majorization-minimization algorithm€¦ · distributed algorithm for nonlinear programming...

Majorization-Minimization Algorithm

Ying Sun and Prof. Daniel P. Palomar

ELEC5470/IEDA6100A - Convex OptimizationThe Hong Kong University of Science and Technology (HKUST)

Fall 2019-20

Acknowledgment

Slides of this lecture are majorly based on the following works:[Hun-Lan’J04] D. R. Hunter, and K. Lange, "A Tutorial on MMAlgorithms", Amer. Statistician, pp. 30-37, 2004.[Raz-Hon-Luo’J13] M. Razaviyayn, M. Hong, and Z. Luo, "A UnifiedConvergence Analysis of Block Successive Minimization Methods forNonsmooth Optimization", SIAM J. Optim., pp. 1126-1153, 2013.[Scu-Fac-Son-Pal-Pan’J14] G. Scutari, F. Facchinei, Peiran Song,D.P. Palomar, and Jong-Shi Pang, "Decomposition by PartialLinearization: Parallel Optimization of Multi-Agent Systems", IEEETrans. Signal Process., vol. 62, no. 3, pp. 641-656, Feb. 2014.

[Sun-Bab-Pal’J17] Y. Sun, P. Babu, and D. P. Palomar,“Majorization-Minimization Algorithms in Signal Processing,Communications, and Machine Learning”, IEEE Trans. SignalProcess, vol. 65, no. 3, pp. 794-816, Feb. 2017.

Outline

1 The Majorization-Minimization AlgorithmIntroductionConstruction TechniquesExample AlgorithmsApplications

2 Block Successive Majorization-MinimizationIntroductionBlock Coordinate DescentBlock Successive Majorization-MinimizationExample AlgorithmsApplications

3 Distributed Algorithm for Nonlinear ProgrammingExact Jacobi Successive Convex ApproximationExtensions

The Majorization-Minimization AlgorithmBlock Successive Majorization-Minimization

Distributed Algorithm for Nonlinear Programming

IntroductionConstruction TechniquesExample AlgorithmsApplications

Problem Statement

Consider the following optimization problem

minimizex

f (x)

subject to x ∈X ,

with X being a closed convex set and f (x) being continuous.f (x) is too complicated to manipulate.

Idea: successively minimize an approximating function u(x,xk

)xk+1 = arg min

x∈Xu(x,xk

),

hoping the sequence of minimizers{xk}will converge to

optimal x?.

Question: how to construct u(x,xk

)?

Ying Sun and Prof. Daniel P. Palomar MM Algorithm 5 / 65




Terminology

Distance from a point to a set:

d (x,S ) = infs∈S‖x− s‖ .

Directional derivative:

f ′ (x;d) , liminfλ↓0

f (x+ λd)− f (x)

λ.

Stationary point: x is a stationary point if

f ′ (x;d)≥ 0, ∀d such that x+d ∈X .





Majorization-Minimization

Construction rule:

u (y,y) = f (y) , ∀y ∈X (A1)u (x,y)≥ f (x) , ∀x,y ∈X (A2)u′ (x,y;d)

∣∣x=y = f ′ (y;d) , ∀d with y +d ∈X (A3)

u (x,y) is continuous in x and y (A4)

Pictorially:

f (x)

xk

u(x,xk

)





Algorithm

Majorization-Minimization (Successive Upper-Bound Minimization):1: Find a feasible point x0 ∈X and set k = 02: repeat3: xk+1 = arg minx∈X u

(x,xk

)(global minimum)

4: k ← k +15: until some convergence criterion is met





Convergence

Under assumptions A1-A4, every limit point of the sequence{xk}is a stationary point of the original problem.

If further assume that the level set X 0 ={x|f (x)≤ f

(x0)}

iscompact, then

limk→∞

d(xk ,X ?

)= 0,

where X ? is the set of stationary points.


Outline







Construct Surrogate Function

The performance of Majorization-Minimization algorithmdepends crucially on the surrogate function u

(x,xk

).

Guideline: the global minimizer of u(x,xk

)should be easy to

find.

Suppose f (x) = f1 (x) + κ (x), where f1 (x) is some “nice”function and κ (x) is the one needed to be approximated.





Construction by Convexity

Suppose κ (t) is convex, then

κ

(∑i

αi ti

)≤∑

i

αiκ (ti )

with αi ≥ 0 and ∑αi = 1.





For example:

κ

(wTx

)= κ

(wT(x−xk

)+wTxk

)= κ

(∑i

αi

(wi

(xi −xki

)αi

+wTxk))

≤∑i

αiκ

(wi

(xi −xki

)αi

+wTxk)

If further assume that w and x are positive (αi = wixki /w

Txk):

κ

(wTx

)≤∑

i

wixki

wTxkκ

(wTxk

xkixi

)The surrogate functions are separable (parallel algorithm).





Construction by Taylor Expansion

Suppose κ (x) is concave and differentiable, then

κ (x)≤ κ

(xk)

+ ∇κ

(xk)(

x−xk),

which is a linear upper-bound.Suppose κ (x) is convex and twice differentiable, then

κ (x)≤ κ

(xk)

+∇κ

(xk)T (

x−xk)

+12

(x−xk

)TM(x−xk

)if M−∇2κ (x)� 0, ∀x.





Construction by Inequalities

Arithmetic-Geometric Mean Inequality:(n

∏i=1

xi

)1/n

≤ 1n

n

∑i=1

xi

Cauchy-Schwartz Inequality:

‖x‖ ≥ xTxk

‖xk‖

Jensen’s Inequality:

κ (Ex)≤ Eκ (x)

with κ (·) being convex.


Outline







EM Algorithm

Assume the complete data set {x,z} consists of observedvariable x and latent variable z.Objective: estimate parameter θ ∈Θ from x.Maximum likelihood estimator: θ = arg minθ∈Θ− logp (x|θ)

EM (Expectation Maximization) algorithm:

E-step: evaluate p(z|x,θk

)“guess” z from current estimate of θ

M-step: update θ as θk+1 = arg minθ∈Θ u(θ ,θk

), where

u(

θ ,θk)

=−Ez|x,θk logp (x,z|θ)

update θ from “guessed” complete data set





Proximal Minimization

f (x) is convex. Solve minx f (x) by solving the equivalentproblem

minimizex∈X ,y∈X

f (x) + 12c ‖x−y‖2 .

Objective function is strongly convex in both x and y.Algorithm:

xk+1 = arg minx∈X

{f (x) +

12c

∥∥∥x−yk∥∥∥2}

yk+1 = xk+1.

An MM interpretation:

xk+1 = arg minx∈X

{f (x) +

12c

∥∥∥x−xk∥∥∥2}





DC Programming

Consider the unconstrained problem

minimizex∈Rn

f (x) ,

where f (x) = g (x) +h (x) with g (x) convex and h (x) concave.DC (Difference of Convex) Programming generates

{xk}by

solving∇g(xk+1

)=−∇h

(xk).

An MM interpretation:

xk+1 = arg minx

{g (x) + ∇h

(xk)T (

x−xk)}

.


Outline







Power Control by GP

[Chi-Tan-Pal-O’Ne-Jul’J07] Problem: maximize systemthroughput. Essentially we need to solve the following problem:

minimizeP∈P

∑j 6=i GijPj+ni∑j GijPj+ni

Objective function is the ratio of two posynomials.Minorize a posynomial, denoted by g (x) = ∑i mi (x), bymononial:

g (x)≥∏i

(mi (x)

αi

)αi

where αi =mi(xk)g(xk)

. (Arithmetic-Geometric Mean Inequality)

Solution: approximate the denominator posynomial∑j GijPj +ni by monomial.





Reweighted `1-norm

Sparsity signal recovery problem

minimizex

‖x‖0subject to Ax = b

`1-norm approximation

minimizex

‖x‖1subject to Ax = b

General formminimize

x∑ni=1 φ (|xi |)

subject to Ax = b





[Can-Wak-Boy’J08] Assume φ (t) is concave nondecreasing, atxki , φ (|xi |) is majorized by wk

i |xi | with wki = φ ′ (t)|t=|xi |.

At each iteration a weighted `1-norm is solved

minimizex

∑wki |xi |

subject to Ax = b

0x

x

1x

1

log 1x





Sparse Generalized Eigenvalue Problem

`0-norm regularized generalized eigenvalue problem

maximizex

xTAx−ρ ‖x‖0subject to xTBx = 1.

Replace ‖xi‖0 by some nicely behaved function gp (xi )

|xi |p, 0< p ≤ 1log (1+ |xi |/p)/ log (1+1/p), p > 01− e−|xi |/p, p > 0.

Take gp (xi ) = |xi |p for example.





[Son-Bab-Pal’J15a] Majorize gp (xi ) at xki by quadratic functionwki x

2i + cki .

The surrogate function for gp (xi ) = |xi |p is defined as

u(xi ,x

ki

)=

p

2

∣∣∣xki ∣∣∣p−2 x2i +(1− p

2

)∣∣∣xki ∣∣∣p .Solve at each iteration the following GEVP:

maximizex

xTAx−ρxTdiag(wk)x

subject to xTBx = 1

However, as |xi | → 0, wi →+∞...





Smooth approximation of gp (x):

g εp (x) =

{ p2εp−2x2, |x | ≤ ε

|x |p−(1− p

2

)εp, |x |> ε

When |x | ≤ ε , w remains to be a constant.





0 10 20 30 40 50 60

0

0.5

1

1.5

2

2.5

3

iterations

obje

ctive v

alu

e

IRQM (proposed)

DC−SGEP





Sequence Design

Complex unimodular sequence {xn ∈ C}Nn=1.Autocorrelation: rk = ∑

Nn=k+1 xnx

∗n−k = r∗−k , k = 0, . . . ,N−1.

Integrated sidelobe level (ISL):

ISL =N−1

∑k=1|rk |2 .

Problem formulation:

minimize{xn}Nn=1

ISL

subject to |xn|= 1, n = 1, . . . ,N.





By Fourier transform:

ISL ∝

2N

∑p=1

[∣∣∣aHp x∣∣∣2−N

]2with x = [x1, . . . ,xN ]T , ap =

[1,e jωp , . . . ,e jωp(N−1)

]Tand

ωp = 2π

2N (p−1).Equivalent problem:

minimizex

∑2Np=1(aHp xxHap

)2subject to |xn|= 1, ∀n.





[Son-Bab-Pal’J15b] Define A = [a1, . . . ,a2N ],

pk =[∣∣aH1 xk

∣∣2 , . . . , ∣∣aH2Nxk∣∣2]T , A = A

(diag

(pk)−pkmaxI

)AH .

Quadratic surrogate function:

��

�:const.pkmaxx

HAAHx +2Re(

xH(

A−2N2xk(xk)H)

xk)

Equivalent tominimize

x‖x−y‖2

subject to |xn|= 1, ∀n

with

y =−(

A−2N2xk(xk)H)

xk

Closed-form solution: xn = e j arg(yn).





0 20 40 60 80 10030

35

40

45

50

55

iteration

Inte

gra

ted s

idelo

be level (d

B)

benchmark

proposed method





Covariance Estimation

xi ∼ elliptical(0,Σ)

Fitting normalized sample si = xi‖xi‖2

to Angular CentralGaussian distribution

f (si ) ∝ det(Σ)−1/2(sTi Σ

−1si)−K/2

[Sun-Bab-Pal’J14] Shrinkage penalty

h (Σ) = log det(Σ) +Tr(Σ−1T

)Solve the following problem:

minimizeΣ

log det(Σ) + KN ∑ log

(xTi Σ

−1xi)

+ αh (Σ)

subject to Σ� 0





At Σk , the objective function is majorized by

(1+ α) log det(Σ) +K

N

N

∑i=1

xTi Σ−1xi

xTi(Σk)−1

xi+ αTr

(Σ−1T

)

Surrogate function is convex in Σ−1.Setting the gradient to zero leads to the weighted sampleaverage

Σk+1 =1

1+ α

K

N ∑xixTi

xTi(Σk)−1

xi+

α

1+ αT





1 6 11 168

10

12

14

16

18

20

Iteration

Obj

ectiv

e F

unct

ion

Val

ue


Outline






IntroductionBlock Coordinate DescentBlock Successive Majorization-MinimizationExample AlgorithmsApplications

Problem Statement

Consider the following problem

minimizex∈X

f (x)

Set X possesses Cartesian product structure X = ∏mi=1Xi .

Observation: the problem

minimizexi∈Xi

f(x01, . . . ,x

0i−1,xi ,x

0i+1, . . . ,x

0m

)with x0−i taking some feasible value, is easy to solve.


Outline







Block Coordinate Descent (BCD)

Denote x , (x1, . . . ,xm),f(x01, . . . ,x

0i−1,xi ,x

0i+1, . . . ,x

0m

), f

(xi ,x0

)Block Coordinate Descent (nonlinear Gauss-Seidel)1: Initialize x0 ∈X and set k = 0.2: repeat3: k = k +1, i = (k mod n) +14: xki = arg minxi∈Xi

f(xi ,xk−1

)5: xki ← xk−1i , ∀k 6= i6: until some convergence criterion is met





Convergence

[Ber’B99] Assume that

f (x) is continuously differentiable over the set X .xki = arg minxi∈Xi

f(xi ,xk−1

)has a unique solution.

Then every limit point of the sequence{xk}is a stationary

point.[Gri-Sci’J00] Generalizations

globally convergent for m = 2.f is component-wise strictly quasi-convex w.r.t. m−2components.f is pseudo-convex.


Outline







BS-MM Algorithm

Combination of MM and BCDBlock Succesive Majorization-Minimization (BS-MM):1: Initialize x0 ∈X and set k = 0.2: repeat3: k = k +1, i = (k mod n) +14: X k = arg minxi∈Xi

ui(xi ,xk−1

)5: Set xki to be an arbitrary element in X k

6: xki ← xk−1i , ∀k 6= i7: until some convergence criterion is met

Generalization of BCD





Convergence

Surrogate function ui (·, ·) satisfies the following assumptions

ui (yi ,y) = f (y) , ∀y ∈X ,∀i (B1)ui (xi ,y)≥ f (y1, . . . ,yi−1,xi ,yi+1, . . . ,yn) ,

∀xi ∈Xi ,∀y ∈X ,∀i (B2)u′i (xi ,y;di )

∣∣xi=yi

= f ′ (y;d) ,

∀d = (0, . . . ,di , . . . ,0) such that yi +di ∈Xi ,∀i (B3)ui (xi ,y) is continuous in (xi ,y) , ∀i (B4)

In short, ui(xi ,xk

)majorizes f (x) on the ith block.





Under assumptions B1-B4, for simplicity additionally assumethat f is continuously differentiable,

ui (xi ,y) is quasi-convex in xi ,each subproblem minxi∈Xi

ui(xi ,xk−1

)has a unique solution

for any xk−1 ∈X ,then every limit point of

{xk}is a stationary point.

level set X 0 ={x|f (x)≤ f

(x0)}

is compact,each subproblem minxi∈Xi

ui(xi ,xk−1

)has a unique solution

for any xk−1 ∈X for at least m−1 blocks,then limk→∞ d

(xk ,X ?

)= 0.

More restrictive assumption than MM due to the cyclic updatebehavior.


Outline







Alternating Proximal Minimization

Consider the problem

minimizex

f (x1, . . . ,xm)

subject to xi ∈Xi ,

with f (·) being convex in each block.The convergence of BCD is not easy to establish since eachsubproblem may have multiple solutions.Alternating Proximal Minimization solves

minimizexi

f(xk1 , . . . ,x

ki−1,xi ,x

ki+1, . . . ,x

km

)+ 1

2c

∥∥xi −xki∥∥2

subject to xi ∈Xi

Strictly convex objective→unique minimizer





Proximal Splitting Algorithm

Consider the following problem

minimizex

∑mi=1 fi (xi ) + fm+1 (x1, . . . ,xm)

subject to xi ∈Xi , i = 1, . . . ,m

with fi convex and lower semicontinuous, fm+1 convex and

‖∇fm+1 (x)−∇fm+1 (y)‖ ≤ βi ‖xi −yi‖

Cyclically update:

xk+1i = proxγfi

(xki − γ∇xi fm+1

(xk))

,

with the proximity operator defined as

proxf (x) = arg miny∈X

f (y) +12‖x−y‖2 .





Proximal Splitting Algorithm

BS-MM interpretation:

ui

(xi ,xk

)= fi (xi ) +

12γ

∥∥∥xi −xki∥∥∥2

+ ∇xi fm+1

(xk)T (

xi −xki)

+ ∑j 6=i

fj

(xkj)

+ fm+1

(xk−i ,xi

).

Check:

fm+1

(xk)

+12γ

∥∥∥xi −xki∥∥∥2

+ ∇xi fm+1

(xk)T (

xi −xki)

≥fm+1

(xk)

+βi

2

∥∥∥xi −xki∥∥∥2

+ ∇xi fm+1

(xk)T (

xi −xki)

≥fm+1

(xk−i ,xi

)(Descent lemma)

with γ ∈ [εi ,2/βi − εi ] and εi ∈ (0,min{1,1/βi}).Ying Sun and Prof. Daniel P. Palomar MM Algorithm 48 / 65

Outline







Robust Estimation of Location and Scatter

xi ∼ elliptical(µ,R)

[Sun-Bab-Pal’J15] Fitting xi to a Cauchy distribution with pdf

f (x) ∝ det(R)−1/2(1+ (xi −µ)T R−1 (xi −µ)

)−(K+1)/2

Shrinkage penalty

h (t,T) =K log(Tr(R−1T

))+log det(R)+log

(1+ (t−µ)T R−1 (t−µ)

)Solve the following problem:

minimizeµ,R�0

log det(R) + K+1N ∑

Ni=1 log

(1+ (xi −µ)T R−1 (xi −µ)

)+αh (t,T)





BS-MM Algorithm update:

µt+1 =(K +1)∑

Ni=1wi (µt ,Rt)xi +Nαwt (µt ,Rt)t

(K +1)∑Ni=1wi (µt ,Rt)+Nαwt (µt ,Rt)

Rt+1 =K +1N+Nα

N

∑i=1

wi

(µt+1,Rt

)(xi −µt+1

)(xi −µt+1

)T+

Nα

N+Nαwt(µt+1,Rt

)(µt+1− t

)(µt+1− t

)T+

NαK

N+Nα

T

Tr(R−1t T

)where

wi (µ,R) =1

1+(xi −µ)T R−1 (xi −µ)

wt (µ,R) =1

1+(t−µ)T R−1 (t−µ).





1 2 3 4 5 6 7 8 9 10 11 12 13 14 152000

2500

3000

3500

4000

4500

5000

Iterations

Obj

ectiv

e fu

nctio

n va

lue


Outline






Exact Jacobi Successive Convex ApproximationExtensions

Problem Statement

Consider the following problem:

minimizex

f (x)

subject to xi ∈Xi

where the Xi ’s are closed and convex sets,f (x) = ∑

Ll=1 fl (x1, . . . ,xm).

Conditional gradient update (Frank-Wolfe):

xk+1 = xk + γkdk

direction dk , xk −xk with

xki = arg minxi∈Xi

∇xi f(xk)T (

xi −xki)

step-size γk ∈ (0,1], chosen to guarantee convergence.





Exact Jacobi SCA Algorithm

Idea:

Conditional gradient update linearize all the fl ’s at xk .Each function fl might be convex w.r.t. some block xi .We want to preserve the convex property of fl

(xi ,xk−i

).

Solution: keep the convex fl(xi ,xk−i

)’s and linearize the others.





Define Ci as the set of indices of l such that fl(xi ,xk−i

)is

convex.Approximate f (x) on the ith block at point xk :

fi

(xi ,xk

)= ∑

l∈Ci

fl

(xi ,xk−i

)︸︷︷︸

convex terms

+π i

(xk)T (

xi −xki)

︸︷︷︸linear approx.

non-convex terms

+τi

2

(xi −xki

)THi

(xk)(

xi −xki)

︸︷︷︸proximal term

,

withπ i

(xk)= ∑

l /∈Ci

∇xi fl

(xk)

and Hi

(xk)� cHi

I.





Exact Jacobi SCA update:

xi(xk ,τi

)= arg min

xi∈Xi

fi

(xi ,xk

)xk+1 = xk + γ

k(x−xk

)Step-size rule

constant step-size that depends on the Lipschitz constant of∇fdiminishing step-size

Remark: update of the blocks can be done sequentially(Gauss-Seidel SCA Algorithm)


Outline







Extensions

FLEXA

non-smooth objective functioninexact update directionflexible block update choice

HyFLEXA





Comparison

BS-MM FLEXAconvergence stationary point stationary point

objective functioncontinuous continuous

may not be smooth may not be smoothconstraint set Cartesian Cartesian & convexupdate rule sequential sequential or parallel

approx. functionglobal upper-bound local approximationunique minimizer not requiredcan be non-convex convex approx.





Summary

We have studied

Majorization-Minimization algorithmBlock Coordinate Descent algorithmBlock Successive Majorization-Minimization algorithm

We have briefly introduced

Distributed Successive Convex Approximation algorithm


References I

D. R. Hunter and K. Lange.A tutorial on MM algorithms.Amer. Statistician, volume 58, pp. 30–37, 2004.

M. Razaviyayn, M. Hong, and Z. Luo.A unified convergence analysis of block successive minimization methods fornonsmooth optimization.SIAM J. Optim., volume 23, no. 2, pp. 1126–1153, 2013.

G. Scutari, F. Facchinei, P. Song, D. Palomar, and J.-S. Pang.Decomposition by partial linearization: Parallel optimization of multi-agentsystems.IEEE Trans. Signal Process., volume 62, no. 3, pp. 641–656, 2014.

Y. Sun, P. Babu, and D. Palomar.Majorization-minimization algorithms in signal processing, communications, andmachine learning.IEEE Transactions on Signal Processing, volume 65, no. 3, pp. 794–816, 2017.

M. Chiang, C. W. Tan, D. Palomar, D. O’Neill, and D. Julian.Power control by geometric programming.IEEE Trans. Wireless Commun, volume 6, no. 7, pp. 2640–2651, 2007.

References II

E. J. Candes, M. Wakin, and S. Boyd.Enhancing sparsity by reweighted l1 minimization.J. Fourier Anal. Appl., volume 14, no. 5-6, pp. 877–905, 2008.

J. Song, P. Babu, and D. P. Palomar.Sparse generalized eigenvalue problem via smooth optimization.IEEE Trans. Signal Process., volume 63, no. 7, pp. 1627–1642, 2015.

—.Optimization methods for designing sequences with low autocorrelationsidelobes.IEEE Trans. Signal Process., volume 63, no. 15, pp. 3998–4009, 2015.

Y. Sun, P. Babu, and D. P. Palomar.Regularized Tyler’s scatter estimator: Existence, uniqueness, and algorithms.IEEE Trans. Signal Processing, volume 62, no. 19, pp. 5143–5156, 2014.

D. P. Bertsekas.Nonlinear Programming.Athena Scientific, 1999.

L. Grippo and M. Sciandrone.On the convergence of the block nonlinear gauss–seidel method under convexconstraints.Oper. Res. Lett., volume 26, no. 3, pp. 127–136, 2000.

References III

Y. Sun, P. Babu, and D. P. Palomar.Regularized robust estimation of mean and covariance matrix under heavy-taileddistributions.IEEE Transactions on Signal Processing, volume 63, no. 12, pp. 3096–3109,2015.

Thanks

For more information visit:

https://www.danielppalomar.com

majorization-minimization algorithm€¦ · distributed algorithm for nonlinear programming...

Documents