convergent nested alternating minimization algorithms for ...convergence analysis is a new extension...

Convergent Nested Alternating Minimization

Algorithms for Non-Convex Optimization Problems

Eyal Gur, Shoham Sabach∗, Shimrit Shtern†

October 1, 2020

Abstract

We introduce a new algorithmic framework for solving non-convex optimization

problems, which is called Nested Alternating Minimization, which aims at combining

the classical Alternating Minimization technique with inner iterations of any optimiza-

tion method. We provide global convergence analysis of the new algorithmic framework

to critical points of the problem at hand, which to the best of our knowledge, is the

first of this kind for nested methods in the non-convex setting. Central to our global

convergence analysis is a new extension of classical proof techniques in the non-convex

setting that allows for errors in the conditions. The power of our framework is illus-

trated with some numerical experiments that show the superiority of this algorithmic

framework over existing methods.

1 Introduction

In recent years non-convex and non-smooth optimization problems have gained a lot of

interest in numerous applied fields such as machine learning [17], signal processing [29, 23],

∗This research was partially supported by the D.F.G German Research Foundation 800240.†Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology, Tech-

nion city, Haifa 3200003, Israel (e-mail: [email protected], [email protected], shim-

[email protected]).

1

data science [13, 20], operations research [14, 12] and more. While the theoretical results in

the convex setting (such as convergence analysis and rates of convergence) are wide (see, for

instance, the recent book [5] and references therein), the tools and results available regarding

the non-convex setting are very limited, making this setting much more challenging. An

important task in the non-convex problem domain, which has already gained success in

several settings, is to establish global convergence of algorithms to critical points (see more

detail in Section 2). In this paper, we are motivated by the analysis and application of

a large and important class of algorithms, which we call Nested Alternating Minimization

and will be precisely defined in Section 3. Our contribution consists of providing a new

general proof procedure that can be applied to obtain global convergence results for this

class of algorithms. As far as we know, this is the first theoretical guarantee for this class of

algorithms in the non-convex setting.

1.1 Global Convergence

Proving global convergence of an algorithm to a critical point involves showing that for any

starting point the entire sequence generated by the algorithm converges to a single critical

point of the optimization problem at hand. In recent years, this task received a lot of

attention starting with the works [2, 3, 4] and later with the work [10], which formulates a

simple proof technique to obtain global convergence results of algorithms that satisfy three

conditions in addition to the KL property of the objective function (see [19, 18, 9]). These

works pave the way to many modifications and extensions that have been, and are still being

developed (e.g., [26, 22] and references therein). In this work we propose another proof

technique to obtain global convergence results in the non-convex setting. This extension

allows more flexibility in accommodating the conditions of previous proposed recipes in

the sense that we propose a relaxed variant of the first two conditions by adding some

non-negative error terms (our recipe is a further relaxation of the recipe suggested in [22],

which only relaxes the second condition). The first condition consists of a sufficient decrease

requirement of the algorithm with respect to all variables (see, for instance, [10]) or part of

them (see, for instance, [26]), where the sufficient decrease can be measured in terms of the

original objective function as in [10] or a related Lyapunov function as in [24]. The second

2

condition consists of bounding the iterates gap from below by the function’s sub-gradient

norm. As mentioned, we relax these conditions by allowing for some error terms.

In Section 2, following the approach introduced in [10] and recently summarized in [11], we

develop the extended proof methodology in detail. Even though the techniques used to derive

the new methodology are not new and are based on [10, 11], the newly developed technique

allows us to easily analyze non necessarily sufficient descent methods (due to the relaxation

of the first condition), which in turn opens the gate for obtaining global convergence results

of the very important and challenging class of Nested Alternating Minimization algorithms.

1.2 Nested Alternating Minimization Algorithms

The celebrated Alternating Minimization (AM) methodology is a technique for handling

complicated optimization problems that has regained huge popularity in the last decade

in the context of non-convex optimization. The AM methodology splits an optimization

problem into smaller problems, which would hopefully be easier to solve in a closed form.

Unfortunately, in many cases, especially in the non-convex setting, even the resulting sub-

problems remain too complicated or too large to be solved explicitly. This led to a stream

of papers in recent years that suggest to approximate the solution of the sub-problems by

computing one iteration of a certain optimization algorithm (see, for instance, [16, 24, 11,

26]). This approach of One Iteration Approximation (OIA) led, in many cases, to algorithms

that enjoy global convergence guarantees using the techniques mentioned in Section 1.1.

However, the idea of OIA is limited to algorithms with a certain sufficient descent property

(see our discussion above). Motivated by considering to approximate the solution of the sub-

problems using non sufficient decrease methods (like FISTA [8] and MFISTA [7]), we study

the class of Nested Alternating Minimization (NAM) algorithms, which allow to approximate

the solution of the sub-problems using several iterations instead of one, in order to achieve

the relaxed sufficient descent property discussed above.

In Section 3 we propose the general algorithmic framework of NAM for solving non-

convex and non-smooth problems. Our main contribution is to show that this algorithmic

framework, which captures a wide spectrum of algorithms, generates a globally convergent

3

sequence to critical points. To this end we define the notion of Nested Friendly Algorithms

(NFA), which captures the minimal necessary requirements of the nested algorithm to be

used within NAM in order to achieve the global convergence of NAM. In Section 4 we

illustrate the concept of NFA by presenting several examples that satisfy the requirements

and therefore can be used in NAM. Another central aspect in applying NAM is to determine

the number of inner iterations which is sufficient to guarantee the global convergence. This

is also discussed in Section 4. Finally, in Section 5 we illustrate the advantage of the NAM

framework in tackling the Regularized Structured Total Least Squares (RSTLS) problem via

its application on an image processing task.

2 A Procedure for Global Convergence with Errors

Consider an extended real-valued function F : Rn ×Rm → (−∞,∞] which is bounded from

below and assume that F is proper and lower semi-continuous. We consider the following

two-block minimization problem

min(z,u)∈Rn×Rm

F (z,u) . (1)

Starting with any pair (z0,u0), let{(

zk,uk)}

k≥0 be a sequence generated by a generic

algorithm, which we denote by A, that aims at tackling Problem (1) in the sense of global

convergence to critical points of the function F . As mentioned above, our goal in this section

is to extend recent proof techniques with a set of relaxed conditions that still guarantee

global convergence (see [10] and also [11] for a recent and concise version). We also use the

same notations and definitions from non-smooth analysis as in these papers. Therefore, for

instance, when we write ∂F we mean the limiting sub-differential (see [25]). Following [10]

we extend the definition of gradient-like descent sequences in the following way.

Definition 1. [Approximate gradient-like descent sequence]. A sequence{(

zk,uk)}

k≥0 is

called an approximate gradient-like descent sequence for minimizing the function F : Rn ×

Rm → (−∞,∞] if the following conditions hold:

(C1) Approximate sufficient decrease property. The sequence{F(zk,uk

)}k≥0 is non-increasing

4

and there exist a positive scalar ρ1 > 0 and a non-negative error term ek1 ≥ 0 such that

ρ1∥∥zk+1 − zk∥∥2 − ek1 ≤ F (zk,uk)− F (zk+1,uk+1) , ∀ k ≥ 0.

(C2) Approximate sub-gradient lower bound on the iterates gap. There exist a vector wk+1 ∈

∂F(zk+1,uk+1

), a positive scalar ρ2 > 0 and a non-negative error term e

k2 ≥ 0 such

that ∥∥wk+1∥∥ ≤ ρ2 ∥∥zk+1 − zk∥∥+ ek2, ∀ k ≥ 0.(C3) Continuity. If (z̄, ū) is a limit point of some sub-sequence

{(zk,uk

)}k∈K⊆N, then

lim supk∈K⊆N

F(zk,uk

)≤ F (z̄, ū) ,

(C4) Summability of the errors. The sequences{√

ek1

}k≥0

and{ek2}k≥0 are summable, that

is,∞∑k=1

√ek1

Condition (C4) these errors should be small enough such that the infinite sum of them is

finite. See [22] for a set of conditions which include errors only in Condition (C2).

We would also like to stress the fact that now, in Definition 1, the two-block structure

comes into a play since the first two conditions are only measured in terms of the variable z

(see also [26]). This adds another level of flexibility in proving the approximate gradient-like

descent sequence property.

Now, we are in a position to prove that these relaxed conditions are enough to guarantee

global convergence to critical points of F . To this end, we denote by ω (x0) the set of limit

points of a sequence{xk}k≥0 and for any function f we denote by crit (f) the set of its

critical points.

Lemma 1. Let{(

zk,uk)}

k≥0 be a bounded approximate gradient-like descent sequence for

minimizing F of Problem (1). Then, ω (z0,u0) is a non-empty and compact subset of crit (F )

and

limk→∞

dist((

zk,uk), ω(z0,u0

))= 0.

In addition, the function F is finite and constant on ω (z0,u0).

Proof. Since{(

zk,uk)}

k≥0 is bounded, ω (z0,u0) is a non-empty and compact set. Therefore,

it follows that dist((

zk,uk), ω (z0,u0)

)converges to 0 as k goes to infinity. From Condition

(C1) we have ∥∥zk+1 − zk∥∥2 ≤ 1ρ1

(F(zk,uk

)− F

(zk+1,uk+1

)+ ek1

). (2)

The non-ascent property required in Condition (C1) and the boundedness from below of F ,

yields the convergence of the sequence{F(zk,uk

)}k≥0. This together with Condition (C4)

implies from (2) that

limk→∞

∥∥zk+1 − zk∥∥ = 0. (3)Let

{(zk,uk

)}k∈K be a convergent sub-sequence to some pair (z

∗,u∗). Denote F ∗ ≡

F (z∗,u∗). From Condition (C3) and the lower semi-continuity of F we derive that

limk→∞

F(zk,uk

)= lim

k∈KF(zk,uk

)= F ∗. (4)

6

From Condition (C2) there exists a sequence{wk}k≥0, for which w

k ∈ ∂F(zk,uk

)for all

k ≥ 0, that satisfies

limk→∞

∥∥wk∥∥ ≤ limk→∞

(ρ2∥∥zk+1 − zk∥∥+ ek2) = 0,

where the equality follows from (3) and Condition (C4). This means that the sequence{wk}k≥0 converges to 0 as k → ∞ and therefore (z

∗,u∗) ∈ crit (F ), as follows from the

closedness property of the sub-differential (see [25]). Finally, since{F(zk,uk

)}k≥0 converges

we can assume that limk→∞

F(zk,uk

)= c ∈ R. Then, lim

k∈KF(zk,uk

)= c and from (4) we

derive that F is finite and constant over the set ω (z0,u0).

We now prove the main result of this section. It states that if a generic optimization

method A generates a bounded approximate gradient-like descent sequence for minimizing

F of Problem (1), then it globally converges to a critical point of F . It should be noted that

this unique limit point is dependent on the initialization point of A.

Theorem 1. [Global convergence]. Let{(

zk,uk)}

k≥0 be a bounded approximate gradient-

like descent sequence for minimizing F of Problem (1). If F satisfies the KL property, then

the sequence{zk}k≥0 has finite length and it globally converges to a point z

∗. Moreover, let

u∗ be any limit point of the sequence{uk}k≥0, then (z

∗,u∗) ∈ crit (F ).

Proof. Let (z∗,u∗) be a limit point of the sequence{(

zk,uk)}

k≥0, which exists due to the

boundedness assumption. From Lemma 1 (see (4)) we derive

limk→∞

F(zk,uk

)= F (z∗,u∗) .

Assume that there exists k̃ ∈ N such that F(zk,uk

)= F (z∗,u∗) and ek1 = 0 for all k ≥ k̃.

Then, from Condition (C1) it follows that{F(zk,uk

)}k≥0 has a “regular” sufficient decrease

property and therefore zk+1 = zk for all k ≥ k̃. This proves a finite convergence of the

sequence{zk}k≥0 and obviously the desired result.

Now, on the other hand, assume that F(zk,uk

)> F (z∗,u∗) for all k ≥ 0. We know

from Condition (C1) that the sequence{F(zk,uk

)}k≥0 is non-increasing and converges to

F (z∗,u∗). Thus, for any γ > 0 there exists an integer k0 such that F(zk,uk

)< F (z∗,u∗)+γ

for all k > k0. From Lemma 1

limk→∞

dist((

zk,uk), ω(z0,u0

))= 0.

7

So, for any ε > 0 there exists an integer k1 such that dist((

zk,uk), ω (z0,u0)

)< ε for any k >

k1. Again from Lemma 1 the function F is finite and constant on the non-empty and compact

set ω (z0,u0). Since F(zk,uk

)− F (z∗,u∗) > 0 it follows that ϕ′

(F(zk,uk

)− F (z∗,u∗)

)is

well-defined, where ϕ is any desingularizing function (see [9]).

From the Uniformized KL property [10, Lemma 6] there exists a desingularizing function

ϕ such that for any k > s ≡ max {k0, k1}+ 1 we have

ϕ′(F(zk,uk

)− F (z∗,u∗)

)· dist

(0, ∂F

(zk,uk

))≥ 1. (5)

Since dist(0, ∂F

(zk,uk

))≤∥∥wk∥∥ for any wk ∈ ∂F (zk,uk) it follows from Condition (C2)

and (5) that

ϕ′(F(zk,uk

)− F (z∗,u∗)

)≥ 1ρ2 ‖zk − zk−1‖+ ek−12

. (6)

For n1, n2 ∈ N we denote

∆n1,n2 ≡ ϕ (F (zn1 ,un1)− F (z∗,u∗))− ϕ (F (zn2 ,un2)− F (z∗,u∗)) .

Since ϕ is concave it follows from the gradient inequality that

∆k,k+1 ≥ ϕ′(F(zk,uk

)− F (z∗)

)·(F(zk,uk

)− F

(zk+1,uk+1

)). (7)

Combining (6) and (7) with Condition (C1) we establish

∆k,k+1 ≥ρ1

(∥∥zk+1 − zk∥∥2 − ek1/ρ1)ρ2(‖zk − zk−1‖+ ek−12 /ρ2

) . (8)Denote c = ρ2/ρ1, ẽ

k1 := e

k1/ρ1 and ẽ

k2 := e

k2/ρ2. We obtain from (8) that√

c∆k,k+1(‖zk − zk−1‖+ ẽk−12

)+ ẽk1 ≥

∥∥zk+1 − zk∥∥ . (9)Now, since c∆k,k+1

(∥∥zk − zk−1∥∥+ ẽk−12 ) ≥ 0 and ẽk1 ≥ 0 we obtain from (9) and thegeometric-arithmetic mean inequality that∥∥zk+1 − zk∥∥ ≤√c∆k,k+1 (‖zk − zk−1‖+ ẽk−12 )+√ẽk1

≤ c2

∆k,k+1 +1

2

∥∥zk − zk−1∥∥+ 12ẽk−12 +

√ẽk1.

(10)

8

Summing up inequalities (10) for r = s, s+ 1, . . . , k we derive that

2k∑

r=s+1

∥∥zr+1 − zr∥∥ ≤ k∑r=s+1

∥∥zr − zr−1∥∥+ c k∑r=s+1

∆r,r+1 +k∑

r=s+1

(2√ẽr1 + ẽ

r−12

)≤

k∑r=s+1

∥∥zr+1 − zr∥∥+ ∥∥zs+1 − zs∥∥+ c k∑r=s+1

∆r,r+1

+k∑

r=s+1

(2√ẽr1 + ẽ

r−12

)=

k∑r=s+1

∥∥zr+1 − zr∥∥+ ∥∥zs+1 − zs∥∥+ c∆s+1,k+1+

k∑r=s+1

(2√ẽr1 + ẽ

r−12

),

(11)

where the last equality follows from the definition of ∆r,r+1. We now see that

k∑r=s+1

∥∥zr+1 − zr∥∥ ≤ ∥∥zs+1 − zs∥∥+ k∑r=s+1

(2√ẽr1 + ẽ

r−12

)+ c∆s+1,k+1

≤∥∥zs+1 − zs∥∥+ k∑

r=s+1

(2√ẽr1 + ẽ

r−12

)+ cϕ

(F(zs+1,us+1

)− F (z∗,u∗)

),

(12)

where the last inequality follows from the definition of ∆s+1,k+1 and the fact that ϕ ≥ 0.

Due to Condition (C4), the infinite sum of errors in (12) is also finite and therefore we finally

derive that∞∑r=1

‖zr+1 − zr‖ < ∞. Hence, the sequence{zk}k≥0 is of a finite length, and it

converges globally to some z∗. Last, from Lemma 1 it follows that (z∗,u∗) is a critical point

of the function F .

3 Global Convergence of Nested Alternating Minimiza-

tion Algorithms

In this section we introduce the class of Nested Alternating Minimization (NAM) algorithms

and provide a global convergence result for this class. To illustrate better the concept of

this class we will consider the optimization model (1), where the vector z is split into two

9

blocks as follows z ≡(xT ,yT

)T ∈ Rp × Rq. More precisely, we focus here on the followingthree-block model

min(x,y,u)∈Rp×Rq×Rm

{F (x,y,u) ≡ G (x,y,u) +H (u)}, (13)

where the function G : Rp×Rq×Rm → R is non-convex and smooth and H : Rm → (−∞,∞]

is non-convex and non-smooth (precise assumption will be made below). The results in this

section can be easily extended to any finite number of blocks of variables. However, in

order to keep the developments simple we address the case where the block z is decomposed

into only two blocks. The algorithmic framework of Alternating Minimization suggests to

alternate between the three blocks of Problem (13) in a cyclic fashion. This results in three

sub-problems that should be exactly solved at each iteration. However, here as already

discussed in the Introduction we are interested in their approximations. Therefore, a reason

to consider this three-block model is due to the assumption that the sub-problems with

respect to the first two blocks x and y will be approximated using nested algorithms, while

the third block will remain to be exactly solved. More precisely, we introduce now the class

of Nested Alternating Minimization algorithms.

Algorithm 1 Nested Alternating Minimization (NAM) Scheme

1: Input: two nested friendly optimization algorithms A1 and A2.

2: Initialization: (x−1,y−1,u0) ∈ Rp × Rq × Rm.

3: Iterative step:

4: for k ≥ 0 do

5: Set xk after ik−1 iterations of A1 starting with xk−1 for minimizing F(·,yk−1,uk

).

6: Set yk after jk−1 iterations of A2 starting with yk−1 for minimizing F(xk, ·,uk

).

7: uk+1 ∈ argmin{F(xk,yk,u

): u ∈ Rm

}.

8: end for

It should be noted that algorithms A1 and A2, which are used in NAM, need not be the

same and could be different. This adds flexibility in using algorithms which fit best to the

structure of each sub-problem. In order to obtain the global convergence of NAM, we will

make the following assumption on the involved objective function F .

10

Assumption A. (i) For any fixed (y,u) ∈ Rq × Rm, the function x 7→ F (x,y,u) is

σ1 (y,u)-strongly convex and with an L1 (y,u)-Lipschitz continuous gradient. In ad-

dition, there exist positive scalars σ̄1 and L̄1 such that inf(y,u)∈D

σ1 (y,u) = σ̄1 and

sup(y,u)∈D

L1 (y,u) = L̄1 for any compact D ⊆ Rq × Rm.

(ii) For any fixed (x,u) ∈ Rp × Rm, the function y 7→ F (x,y,u) is σ2 (x,u)-strongly

convex and with an L2 (y,u)-Lipschitz continuous gradient. In addition, there exist

positive scalars σ̄2 and L̄2 such that inf(x,u)∈D

σ2 (x,u) = σ̄2 and sup(y,u)∈D

L2 (y,u) = L̄2 for

any compact D ⊆ Rp × Rm.

(iii) The gradient of the function G is L-Lipschitz continuous on bounded subsets of Rp ×

Rq × Rm.

At this juncture we would like to note that this assumption is satisfied in many chal-

lenging applications, such as the Regularized Structured Total Least Squares (which will be

discussed in detail in Section 5), the Wireless Sensor Network Localization problem [27],

Multi-Dimensional Scaling [28, Chapter 6], and more.

Before proceeding with NAM we would like to discuss the requirements of the nested

algorithms for guaranteeing the global convergence of NAM. To this end, and as we already

mentioned in NAM, we would like to define the following concept.

Definition 2. [Nested friendly algorithm]. Let ϕk : Rd → R, k ∈ N, be a family of convex

functions. Let A be an optimization algorithm that generates for any k ∈ N a sequence{vk,i}i≥0 starting from v

k,0. We say that A is a Nested Friendly Algorithm (NFA) with

respect to {ϕk}k≥0, if for any k ∈ N there exist an index ik ∈ N and a sequence of non-

negative scalars {cA (k, i)}i≥0 such that

(i) limi→∞

cA (k, i) = 0.

(ii) ϕk(vk,ik

)−ϕ∗k ≤

c2A(k,ik)

2·∥∥vk,0 − v∗k∥∥2, where v∗k is a minimizer of ϕk with an optimal

value of ϕ∗k.

(iii)∞∑k=1

√cA (k, ik)

This definition is crucial for us to derive the global convergence of NAM, which will be

carried out next. We will discuss examples of optimization algorithms which are NFAs in

Section 4. Before getting into the convergence of NAM, we would like to show that NFAs

satisfy some properties which will be useful for us in proving the global convergence of NAM.

Lemma 2. Let ϕk : Rd → R, k ∈ N, be a family of continuously differentiable and σk-strongly

convex functions. Assume that the gradient of each ϕk is Lk-Lipschitz continuous. Let A be

an NFA with respect to {ϕk}k≥0 that generates sequences{vk,i}i≥0 with v

k,0 = vk−1,ik−1 for

some v0,0 = v0. Then, the following properties hold

(i) Overall non-increasing function value. ϕk(vk,ik

)≤ ϕk

(vk,0

)= ϕk

(vk−1,ik−1

).

(ii) Convergence of the gradients.∥∥∇ϕk (vk,ik)∥∥ ≤ δk,ik ≡ LkcA (k, ik)∥∥vk,0 − v∗k∥∥ /√σk.

Proof. To prove that item (i) holds, we use the strong convexity of ϕk to establish that

ϕk(vk,0

)≥ ϕ∗k +∇ϕk (v∗k)

T (vk,0 − v∗k)+ σk2 ∥∥vk,0 − v∗k∥∥2 = ϕ∗k + σk2 ∥∥vk,0 − v∗k∥∥2 , (14)where the last equality follows from the fact that v∗k is a minimizer of ϕk and therefore

∇ϕk (v∗k) = 0. This fact with Definition 2(ii) yields that

ϕk(vk,ik

)− ϕ∗k ≤

c2A (k, ik)

σk

(ϕk(vk,0

)− ϕ∗k

),

Therefore, item (i) holds for any ik ≥ 0 that satisfies c2A (k, ik) ≤ σ. Since Definition 2(i)

states that cA (k, i)→ 0 as i→∞, there must exist such an ik ∈ N.

To show that item (ii) holds, we use the fact that the functions ϕk are strongly convex

with Lk-Lipschitz continuous gradient for all k ≥ 0 to obtain that

ϕk(vk,ik

)− ϕ∗k ≥

σk2

∥∥vk,ik − v∗k∥∥2 ≥ σk2L2k ∥∥∇ϕk (vk,ik)−∇ϕ (v∗k)∥∥2 = σk2L2k ∥∥∇ϕk (vk,ik)∥∥2 ,where the first inequality follows again from the fact that ∇ϕk (v∗k) = 0. This fact with

Definition 2(ii) yields that

∥∥∇ϕk (vk,ik)∥∥ ≤ cA (k, ik)√σk

· Lk∥∥vk,0 − v∗k∥∥ , (15)

and item (ii) holds true with δk,ik ≡ LkcA (k, ik)∥∥vk,0 − v∗k∥∥ /√σk.

12

Remark 1. Equipped with the properties of NFAs from Lemma 2, we would like to describe

the concept of NFA in the context of NAM. For example, according to Lemma 2, saying that

algorithm A1 in NAM is NFA means that with ϕk (x) := F(x,yk,uk+1

)for all k ∈ N, at

each outer iteration k, algorithm A1 generates iteratively a sequence{xk,i}i≥0 which starts at

xk,0 = xk and stops after ik inner iterations at xk,ik = xk+1 (where the index ik is mentioned

in NAM and Lemma 2) such that

(i) F(xk+1,yk,uk+1

)≤ F

(xk,yk,uk+1

)for all k ≥ 0.

(ii)∥∥∇xF (xk+1,yk,uk+1)∥∥ ≤ δk+1, where δk+1 ≡ δk,ik = LkcA (k, ik)∥∥xk − x∗k∥∥ /√σk.

Similar properties follow for algorithm A2 with respect to the y-block.

Now we are ready to prove the following main result.

Theorem 2. [Global convergence of NAM]. Let{(

xk,yk,uk)}

k≥0 be a bounded sequence

generated by NAM. If F satisfies Assumption A and the KL property, then the sequence{(xk,yk

)}k≥0 has finite length and it globally converges to some pair (x

∗,y∗). Moreover, let

u∗ be any limit point of the sequence{uk}k≥0, then (x

∗,y∗,u∗) ∈ crit (F ).

Proof. Since A1 is NFA, we denote ϕk (x) = F(x,yk,uk+1

)for each k ≥ 0, and derive

from Assumption A that ϕk is σk(yk,uk+1

)-strongly convex with Lk

(yk,uk+1

)-Lipschitz

continuous gradient. Therefore, Lemma 2 can be applied on {ϕk}k≥0 and the properties of

Remark 1 hold true. Similar arguments can be done for the y-block since A2 is also NFA.

Throughout the proof xk,0 = xk, xk,ik = xk+1, yk,0 = yk and yk,ik = yk+1 for all k ∈ N.

We will show that the sequence{(

xk,yk,uk)}

k≥0 is an approximate gradient-like descent

sequence according to Definition 1, which will be enough to obtain the stated result due to

Theorem 1 with z = (x,y). To this end, we first prove that Condition (C1) is satisfied. By

the definition of uk+1 (see step 7 in NAM) we have

F(xk,yk,uk

)≥ F

(xk,yk,uk+1

)≥ F

(xk+1,yk,uk+1

)≥ F

(xk+1,yk+1,uk+1

), (16)

where the last two inequalities hold true using the fact that algorithms A1 and A2 are NFAs

(see Remark 1). Therefore, the sequence{F(xk,yk,uk

)}k≥0 is non-increasing. Since the

sequences{xk}k≥0 and

{yk}k≥0 are bounded as assumed, there exists M ≥ 0 such that

13

∥∥xk − xk+1∥∥ ≤ M and ∥∥yk − yk+1∥∥ ≤ M for all k ≥ 0. In addition, from strong convexitywe establish that

F(xk,yk,uk+1

)− F

(xk+1,yk,uk+1

)≥ ∇xF

(xk+1,yk,uk+1

)T (xk − xk+1

)+σ1(yk,uk+1

)2

∥∥xk+1 − xk∥∥2≥ −

∥∥∇xF (xk+1,yk,uk+1)∥∥ · ∥∥xk − xk+1∥∥+σ1(yk,uk+1

)2

∥∥xk+1 − xk∥∥2≥ σ̄1

2

∥∥xk+1 − xk∥∥2 −Mδk+1x ,

(17)

where the last inequality follows from Assumption A(i) and Lemma 2(iii). Similarly, for the

y-block we obtain

F(xk+1,yk,uk+1

)− F

(xk+1,yk+1,uk+1

)≥ σ̄2

2

∥∥yk+1 − yk∥∥2 −Mδk+1y . (18)Combining (16), (17) and (18) we obtain that

F(xk,yk,uk

)− F

(xk+1,yk+1,uk+1

)≥ ρ1

∥∥zk+1 − zk∥∥2 − ek1,where ρ1 = (1/2) ·min {σ̄1, σ̄2} and ek1 = M

(δk+1x + δ

k+1y

). This proves Condition (C1).

Now we prove that Condition (C2) is satisfied. We notice that

∂F(zk+1,uk+1

)=

∇zF (zk+1,uk+1)∂uF

(zk+1,uk+1

) =

∇zF (zk+1,uk+1)∇uG

(zk+1,uk+1

)+ ∂H

(uk+1

) . (19)

Since uk+1 is a minimizer of the function u 7→ F(zk,u

), it follows from the first-order

optimality condition that

0m ∈ ∇uG(zk,uk+1

)+ ∂H

(uk+1

). (20)

Therefore, combining (19) with (20) we establish the inclusion

wk+1 ≡

∇zF (zk+1,uk+1)∇uG

(zk+1,uk+1

)−∇uG

(zk,uk+1

) ∈ ∂F (zk+1,uk+1) . (21)

From the triangle inequality we have

∥∥∇zF (zk+1,uk+1)∥∥ ≤∥∥∥∥∥∥∇xF (xk+1,yk+1,uk+1)

0

−∇xF (xk+1,yk,uk+1)

0

∥∥∥∥∥∥+

∥∥∥∥∥∥ ∇xF (xk+1,yk,uk+1)∇yF

(xk+1,yk+1,uk+1

)∥∥∥∥∥∥ .

(22)

14

Since{xk,yk,uk

}k≥0 is bounded, using Assumption A(iii) and Remark 1(ii) we obtain from

(22) that

∥∥∇zF (zk+1,uk+1)∥∥ ≤ L∥∥yk+1 − yk∥∥+ δk+1x + δk+1y ≤ L∥∥zk+1 − zk∥∥+ δk+1x + δk+1y . (23)From (21) and (23), we derive using the triangle inequality and Assumption A(iii) that

∥∥wk+1∥∥ ≤ ∥∥∇zF (zk+1,uk+1)∥∥+∥∥∇uG (zk+1,uk+1)−∇uG (zk,uk+1)∥∥ ≤ ρ2 ∥∥zk+1 − zk∥∥+ek2,where ρ2 = 2L and e

k2 = δ

k+1x + δ

k+1y . This proves Condition (C2).

To prove Condition (C3), we take a subsequence {(zqk ,uqk)}k≥0 which converges to some

pair (z∗,u∗). From the definition of uk+1 we obtain that

G(zk,uk+1

)+H

(uk+1

)≤ G

(zk,u∗

)+H (u∗) ,

and thus from Assumption A(iii) it follows that

H(uk+1

)≤ H (u∗) +G

(zk,u∗

)−G

(zk,uk+1

)≤ H (u∗) +∇uG

(zk,uk+1

)T (u∗ − uk+1

)+L

2

∥∥u∗ − uk+1∥∥2 .Substituting k with qk − 1 and taking k →∞ yields

lim supk→∞

H (uqk) ≤ H (u∗) .

Using the continuity of G we obtain Condition (C3). Finally, in order to prove Condition

(C4) we have to show that both{√

ek1

}k≥0

and{ek2}k≥0 are summable. Therefore, by the

definition of ek1 and ek2 it is enough to show that

{√ek1

}k≥0

is summable. From the σ̄1-strong

convexity of the function x 7→ F(x,yk−1,uk

)(see Assumption A(i)) we have

F(xk,yk−1,uk

)− F

(x∗k,y

k−1,uk)≥ ∇xF

(x∗k,y

k−1,uk)T (

xk − x∗k)

+σ̄12

∥∥xk − x∗k∥∥2=σ̄12

∥∥xk − x∗k∥∥2 , (24)where the last equality follows from the fact that ∇xF

(x∗k,y

k−1,uk)

= 0. Therefore, since

we assume that F is bounded from below by some F̄ ∈ R, it follows from (24) that

∥∥xk − x∗k∥∥2 ≤ 2σ̄1 (F (xk,yk−1,uk)− F (x∗k,yk−1,uk)) ≤ 2σ̄1 (F (x−1,y−1,u1)− F̄) , (25)15

where the last inequality follows (16). Now, using Remark 1(ii) we derive from (25) that

δk+1x =Lk√σkcA1 (k, ik)

∥∥xk − x∗k∥∥ ≤ L̄1√σ̄1 cA1 (k, ik)∥∥xk − x∗k∥∥≤√

2(F (x−1,y−1,u1)− F̄

)κ̄1 · cA1 (k, ik)

where κ̄1 ≡ L̄1/σ̄1. Therefore, from Definition 2(iii) we obtain that∞∑k=1

√δk+1x

and therefore cA (k, i) = 2√Lk/ (i+ 1), where Lk is the Lipschitz constant of the gradient

of the function ϕk. So, as can be seen, any optimization method with provable rate of

convergence in terms of function values is NFA (the third requirement to be discussed below

will be also valid). Before discussing the number of inner iterations we will present another

example. If ϕk, k ∈ N, is additionally σk-strongly convex (as will be in NAM due to

Assumption A), then we can use a variant of AG called V-AG which exploits the strong

convexity and enjoys a better rate (see [5, Theorem 10.42])

ϕk(vk,i)− ϕ∗k ≤

(1− 1√

κk

)iσk + Lk

2

∥∥vk,0 − v∗k∥∥2 ,and therefore cA (k, i) = (σk + Lk) ·

(1− κ−1/2k

)i/2, where κk = Lk/σk.

Now we would like to focus on the issue of determining the number of inner iterations

ik without focusing on a specific algorithm. From Lemma 2 we see that there are two

requirements on ik

(a) c2A (k, ik) ≤ σk, where σk is the strong convexity parameter of ϕk.

(b) Summability of the sequence{√

cA (k, ik)}k≥0

.

We should mention that it is enough to require that the above two requirements are satisfied

for all k ≥ K for some K ∈ N (and we use this fact in Section 5). Notice that each

requirement is satisfied for some ik, and the number of inner iterations is chosen to be

maximum between them. The first requirement can be easily verified since cA (k, ik) has an

explicit depend on ik. For instance, in the case of AG we have that

cA (k, ik) =2√Lk

ik + 1,

and therefore⌈2√κk − 1

⌉≤ ik, where κk = Lk/σk. Similarly we can show that for V-AG we

have⌈(log (σk + Lk)− log (2σk)) /

(log(√

κk)− log

(√κk − 1

))⌉≤ ik.

The second requirement can also be easily satisfied since all we need is to guarantee that

the sequence{√

cA (k, ik)}k≥0

converges to 0 fast enough such that its sum is finite. Since

cA (k, i) converges to 0 as i → ∞, independent on the specific structure of cA (k, i), it is

enough to take a sequence {ik}k∈N which grows fast enough to obtain the summability. For

17

example, if we set ik = s + 2bk+1/rc for some fixed integers s and r, we easily obtain the

desired requirement.

It should be noted that since NFA is used in the framework of NAM, we can rely on

Assumption A to bound the strong convexity parameters σk and the Lipschitz constants Lk

by σ̄ and L̄, respectively. For example, in the case of AG and V-AG, the number of inner

iterations ik that satisfies (a) can also be determined in terms of κ = L̄/σ̄ instead of κk.

Since κ is fixed for any k ∈ N, then taking ik = s+ 2bk+1/rc (as described above) guarantees

that both (a) and (b) are satisfied simultaneously, and there is no need to calculate the value

of κ.

5 Numerical Experiments

In this section we give a numerical example to show the advantage of using the NAM frame-

work relatively to existing methods on the challenging non-convex problem of Regularized

Structured Total Least Squares (RSTLS). Unlike One Iteration Approximation based meth-

ods, NAM allows several iterations in order to approximate the solution of each sub-problem.

In the following example we show that these additional inner iterations at each outer iter-

ation lead to a superior algorithm compared to existing methods in terms of the obtained

function values that NAM converges to and also in terms of a relative error of the obtained

solution

5.1 Regularized Structured Total Least Squares (RSTLS)

Following the recent paper [6], we consider the class of RSLTS problems, which can be

formulated as the following non-convex and possibly non-smooth optimization problems

minx∈Rp,u∈Rm

F (x,u) ≡ σ2wf (x) +∥∥∥∥∥(

A +m∑i=1

uiAi

)x− b

∥∥∥∥∥2

+σ2wσ2e‖u‖2

, (26)where A,A1,A2, . . . ,Am ∈ Rn×p and b ∈ Rn are given and contain noise, σw and σe are

the standard deviations of the noise components, respectively, and f : Rp → (−∞,∞] is a

regularizing function. For more details about the problem formulation given in (26) and its

18

applications in image processing see [6] and references therein. For the RSTLS model to

be compatible with the NAM framework, we choose the function f to be λ ‖x‖2 for some

regularization parameter λ > 0. This yields a non-convex formulation which is proper, lower

semi-continuous, bounded from below by 0 and satisfies the KL property (all these properties

can be easily verified in this case, since the function F is a quartic polynomial function).

Under the NAM framework we can write F (x,u) = G (x,u) +H (u) where

G (x,u) = σ2wf (x) + G̃ (x,u) ≡ λσ2w ‖x‖2 +

∥∥∥∥∥(

A +m∑i=1

uiAi

)x− b

∥∥∥∥∥2

,

H (u) =σ2wσ2e‖u‖2 .

We mention that in the general NAM framework, the function H is allowed to be non-

smooth, which is not the case in this specific application. In addition, in this application the

function u 7→ F (x,u), for a fixed x ∈ Rp, is strongly convex quadratic with a very small m

and thus can be minimized in exact fashion, as also done in [6] (see more details below).

We also point out that in the NAM algorithm, there are three blocks while in the case

of our application there are only two blocks. Therefore, we can either consider that the

y-block in NAM is vanished, or alternatively we can treat the u-block of RSTLS as the

y-block in NAM, and then the u-block in NAM is vanished. The difference between these

two approaches is that the u-block in NAM is solved exactly, while the y-block in NAM can

be solved approximately. In this work we choose to solve each such sub-problem exactly,

which is possible due to the low dimension of u, nevertheless the NAM framework allows us

to solve it approximately instead

In order to apply Theorem 2 on formulation (26) we need to show that Assumption A is

satisfied for the RSTLS problem (without item (ii), since we have no y-block here). To this

end, we notice that the function x 7→ F (x,u) with Hessian matrix given by

∇2xF (x,u) = 2λσ2wIp×p + 2

(A +

m∑i=1

uiAi

)T (A +

m∑i=1

uiAi

).

Therefore, we derive that σ̄ = 2λσ2w is a positive lower bound on the strong convexity

parameter of the function x 7→ F (x,u) for all u ∈ Rm. In addition, it is easy to check that

19

the partial gradient ∇xG (x,u) is Lipschitz continuous with constant

L (u) = 2λσ2w + 2

∥∥∥∥∥A +m∑i=1

uiAi

∥∥∥∥∥2

. (27)

Therefore, we see that L (u) is bounded on compact sets by some L > 0, thus Assump-

tion A(i) is satisfied. Since G (x,u) is a quadratic polynomial function we easily derive

Assumption A(iii).

5.2 Description of the Data and Methods

We follow the experiment of [6] and utilize the RSTLS formulation of (26) in order to

reconstruct a given blurred image. We consider the vector b as the vectorized observed

blurred image. The vector b is obtained from the real image, denoted xreal, by first scaling

the pixels of the real image to be between 0 and 1. Then, it is transferred through a Gaussian

blur point spread function (PSF) of size q×q and standard deviation γ (see [15, Chapter 3]),

and then by adding some noise to each coordinate drawn from Gaussian distribution with

parameters (0, σw).

The unknown blurring operator is constructed from the original PSF by adding to it a

q × q matrix (before padding) of the same structure of the original PSF. Noise is added to

each of the structure matrices Ai by the update rule A := µAi where µ ∼ Gauss (0, σe).

We use periodic boundary conditions, resulting with a block circulant with circulant blocks

(BCCB) blurring matrix.

The authors of [6] solve the RSTLS problem with One Iteration Approximation based

algorithm called SemiProximal-Alternating (SPA). It was proven that SPA is a globally

convergent algorithm to a critical point of F (x,u). In order to present the algorithm, we

need to recall the definition of the proximal mapping of a function ψ : Rd → (−∞,∞]:

proxψ (z) = argminv∈Rd

{ψ (v) +

1

2‖v − z‖2

}for any z ∈ Rd.

20

Algorithm 2 SPA Algorithm for RSTLS

1: Initialization: (x0,u0) ∈ Rp × Rm.

2: Iterative step:

3: for k ≥ 0 do

4: Pick Lk (a Lipschitz constant of the gradient of x 7→ G̃(x,uk+1

)) and update

xk+1 = proxσ2wLk

f

(xk − 1

Lk∇xG̃

(xk,uk

)). (28)

5: Update

uk+1 = argmin{F(xk+1,u

): u ∈ Rm

}. (29)

6: end for

Since the function u 7→ F(xk+1,u

)is strongly convex, a unique minimizer in (29) can

be obtained by the first-order optimality condition ∇uF(xk+1,uk+1

)= 0, and is given by

the formula

uk+1 = −(σ2wσ2e

Im×m + B(xk+1

)TB(xk+1

))−1B(xk+1

)T (Axk+1 − b

), (30)

where B (x) = [A1x A2x . . . Amx]. Since, in practice, m is small, this system of linear

equations can be solved efficiently. As for the update rule in (28), since the function f is set

to be λ ‖x‖2, simple calculations show that an explicit update rule is given by the formula

xk+1 =Lk

Lk + 2λσ2w

(xk − 1

Lk∇xG̃

(xk,uk

)). (31)

Obtaining an upper bound on the Lipschitz constant Lk, k ∈ N, can a computationally

challenging task. However, since we know that the blurring matrix is BCCB, its eigenvalues

can be calculated efficiently (see [15, Section 4.2.1]). Therefore, we use in this paper the

tight Lipschitz constant given by

Lk = 2

∥∥∥∥∥A +m∑i=1

ukiAi

∥∥∥∥∥2

. (32)

Now, we would like to compare the SPA algorithm with our NAM method equipped with

inner iterations of the AG algorithm (which was proved in Section 4 to be NFA). This method

21

applies several iterations of AG on the function x 7→ G(x,uk

), which requires finding the

Lipschitz constant of the gradient of this function. From (27) we see that we can use the

constant 2λσ2w + Lk, where Lk is defined in (32).

Algorithm 3 NAM with AG for RSTLS

1: Initialization: (x−1,u0) ∈ Rp × Rm.

2: Iterative step:

3: for k ≥ 0 do

4: Set wk−1,0 = xk−1, t0 = 1 and pick Lk ≥ L∇xG̃(uk).

5: for i = 0, 1, 2, . . . , ik−1 − 1 do

6: Update

xk−1,i+1 = wk−1,i − 12λσ2w + Lk

∇xF(wk−1,i,uk

). (33)

7: Set ti+1 =1+√

1+4t2i2

.

8: Update

wk−1,i+1 = xk−1,i+1 +ti − 1ti+1

(xk−1,i+1 − xk−1,i

). (34)

9: end for

10: Set xk = xk−1,ik−1 .

11: Update

uk+1 = argmin{F(xk,u

): u ∈ Rm

}. (35)

12: end for

We mention that since the blurring matrix is BCCD, we can efficiently compute the

strong convexity parameters σk of the functions x 7→ F(x,uk

)for any k ≥ 0. Therefore, we

can use V-AG instead of AG as NFA in Algorithm 3. Moreover, following the arguments in

Section 4 we actually know an explicit minimum number of inner iterations ik for both AG

and V-AG. However, the condition number κk increases as σw decreases, giving very large

κk. Two issues arise when the value of κk is very large, which limits its applicability here:

(i) the number of inner iterations ik increases significantly, even for small k, (ii) the stepsize

in V-AG, which depends on κk increases significantly. Therefore, we use the AG method

instead of V-AG. To overcome the issue in (i), we simply follow the arguments of Section 4

22

and choose ik = s + 2bk+1/rc for some fixed integers s and r. This selection guarantees the

conditions for global convergence while keeping a relatively low number of inner iterations

when k is small. Additionally, this selection does not require the calculation of κ.

5.3 Results

We run the SPA algorithm and NAM with inner iterations with AG on a blurred version of

the 256×256 pixels cameraman test image figure taken from the MATLAB Image Proceesing

Toolbox available in [1]. In order to blur the original image, we used a Gaussian PDF of size

5× 5 with a standard deviation of 4. See figures (a) and (b) below for both the original and

the blurred image.

(a) Original image (b) Blurred image

We let both methods to run for N = 3000 iterations, where for NAM the iteartion

counter counts also inner iterations (for example, if i1 = 100 and i2 = 200, then the iteration

counter after two iterations is N = 300). In all experiments we set λ = 0.02, σw = 10−4 and

σe = 10−3. The initial point for both algorithms is set to be the original blurred image, and

the variable u is initialized as the zero vector.

As for the NAM with AG algorithm, we always set r = 10, which lets us control the

number of inner AG iterations while also establishing global convergence. For example, from

the formula ik = s + 2bk+1/rc we get the following number of inner AG iterations for s = 10

and r = 10: i−1 = 11, i0 = 11, i9 = 12, i19 = 14, i29 = 18 and so forth. Note that when

keeping N fixed, then increasing s reduces the effect of r on the number of inner iterations.

23

All experiments were ran on an Intel(R) Core(TM) i5-6400 CPU @ 2.70GHz with 8.00GB

RAM, Windows 7 Professional 64-bit using MATLAB 2020a.

We compare in Figure 2 and Figure 3 the two methods by the achieved function value

and by the relative error at each iteration, calculated by

Relative Error =

∥∥xi − xreal∥∥‖xreal‖

,

where xi is the image obtained in the i-th iteration.

Figure 2: Function value at each iteration for N = 3000 in a logarithmic scale. The NAM

with AG method was conducted for s = 1, 10, 50, 100 and 200.

24

Figure 3: Relative relative error at each iteration for N = 3000 in a logarithmic scale. The

NAM with AG method was conducted for s = 1, 10, 50, 100 and 200.

In Figure 2 and Figure 3 wee see that after 10 iterations, NAM with AG for s =

10, 50, 100, 200 is superior to SPA in terms of both function value and relative error. We

also see that for NAM with AG, when increasing the number of inner iterations the accuracy

of the output increases, indicating that at least for the RSTLS problem it is better to use

methods that are not based on One Iteration Approximation.

The following figures show the output of each of the methods after N = 3000 iterations.

25

(a) SPA Algorithm (b) NAM with AG, s = 1

(c) NAM with AG, s = 10 (d) NAM with AG, s = 50

(e) NAM with AG, s = 100 (f) NAM with AG, s = 200

To conclude, in this paper we provided a general recipe which serves as a base for proving

global convergence of algorithms that tackle non-convex optimization problems. We have

generalized the methodology of [10] to allow errors in the conditions. We utilized the new

26

methodology in order to establish global convergence of Nested Alternating Minimization

methods that are incorporated with a Nested Friendly Algorithm. To the best of our knowl-

edge, this is the first result that proves global convergence of general nested algorithms in

the non-convex setting. Moreover, to the best of our knowledge this is the first methodol-

ogy which can be used to derive global convergence of methods which are not necessarily

sufficient descent methods. In a future work, it would be interesting to generalize the NAM

framework to also include constraints or non-smooth regularizers in the blocks x and y.

Another challenging task is to extend the framework of NAM for functions which are not

necessarily strongly convex in the x and y blocks.

References

[1] Image Processing Toolbox. The MathWorks Inc. Natick, Massachusetts, United States.

Online: mathworks.com/products/image, 2020.

[2] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth

functions involving analytic features. Mathematical Programming, 116(1-2):5–16, 2009.

[3] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization

and projection methods for nonconvex problems: An approach based on the kurdyka-

lojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.

[4] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-

algebraic and tame problems: proximal algorithms, forward–backward splitting, and

regularized gauss–seidel methods. Mathematical Programming, 137(1-2):91–129, 2013.

[5] A. Beck. First-Order Methods in Optimization, volume 25. SIAM, 2017.

[6] A. Beck, S. Sabach, and M. Teboulle. An alternating semiproximal method for non-

convex regularized structured total least squares problems. SIAM Journal on Matrix

Analysis and Applications, 37(3):1129–1150, 2016.

27

[7] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total varia-

tion image denoising and deblurring problems. IEEE transactions on image processing,

18(11):2419–2434, 2009.

[8] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear

inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.

[9] J. Bolte, A. Daniilidis, and A. Lewis. The lojasiewicz inequality for nonsmooth suban-

alytic functions with applications to subgradient dynamical systems. SIAM Journal on

Optimization, 17(4):1205–1223, 2007.

[10] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for

nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459–494,

2014.

[11] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd. First order methods beyond con-

vexity and lipschitz gradient continuity with applications to quadratic inverse problems.

SIAM Journal on Optimization, 28(3):2131–2151, 2018.

[12] B. L. Gorissen, İ. Yanıkoğlu, and D. den Hertog. A practical guide to robust optimiza-

tion. Omega, 53:124–137, 2015.

[13] P. J. Groenen, M. van de Velden, et al. Multidimensional scaling by majorization: A

review. Journal of Statistical Software, 73(8):1–26, 2016.

[14] W. J. Gutjahr and A. Pichler. Stochastic multi-objective optimization: a survey on

non-scalarizing methods. Annals of Operations Research, 236(2):475–499, 2016.

[15] P. C. Hansen, J. G. Nagy, and D. P. O’leary. Deblurring images: matrices, spectra, and

filtering. SIAM, 2006.

[16] R. Hesse, D. R. Luke, S. Sabach, and M. K. Tam. Proximal heterogeneous block

implicit-explicit method and application to blind ptychographic diffraction imaging.

SIAM Journal on Imaging Sciences, 8(1):426–457, 2015.

28

[17] P. Jain, P. Kar, et al. Non-convex optimization for machine learning. Foundations and

Trends R© in Machine Learning, 10(3-4):142–336, 2017.

[18] K. Kurdyka. On gradients of functions definable in o-minimal structures. Ann. Inst.

Fourier (Grenoble), 48(3):769–783, 1998.

[19] S. Lojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. In

Les Équations aux Dérivées Partielles (Paris, 1962), pages 87–89. Éditions du Centre

National de la Recherche Scientifique, Paris, 1963.

[20] F. G. Mohammadi, M. H. Amini, and H. R. Arabnia. Evolutionary computation, op-

timization, and learning algorithms for data science. In Optimization, Learning, and

Control for Interdependent Complex Networks, pages 37–65. Springer, 2020.

[21] Y. E. Nesterov. A method for solving the convex programming problem with conver-

gence rate O(1/k2). Dokl. Akad. Nauk SSSR, 269(3):543–547, 1983.

[22] P. Ochs. Unifying abstract inexact convergence theorems and block coordinate variable

metric ipiano. SIAM Journal on Optimization, 29(1):541–570, 2019.

[23] N. Piovesan and T. Erseghe. Cooperative localization in wsns: A hybrid con-

vex/nonconvex solution. IEEE Transactions on Signal and Information Processing over

Networks, 4(1):162–172, 2018.

[24] T. Pock and S. Sabach. Inertial proximal alternating linearized minimization (ipalm) for

nonconvex and nonsmooth problems. SIAM Journal on Imaging Sciences, 9(4):1756–

1787, 2016.

[25] R. T. Rockafellar and R. J.-B. Wets. Variational analysis, volume 317. Springer Science

& Business Media, 2009.

[26] S. Sabach, M. Teboulle, and S. Voldman. A smoothing alternating minimization-based

algorithm for clustering with sum-min of euclidean norms. Pure and Applied Functional

Analysis, 3:653–679, 2018.

29

[27] C. Soares, J. Xavier, and J. Gomes. Simple and fast convex relaxation method for coop-

erative localization in sensor networks using range measurements. IEEE Transactions

on Signal Processing, 63(17):4532–4543, 2015.

[28] J. Wang. Geometric structure of high-dimensional data and dimensionality reduction.

Springer, 2012.

[29] F. Wen, L. Chu, P. Liu, and R. C. Qiu. A survey on nonconvex regularization-based

sparse and low-rank recovery in signal processing, statistics, and machine learning. IEEE

Access, 6:69883–69906, 2018.

30

convergent nested alternating minimization algorithms for ...convergence analysis is a new extension...

Documents