convergent nested alternating minimization algorithms for ...convergence analysis is a new extension...
TRANSCRIPT
-
Convergent Nested Alternating Minimization
Algorithms for Non-Convex Optimization Problems
Eyal Gur, Shoham Sabach∗, Shimrit Shtern†
October 1, 2020
Abstract
We introduce a new algorithmic framework for solving non-convex optimization
problems, which is called Nested Alternating Minimization, which aims at combining
the classical Alternating Minimization technique with inner iterations of any optimiza-
tion method. We provide global convergence analysis of the new algorithmic framework
to critical points of the problem at hand, which to the best of our knowledge, is the
first of this kind for nested methods in the non-convex setting. Central to our global
convergence analysis is a new extension of classical proof techniques in the non-convex
setting that allows for errors in the conditions. The power of our framework is illus-
trated with some numerical experiments that show the superiority of this algorithmic
framework over existing methods.
1 Introduction
In recent years non-convex and non-smooth optimization problems have gained a lot of
interest in numerous applied fields such as machine learning [17], signal processing [29, 23],
∗This research was partially supported by the D.F.G German Research Foundation 800240.†Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology, Tech-
nion city, Haifa 3200003, Israel (e-mail: [email protected], [email protected], shim-
1
-
data science [13, 20], operations research [14, 12] and more. While the theoretical results in
the convex setting (such as convergence analysis and rates of convergence) are wide (see, for
instance, the recent book [5] and references therein), the tools and results available regarding
the non-convex setting are very limited, making this setting much more challenging. An
important task in the non-convex problem domain, which has already gained success in
several settings, is to establish global convergence of algorithms to critical points (see more
detail in Section 2). In this paper, we are motivated by the analysis and application of
a large and important class of algorithms, which we call Nested Alternating Minimization
and will be precisely defined in Section 3. Our contribution consists of providing a new
general proof procedure that can be applied to obtain global convergence results for this
class of algorithms. As far as we know, this is the first theoretical guarantee for this class of
algorithms in the non-convex setting.
1.1 Global Convergence
Proving global convergence of an algorithm to a critical point involves showing that for any
starting point the entire sequence generated by the algorithm converges to a single critical
point of the optimization problem at hand. In recent years, this task received a lot of
attention starting with the works [2, 3, 4] and later with the work [10], which formulates a
simple proof technique to obtain global convergence results of algorithms that satisfy three
conditions in addition to the KL property of the objective function (see [19, 18, 9]). These
works pave the way to many modifications and extensions that have been, and are still being
developed (e.g., [26, 22] and references therein). In this work we propose another proof
technique to obtain global convergence results in the non-convex setting. This extension
allows more flexibility in accommodating the conditions of previous proposed recipes in
the sense that we propose a relaxed variant of the first two conditions by adding some
non-negative error terms (our recipe is a further relaxation of the recipe suggested in [22],
which only relaxes the second condition). The first condition consists of a sufficient decrease
requirement of the algorithm with respect to all variables (see, for instance, [10]) or part of
them (see, for instance, [26]), where the sufficient decrease can be measured in terms of the
original objective function as in [10] or a related Lyapunov function as in [24]. The second
2
-
condition consists of bounding the iterates gap from below by the function’s sub-gradient
norm. As mentioned, we relax these conditions by allowing for some error terms.
In Section 2, following the approach introduced in [10] and recently summarized in [11], we
develop the extended proof methodology in detail. Even though the techniques used to derive
the new methodology are not new and are based on [10, 11], the newly developed technique
allows us to easily analyze non necessarily sufficient descent methods (due to the relaxation
of the first condition), which in turn opens the gate for obtaining global convergence results
of the very important and challenging class of Nested Alternating Minimization algorithms.
1.2 Nested Alternating Minimization Algorithms
The celebrated Alternating Minimization (AM) methodology is a technique for handling
complicated optimization problems that has regained huge popularity in the last decade
in the context of non-convex optimization. The AM methodology splits an optimization
problem into smaller problems, which would hopefully be easier to solve in a closed form.
Unfortunately, in many cases, especially in the non-convex setting, even the resulting sub-
problems remain too complicated or too large to be solved explicitly. This led to a stream
of papers in recent years that suggest to approximate the solution of the sub-problems by
computing one iteration of a certain optimization algorithm (see, for instance, [16, 24, 11,
26]). This approach of One Iteration Approximation (OIA) led, in many cases, to algorithms
that enjoy global convergence guarantees using the techniques mentioned in Section 1.1.
However, the idea of OIA is limited to algorithms with a certain sufficient descent property
(see our discussion above). Motivated by considering to approximate the solution of the sub-
problems using non sufficient decrease methods (like FISTA [8] and MFISTA [7]), we study
the class of Nested Alternating Minimization (NAM) algorithms, which allow to approximate
the solution of the sub-problems using several iterations instead of one, in order to achieve
the relaxed sufficient descent property discussed above.
In Section 3 we propose the general algorithmic framework of NAM for solving non-
convex and non-smooth problems. Our main contribution is to show that this algorithmic
framework, which captures a wide spectrum of algorithms, generates a globally convergent
3
-
sequence to critical points. To this end we define the notion of Nested Friendly Algorithms
(NFA), which captures the minimal necessary requirements of the nested algorithm to be
used within NAM in order to achieve the global convergence of NAM. In Section 4 we
illustrate the concept of NFA by presenting several examples that satisfy the requirements
and therefore can be used in NAM. Another central aspect in applying NAM is to determine
the number of inner iterations which is sufficient to guarantee the global convergence. This
is also discussed in Section 4. Finally, in Section 5 we illustrate the advantage of the NAM
framework in tackling the Regularized Structured Total Least Squares (RSTLS) problem via
its application on an image processing task.
2 A Procedure for Global Convergence with Errors
Consider an extended real-valued function F : Rn ×Rm → (−∞,∞] which is bounded from
below and assume that F is proper and lower semi-continuous. We consider the following
two-block minimization problem
min(z,u)∈Rn×Rm
F (z,u) . (1)
Starting with any pair (z0,u0), let{(
zk,uk)}
k≥0 be a sequence generated by a generic
algorithm, which we denote by A, that aims at tackling Problem (1) in the sense of global
convergence to critical points of the function F . As mentioned above, our goal in this section
is to extend recent proof techniques with a set of relaxed conditions that still guarantee
global convergence (see [10] and also [11] for a recent and concise version). We also use the
same notations and definitions from non-smooth analysis as in these papers. Therefore, for
instance, when we write ∂F we mean the limiting sub-differential (see [25]). Following [10]
we extend the definition of gradient-like descent sequences in the following way.
Definition 1. [Approximate gradient-like descent sequence]. A sequence{(
zk,uk)}
k≥0 is
called an approximate gradient-like descent sequence for minimizing the function F : Rn ×
Rm → (−∞,∞] if the following conditions hold:
(C1) Approximate sufficient decrease property. The sequence{F(zk,uk
)}k≥0 is non-increasing
4
-
and there exist a positive scalar ρ1 > 0 and a non-negative error term ek1 ≥ 0 such that
ρ1∥∥zk+1 − zk∥∥2 − ek1 ≤ F (zk,uk)− F (zk+1,uk+1) , ∀ k ≥ 0.
(C2) Approximate sub-gradient lower bound on the iterates gap. There exist a vector wk+1 ∈
∂F(zk+1,uk+1
), a positive scalar ρ2 > 0 and a non-negative error term e
k2 ≥ 0 such
that ∥∥wk+1∥∥ ≤ ρ2 ∥∥zk+1 − zk∥∥+ ek2, ∀ k ≥ 0.(C3) Continuity. If (z̄, ū) is a limit point of some sub-sequence
{(zk,uk
)}k∈K⊆N, then
lim supk∈K⊆N
F(zk,uk
)≤ F (z̄, ū) ,
(C4) Summability of the errors. The sequences{√
ek1
}k≥0
and{ek2}k≥0 are summable, that
is,∞∑k=1
√ek1
-
Condition (C4) these errors should be small enough such that the infinite sum of them is
finite. See [22] for a set of conditions which include errors only in Condition (C2).
We would also like to stress the fact that now, in Definition 1, the two-block structure
comes into a play since the first two conditions are only measured in terms of the variable z
(see also [26]). This adds another level of flexibility in proving the approximate gradient-like
descent sequence property.
Now, we are in a position to prove that these relaxed conditions are enough to guarantee
global convergence to critical points of F . To this end, we denote by ω (x0) the set of limit
points of a sequence{xk}k≥0 and for any function f we denote by crit (f) the set of its
critical points.
Lemma 1. Let{(
zk,uk)}
k≥0 be a bounded approximate gradient-like descent sequence for
minimizing F of Problem (1). Then, ω (z0,u0) is a non-empty and compact subset of crit (F )
and
limk→∞
dist((
zk,uk), ω(z0,u0
))= 0.
In addition, the function F is finite and constant on ω (z0,u0).
Proof. Since{(
zk,uk)}
k≥0 is bounded, ω (z0,u0) is a non-empty and compact set. Therefore,
it follows that dist((
zk,uk), ω (z0,u0)
)converges to 0 as k goes to infinity. From Condition
(C1) we have ∥∥zk+1 − zk∥∥2 ≤ 1ρ1
(F(zk,uk
)− F
(zk+1,uk+1
)+ ek1
). (2)
The non-ascent property required in Condition (C1) and the boundedness from below of F ,
yields the convergence of the sequence{F(zk,uk
)}k≥0. This together with Condition (C4)
implies from (2) that
limk→∞
∥∥zk+1 − zk∥∥ = 0. (3)Let
{(zk,uk
)}k∈K be a convergent sub-sequence to some pair (z
∗,u∗). Denote F ∗ ≡
F (z∗,u∗). From Condition (C3) and the lower semi-continuity of F we derive that
limk→∞
F(zk,uk
)= lim
k∈KF(zk,uk
)= F ∗. (4)
6
-
From Condition (C2) there exists a sequence{wk}k≥0, for which w
k ∈ ∂F(zk,uk
)for all
k ≥ 0, that satisfies
limk→∞
∥∥wk∥∥ ≤ limk→∞
(ρ2∥∥zk+1 − zk∥∥+ ek2) = 0,
where the equality follows from (3) and Condition (C4). This means that the sequence{wk}k≥0 converges to 0 as k → ∞ and therefore (z
∗,u∗) ∈ crit (F ), as follows from the
closedness property of the sub-differential (see [25]). Finally, since{F(zk,uk
)}k≥0 converges
we can assume that limk→∞
F(zk,uk
)= c ∈ R. Then, lim
k∈KF(zk,uk
)= c and from (4) we
derive that F is finite and constant over the set ω (z0,u0).
We now prove the main result of this section. It states that if a generic optimization
method A generates a bounded approximate gradient-like descent sequence for minimizing
F of Problem (1), then it globally converges to a critical point of F . It should be noted that
this unique limit point is dependent on the initialization point of A.
Theorem 1. [Global convergence]. Let{(
zk,uk)}
k≥0 be a bounded approximate gradient-
like descent sequence for minimizing F of Problem (1). If F satisfies the KL property, then
the sequence{zk}k≥0 has finite length and it globally converges to a point z
∗. Moreover, let
u∗ be any limit point of the sequence{uk}k≥0, then (z
∗,u∗) ∈ crit (F ).
Proof. Let (z∗,u∗) be a limit point of the sequence{(
zk,uk)}
k≥0, which exists due to the
boundedness assumption. From Lemma 1 (see (4)) we derive
limk→∞
F(zk,uk
)= F (z∗,u∗) .
Assume that there exists k̃ ∈ N such that F(zk,uk
)= F (z∗,u∗) and ek1 = 0 for all k ≥ k̃.
Then, from Condition (C1) it follows that{F(zk,uk
)}k≥0 has a “regular” sufficient decrease
property and therefore zk+1 = zk for all k ≥ k̃. This proves a finite convergence of the
sequence{zk}k≥0 and obviously the desired result.
Now, on the other hand, assume that F(zk,uk
)> F (z∗,u∗) for all k ≥ 0. We know
from Condition (C1) that the sequence{F(zk,uk
)}k≥0 is non-increasing and converges to
F (z∗,u∗). Thus, for any γ > 0 there exists an integer k0 such that F(zk,uk
)< F (z∗,u∗)+γ
for all k > k0. From Lemma 1
limk→∞
dist((
zk,uk), ω(z0,u0
))= 0.
7
-
So, for any ε > 0 there exists an integer k1 such that dist((
zk,uk), ω (z0,u0)
)< ε for any k >
k1. Again from Lemma 1 the function F is finite and constant on the non-empty and compact
set ω (z0,u0). Since F(zk,uk
)− F (z∗,u∗) > 0 it follows that ϕ′
(F(zk,uk
)− F (z∗,u∗)
)is
well-defined, where ϕ is any desingularizing function (see [9]).
From the Uniformized KL property [10, Lemma 6] there exists a desingularizing function
ϕ such that for any k > s ≡ max {k0, k1}+ 1 we have
ϕ′(F(zk,uk
)− F (z∗,u∗)
)· dist
(0, ∂F
(zk,uk
))≥ 1. (5)
Since dist(0, ∂F
(zk,uk
))≤∥∥wk∥∥ for any wk ∈ ∂F (zk,uk) it follows from Condition (C2)
and (5) that
ϕ′(F(zk,uk
)− F (z∗,u∗)
)≥ 1ρ2 ‖zk − zk−1‖+ ek−12
. (6)
For n1, n2 ∈ N we denote
∆n1,n2 ≡ ϕ (F (zn1 ,un1)− F (z∗,u∗))− ϕ (F (zn2 ,un2)− F (z∗,u∗)) .
Since ϕ is concave it follows from the gradient inequality that
∆k,k+1 ≥ ϕ′(F(zk,uk
)− F (z∗)
)·(F(zk,uk
)− F
(zk+1,uk+1
)). (7)
Combining (6) and (7) with Condition (C1) we establish
∆k,k+1 ≥ρ1
(∥∥zk+1 − zk∥∥2 − ek1/ρ1)ρ2(‖zk − zk−1‖+ ek−12 /ρ2
) . (8)Denote c = ρ2/ρ1, ẽ
k1 := e
k1/ρ1 and ẽ
k2 := e
k2/ρ2. We obtain from (8) that√
c∆k,k+1(‖zk − zk−1‖+ ẽk−12
)+ ẽk1 ≥
∥∥zk+1 − zk∥∥ . (9)Now, since c∆k,k+1
(∥∥zk − zk−1∥∥+ ẽk−12 ) ≥ 0 and ẽk1 ≥ 0 we obtain from (9) and thegeometric-arithmetic mean inequality that∥∥zk+1 − zk∥∥ ≤√c∆k,k+1 (‖zk − zk−1‖+ ẽk−12 )+√ẽk1
≤ c2
∆k,k+1 +1
2
∥∥zk − zk−1∥∥+ 12ẽk−12 +
√ẽk1.
(10)
8
-
Summing up inequalities (10) for r = s, s+ 1, . . . , k we derive that
2k∑
r=s+1
∥∥zr+1 − zr∥∥ ≤ k∑r=s+1
∥∥zr − zr−1∥∥+ c k∑r=s+1
∆r,r+1 +k∑
r=s+1
(2√ẽr1 + ẽ
r−12
)≤
k∑r=s+1
∥∥zr+1 − zr∥∥+ ∥∥zs+1 − zs∥∥+ c k∑r=s+1
∆r,r+1
+k∑
r=s+1
(2√ẽr1 + ẽ
r−12
)=
k∑r=s+1
∥∥zr+1 − zr∥∥+ ∥∥zs+1 − zs∥∥+ c∆s+1,k+1+
k∑r=s+1
(2√ẽr1 + ẽ
r−12
),
(11)
where the last equality follows from the definition of ∆r,r+1. We now see that
k∑r=s+1
∥∥zr+1 − zr∥∥ ≤ ∥∥zs+1 − zs∥∥+ k∑r=s+1
(2√ẽr1 + ẽ
r−12
)+ c∆s+1,k+1
≤∥∥zs+1 − zs∥∥+ k∑
r=s+1
(2√ẽr1 + ẽ
r−12
)+ cϕ
(F(zs+1,us+1
)− F (z∗,u∗)
),
(12)
where the last inequality follows from the definition of ∆s+1,k+1 and the fact that ϕ ≥ 0.
Due to Condition (C4), the infinite sum of errors in (12) is also finite and therefore we finally
derive that∞∑r=1
‖zr+1 − zr‖ < ∞. Hence, the sequence{zk}k≥0 is of a finite length, and it
converges globally to some z∗. Last, from Lemma 1 it follows that (z∗,u∗) is a critical point
of the function F .
3 Global Convergence of Nested Alternating Minimiza-
tion Algorithms
In this section we introduce the class of Nested Alternating Minimization (NAM) algorithms
and provide a global convergence result for this class. To illustrate better the concept of
this class we will consider the optimization model (1), where the vector z is split into two
9
-
blocks as follows z ≡(xT ,yT
)T ∈ Rp × Rq. More precisely, we focus here on the followingthree-block model
min(x,y,u)∈Rp×Rq×Rm
{F (x,y,u) ≡ G (x,y,u) +H (u)}, (13)
where the function G : Rp×Rq×Rm → R is non-convex and smooth and H : Rm → (−∞,∞]
is non-convex and non-smooth (precise assumption will be made below). The results in this
section can be easily extended to any finite number of blocks of variables. However, in
order to keep the developments simple we address the case where the block z is decomposed
into only two blocks. The algorithmic framework of Alternating Minimization suggests to
alternate between the three blocks of Problem (13) in a cyclic fashion. This results in three
sub-problems that should be exactly solved at each iteration. However, here as already
discussed in the Introduction we are interested in their approximations. Therefore, a reason
to consider this three-block model is due to the assumption that the sub-problems with
respect to the first two blocks x and y will be approximated using nested algorithms, while
the third block will remain to be exactly solved. More precisely, we introduce now the class
of Nested Alternating Minimization algorithms.
Algorithm 1 Nested Alternating Minimization (NAM) Scheme
1: Input: two nested friendly optimization algorithms A1 and A2.
2: Initialization: (x−1,y−1,u0) ∈ Rp × Rq × Rm.
3: Iterative step:
4: for k ≥ 0 do
5: Set xk after ik−1 iterations of A1 starting with xk−1 for minimizing F(·,yk−1,uk
).
6: Set yk after jk−1 iterations of A2 starting with yk−1 for minimizing F(xk, ·,uk
).
7: uk+1 ∈ argmin{F(xk,yk,u
): u ∈ Rm
}.
8: end for
It should be noted that algorithms A1 and A2, which are used in NAM, need not be the
same and could be different. This adds flexibility in using algorithms which fit best to the
structure of each sub-problem. In order to obtain the global convergence of NAM, we will
make the following assumption on the involved objective function F .
10
-
Assumption A. (i) For any fixed (y,u) ∈ Rq × Rm, the function x 7→ F (x,y,u) is
σ1 (y,u)-strongly convex and with an L1 (y,u)-Lipschitz continuous gradient. In ad-
dition, there exist positive scalars σ̄1 and L̄1 such that inf(y,u)∈D
σ1 (y,u) = σ̄1 and
sup(y,u)∈D
L1 (y,u) = L̄1 for any compact D ⊆ Rq × Rm.
(ii) For any fixed (x,u) ∈ Rp × Rm, the function y 7→ F (x,y,u) is σ2 (x,u)-strongly
convex and with an L2 (y,u)-Lipschitz continuous gradient. In addition, there exist
positive scalars σ̄2 and L̄2 such that inf(x,u)∈D
σ2 (x,u) = σ̄2 and sup(y,u)∈D
L2 (y,u) = L̄2 for
any compact D ⊆ Rp × Rm.
(iii) The gradient of the function G is L-Lipschitz continuous on bounded subsets of Rp ×
Rq × Rm.
At this juncture we would like to note that this assumption is satisfied in many chal-
lenging applications, such as the Regularized Structured Total Least Squares (which will be
discussed in detail in Section 5), the Wireless Sensor Network Localization problem [27],
Multi-Dimensional Scaling [28, Chapter 6], and more.
Before proceeding with NAM we would like to discuss the requirements of the nested
algorithms for guaranteeing the global convergence of NAM. To this end, and as we already
mentioned in NAM, we would like to define the following concept.
Definition 2. [Nested friendly algorithm]. Let ϕk : Rd → R, k ∈ N, be a family of convex
functions. Let A be an optimization algorithm that generates for any k ∈ N a sequence{vk,i}i≥0 starting from v
k,0. We say that A is a Nested Friendly Algorithm (NFA) with
respect to {ϕk}k≥0, if for any k ∈ N there exist an index ik ∈ N and a sequence of non-
negative scalars {cA (k, i)}i≥0 such that
(i) limi→∞
cA (k, i) = 0.
(ii) ϕk(vk,ik
)−ϕ∗k ≤
c2A(k,ik)
2·∥∥vk,0 − v∗k∥∥2, where v∗k is a minimizer of ϕk with an optimal
value of ϕ∗k.
(iii)∞∑k=1
√cA (k, ik)
-
This definition is crucial for us to derive the global convergence of NAM, which will be
carried out next. We will discuss examples of optimization algorithms which are NFAs in
Section 4. Before getting into the convergence of NAM, we would like to show that NFAs
satisfy some properties which will be useful for us in proving the global convergence of NAM.
Lemma 2. Let ϕk : Rd → R, k ∈ N, be a family of continuously differentiable and σk-strongly
convex functions. Assume that the gradient of each ϕk is Lk-Lipschitz continuous. Let A be
an NFA with respect to {ϕk}k≥0 that generates sequences{vk,i}i≥0 with v
k,0 = vk−1,ik−1 for
some v0,0 = v0. Then, the following properties hold
(i) Overall non-increasing function value. ϕk(vk,ik
)≤ ϕk
(vk,0
)= ϕk
(vk−1,ik−1
).
(ii) Convergence of the gradients.∥∥∇ϕk (vk,ik)∥∥ ≤ δk,ik ≡ LkcA (k, ik)∥∥vk,0 − v∗k∥∥ /√σk.
Proof. To prove that item (i) holds, we use the strong convexity of ϕk to establish that
ϕk(vk,0
)≥ ϕ∗k +∇ϕk (v∗k)
T (vk,0 − v∗k)+ σk2 ∥∥vk,0 − v∗k∥∥2 = ϕ∗k + σk2 ∥∥vk,0 − v∗k∥∥2 , (14)where the last equality follows from the fact that v∗k is a minimizer of ϕk and therefore
∇ϕk (v∗k) = 0. This fact with Definition 2(ii) yields that
ϕk(vk,ik
)− ϕ∗k ≤
c2A (k, ik)
σk
(ϕk(vk,0
)− ϕ∗k
),
Therefore, item (i) holds for any ik ≥ 0 that satisfies c2A (k, ik) ≤ σ. Since Definition 2(i)
states that cA (k, i)→ 0 as i→∞, there must exist such an ik ∈ N.
To show that item (ii) holds, we use the fact that the functions ϕk are strongly convex
with Lk-Lipschitz continuous gradient for all k ≥ 0 to obtain that
ϕk(vk,ik
)− ϕ∗k ≥
σk2
∥∥vk,ik − v∗k∥∥2 ≥ σk2L2k ∥∥∇ϕk (vk,ik)−∇ϕ (v∗k)∥∥2 = σk2L2k ∥∥∇ϕk (vk,ik)∥∥2 ,where the first inequality follows again from the fact that ∇ϕk (v∗k) = 0. This fact with
Definition 2(ii) yields that
∥∥∇ϕk (vk,ik)∥∥ ≤ cA (k, ik)√σk
· Lk∥∥vk,0 − v∗k∥∥ , (15)
and item (ii) holds true with δk,ik ≡ LkcA (k, ik)∥∥vk,0 − v∗k∥∥ /√σk.
12
-
Remark 1. Equipped with the properties of NFAs from Lemma 2, we would like to describe
the concept of NFA in the context of NAM. For example, according to Lemma 2, saying that
algorithm A1 in NAM is NFA means that with ϕk (x) := F(x,yk,uk+1
)for all k ∈ N, at
each outer iteration k, algorithm A1 generates iteratively a sequence{xk,i}i≥0 which starts at
xk,0 = xk and stops after ik inner iterations at xk,ik = xk+1 (where the index ik is mentioned
in NAM and Lemma 2) such that
(i) F(xk+1,yk,uk+1
)≤ F
(xk,yk,uk+1
)for all k ≥ 0.
(ii)∥∥∇xF (xk+1,yk,uk+1)∥∥ ≤ δk+1, where δk+1 ≡ δk,ik = LkcA (k, ik)∥∥xk − x∗k∥∥ /√σk.
Similar properties follow for algorithm A2 with respect to the y-block.
Now we are ready to prove the following main result.
Theorem 2. [Global convergence of NAM]. Let{(
xk,yk,uk)}
k≥0 be a bounded sequence
generated by NAM. If F satisfies Assumption A and the KL property, then the sequence{(xk,yk
)}k≥0 has finite length and it globally converges to some pair (x
∗,y∗). Moreover, let
u∗ be any limit point of the sequence{uk}k≥0, then (x
∗,y∗,u∗) ∈ crit (F ).
Proof. Since A1 is NFA, we denote ϕk (x) = F(x,yk,uk+1
)for each k ≥ 0, and derive
from Assumption A that ϕk is σk(yk,uk+1
)-strongly convex with Lk
(yk,uk+1
)-Lipschitz
continuous gradient. Therefore, Lemma 2 can be applied on {ϕk}k≥0 and the properties of
Remark 1 hold true. Similar arguments can be done for the y-block since A2 is also NFA.
Throughout the proof xk,0 = xk, xk,ik = xk+1, yk,0 = yk and yk,ik = yk+1 for all k ∈ N.
We will show that the sequence{(
xk,yk,uk)}
k≥0 is an approximate gradient-like descent
sequence according to Definition 1, which will be enough to obtain the stated result due to
Theorem 1 with z = (x,y). To this end, we first prove that Condition (C1) is satisfied. By
the definition of uk+1 (see step 7 in NAM) we have
F(xk,yk,uk
)≥ F
(xk,yk,uk+1
)≥ F
(xk+1,yk,uk+1
)≥ F
(xk+1,yk+1,uk+1
), (16)
where the last two inequalities hold true using the fact that algorithms A1 and A2 are NFAs
(see Remark 1). Therefore, the sequence{F(xk,yk,uk
)}k≥0 is non-increasing. Since the
sequences{xk}k≥0 and
{yk}k≥0 are bounded as assumed, there exists M ≥ 0 such that
13
-
∥∥xk − xk+1∥∥ ≤ M and ∥∥yk − yk+1∥∥ ≤ M for all k ≥ 0. In addition, from strong convexitywe establish that
F(xk,yk,uk+1
)− F
(xk+1,yk,uk+1
)≥ ∇xF
(xk+1,yk,uk+1
)T (xk − xk+1
)+σ1(yk,uk+1
)2
∥∥xk+1 − xk∥∥2≥ −
∥∥∇xF (xk+1,yk,uk+1)∥∥ · ∥∥xk − xk+1∥∥+σ1(yk,uk+1
)2
∥∥xk+1 − xk∥∥2≥ σ̄1
2
∥∥xk+1 − xk∥∥2 −Mδk+1x ,
(17)
where the last inequality follows from Assumption A(i) and Lemma 2(iii). Similarly, for the
y-block we obtain
F(xk+1,yk,uk+1
)− F
(xk+1,yk+1,uk+1
)≥ σ̄2
2
∥∥yk+1 − yk∥∥2 −Mδk+1y . (18)Combining (16), (17) and (18) we obtain that
F(xk,yk,uk
)− F
(xk+1,yk+1,uk+1
)≥ ρ1
∥∥zk+1 − zk∥∥2 − ek1,where ρ1 = (1/2) ·min {σ̄1, σ̄2} and ek1 = M
(δk+1x + δ
k+1y
). This proves Condition (C1).
Now we prove that Condition (C2) is satisfied. We notice that
∂F(zk+1,uk+1
)=
∇zF (zk+1,uk+1)∂uF
(zk+1,uk+1
) =
∇zF (zk+1,uk+1)∇uG
(zk+1,uk+1
)+ ∂H
(uk+1
) . (19)
Since uk+1 is a minimizer of the function u 7→ F(zk,u
), it follows from the first-order
optimality condition that
0m ∈ ∇uG(zk,uk+1
)+ ∂H
(uk+1
). (20)
Therefore, combining (19) with (20) we establish the inclusion
wk+1 ≡
∇zF (zk+1,uk+1)∇uG
(zk+1,uk+1
)−∇uG
(zk,uk+1
) ∈ ∂F (zk+1,uk+1) . (21)
From the triangle inequality we have
∥∥∇zF (zk+1,uk+1)∥∥ ≤∥∥∥∥∥∥∇xF (xk+1,yk+1,uk+1)
0
−∇xF (xk+1,yk,uk+1)
0
∥∥∥∥∥∥+
∥∥∥∥∥∥ ∇xF (xk+1,yk,uk+1)∇yF
(xk+1,yk+1,uk+1
)∥∥∥∥∥∥ .
(22)
14
-
Since{xk,yk,uk
}k≥0 is bounded, using Assumption A(iii) and Remark 1(ii) we obtain from
(22) that
∥∥∇zF (zk+1,uk+1)∥∥ ≤ L∥∥yk+1 − yk∥∥+ δk+1x + δk+1y ≤ L∥∥zk+1 − zk∥∥+ δk+1x + δk+1y . (23)From (21) and (23), we derive using the triangle inequality and Assumption A(iii) that
∥∥wk+1∥∥ ≤ ∥∥∇zF (zk+1,uk+1)∥∥+∥∥∇uG (zk+1,uk+1)−∇uG (zk,uk+1)∥∥ ≤ ρ2 ∥∥zk+1 − zk∥∥+ek2,where ρ2 = 2L and e
k2 = δ
k+1x + δ
k+1y . This proves Condition (C2).
To prove Condition (C3), we take a subsequence {(zqk ,uqk)}k≥0 which converges to some
pair (z∗,u∗). From the definition of uk+1 we obtain that
G(zk,uk+1
)+H
(uk+1
)≤ G
(zk,u∗
)+H (u∗) ,
and thus from Assumption A(iii) it follows that
H(uk+1
)≤ H (u∗) +G
(zk,u∗
)−G
(zk,uk+1
)≤ H (u∗) +∇uG
(zk,uk+1
)T (u∗ − uk+1
)+L
2
∥∥u∗ − uk+1∥∥2 .Substituting k with qk − 1 and taking k →∞ yields
lim supk→∞
H (uqk) ≤ H (u∗) .
Using the continuity of G we obtain Condition (C3). Finally, in order to prove Condition
(C4) we have to show that both{√
ek1
}k≥0
and{ek2}k≥0 are summable. Therefore, by the
definition of ek1 and ek2 it is enough to show that
{√ek1
}k≥0
is summable. From the σ̄1-strong
convexity of the function x 7→ F(x,yk−1,uk
)(see Assumption A(i)) we have
F(xk,yk−1,uk
)− F
(x∗k,y
k−1,uk)≥ ∇xF
(x∗k,y
k−1,uk)T (
xk − x∗k)
+σ̄12
∥∥xk − x∗k∥∥2=σ̄12
∥∥xk − x∗k∥∥2 , (24)where the last equality follows from the fact that ∇xF
(x∗k,y
k−1,uk)
= 0. Therefore, since
we assume that F is bounded from below by some F̄ ∈ R, it follows from (24) that
∥∥xk − x∗k∥∥2 ≤ 2σ̄1 (F (xk,yk−1,uk)− F (x∗k,yk−1,uk)) ≤ 2σ̄1 (F (x−1,y−1,u1)− F̄) , (25)15
-
where the last inequality follows (16). Now, using Remark 1(ii) we derive from (25) that
δk+1x =Lk√σkcA1 (k, ik)
∥∥xk − x∗k∥∥ ≤ L̄1√σ̄1 cA1 (k, ik)∥∥xk − x∗k∥∥≤√
2(F (x−1,y−1,u1)− F̄
)κ̄1 · cA1 (k, ik)
where κ̄1 ≡ L̄1/σ̄1. Therefore, from Definition 2(iii) we obtain that∞∑k=1
√δk+1x
-
and therefore cA (k, i) = 2√Lk/ (i+ 1), where Lk is the Lipschitz constant of the gradient
of the function ϕk. So, as can be seen, any optimization method with provable rate of
convergence in terms of function values is NFA (the third requirement to be discussed below
will be also valid). Before discussing the number of inner iterations we will present another
example. If ϕk, k ∈ N, is additionally σk-strongly convex (as will be in NAM due to
Assumption A), then we can use a variant of AG called V-AG which exploits the strong
convexity and enjoys a better rate (see [5, Theorem 10.42])
ϕk(vk,i)− ϕ∗k ≤
(1− 1√
κk
)iσk + Lk
2
∥∥vk,0 − v∗k∥∥2 ,and therefore cA (k, i) = (σk + Lk) ·
(1− κ−1/2k
)i/2, where κk = Lk/σk.
Now we would like to focus on the issue of determining the number of inner iterations
ik without focusing on a specific algorithm. From Lemma 2 we see that there are two
requirements on ik
(a) c2A (k, ik) ≤ σk, where σk is the strong convexity parameter of ϕk.
(b) Summability of the sequence{√
cA (k, ik)}k≥0
.
We should mention that it is enough to require that the above two requirements are satisfied
for all k ≥ K for some K ∈ N (and we use this fact in Section 5). Notice that each
requirement is satisfied for some ik, and the number of inner iterations is chosen to be
maximum between them. The first requirement can be easily verified since cA (k, ik) has an
explicit depend on ik. For instance, in the case of AG we have that
cA (k, ik) =2√Lk
ik + 1,
and therefore⌈2√κk − 1
⌉≤ ik, where κk = Lk/σk. Similarly we can show that for V-AG we
have⌈(log (σk + Lk)− log (2σk)) /
(log(√
κk)− log
(√κk − 1
))⌉≤ ik.
The second requirement can also be easily satisfied since all we need is to guarantee that
the sequence{√
cA (k, ik)}k≥0
converges to 0 fast enough such that its sum is finite. Since
cA (k, i) converges to 0 as i → ∞, independent on the specific structure of cA (k, i), it is
enough to take a sequence {ik}k∈N which grows fast enough to obtain the summability. For
17
-
example, if we set ik = s + 2bk+1/rc for some fixed integers s and r, we easily obtain the
desired requirement.
It should be noted that since NFA is used in the framework of NAM, we can rely on
Assumption A to bound the strong convexity parameters σk and the Lipschitz constants Lk
by σ̄ and L̄, respectively. For example, in the case of AG and V-AG, the number of inner
iterations ik that satisfies (a) can also be determined in terms of κ = L̄/σ̄ instead of κk.
Since κ is fixed for any k ∈ N, then taking ik = s+ 2bk+1/rc (as described above) guarantees
that both (a) and (b) are satisfied simultaneously, and there is no need to calculate the value
of κ.
5 Numerical Experiments
In this section we give a numerical example to show the advantage of using the NAM frame-
work relatively to existing methods on the challenging non-convex problem of Regularized
Structured Total Least Squares (RSTLS). Unlike One Iteration Approximation based meth-
ods, NAM allows several iterations in order to approximate the solution of each sub-problem.
In the following example we show that these additional inner iterations at each outer iter-
ation lead to a superior algorithm compared to existing methods in terms of the obtained
function values that NAM converges to and also in terms of a relative error of the obtained
solution
5.1 Regularized Structured Total Least Squares (RSTLS)
Following the recent paper [6], we consider the class of RSLTS problems, which can be
formulated as the following non-convex and possibly non-smooth optimization problems
minx∈Rp,u∈Rm
F (x,u) ≡ σ2wf (x) +∥∥∥∥∥(
A +m∑i=1
uiAi
)x− b
∥∥∥∥∥2
+σ2wσ2e‖u‖2
, (26)where A,A1,A2, . . . ,Am ∈ Rn×p and b ∈ Rn are given and contain noise, σw and σe are
the standard deviations of the noise components, respectively, and f : Rp → (−∞,∞] is a
regularizing function. For more details about the problem formulation given in (26) and its
18
-
applications in image processing see [6] and references therein. For the RSTLS model to
be compatible with the NAM framework, we choose the function f to be λ ‖x‖2 for some
regularization parameter λ > 0. This yields a non-convex formulation which is proper, lower
semi-continuous, bounded from below by 0 and satisfies the KL property (all these properties
can be easily verified in this case, since the function F is a quartic polynomial function).
Under the NAM framework we can write F (x,u) = G (x,u) +H (u) where
G (x,u) = σ2wf (x) + G̃ (x,u) ≡ λσ2w ‖x‖2 +
∥∥∥∥∥(
A +m∑i=1
uiAi
)x− b
∥∥∥∥∥2
,
H (u) =σ2wσ2e‖u‖2 .
We mention that in the general NAM framework, the function H is allowed to be non-
smooth, which is not the case in this specific application. In addition, in this application the
function u 7→ F (x,u), for a fixed x ∈ Rp, is strongly convex quadratic with a very small m
and thus can be minimized in exact fashion, as also done in [6] (see more details below).
We also point out that in the NAM algorithm, there are three blocks while in the case
of our application there are only two blocks. Therefore, we can either consider that the
y-block in NAM is vanished, or alternatively we can treat the u-block of RSTLS as the
y-block in NAM, and then the u-block in NAM is vanished. The difference between these
two approaches is that the u-block in NAM is solved exactly, while the y-block in NAM can
be solved approximately. In this work we choose to solve each such sub-problem exactly,
which is possible due to the low dimension of u, nevertheless the NAM framework allows us
to solve it approximately instead
In order to apply Theorem 2 on formulation (26) we need to show that Assumption A is
satisfied for the RSTLS problem (without item (ii), since we have no y-block here). To this
end, we notice that the function x 7→ F (x,u) with Hessian matrix given by
∇2xF (x,u) = 2λσ2wIp×p + 2
(A +
m∑i=1
uiAi
)T (A +
m∑i=1
uiAi
).
Therefore, we derive that σ̄ = 2λσ2w is a positive lower bound on the strong convexity
parameter of the function x 7→ F (x,u) for all u ∈ Rm. In addition, it is easy to check that
19
-
the partial gradient ∇xG (x,u) is Lipschitz continuous with constant
L (u) = 2λσ2w + 2
∥∥∥∥∥A +m∑i=1
uiAi
∥∥∥∥∥2
. (27)
Therefore, we see that L (u) is bounded on compact sets by some L > 0, thus Assump-
tion A(i) is satisfied. Since G (x,u) is a quadratic polynomial function we easily derive
Assumption A(iii).
5.2 Description of the Data and Methods
We follow the experiment of [6] and utilize the RSTLS formulation of (26) in order to
reconstruct a given blurred image. We consider the vector b as the vectorized observed
blurred image. The vector b is obtained from the real image, denoted xreal, by first scaling
the pixels of the real image to be between 0 and 1. Then, it is transferred through a Gaussian
blur point spread function (PSF) of size q×q and standard deviation γ (see [15, Chapter 3]),
and then by adding some noise to each coordinate drawn from Gaussian distribution with
parameters (0, σw).
The unknown blurring operator is constructed from the original PSF by adding to it a
q × q matrix (before padding) of the same structure of the original PSF. Noise is added to
each of the structure matrices Ai by the update rule A := µAi where µ ∼ Gauss (0, σe).
We use periodic boundary conditions, resulting with a block circulant with circulant blocks
(BCCB) blurring matrix.
The authors of [6] solve the RSTLS problem with One Iteration Approximation based
algorithm called SemiProximal-Alternating (SPA). It was proven that SPA is a globally
convergent algorithm to a critical point of F (x,u). In order to present the algorithm, we
need to recall the definition of the proximal mapping of a function ψ : Rd → (−∞,∞]:
proxψ (z) = argminv∈Rd
{ψ (v) +
1
2‖v − z‖2
}for any z ∈ Rd.
20
-
Algorithm 2 SPA Algorithm for RSTLS
1: Initialization: (x0,u0) ∈ Rp × Rm.
2: Iterative step:
3: for k ≥ 0 do
4: Pick Lk (a Lipschitz constant of the gradient of x 7→ G̃(x,uk+1
)) and update
xk+1 = proxσ2wLk
f
(xk − 1
Lk∇xG̃
(xk,uk
)). (28)
5: Update
uk+1 = argmin{F(xk+1,u
): u ∈ Rm
}. (29)
6: end for
Since the function u 7→ F(xk+1,u
)is strongly convex, a unique minimizer in (29) can
be obtained by the first-order optimality condition ∇uF(xk+1,uk+1
)= 0, and is given by
the formula
uk+1 = −(σ2wσ2e
Im×m + B(xk+1
)TB(xk+1
))−1B(xk+1
)T (Axk+1 − b
), (30)
where B (x) = [A1x A2x . . . Amx]. Since, in practice, m is small, this system of linear
equations can be solved efficiently. As for the update rule in (28), since the function f is set
to be λ ‖x‖2, simple calculations show that an explicit update rule is given by the formula
xk+1 =Lk
Lk + 2λσ2w
(xk − 1
Lk∇xG̃
(xk,uk
)). (31)
Obtaining an upper bound on the Lipschitz constant Lk, k ∈ N, can a computationally
challenging task. However, since we know that the blurring matrix is BCCB, its eigenvalues
can be calculated efficiently (see [15, Section 4.2.1]). Therefore, we use in this paper the
tight Lipschitz constant given by
Lk = 2
∥∥∥∥∥A +m∑i=1
ukiAi
∥∥∥∥∥2
. (32)
Now, we would like to compare the SPA algorithm with our NAM method equipped with
inner iterations of the AG algorithm (which was proved in Section 4 to be NFA). This method
21
-
applies several iterations of AG on the function x 7→ G(x,uk
), which requires finding the
Lipschitz constant of the gradient of this function. From (27) we see that we can use the
constant 2λσ2w + Lk, where Lk is defined in (32).
Algorithm 3 NAM with AG for RSTLS
1: Initialization: (x−1,u0) ∈ Rp × Rm.
2: Iterative step:
3: for k ≥ 0 do
4: Set wk−1,0 = xk−1, t0 = 1 and pick Lk ≥ L∇xG̃(uk).
5: for i = 0, 1, 2, . . . , ik−1 − 1 do
6: Update
xk−1,i+1 = wk−1,i − 12λσ2w + Lk
∇xF(wk−1,i,uk
). (33)
7: Set ti+1 =1+√
1+4t2i2
.
8: Update
wk−1,i+1 = xk−1,i+1 +ti − 1ti+1
(xk−1,i+1 − xk−1,i
). (34)
9: end for
10: Set xk = xk−1,ik−1 .
11: Update
uk+1 = argmin{F(xk,u
): u ∈ Rm
}. (35)
12: end for
We mention that since the blurring matrix is BCCD, we can efficiently compute the
strong convexity parameters σk of the functions x 7→ F(x,uk
)for any k ≥ 0. Therefore, we
can use V-AG instead of AG as NFA in Algorithm 3. Moreover, following the arguments in
Section 4 we actually know an explicit minimum number of inner iterations ik for both AG
and V-AG. However, the condition number κk increases as σw decreases, giving very large
κk. Two issues arise when the value of κk is very large, which limits its applicability here:
(i) the number of inner iterations ik increases significantly, even for small k, (ii) the stepsize
in V-AG, which depends on κk increases significantly. Therefore, we use the AG method
instead of V-AG. To overcome the issue in (i), we simply follow the arguments of Section 4
22
-
and choose ik = s + 2bk+1/rc for some fixed integers s and r. This selection guarantees the
conditions for global convergence while keeping a relatively low number of inner iterations
when k is small. Additionally, this selection does not require the calculation of κ.
5.3 Results
We run the SPA algorithm and NAM with inner iterations with AG on a blurred version of
the 256×256 pixels cameraman test image figure taken from the MATLAB Image Proceesing
Toolbox available in [1]. In order to blur the original image, we used a Gaussian PDF of size
5× 5 with a standard deviation of 4. See figures (a) and (b) below for both the original and
the blurred image.
(a) Original image (b) Blurred image
We let both methods to run for N = 3000 iterations, where for NAM the iteartion
counter counts also inner iterations (for example, if i1 = 100 and i2 = 200, then the iteration
counter after two iterations is N = 300). In all experiments we set λ = 0.02, σw = 10−4 and
σe = 10−3. The initial point for both algorithms is set to be the original blurred image, and
the variable u is initialized as the zero vector.
As for the NAM with AG algorithm, we always set r = 10, which lets us control the
number of inner AG iterations while also establishing global convergence. For example, from
the formula ik = s + 2bk+1/rc we get the following number of inner AG iterations for s = 10
and r = 10: i−1 = 11, i0 = 11, i9 = 12, i19 = 14, i29 = 18 and so forth. Note that when
keeping N fixed, then increasing s reduces the effect of r on the number of inner iterations.
23
-
All experiments were ran on an Intel(R) Core(TM) i5-6400 CPU @ 2.70GHz with 8.00GB
RAM, Windows 7 Professional 64-bit using MATLAB 2020a.
We compare in Figure 2 and Figure 3 the two methods by the achieved function value
and by the relative error at each iteration, calculated by
Relative Error =
∥∥xi − xreal∥∥‖xreal‖
,
where xi is the image obtained in the i-th iteration.
Figure 2: Function value at each iteration for N = 3000 in a logarithmic scale. The NAM
with AG method was conducted for s = 1, 10, 50, 100 and 200.
24
-
Figure 3: Relative relative error at each iteration for N = 3000 in a logarithmic scale. The
NAM with AG method was conducted for s = 1, 10, 50, 100 and 200.
In Figure 2 and Figure 3 wee see that after 10 iterations, NAM with AG for s =
10, 50, 100, 200 is superior to SPA in terms of both function value and relative error. We
also see that for NAM with AG, when increasing the number of inner iterations the accuracy
of the output increases, indicating that at least for the RSTLS problem it is better to use
methods that are not based on One Iteration Approximation.
The following figures show the output of each of the methods after N = 3000 iterations.
25
-
(a) SPA Algorithm (b) NAM with AG, s = 1
(c) NAM with AG, s = 10 (d) NAM with AG, s = 50
(e) NAM with AG, s = 100 (f) NAM with AG, s = 200
To conclude, in this paper we provided a general recipe which serves as a base for proving
global convergence of algorithms that tackle non-convex optimization problems. We have
generalized the methodology of [10] to allow errors in the conditions. We utilized the new
26
-
methodology in order to establish global convergence of Nested Alternating Minimization
methods that are incorporated with a Nested Friendly Algorithm. To the best of our knowl-
edge, this is the first result that proves global convergence of general nested algorithms in
the non-convex setting. Moreover, to the best of our knowledge this is the first methodol-
ogy which can be used to derive global convergence of methods which are not necessarily
sufficient descent methods. In a future work, it would be interesting to generalize the NAM
framework to also include constraints or non-smooth regularizers in the blocks x and y.
Another challenging task is to extend the framework of NAM for functions which are not
necessarily strongly convex in the x and y blocks.
References
[1] Image Processing Toolbox. The MathWorks Inc. Natick, Massachusetts, United States.
Online: mathworks.com/products/image, 2020.
[2] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth
functions involving analytic features. Mathematical Programming, 116(1-2):5–16, 2009.
[3] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization
and projection methods for nonconvex problems: An approach based on the kurdyka-
lojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
[4] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-
algebraic and tame problems: proximal algorithms, forward–backward splitting, and
regularized gauss–seidel methods. Mathematical Programming, 137(1-2):91–129, 2013.
[5] A. Beck. First-Order Methods in Optimization, volume 25. SIAM, 2017.
[6] A. Beck, S. Sabach, and M. Teboulle. An alternating semiproximal method for non-
convex regularized structured total least squares problems. SIAM Journal on Matrix
Analysis and Applications, 37(3):1129–1150, 2016.
27
-
[7] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total varia-
tion image denoising and deblurring problems. IEEE transactions on image processing,
18(11):2419–2434, 2009.
[8] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear
inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
[9] J. Bolte, A. Daniilidis, and A. Lewis. The lojasiewicz inequality for nonsmooth suban-
alytic functions with applications to subgradient dynamical systems. SIAM Journal on
Optimization, 17(4):1205–1223, 2007.
[10] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for
nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459–494,
2014.
[11] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd. First order methods beyond con-
vexity and lipschitz gradient continuity with applications to quadratic inverse problems.
SIAM Journal on Optimization, 28(3):2131–2151, 2018.
[12] B. L. Gorissen, İ. Yanıkoğlu, and D. den Hertog. A practical guide to robust optimiza-
tion. Omega, 53:124–137, 2015.
[13] P. J. Groenen, M. van de Velden, et al. Multidimensional scaling by majorization: A
review. Journal of Statistical Software, 73(8):1–26, 2016.
[14] W. J. Gutjahr and A. Pichler. Stochastic multi-objective optimization: a survey on
non-scalarizing methods. Annals of Operations Research, 236(2):475–499, 2016.
[15] P. C. Hansen, J. G. Nagy, and D. P. O’leary. Deblurring images: matrices, spectra, and
filtering. SIAM, 2006.
[16] R. Hesse, D. R. Luke, S. Sabach, and M. K. Tam. Proximal heterogeneous block
implicit-explicit method and application to blind ptychographic diffraction imaging.
SIAM Journal on Imaging Sciences, 8(1):426–457, 2015.
28
-
[17] P. Jain, P. Kar, et al. Non-convex optimization for machine learning. Foundations and
Trends R© in Machine Learning, 10(3-4):142–336, 2017.
[18] K. Kurdyka. On gradients of functions definable in o-minimal structures. Ann. Inst.
Fourier (Grenoble), 48(3):769–783, 1998.
[19] S. Lojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. In
Les Équations aux Dérivées Partielles (Paris, 1962), pages 87–89. Éditions du Centre
National de la Recherche Scientifique, Paris, 1963.
[20] F. G. Mohammadi, M. H. Amini, and H. R. Arabnia. Evolutionary computation, op-
timization, and learning algorithms for data science. In Optimization, Learning, and
Control for Interdependent Complex Networks, pages 37–65. Springer, 2020.
[21] Y. E. Nesterov. A method for solving the convex programming problem with conver-
gence rate O(1/k2). Dokl. Akad. Nauk SSSR, 269(3):543–547, 1983.
[22] P. Ochs. Unifying abstract inexact convergence theorems and block coordinate variable
metric ipiano. SIAM Journal on Optimization, 29(1):541–570, 2019.
[23] N. Piovesan and T. Erseghe. Cooperative localization in wsns: A hybrid con-
vex/nonconvex solution. IEEE Transactions on Signal and Information Processing over
Networks, 4(1):162–172, 2018.
[24] T. Pock and S. Sabach. Inertial proximal alternating linearized minimization (ipalm) for
nonconvex and nonsmooth problems. SIAM Journal on Imaging Sciences, 9(4):1756–
1787, 2016.
[25] R. T. Rockafellar and R. J.-B. Wets. Variational analysis, volume 317. Springer Science
& Business Media, 2009.
[26] S. Sabach, M. Teboulle, and S. Voldman. A smoothing alternating minimization-based
algorithm for clustering with sum-min of euclidean norms. Pure and Applied Functional
Analysis, 3:653–679, 2018.
29
-
[27] C. Soares, J. Xavier, and J. Gomes. Simple and fast convex relaxation method for coop-
erative localization in sensor networks using range measurements. IEEE Transactions
on Signal Processing, 63(17):4532–4543, 2015.
[28] J. Wang. Geometric structure of high-dimensional data and dimensionality reduction.
Springer, 2012.
[29] F. Wen, L. Chu, P. Liu, and R. C. Qiu. A survey on nonconvex regularization-based
sparse and low-rank recovery in signal processing, statistics, and machine learning. IEEE
Access, 6:69883–69906, 2018.
30