convergent nested alternating minimization algorithms for ...convergence analysis is a new extension...

30
Convergent Nested Alternating Minimization Algorithms for Non-Convex Optimization Problems Eyal Gur, Shoham Sabach * , Shimrit Shtern October 1, 2020 Abstract We introduce a new algorithmic framework for solving non-convex optimization problems, which is called Nested Alternating Minimization, which aims at combining the classical Alternating Minimization technique with inner iterations of any optimiza- tion method. We provide global convergence analysis of the new algorithmic framework to critical points of the problem at hand, which to the best of our knowledge, is the first of this kind for nested methods in the non-convex setting. Central to our global convergence analysis is a new extension of classical proof techniques in the non-convex setting that allows for errors in the conditions. The power of our framework is illus- trated with some numerical experiments that show the superiority of this algorithmic framework over existing methods. 1 Introduction In recent years non-convex and non-smooth optimization problems have gained a lot of interest in numerous applied fields such as machine learning [17], signal processing [29, 23], * This research was partially supported by the D.F.G German Research Foundation 800240. Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology, Tech- nion city, Haifa 3200003, Israel (e-mail: [email protected], [email protected], shim- [email protected]). 1

Upload: others

Post on 13-Feb-2021

9 views

Category:

Documents


1 download

TRANSCRIPT

  • Convergent Nested Alternating Minimization

    Algorithms for Non-Convex Optimization Problems

    Eyal Gur, Shoham Sabach∗, Shimrit Shtern†

    October 1, 2020

    Abstract

    We introduce a new algorithmic framework for solving non-convex optimization

    problems, which is called Nested Alternating Minimization, which aims at combining

    the classical Alternating Minimization technique with inner iterations of any optimiza-

    tion method. We provide global convergence analysis of the new algorithmic framework

    to critical points of the problem at hand, which to the best of our knowledge, is the

    first of this kind for nested methods in the non-convex setting. Central to our global

    convergence analysis is a new extension of classical proof techniques in the non-convex

    setting that allows for errors in the conditions. The power of our framework is illus-

    trated with some numerical experiments that show the superiority of this algorithmic

    framework over existing methods.

    1 Introduction

    In recent years non-convex and non-smooth optimization problems have gained a lot of

    interest in numerous applied fields such as machine learning [17], signal processing [29, 23],

    ∗This research was partially supported by the D.F.G German Research Foundation 800240.†Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology, Tech-

    nion city, Haifa 3200003, Israel (e-mail: [email protected], [email protected], shim-

    [email protected]).

    1

  • data science [13, 20], operations research [14, 12] and more. While the theoretical results in

    the convex setting (such as convergence analysis and rates of convergence) are wide (see, for

    instance, the recent book [5] and references therein), the tools and results available regarding

    the non-convex setting are very limited, making this setting much more challenging. An

    important task in the non-convex problem domain, which has already gained success in

    several settings, is to establish global convergence of algorithms to critical points (see more

    detail in Section 2). In this paper, we are motivated by the analysis and application of

    a large and important class of algorithms, which we call Nested Alternating Minimization

    and will be precisely defined in Section 3. Our contribution consists of providing a new

    general proof procedure that can be applied to obtain global convergence results for this

    class of algorithms. As far as we know, this is the first theoretical guarantee for this class of

    algorithms in the non-convex setting.

    1.1 Global Convergence

    Proving global convergence of an algorithm to a critical point involves showing that for any

    starting point the entire sequence generated by the algorithm converges to a single critical

    point of the optimization problem at hand. In recent years, this task received a lot of

    attention starting with the works [2, 3, 4] and later with the work [10], which formulates a

    simple proof technique to obtain global convergence results of algorithms that satisfy three

    conditions in addition to the KL property of the objective function (see [19, 18, 9]). These

    works pave the way to many modifications and extensions that have been, and are still being

    developed (e.g., [26, 22] and references therein). In this work we propose another proof

    technique to obtain global convergence results in the non-convex setting. This extension

    allows more flexibility in accommodating the conditions of previous proposed recipes in

    the sense that we propose a relaxed variant of the first two conditions by adding some

    non-negative error terms (our recipe is a further relaxation of the recipe suggested in [22],

    which only relaxes the second condition). The first condition consists of a sufficient decrease

    requirement of the algorithm with respect to all variables (see, for instance, [10]) or part of

    them (see, for instance, [26]), where the sufficient decrease can be measured in terms of the

    original objective function as in [10] or a related Lyapunov function as in [24]. The second

    2

  • condition consists of bounding the iterates gap from below by the function’s sub-gradient

    norm. As mentioned, we relax these conditions by allowing for some error terms.

    In Section 2, following the approach introduced in [10] and recently summarized in [11], we

    develop the extended proof methodology in detail. Even though the techniques used to derive

    the new methodology are not new and are based on [10, 11], the newly developed technique

    allows us to easily analyze non necessarily sufficient descent methods (due to the relaxation

    of the first condition), which in turn opens the gate for obtaining global convergence results

    of the very important and challenging class of Nested Alternating Minimization algorithms.

    1.2 Nested Alternating Minimization Algorithms

    The celebrated Alternating Minimization (AM) methodology is a technique for handling

    complicated optimization problems that has regained huge popularity in the last decade

    in the context of non-convex optimization. The AM methodology splits an optimization

    problem into smaller problems, which would hopefully be easier to solve in a closed form.

    Unfortunately, in many cases, especially in the non-convex setting, even the resulting sub-

    problems remain too complicated or too large to be solved explicitly. This led to a stream

    of papers in recent years that suggest to approximate the solution of the sub-problems by

    computing one iteration of a certain optimization algorithm (see, for instance, [16, 24, 11,

    26]). This approach of One Iteration Approximation (OIA) led, in many cases, to algorithms

    that enjoy global convergence guarantees using the techniques mentioned in Section 1.1.

    However, the idea of OIA is limited to algorithms with a certain sufficient descent property

    (see our discussion above). Motivated by considering to approximate the solution of the sub-

    problems using non sufficient decrease methods (like FISTA [8] and MFISTA [7]), we study

    the class of Nested Alternating Minimization (NAM) algorithms, which allow to approximate

    the solution of the sub-problems using several iterations instead of one, in order to achieve

    the relaxed sufficient descent property discussed above.

    In Section 3 we propose the general algorithmic framework of NAM for solving non-

    convex and non-smooth problems. Our main contribution is to show that this algorithmic

    framework, which captures a wide spectrum of algorithms, generates a globally convergent

    3

  • sequence to critical points. To this end we define the notion of Nested Friendly Algorithms

    (NFA), which captures the minimal necessary requirements of the nested algorithm to be

    used within NAM in order to achieve the global convergence of NAM. In Section 4 we

    illustrate the concept of NFA by presenting several examples that satisfy the requirements

    and therefore can be used in NAM. Another central aspect in applying NAM is to determine

    the number of inner iterations which is sufficient to guarantee the global convergence. This

    is also discussed in Section 4. Finally, in Section 5 we illustrate the advantage of the NAM

    framework in tackling the Regularized Structured Total Least Squares (RSTLS) problem via

    its application on an image processing task.

    2 A Procedure for Global Convergence with Errors

    Consider an extended real-valued function F : Rn ×Rm → (−∞,∞] which is bounded from

    below and assume that F is proper and lower semi-continuous. We consider the following

    two-block minimization problem

    min(z,u)∈Rn×Rm

    F (z,u) . (1)

    Starting with any pair (z0,u0), let{(

    zk,uk)}

    k≥0 be a sequence generated by a generic

    algorithm, which we denote by A, that aims at tackling Problem (1) in the sense of global

    convergence to critical points of the function F . As mentioned above, our goal in this section

    is to extend recent proof techniques with a set of relaxed conditions that still guarantee

    global convergence (see [10] and also [11] for a recent and concise version). We also use the

    same notations and definitions from non-smooth analysis as in these papers. Therefore, for

    instance, when we write ∂F we mean the limiting sub-differential (see [25]). Following [10]

    we extend the definition of gradient-like descent sequences in the following way.

    Definition 1. [Approximate gradient-like descent sequence]. A sequence{(

    zk,uk)}

    k≥0 is

    called an approximate gradient-like descent sequence for minimizing the function F : Rn ×

    Rm → (−∞,∞] if the following conditions hold:

    (C1) Approximate sufficient decrease property. The sequence{F(zk,uk

    )}k≥0 is non-increasing

    4

  • and there exist a positive scalar ρ1 > 0 and a non-negative error term ek1 ≥ 0 such that

    ρ1∥∥zk+1 − zk∥∥2 − ek1 ≤ F (zk,uk)− F (zk+1,uk+1) , ∀ k ≥ 0.

    (C2) Approximate sub-gradient lower bound on the iterates gap. There exist a vector wk+1 ∈

    ∂F(zk+1,uk+1

    ), a positive scalar ρ2 > 0 and a non-negative error term e

    k2 ≥ 0 such

    that ∥∥wk+1∥∥ ≤ ρ2 ∥∥zk+1 − zk∥∥+ ek2, ∀ k ≥ 0.(C3) Continuity. If (z̄, ū) is a limit point of some sub-sequence

    {(zk,uk

    )}k∈K⊆N, then

    lim supk∈K⊆N

    F(zk,uk

    )≤ F (z̄, ū) ,

    (C4) Summability of the errors. The sequences{√

    ek1

    }k≥0

    and{ek2}k≥0 are summable, that

    is,∞∑k=1

    √ek1

  • Condition (C4) these errors should be small enough such that the infinite sum of them is

    finite. See [22] for a set of conditions which include errors only in Condition (C2).

    We would also like to stress the fact that now, in Definition 1, the two-block structure

    comes into a play since the first two conditions are only measured in terms of the variable z

    (see also [26]). This adds another level of flexibility in proving the approximate gradient-like

    descent sequence property.

    Now, we are in a position to prove that these relaxed conditions are enough to guarantee

    global convergence to critical points of F . To this end, we denote by ω (x0) the set of limit

    points of a sequence{xk}k≥0 and for any function f we denote by crit (f) the set of its

    critical points.

    Lemma 1. Let{(

    zk,uk)}

    k≥0 be a bounded approximate gradient-like descent sequence for

    minimizing F of Problem (1). Then, ω (z0,u0) is a non-empty and compact subset of crit (F )

    and

    limk→∞

    dist((

    zk,uk), ω(z0,u0

    ))= 0.

    In addition, the function F is finite and constant on ω (z0,u0).

    Proof. Since{(

    zk,uk)}

    k≥0 is bounded, ω (z0,u0) is a non-empty and compact set. Therefore,

    it follows that dist((

    zk,uk), ω (z0,u0)

    )converges to 0 as k goes to infinity. From Condition

    (C1) we have ∥∥zk+1 − zk∥∥2 ≤ 1ρ1

    (F(zk,uk

    )− F

    (zk+1,uk+1

    )+ ek1

    ). (2)

    The non-ascent property required in Condition (C1) and the boundedness from below of F ,

    yields the convergence of the sequence{F(zk,uk

    )}k≥0. This together with Condition (C4)

    implies from (2) that

    limk→∞

    ∥∥zk+1 − zk∥∥ = 0. (3)Let

    {(zk,uk

    )}k∈K be a convergent sub-sequence to some pair (z

    ∗,u∗). Denote F ∗ ≡

    F (z∗,u∗). From Condition (C3) and the lower semi-continuity of F we derive that

    limk→∞

    F(zk,uk

    )= lim

    k∈KF(zk,uk

    )= F ∗. (4)

    6

  • From Condition (C2) there exists a sequence{wk}k≥0, for which w

    k ∈ ∂F(zk,uk

    )for all

    k ≥ 0, that satisfies

    limk→∞

    ∥∥wk∥∥ ≤ limk→∞

    (ρ2∥∥zk+1 − zk∥∥+ ek2) = 0,

    where the equality follows from (3) and Condition (C4). This means that the sequence{wk}k≥0 converges to 0 as k → ∞ and therefore (z

    ∗,u∗) ∈ crit (F ), as follows from the

    closedness property of the sub-differential (see [25]). Finally, since{F(zk,uk

    )}k≥0 converges

    we can assume that limk→∞

    F(zk,uk

    )= c ∈ R. Then, lim

    k∈KF(zk,uk

    )= c and from (4) we

    derive that F is finite and constant over the set ω (z0,u0).

    We now prove the main result of this section. It states that if a generic optimization

    method A generates a bounded approximate gradient-like descent sequence for minimizing

    F of Problem (1), then it globally converges to a critical point of F . It should be noted that

    this unique limit point is dependent on the initialization point of A.

    Theorem 1. [Global convergence]. Let{(

    zk,uk)}

    k≥0 be a bounded approximate gradient-

    like descent sequence for minimizing F of Problem (1). If F satisfies the KL property, then

    the sequence{zk}k≥0 has finite length and it globally converges to a point z

    ∗. Moreover, let

    u∗ be any limit point of the sequence{uk}k≥0, then (z

    ∗,u∗) ∈ crit (F ).

    Proof. Let (z∗,u∗) be a limit point of the sequence{(

    zk,uk)}

    k≥0, which exists due to the

    boundedness assumption. From Lemma 1 (see (4)) we derive

    limk→∞

    F(zk,uk

    )= F (z∗,u∗) .

    Assume that there exists k̃ ∈ N such that F(zk,uk

    )= F (z∗,u∗) and ek1 = 0 for all k ≥ k̃.

    Then, from Condition (C1) it follows that{F(zk,uk

    )}k≥0 has a “regular” sufficient decrease

    property and therefore zk+1 = zk for all k ≥ k̃. This proves a finite convergence of the

    sequence{zk}k≥0 and obviously the desired result.

    Now, on the other hand, assume that F(zk,uk

    )> F (z∗,u∗) for all k ≥ 0. We know

    from Condition (C1) that the sequence{F(zk,uk

    )}k≥0 is non-increasing and converges to

    F (z∗,u∗). Thus, for any γ > 0 there exists an integer k0 such that F(zk,uk

    )< F (z∗,u∗)+γ

    for all k > k0. From Lemma 1

    limk→∞

    dist((

    zk,uk), ω(z0,u0

    ))= 0.

    7

  • So, for any ε > 0 there exists an integer k1 such that dist((

    zk,uk), ω (z0,u0)

    )< ε for any k >

    k1. Again from Lemma 1 the function F is finite and constant on the non-empty and compact

    set ω (z0,u0). Since F(zk,uk

    )− F (z∗,u∗) > 0 it follows that ϕ′

    (F(zk,uk

    )− F (z∗,u∗)

    )is

    well-defined, where ϕ is any desingularizing function (see [9]).

    From the Uniformized KL property [10, Lemma 6] there exists a desingularizing function

    ϕ such that for any k > s ≡ max {k0, k1}+ 1 we have

    ϕ′(F(zk,uk

    )− F (z∗,u∗)

    )· dist

    (0, ∂F

    (zk,uk

    ))≥ 1. (5)

    Since dist(0, ∂F

    (zk,uk

    ))≤∥∥wk∥∥ for any wk ∈ ∂F (zk,uk) it follows from Condition (C2)

    and (5) that

    ϕ′(F(zk,uk

    )− F (z∗,u∗)

    )≥ 1ρ2 ‖zk − zk−1‖+ ek−12

    . (6)

    For n1, n2 ∈ N we denote

    ∆n1,n2 ≡ ϕ (F (zn1 ,un1)− F (z∗,u∗))− ϕ (F (zn2 ,un2)− F (z∗,u∗)) .

    Since ϕ is concave it follows from the gradient inequality that

    ∆k,k+1 ≥ ϕ′(F(zk,uk

    )− F (z∗)

    )·(F(zk,uk

    )− F

    (zk+1,uk+1

    )). (7)

    Combining (6) and (7) with Condition (C1) we establish

    ∆k,k+1 ≥ρ1

    (∥∥zk+1 − zk∥∥2 − ek1/ρ1)ρ2(‖zk − zk−1‖+ ek−12 /ρ2

    ) . (8)Denote c = ρ2/ρ1, ẽ

    k1 := e

    k1/ρ1 and ẽ

    k2 := e

    k2/ρ2. We obtain from (8) that√

    c∆k,k+1(‖zk − zk−1‖+ ẽk−12

    )+ ẽk1 ≥

    ∥∥zk+1 − zk∥∥ . (9)Now, since c∆k,k+1

    (∥∥zk − zk−1∥∥+ ẽk−12 ) ≥ 0 and ẽk1 ≥ 0 we obtain from (9) and thegeometric-arithmetic mean inequality that∥∥zk+1 − zk∥∥ ≤√c∆k,k+1 (‖zk − zk−1‖+ ẽk−12 )+√ẽk1

    ≤ c2

    ∆k,k+1 +1

    2

    ∥∥zk − zk−1∥∥+ 12ẽk−12 +

    √ẽk1.

    (10)

    8

  • Summing up inequalities (10) for r = s, s+ 1, . . . , k we derive that

    2k∑

    r=s+1

    ∥∥zr+1 − zr∥∥ ≤ k∑r=s+1

    ∥∥zr − zr−1∥∥+ c k∑r=s+1

    ∆r,r+1 +k∑

    r=s+1

    (2√ẽr1 + ẽ

    r−12

    )≤

    k∑r=s+1

    ∥∥zr+1 − zr∥∥+ ∥∥zs+1 − zs∥∥+ c k∑r=s+1

    ∆r,r+1

    +k∑

    r=s+1

    (2√ẽr1 + ẽ

    r−12

    )=

    k∑r=s+1

    ∥∥zr+1 − zr∥∥+ ∥∥zs+1 − zs∥∥+ c∆s+1,k+1+

    k∑r=s+1

    (2√ẽr1 + ẽ

    r−12

    ),

    (11)

    where the last equality follows from the definition of ∆r,r+1. We now see that

    k∑r=s+1

    ∥∥zr+1 − zr∥∥ ≤ ∥∥zs+1 − zs∥∥+ k∑r=s+1

    (2√ẽr1 + ẽ

    r−12

    )+ c∆s+1,k+1

    ≤∥∥zs+1 − zs∥∥+ k∑

    r=s+1

    (2√ẽr1 + ẽ

    r−12

    )+ cϕ

    (F(zs+1,us+1

    )− F (z∗,u∗)

    ),

    (12)

    where the last inequality follows from the definition of ∆s+1,k+1 and the fact that ϕ ≥ 0.

    Due to Condition (C4), the infinite sum of errors in (12) is also finite and therefore we finally

    derive that∞∑r=1

    ‖zr+1 − zr‖ < ∞. Hence, the sequence{zk}k≥0 is of a finite length, and it

    converges globally to some z∗. Last, from Lemma 1 it follows that (z∗,u∗) is a critical point

    of the function F .

    3 Global Convergence of Nested Alternating Minimiza-

    tion Algorithms

    In this section we introduce the class of Nested Alternating Minimization (NAM) algorithms

    and provide a global convergence result for this class. To illustrate better the concept of

    this class we will consider the optimization model (1), where the vector z is split into two

    9

  • blocks as follows z ≡(xT ,yT

    )T ∈ Rp × Rq. More precisely, we focus here on the followingthree-block model

    min(x,y,u)∈Rp×Rq×Rm

    {F (x,y,u) ≡ G (x,y,u) +H (u)}, (13)

    where the function G : Rp×Rq×Rm → R is non-convex and smooth and H : Rm → (−∞,∞]

    is non-convex and non-smooth (precise assumption will be made below). The results in this

    section can be easily extended to any finite number of blocks of variables. However, in

    order to keep the developments simple we address the case where the block z is decomposed

    into only two blocks. The algorithmic framework of Alternating Minimization suggests to

    alternate between the three blocks of Problem (13) in a cyclic fashion. This results in three

    sub-problems that should be exactly solved at each iteration. However, here as already

    discussed in the Introduction we are interested in their approximations. Therefore, a reason

    to consider this three-block model is due to the assumption that the sub-problems with

    respect to the first two blocks x and y will be approximated using nested algorithms, while

    the third block will remain to be exactly solved. More precisely, we introduce now the class

    of Nested Alternating Minimization algorithms.

    Algorithm 1 Nested Alternating Minimization (NAM) Scheme

    1: Input: two nested friendly optimization algorithms A1 and A2.

    2: Initialization: (x−1,y−1,u0) ∈ Rp × Rq × Rm.

    3: Iterative step:

    4: for k ≥ 0 do

    5: Set xk after ik−1 iterations of A1 starting with xk−1 for minimizing F(·,yk−1,uk

    ).

    6: Set yk after jk−1 iterations of A2 starting with yk−1 for minimizing F(xk, ·,uk

    ).

    7: uk+1 ∈ argmin{F(xk,yk,u

    ): u ∈ Rm

    }.

    8: end for

    It should be noted that algorithms A1 and A2, which are used in NAM, need not be the

    same and could be different. This adds flexibility in using algorithms which fit best to the

    structure of each sub-problem. In order to obtain the global convergence of NAM, we will

    make the following assumption on the involved objective function F .

    10

  • Assumption A. (i) For any fixed (y,u) ∈ Rq × Rm, the function x 7→ F (x,y,u) is

    σ1 (y,u)-strongly convex and with an L1 (y,u)-Lipschitz continuous gradient. In ad-

    dition, there exist positive scalars σ̄1 and L̄1 such that inf(y,u)∈D

    σ1 (y,u) = σ̄1 and

    sup(y,u)∈D

    L1 (y,u) = L̄1 for any compact D ⊆ Rq × Rm.

    (ii) For any fixed (x,u) ∈ Rp × Rm, the function y 7→ F (x,y,u) is σ2 (x,u)-strongly

    convex and with an L2 (y,u)-Lipschitz continuous gradient. In addition, there exist

    positive scalars σ̄2 and L̄2 such that inf(x,u)∈D

    σ2 (x,u) = σ̄2 and sup(y,u)∈D

    L2 (y,u) = L̄2 for

    any compact D ⊆ Rp × Rm.

    (iii) The gradient of the function G is L-Lipschitz continuous on bounded subsets of Rp ×

    Rq × Rm.

    At this juncture we would like to note that this assumption is satisfied in many chal-

    lenging applications, such as the Regularized Structured Total Least Squares (which will be

    discussed in detail in Section 5), the Wireless Sensor Network Localization problem [27],

    Multi-Dimensional Scaling [28, Chapter 6], and more.

    Before proceeding with NAM we would like to discuss the requirements of the nested

    algorithms for guaranteeing the global convergence of NAM. To this end, and as we already

    mentioned in NAM, we would like to define the following concept.

    Definition 2. [Nested friendly algorithm]. Let ϕk : Rd → R, k ∈ N, be a family of convex

    functions. Let A be an optimization algorithm that generates for any k ∈ N a sequence{vk,i}i≥0 starting from v

    k,0. We say that A is a Nested Friendly Algorithm (NFA) with

    respect to {ϕk}k≥0, if for any k ∈ N there exist an index ik ∈ N and a sequence of non-

    negative scalars {cA (k, i)}i≥0 such that

    (i) limi→∞

    cA (k, i) = 0.

    (ii) ϕk(vk,ik

    )−ϕ∗k ≤

    c2A(k,ik)

    2·∥∥vk,0 − v∗k∥∥2, where v∗k is a minimizer of ϕk with an optimal

    value of ϕ∗k.

    (iii)∞∑k=1

    √cA (k, ik)

  • This definition is crucial for us to derive the global convergence of NAM, which will be

    carried out next. We will discuss examples of optimization algorithms which are NFAs in

    Section 4. Before getting into the convergence of NAM, we would like to show that NFAs

    satisfy some properties which will be useful for us in proving the global convergence of NAM.

    Lemma 2. Let ϕk : Rd → R, k ∈ N, be a family of continuously differentiable and σk-strongly

    convex functions. Assume that the gradient of each ϕk is Lk-Lipschitz continuous. Let A be

    an NFA with respect to {ϕk}k≥0 that generates sequences{vk,i}i≥0 with v

    k,0 = vk−1,ik−1 for

    some v0,0 = v0. Then, the following properties hold

    (i) Overall non-increasing function value. ϕk(vk,ik

    )≤ ϕk

    (vk,0

    )= ϕk

    (vk−1,ik−1

    ).

    (ii) Convergence of the gradients.∥∥∇ϕk (vk,ik)∥∥ ≤ δk,ik ≡ LkcA (k, ik)∥∥vk,0 − v∗k∥∥ /√σk.

    Proof. To prove that item (i) holds, we use the strong convexity of ϕk to establish that

    ϕk(vk,0

    )≥ ϕ∗k +∇ϕk (v∗k)

    T (vk,0 − v∗k)+ σk2 ∥∥vk,0 − v∗k∥∥2 = ϕ∗k + σk2 ∥∥vk,0 − v∗k∥∥2 , (14)where the last equality follows from the fact that v∗k is a minimizer of ϕk and therefore

    ∇ϕk (v∗k) = 0. This fact with Definition 2(ii) yields that

    ϕk(vk,ik

    )− ϕ∗k ≤

    c2A (k, ik)

    σk

    (ϕk(vk,0

    )− ϕ∗k

    ),

    Therefore, item (i) holds for any ik ≥ 0 that satisfies c2A (k, ik) ≤ σ. Since Definition 2(i)

    states that cA (k, i)→ 0 as i→∞, there must exist such an ik ∈ N.

    To show that item (ii) holds, we use the fact that the functions ϕk are strongly convex

    with Lk-Lipschitz continuous gradient for all k ≥ 0 to obtain that

    ϕk(vk,ik

    )− ϕ∗k ≥

    σk2

    ∥∥vk,ik − v∗k∥∥2 ≥ σk2L2k ∥∥∇ϕk (vk,ik)−∇ϕ (v∗k)∥∥2 = σk2L2k ∥∥∇ϕk (vk,ik)∥∥2 ,where the first inequality follows again from the fact that ∇ϕk (v∗k) = 0. This fact with

    Definition 2(ii) yields that

    ∥∥∇ϕk (vk,ik)∥∥ ≤ cA (k, ik)√σk

    · Lk∥∥vk,0 − v∗k∥∥ , (15)

    and item (ii) holds true with δk,ik ≡ LkcA (k, ik)∥∥vk,0 − v∗k∥∥ /√σk.

    12

  • Remark 1. Equipped with the properties of NFAs from Lemma 2, we would like to describe

    the concept of NFA in the context of NAM. For example, according to Lemma 2, saying that

    algorithm A1 in NAM is NFA means that with ϕk (x) := F(x,yk,uk+1

    )for all k ∈ N, at

    each outer iteration k, algorithm A1 generates iteratively a sequence{xk,i}i≥0 which starts at

    xk,0 = xk and stops after ik inner iterations at xk,ik = xk+1 (where the index ik is mentioned

    in NAM and Lemma 2) such that

    (i) F(xk+1,yk,uk+1

    )≤ F

    (xk,yk,uk+1

    )for all k ≥ 0.

    (ii)∥∥∇xF (xk+1,yk,uk+1)∥∥ ≤ δk+1, where δk+1 ≡ δk,ik = LkcA (k, ik)∥∥xk − x∗k∥∥ /√σk.

    Similar properties follow for algorithm A2 with respect to the y-block.

    Now we are ready to prove the following main result.

    Theorem 2. [Global convergence of NAM]. Let{(

    xk,yk,uk)}

    k≥0 be a bounded sequence

    generated by NAM. If F satisfies Assumption A and the KL property, then the sequence{(xk,yk

    )}k≥0 has finite length and it globally converges to some pair (x

    ∗,y∗). Moreover, let

    u∗ be any limit point of the sequence{uk}k≥0, then (x

    ∗,y∗,u∗) ∈ crit (F ).

    Proof. Since A1 is NFA, we denote ϕk (x) = F(x,yk,uk+1

    )for each k ≥ 0, and derive

    from Assumption A that ϕk is σk(yk,uk+1

    )-strongly convex with Lk

    (yk,uk+1

    )-Lipschitz

    continuous gradient. Therefore, Lemma 2 can be applied on {ϕk}k≥0 and the properties of

    Remark 1 hold true. Similar arguments can be done for the y-block since A2 is also NFA.

    Throughout the proof xk,0 = xk, xk,ik = xk+1, yk,0 = yk and yk,ik = yk+1 for all k ∈ N.

    We will show that the sequence{(

    xk,yk,uk)}

    k≥0 is an approximate gradient-like descent

    sequence according to Definition 1, which will be enough to obtain the stated result due to

    Theorem 1 with z = (x,y). To this end, we first prove that Condition (C1) is satisfied. By

    the definition of uk+1 (see step 7 in NAM) we have

    F(xk,yk,uk

    )≥ F

    (xk,yk,uk+1

    )≥ F

    (xk+1,yk,uk+1

    )≥ F

    (xk+1,yk+1,uk+1

    ), (16)

    where the last two inequalities hold true using the fact that algorithms A1 and A2 are NFAs

    (see Remark 1). Therefore, the sequence{F(xk,yk,uk

    )}k≥0 is non-increasing. Since the

    sequences{xk}k≥0 and

    {yk}k≥0 are bounded as assumed, there exists M ≥ 0 such that

    13

  • ∥∥xk − xk+1∥∥ ≤ M and ∥∥yk − yk+1∥∥ ≤ M for all k ≥ 0. In addition, from strong convexitywe establish that

    F(xk,yk,uk+1

    )− F

    (xk+1,yk,uk+1

    )≥ ∇xF

    (xk+1,yk,uk+1

    )T (xk − xk+1

    )+σ1(yk,uk+1

    )2

    ∥∥xk+1 − xk∥∥2≥ −

    ∥∥∇xF (xk+1,yk,uk+1)∥∥ · ∥∥xk − xk+1∥∥+σ1(yk,uk+1

    )2

    ∥∥xk+1 − xk∥∥2≥ σ̄1

    2

    ∥∥xk+1 − xk∥∥2 −Mδk+1x ,

    (17)

    where the last inequality follows from Assumption A(i) and Lemma 2(iii). Similarly, for the

    y-block we obtain

    F(xk+1,yk,uk+1

    )− F

    (xk+1,yk+1,uk+1

    )≥ σ̄2

    2

    ∥∥yk+1 − yk∥∥2 −Mδk+1y . (18)Combining (16), (17) and (18) we obtain that

    F(xk,yk,uk

    )− F

    (xk+1,yk+1,uk+1

    )≥ ρ1

    ∥∥zk+1 − zk∥∥2 − ek1,where ρ1 = (1/2) ·min {σ̄1, σ̄2} and ek1 = M

    (δk+1x + δ

    k+1y

    ). This proves Condition (C1).

    Now we prove that Condition (C2) is satisfied. We notice that

    ∂F(zk+1,uk+1

    )=

    ∇zF (zk+1,uk+1)∂uF

    (zk+1,uk+1

    ) =

    ∇zF (zk+1,uk+1)∇uG

    (zk+1,uk+1

    )+ ∂H

    (uk+1

    ) . (19)

    Since uk+1 is a minimizer of the function u 7→ F(zk,u

    ), it follows from the first-order

    optimality condition that

    0m ∈ ∇uG(zk,uk+1

    )+ ∂H

    (uk+1

    ). (20)

    Therefore, combining (19) with (20) we establish the inclusion

    wk+1 ≡

    ∇zF (zk+1,uk+1)∇uG

    (zk+1,uk+1

    )−∇uG

    (zk,uk+1

    ) ∈ ∂F (zk+1,uk+1) . (21)

    From the triangle inequality we have

    ∥∥∇zF (zk+1,uk+1)∥∥ ≤∥∥∥∥∥∥∇xF (xk+1,yk+1,uk+1)

    0

    −∇xF (xk+1,yk,uk+1)

    0

    ∥∥∥∥∥∥+

    ∥∥∥∥∥∥ ∇xF (xk+1,yk,uk+1)∇yF

    (xk+1,yk+1,uk+1

    )∥∥∥∥∥∥ .

    (22)

    14

  • Since{xk,yk,uk

    }k≥0 is bounded, using Assumption A(iii) and Remark 1(ii) we obtain from

    (22) that

    ∥∥∇zF (zk+1,uk+1)∥∥ ≤ L∥∥yk+1 − yk∥∥+ δk+1x + δk+1y ≤ L∥∥zk+1 − zk∥∥+ δk+1x + δk+1y . (23)From (21) and (23), we derive using the triangle inequality and Assumption A(iii) that

    ∥∥wk+1∥∥ ≤ ∥∥∇zF (zk+1,uk+1)∥∥+∥∥∇uG (zk+1,uk+1)−∇uG (zk,uk+1)∥∥ ≤ ρ2 ∥∥zk+1 − zk∥∥+ek2,where ρ2 = 2L and e

    k2 = δ

    k+1x + δ

    k+1y . This proves Condition (C2).

    To prove Condition (C3), we take a subsequence {(zqk ,uqk)}k≥0 which converges to some

    pair (z∗,u∗). From the definition of uk+1 we obtain that

    G(zk,uk+1

    )+H

    (uk+1

    )≤ G

    (zk,u∗

    )+H (u∗) ,

    and thus from Assumption A(iii) it follows that

    H(uk+1

    )≤ H (u∗) +G

    (zk,u∗

    )−G

    (zk,uk+1

    )≤ H (u∗) +∇uG

    (zk,uk+1

    )T (u∗ − uk+1

    )+L

    2

    ∥∥u∗ − uk+1∥∥2 .Substituting k with qk − 1 and taking k →∞ yields

    lim supk→∞

    H (uqk) ≤ H (u∗) .

    Using the continuity of G we obtain Condition (C3). Finally, in order to prove Condition

    (C4) we have to show that both{√

    ek1

    }k≥0

    and{ek2}k≥0 are summable. Therefore, by the

    definition of ek1 and ek2 it is enough to show that

    {√ek1

    }k≥0

    is summable. From the σ̄1-strong

    convexity of the function x 7→ F(x,yk−1,uk

    )(see Assumption A(i)) we have

    F(xk,yk−1,uk

    )− F

    (x∗k,y

    k−1,uk)≥ ∇xF

    (x∗k,y

    k−1,uk)T (

    xk − x∗k)

    +σ̄12

    ∥∥xk − x∗k∥∥2=σ̄12

    ∥∥xk − x∗k∥∥2 , (24)where the last equality follows from the fact that ∇xF

    (x∗k,y

    k−1,uk)

    = 0. Therefore, since

    we assume that F is bounded from below by some F̄ ∈ R, it follows from (24) that

    ∥∥xk − x∗k∥∥2 ≤ 2σ̄1 (F (xk,yk−1,uk)− F (x∗k,yk−1,uk)) ≤ 2σ̄1 (F (x−1,y−1,u1)− F̄) , (25)15

  • where the last inequality follows (16). Now, using Remark 1(ii) we derive from (25) that

    δk+1x =Lk√σkcA1 (k, ik)

    ∥∥xk − x∗k∥∥ ≤ L̄1√σ̄1 cA1 (k, ik)∥∥xk − x∗k∥∥≤√

    2(F (x−1,y−1,u1)− F̄

    )κ̄1 · cA1 (k, ik)

    where κ̄1 ≡ L̄1/σ̄1. Therefore, from Definition 2(iii) we obtain that∞∑k=1

    √δk+1x

  • and therefore cA (k, i) = 2√Lk/ (i+ 1), where Lk is the Lipschitz constant of the gradient

    of the function ϕk. So, as can be seen, any optimization method with provable rate of

    convergence in terms of function values is NFA (the third requirement to be discussed below

    will be also valid). Before discussing the number of inner iterations we will present another

    example. If ϕk, k ∈ N, is additionally σk-strongly convex (as will be in NAM due to

    Assumption A), then we can use a variant of AG called V-AG which exploits the strong

    convexity and enjoys a better rate (see [5, Theorem 10.42])

    ϕk(vk,i)− ϕ∗k ≤

    (1− 1√

    κk

    )iσk + Lk

    2

    ∥∥vk,0 − v∗k∥∥2 ,and therefore cA (k, i) = (σk + Lk) ·

    (1− κ−1/2k

    )i/2, where κk = Lk/σk.

    Now we would like to focus on the issue of determining the number of inner iterations

    ik without focusing on a specific algorithm. From Lemma 2 we see that there are two

    requirements on ik

    (a) c2A (k, ik) ≤ σk, where σk is the strong convexity parameter of ϕk.

    (b) Summability of the sequence{√

    cA (k, ik)}k≥0

    .

    We should mention that it is enough to require that the above two requirements are satisfied

    for all k ≥ K for some K ∈ N (and we use this fact in Section 5). Notice that each

    requirement is satisfied for some ik, and the number of inner iterations is chosen to be

    maximum between them. The first requirement can be easily verified since cA (k, ik) has an

    explicit depend on ik. For instance, in the case of AG we have that

    cA (k, ik) =2√Lk

    ik + 1,

    and therefore⌈2√κk − 1

    ⌉≤ ik, where κk = Lk/σk. Similarly we can show that for V-AG we

    have⌈(log (σk + Lk)− log (2σk)) /

    (log(√

    κk)− log

    (√κk − 1

    ))⌉≤ ik.

    The second requirement can also be easily satisfied since all we need is to guarantee that

    the sequence{√

    cA (k, ik)}k≥0

    converges to 0 fast enough such that its sum is finite. Since

    cA (k, i) converges to 0 as i → ∞, independent on the specific structure of cA (k, i), it is

    enough to take a sequence {ik}k∈N which grows fast enough to obtain the summability. For

    17

  • example, if we set ik = s + 2bk+1/rc for some fixed integers s and r, we easily obtain the

    desired requirement.

    It should be noted that since NFA is used in the framework of NAM, we can rely on

    Assumption A to bound the strong convexity parameters σk and the Lipschitz constants Lk

    by σ̄ and L̄, respectively. For example, in the case of AG and V-AG, the number of inner

    iterations ik that satisfies (a) can also be determined in terms of κ = L̄/σ̄ instead of κk.

    Since κ is fixed for any k ∈ N, then taking ik = s+ 2bk+1/rc (as described above) guarantees

    that both (a) and (b) are satisfied simultaneously, and there is no need to calculate the value

    of κ.

    5 Numerical Experiments

    In this section we give a numerical example to show the advantage of using the NAM frame-

    work relatively to existing methods on the challenging non-convex problem of Regularized

    Structured Total Least Squares (RSTLS). Unlike One Iteration Approximation based meth-

    ods, NAM allows several iterations in order to approximate the solution of each sub-problem.

    In the following example we show that these additional inner iterations at each outer iter-

    ation lead to a superior algorithm compared to existing methods in terms of the obtained

    function values that NAM converges to and also in terms of a relative error of the obtained

    solution

    5.1 Regularized Structured Total Least Squares (RSTLS)

    Following the recent paper [6], we consider the class of RSLTS problems, which can be

    formulated as the following non-convex and possibly non-smooth optimization problems

    minx∈Rp,u∈Rm

    F (x,u) ≡ σ2wf (x) +∥∥∥∥∥(

    A +m∑i=1

    uiAi

    )x− b

    ∥∥∥∥∥2

    +σ2wσ2e‖u‖2

    , (26)where A,A1,A2, . . . ,Am ∈ Rn×p and b ∈ Rn are given and contain noise, σw and σe are

    the standard deviations of the noise components, respectively, and f : Rp → (−∞,∞] is a

    regularizing function. For more details about the problem formulation given in (26) and its

    18

  • applications in image processing see [6] and references therein. For the RSTLS model to

    be compatible with the NAM framework, we choose the function f to be λ ‖x‖2 for some

    regularization parameter λ > 0. This yields a non-convex formulation which is proper, lower

    semi-continuous, bounded from below by 0 and satisfies the KL property (all these properties

    can be easily verified in this case, since the function F is a quartic polynomial function).

    Under the NAM framework we can write F (x,u) = G (x,u) +H (u) where

    G (x,u) = σ2wf (x) + G̃ (x,u) ≡ λσ2w ‖x‖2 +

    ∥∥∥∥∥(

    A +m∑i=1

    uiAi

    )x− b

    ∥∥∥∥∥2

    ,

    H (u) =σ2wσ2e‖u‖2 .

    We mention that in the general NAM framework, the function H is allowed to be non-

    smooth, which is not the case in this specific application. In addition, in this application the

    function u 7→ F (x,u), for a fixed x ∈ Rp, is strongly convex quadratic with a very small m

    and thus can be minimized in exact fashion, as also done in [6] (see more details below).

    We also point out that in the NAM algorithm, there are three blocks while in the case

    of our application there are only two blocks. Therefore, we can either consider that the

    y-block in NAM is vanished, or alternatively we can treat the u-block of RSTLS as the

    y-block in NAM, and then the u-block in NAM is vanished. The difference between these

    two approaches is that the u-block in NAM is solved exactly, while the y-block in NAM can

    be solved approximately. In this work we choose to solve each such sub-problem exactly,

    which is possible due to the low dimension of u, nevertheless the NAM framework allows us

    to solve it approximately instead

    In order to apply Theorem 2 on formulation (26) we need to show that Assumption A is

    satisfied for the RSTLS problem (without item (ii), since we have no y-block here). To this

    end, we notice that the function x 7→ F (x,u) with Hessian matrix given by

    ∇2xF (x,u) = 2λσ2wIp×p + 2

    (A +

    m∑i=1

    uiAi

    )T (A +

    m∑i=1

    uiAi

    ).

    Therefore, we derive that σ̄ = 2λσ2w is a positive lower bound on the strong convexity

    parameter of the function x 7→ F (x,u) for all u ∈ Rm. In addition, it is easy to check that

    19

  • the partial gradient ∇xG (x,u) is Lipschitz continuous with constant

    L (u) = 2λσ2w + 2

    ∥∥∥∥∥A +m∑i=1

    uiAi

    ∥∥∥∥∥2

    . (27)

    Therefore, we see that L (u) is bounded on compact sets by some L > 0, thus Assump-

    tion A(i) is satisfied. Since G (x,u) is a quadratic polynomial function we easily derive

    Assumption A(iii).

    5.2 Description of the Data and Methods

    We follow the experiment of [6] and utilize the RSTLS formulation of (26) in order to

    reconstruct a given blurred image. We consider the vector b as the vectorized observed

    blurred image. The vector b is obtained from the real image, denoted xreal, by first scaling

    the pixels of the real image to be between 0 and 1. Then, it is transferred through a Gaussian

    blur point spread function (PSF) of size q×q and standard deviation γ (see [15, Chapter 3]),

    and then by adding some noise to each coordinate drawn from Gaussian distribution with

    parameters (0, σw).

    The unknown blurring operator is constructed from the original PSF by adding to it a

    q × q matrix (before padding) of the same structure of the original PSF. Noise is added to

    each of the structure matrices Ai by the update rule A := µAi where µ ∼ Gauss (0, σe).

    We use periodic boundary conditions, resulting with a block circulant with circulant blocks

    (BCCB) blurring matrix.

    The authors of [6] solve the RSTLS problem with One Iteration Approximation based

    algorithm called SemiProximal-Alternating (SPA). It was proven that SPA is a globally

    convergent algorithm to a critical point of F (x,u). In order to present the algorithm, we

    need to recall the definition of the proximal mapping of a function ψ : Rd → (−∞,∞]:

    proxψ (z) = argminv∈Rd

    {ψ (v) +

    1

    2‖v − z‖2

    }for any z ∈ Rd.

    20

  • Algorithm 2 SPA Algorithm for RSTLS

    1: Initialization: (x0,u0) ∈ Rp × Rm.

    2: Iterative step:

    3: for k ≥ 0 do

    4: Pick Lk (a Lipschitz constant of the gradient of x 7→ G̃(x,uk+1

    )) and update

    xk+1 = proxσ2wLk

    f

    (xk − 1

    Lk∇xG̃

    (xk,uk

    )). (28)

    5: Update

    uk+1 = argmin{F(xk+1,u

    ): u ∈ Rm

    }. (29)

    6: end for

    Since the function u 7→ F(xk+1,u

    )is strongly convex, a unique minimizer in (29) can

    be obtained by the first-order optimality condition ∇uF(xk+1,uk+1

    )= 0, and is given by

    the formula

    uk+1 = −(σ2wσ2e

    Im×m + B(xk+1

    )TB(xk+1

    ))−1B(xk+1

    )T (Axk+1 − b

    ), (30)

    where B (x) = [A1x A2x . . . Amx]. Since, in practice, m is small, this system of linear

    equations can be solved efficiently. As for the update rule in (28), since the function f is set

    to be λ ‖x‖2, simple calculations show that an explicit update rule is given by the formula

    xk+1 =Lk

    Lk + 2λσ2w

    (xk − 1

    Lk∇xG̃

    (xk,uk

    )). (31)

    Obtaining an upper bound on the Lipschitz constant Lk, k ∈ N, can a computationally

    challenging task. However, since we know that the blurring matrix is BCCB, its eigenvalues

    can be calculated efficiently (see [15, Section 4.2.1]). Therefore, we use in this paper the

    tight Lipschitz constant given by

    Lk = 2

    ∥∥∥∥∥A +m∑i=1

    ukiAi

    ∥∥∥∥∥2

    . (32)

    Now, we would like to compare the SPA algorithm with our NAM method equipped with

    inner iterations of the AG algorithm (which was proved in Section 4 to be NFA). This method

    21

  • applies several iterations of AG on the function x 7→ G(x,uk

    ), which requires finding the

    Lipschitz constant of the gradient of this function. From (27) we see that we can use the

    constant 2λσ2w + Lk, where Lk is defined in (32).

    Algorithm 3 NAM with AG for RSTLS

    1: Initialization: (x−1,u0) ∈ Rp × Rm.

    2: Iterative step:

    3: for k ≥ 0 do

    4: Set wk−1,0 = xk−1, t0 = 1 and pick Lk ≥ L∇xG̃(uk).

    5: for i = 0, 1, 2, . . . , ik−1 − 1 do

    6: Update

    xk−1,i+1 = wk−1,i − 12λσ2w + Lk

    ∇xF(wk−1,i,uk

    ). (33)

    7: Set ti+1 =1+√

    1+4t2i2

    .

    8: Update

    wk−1,i+1 = xk−1,i+1 +ti − 1ti+1

    (xk−1,i+1 − xk−1,i

    ). (34)

    9: end for

    10: Set xk = xk−1,ik−1 .

    11: Update

    uk+1 = argmin{F(xk,u

    ): u ∈ Rm

    }. (35)

    12: end for

    We mention that since the blurring matrix is BCCD, we can efficiently compute the

    strong convexity parameters σk of the functions x 7→ F(x,uk

    )for any k ≥ 0. Therefore, we

    can use V-AG instead of AG as NFA in Algorithm 3. Moreover, following the arguments in

    Section 4 we actually know an explicit minimum number of inner iterations ik for both AG

    and V-AG. However, the condition number κk increases as σw decreases, giving very large

    κk. Two issues arise when the value of κk is very large, which limits its applicability here:

    (i) the number of inner iterations ik increases significantly, even for small k, (ii) the stepsize

    in V-AG, which depends on κk increases significantly. Therefore, we use the AG method

    instead of V-AG. To overcome the issue in (i), we simply follow the arguments of Section 4

    22

  • and choose ik = s + 2bk+1/rc for some fixed integers s and r. This selection guarantees the

    conditions for global convergence while keeping a relatively low number of inner iterations

    when k is small. Additionally, this selection does not require the calculation of κ.

    5.3 Results

    We run the SPA algorithm and NAM with inner iterations with AG on a blurred version of

    the 256×256 pixels cameraman test image figure taken from the MATLAB Image Proceesing

    Toolbox available in [1]. In order to blur the original image, we used a Gaussian PDF of size

    5× 5 with a standard deviation of 4. See figures (a) and (b) below for both the original and

    the blurred image.

    (a) Original image (b) Blurred image

    We let both methods to run for N = 3000 iterations, where for NAM the iteartion

    counter counts also inner iterations (for example, if i1 = 100 and i2 = 200, then the iteration

    counter after two iterations is N = 300). In all experiments we set λ = 0.02, σw = 10−4 and

    σe = 10−3. The initial point for both algorithms is set to be the original blurred image, and

    the variable u is initialized as the zero vector.

    As for the NAM with AG algorithm, we always set r = 10, which lets us control the

    number of inner AG iterations while also establishing global convergence. For example, from

    the formula ik = s + 2bk+1/rc we get the following number of inner AG iterations for s = 10

    and r = 10: i−1 = 11, i0 = 11, i9 = 12, i19 = 14, i29 = 18 and so forth. Note that when

    keeping N fixed, then increasing s reduces the effect of r on the number of inner iterations.

    23

  • All experiments were ran on an Intel(R) Core(TM) i5-6400 CPU @ 2.70GHz with 8.00GB

    RAM, Windows 7 Professional 64-bit using MATLAB 2020a.

    We compare in Figure 2 and Figure 3 the two methods by the achieved function value

    and by the relative error at each iteration, calculated by

    Relative Error =

    ∥∥xi − xreal∥∥‖xreal‖

    ,

    where xi is the image obtained in the i-th iteration.

    Figure 2: Function value at each iteration for N = 3000 in a logarithmic scale. The NAM

    with AG method was conducted for s = 1, 10, 50, 100 and 200.

    24

  • Figure 3: Relative relative error at each iteration for N = 3000 in a logarithmic scale. The

    NAM with AG method was conducted for s = 1, 10, 50, 100 and 200.

    In Figure 2 and Figure 3 wee see that after 10 iterations, NAM with AG for s =

    10, 50, 100, 200 is superior to SPA in terms of both function value and relative error. We

    also see that for NAM with AG, when increasing the number of inner iterations the accuracy

    of the output increases, indicating that at least for the RSTLS problem it is better to use

    methods that are not based on One Iteration Approximation.

    The following figures show the output of each of the methods after N = 3000 iterations.

    25

  • (a) SPA Algorithm (b) NAM with AG, s = 1

    (c) NAM with AG, s = 10 (d) NAM with AG, s = 50

    (e) NAM with AG, s = 100 (f) NAM with AG, s = 200

    To conclude, in this paper we provided a general recipe which serves as a base for proving

    global convergence of algorithms that tackle non-convex optimization problems. We have

    generalized the methodology of [10] to allow errors in the conditions. We utilized the new

    26

  • methodology in order to establish global convergence of Nested Alternating Minimization

    methods that are incorporated with a Nested Friendly Algorithm. To the best of our knowl-

    edge, this is the first result that proves global convergence of general nested algorithms in

    the non-convex setting. Moreover, to the best of our knowledge this is the first methodol-

    ogy which can be used to derive global convergence of methods which are not necessarily

    sufficient descent methods. In a future work, it would be interesting to generalize the NAM

    framework to also include constraints or non-smooth regularizers in the blocks x and y.

    Another challenging task is to extend the framework of NAM for functions which are not

    necessarily strongly convex in the x and y blocks.

    References

    [1] Image Processing Toolbox. The MathWorks Inc. Natick, Massachusetts, United States.

    Online: mathworks.com/products/image, 2020.

    [2] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth

    functions involving analytic features. Mathematical Programming, 116(1-2):5–16, 2009.

    [3] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization

    and projection methods for nonconvex problems: An approach based on the kurdyka-

    lojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.

    [4] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-

    algebraic and tame problems: proximal algorithms, forward–backward splitting, and

    regularized gauss–seidel methods. Mathematical Programming, 137(1-2):91–129, 2013.

    [5] A. Beck. First-Order Methods in Optimization, volume 25. SIAM, 2017.

    [6] A. Beck, S. Sabach, and M. Teboulle. An alternating semiproximal method for non-

    convex regularized structured total least squares problems. SIAM Journal on Matrix

    Analysis and Applications, 37(3):1129–1150, 2016.

    27

  • [7] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total varia-

    tion image denoising and deblurring problems. IEEE transactions on image processing,

    18(11):2419–2434, 2009.

    [8] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear

    inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.

    [9] J. Bolte, A. Daniilidis, and A. Lewis. The lojasiewicz inequality for nonsmooth suban-

    alytic functions with applications to subgradient dynamical systems. SIAM Journal on

    Optimization, 17(4):1205–1223, 2007.

    [10] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for

    nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459–494,

    2014.

    [11] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd. First order methods beyond con-

    vexity and lipschitz gradient continuity with applications to quadratic inverse problems.

    SIAM Journal on Optimization, 28(3):2131–2151, 2018.

    [12] B. L. Gorissen, İ. Yanıkoğlu, and D. den Hertog. A practical guide to robust optimiza-

    tion. Omega, 53:124–137, 2015.

    [13] P. J. Groenen, M. van de Velden, et al. Multidimensional scaling by majorization: A

    review. Journal of Statistical Software, 73(8):1–26, 2016.

    [14] W. J. Gutjahr and A. Pichler. Stochastic multi-objective optimization: a survey on

    non-scalarizing methods. Annals of Operations Research, 236(2):475–499, 2016.

    [15] P. C. Hansen, J. G. Nagy, and D. P. O’leary. Deblurring images: matrices, spectra, and

    filtering. SIAM, 2006.

    [16] R. Hesse, D. R. Luke, S. Sabach, and M. K. Tam. Proximal heterogeneous block

    implicit-explicit method and application to blind ptychographic diffraction imaging.

    SIAM Journal on Imaging Sciences, 8(1):426–457, 2015.

    28

  • [17] P. Jain, P. Kar, et al. Non-convex optimization for machine learning. Foundations and

    Trends R© in Machine Learning, 10(3-4):142–336, 2017.

    [18] K. Kurdyka. On gradients of functions definable in o-minimal structures. Ann. Inst.

    Fourier (Grenoble), 48(3):769–783, 1998.

    [19] S. Lojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. In

    Les Équations aux Dérivées Partielles (Paris, 1962), pages 87–89. Éditions du Centre

    National de la Recherche Scientifique, Paris, 1963.

    [20] F. G. Mohammadi, M. H. Amini, and H. R. Arabnia. Evolutionary computation, op-

    timization, and learning algorithms for data science. In Optimization, Learning, and

    Control for Interdependent Complex Networks, pages 37–65. Springer, 2020.

    [21] Y. E. Nesterov. A method for solving the convex programming problem with conver-

    gence rate O(1/k2). Dokl. Akad. Nauk SSSR, 269(3):543–547, 1983.

    [22] P. Ochs. Unifying abstract inexact convergence theorems and block coordinate variable

    metric ipiano. SIAM Journal on Optimization, 29(1):541–570, 2019.

    [23] N. Piovesan and T. Erseghe. Cooperative localization in wsns: A hybrid con-

    vex/nonconvex solution. IEEE Transactions on Signal and Information Processing over

    Networks, 4(1):162–172, 2018.

    [24] T. Pock and S. Sabach. Inertial proximal alternating linearized minimization (ipalm) for

    nonconvex and nonsmooth problems. SIAM Journal on Imaging Sciences, 9(4):1756–

    1787, 2016.

    [25] R. T. Rockafellar and R. J.-B. Wets. Variational analysis, volume 317. Springer Science

    & Business Media, 2009.

    [26] S. Sabach, M. Teboulle, and S. Voldman. A smoothing alternating minimization-based

    algorithm for clustering with sum-min of euclidean norms. Pure and Applied Functional

    Analysis, 3:653–679, 2018.

    29

  • [27] C. Soares, J. Xavier, and J. Gomes. Simple and fast convex relaxation method for coop-

    erative localization in sensor networks using range measurements. IEEE Transactions

    on Signal Processing, 63(17):4532–4543, 2015.

    [28] J. Wang. Geometric structure of high-dimensional data and dimensionality reduction.

    Springer, 2012.

    [29] F. Wen, L. Chu, P. Liu, and R. C. Qiu. A survey on nonconvex regularization-based

    sparse and low-rank recovery in signal processing, statistics, and machine learning. IEEE

    Access, 6:69883–69906, 2018.

    30