university of california, san diegoscicomp.ucsd.edu/~mwl/pubs/thesis.pdfin this thesis....
TRANSCRIPT
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Reduced Hessian Quasi-Newton Methods forOptimization
A dissertation submitted in partial satisfaction of therequirements for the degree Doctor of Philosophy
in Mathematics
by
Michael Wallace Leonard
Committee in charge:
Professor Philip E. Gill, ChairProfessor Randolph E. BankProfessor James R. BunchProfessor Scott B. BadenProfessor Pao C. Chau
1995
Copyright c© 1995Michael Wallace Leonard
All rights reserved.
The dissertation of Michael Wallace Leonard is approved,
and it is acceptable in quality and form for publication
on microfilm:
Professor Philip E. Gill, Chair
University of California, San Diego1995
iii
This dissertation is dedicated to my mother and father.
iv
Contents
Signature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiCurriculum Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction to Unconstrained Optimization 11.1 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Minimizing strictly convex quadratic functions . . . . . . . 91.2.2 Minimizing convex objective functions . . . . . . . . . . . 10
1.3 Computation of the search direction . . . . . . . . . . . . . . . . . 131.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.2 Using Cholesky factors . . . . . . . . . . . . . . . . . . . . 131.3.3 Using conjugate-direction matrices . . . . . . . . . . . . . 15
1.4 Transformed and reduced Hessians . . . . . . . . . . . . . . . . . 16
2 Reduced-Hessian Methods for Unconstrained Optimization 182.1 Fenelon’s reduced-Hessian BFGS method . . . . . . . . . . . . . . 19
2.1.1 The Gram-Schmidt process . . . . . . . . . . . . . . . . . 202.1.2 The BFGS update to RZ . . . . . . . . . . . . . . . . . . . 22
2.2 Reduced inverse Hessian methods . . . . . . . . . . . . . . . . . . 232.3 An extension of Fenelon’s method . . . . . . . . . . . . . . . . . . 252.4 The effective approximate Hessian . . . . . . . . . . . . . . . . . . 292.5 Lingering on a subspace . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 Updating Z when p = pr . . . . . . . . . . . . . . . . . . . 342.5.2 Calculating sZ and yε
Z . . . . . . . . . . . . . . . . . . . . 362.5.3 The form of RZ when using the BFGS update . . . . . . . 372.5.4 Updating RZ after the computation of p . . . . . . . . . . 392.5.5 The Broyden update to R
Z. . . . . . . . . . . . . . . . . . 41
v
2.5.6 A reduced-Hessian algorithm with lingering . . . . . . . . 41
3 Rescaling Reduced Hessians 433.1 Self-scaling variable metric methods . . . . . . . . . . . . . . . . . 443.2 Rescaling conjugate-direction matrices . . . . . . . . . . . . . . . 46
3.2.1 Definition of p . . . . . . . . . . . . . . . . . . . . . . . . . 463.2.2 Rescaling V . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2.3 The conjugate-direction rescaling algorithm . . . . . . . . 483.2.4 Convergence properties . . . . . . . . . . . . . . . . . . . . 49
3.3 Extending Algorithm RH . . . . . . . . . . . . . . . . . . . . . . . 503.3.1 Reinitializing the approximate curvature . . . . . . . . . . 503.3.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Rescaling combined with lingering . . . . . . . . . . . . . . . . . . 543.4.1 Numerical results . . . . . . . . . . . . . . . . . . . . . . . 573.4.2 Algorithm RHRL applied to a quadratic . . . . . . . . . . 58
4 Equivalence of Reduced-Hessian and Conjugate-Direction Rescal-ing 624.1 A search-direction basis for range(V1) . . . . . . . . . . . . . . . . 624.2 A transformed Hessian associated with B . . . . . . . . . . . . . . 664.3 How rescaling V affects UT BU . . . . . . . . . . . . . . . . . . . 704.4 The proof of equivalence . . . . . . . . . . . . . . . . . . . . . . . 75
5 Reduced-Hessian Methods for Large-Scale Unconstrained Opti-mization 795.1 Large-scale quasi-Newton methods . . . . . . . . . . . . . . . . . 795.2 Extending Algorithm RH to large problems . . . . . . . . . . . . . 82
5.2.1 Imposing a storage limit . . . . . . . . . . . . . . . . . . . 835.2.2 The deletion procedure . . . . . . . . . . . . . . . . . . . . 845.2.3 The computation of T . . . . . . . . . . . . . . . . . . . . 865.2.4 The updates to gZ and RZ . . . . . . . . . . . . . . . . . . 875.2.5 Gradient-based reduced-Hessian algorithms . . . . . . . . . 885.2.6 Quadratic termination . . . . . . . . . . . . . . . . . . . . 895.2.7 Replacing g with p . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.4 Algorithm RHR-L-P applied to quadratics . . . . . . . . . . . . . 107
6 Reduced-Hessian Methods for Linearly-Constrained Problems 1146.1 Linearly constrained optimization . . . . . . . . . . . . . . . . . . 1146.2 A dynamic null-space method for LEP . . . . . . . . . . . . . . . 1186.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Bibliography 125
vi
List of Tables
2.1 Alternate methods for computing Z . . . . . . . . . . . . . . . . . 21
3.1 Alternate values for σ . . . . . . . . . . . . . . . . . . . . . . . . . 533.2 Test Problems from More et al. . . . . . . . . . . . . . . . . . . . 543.3 Results for Algorithm RHR using R1, R4 and R5 . . . . . . . . . 553.4 Results for Algorithm RHRL on problems 1–18 . . . . . . . . . . 583.5 Results for Algorithm RHRL on problems 19–22 . . . . . . . . . . 59
5.1 Comparing p from CG and Algorithm RH-L-G on quadratics . . . 905.2 Iterations/Functions for RHR-L-G (m = 5) . . . . . . . . . . . . . 985.3 Iterations/Functions for RHR-L-P (m = 5) . . . . . . . . . . . . . 995.4 Results for RHR-L-P using R3–R5 (m = 5) on Set # 1 . . . . . . 1005.5 Results for RHR-L-P using R3–R5 (m = 5) on Set # 2 . . . . . . 1015.6 RHR-L-P using different m with R4 . . . . . . . . . . . . . . . . 1025.7 RHR-L-P (R4) for m ranging from 2 to n . . . . . . . . . . . . . 1035.8 Results for RHR-L-P and L-BFGS-B (m = 5) on Set #1 . . . . . 1055.9 Results for RHR-L-P and L-BFGS-B (m = 5) on Set #2 . . . . . 106
6.1 Results for LEPs (mL = 5, δ = 10−10, ‖NTg‖ ≤ 10−6) . . . . . . . 1246.2 Results for LEPs (mL = 8, δ = 10−10, ‖NTg‖ ≤ 10−6) . . . . . . . 124
vii
Preface
This thesis consists of seven chapters and a bibliography. Each chapter
starts with a review of the literature and proceeds to new material developed
by the author under the direction of the Chair of the dissertation committee.
All lemmas, theorems, corollaries and algorithms are those of the author unless
otherwise stated.
Problems from all areas of science and engineering can be posed as
optimization problems . An optimization problem involves a set of independent
variables, and often includes constraints or restrictions that define acceptable val-
ues of the variables. The solution of an optimization problem is a set of allowed
values of the variables for which some objective function achieves its maximum
or minimum value. The class of model-based methods form quadratic approxi-
mations of optimization problems using first and sometimes second derivatives of
the objective and constraint functions.
If no constraints are present, an optimization problem is said to be
unconstrained. The formulation of effective methods for the unconstrained case
is the first step towards defining methods for constrained optimization. The
unconstrained optimization problem is considered in Chapters 1–5. Methods for
problems with linear equality constraints are considered in Chapter 6.
Chapter 1 opens with a discussion of Newton’s method for unconstrained
optimization. Newton’s method is a model-based method that requires both
first and second derivatives. In Section 1.2 we move on to quasi-Newton meth-
ods, which are intended for the situation when the provision of analytic second
derivatives is inconvenient or impossible. Quasi-Newton methods use only first
derivatives to build up an approximate Hessian over a number of iterations. At
viii
each iteration of a quasi-Newton method, the approximate Hessian is altered
to incorporate new curvature information. This process, which is known as an
update, involves the addition of a low-rank matrix (usually of rank one or rank
two). This thesis will be concerned with a class of rank-two updates known as
the Broyden class. The most important member of this class is the so-called
Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula.
In Chapter 2 we consider quasi-Newton methods from a completely dif-
ferent point of view. Quasi-Newton methods that employ updates from the
Broyden class are known to accumulate approximate curvature in a sequence
of expanding subspaces. It follows that the search direction can be defined using
matrices of smaller dimension than the approximate Hessian. In exact arithmetic
these so-called reduced Hessians generate the same iterates as the standard quasi-
Newton methods. This result is the basis for all of the new algorithms defined
in this thesis. Reduced-Hessian and reduced inverse Hessian methods are con-
sidered in Sections 2.1 and 2.2 respectively. In Section 2.3 we propose Algorithm
RH, which is the template algorithm for this thesis. In Section 2.5 this algorithm
is generalized to include a “lingering scheme” (Algorithm RHL) that allows the
iterates to be restricted to certain low dimensional manifolds.
In practice, the choice of initial approximate Hessian can greatly influ-
ence the performance of quasi-Newton methods. In the absence of exact second-
derivative information, the approximate Hessian is often initialized to the identity
matrix. Several authors have observed that a poor choice of initial approximate
Hessian can lead to inefficiences—especially if the Hessian itself is ill-conditioned.
These inefficiences can lead to a large number of function evaluations in some
cases.
ix
Rescaling techniques are intended to address this difficulty and are the
subject of Chapter 3. The rescaling methods of Oren and Luenberger [39], Siegel
[45] and Lalee and Nocedal [27] are discussed. In particular, the conjugate-
direction rescaling method of Siegel (Algorithm CDR), which is also a variant of
the BFGS method, is described in some detail. Algorithm CDR (page 48) has
been shown to be effective in solving ill-conditioned problems. Algorithm CDR
has notable similarities to reduced-Hessian methods, and two new rescaling algo-
rithms follow naturally from the interpretation of Algorithm CDR as a reduced
Hessian method. These algorithms are derived in Sections 3.3 and 3.4. The
first (Algorithm RHR) is a modification of Algorithm RH; the second (Algo-
rithm RHRL) is derived from Algorithm RHL. Numerical results are given for
both algorithms. Moreover, under certain conditions Algorithm RHRL is shown
to converge in a finite number of iterations when applied to a class of quadratic
problems. This property, often termed quadratic termination, can be numerically
beneficial for quasi-Newton methods.
In Chapter 4, it is shown that if Algorithm RHRL is used in conjunction
with a particular rescaling technique of Siegel [45], then it is equivalent to Algo-
rithm CDR in exact arithmetic. Chapter 4 is mostly technical in nature and may
be skipped without loss of continuity. However, the convergence results given in
Section 4.4 should be reviewed before passing to Chapter 5.
If the problem has many independent variables, it may not be practical
to store the Hessian matrix or an approximate Hessian. In Chapter 5, meth-
ods for solving large unconstrained problems are reviewed. Conjugate-gradient
(CG) methods require storage for only a few vectors and can be used in the
large-scale case. However, CG methods can require a large number of itera-
x
tions relative to the problem size and can be prohibitively expensive in terms
of function evaluations. In an effort to accelerate CG methods, several authors
have proposed limited-memory and reduced-Hessian quasi-Newton methods. The
limited-memory algorithm of Nocedal [35], the successive affine reduction method
of Nazareth [34], the reduced-Hessian method of Fenelon [14] and reduced inverse-
Hessian methods due to Siegel [46] are reviewed.
In Chapter 5, new reduced-Hessian rescaling algorithms are derived as
extensions of Algorithms RH and RHR. These algorithms (Algorithms RHR-L-G
and RHR-L-P) employ the rescaling method of Algorithm RHR. Algorithm RHR-
L-P shares features of the methods of Fenelon, Nazareth and Siegel. However, the
inclusion of rescaling is demonstrated numerically to be essential for efficiency.
Moreover, Algorithm RHR-L-P is shown to enjoy the property of quadratic ter-
mination, which is shown to be beneficial when the algorithm is applied to general
functions.
Chapter 6 considers the minimization of a function subject to linear
equality constraints. Two algorithms (Algorithms RH-LEP and RHR-LEP) ex-
tend reduced-Hessian methods to problems with linear constraints. Numerical
results are given comparing Algorithm RHR-LEP with a standard method for
solving linearly constrained problems.
In summary, a total of seven new reduced-Hessian algorithms are pro-
posed.
• Algorithm RH (p. 28)—The algorithm template.
• Algorithm RHL (p. 41)—Uses a lingering scheme that constrains the iter-
ates to remain on a manifold.
• Algorithm RHR (p. 52)—Rescales when approximate curvature is obtained
xi
in a new subspace.
• Algorithm RHRL (p. 56)—Exploits the special form of the reduced Hessian
resulting from the lingering strategy. This special form allows rescaling on
larger subspaces.
• Algorithm RHR-L-G (p. 95)—A gradient-based method with rescaling for
large-scale optimization.
• Algorithm RHR-L-P (p. 95)—A direction-based method with rescaling for
large-scale optimization. This algorithm converges in a finite number of
iterations when applied to a quadratic function.
• Algorithm RHR-LEP (p. 123)—A reduced-Hessian rescaling method for
linear equality-constrained problems.
xii
Acknowledgements
I am pleased to acknowledge my advisor, Professor Philip E. Gill. I
became interested in doing research while I was a student in the Master of Arts
program, but writing a dissertation seemed an unlikely task. However, Professor
Gill thought that I had the right stuff. He has helped me hurdle many obstacles,
not the least of which was transferring into the Ph.D. program. He introduced
me to a very interesting and rewarding problem in numerical optimization. He
also supported me as a Research Assistant for several summers and during my
last quarter as a graduate student.
I would like to express my gratitude to Professors James R. Bunch,
Randolph E. Bank, Scott B. Baden and Pao C. Chao, all of whom served on my
thesis committee. My thanks also to Professors Maria E. Ong and Donald R.
Smith from whom I learned much in my capacity as a teaching assistant.
My special thanks to Professor Carl H. Fitzgerald. His training inspired
in me a much deeper appreciation of mathematics and is the basis of my technical
knowledge.
My family has always prompted me towards further education. I want
to thank my mother and father, my stepmother Maggie and my brother Clif for
their encouragement and support while I have been a graduate student.
I also want to express my appreciation to all of my friends who have
been supportive while I worked on this thesis. My climbing friends Scott Marshall,
Michael Smith, Fred Weening and Jeff Gee listened to my ranting and raving and
always encouraged me. My friends in the department, Jerome Braunstein, Scott
Crass, Sam Eldersveld, Ricardo Fierro, Richard LeBorne, Ned Lucia, Joe Shin-
nerl, Mark Stankus, Tuan Nguyen and others were all inspirational, informative
and helpful.
xiii
Vita
1982 Appointed U.C. Regents Scholar.University of California, Santa Barbara
1985 B.S., Mathematical Sciences, Highest Honors.University of California, Santa Barbara
1985 B.S., Mechanical Engineering, Highest Honors.University of California, Santa Barbara
1985-1987 Associate Engineering Scientist.McDonnell-Douglas Astronautics Corporation
1987-1990 High School Mathematics Teacher.Vista Unified School District
1988 Mathematics Single Subject Teaching Credential.University of California, San Diego
1991 M.A., Applied Mathematics.University of California, San Diego
1991-1993 Adjunct Mathematics Instructor. Mesa Community College
1991-1995 Teaching Assistant. Department of Mathematics,University of California, San Diego
1993 C.Phil., Mathematics. University of California, San Diego
1995 Research Assistant. Department of Mathematics,University of California, San Diego
1995 Ph.D., Mathematics. University of California, San Diego
Major Fields of Study
Major Field: MathematicsStudies in Numerical Optimization.Professor Philip E. Gill
Studies in Numerical Analysis.Professors Randolph E. Bank, James R. Bunch, Philip E. Gilland Donald R. Smith
Studies in Complex Analysis.Professor Carl H. Fitzgerald
Studies in Applied Algebra.Professors Jeffrey B. Remmel and Adriano M. Garsia
xiv
Abstract of the Dissertation
Reduced Hessian Quasi-Newton Methods for Optimization
by
Michael Wallace Leonard
Doctor of Philosophy in Mathematics
University of California, San Diego, 1995
Professor Philip E. Gill, Chair
Many methods for optimization are variants of Newton’s method, which
requires the specification of the Hessian matrix of second derivatives. Quasi-
Newton methods are intended for the situation where the Hessian is expensive
or difficult to calculate. Quasi-Newton methods use only first derivatives to
build an approximate Hessian over a number of iterations. This approximation
is updated each iteration by a matrix of low rank. This thesis is concerned with
the Broyden class of updates, with emphasis on the Broyden-Fletcher-Goldfarb-
Shanno (BFGS) update.
Updates from the Broyden class accumulate approximate curvature in
a sequence of expanding subspaces. This allows the approximate Hessians to be
represented in compact form using smaller reduced approximate Hessians. These
reduced matrices offer computational advantages when the objective function is
highly nonlinear or the number of variables is large.
Although the initial approximate Hessian is arbitrary, some choices may
cause quasi-Newton methods to fail on highly nonlinear functions. In this case,
rescaling can be used to decrease inefficiencies resulting from a poor initial ap-
proximate Hessian. Reduced-Hessian methods facilitate a trivial rescaling that
implicitly changes the initial curvature as iterations proceed. Methods of this
type are shown to have global and superlinear convergence. Moreover, numerical
xv
results indicate that this rescaling is effective in practice.
In the large-scale case, so-called limited-storage reduced-Hessian meth-
ods offer advantages over conjugate-gradient methods, with only slightly in-
creased memory requirements. We propose two limited-storage methods that uti-
lize rescaling, one of which can be shown to terminate on quadratics. Numerical
results suggest that the method is effective compared with other state-of-the-art
limited-storage methods.
Finally, we extend reduced-Hessian methods to problems with linear
equality constraints. These methods are the first step towards reduced-Hessian
methods for the important class of nonlinearly constrained problems.
xvi
Chapter 1
Introduction to UnconstrainedOptimization
Problems from all areas of science and engineering can be posed as
optimization problems . An optimization problem involves a set of independent
variables, and often includes constraints or restrictions that define acceptable val-
ues of the variables. The solution of an optimization problem is a set of allowed
values of the variables for which some objective function achieves its maximum
or minimum value. The class of model-based methods form quadratic approxi-
mations of optimization problems using first and sometimes second derivatives of
the objective and constraint functions.
Consider the unconstrained optimization problem
minimizex∈IRn
f(x), (1.1)
where f : IRn → IR is twice-continuously differentiable. Since maximizing f can
be achieved by minimizing −f , it suffices to consider only minimization. When
no constraints are present, the problem of minimizing f is often called “uncon-
strained optimization.” When linear constraints are present, the minimization
problem is called “linearly-constrained optimization.” The unconstrained opti-
1
Introduction to Unconstrained Optimization 2
mization problem is introduced in the next section. Linearly constrained opti-
mization is introduced in Chapter 6. Nonlinearly constrained optimization is not
considered. However, much of the work given here applies to solving “subprob-
lems” that might arise in the course of solving nonlinearly constrained problems.
1.1 Newton’s method
A local minimizer x∗ of (1.1) satisfies f(x∗) ≤ f(x) for all x in some open neigh-
borhood of x∗. The necessary optimality conditions at x∗ are
∇f(x∗) = 0 and ∇2f(x∗) ≥ 0,
where ∇2f(x∗) ≥ 0 means that the Hessian of f at x∗ is positive semi-definite.
Sufficient conditions for a point x∗ to be a local minizer are
∇f(x∗) = 0 and ∇2f(x∗) > 0,
where ∇2f(x∗) > 0 means that the Hessian of f at x∗ is positive definite. Since
∇f(x∗) = 0, many methods for solving the (1.1) attempt to “drive” the gradient
to zero. The methods considered here are iterative and generate search directions
by minimizing quadratic approximations to f . In what follows, let xk denote the
kth iterate and pk the kth search direction.
Newton’s method for solving (1.1) minimizes a quadratic model of f
each iteration. The function qNk (x) given by
qNk (x) = f(xk) +∇f(xk)
T(x− xk) + 12(x− xk)
T∇2f(xk)(x− xk), (1.2)
is a second-order Taylor-series approximation to f at the point xk. If∇2f(xk) > 0,
then qNk (x) has a unique minimizer, corresponding to the point at which ∇qN
k (x)
Introduction to Unconstrained Optimization 3
vanishes. This point is taken as the new estimate xk+1 of x∗. If the substitution
p = x− xk is made in (1.2) then the resulting quadratic model
qN ′
k (p) = f(xk) +∇f(xk)Tp+ 1
2pT∇2f(xk)p (1.3)
can be minimized with respect to p for a search direction pk. If ∇2f(xk) > 0, then
the vector pk such that ∇qN ′k (pk) = ∇2f(xk)pk + ∇f(xk) = 0 minimizes qN ′
k (p).
The new iterate is defined as xk+1 = xk + pk. This leads to the definition of
Newton’s method given below.
Algorithm 1.1. Newton’s method
Initialize k = 0 and choose x0.
while not converged do
Solve ∇2f(xk)p = −∇f(xk) for pk.
xk+1 = xk + pk.
k ← k + 1
end do
We now summarize the convergence properties of Newton’s method. It
is important to note that the method seeks points at which the gradient vanishes
and has no particular affinity for minimizers. In the following theorem we will
let x denote a point such that ∇f(x) = 0.
Theorem 1.1 Let f : IRn → IR be a twice-continuously differentiable mapping
defined in an open set D, and assume that ∇f(x) = 0 for some x ∈ D and that
∇2f(x) is nonsingular. Then there is an open set S such that for any x0 ∈ S the
Newton iterates are well defined, remain in S, and converge to x.
Proof. See More and Sorenson [30, pp. 37–38].
Introduction to Unconstrained Optimization 4
The rate or order of convergence of a sequence of iterates is as important
as its convergence. If a sequence xk converges to x and
‖xk+1 − x‖ ≤ C‖xk − x‖p (1.4)
for some positive constant C, then xk is said to converge with order p. The
special cases of p = 1 and p = 2 correspond to linear and quadratic convergence
respectively. In the case of linear convergence, the constant C must satisfy C ∈
(0, 1). Note that if C is close to 1, linear convergence can be unsatisfactory. For
example, if C = .9 and ‖xk− x‖ = .1, then roughly 21 iterations may be required
to attain ‖xk − x‖ = .01.
A sequence xk that converges to x and satisfies
‖xk+1 − x‖ ≤ βk‖xk − x‖,
for some sequence βk that converges to zero, is said to converge superlinearly.
Note that a sequence that converges superlinearly also converges linearly. More-
over, a sequence that converges quadratically converges superlinearly. In this
sense, superlinear convergence can be considered a “middle ground” between lin-
ear and quadratic convergence.
We now state order of convergence results for Newton’s method (for
proofs of these results, see More and Sorenson [30]). If f satisfies the conditions
of Theorem 1.1, the iterates converge to x superlinearly. Moreover, if the Hessian
is Lipschitz continuous at x, i.e.,
‖∇2f(x)−∇2f(x)‖ ≤ κ‖x− x‖ (κ > 0), (1.5)
then xk converges quadratically. These asymptotic rates of convergence of
Newton’s method are the benchmark for all other methods that use only first
Introduction to Unconstrained Optimization 5
and second derivatives of f . Note that since x∗ satisfies ∇f(x∗) = 0, these
results hold also for minimizers.
If x0 is far from x∗, Newton’s method can have several deficiencies.
Consider first when ∇2f(xk) is positive definite. In this case, pk is a descent
direction satisfying ∇f(xk)Tpk < 0. However, since the quadratic model qN ′
k is
only a local approximation of f , it is possible that f(xk + pk) > f(xk). This
problem is alleviated by redefining xk+1 = xk + αkpk, where αk is a positive step
length. If pTk∇f(xk) < 0, then the existence of α > 0 such that αk ∈ (0, α)
implies f(xk+1) < f(xk) is guaranteed (see Fletcher [15]). The specific value
of αk is computed using a line search algorithm that approximately minimizes
the univariate function f(xk + αpk). As a result of the line search, the iterates
satisfy f(xk+1) < f(xk) for all k, which is the defining property associated with
all descent methods. This thesis is concerned mainly with descent methods that
use a line search.
Another problem with Algorithm 1.1 arises when ∇2f(xk) is indefinite
or singular. In this case, pk may be undefined, non-uniquely defined, or a non-
descent direction. This drawback has been successfully overcome by both modi-
fied Newton methods and trust-region methods. Modified Newton methods replace
∇2f(xk) with a positive-definite approximation whenever the former is indefinite
or singular (see Gill et al. [22] for details). Trust-region methods minimize the
quadratic model (1.3) in some small region surrounding xk (see More and Soren-
son [13, pp. 61–67] for further details).
Any Newton method requires the definition of O(n2) second-derivatives
associated with the Hessian. In some cases, for example when f is the solution to
a differential or integral equation, it may be inconvenient or expensive to define
Introduction to Unconstrained Optimization 6
the Hessian. In the next section, quasi-Newton methods are introduced that solve
the unconstrained problem (1.1) using only gradient information.
1.2 Quasi-Newton methods
The idea of approximating the Hessian with a symmetric positive-definite matrix
was first introduced in Davidon’s 1959 paper, Variable metric methods for min-
imization [9]. If Bk denotes an approximate Hessian, then the quadratic model
qNk is replaced by
qk(x) = f(xk) +∇f(xk)T(x− xk) + 1
2(x− xk)
TBk (x− xk). (1.6)
In this case, pk is the solution of the subproblem
minimizep∈IRn
f(xk) +∇f(xk)Tp+ 1
2pTBk p. (1.7)
Since Bk is positive definite, pk satisfies
Bkpk = −∇f(xk) (1.8)
and pk is guaranteed to be a descent direction. Approximate second-derivative
information obtained in moving from xk to xk+1 is incorporated into Bk+1 using
an “update” to Bk. Hence, a general quasi-Newton method takes the form given
in Algorithm 1.2 below.
Algorithm 1.2. Quasi-Newton method
Initialize k = 0; Choose x0 and B0;
while not converged do
Solve Bkpk = −∇f(xk);
Compute αk, and set xk+1 = xk + αkpk;
Introduction to Unconstrained Optimization 7
Compute Bk+1 by applying an update to Bk;
k ← k + 1;
end do
It remains to discuss the form of the update to Bk and the choice of αk.
Define sk = xk+1 − xk, gk = ∇f(xk) and yk = gk+1 − gk. The definition of xk+1
implies that sk satisfies
sk = αkpk. (1.9)
This relationship will used throughout this thesis. The curvature of f along sk at
a point xk is defined as sTk∇2f(xk)sk. The gradient of f can be expanded about
xk to give
gk+1 = ∇f(xk + sk) = gk +(∫ 1
0∇2f(xk + ξsk)dξ
)sk.
It follows from the definition of yk that
sTk∇2f(xk)sk ≈ sT
k yk. (1.10)
The quantity sTk yk is called the approximate curvature of f at xk along sk.
Next, we present a class of low-rank changes to Bk that ensure
sTkBk+1sk = sT
k yk, (1.11)
so that Bk+1 incorporates the correct approximate curvature.
• The well-known Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula de-
fined by
Bk+1 = Bk −Bksks
TkBk
sTkBksk
+yky
Tk
sTk yk
(1.12)
is easily shown to satisfy (1.11). An implementation of Algorithm 1.2 using
the BFGS update will be called a “BFGS method”.
Introduction to Unconstrained Optimization 8
• The Davidon-Fletcher-Powell (DFP) formula is defined by
Bk+1 = Bk −(
1 +sT
kBksk
sTkyk
)yky
Tk
sTkyk
− yksTkBk +Bksky
Tk
sTkyk
. (1.13)
An implementation of Algorithm 1.2 using the DFP update will be called
a “DFP method”.
• The approximate Hessians of the so-called Broyden class are defined by the
formulae
Bk+1 = Bk −Bksks
TkBk
sTkBksk
+yky
Tk
sTkyk
+ φk(sTkBksk)wkw
Tk, (1.14)
where
wk =yk
sTkyk
− Bksk
sTkBksk
,
and φk is a scalar parameter. Note that the BFGS and DFP formulae
correspond to the choices φk = 0 and φk = 1.
• The convex class of updates is a subclass of the Broyden updates for which
φk ∈ [0, 1] for all k. The updates from convex class satisfy (1.11) since they
are all elements of the Broyden class.
Several results follow immediately from the definition of the updates
in the Broyden class. First, formulae in the Broyden class apply at most rank-
two updates to Bk. Second, updates in the Broyden class are such that Bk+1 is
symmetric as long as Bk is symmetric. Third, if Bk is positive definite and φk is
properly chosen (e.g., any φk ≥ 0 is acceptable (see Fletcher [16])), then Bk+1 is
positive definite if and only if sTkyk > 0.
In unconstrained optimization, the value of αk can ensure that sTkyk > 0.
In particular, sTkyk is positive if αk satisfies the Wolfe [48] conditions
f(xk + αkpk) ≤ f(xk) + ναkgTkpk
and gTk+1pk ≥ ηgT
kpk,(1.15)
Introduction to Unconstrained Optimization 9
where 0 < ν < 1/2 and ν ≥ η < 1. The existence of such αk is guaranteed if, for
example, f is bounded below. In a practical line search, it is often convenient to
require αk to satisfy the modified Wolfe conditions
f(xk + αkpk) ≤ f(xk) + ναkgTkpk
and |gTk+1pk| ≤ η|gT
kpk|.(1.16)
The existence of an αk satisfying these conditions can also be guaranteed theoret-
ically. (See Fletcher [15, pp. 26–30] for the existence results and further details.)
For theoretical discussion, αk is sometimes considered to be an exact
minimizer of the univariate function Ψ(α) defined by Ψ(α) = f(xk + αpk). This
choice ensures a positive-definite update since, for such an αk, gTk+1pk = 0, which
implies sTk yk > 0. Properties of Algorithm 1.2 when it is applied to a convex
quadratic objective function using such an exact line search are given in the next
section.
1.2.1 Minimizing strictly convex quadratic functions
Consider the quadratic function
q(x) = d+ cTx+ 12xTHx, where c ∈ IR, d ∈ IRn, H ∈ IRn×n, (1.17)
and H is symmetric positive definite and independent of x. This quadratic has
a unique minimizer x∗ that satisfies Hx∗ = −c. If Algorithm 1.2 is used with
an exact line search and an update from the Broyden class, then the following
properties hold at the kth (0 < k ≤ n) iteration:
Bksi = Hsi, (1.18)
sTiHsk = 0, and (1.19)
sTi gk = 0, (1.20)
Introduction to Unconstrained Optimization 10
for all i < k. Multiplying (1.18) by sTi gives sT
i Bsi = sTi Hsi, which implies that
the curvature of the quadratic model (1.6) along si (i < k) is exact. Define
Sk = ( s0 s1 · · · sk−1 ) and assume that si 6= 0 (0 ≤ i ≤ n − 1). Under
this assumption, note that (1.19) implies that the set si | i ≤ n− 1 is linearly
independent. At the start of the nth iteration, (1.18) implies that BnSn = HSn,
and Bn = H since Sn is nonsingular.
It can be shown that xk minimizes q(x) on the manifold defined by x0
and range(Sk) (see Fletcher [15, pp. 25–26]). It follows that xn minimizes q(x).
This implies that Algorithm 1.2 with an exact line search finds the minimizer of
the quadratic (1.17) in at most n steps, a property often referred to as quadratic
termination.
Further properties of Algorithm 1.2 follow from its well-known equiva-
lence to the conjugate-gradient method when used to minimize convex quadratic
functions using an exact line search. If B0 = I and the updates are from the
Broyden class, then for all k ≥ 1 and 0 ≤ i < k,
gTi gk = 0 and (1.21)
pk = −gk + βk−1pk−1, (1.22)
where βk−1 = ‖gk‖2/‖gk−1‖2 (see Fletcher [15, p. 65] for further details).
1.2.2 Minimizing convex objective functions
Much of the convergence theory for quasi-Newton methods involves convex func-
tions. The theory focuses on two properties of the sequence of iterates. First,
given an arbitrary starting point x0, will the sequence of iterates converge to x∗?
If so, then the method is said to be globally convergent. Second, what is the order
of convergence of the sequence of iterates? In the next two sections, we present
Introduction to Unconstrained Optimization 11
some of the results from the literature regarding the convergence properties of
quasi-Newton methods.
Global convergence of quasi-Newton methods
Consider the application of Algorithm 1.2 to a convex function. Powell has shown
that in this case, the BFGS method with a Wolfe line search is globally convergent
with lim inf ‖gk‖ = 0 (see Powell [40]). Byrd, Nocedal and Yuan have extended
Powell’s result to a quasi-Newton method using any update from the convex class
except the DFP update (see Byrd et al. [6]).
Uniformly convex functions are an important subclass of the set of con-
vex functions. The Hessian of these functions satisfy
m‖z‖2 ≤ zT∇2f(x)z ≤M‖z‖2, (1.23)
for all x and z in IRn. It follows that a function in this class has a unique minimizer
x∗. Although the DFP method is on the boundary of the convex class, it has not
been shown to be globally convergent, even on uniformly convex functions (see
Nocedal [36]).
Order of convergence of quasi-Newton methods
The order of convergence of a sequence has been defined in Section 1.1. The
method of steepest descent, which sets pk = −gk for all k, is known to con-
verge linearly from any starting point (see, for example, Gill et al. [22, p. 103]).
This poor rate of convergence occurs because steepest descent uses no second-
derivative information (the method implicitly chooses Bk = I for all k). On the
other hand, Newton’s method can be shown to converge quadratically for x0 suf-
ficiently close to x∗ if ∇2f(x) is nonsingular and satisfies the Lipschitz condition
(1.5) at x∗. Since quasi-Newton methods use an approximation to the Hessian,
Introduction to Unconstrained Optimization 12
they might be expected to converge at a rate between linear and quadratic. This
is indeed the case.
The following order of convergence results apply to the general quasi-
Newton method given in Algorithm 1.2. It has been shown that xk converges
superlinearly to x∗ if and only if
limk→∞
‖(Bk −∇2f(x∗))sk‖‖sk‖
= 0 (1.24)
(see Dennis and More [11]). Hence, the approximate curvature must converge to
the curvature in f along the unit directions sk/‖sk‖. In a quasi-Newton method
using a Wolfe line search, it has been shown that if the search direction approaches
the Newton direction asymptotically, the step length αk = 1 is acceptable for large
enough k (see Dennis and More [12]).
Suppose now that a quasi-Newton method using updates from the con-
vex class converges to a point x∗ such that ∇2f(x∗) is nonsingular. In this case,
if f is convex, Powell has shown that the BFGS method with a Wolfe line search
converges superlinearly as long as the unit step length is taken whenever possible
(see [40]). This result has been extended to every member of the convex class of
Broyden updates except the DFP update (see Byrd et al. [6]).
The DFP method has not been shown to be superlinearly convergent
when using a Wolfe line search. However, there are convergence results concerning
the application of the DFP method using an exact line search (see Nocedal [36]
for further discussion).
In Section 1.2.1, it was noted that if Algorithm 1.2 with exact line search
is applied to a strictly convex quadratic function, and the steps sk (0 ≤ k ≤ n−1)
are nonzero, then Bn = H. When applied to general functions, it should be noted
that Bk need not converge to∇2f(x∗) even when xk converges to x∗ (see Dennis
Introduction to Unconstrained Optimization 13
and More [11]).
The global and superlinear convergence of Algorithm 1.2 when applied
to general f using a Wolfe line search remains an open question.
1.3 Computation of the search direction
Various methods for solving the system Bkpk = −gk in a practical implementation
of Algorithm 1.2 are discussed in this section.
1.3.1 Notation
For simplicity, the subscript k is suppressed in much of what follows. Bars, tildes
and cups are used to define updated quantities obtained during the kth iteration.
Underlines are sometimes used to denote quantities associated with xk−1. The use
of the subscript will be retained in the definition of sets that contain a sequence of
quantities belonging to different iterations, e.g., g0, g1, . . . , gk. Also, for clarity,
the use of subscripts will be retained in the statement of results.
Throughout the thesis, Ij denotes the j × j identity matrix, where j
satisfies 1 ≤ j < n. The matrix I is reserved for the n× n identity matrix. The
vector ei denotes the ith column of an identity matrix whose order depends on
the context.
If u ∈ IRn and v ∈ IRm, then (u, v)T denotes the column vector of order
n+m whose components are the components of u and v.
1.3.2 Using Cholesky factors
The equations Bp = −g can be solved if an upper-triangular matrix R is known
such that B = RTR. If B is obtained from B using a Broyden update, then an
upper-triangular matrix R satisfying B = RTR can be obtained from a rank-one
Introduction to Unconstrained Optimization 14
update to R (see Goldfarb [24], Dennis and Schnabel [10]). In particular, the
BFGS update can be written as
R = S(R + u(w −RTu)T ) where u =Rs
‖Rs‖, w =
y
(yTs)1/2, (1.25)
and S is an orthogonal matrix that transforms R + u(w − RTu)T to upper-
triangular form.
Since many choices of S yield an upper-triangular R, we now describe
the particular choice used throughout the paper. The matrix S is of the form
S = S2S1, where S1 and S2 are products of Givens matrices. The matrix S1 is
defined by S1 = Pn,1 · · ·Pn,n−2Pn,n−1, where Pnj (1 ≤ j ≤ n−1) is a Givens matrix
in the (j, n) plane designed to annihilate the jth element of Pn,j+1 · · ·Pn,n−1u.
The product S1R is upper triangular except for the presence of a “row spike” in
the nth row. Since S1u = ±en, the matrix S1(R + u(w − RTu)T ) is also upper
triangular except for a row-spike in the nth row. This matrix is restored to
upper-triangular form using a second product of Givens matrices. In particular,
S2 = Pn−1,nPn−2,n · · ·P1n, where Pin (1 ≤ i ≤ n−1) is a Givens matrix in the (i, n)
plane defined to annihilate the (n, i) element of Pi−1,n · · ·P1nS1(R+u(w−RTu)T ).
For simplicity, the BFGS update (1.25) and the Broyden update to R
will be written
R = BFGS(R, s, y) and R = Broyden(R, s, y). (1.26)
The form of S will be as described in the last paragraph.
Another choice of S that implies S1(R + u(w − RTu)T ) is upper Hes-
senberg is described by Gill, Golub, Murray and Saunders [17]. Goldfarb prefers
to write the update as a product of R and a rank-one modification of the iden-
tity. This form of the update is also easily restored to upper-triangular form (see
Goldfarb [24]).
Introduction to Unconstrained Optimization 15
Some authors reserve the term “Cholesky factor” of a positive definite
matrix B to mean the triangular factor with positive diagonals satisfying B =
RTR. However, throughout this thesis, the diagonal components of R are not
restricted in sign, but R will be called “the” Cholesky factor of B.
1.3.3 Using conjugate-direction matrices
Since B is symmetric positive definite, there exists a nonsingular matrix V such
that V TBV = I. The columns of V are said to be “conjugate” with respect to
B. In terms of V , the approximate Hessian satisfies
B−1 = V V T, (1.27)
which implies that the solution of (1.7) may be written as
p = −V V Tg. (1.28)
If B is defined by the BFGS formula (1.12), then a formula for V satisfying
V TBV = I can be obtained from the product form of the BFGS update (see
Brodlie, Gourlay, and Greenstadt [3]). The formula is given by
V = (I − suT)V Ω, where u =Bs
(sTy)1/2(sTBs)1/2+
y
sTy(1.29)
and Ω is an orthogonal matrix.
Powell has proposed that Ω be defined as follows. Let V denote the
product V Ω. The matrix Ω is chosen as a lower-Hessenberg matrix such that
the first column of V is parallel to s (see Powell [42]). Let gV be defined as
gV = V Tg, (1.30)
and define Ω such that ΩT = P12P23 · · ·Pn−1,n, where Pi,i+1 is a rotation in the
(i, i+1) plane chosen to annihilate the (i+1)th component of Pi+1,i+2 · · ·Pn−1,ngV .
Introduction to Unconstrained Optimization 16
Then, Ω is an orthogonal lower-Hessenberg matrix such that ΩTgV = ‖gV ‖e1.
Furthermore, (1.28) and the relation s = αp give
V e1 = − 1
α ‖gV ‖s. (1.31)
Hence, the first column of V is parallel to s.
With this choice of Ω, Powell shows that the columns of V satisfy
vi =
s
(sTy)1/2, if i = 1;
vi −vT
i y
sTys, otherwise.
(1.32)
Note that the matrix B in the update (1.29) has been eliminated in the formulae
(1.32).
Formulae have also been derived for matrices V that satisfy V TBV = I,
where B is any Broyden update to B (see Siegel [47]).
1.4 Transformed and reduced Hessians
Let Q denote an n×n orthogonal matrix and let B denote a positive-definite ap-
proximation to ∇2f(x). The matrix QTBQ is called the transformed approximate
Hessian. If Q is partitioned as Q = ( Z W ), the transformed Hessian has a
corresponding partition
QTBQ =
ZTBZ ZTBW
W TBZ W TBW
.The positive-definite submatrices ZTBZ and W TBW are called reduced approxi-
mate Hessians.
Transformed Hessians are often used in the solution of constrained op-
timization problems (see, for example, Gill et al. [21]). In the next chapter, a
Introduction to Unconstrained Optimization 17
particular choice of Q will be seen to give block-diagonal structure to the ap-
proximate Hessians associated with quasi-Newton methods for unconstrained op-
timization. This simplification leads to another technique for solving Bp = −g
that involves a reduced Hessian. Reduced Hessian quasi-Newton methods using
this technique are the subject of Chapter 2.
Chapter 2
Reduced-Hessian Methods forUnconstrained Optimization
In her dissertation, Fenelon [14] has shown that the BFGS method accu-
mulates approximate curvature information in a sequence of expanding subspaces.
This feature is used to show that the BFGS search direction can often be gen-
erated with matrices of smaller dimension than the approximate Hessian. Use
of these reduced approximate Hessians leads to a variant of the BFGS method
that can be used to solve problems whose Hessians may be too large to store.
In this chapter, reduced Hessian methods are reviewed from Fenelon’s point of
view. A reduced inverse Hessian method, due to Siegel [46], is reviewed in Sec-
tion 2.2. Fenelon’s and Siegel’s work is extended in Sections 2.3–2.5, giving new
reduced-Hessian methods that utilize the Broyden class of updates.
18
Reduced-Hessian Methods for Unconstrained Optimization 19
2.1 Fenelon’s reduced-Hessian BFGS method
Using the equations Bipi = −gi and si = αipi for 0 ≤ i ≤ k, the BFGS updates
from B0 to Bk can be “telescoped” to give
Bk = B0 +k−1∑i=0
(gig
Ti
gTi pi
+yiy
Ti
sTi yi
). (2.1)
If B0 = σI (σ > 0), then (2.1) can be used to show that the solution of Bkpk =
−gk is given by
pk = − 1
σgk −
1
σ
k−1∑i=0
(gT
i pk
gTi pi
gi +yT
i pk
sTi yi
yi
). (2.2)
Hence, if Gk denotes the set of vectors
Gk = g0, g1, . . . , gk, (2.3)
then (2.2) implies that pk ∈ span(Gk). The following lemma summarizes this
result.
Lemma 2.1 (Fenelon) If the BFGS method is used to solve the unconstrained
minimization problem (1.1) with B0 = σI (σ > 0), then pk ∈ span(Gk) for all k.
Using this result, Fenelon has shown that if Zk is a full-rank matrix such
that range(Zk) = span(Gk), then
pk = ZkpZ, where pZ = −(ZTkBkZk)
−1ZTkgk. (2.4)
This form of the search direction implies a reduced-Hessian implementation of
the BFGS method employing Zk and an upper-triangular matrix RZ such that
RTZRZ = ZT
kBkZk.
Reduced-Hessian Methods for Unconstrained Optimization 20
2.1.1 The Gram-Schmidt process
The matrix Zk is obtained from Gk using the Gram-Schmidt process. This process
gives an orthonormal basis for Gk. The choice of orthonormal basis is motivated
by the result
cond(ZTk BkZk) ≤ cond(Bk) if ZT
k Zk = Irk
(see Gill et al. [22, p. 162]).
To simplify the description of this process we drop the subscript k, as
discussed in Section 1.3.1. At the start of the first iteration, Z is initialized to
g0/‖g0‖. During the kth iteration, assume that the columns of Z approximate
an orthonormal basis for span(G). The matrix Z is defined so that range(Z) =
span(G ∪ g) as follows. The vector g can be uniquely written as g = gR + gN ,
where gR ∈ range(Z), gN ∈ null(ZT ). The vector gR satisfies gR = ZZT g, which
implies that the component of g orthogonal to range(Z) satisfies gN = g−ZZT g =
(I−ZZT )g. Let zg denote the normalized component of g orthogonal to range(Z).
If we define ρg = ‖gN‖, then zg = gN/ρg. Note that if ρg = 0, then g ∈ range(Z).
In this case, we will define Z = Z.
To summarize, if r denotes the column dimension of Z, we define
r =
r, if ρg = 0;
r + 1, otherwise.(2.5)
Using r, zg and Z satisfy
zg =
0, if r = r;
1
ρg
(I − ZZT )g, otherwise,(2.6)
and
Z =
Z, if r = r;
( Z zg ), otherwise.(2.7)
Reduced-Hessian Methods for Unconstrained Optimization 21
It is well-known that the Gram-Schmidt process is unstable in the pres-
ence of computer round-off error (see Golub and Van Loan [25, p. 218]). Several
methods have been proposed to stabilize the process. These methods are given
in Table 2.1. The advantages and disadvantages of each method are also given in
the table. Note that a “flop” is defined as a multiplication and an addition. The
flop counts given in the table are only approximations of the actual counts. The
value of 3.2nr flops for the reorthogonalization process is an average that results
if 3 reorthogonalizations are performed every 5 iterations.
Table 2.1: Alternate methods for computing Z
Method Advantage Disadvantage
Gram-Schmidt Simple Unstable2nr flops
Modified More stable Z must be recomputedGram-Schmidt than GS each iteration.
Gram-Schmidt with Stable Expensive, e.g.,reorthogonalization 3.2nr flops(Daniel et al. [76],Fenelon [81])
Implicitly nr +O(r2) flops Expensive if(Siegel [92]) r is large
Another technique for stabilizing the process suggested by Daniel et
al. [8] (and used by Siegel [46]) is to ignore the component of g orthogonal to
range(Z) if it is small (but possibly nonzero) relative to ‖g‖. In this case, the
definition of r satisfies
r =
r, if ρg ≤ ε‖g‖;
r + 1, otherwise,(2.8)
where ε ≥ 0 is a preassigned constant.
Reduced-Hessian Methods for Unconstrained Optimization 22
The matrix Z that results when this definition of r is used has properties
that depend on the choice of ε. If ε = 0, then in exact arithmetic the columns of
Z form an orthonormal basis for span(G). Moreover, for any ε (ε ≥ 0), the matrix
Z forms an orthonormal basis for a subset of G. If Kε = k1, k2, . . . , kr denotes
the set of indices for which ρg > ε‖g‖ and Gε = ( gk1 gk2 · · · gkr ) is the
matrix of corresponding gradients, then the columns of Z form an orthonormal
basis for range(Gε). Gradients satisfying ρg > ε‖g‖ are said to be “accepted”;
otherwise, they are said to be “rejected”. Hence, Gε is the matrix of accepted
gradients associated with a particular choice of ε. Note that the dimension of Z
is nondecreasing with k.
During iteration k + 1, the vector gZ (gZ = ZTg) is needed to compute
the next search direction p. Since
gZ =
ZT g, if r = r; ZT g
ρg
, otherwise,(2.9)
this quantity is a by-product of the computation of Z.
If r, gZ and Z satisfy (2.8), (2.9) and (2.7), then we will write
(Z, gZ, r) = GS(Z, g, r, ε). (2.10)
2.1.2 The BFGS update to RZ
If Z, gZ and RZ are known during the kth iteration of a reduced-Hessian method,
then p is computed using (2.4). Following the calculation of x in the line search,
g is either rejected or added to the basis defined by Z. It remains to define
a matrix RZ satisfying ZTBZ = RTZRZ, where B is obtained from B using the
BFGS update.
Reduced-Hessian Methods for Unconstrained Optimization 23
Let yZ denote the quantity ZTy. If g is rejected, Fenelon employs the
method of Gill et al. [17] to obtain RZ from RZ via two rank-one updates involving
gZ and yZ. If g is accepted, RZ can be partitioned as
RZ =
RZ Rg
0 φ
, where φ is a scalar.
The matrix RZ is obtained from RZ using gZ and yZ. The following lemma is
used to define Rg and φ.
Lemma 2.2 (Fenelon) If zg denotes the normalized component of gk+1 orthogo-
nal to span(Gk), then
ZTBk+1zg =yTg
sTyyZ and zg
TBk+1zg = σ +(zg
Ty)2
sTkyk
. (2.11)
(Although the relation zgTg = 0 is used in the proof of Lemma 2.2, it was not
used to simplify (2.11).) The solution of an upper-triangular system involving
RZ and (yTg/sTy)yZ is used to define Rg. The value φ is then obtained from Rg
and zgT Bzg.
2.2 Reduced inverse Hessian methods
Many quasi-Newton algorithms are defined in terms of the inverse approximate
Hessian Hk = B−1k . The Broyden update to Hk is
Hk+1 = MTk HkMk +
sksTk
sTk yk
− ψk(yTk Hkyk)rkr
Tk, where
Mk = I − skyTk
sTk yk
and rk =Hkyk
yTk Hkyk
− sk
sTk yk
.
(2.12)
The parameter φk is related to ψk by the equation
φk(ψk − 1)(yTk Hkyk)(s
TkBksk) = ψk(φk − 1)(sT
k yk)2.
Reduced-Hessian Methods for Unconstrained Optimization 24
Note that the values ψk = 0 and ψk = 1 correspond to the BFGS and the DFP
updates respectively.
Siegel [46] gives a more general result than Lemma 2.1 that applies to
the entire Broyden class. The result is stated below without proof.
Lemma 2.3 (Siegel) If Algorithm 1.2 is used to solve the unconstrained min-
imization problem (1.1) with B0 = σI (σ > 0) and a Broyden update, then
pk ∈ span(Gk) for all k. Moreover, if z ∈ span(Gk) and w ∈ span(Gk)⊥, then
Bkz ∈ span(Gk), Hkz ∈ span(Gk), Bkw = σw and Hkw = σ−1w.
Let Gk denote the matrix of the first k + 1 gradients. For simplicity,
assume that these gradients are linearly independent and that k is less than n.
Since Gk has full column rank, it has a QR factorization of the form
Gk = Qk
Tk
0
, where QTkQk = I and (2.13)
Tk is nonsingular and upper triangular. Define rk = dim(span(Gk)), and partition
Qk = ( Zk Wk ), where Zk ∈ IRn×rk . Note that the product Gk = ZkTk defines
a “skinny” QR factorization of Gk (see Golub and Van Loan [25, p. 217]). The
columns of Zk form an orthonormal basis for range(Gk) and the columns of Wk
form an orthonormal basis for null(GTk ). If the first k+1 gradients are not linearly
independent, Qk is defined as in (2.13), except that G0k is used in place of Gk.
Hence, the first r columns of Qk are still an orthonormal basis for Gk.
Consider the transformed inverse Hessian QTkHkQk. Lemma 2.3 implies
that if H0 = σ−1I, then QTkHkQk is block diagonal and satisfies
QTkHkQk =
ZTk HkZk 0
0 σ−1In−rk
. (2.14)
As the equation for the search direction in terms of Hk satisfies pk = −Hkgk,
we have QTk pk = −(QT
kHkQk)QTk gk. It follows that pk = −Zk(Z
Tk HkZk)Z
Tk gk
Reduced-Hessian Methods for Unconstrained Optimization 25
since W Tk gk = 0. This form of the search direction leads to a reduced inverse
Hessian method employing Zk and ZTk HkZk. Instead of using reorthogonalization
for stability, Siegel defines Zk implicitly in terms of Gk and a nonsingular upper-
triangular matrix similar to Tk given by (2.13) (see Siegel [46] for further details).
This form of Zk has some advantages in the case of large-scale unconstrained
optimization (see Table 2.1).
2.3 An extension of Fenelon’s method
Lemma 2.3 is now used to show that pk is of the form (2.4) when Bk is updated
using any member from the Broyden class. Let Qk be defined as in Section 2.2,
i.e., Qk = ( Zk Wk ), where range(Zk) = span(G0k) and QT
kQk = I. If B0 = σI
and Bk is updated using (1.14), then Lemma 2.3 implies that
QTkBkQk =
ZTkBkZk 0
0 σIn−rk
. (2.15)
The equation for the search direction can be written as (QTkBkQk)Q
Tk pk = −QT
k gk.
Since W Tk gk = 0, it follows from the form of the transformed Hessian (2.15) that
pk satisfies (2.4).
The curvature of the quadratic model (1.7) along any unit vector in
range(Wk) depends only on the choice of B0 and has no effect on pk. All relevant
curvature in Bk is contained in the reduced Hessian ZTkBkZk. Since rk+1 ≥ rk
for all k, the curvature in the quadratic model used to define pk accumulates in
subspaces of nondecreasing dimension.
Let Qk+1 denote an update to Qk satisfying
G0k+1 = Qk+1
Tk+1
0
,
Reduced-Hessian Methods for Unconstrained Optimization 26
where Tk+1 is nonsingular and upper triangular. Partition Qk+1 as Qk+1 =
( Zk+1 Wk+1 ), where Zk+1 ∈ IRn×rk+1 . Furthermore, let Zk+1 be defined so
that its first rk columns are identical to Zk. In the remainder of the section, the
subscript k will be omitted.
Let RQ and RQ denote upper-triangular matrices such that RTQRQ =
QTBQ and RTQRQ = QTBQ. Since the first r columns of Q and Q are identical,
Lemma 2.3 implies that the matrices QTBQ and QTBQ are identical. Hence, the
form of the transformed Hessian given by (2.15) implies that RQ is of the form
RQ =
RZ 0
0 σ1/2In−r
, where RZ satisfies (2.16)
RZ =
RZ, if r = r; RZ 0
0 σ1/2
, if r = r + 1.(2.17)
Define the transformed vectors sQ = QTs and yQ = QTy. Let RQ denote
the Cholesky factor of QTBQ, where B is obtained from B using a Broyden
update. The following lemma follows from the definition of B, sQ and yQ.
Lemma 2.4 If RQ, RQ and RQ satisfy RTQRQ = QTBQ, RT
QRQ = QTBQ and
RTQRQ = QTBQ, then
RQ = Broyden(RQ, sQ, yQ). (2.18)
Hence, the updated Cholesky factor of the transformed Hessian is obtained in
the same way as R except that sQ and yQ are used in place of s and y.
Lemma 2.3 and the definition of y imply that
sQ =
sZ
0
and yQ =
yZ
0
, (2.19)
where sZ = ZTs and yZ = ZTy. A simplification of the Broyden update results
from the special form of sQ and yQ.
Reduced-Hessian Methods for Unconstrained Optimization 27
Lemma 2.5 If sQ and yQ are of the form (2.19) and RQ satisfies (2.16), then
RQ = Broyden(RQ, sQ, yQ) satisfies
RQ =
RZ 0
0 σ1/2In−r
, where RZ = Broyden(RZ, sZ, yZ).
Since RTQRQ = QTBQ, and QTBQ satisfies (2.15) post-dated one iter-
ation, RZ is the Cholesky factor of ZTBZ. It follows that the Cholesky factor
corresponding to the updated reduced Hessian can be obtained directly from RZ
using the reduced quantities sZ and yZ.
This discussion leads to the definition of reduced-Hessian methods using
updates from the Broyden class. We first present a version of these methods that
is identical in exact arithmetic to the corresponding quasi-Newton method. This
method will serve as a template for the more practical reduced-Hessian methods
that follow.
Algorithm 2.1. Template reduced-Hessian quasi-Newton method
Initialize k = 0, r0 = 1; Choose x0 and σ;
Initialize Z = g0/‖g0‖, gZ = ‖g0‖ and RZ = σ1/2;
while not converged do
Solve RTZ tZ = −gZ, RZpZ = tZ, and set p = ZpZ;
Compute α so that sTy > 0 and set x = x+ αp;
Compute (Z, gZ, r) = GS(Z, g, r, 0);
Form RZ according to (2.17);
if r = r then
Set gZ = gZ and pZ = pZ;
else
Set gZ = (gZ, 0)T and pZ = (pZ, 0)T;
Reduced-Hessian Methods for Unconstrained Optimization 28
end if
Compute sZ = αpZ and yZ = gZ − gZ;
Compute RZ = Broyden(RZ, sZ, yZ);
k ← k + 1;
end do
The columns of Z form an orthonormal basis for span(G) since ε = 0.
The definition of gZ follows because when g is accepted, gZ = (ZTg, zgTg)T =
(gZ, 0)T since g ∈ range(Z). A similar argument implies that the form of pZ is
correct.
Round-off error can cause the computed value of ρg to be inaccurate
when g is nearly in range(Z). For this reason, we consider a modification of
Algorithm 2.1 that employs a positive value for ε. In this case, the following
comment is made with regard to the definition of gZ.
Consider the case when g has been rejected and g is accepted. At the
end of iteration k, the vector gZ satisfies gZ = ZTg = (gZ, zgTg)T . Note that zg
may be nonzero since g might have been rejected with 0 < ρg ≤ ε‖g‖. In this case,
gZ is not of the form given in Algorithm 2.1. We take the suggestion of Siegel [46]
and define the update in terms of an approximation gεZ defined by gε
Z = (gZ, 0)T .
The quantity yZ is replaced by approximation yεZ defined by yε
Z = gZ − gεZ.
This discussion leads to the definition of the following reduced-Hessian
algorithm.
Algorithm 2.2. Reduced-Hessian quasi-Newton method (RH)
Initialize k = 0, r0 = 1; Choose x0, σ and ε;
Initialize Z = g0/‖g0‖, gZ = ‖g0‖ and RZ = σ1/2;
while not converged do
Reduced-Hessian Methods for Unconstrained Optimization 29
Solve RTZ tZ = −gZ, RZpZ = tZ, and set p = ZpZ;
Compute α so that sTy > 0 and set x = x+ αp;
Compute (Z, gZ, r) = GS(Z, g, r, ε);
Form RZ according to (2.17);
if r = r then
Define gεZ = gZ and pZ = pZ;
else
Define gεZ = (gZ, 0)T and pZ = (pZ, 0)T ;
Compute sZ = αpZ and yεZ = gZ − gε
Z;
Compute RZ = Broyden(RZ, sZ, yεZ);
k ← k + 1;
end do
Note that the Broyden update is well defined as long as sTy > 0 since
sTZy
εZ = sT
ZyZ = sTQyQ = sTy. (2.20)
2.4 The effective approximate Hessian
As suggested by Nazareth [34], we define an effective approximate Hessian Bε in
terms of Z, RZ, and the implicit matrix W . In particular, with Q = ( Z W ),
Bε is given by
Bε = Q(RεQ)TRε
QQT , where Rε
Q =
RZ 0
0 σ1/2In−r
.The quadratic model associated with Bε is denoted by qε(p) and satisfies
qε(p) = f(x) + gTp+ 12pTBεp.
It can be verified that the search direction p = ZpZ defined in Algorithm RH
minimizes qε(p) in range(Z).
Reduced-Hessian Methods for Unconstrained Optimization 30
It is important to note that if ε > 0, Bε may not be equal to the
approximate Hessian B generated by Algorithm 1.2. To see this, suppose that
the first k+ 1 gradients are accepted and that RZ is updated as described above.
During iteration k, suppose that g is not accepted, but that 0 < ρg ≤ ε‖g‖. This
implies that the component of g orthogonal to range(Z) is nonzero. Since g is
not accepted and g has been accepted, it follows that yQ satisfies
yQ =
ZT g
W T g
− ZTg
W Tg
=
ZTy
W T g
. (2.21)
Since 0 < ρg ≤ ε‖g‖, it follows that W T g 6= 0 (note that ‖W T g‖ = ρg) and that yQ
does not satisfy the hypothesis (2.19) of Lemma 2.5. If RQ = Broyden(RQ, sQ, yQ),
where RQ satisfies (2.16) and yQ satisfies (2.21), then RQ is generally a dense
upper-triangular matrix (although the elements corresponding to W are “small”),
which is not equal to RεQ.
The structure of RεQ corresponds to an approximate gradient g ε defined
by g ε = ZgZ. Note that g 0 = g and that g ε = g, whenever g is accepted.
The vector gε is similarly defined as gε = ZgZ. In terms of these approximate
gradients, yε is defined by yε = g ε − gε. Since
yεQ = QTyε =
ZTyε
W Tyε
=
yεZ
0
,the following lemma holds.
Lemma 2.6 Let Z ∈ IRn×r and let RεQ denote a nonsingular upper-triangular
matrix of the form
RεQ =
RZ 0
0 σIn−r
, where RZ ∈ IRr×r.
Let r and Z be defined by (Z, r) = GS(Z, g, r, ε). Define
RεQ =
RZ 0
0 σIn−r
, where RZ is defined by (2.17).
Reduced-Hessian Methods for Unconstrained Optimization 31
If RεQ = Broyden(Rε
Q, sQ, yεQ), then
RεQ =
RZ 0
0 σIn−r
, where RZ = Broyden(RZ, sZ, yεZ).
Proof. Since s ∈ range(Z) and the first r columns of Q are the columns of Z, it
follows that sQ = (sZ, 0)T. A short calculation verifies that the first r components
of yεQ are given by yZ. Hence,
sTQy
εQ = sT
ZyZ = sTQyQ = sTy > 0, (2.22)
which implies that RεQ is well defined. An identical argument shows that RZ is
well defined. The rest of the proof follows from the form of sQ and yεQ and the
definition of the Broyden update to the Cholesky factor.
2.5 Lingering on a subspace
We have seen that quasi-Newton methods gain approximate curvature in a se-
quence of expanding subspaces whose dimensions are given by dim(span(Gk)).
The subspace span(Gk) ⊕ span(x0) is the manifold determined by x0 and Gk
and will be denoted by Mk(Gk, x0). Because of the form of pk, it is clear that
x0, x1, . . . , xk lies in Mk(Gk, x0). Moreover, as will be shown in Chapter 5,
each iteration that dim(span(Gk)) increases, the iterates “step into” the cor-
responding larger subspace. Hence, x0, x1, . . . , xk spans Mk(Gk, x0). This
property also holds if Algorithm RH is used with a positive value of ε, i.e.,
spanx0, x1, . . . , xk =Mk(Gεk, x0) for any ε ≥ 0.
We now consider a modification of Algorithm RH that employs a scheme
in which successive iterates “linger” on a manifold smaller than Mk(Gεk, x0).
Both Fenelon [14] and Siegel [45] have considered lingering as an optimization
Reduced-Hessian Methods for Unconstrained Optimization 32
strategy. Our experience has shown that lingering can be beneficial, especially
when combined with the rescaling strategies defined in Chapter 3. We develop
the lingering strategy as a modification of Algorithm RH during iteration k. The
iteration subscript is again dropped as described in Section 1.3.1.
Suppose that g is accepted during iteration k − 1 of Algorithm RH. It
follows that Z satisfies Z = ( Z zg ) at the start of iteration k (recall that Z is
defined by Z = Zk−1). Define U = Z and Y = zg so that Z = ( U Y ) and let
l denote the number of columns in U , which is r − 1 in this case. Partition RZ
according to
RZ =
RU RUY
RY
, where RU ∈ IRl×l. (2.23)
The search direction defined in Algorithm RH satisfies
pr = ZpZ, where RTZ tZ = −gZ and RZpZ = tZ. (2.24)
The superscript r has been added to emphasize that the search direction is ob-
tained from the r-dimensional subspace range(Z). A unit step along pr min-
imizes the quadratic model qε(p) in range(Z). Partition gZ = (gU , gY )T and
tZ = (tU , tY )T , where gU = UTg, gY = Y Tg, tU ∈ IRl and tY ∈ IRr−l. Note that the
partition of RZ given by (2.23) and the equation RZtZ = −gZ imply that
RTU tU = −gU and RT
Y tY = −(RTUY tU + gY ). (2.25)
The reduction in the quadratic model qε(p) along pr satisfies
qε(0)− qε(pr) = 12‖tZ‖2.
Let pl denote the vector obtained by minimizing the quadratic model in range(U).
The vector pl satisfies
pl = −U(RTURU)−1gU
Reduced-Hessian Methods for Unconstrained Optimization 33
and the reduction in the quadratic model along pl satisfies
qε(0)− qε(pl) = 12‖tU‖2.
When minimizing a convex quadratic function with exact line search, successive
gradients are mutually orthogonal. In this case, gU = 0 and it follows from (2.25)
that tU = 0. Hence, a decrease in the quadratic model can be made only by
minimizing on the subspace determined by Z. However, Siegel has observed that
this behavior can be nearly reversed when minimizing general functions with an
inexact line search (see Siegel [45]). In this case, it is possible that ‖tU‖ ≈ ‖tZ‖,
which implies that nearly all of the reduction in the quadratic model is obtained
in range(U).
Since gU = 0 when minimizing quadratics with exact line search, quasi-
Newton methods minimize completely on range(U) before to moving into the
larger subspace range(Z). This phenomenon leads to the well-known property of
quadratic termination. Although this property is not retained when minimizing
general f , the quasi-Newton method can be modified so that gU is “smaller”
before moving into the larger subspace range(Z). This modification is achieved
easily by choosing pl instead of pr as the search direction. If the search direction
is given by pl, then the iterate x = x + αpl remains on the manifold M(U, x0)
defined by x0, x1, . . ., xk.
While the iterates linger on range(U), it is likely that the column di-
mension of Z continues to grow as gradients are accepted into the basis. The new
components of each accepted gradient are appended to Z as in Algorithm RH and
contribute to an increase in the dimension of range(Y ). The matrix U remains
fixed as long as the iterates linger onM(U, x0). While the iterates linger, unused
approximate curvature accumulates in the effective Hessian along directions in
Reduced-Hessian Methods for Unconstrained Optimization 34
range(Y ).
As noted by Fenelon [14, p. 72], it is not generally efficient to remain
onM(U, x0) until gU = 0. As suggested by Siegel [45], we will allow the iterates
to linger on M(U, x0) until the reduction in the quadratic model obtained by
moving into range(Y ) is significantly better than that obtained by lingering. In
particular, the iterates will remain inM(U, x0) as long as ‖tU‖2 > τ‖tZ‖2, where
τ ∈ (12, 1] is a preassigned constant. Since ‖tZ‖2 = ‖tU‖2 + ‖tY ‖2, the inequality
‖tU‖2 > τ‖tZ‖2 is equivalent to (1 − τ)‖tU‖2 > ‖tZ‖2. Hence, if τ = 1, then the
iterates do not linger.
In the case that p = pr (‖tU‖2 ≤ τ‖tZ‖2), let pZ be partioned as pZ =
(pU , pY )T, where pU ∈ IRl and pY ∈ IRr−l. The partition of RZ and the equation
RZpZ = tZ imply that
RY pY = tY and RUpU = tU −RUY pY . (2.26)
In terms of pU and pY , the search direction satisfies p = UpU + Y pY . Note that
if τ < 1, then the inequality (1− τ)‖tU‖2 ≤ ‖tY ‖2 implies that tY 6= 0. It follows
from (2.26) that pY 6= 0 since RY is nonsingular. Hence, Y pY is a nonzero step
into range(Y ) and we will say that the iterate x = x+αpr “steps into” range(Y ).
2.5.1 Updating Z when p = pr
When x “steps into” range(Y ), the dimension of the manifold defined by the
sequence of iterates increases by one. The new manifold is determined by x0, U
and pr. If subsequent iterates are to linger on this manifold, then it is convenient
to change U to another matrix, say U , such that range(U) ⊂ range(U) and
pr ∈ range(U). The new manifold is then given by M(x0, U). If the search
directions are taken from range(U), then the iterates will remain onM(x0, U).
Reduced-Hessian Methods for Unconstrained Optimization 35
The matrix U can be defined using an update to Z following the com-
putation of pr. Let Z denote the desired update to Z and partition Z as
Z = ( U Y ), where U ∈ IRn×(l+1) and Y ∈ IRn×(r−l−1). The component of
pr in range(Y ) is given by Y pY . The matrix Z is defined so that
range(U) = range(U)⊕ range(Y pY ),
and range(Z) = range(Z). Because range(Z) = range(Z), the update essentially
defines a “reorganization” of Z.
The update described here corresponds to an update of the Gram-
Schmidt QR factorization associated with Gε and is due to Daniel, et al. [8].
Let S denote an orthogonal (r − l) × (r − l) matrix satisfying SpY = ‖pY ‖e1
and define Z = ( U Y ST ). Note that Y STe1 = Y pY /‖pY ‖. Accordingly, the
update U is given by the first l + 1 columns of Z, i.e., U = ( U Y STe1 ). The
remainder of Z is denoted by Y , i.e., Y = ( Y STe2 Y STe3 · · · Y STer ). A
short argument shows that range(Z) = range(Z) and ZTZ = Ir. Hence, Z is also
an orthonormal basis for Gε. The matrix S satisfies S = Pl+1,l+2Pl+2,l+3 · · ·Pr−1,r,
where Pi,i+1 is a symmetric (r−l)×(r−l) Givens matrix in the (i, i+1) plane cho-
sen to annihilate the (i+1)th component of Pi+1,i+2 · · ·Pr−1,rpY . We say that the
component of p in range(Y ) is “rotated” into U . This component is considered
to be removed from Y to define Y since these two matrices satisfy
range(Y ) = range(Y )⊕ range(Y pY ).
As an aside, we note that if U is defined as above, then the columns of
U form a basis for the search directions. Moreover, if Pk = ( pk0 pk1 · · · pkl)
denotes the matrix of “full” search directions satisfying p = pr, then Pk has full
rank and range(Pk) = range(U).
Reduced-Hessian Methods for Unconstrained Optimization 36
If p = pl, then let U = U , Y = Y and Z = Z. The new partition
parameter satisfies
l =
l, if ‖tU‖2 > τ‖tZ‖2;
l + 1, otherwise,(2.27)
The matrix Z is defined by the Gram-Schmidt process used in Algo-
rithm RH, except that Z is used in place of Z. Hence, Z, gZ and r satisfy
(Z, gZ, r) = GS(Z, g, r, ε).
2.5.2 Calculating sZ and yεZ
At the end of iteration k, the quantities sZ and yεZ are required to compute RZ
using a Broyden update. (Recall that yεZ is the approximation of yZ that results
when a positive value of ε is used in the Gram-Schmidt process.) Computational
savings can be made if these quantities are obtained using pZ and gZ. We discuss
the definition of sZ first. The vector sZ satisfies
sZ =
sZ, if r = r;
(sZ, 0)T , if r = r + 1,
(2.28)
where sZ
= ZTs. The vector sZ
satisfies sZ
= αpZ, where p
Z= ZTp. If the
partition parameter increases, then pZ6= pZ. However, we shall show below that
pZ
can be obtained directly from pZ without a matrix-vector multiplication. This
is important, especially when n is large, since the computation of ZTp “from
scratch” requires nr floating point operations. From the definition of Z, pZ
satisfies
pZ
= ZTp = ( U Y ST )Tp =
UTp
SY Tp
=
pU
‖pY ‖e1
.The value ‖pY ‖ is computed during the update of Z. Hence, the definition of p
Z
requires no further computation. Note that if l = l, then pZ
= pZ.
Reduced-Hessian Methods for Unconstrained Optimization 37
Second, we discuss the calculation of yεZ. The vector g
Zdefined by
gZ
= ZTg satisfies
gZ
=
gZ
SgY
.This vector can be calculated by applying the Givens matrices defining S to the
vector gY . The definition of gεZ is similar to the definition used in Algorithm RH,
i.e.,
gεZ =
gZ, if r = r;
(gZ, 0)T , if r = r + 1.
(2.29)
The vector yεZ is defined as in Algorithm RH, i.e., yε
Z = gZ − gεZ.
2.5.3 The form of RZ when using the BFGS update
In this section, the effect of the lingering strategy on the block structure of RZ
is examined. Although a complete algorithm utilizing lingering has not yet been
defined, we present some preliminary results based on the discussion given to this
point. The first result gives information about the effect of the BFGS update on
RZ when s ∈ range(U).
Lemma 2.7 Let RZ denote a nonsingular upper-triangular r × r matrix parti-
tioned as
RZ =
RU RUY
0 RY
, where RU ∈ IRl×l and RY ∈ IRr−l.
Suppose sZ is of the form sZ = (sU , 0)T, where sU ∈ IRl, and that yZ ∈ IRr. If
RZ = BFGS(RZ, sZ, yZ), then the (2, 2) block of RZ is unaltered by the update,
i.e.,
RZ =
RU RUY
0 RY
Reduced-Hessian Methods for Unconstrained Optimization 38
Proof. The result follows from the definition of the rank-one BFGS update given
in Section 1.3.2.
Note that the result is purely algebraic in nature. The notation used
in the lemma is consistent with the current discussion to facilitate application in
Lemma 2.8 below.
Lemma 2.8 Assume that Algorithm RH has been applied with the BFGS update
to minimize f for k iterations. Moreover, assume that g was accepted at iteration
k− 1. Let lk = rk − 1 and partition RZ as in (2.23). During iteration k, suppose
that the iterates begin to linger onM(U, x0), and that they remain on the manifold
for m (m ≥ 0) iterations. Then, at the start of iteration k +m, the (2, 2) block
of RZ satisfies RY = σ1/2Irk+m−lk .
Proof. The result is proved by induction on i, where (0 ≤ i ≤ m). Since g is
accepted and l = r−1, Z is of the form Z = ( U Y ), where U = Z and Y = zg.
Prior to application of the BFGS update, the Cholesky factor satisfies
RZ =
RU 0
0 RY
, where RY = σ1/2 .
Since s ∈ range(U), it follows that sZ = (sU , 0)T . Hence, the result holds for i = 0
by application of Lemma 2.7 predated by one iteration. Assume that the result
holds for i = m−1. Since the iterates linger during iterations k through k+m−1,
the partition parameter satisfies lk+m = lk+m−1 = · · · = lk. Hence, we may use l
to denote this common value of the partition parameter. For the remainder of the
proof, let unbarred quantities be associated with the start of iteration k+m− 1
and let barred quantities denote their corresponding updates. By the inductive
hypothesis, RY = σ1/2Ir−l. Since x lingers on M(U, x0), s ∈ range(U) and it
Reduced-Hessian Methods for Unconstrained Optimization 39
follows that sZ = (sU , 0)T . Prior to the BFGS update, RY = σ1/2Ir−l. After the
BFGS update, Lemma 2.7 implies that RY = σ1/2Ir−l, as required.
2.5.4 Updating RZ after the computation of p
The change of basis from Z to Z necessitates a corresponding change in RZ
whenever l = l + 1. Recall that the effective approximate Hessian is defined by
Bε = QRTQRQQ
T, where RQ =
RZ 0
0 σ1/2In−r
and Q = ( Z W ) is orthogonal. The reduced Hessian ZTBεZ satisfies ZTBεZ =
RTZRZ. Following the change of basis, the Cholesky factor of the reduced Hessian
ZTBεZ is required. Let RZ
denote the desired matrix. A short calculation
shows that ZTBεZ = diag(Il, S)RTZRZ diag(Il, S
T ). The partition of RZ defined
by (2.23) gives
RZ diag(Il, ST ) =
RU RUYST
0 RYST
,which is not generally upper triangular. Hence, R
Zis defined by
RZ
= diag(Il, S)RZ diag(Il, ST ),
where S is defined so that SRYST is upper triangular. In the next section, we
consider the definition of S when BFGS updates are used.
The form of S when using the BFGS update
Lemma 2.8 implies that RY = σ1/2Ir−l. Hence, the matrix RZ diag(Il, ST ) satisfies
RZ diag(Il, ST ) =
RU RUYST
0 σ1/2ST
.
Reduced-Hessian Methods for Unconstrained Optimization 40
Thus, S may be set equal to S giving
RZ
=
RU RUYST
0 σ1/2Ir−l
.Note that the Givens matrices defined by S need only be applied to RUY in this
case. In the next section, we consider the definition of S when general Broyden
updates are used.
The form of S when using Broyden updates
When using Broyden updates other that the BFGS update, RY is not generally
diagonal. Restoring RYST to upper-triangular form is more complicated in this
case. The matrix S is defined by a product of Givens matrices. In particular,
S = P l+1,l+2 · · · P r−1,r, where P i,i+1 is an (r − l) × (r − l) Givens matrix in the
(i, i+ 1) plane defined to annihilate the (i+ 1, i) component of
P i+1,i+2 · · · P r−1,rRYPr−1,r · · ·Pi,i+1.
Note that P i,i+1 is defined immediately after the definition of Pi,i+1. For this
reason, the Givens matrices defining S are said to be interlaced with those defining
S. This technique of interlacing Givens matrices to maintain upper-triangular
form has been described by Crawford in the context of the generalized eigenvalue
problem and has been suggested for use in optimization by Gill et al. (see [20]).
In summary, the update to RZ satisfies
RZ
=
RZ, if l = l;
diag(Il, S)RZ diag(Il, ST ), otherwise.
(2.30)
Reduced-Hessian Methods for Unconstrained Optimization 41
2.5.5 The Broyden update to RZ
The matrix RZ is defined by
RZ =
R
Z, if r = r; R
Z0
0 σ1/2
, if r = r + 1.(2.31)
The updated Cholesky factor RZ satisfies RZ = Broyden(RZ, sZ, yεZ). Note that
when the BFGS update is used, fewer Givens matrices need be defined in order
to reduce uZ to er, resulting in computational savings. (See Section 1.3.2 and
note that uZ = RZsZ/‖RZsZ‖ is of the form uZ = (uU , 0)T , where uU ∈ IRl.)
2.5.6 A reduced-Hessian algorithm with lingering
A reduced Hessian algorithm with lingering is given below. This algorithm will
be referred to as Algorithm RHL.
Algorithm 2.3. Reduced Hessian method with lingering (RHL)
Initialize k = 0, r0 = 1 and l0 = 0; Choose x0, σ0 (σ0 > 0) and ε;
Initialize Z = Y = g0/‖g0‖ and RZ = RY = σ1/20 (U and RU are void);
while not converged do
Compute tU and tY according to (2.25);
Compute l according to (2.27);
if l = l then
Solve RUpU = −tU and compute p = UpU ;
else
Compute pU and pY according to (2.26) and set p = UpU + Y pY ;
end if
Compute α so that sTy > 0 and set x = x+ αp;
Reduced-Hessian Methods for Unconstrained Optimization 42
if l = l then
Define Z = Z, U = U and Y = Y ;
else
Define Z = Z diag(Il, ST ), where S satisfies SpY = ‖pY ‖e1;
Define U = ( U Y ST e1 ) and Y = Y ST ( e2 e3 · · · er );
Compute RZ
= diag(Il, S)RZ diag(Il, ST ), where S is defined so that
SRYST is upper triangular;
Define pZ
= (pU , ‖pY ‖e1)T and gZ
= (gU , SgY )T ;
end if
Form RZ according to (2.31);
Compute sZ and yεZ;
Compute RZ = Broyden(RZ, sZ, yεZ);
k ← k + 1;
end do
Chapter 3
Rescaling Reduced Hessians
In practice, the choice of B0 can greatly influence the performance of
quasi-Newton methods. If no second-derivative information is available at x0,
then B0 is often initialized to I. Several authors have observed that a poor
choice of B0 can lead to inefficiences—especially if ∇2f(x∗) is ill-conditioned (e.g.,
see Powell [41] and Siegel [45]). These inefficiences can lead to a large number
of function evaluations in practical implementations. Function evaluations are
often expensive in comparison to the linear algebra required to implement quasi-
Newton methods.
One remedy involves rescaling the approximate Hessians. To date,
rescaling has involved multiplying the approximate Hessians (or part of a fac-
torization of the approximate Hessians) by positive scalars. The following are
examples of rescaling methods.
• The self-scaling variable metric (SSVM) method, reviewed in Section 3.1,
multiplies Bk by a scalar prior to application of the Broyden update.
• Siegel [45] has demonstrated global and superlinear convergence of a scheme
that rescales columns of a conjugate-direction factorization of B−1k . This
43
Rescaling Reduced Hessians 44
method is reviewed in Section 3.2.
• Lalee and Nocedal [27] have defined an algorithm that rescales columns of
a lower-Hessenberg factor of Bk.
In this thesis, rescaling is achieved by reassigning the values of certain
elements of the reduced-Hessian Cholesky factor. In Sections 3.3 and 3.4, two
new rescaling algorithms of this type are introduced as extensions of Algorithm
RH (p. 28) and Algorithm RHL (p. 41).
3.1 Self-scaling variable metric methods
The first rescaling method, suggested by Oren and Luenberger [39], involves a
scalar factor ηk applied to the approximate Hessian before the quasi-Newton
update. Although the original SSVM methods were formulated in terms of the
inverse approximate Hessian, we shall describe them in terms of Bk (see Brodlie
[2]).
Let Mk = H1/2B−1k H1/2, where H is the Hessian of the quadratic q(x)
given in (1.17). Assume that Bk is positive definite. Brodlie states the result
that when q(x) is minimized using an exact line search,
q(xk+1)− q(x∗) ≤ γ2k(q(xk)− q(x∗)),
where γk = (κ(Mk)− 1)/(κ(Mk) + 1).
The value γ2k is called the “one-step convergence rate”. Note that the
smaller the value of κ(Mk), the smaller the value of γk. Hence, Oren and Luen-
berger suggest that a good method should decrease κ(Mk) every iteration. How-
ever, when Bk is updated by a formula from the Broyden convex class, κ(Mk)
Rescaling Reduced Hessians 45
can fluctuate. Consider the scalar ηk(β) defined by
ηk(β) = βsT
k yk
sTkBksk
+ (1− β)yT
k B−1k yk
sTk yk
, (3.1)
If Bk is multiplied by ηk before application of an update from the Broyden convex
class, then κ(Mk) decreases monotonically assuming an exact line search (see
Oren and Luenberger [39]).
The choice β = 1 avoids the need to form B−1k yk, a quantity not normally
computed by methods updating Bk. The corresponding value of the rescaling
parameter.
ηk(1) =sT
k yk
sTkBksk
, (3.2)
has been studied by several authors. Contreras and Tapia [7] consider this choice
in connection with trust-region methods for unconstrained optimization using
both the BFGS and the DFP updates. They report positive results for the DFP
update but negative results for the BFGS update (see [7] for further details).
Results given by Nocedal and Yuan [37] suggest that rescaling by ηk(1) every
iteration may inhibit superlinear convergence in line search algorithms that use
an initial step length of one.
Several researchers have proposed rescaling at the first iteration only.
Shanno and Phua suggest multiplying H0 by the scalar 1/η0(0) prior to the first
BFGS update. This is analogous to multiplying B0 by
η0(0) =y0B
−10 y0
sT0 y0
.
Numerical results imply that the method can be superior to the BFGS method,
especially for larger values of n. They also compare the method to a SSVM
method that suggested by Oren and Spedicato [38] and conclude that initial scal-
Rescaling Reduced Hessians 46
ing is superior (see Shanno and Phua [44]). Siegel has suggested multiplying B0
by η0(1) = sT0 y0/s
T0B0s0 in methods for large-scale unconstrained optimization.
Liu and Nocedal [28] have studied rescaling parameters in connection
with limited-memory methods (see Section 5.1). In these methods, the “initial”
inverse approximate Hessian H0k can be redefined every iteration. Several choices
for H0k are compared and they conclude that
H0k =
1
η0(0)I =
sTk−1yk−1
yTk−1yk−1
I
is the most effective in practice.
A common feature of SSVM methods is that they alter the approxi-
mate curvature in all directions. Recent methods, such as the conjugate-direction
rescaling algorithm reviewed in the next section, rescale more selectively.
3.2 Rescaling conjugate-direction matrices
Siegel has proposed a rescaling algorithm that uses conjugate direction matrices.
The matrices are updated using the form of the BFGS update suggested by Powell
(see Section 1.3.3). The algorithm is similar to Powell’s, but the definition of p
is different and the updated matrix V is rescaled. We present an outline of the
method in the following sections. See Siegel [45] for further details regarding both
the motivation and implementation of the method.
3.2.1 Definition of p
Consider the matrix V defined in Section 1.3.3. The rescaling algorithm uses an
integer parameter l (0 ≤ l ≤ n) that may be increased at any iteration. The
matrix V is partitioned as V = ( V1 V2 ), where V1 = (v1 v2 · · · vl) and
Rescaling Reduced Hessians 47
V2 = (vl+1 vl+2 · · · vn). Define the vectors
g1 = V T1 g, and g2 = V T
2 g. (3.3)
Note that the definitions of gV (1.30) and p (1.28) satisfy gV = (g1, g2)T and
p = −V gV . The definition of gV is modified for the rescaling scheme as follows.
Let τ ∈ (12, 1] denote a preassigned constant and define
gV =
(g1, 0)T, if ‖g1‖2 > τ(‖g1‖2 + ‖g2‖2);
(g1, g2)T, otherwise.
(3.4)
As before, the search direction is given by
p = −V gV . (3.5)
Note that if gV is defined by the first part of (3.4), then p depends only on the
first l columns of V . The parameter l is initialized at zero and updated during the
calculation of p. This parameter is always incremented during the first iteration.
For k ≥ 1,
l =
l, if ‖g1‖2 > τ(‖g1‖2 + ‖g2‖2);
l + 1, otherwise.(3.6)
3.2.2 Rescaling V
After the calculation of p, V is updated to give V using the BFGS update (1.32).
Let V be partitioned as V = ( V 1 V 2 ), where V 1 = (v1 v2 · · · vl) and
V 2 = (vl+1 vl+2 · · · vn). Let γk and µk denote the scalar parameters
γk =yT
ksk
‖sk‖2and (3.7)
µk = min0≤i≤k
γi. (3.8)
Rescaling Reduced Hessians 48
The matrix V , which is used to denote V after rescaling, satisfies
V = ( V 1 βkV 2 ), (3.9)
where
βk =
(1/γ0)
1/2, if k = 0;
max1, (µk−1/γk)
1/2, otherwise.
(3.10)
This choice of βk is motivated by considering the application of the
BFGS method to the convex quadratic function
q(x) = 12xTHx, where λ1(H) · · · λn(H) > 0. (3.11)
When the BFGS method with exact line search is applied to q(x), the search
directions tend to be almost parallel with the eigenvectors. Moreover, successive
search directions are aligned with eigenvectors associated with smaller eigenval-
ues. Under these conditions, the curvature along pk+1 should be no larger than
γ0, . . ., γk, i.e., it should be no larger than µk. Recall that W defines the subspace
of IRn in which the BFGS method has gained no approximate curvature through
the (k + 1)th iteration. We shall show in Chapter 4 that the choice of βk is such
that the approximate curvature in (V V T )−1 along unit directions w ∈ range(W )
is equal to µk.
3.2.3 The conjugate-direction rescaling algorithm
For reference, the conjugate-direction rescaling algorithm is given below.
Algorithm 3.1. Conjugate-direction rescaling (CDR) (Siegel)
Initialize k = 0, l = 0;
Choose x0 and V0 (V T0 V0 = I);
Define τ (12< τ < 1);
Rescaling Reduced Hessians 49
while not converged do
if k = 0 then
Compute gV = V Tg and set l = l + 1;
else if l < n
if ‖g1‖2 > τ(‖g1‖2 + ‖g2‖2) then
Set gV = (g1, 0)T and l = l;
else
Set gV = (g1, g2)T and l = l + 1;
end if
end if
Compute p = −V gV ;
Compute α so that yTs > 0, and set x = x+ αp;
Compute y = g − g;
Compute V from V using (1.32) and set V = ( V 1 βV 2 );
k ← k + 1;
end do
Note that β ≥ 1 except possibly on the first iteration. It follows that
the columns of V 2 are either unchanged or “scaled up” on every iteration after
the first. Once l reaches n, the BFGS update to V is no longer rescaled.
3.2.4 Convergence properties
It has been shown that if Algorithm CDR is applied to a strictly convex, twice-
continuously differentiable f with Lipschitz continuous Hessian satisfying
‖∇2f(x)−1‖ < C, where C > 0,
Rescaling Reduced Hessians 50
for all x in the level set of f(x0), then the iterates converge globally and superlin-
early to f(x∗). The proof uses the convergence properties of the BFGS algorithm
proven by Powell to imply global and superlinear convergence of xk. In the
case that l never reaches n, it is also necessary to show that the limit of xk
minimizes f(x) (see Siegel [45] for further details).
3.3 Extending Algorithm RH
In this section, a new rescaling algorithm is introduced that is an extension of
Algorithm RH (p. 28). This new algorithm alters the approximate curvature of
Bε on a subspace of dimension n−r at each iteration that r = r+1. Attention is
now restricted to the BFGS update (1.12) since this has been the most successful
update in practice.
3.3.1 Reinitializing the approximate curvature
The effective transformed Hessian associated with Algorithm RH is
QTBεQ =
ZT BεZ ZT BεW
W T BεZ W T BεW
=
RTZ RZ 0
0 σIn−r
(3.12)
at the end of iteration k. The approximate curvature along unit vectors in
range(W ) is equal to σ. The approximate curvature along zg is given by the
following lemma.
Lemma 3.1 Suppose that g is accepted during the kth iteration of Algorithm RH.
If the BFGS update is used at the end of the iteration, then
zgT Bεzg = σ +
ρg2
sTy.
Proof. The value zgT Bεzg is the (r, r) element of RT
Z RZ. This matrix satisfies
RTZ RZ = RT
ZRZ −RT
ZRZsZsTZR
TZRZ
sTZR
TZRZsZ
+yε
Z(yεZ)T
sTZy
εZ
.
Rescaling Reduced Hessians 51
The (r, r) element of RTZRZ is σ. The result follows since sZ = (sZ, 0)T, yε
Z =
(ZTy, ρg)T and sT
ZyεZ = sTy.
Lemma 3.1 is analogous to Lemma 2.2, which applies to the BFGS method in
exact arithmetic.
Lemma 3.1 implies that
zgT Bεzg − (σ − σ) = σ +
ρg2
sTy.
This is the value of the approximate curvature along zg that would result from
choosing B0 = σI. In this sense, subtracting σ − σ reinitializes the approximate
curvature along zg. The approximate curvature along directions in range(W ) can
be reinitialized in the same way. The rescaled transformed effective Hessian is
defined accordingly by
QTBeQ =
ZTBεZ ZTBεzg 0
zTgB
εZ zTgB
εzg − (σ − σ) 0
0 0 σIn−r
(3.13)
(the “hat” denotes rescaling as in the definition of Algorithm CDR).
The rescaling suggested by (3.13) can be simply applied to RZ. Since
p ∈ range(Z), it follows that sZ = (sZ, 0)T . Moreover, Lemma 2.7 implies that
the (r, r) element of RZ is unaltered by the BFGS update. It follows that RZ can
be partitioned as
RZ =
RZ Rg
0 σ1/2
, which implies RTZ RZ =
RTZ RZ RT
Z Rg
RgT RZ Rg
T Rg + σ
.Let RZ be defined by replacing the (r, r) element of RZ with σ, i.e.,
RZ =
RZ Rg
0 σ1/2
.
Rescaling Reduced Hessians 52
It follows that
RTZ RZ =
RTZ RZ RT
Z r
rT RZ rT r + σ
= RTZ RZ −
0 0
0 σ − σ
.Hence, RZ is the Cholesky factor of ZT BεZ. Note that RZ is nonsingular after
the reassignment since σ > 0. Thus, no loss of positive definiteness occurs as a
result of subtracting σ − σ from the reduced Hessian.
An algorithm using this rescaling scheme is given below.
Algorithm 3.2. Reduced Hessian rescaling
Initialize k = 0; Choose x0, σ0 and ε;
Initialize r = 1, Z = g0/‖g0‖, and RZ = σ1/20 ;
while not converged do
Solve RTZ tZ = −gZ, RZpZ = tZ, and set p = ZpZ;
Compute α so that sTy > 0 and set x = x+ αp;
Compute (Z, gZ, r) = GS(Z, g, r, ε).
Define RZ as in (2.17);
Compute RZ = BFGS(RZ, sZ, yεZ);
Compute or define σ;
if r > r and σ 6= σ then
Set the (r, r) element of RZ equal to σ;
end if
k ← k + 1;
end do
It remains to define an appropriate value of σ. We draw upon the
discussion in Section 3.1 to define four possible values. The fifth value has been
Rescaling Reduced Hessians 53
suggested by Siegel for Algorithm CDR. The five alternatives are summarized in
Table 3.1.
Table 3.1: Alternate values for σLabel σ ReferenceR0 σ No rescalingR1 γ0 Siegel [46]
R2 yT0 y0/s
T0 y0 Shanno and Phua [44]
R3 γk Analogous to Liu and Nocedal
R4 yTk yk/s
Tk yk Liu and Nocedal [28]
R5 µk Siegel [45]
3.3.2 Numerical results
The first set of test problems consists of the 18 unconstrained optimization prob-
lems given by More et al. [29]. These problems are listed in Table 3.2 below.
The method is implemented in double precision FORTRAN 77 on a
DEC 5000/240. The line search is a slightly modified version of that included in
NPSOL. The line search is designed to ensure that α satisfies the modified Wolfe
conditions (1.16) (see Gill et al. [21]). The step length α = 1 is always attempted
first. The step length parameters are ν = 10−4 and η = 0.9. The value ε = 10−4
is used in the Gram-Schmidt process and the stopping criterion is ‖gk‖ < 10−8.
The results of Table 3.3 compare Algorithm RH (p. 28) with Algo-
rithm RHR using several of the rescaling values. The numbers of iterations and
function evaluations needed to achieve the stopping criterion are given for each
run. For example, the notation “31/39” indicates that 31 iterations and 39 func-
tion evaluations are required for convergence. The notation “L” indicates that
the method terminated in the line search. In this case, the number in parentheses
gives gives the final norm of the gradient. The final column in Table 3.3 gives an
Rescaling Reduced Hessians 54
Table 3.2: Test Problems from More et al.Number n Problem name
1 3 Helical valley2 6 Biggs EXP63 3 Gaussian function4 2 Powell badly scaled function5 3 Box three-dimensional function6 16 Variably dimensioned function7 12 Watson8 16 Penalty I9 16 Penalty II10 2 Brown badly scaled function11 4 Brown and Dennis12 3 Gulf research and development13 20 Trigonometric14 14 Extended Rosenbrock15 16 Extended Powell singular function16 2 Beale17 4 Wood18 16 Chebyquad
“at a glance” comparison of Algorithm RHR and Algorithm RH. For example,
the notation “+−+” means that Algorithm RHR required fewer function evalu-
ations than Algorithm RH for rescaling methods R1 and R5, but required more
function evaluations for R4 (note that “+” means fewer function evaluations).
3.4 Rescaling combined with lingering
Sometimes it is desirable to reinitialize the approximate curvature in a larger
subspace than that determined by zg. Our objective is to alleviate inefficiencies
resulting from poor initial approximate curvature. In doing so, we must be careful
to alter only the affects of the initial approximate curvature. In Lemma 2.8 it
is shown that RY = σIr−l when the BFGS update is used in Algorithm RHL
(p. 41). In this sense, the initial approximate curvature along unit directions in
Rescaling Reduced Hessians 55
Table 3.3: Results for Algorithm RHR using R1, R4 and R5
Problem Alg. RH Algorithm RHRNo. n σ = 1 R1 R4 R5 Comp.1 3 31/39 28/35 27/36 26/35 + + +2 6 34/42 44/50 41/45 39/44 −−−3 3 4/6 5/8 5/8 5/8 −−−4 2 145/191 147/199 140/193 147/199 −−−5 3 32/36 34/37 34/37 33/36 −− 06 16 24/32 24/32 24/32 24/32 0007 12 80/93 146/153 124/129 109/114 −−−8 16 58/74 61/85 57/69 56/68 −+ +9 16 301/442 416/470 505/584 582/710 −−−10 2 L(.1E-1) L(.1E-1) L(.5E-2) L(.1E-1) 0 + 011 4 65/88 74/83 72/80 72/80 + + +12 3 31/40 47/65 48/69 47/65 −−−13 20 48/53 54/61 47/50 39/49 −+ +14 14 37/52 38/54 36/50 38/54 −+−15 16 48/62 74/79 81/87 60/65 −−−16 2 16/23 16/20 16/20 16/20 + + +17 4 80/121 39/48 63/78 66/87 + + +18 16 68/102 77/87 70/78 58/82 + + +
range(Y ) is unaltered. Hence, the approximate curvature in Bε corresponding to
Y is easily reinitialized. The approximate curvature along directions in range(U)
will be considered to be established and the associated reduced Hessian will not
be rescaled. Following the BFGS update, RZ will be defined by
RZ =
RU RUY
0 σIr−l
, which replaces RZ =
RU RUY
0 σIr−l
at the start of iteration k + 1.
The approximate curvature along directions w ∈ range(W ) is also reini-
tialized. The Cholesky factors of the effective transformed Hessians QTBεQ and
Rescaling Reduced Hessians 56
QTBεQ satisfy
RQ =
RZ RZY 0
0 σ1/2Ir−l 0
0 0 σ1/2Ir−l
and RQ =
RZ RZY 0
0 σ1/2Ir−l 0
0 0 σ1/2Ir−l
.Note that the rescaled transformed Hessian satisfies
QT BεQ = RTQRQ = RT
QRQ −
0 0
0 (σ − σ)In−l
. (3.14)
This rescaling it therefore analogous to that defined for Algorithm RH (3.13). In
this case the rescaling is defined on the (possibly larger) subspace range( Y W )
instead of range( zg W ).
An algorithm employing this strategy is given below.
Algorithm 3.3. Reduced-Hessian rescaling with lingering (RHRL)
Initialize k = 0, r0 = 1 and l0 = 0; Choose x0, σ0 and ε;
Initialize Z = Y = g0/‖g0‖ and RZ = RY = σ1/20 (U is void);
while not converged do
Compute p as in Algorithm RHL;
Compute Z and associated quantities as in Algorithm RHL;
Compute α so that sTy > 0 and set x = x+ αp;
if r < n then
Compute (Z, gZ, r) = GS(Z, g, r, ε);
else
Define Z = Z;
Compute gZ;
end if
Form RZ according to (2.31);
Compute sZ and yεZ;
Rescaling Reduced Hessians 57
Compute RZ = BFGS(RZ, sZ, yεZ);
Compute σ;
if l < r and σ 6= σ then
Set RZ =
RU RUY
0 σIr−l
;
end if
k ← k + 1;
end do
3.4.1 Numerical results
Results are given in Table 3.4 that compare Algorithm RHRL using R1, R4
and R5 with Algorithm RH (p. 28) on the 18 problems listed in Table 3.2. The
constants used in the line search and the Gram Schmidt process are the same as
those given in Section 3.3.2.
Results are given for four additional problems in Table 3.5. Problem 19
was used by Siegel to test Algorithm 3.1 (p. 48). Results are given for the case
D11 = 1 and D55 = 10−12, which define a function whose Hessian is very ill-
conditioned (see [45] for further details). In this case, the convergence criteria
are ‖gk‖ < 10−8 and |f(xk)−f∗| < 10−8, where f∗ = 3.085557482E−3. Problems
20, 21 and 22 are the calculus of variation problems discussed by Gill and Murray
(see [18]). We give results for these problems for n = 50, n = 100 and n = 200.
Generally the column dimension of Y stays large as the iterations proceed, which
means that the approximate curvature is rescaled on high-dimensional subspaces
of IRn. For example, in the solution of problem 20 with n = 50, the column
dimension of Y reaches 10 at iteration 33 and remains greater than or equal to
10 until iteration 49.
Rescaling Reduced Hessians 58
Table 3.4: Results for Algorithm RHRL on problems 1–18
Problem Alg. RH Algorithm RHRLNo. n σ = 1 R1 R4 R5 Comp.1 3 31/39 28/37 26/33 31/38 + + +2 6 40/43 46/52 48/53 40/43 −− 03 3 4/6 5/8 5/8 5/8 −−−4 2 146/194 147/199 140/193 147/199 −+−5 3 32/36 34/37 31/34 29/33 −+ +6 16 24/32 24/32 24/32 24/32 0007 12 80/93 146/157 119/126 79/84 −−+8 16 58/74 60/77 57/75 57/75 −−−9 16 285/411 526/614 449/558 493/614 −−−10 2 L(.1E-1) L(.1E-1) L(.5E-2) L(.1E-1) 0 + 011 4 65/88 77/84 66/73 68/75 + + +12 3 31/40 51/60 49/67 38/53 −−−13 20 48/53 55/59 48/52 38/50 −+ +14 14 37/52 37/52 39/55 36/50 0−+15 16 48/52 74/79 61/68 67/73 −−−16 2 16/23 16/20 16/20 16/20 + + +17 4 80/121 38/45 71/88 69/91 + + +18 16 68/102 80/89 67/77 57/83 + + +
3.4.2 Algorithm RHRL applied to a quadratic
The following theorem summarizes some properties of Algorithm RHRL when it
is used with an exact line search to minimize the quadratic function (1.17). In
the statement and proof of the theorem, rij denotes the (i, j) component of RZ.
Theorem 3.1 Consider the use of Algorithm RHRL with exact line search to
minimize the strictly convex quadratic function (1.17). In this case, the upper-
triangular matrix RZ is upper bidiagonal. At the start of iteration k, rk = k + 1,
lk = k, RU ∈ IRk×k, RUY = −‖gk‖/(sTk−1yk−1)ek and RY = σ
1/2k . The nonzero
elements of RU satisfy
rii =‖gi−1‖
(sTi−1yi−1)1/2
and ri,i+1 = − ‖gi‖(sT
i−1yi−1)1/2
Rescaling Reduced Hessians 59
Table 3.5: Results for Algorithm RHRL on problems 19–22
Problem Alg. RH Algorithm RHRLNo. n σ = 1 R1 R4 R5 Comp.19 5 L(.2E-8) L(.4E-9) L(.4E-9) 115/139 00+
50 222/255 266/291 209/215 79/128 −+ +20 100 398/480 470/524 475/478 137/249 −+ +
200 731/912 849/966 969/985 260/524 −−+50 50/172 197/202 110/115 49/74 −+ +
21 100 64/310 247/253 169/174 74/124 + + +200 127/623 335/341 295/302 127/227 + + +50 164/217 280/284 187/191 70/107 −+ +
22 100 250/350 421/425 312/316 99/148 −+ +200 217/317 152/252 217/220 161/292 + + +
for 1 ≤ i ≤ k. The matrix Z satisfies Z = ( U Y ), where
U =( g0
‖g0‖g1
‖g1‖· · · gk−1
‖gk−1‖
)and Y =
gk
‖gk‖.
Furthermore, the search directions satisfy
pk =
−gk, if k = 0;
1
σk
(σk−1
‖gk‖2
‖gk−1‖2pk−1 − gk
), otherwise.
(3.15)
Proof. The result is clearly true for k = 0. Assume that the result holds at the
start of iteration k, i.e., RZ, Z and the first k search directions are of the stated
form.
The first k+ 1 gradients are orthogonal and nonzero by assumption (or
by the assumed form of the first k search directions). Hence, gU = UTgk = 0,
which implies tU = 0 and tY = −σ−1/2k ‖gk‖. Since ‖tU‖2 < τ(‖tU‖2 + ‖tY ‖2), lk+1
satisfies lk is incremented and lk+1 = k + 1, as required. The definitions of pY
and pU give
pY = −‖gk‖σk
and pU = −‖gk‖2
σk
‖g0‖−1
...
‖gk−1‖−1
, (3.16)
Rescaling Reduced Hessians 60
respectively. Hence,
pk = UkpU + YkpY =1
σk
(− ‖gk‖2
‖gk−1‖2‖gk−1‖2
(g0
‖g0‖2+ · · ·+ gk−1
‖gk−1‖2
)− gk
).
A short inductive argument verifies that
−‖gk−1‖2(
g0
‖g0‖2+ · · ·+ gk−1
‖gk−1‖2
)= σk−1pk−1,
which, together with the previous equation, implies that pk is of the required
form.
Following the computation of pk, the matrix Z must be reorganized since
the partition parameter has increased. Since pY is a scalar, S = 1, U = ( U Y )
and Y is void. The Cholesky factor RZ
satisfies RZ
= RZ. Since the first k + 1
search directions are parallel to the conjugate-gradient directions, xk+1 is such
that gk+1 is orthogonal to g0, . . ., gk. Thus, gk+1 is accepted if it is nonzero.
It follows that the matrices U , Y and Z satisfy U = U , Y = gk+1/‖gk+1‖ and
Z = ( U Y ), as required.
We complete the proof by considering the computation of RZ (see Sec-
tion 1.3.2). Since gk+1 is always accepted, RZ = diag(RZ, σ
1/2k ) = diag(RZ, σ
1/2k ).
The vector uZ used in the BFGS update satisfies
uZ =RZsZ
‖RZsZ‖=
RZpZ
‖RZpZ‖=
1
‖tU‖
tZ
tY
0
=
0
−1
0
.Thus, the matrices S1 and S1RZ satisfy
S1 =
Ik 0 0
0 0 1
0 1 0
and S1RZ =
RU RUY 0
0 0 σ1/2k
0 σ1/2k 0
,respectively. Since
RTZuZ =
0
−σ1/2k
0
and wZ =1
(sTk yk)1/2
0
−‖gk‖‖gk+1‖
,
Rescaling Reduced Hessians 61
it follows that
S1(RZ + uZ(wZ −RTZuZ)T ) =
RU RUY 0
0 0 σ1/2k
0‖gk‖
(sTk yk)1/2
− ‖gk+1‖(sT
k yk)1/2
.
If S2 is defined by S2 = S1, then RZ is upper triangular and satisfies
RZ =
RZ RUY
0 RY
, where RZ =
RU RUY
0‖gk‖
(sTk yk)1/2
,
RUY =
0
− ‖gk+1‖(sT
k yk)1/2
and RY = σ1/2k .
The rescaled matrix RZ satisfies
RZ =
RZ RUY
0 σ1/2k+1
,which completes the inductive argument.
Now we show that Theorem 3.1 implies that Algorithm RHRL termi-
nates on quadratics.
Corollary 3.1 If Algorithm RHRL is used to minimize the convex quadratic
function (1.17) with exact line search and σ0 = 1, then the method converges to
the minimizer in at most n iterations.
Proof. The search directions are parallel to the conjugate-gradient directions
by Theorem 3.1. Thus, Algorithm RHRL enjoys quadratic termination since the
conjugate-gradient method has this property.
Chapter 4
Equivalence of Reduced-Hessianand Conjugate-DirectionRescaling
In this chapter, it is shown that if Algorithm RHRL is used in conjunc-
tion with a particular rescaling technique of Siegel [45], then it is equivalent to
Algorithm CDR in exact arithmetic. This chapter is mostly technical in nature
and may be skipped without loss of continuity. However, the convergence results
given in Section 4.4 should be reviewed before passing to Chapter 5.
First, we show that a basis for V1 can be formulated in terms of the
search directions generated by Algorithm CDR. Second, a transformed approxi-
mate Hessian associated with B is derived that has the same form as the trans-
formed Hessian generated by Algorithm RHRL. Third, we define the affect that
rescaling the conjugate-direction matrices has on this transformed Hessian.
4.1 A search-direction basis for range(V1)
The following two lemmas lead to a result that gives a basis for range(V1) in
terms of a subset of the search directions generated by Algorithm 3.1.
62
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 63
Lemma 4.1 If l is unchanged during any iteration of Algorithm 3.1, then
range(V1) = range(V1).
Proof. Since l remains fixed, gV = (g1, 0)T. Thus Ω (see Section 1.3.3 for the
definition of Ω) is of the form
Ω =
Ω1 0
0 In−l
, where Ω1 ∈ IRl×l
is orthogonal and lower Hessenberg. Using the update (1.29) and the form of Ω,
we find
V = (I − suT )(V1Ω1 V2
).
If rescaling is applied to the second part of V , we obtain
V = (I − suT )(V1Ω1 βV2
),
which implies V1 = (I − suT )V1Ω1. Since s = αp = −αV1g1, it follows that
V1 = (I + αV1g1uT )V1Ω1 = V1(Il + αg1u
TV1)Ω1.
From this we see that range(V1) ⊆ range(V1). However, (Il + αg1uTV1)Ω1 is
invertible since otherwise, V1 is rank deficient. Thus, range(V1) ⊆ range(V1), and
we may conclude that range(V1) = range(V1).
The second lemma relates to a property of Ω. In both the statement
and proof of the lemma, Ω is partitioned according to
Ω = ( Ω1 Ω2 ), where Ω1 ∈ IRn×(l+1) and Ω2 ∈ IRn×(n−l−1).
The tildes are used to distinguish the partition from that used in Lemma 4.1 and
Theorem 4.1.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 64
Lemma 4.2 Let Ω ∈ IRn×n be an orthogonal, lower-Hessenberg matrix. Given
an integer l (1 ≤ l ≤ n− 1), partition Ω as in (4.1). There exist w1, w2, . . . , wl ∈
IRl+1, such that Ω1wi = ei (1 ≤ i ≤ l), where ei is the ith column of I.
Proof. The first l rows of Ω2 are zero since Ω is lower Hessenberg. Hence, Ω2
may be partitioned as
Ω2 =
0
Ω22
, where Ω22 ∈ IR(n−l)×(n−l−1).
The product Ω2ΩT2 satisfies
Ω2ΩT2 =
0 0
0 Ω22ΩT22
.Since I = ΩΩT = Ω1Ω
T1 + Ω2Ω
T2 , it follows that
Ω1ΩT1 =
Il 0
0 In−l − Ω22ΩT22
.Thus, with wi (1 ≤ i ≤ l) defined as the transpose of the ith row of Ω1, we have
the desired result.
Let P denote the set of search directions generated by Algorithm CDR.
Let l denote the value of the partition parameter at the kth (k ≥ 1) iteration
of Algorithm CDR before calculation of the search direction. Let k1, k2, . . . , kl
(0 = k1 < k2 < · · · < kl < k) denote the indices of the iterations at which l is
incremented. Define
P1 = pk1 , pk2 , . . . , pkl, and P2 = P − P1. (4.1)
Note that the subscripts 1 and 2 on P in this definition are not iteration indices.
The main result of this section follows.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 65
Theorem 4.1 Let P1 be defined as in (4.1). Then P1 is a basis for range(V1)
for all k ≥ 1.
Proof. If k = 1, then Algorithm 3.1 gives l = 1 and P1 = p0 automatically.
Since l = 1 and by (1.32), V1 = s0/(sT0 y0)
1/2. Hence, the result holds for k = 1.
Given l (1 ≤ l ≤ n), assume that the result holds for k = kl +1. The set
P1 as given in (4.1) is a basis for range(V1). If l = n, then the inductive argument
is complete since this would imply that V1 = V , P1 is a basis for IRn, and hence
P1 is a basis for range(V 1) = range(V ). If l < n and l does not increase during
or after iteration k, then the inductive argument is complete since Lemma 4.1
implies range(V 1) = range(V1). Therefore, assume that l < n and that l increases
during or after iteration k.
The result is true for all k (kl + 1 < k ≤ kl+1) by Lemma 4.1, and we
fix k = kl+1 for the rest of the argument. Since l = l + 1, p 6∈ range(V1). Hence,
p is independent of P1, which implies that P1 is a linearly independent set. It
remains to show that P1 is a spanning set for range(V 1).
The vector p ∈ range(V1) since the first column of V is parallel to it by
(1.32). We now show that each member of P1 is also an element of range(V1).
Partition V as
V = ( V1 V2 ), where V1 ∈ IRn×(l+1) and V2 ∈ IRn×(n−l−1).
Since Ω is constructed to make v1 parallel to s and since v1 is parallel to s by
(1.32), it follows that v1 ∈ range(V1). Rearranging the definition of vi in (1.32)
gives vi = vi − (vTi y/s
Ty)s (2 ≤ i ≤ n). Thus, vi ∈ range(V1) (2 ≤ i ≤ l + 1).
Therefore range(V1) ⊆ range(V1). Since V1 = V Ω1, we have V Ω1wi = V ei =
vi ∈ range(V1) (1 ≤ i ≤ l) where Ω1 and wi are defined as in Lemma 4.2. Thus,
range(V1) ⊂ range(V1), and since P1 is a basis for range(V1), P1 ⊂ range(V1).
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 66
It has been shown that p ∈ range(V ) and P1 ⊂ range(V ). Thus,
P1 ⊆ range(V 1). Since P1 consists of l + 1 linearly independent vectors and
dim(range(V 1)) = l+ 1, P1 is a basis for V1. Finally, since rescaling has no effect
on V 1, P1 is a basis for V1, as required.
4.2 A transformed Hessian associated with B
The set Gk is defined as in (2.3), i.e.,
Gk = g0, g1, . . . , gk.
In this section, Q will denote an orthogonal matrix partitioned as
Q = ( Z W ), where range(Z) = span(G).
The following lemma, analogous to Lemma 2.3, shows that the transformed Hes-
sian QTBQ has the same structure as the transformed Hessian associated with
the BFGS method. Hence, conjugate-direction rescaling preserves the block di-
agonal structure of the transformed Hessian. The proof of Lemma 4.3 is similar
to the proof of Lemma 2.3 given by Siegel [46].
Lemma 4.3 Let V0 be any orthogonal matrix. If Algorithm CDR is applied to
a twice-continuously differentiable function f : IRn → IR, then s ∈ span(G) for
all k. Moreover, if z and w belong to span(G) and the orthogonal complement of
span(G) respectively, then
Bz ∈ span(G),B−1z ∈ span(G) for all k, while
Bw =
w if k = 0;
µw otherwise,
(4.2)
where µ = µk−1 is defined by (3.8).
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 67
Proof. The result for k = 0 is proved directly, while induction is used for
iterations such that k ≥ 1. Since B = I, (1.8) implies p = −g. Thus, s =
αp = −αg, which implies that s ∈ span(G). Also Bz = z implies Bz ∈ span(G)
and B−1z ∈ span(G), for all z ∈ span(G). Since Bw = B−1w = w, for all
w ∈ span(G)⊥ the result is true for k = 0.
With k = 0, the update B satisfies
B = I − ssT
sTs+yyT
yTs.
The set G satisfies G = g0, g1. The vector s = −αg ∈ span(G) and y = g − g ∈
span(G). span(G). Hence, for all w ∈ span(G)⊥, it is true that Bw = w, which
implies B−1w = w. The matrix B−1 = V V T by definition and it follows that
V V Tw = w. (4.3)
Since the first column of V is parallel to s and since l = 1 at the end of the first
iteration, V T1w = 0. It follows from (4.3) that V 2V
T2w = w. Hence, using (3.9),
B−1w = (V 1VT1 + β2V 2V
T2 )w = β2w. (4.4)
Using (3.10) and (3.8), equation (4.4) implies that B−1w = (1/µ)w, which also
gives Bw = µw. For all z ∈ span(G), we have (Bz)Tw = µzTw = 0. Hence,
Bz ∈ span(G). Similarly, B−1z ∈ span(G). Finally, since s = −αB−1g, and since
B−1g ∈ span(G), if follows that s ∈ span(G). Therefore, the result holds for
k = 1.
Assume that the result holds at the start of iteration k. By the inductive
hypothesis, s ∈ span(G) ⊆ span(G), and Bs ∈ span(G) ⊆ span(G). Also, y ∈
span(G) by definition. With w ∈ span(G)⊥, and using (1.12) along with the
inductive hypothesis, we find
Bw = Bw = µw. (4.5)
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 68
Equation (4.5) implies V V Tw = (1/µ)w, whence V 2VT2w = (1/µ)w − V 1V
T1w.
Using Theorem 4.1 and the inductive hypothesis, range(V 1) = span(P1) ⊆
span(G) ⊆ span(G). Hence, V T1w = 0 and V 2V
T2w = (1/µ)w. Thus, B−1w =
(V 1VT1 + β2V 2V
T2 )w = (β2/µ)w. Using (3.10) and (3.8),
β2
µ=
1
µ, (4.6)
for all k ≥ 1, which implies B−1w = (1/µ)w as desired. Hence, Bw = µw.
Exactly as above, we find Bz ∈ span(G), and B−1z ∈ span(G) for all z ∈ span(G).
Finally, if s = −αB−1g, then s ∈ span(G) since B−1g ∈ span(G). Otherwise, if
s = −αV1VT1 g, then s ∈ span(G) since range(V1) = span(P1) ⊆ span(G) ⊆
span(G).
Lemma 4.3 implies that for k = 0, QTBQ = I, and for k ≥ 1,
QTBQ =
ZTBZ 0
0 µIn−r
. (4.7)
Hence, the transformed Hessian associated with Algorithm CDR has the same
block structure as that given in equation (2.15) in connection with the BFGS
method. Furthermore, the transformed gradient satisfies
QTg =
ZTg
W Tg
=
ZTg
0
. (4.8)
When l = l+1, the form of p given in (3.5) satisfies Bp = −g, which is equivalent
to
(QTBQ)QTp = −QTg. (4.9)
Equations (4.7), (3.5), and (4.9) imply that
p = −Z(ZTBZ)−1ZTg. (4.10)
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 69
Using this form of p (4.10) and Theorem 4.1, it is now shown that the
search directions in P1 can be rotated into the basis defined by Z. For k = 0, p
can replace g in the definition of Z since p = −g as long as V is orthogonal. At
the start of iteration k, assume that
Z = ( U Y ), where range(U) = span(P1). (4.11)
Let r denote the column dimension of Z. If l = l+1, equation (4.10) implies that
p = UpU +Y pY , for some pU ∈ IRl and pY ∈ IRr−l. By Theorem 4.1, p is indepen-
dent of P1, and hence pY 6= 0. Therefore, Y pY /‖pY ‖ can be rotated into the first
column of Y as described in Section 2.5.1. Let S denote an orthogonal matrix
satisfying SpY = ‖pY ‖e1 and define Z = ( U Y ), where U = ( U Y STe1 ),
and Y = Y ST( e2 · · · er ). Let ρg denote the norm of the component of g
orthogonal to Z. Let yg denote the normalized component of g orthogonal to Z
and define
Y =
Y , if ρg = 0;
( Y yg ), otherwise.(4.12)
If Z = ( U Y ), then range(Z) = span(G) and range(U) = span(P1) and this
completes the argument.
For the remainder of the chapter, Q is defined as an orthogonal matrix
satisfying
Q = ( Z W ), where Z = ( U Y ). (4.13)
Consider the (2, 2) block of the transformed Hessian determined by W . From
equation (4.5), it follows that W T BW = µIn−r while the form of the transformed
Hessian given by (4.7) implies that W T BW = µIn−r. Since the off-diagonal blocks
of the transformed Hessian are 0, the affect of rescaling V on the transformed
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 70
Hessian corresponding to W is now determined. The affect of conjugate direction
rescaling on the reduced Hessian UT BU is examined in the next section.
4.3 How rescaling V affects UT BU
A preliminary lemma is required that relates V −1 to V −1.
Lemma 4.4 If V −1 is partitioned as
V −1 =
V −11
V −12
,where V −1
1 ∈ IRl×n, then
V −1 =
V −11
1
βV −1
2
.Proof. Using the partition V = ( V 1 V 2 ), it follows that I = V V −1 =
V 1V−11 + V 2V
−12 . Since V = ( V 1 βV 2 ),
V
V −11
1
βV −1
2
= V 1V−11 + (βV 2)(
1
βV −1
2 ) = I,
as required.
The overall form of QTBQ given by (4.7) (postdated one iteration) is
QTBQ =
ZTBZ 0
0 µIn−r
.Since Z satisfies equation (4.11) postdated one iteration, range(U) = range(V 1).
Thus, there exists a nonsingular M ∈ IRl×l such that U = V 1M , and we may
write Z = ( V 1M Y ). It follows that
ZTBZ =
MT V T1BV 1M MT V T
1 BY
Y TBV 1M Y TBY
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 71
By definition of B, V T1 BV 1 = Il. Since V −1V = I, we have V −1V 1 = E1, where
E1 denotes the first l columns of the n× n identity matrix. It follows that
BV 1 = V T−1V −1V 1 = V T−1E1 = (V −11 )T,
which implies
ZTBZ =
MTM MT V −11 Y
Y T(V −11 )TM Y TBY
.Using the relation V 1 = V1, Z
TBZ is found to satisfy
ZTBZ =
MTM MT V −11 Y
Y T(V −11 )TM Y TBY
.Lemma 4.4 implies that V −1
1 = V −11 and it follows that ZTBZ and ZTBZ are
identical except in the (2, 2) block.
The quantity Y TBY can be written in terms of quantities involving V
as follows
Y TBY = Y T(V −11 )T V −1
1 Y + Y T(V −12 )T V −1
2 Y . (4.14)
Similarly, using the equality V −11 = V −1
1 ,
Y TBY = Y T(V −11 )T V −1
1 Y + Y T (V −12 )TV −1
2 Y . (4.15)
Subtracting (4.14) from (4.15), and using Lemma 4.4 gives
Y TBY − Y TBY = (1− β2)Y T(V −12 )T V −1
2 Y . (4.16)
From (4.16) it is seen that the form of Y T(V −12 )T V −1
2 Y is required to determine
how rescaling V affects the reduced Hessian Y TBY .
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 72
The form of Y T (V −12 )T V −1
2 Y
The following theorem gives information about the block structure of (V TV )−1,
from which the right-hand side of (4.16) is ascertained.
Theorem 4.2 If (V TV )−1 is partitioned as
(V TV )−1 =
X11 X12
XT12 X22
,where X11 ∈ IRl×l, then X22 = µIn−l for all k ≥ 1.
Proof. The proof is by induction on k. For k = 0, V is orthogonal by definition
of Algorithm CDR. Using (1.29), and the Sherman-Morrison formula (see Golub
and Van Loan [26, p. 51]),
(V T V )−1 = ΩTV −1(I − γsuT )(I − γusT )V T−1Ω,
where γ = 1/(sTu− 1). From (1.31), the quantity ΩTV −1s satisfies
ΩTV −1s = −α‖gV ‖e1.
Hence,
(V T V )−1 = (ΩTV −1 + γα‖gV ‖e1uT )(V T−1Ω + γα‖gV ‖ueT1 ),
which can be written as
(V T V )−1 = I + δ(e1fT + feT
1 ) + δ2‖u‖2e1eT1 , (4.17)
where δ = γα‖gV ‖, and f = ΩTV −1u. Let X22 denote the (n−1)× (n−1), (2, 2)
block of (V T V )−1. Since equation (4.17) only involves rank-one changes to I, all
of which include e1 as a factor,
X22 = In−1. (4.18)
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 73
Using Lemma 4.4, (4.18), (3.8), and (3.10), we have X22 = (1/β2)X22 = µI.
Since l = 1, the result is true at the start of the first iteration.
Assume that the result is true at the start of the kth iteration. Exactly
as in the derivation of (4.17),
(V T V )−1 = ΩT (V TV )−1Ω + δ(e1fT + feT
1 ) + δ2‖d‖2e1eT1 . (4.19)
Let Ω be partitioned as
Ω =
Ω11 Ω12
Ω21 Ω22
,where Ω11 ∈ IRl×l, while (V TV )−1 is partitioned as in the statement of the lemma,
and consider the cases l = l and l = l + 1.
If l = l, then Ω12 = 0 and Ω22 = In−l. In this case,
ΩTV −1V T−1Ω =
ΩT11 ΩT
21
0 In−l
X11 X12
XT12 X22
Ω11 0
Ω22 In−l
=
X11 X12
XT12 X22
,where quantities with tildes have been affected by Ω. Using this in (4.19) gives
(V T V )−1 =
X11 X12
XT12 X22
,where the quantities with bars differ from those with tildes as a result of the rank
one matrices in (4.19). By the inductive hypothesis, X22 = µIn−l. Hence, using
Lemma 4.4 and (4.6), we have X22 = µIn−l as required.
Suppose that l = l+1. Due to the lower Hessenberg form of Ω, we have
Ω12 = ζen−leT1 , where ζ is some constant. The matrix Ω22 is lower Hessenberg,
and, because of the form of Ω12, orthogonal. Let X22 denote the (2, 2) block of
ΩT (V TV )−1Ω. Block multiplication, the orthogonality of Ω22, and the inductive
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 74
hypothesis give
X22 = ζe1(eTn−l(X11Ω12 +X12Ω22)) + ζ(ΩT
22XT12en−l)e
T1 + µIn−l, (4.20)
which differs from µIn−l only in the first row and column. Since l = l + 1, the
partitioning in (V T V )−1 is changed so that X11 ∈ IR(l+1)×(l+1). Substitution of
(4.20) into (4.19) yields
(V T V )−1 =
X11 X12
X21 µIn−l
.Finally, using Lemma 4.4 and (4.6) again, we have X22 = µIn−l, as required.
We now return to the derivation of the right hand side of (4.16). Since
V −12 V1 = 0 by definition of a matrix inverse, the rows of V −1
2 are a basis for
null(V1). Theorem 4.2 implies that V −12 (V −1
2 )T = µIn−l, which means that the
rows of µ−1/2V −12 are orthonormal. Hence, the rows of µ−1/2V −1
2 form an or-
thonormal basis for null(V1). The form of Q given in (4.13) and the definition
of U imply that ( Y W ) also forms an orthonormal basis for null(V1). Thus,
µ−1/2(V −12 )T = ( Y W )N , where N ∈ IR(n−l)×(n−l) is nonsingular. Moreover,
N is orthogonal. Therefore,
Y T (V −12 )T V −1
2 Y = µY T(Y W
)NNT
Y T
W T
Y = µIr−l.
Substituting this result into (4.16), and using (4.6), it follows that
Y T BY − Y T BY = (µ− µ)Ir−l.
The effect that rescaling V has on QT BQ is now fully determined and is summa-
rized in the following theorem.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 75
Theorem 4.3 Let V0 denote any orthogonal matrix. During the kth iteration of
Algorithm CDR, let B = (V V T )−1, where V is the BFGS update to V . Let Q be
defined as in (4.13) postdated one iteration. Then, for k = 0 and k ≥ 1,
QT BQ =
ZT BZ 0
0 In−r
, and QT BQ =
ZT BZ 0
0 µIn−r
.Now, let V be given by (3.9). If B = (V V T )−1, then B satisfies
QT BQ =
ZT BZ 0
0 µIn−r
,where
ZT BZ =
UT BU UT BY
Y T BU Y T BY
− 0 0
0 (µ− µ)Ir−l
.
4.4 The proof of equivalence
The results of this chapter are now applied in the proof of the following theorem
on the equivalence of Algorithm RHRL and CDR. Following the theorem are two
corollaries, one of which addresses the convergence properties of Algorithm RHRL
on strictly convex quadratic functions. In the proof of the theorem, the subscript
“c” is used to denote quantities generated by Algorithm CDR.
Theorem 4.4 Consider application of Algorithm RHRL and Algorithm CDR in
exact arithmetic to find a local minimizer of a twice-continuously differentiable
f : IRn → IR, where the former algorithm uses σ0 = 1, R5 and ε = 0. Then, if
τ = τc and both algorithms start from the same initial point x0, then they generate
the same sequence xk of iterates.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 76
Proof. It suffices to show that both algorithms generate the same sequence of
search directions. Clearly, p = pc = −g for k = 0. Assume that the first k search
directions satisfy p = pc and assume that the index l increases on the same
iterations that lc increases. Assume that the matrices Q and QC are identical
satisfying U = UC, Y = YC and W = WC. This is true at the start of the
first iteration since U and UC are vacuous, Y and YC both equal g0/‖g0‖ and the
implicit matrices W and WC can be considered to be equal. Furthermore, assume
that V and RQ are such that RTQRQ = QT
C(V V T )−1QC. This is true at the start
of the first iteration since RQ = I and since V is orthogonal.
At the start of iteration k, the reduction in the quadratic model in
range(U) is equal to 12‖tU‖2, where RT
U tU = −gU . Since U = UC = V1M , for some
nonsingular matrix M , and since RTURU = UT
C BCUC, it follows that
12‖tU‖2 = 1
2gT
U (UTC BCUC)−1gU = 1
2gT1 M(MTV T
1 BCV1M)−1MTg1 = 12‖g1‖2,
where the last equality follows since V T1 BCV1 = Ilc . A similar argument shows
explicitly that
τ(‖tZ‖2 + ‖tY ‖2) = τc(‖g1‖2 + ‖g2‖2),
since τ = τc by assumption. Hence, the parameters l and lc increase or remain
fixed in tandem. If l = lc = l, then
p = −U(RTURU)−1UTg = −V1M(MTV T
1 BCV1M)−1MTV T1 g = −V1V
T1 g = pc.
Otherwise, l = lc = l + 1 and
p = −Z(RTZRZ)−1ZTg = −ZC(ZT
CBCZC)−1ZTC g = pc,
where the last equality is given in (4.10). Thus, the search directions satisfy p = pc
and x = xc assuming that both algorithms use the same line search strategy.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 77
The matrix Z is defined by Algorithm RHRL so that range(Z) =
span(G), and if l = l + 1, then p is rotated into the basis so that range(U) =
span(P). The implicit matrix W is defined so that Q = ( Z W ) is orthog-
onal. Note that the update to Q does not affect the underlying matrix B, i.e.,
QRTQRQQ
T = B = BC. Since s = sc and y = yc, the BFGS updates to RQ and
V yield RQ and V respectively satisfying QRTQRQQ
T = (V V T )−1 = BC. This
equation and Theorem 4.3 imply that
QT BCQ = RTQRQ −
0 0
0 (µ− µ)In−l
= RTQRQ,
where the last equality follows from (3.14) and the choice of σ. Since Q = QC,
the last equation implies that RTQRQ = QC(V V T )−1QT
C , as required.
The following corollary addresses the quadratic termination of Algo-
rithm CDR. This result was not given by Siegel in [45], although the algorithm
was designed specifically not to interfere with the quadratic termination of the
BFGS method.
Corollary 4.1 If Algorithm 3.1 is used with exact line search each iteration to
minimize the strictly convex quadratic (1.17), then the iteration terminates at the
minimizer x∗ in at most n steps.
Proof. The result follows since Algorithm CDR generates the same iterates as
Algorithm RHRL and since the latter terminates on quadratics by Corollary 3.1.
The last result of this chapter gives convergence properties of Algorithm
RHR when applied to strictly convex functions.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 78
Corollary 4.2 Let f : IRn → IR denote a strictly convex, twice-continuously
differentiable function. Furthermore, assume that ∇2f(x) is Lipschitz continuous
with ‖∇2f(x)−1‖ bounded above for all x in the level set of x0. If Algorithm RHRL
with a Wolfe line search is used to minimize f , then convergence is global and
superlinear.
Proof. Since Algorithm CDR has these convergence properties, the proof is
immediate considering Theorem 4.4.
Chapter 5
Reduced-Hessian Methods forLarge-Scale UnconstrainedOptimization
5.1 Large-scale quasi-Newton methods
When n is large, it may be impossible to store the Cholesky factor of Bk or the
conjugate-direction matrix Vk. Conjugate-gradient methods can be used in this
case and require storage for only a few n-vectors (see Gill et al. [22, pp. 144–
150]). However, these methods can require as many as 5n iterations and may be
prohibitively costly in terms of function evaluations. In an effort to accelerate
these methods, several authors have proposed “limited-memory” quasi-Newton
methods. These methods define a quasi-Newton update used either alone (e.g.,
see Shanno [43], Gill and Murray [19] or Nocedal [35]) or in a preconditioned
conjugate-gradient scheme (e.g., see Nazareth [33] or Buckley [4], [5]). Instead of
forming Hk explicitly, these methods store vectors that implicitly define Hk as a
sequence of updates to an “initial” inverse approximate Hessian. This allows the
direction pk = −Hkgk to be computed using a sequence of inner products.
79
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 80
For example, Nocedal’s method [35] makes use of the product form of
the inverse BFGS update
Hk+1 = MTk HkMk +
sksTk
sTk yk
, where Mk = I − yksTk
sTk yk
(5.1)
(see (2.12) for the corresponding form of the general Broyden update). Storage
is provided for a maximum of m pairs of vectors (si, yi). Once the storage limit
is reached, the oldest pair of vectors is discarded at each iteration. Hence, after
the mth iteration, Hk is given by
Hk = MTk−1 · · ·MT
k−m−1H0kMk−m−1 · · ·Mk−1
+MTk−1 · · ·MT
k−m
sk−m−1sTk−m−1
sTk−m−1yk−m−1
Mk−m · · ·Mk−1
...
+MTk−1
sk−2sTk−2
sTk−2yk−2
Mk−1 +sk−1s
Tk−1
sTk−1yk−1
,
(5.2)
where H0k is chosen during iteration k. Liu and Nocedal study several different
choices for H0k , all of which are diagonal. In particular, the choice
H0k = θkI, where θk =
sTk yk
yTk yk
,
is shown to be the most effective in practice (see Liu and Nocedal [28]).
The formula (5.2) for Hk is used to compute pk using the stored vectors
sk−m, . . . , sk−1 and yk−m, . . . , yk−1.
An efficient method for computing pk due to Strang is given in Nocedal [35] that
requires 4mn floating-point operations. The iterations proceed using formula
(5.2) until a non-descent search direction is computed. The matrix Hk is then
reset to a diagonal matrix and the storage of pairs (si, yi) begins from scratch.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 81
Reduced-Hessian methods provide an alternative to standard limited-
memory methods. Fenelon proposed the first reduced-Hessian method for large
scale unconstrained optimization in her thesis dissertation (see Fenelon [14]). Her
method is an extension of the Cholesky-factor method given in Section 2.1 and
is based on the fact that the reduced Hessian is tridiagonal when minimizing
quadratic functions using an exact line search. This tridiagonal form implies
that the reduced Hessian can be written as ZTBZ = LDLT, where L is unit
lower bidiagonal. The matrix Z is partitioned as Z = ( Z1 Z2 ), where Z2
corresponds to the last m accepted gradients. Fenelon suggests forcing L to have
a block structure
L =
L11 0
λe1eTr−m L22
, where λ ∈ IR,
L11 is unit lower bidiagonal and L22 is unit lower triangular. A recurrence relation
is given for computing p satisfying LDLTpZ = −gZ and p = ZpZ using L22, Z2
and one extra n-vector. The form of L is motivated by the desire for quadratic
termination. However, the update to L22 may not be defined when minimizing
general f because of a loss of positive definiteness in the matrix LDLT. An
indefinite update does not occur because of roundoff error, but stems rather from
the assumed structure of L. Fenelon suggests a restart strategy to alleviate the
problem, but reports disappointing results (see Fenelon [14] for further details of
the algorithm and complete test results).
Nazareth has defined reduced inverse-Hessian successive affine reduction
(SAR) methods. These methods store a matrix ZTHZ, where
range(Z) = spanpk−1, gk−m+1, gk−m+2, . . . , gk (assuming k ≥ m− 1)
and the columns of Z are orthonormal. In terms of ZTHZ, the search direction
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 82
satisfies p = −Z(ZTHZ)ZTg. The inclusion of pk−1 in range(Z) ensures that the
method terminates on quadratics.
Siegel has proposed a method based on the reduced inverse approximate
Hessian method of Section 2.2. The method differs from Fenelon’s in that no
attempt is made to define the approximate inverse Hessian corresponding to
Z1. Positive definiteness of ZT2HZ2 is guaranteed and quadratic termination is
achieved by redefining Z2 after the computation of p so that p ∈ range(Z2).
The method differs from SAR methods because the size of reduced Hessian is
explicitly controlled. Information is only discarded when the acceptance of a
new gradient causes the reduced Hessian to exceed order m (see Siegel [46] for
complete details).
5.2 Extending Algorithm RH to large problems
Four new reduced Hessian methods for large-scale optimization are introduced in
this chapter. The first, which is called Algorithm RH-L-G, is similar to Fenelon’s
method in the sense that it uses both a Cholesky factor of the reduced Hes-
sian and an orthonormal basis for the gradients. The second, called Algorithm
RH-L-P, uses an orthonormal basis for previous search directions and possibly
the last accepted gradient. The third and fourth new algorithms, called RHR-
L-G and RHR-L-P, use the method of resaling proposed in Chapter 3. Since
Algorithms RHR-L-G and RHR-L-P can be implemented without rescaling, they
include Algorithms RH-L-G and RH-L-P as special cases.
The methods are similar to Siegel’s and utilize two important features
of his algorithms.
• When information is discarded, the exact reduced Hessian corresponding
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 83
to the saved gradients (or search directions) is maintained. Since the saved
gradients (search directions) will be linearly independent, this implies that
there is no loss of positive definiteness in exact arithmetic.
• In Algorithm RHR-L-P, the last accepted gradient is replaced by the search
direction in order to establish quadratic termination.
In Section 5.3, numerical results are given for the algorithms. Results
show that Algorithm RHR-L-P outperforms RHR-L-G, which may indicate that
the quadratic termination property appears to be beneficial in practice. Rescaling
is shown to be crucial in practice through numerical experimentation. Results
are given comparing the methods to the limited-memory BFGS algorithm of Zhu
et al. [49], which may be considered as the current state of the art.
5.2.1 Imposing a storage limit
Let m denote a prespecified “storage limit”. This limit restricts the size of the
reduced Hessian passed from one iteration to the next. If the reduced Hessian
grows to size (m+1)× (m+1) during any iteration, then approximate curvature
information will be discarded and an m × m reduced Hessian is passed to the
next iteration. Several authors have suggested discarding curvature information
corresponding to the “oldest” gradient (e.g., see Fenelon [14], Nazareth [34] and
Siegel [46]). Alternative discard procedures are the subject of future research and
will not be considered in this thesis.
To introduce some of the notation that will be used, we present an ex-
ample illustrating how the oldest gradient can be discarded. At the end of the kth
iteration, suppose that Z and RZ are associated with G = ( g0 g1 · · · gm )
(g0, g1, . . . , gm are assumed to be linearly independent). Because m+ 1 linearly
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 84
independent vectors are in the basis, g0 will be discarded before the start of itera-
tion k+1. We will use tildes to denote the corresponding quantities following the
deletion of g0. In this case, G = ( g1 g2 · · · gm ) and range(Z) = range(G)
with ZTZ = Im. The matrix RZ
will denote the Cholesky factor of the reduced
Hessian ZT BεZ. The determination of Z is considered in the next section.
5.2.2 The deletion procedure
In the next two sections, we consider the definition of Z when G is obtained from
G by dropping the oldest gradient. This procedure is due to Daniel et al. (see
[8] for further details). In the first section, we will consider an example. In the
second, we will give the general procedure.
An example of the discard procedure
Consider the case where n = 4, m = 2 and G = ( g0 g1 g2 ). (The gradients
g0, g1 and g2 are assumed to be linearly independent.) The matrix Z satisfies
range(Z) = range(G), ZTZ = I3 and is obtained using the Gram-Schmidt process
on g0, g1 and g2. We require Z such that range(Z) = range(G), where G =
( g1 g2 ) and ZTZ = I2. Recall that there exists a nonsingular upper-triangular
matrix T such that G = ZT . Let Z and T be partitioned as
Z = ( z1 z2 z3 ) and T =
t11 t12 t13
t22 t23
t33
.It follows that
g1 = t12z1 + t22z2 and g2 = t13z1 + t23z2 + t33z3.
Hence, no two columns of Z define a basis for range(G). Let P12 denote a 3× 3
symmetric Givens matrix in the (1, 2) plane defined to annihilate t22 in T . Let
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 85
P23 denote a symmetric Givens matrix in the (2, 3) plane defined to annihilate
t33 in P12T . It follows that G = (ZP12P23)(P23P12T ) and that P23P12T is of the
form
P23P12T =
× × ×× ××
.Suppose we partition ZP12P23 and P23P12T such that
ZP12P23 = ( Z z ) and P23P12T =
t T
τ 0
,where Z ∈ IR4×2 and T ∈ IR2×2. Note that T is nonsingular since G has full
rank. These partitions imply that G = ZT and it follows that Z and T define a
skinny Gram-Schmidt QR factorization of G.
It is important to note that the discard procedure cannot be accom-
plished without knowledge of T . The Givens matrices P12 and P23 depend on
every nonzero component of T except t11.
The general drop-off procedure
During iteration k, suppose that g = gkm is accepted into the basis and that, with
the addition of g, the reduced approximate Hessian attains order m + 1. The
matrix of accepted gradients, G = ( gk0 gk1 · · · gkm ) may be partitioned as
G = ( gk0 G ) in accordance with the strategy of deleting the oldest gradient.
Define TS as ST , where S denotes an orthogonal matrix. Define T as the (1, 2),
m×m block of TS. The matrix S is defined so that T is nonsingular and upper
triangular. In particular, S = Pm,m+1Pm−1,m · · ·P12, where Pi,i+1 is a symmetric
(m+ 1)× (m+ 1) Givens matrix in the (i, i+ 1) plane defined to annihilate the
(i+ 1, i+ 1) element of Pi−1,i · · ·P12T . The resulting product satisfies
ST =
t T
τ 0
, where t ∈ IRm.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 86
Let ZS = ZST and define Z ∈ IRn×m by the partition ZS = ( Z z ), i.e.,
Z = ZSEm, where Em consists of the first m columns of I. From the definition
of G, we have
G = ( gk0 G ) == ZT = ZSTS = ( Zt+ τ z ZT ),
and it follows that G = ZT is a Gram-Schmidt QR factorization corresponding
to the last m accepted gradients.
5.2.3 The computation of T
We now describe the computation of the nonsingular upper-triangular matrix T .
This matrix is a by-product of the Gram-Schmidt process described in Section
2.1.
Given that Z and T are known at the start of iteration k (they are
easily defined at the start of the first iteration), consider the definition of Z and
T . During iteration k, suppose that g is accepted giving G = ( G g ). Define
ρg = ‖(I − ZZT )g‖ and zg = (I − ZZT )g/ρg as in Section 2.1. If Z and T are
defined by
Z = ( Z zg ) and T =
T ZTg
0 ρg
,then ZT = G. If g is rejected, we will define Z = Z and T = T .
In summary, after the computation of g, r is defined as in Chapter 2,
i.e.,
r =
r, if ρg ≤ ε‖g‖;
r + 1, otherwise,(5.3)
The updates to Z and T satisfy
Z =
Z, if r = r;
( Z zg ), otherwise(5.4)
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 87
and
T =
T, if r = r; T gZ
0 ρg
, otherwise.(5.5)
For convenience, we define the function GST (short for Gram-Schmidt orthogo-
nalization including T )
(Z, T , gZ, r) = GST(Z, T, g, r, ε)
that defines r, T and Z according to (5.3)–(5.5).
5.2.4 The updates to gZ and RZ
The change of basis necessitates changing gZ and RZ so that all quantities passed
to the next iteration correspond to the new basis defined by Z. The quantity gZ
needed to compute the search direction during iteration k + 1 can be obtained
from gZ without the mn floating-point operations required to compute ZTg from
scratch. Let gS denote the vector SgZ = Pm,m+1Pm−1,m · · ·P12gZ. Since Z =
ZSEm (recall that Em denotes the matrix of first m columns of I), we have
gZ
= ZTg = (ZSEm)T g = ETmSZ
Tg = ETmSgZ = ET
mgS.
Thus, gZ
is given by the first m components of gS.
It remains to define an update to RZ that yields RZ, where RT
ZR
Z=
ZTBεZ. The latter quantity satisfies
ZTBεZ = (ZSTEm)TBε(ZSTEm) = ETmSR
TZ RZS
TEm.
In general, the matrix RZST is not upper triangular. Let S denote an orthogonal
matrix of order m+1 defined so that SRZST is upper triangular. If RS = SRZS
T
denotes the resulting matrix, then it follows that
ZT BεZ = ETmR
TS RSEm,
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 88
which implies that the leading m×m block of RS is the required factor RZ.
The matrix S is the product Pm,m+1 · · · P 23P 12, where P i,i+1 is an (m+
1) × (m + 1) Givens matrix in the (i, i + 1) plane that annihilates the (i, i +
1) element of P i−1,i · · · P 12RZP12 · · ·Pi,i+1. The two sweeps of Givens matrices
defined by S and S must be interlaced as in the update for the Cholesky factor
following the change of basis for Z (see Section 2.5).
For notational convenience, we define the function discard corresponding
to the drop-off procedure. We write
(Z, T , gZ, R
Z) = discard(Z, T , gZ, RZ).
The quantities gZ and RZ are supplied to discard because if gZ
and RZ
are com-
puted during the computation of Z, the rotations defining S need not be stored.
5.2.5 Gradient-based reduced-Hessian algorithms
The first of the four reduced-Hessian methods for large-scale unconstrained op-
timization is given as Algorithm RH-L-G below.
Algorithm 5.1. Gradient-based Large-scale reduced-Hessian method (RH-L-G)
Initialize k = 0; Choose x0, σ, ε and m;
Initialize r = 1, Z = g0/‖g0‖, T = ‖g0‖ and RZ = σ1/2;
while not converged do
Solve RTZ tZ = −gZ, RZpZ = tZ and set p = ZpZ;
Compute α so that sTy > 0 and set x = x+ αp;
Compute (Z, T , gZ, r) = GS(Z, T, g, r, ε);
if r = r + 1 then
Define pZ = (pZ, 0)T , gεZ = (gZ, 0)T and RZ = diag(RZ, σ
1/2);
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 89
else
Define pZ = pZ, gεZ = gZ and RZ = RZ;
end if
Compute sZ = αpZ and yεZ = gZ − gε
Z;
Compute RZ = Broyden(RZ, sZ, yεZ);
if r = m+ 1 then
Compute (Z, T , gZ, R
Z) = discard(Z, T , gZ, RZ);
r ← m;
end if
k ← k + 1;
end do
5.2.6 Quadratic termination
Fenelon [14] and Siegel [46] have observed that gradient-based reduced-Hessian al-
gorithms may not enjoy quadratic termination. Consider a quasi-Newton method
employing an update from the Broyden class and an exact line search. Recall that
when this method is applied to a quadratic, the search directions are parallel to
the conjugate-gradient directions (see Section 1.2.1). However, we demonstrate
below that the directions generated by Algorithm RH-L-G are not necessarily
parallel to the conjugate-gradient directions. Moreover, Algorithm RH-L-G does
not exhibit quadratic termination in practice.
Table 5.1 gives the definition of the search direction generated during
iteration k + 1 of both the conjugate-gradient method and Algorithm RH-L-G.
Note that the conjugate-gradient direction is a linear combination of g and p.
Suppose that the first k + 1 directions of Algorithm RH-L-G are parallel to the
first k+1 conjugate-gradient directions. Under this assumption, g is accepted by
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 90
Table 5.1: Comparing p from CG and Algorithm RH-L-G on quadratics
Iteration Conjugate Gradient Reduced Hessian
k + 1 p = −g +‖g‖2
‖g‖2p p = Zp
Z
Algorithm RH-L-G during iteration k since g is orthogonal to the previous gradi-
ents (see Section 1.2.1). It follows by construction that g ∈ range(Z). However,
if the oldest gradient is dropped from G, and if p has a nonzero component along
the direction of the oldest gradient, then p 6∈ range(Z). Hence, the search direc-
tion p generated by Algorithm RH-L-G cannot be parallel to the corresponding
conjugate-gradient direction.
Authors have devised various ways of ensuring quadratic termination of
reduced-Hessian type methods. As described in Section 5.1, Fenelon [14] obtains
quadratic termination by recurring the super-diagonal elements of RZ correspond-
ing to the deleted gradients. Nazareth [34] defines the basis used during iteration
k+1 to include p and g. Siegel [46] maintains quadratic termination by replacing
g with p in the definition of Z whenever the former is accepted. This exchange is
discussed further in the next section and will lead to a modification of Algorithm
RH-L-G.
5.2.7 Replacing g with p
Consider the set of search directions,
Pk = p0, p1, . . . , pk, (5.6)
generated by a quasi-Newton method (see Algorithm 1.2) using updates from the
Broyden class. Siegel has observed that the subspace associated with Gk is also
determined by Pk, i.e., span(Gk) = span(Pk). Lemma 5.1 given below is essential
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 91
for the proof of this result. In Lemma 5.1, zg and zp are the normalized compo-
nents of g and p respectively, that are orthogonal to span(G). The normalized
component of p orthogonal to range(Z) is given
zp =
0, if ρp = 0;
1
ρp
(I − ZZT)p, otherwise,(5.7)
where ρp = ‖(I − ZZT )p‖. The lemma establishes that zp is nonzero as long zg
is nonzero, i.e., p always includes a component of zg. Note that in the proof of
Lemma 5.1, Z and Z are assumed to be exact orthonormal bases for span(G) and
span(G) respectively.
Lemma 5.1 If B0 = σI (σ > 0), and Bk is updated using an update from the
Broyden class, then zp = ±zg.
Proof. The proof is trivial if zg = 0. Suppose zg 6= 0. Using (2.4), p = ZZTp =
ZZTp+ (zTg p)zg, which implies that (I −ZZT)p = (zT
g p)zg and ρp = |zTg p|. Hence,
as long as zTg p 6= 0, zp = sign(zT
g p)zg, as required.
It remains to show that zTg p cannot be zero. Assume that zT
g p = 0 with
zg 6= 0, which means that p = Zp1, where p1 ∈ IRr, i.e., p ∈ span(G). Using the
Broyden update formulae (1.14), the equation Bp = −g, and the equations
s = αp and Bp = −g, (5.8)
it follows that
Bp+
(αφwTp+
pTg
pTg+αφsTg wTp− pTy
sTy
)g =
(αφsTg wTp− pTy
sTy− 1
)g. (5.9)
Since p ∈ span(G), Lemma 2.3 implies that Bp ∈ span(G). Thus, if (αφsTg wTp−
pTy)/sTy 6= 1, then equation (5.9) implies that g ∈ span(G), which contradicts
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 92
zg 6= 0. Otherwise, equation (5.9) implies that
Bp = −βg, where β = αφwTp+pTg
pTg+αφsTg wTp− pTy
sTy.
Multiplying through by B−1 gives p = −βB−1g = βp. Combining this with the
quasi-Newton condition Bs = y and (5.8) gives(β
α+ 1
)g =
β
αg,
which must imply that g is parallel to g, contradicting zg 6= 0. These contradic-
tions establish that zTg p 6= 0 as required.
Once Lemma 5.1 is established, the following result follows directly.
Theorem 5.1 (Siegel) If B0 = σI (σ > 0), and Bk is updated using any formula
from the Broyden class, then
span(Gk) = span(Pk).
Proof. The result is clearly true for k = 0. Suppose that the result holds
through iteration k − 1, i.e., span(Pk−1) = span(Gk−1). Since pk ∈ span(Gk) (see
Lemma 2.3), it follows that span(Pk) ⊆ span(Gk). A straightforward application
of Lemma 5.1 (predated one iteration) implies that span(Gk) ⊆ span(Pk) and the
desired result follows.
Recall that the iterates x0, x1, . . ., xk+1 of quasi-Newton methods (using
updates from the Broyden class) lie on the manifoldM(x0,Gk). Since span(Pk) =
span(Gk), the iterates are a spanning set for the manifold. Hence, these methods
exploit all gradient information.
We now consider exchanging the search direction for the gradient in
Algorithm RH (p. 28). Recall that this algorithm uses an approximate basis Gk
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 93
for span(G) (see Section 2.1). The columns of Gk are the accepted gradients gk1 ,
gk2 , . . ., gkr . We define Pk as the corresponding matrix of search directions, i.e.,
Pk =(pk1 pk2 · · · pkr
). (5.10)
The following corollary, analogous to Lemma 5.1, implies that p has a nonzero
component along zg whenever g is accepted.
Corollary 5.1 If zg is defined as in Algorithm RH and zp is defined by (5.7),
then zp = ±zg.
Proof. In the reduced Hessian method, a full approximate Hessian is not stored,
but can be implicitly defined by
Bε = Q
RTZRZ 0
0 σIn−r
QT
(see Section 2.4). Similarly, define
Bε = Q
RTZ RZ 0
0 σIn−r
QT
and note that in terms of these effective Hessians, the search directions satisfy
Bεp = −ZgZ and Bεp = −ZgZ respectively. In light of these two equations,
define gε = ZgZ, g ε = ZgZ, and yε = g ε − gε. Hence,
Bεp = −gε and Bεp = −g ε. (5.11)
If RZ is obtained from RZ using a Broyden update as in Algorithm 2.2, then
Bε is the matrix obtained by applying the same Broyden update to Bε using
the quantities Bε, and yε in place of B and y respectively. A short calculation
verifies the quasi-Newton condition Bεs = yε. The rest of the proof proceeds as
in Lemma 5.1.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 94
Theorem 5.2, which is analogous to Theorem 5.1, follows immediately
from Corollary 5.1.
Theorem 5.2 If B0 = σI (σ > 0) in Algorithm RH,
range(Gk) = range(Pk).
Proof. The proof is analogous to the proof of 5.1 and is omitted.
When the discard procedure is used, we expect that zp is nonzero when-
ever g is accepted. However, at the time of the completion of this dissertation,
this result has not been proved. We therefore give the following proposition.
Proposition 5.1 If Algorithm RH-L-G is used to minimize f : IRn → IR, then
zp 6= 0 whenever g is accepted.
Replacing g with p can be accomplished by simply replacing the last
column of T with pZ. We will use the function chbs (for “change of basis”) to
denote this replacement and will write
T = chbs(T ).
In the absence of a proof for the proposition, we will only perform the replacement
if ρp > εM‖pZ‖. We note that in exact arithmetic, ρp must be nonzero if the search
directions are conjugate-gradient directions. This follows because the conjugate-
gradient directions are linearly independent (see Fletcher [15, p. 25]).
The following algorithm employs the change of basis.
Algorithm 5.2. Direction-based large-scale reduced-Hessian method (RH-L-P)
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 95
The algorithm is identical to Algorithm RH-L-G except after defining p.
The lines following the computation of p are as follows.
if g was accepted then
T = chbs(T )
end if
Rescaling reduced Hessians for large problems
When solving smaller problems, the numerical effects of rescaling vary as shown in
Chapter 3. For larger problems, the discard procedure makes rescaling essential.
We now present two rescaling algorithms defined as extensions of Algorithms RH-
L-G and RH-L-P. The definition is based on the rescaling suggested in Algorithm
RHR (p. 52). Algorithm RHR-L-G is identical to RH-L-G except following the
BFGS update.
Algorithm 5.3. Gradient-based Large-scale reduced-Hessian rescaling method
(RHR-L-G)
Compute σ;
if r = r + 1 then
Replace the (r, r) element of RZ with σ to give RZ;
end if
Since much of the notation is altered for the direction-based algorithm,
it is given in its entirety.
Algorithm 5.4. Direction-based large-scale reduced-Hessian rescaling method
(RHR-L-P)
Initialize k = 0; Choose x0, σ, ε and m;
Initialize r = 1, Z = g0/‖g0‖, T = ‖g0‖ and RZ = σ1/2;
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 96
while not converged do
Solve RTZ tZ = −gZ, RZpZ = tZ and set p = ZpZ;
if g was accepted then
T = chbs(T )
end if
Compute α so that sTy > 0 and set x = x+ αp;
Compute (Z, T , gZ, r) = GST(Z, T , g, r, ε);
if r = r + 1 then
Define pZ = (pZ, 0)T , gεZ = (gZ, 0)T and RZ = diag(RZ, σ
1/2);
else
Define pZ = pZ, gεZ = gZ and RZ = RZ;
end if
Compute sZ = αpZ and yεZ = gZ − gε
Z;
Compute RZ = BFGS(RZ, sZ, yεZ);
Compute σ;
if r = r + 1 then
Replace the (r, r) element of RZ with σ to give RZ;
end if
if r = m+ 1 then
Compute (Z, T , gZ, R
Z) = discard(Z, T , gZ, RZ);
r ← m;
end if
k ← k + 1;
end do
In Section 5.3, we compare several choices of σ used in Algorithm RHR-
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 97
L-P. In Section 5.4, Algorithm RHR-L-P (and consequently Algorithm RH-L-P)
is shown to enjoy quadratic termination in exact arithmetic.
5.3 Numerical results
Results are presented in this section for various aspects of Algorithms RHR-L-
G and RHR-L-P. We also present a comparison with the L-BFGS-B algorithm
proposed by Zhu et al. [49]. (The L-BFGS-B algorithm is an extension of the L-
BFGS method reviewed in Section 5.1, but performs similarly on unconstrained
problems.)
Many of the problems are taken from the CUTE collection (see Bongartz
et al. [1]). In the tables of results, we will use the CUTE designation for the test
problems although there is some overlap with the problems from More et al. [29]
listed in Table 3.2.
In the following sections we answer four questions concerning the algo-
rithms.
• Does the enforcement of quadratic termination in Algorithm RHR-L-P re-
sult in practical benefits in comparison to Algorithm RHR-L-G?
• How do the various rescaling schemes presented in Table 3.1 affect the
performance of Algorithms RHR-L-G and RHR-L-P?
• What effect does the value of m have on the number of iterations and
function evaluations required by the algorithms?
• How many iterations and function evaluations do the algorithms require in
comparison with L-BFGS-B?
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 98
Algorithm RHR-L-G compared with RHR-L-P
In this section we examine the performance of Algorithm RHR-L-G compared
with Algorithm RHR-L-P. The implementation is in FORTRAN 77 using a DEC
5000/240. The line search, step length parameters and acceptance parameter are
the same as those used to test Algorithm RHR (see Section 3.3.2, p. 53). We
present results in Tables 5.2–5.3 for the rescaling methods R0 (no rescaling), R2
and R4 (see Table 3.1, p. 53, for definitions of the rescaling schemes).
The stopping criterion is ‖gk‖∞ < 10−5 as suggested by Zhu et al. [49].
The notation “L” indicates that the algorithm terminated during the line search.
Termination during the line search usually occurs when the search direction is
nearly orthogonal to the gradient. A limit of 1500 iterations was imposed. The
notation “I” indicates that the algorithm was terminated after 1500 iterations.
Both the “L” and “I” are accompanied by a number in parentheses that indicates
the final infinity norm of the gradient.
Table 5.2: Iterations/Functions for RHR-L-G (m = 5)
Problem Algorithm RHR-L-GName n R0 R2 R4
ARWHEAD 1000 8/14 17/22 17/22BDQRTIC 100 211/325 I(.1E-2) 259/268
CRAGGLVY 1000 L(.2E-4) I(.6E-2) 249/256DIXMAANA 1500 12/16 17/21 15/19DIXMAANB 1500 28/45 22/26 17/21DIXMAANE 1500 965/976 I(.6E-3) 1232/1236EIGENALS 110 I(.8E-1) I(.7E-1) I(.6E-1)GENROSE 500 I(.2E+1) I(.2E+1) I(.2E+1)MOREBV 1000 120/196 130/132 158/160
PENALTY1 1000 49/60 75/89 59/71QUARTC 1000 120/178 645/1176 97/103
Consider the performance of the two algorithms for a given rescaling
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 99
Table 5.3: Iterations/Functions for RHR-L-P (m = 5)
Problem Algorithm RHR-L-PName n R0 R2 R4
ARWHEAD 1000 8/14 17/22 17/22BDQRTIC 100 123/211 625/634 148/163
CRAGGLVY 1000 L(.1E-3) L(.3E-4) 108/116DIXMAANA 1500 12/16 17/21 15/19DIXMAANB 1500 25/39 22/26 18/22DIXMAANE 1500 209/216 772/776 178/183EIGENALS 110 975/1880 659/1189 682/704GENROSE 500 1432/3449 1058/1903 1136/1211MOREBV 1000 100/201 96/98 81/83
PENALTY1 1000 49/60 75/89 59/71QUARTC 1000 122/160 722/728 87/93
technique. For some of the problems (e.g., ARWHEAD, DIXMAANA–B and
PENALTY1) they perform similarly. However, for most of the problems, it is
clear that Algorithm RHR-L-P performs better than RHR-L-G. This is particu-
larly true for DIXMAANE, EIGENALS and GENROSE. There are very few cases
where the reverse is true and in these cases the difference is quite small. For ex-
ample, RHR-L-G without rescaling takes 196 function evaluations for MOREBV
while RHR-L-P requires 201.
A comparison of the rescaling schemes R3–R5
If we consider the three rescaling schemes shown in Table 5.3, then clearly R4 is
the best choice. In this section, we give results comparing more of the rescaling
techniques in conjunction with Algorithm RHR-L-P. In Tables 5.4–5.5, we con-
sider the choices R3–R5 for two sets of test problems from the CUTE collection.
The first set includes 26 problems whose names range in alphabetical order from
ARWHEAD to FLETCHCR. The second set includes problems from FMINSURF
to WOODS. The maximum number of iterations is increased to 3500 for this set
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 100
of results.
Table 5.4: Results for RHR-L-P using R3–R5 (m = 5) on Set # 1Problem n R3 R4 R5
ARWHEAD 1000 17/22 17/22 17/22BDQRTIC 100 129/177 148/163 127/189
BROYDN7D 1000 352/614 367/372 329/657BRYBND 1000 41/52 30/36 39/53
CRAGGLVY 1000 100/138 108/116 91/162DIXMAANA 1500 15/19 15/19 15/19DIXMAANB 1500 18/22 18/22 18/22DIXMAANC 1500 12/17 12/17 12/17DIXMAAND 1500 24/28 21/25 23/27DIXMAANE 1500 156/293 178/183 159/303DIXMAANF 1500 175/320 147/152 165/306DIXMAANG 1500 108/199 124/130 134/251DIXMAANH 1500 268/442 243/254 256/542DIXMAANI 1500 1534/3059 1352/1367 1153/2289DIXMAANK 1500 136/257 129/136 153/299DIXMAANL 1500 174/318 147/152 154/286DQDRTIC 1000 7/10 12/15 7/10DQRTIC 500 85/92 85/91 90/97
EIGENALS 110 667/1239 682/704 756/1439EIGENBLS 110 1184/2321 329/343 1562/3434EIGENCLS 462 3313/6810 2808/2843 3097/6804ENGVAL1 1000 23/29 22/26 25/33FLETCBV2 1000 1003/2007 952/967 1002/2005FLETCBV3 1000 L(.1E+2) L(.3E-1) L(.1E+2)FLETCHBV 100 L(.2E-1) L(.2E+5) L(.8E-1)FLETCHCR 100 84/142 76/84 92/196
From the tables, it is clear that R4 is the best choice of the rescaling
parameter in practice. The number of function evaluations for R3 and R5 is
similar with slightly fewer required for option R3. Note that for many of the
problems option R5 requires roughly twice as many function evaluations as iter-
ations. This behavior results because the rescaling parameter σ = µ results in a
search direction of larger norm than is acceptable by the line search. (In the case
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 101
Table 5.5: Results for RHR-L-P using R3–R5 (m = 5) on Set # 2Problem n R3 R4 R5
FMINSURF 1024 180/337 229/233 214/470FREUROTH 1000 L(.3E-3) L(.8E-4) L(.1E-3)GENROSE 500 1150/2097 1136/1211 1417/3361LIARWHD 1000 20/26 20/26 20/26MANCINO 100 L(.2E-4) L(.2E-4) L(.2E-4)MOREBV 1000 96/179 81/82 95/187NONDIA 1000 9/21 9/21 9/21
NONDQUAR 100 943/1445 794/848 1095/1848PENALTY1 1000 59/71 59/71 59/71PENALTY2 100 107/135 125/133 113/161PENALTY3 100 L(.2E-1) L(.8E-2) L(.8E-2)POWELLSG 1000 41/46 44/50 41/46
POWER 1000 149/223 177/184 152/253QUARTC 1000 90/98 87/93 90/97SINQUAD 1000 143/189 142/187 130/173
SROSENBR 1000 18/24 18/24 18/24TOINTGOR 50 134/220 157/161 137/256TOINTGSS 1000 5/8 5/8 5/8TOINTPSP 50 123/193 133/153 118/227TOINTQOR 50 41/57 43/45 51/74TQUARTIC 1000 20/26 20/26 20/26
TRIDIA 1000 398/796 903/924 391/783VARDIM 100 36/44 36/44 36/44
VAREIGVL 1000 91/152 90/95 92/165WOODS 1000 43/49 45/51 43/49
of a quadratic with exact line search, the length of the search direction varies as
the inverse of σ.) For this choice, the line search is forced to interpolate nearly
every iteration.
Results for Algorithm RHR-L-P using different m
Results are given in Table 5.6 for Algorithm RHR-L-P (with R4) using different
values of m.
When m = 2, the search direction is a linear combination of two vectors
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 102
Table 5.6: RHR-L-P using different m with R4
Problem Algorithm RHR-L-GName n m = 2 m = 5 m = 10 m = 15
BDQRTIC 100 216/250 148/163 109/118 104/115CRAGGLVY 1000 127/134 108/116 118/126 124/132DIXMAANA 1500 13/17 15/19 15/19 15/19DIXMAANB 1500 14/18 18/22 18/22 18/22DIXMAANE 1500 222/226 178/183 190/194 193/197EIGENALS 110 752/765 682/704 682/711 389/409GENROSE 500 1776/1804 1136/1211 1106/1254 1133/1295MOREBV 1000 132/134 81/83 70/72 73/75
PENALTY1 1000 59/71 59/71 59/71 59/71QUARTC 1000 55/61 87/93 139/145 212/218SINQUAD 1000 376/435 142/187 142/187 142/187
as it is for conjugate-gradient methods. We note that the choice m = 5 gives
better results than m = 2 on most of the problems. The choice of m that
minimizes the number of function evaluations varies. For example, the number
of function evaluations for CRAGGLVY, DIXMAANE and GENROSE is least
for m = 5. For several of the problems, for example BDQRTIC and EIGENALS,
the fewest number of function evaluations is required for m = 15.
For some of the examples, the smallest number of functions evaluations
is required for an intermediate value of m. For example, the smallest number of
function evaluations for CRAGGLVY, DIXMAANE and GENROSE occurs for
m = 5. The function evaluations might be expected decrease for values of m
greater than 15, especially as m is chosen closer to n. The variation in function
evaluations for values of m ranging from 2 to n is given in Table 5.7 for four
problems. The first three are the calculus of variation problems (see Section
3.4.1). Problem 23 is a minimum-energy problem (see Siegel [46]). For this
table, the termination criterion is ‖gk‖ < 10−4|f(xk)|. The maximum number of
iterations is 15, 000 and the notation “I” indicates that this limit was reached.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 103
In this case, the number in parentheses gives the final ratio ‖gk‖/|f(xk)|.
Table 5.7: RHR-L-P (R4) for m ranging from 2 to n
m Problem 20 Problem 21 Problem 22 Problem 232 I(.4E-1) 3898/3903 I(.6E-2) 269/2813 13496/13906 2545/2619 I(.1E-2) 210/2235 11765/11896 1920/1947 9027/9129 163/1767 I(.4E-3) 1792/1805 I(.5E-2) 181/19410 14496/14532 1992/1997 7927/7953 192/21315 13152/13167 2315/2321 14766/14789 228/26620 13672/13683 1712/1718 953/957 225/24730 11042/11064 1426/1431 1216/1218 254/28140 13775/13794 997/1002 744/746 311/34050 14170/14189 1002/1007 451/453 316/34170 13361/13373 510/516 313/316 296/320100 11419/11432 315/326 207/210 374/398150 5829/5302 808/821 190/193 437/461200 854/865 464/475 190/193 656/680
For all four of the problems there is a slight “dip” in the plot of function
evaluations versus m. This dip occurs for m = 5, m = 7, m = 5 and m = 5 for
the respective problems. For Problems 20–22, the number of function evaluations
decreases dramatically for values ofm closer to n. This is not the case for Problem
23 where the choice m = 5 results in the smallest number of function evaluations.
Algorithm RHR-L-P compared with L-BFGS-B
In this section, we compare Algorithm RHR-L-P (using rescaling option R4)
with L-BFGS-B. The L-BFGS-B method is run using the primal option (see Zhu
et al. [49] for an description of the three L-BFGS-B options). The line search
provided with L-BFGS-B is an implementation of the line search proposed by
More and Thuente [31]. The line search parameters ν = 10−4 and η = .9 are
used in L-BFGS-B as suggested by Zhu et al. (these are the same as those used
for RHR-L-P). The termination criterion is ‖gk‖∞ < 10−5.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 104
The number of iterations and function evaluations required to solve the
problems in Set #1 and Set #2 is given in Tables 5.8–5.9. The notation “R” in
the tables indicates that L-BFGS-B did not meet the termination criterion, but
that when the iterations were halted, a secondary termination criterion had been
met. This criterion is given by (f(xk)−f(xk+1))/max(|f(xk)|, |f(xk)|, 1) ≤ CεM ,
where C = 10−7 and εM is the machine precision.
From the tables, we see that in terms of function evaluations Algorithm
RHR-L-P is comparable with L-BFGS-B. The number of problems on which
RHR-L-P requires fewer function evaluation is relatively low. However, we should
stress that the results are somewhat preliminary.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 105
Table 5.8: Results for RHR-L-P and L-BFGS-B (m = 5) on Set #1
Problem n Algorithm RHR-L-P Algorithm L-BFGS-BARWHEAD 1000 17/22 11/13BDQRTIC 100 148/163 86/101
BROYDN7D 1000 367/372 362/373BRYBND 1000 30/36 29/31
CRAGGLVY 1000 108/116 87/95DIXMAANA 1500 15/19 10/12DIXMAANB 1500 18/22 10/12DIXMAANC 1500 12/17 12/14DIXMAAND 1500 21/25 14/16DIXMAANE 1500 178/183 165/171DIXMAANF 1500 147/152 153/160DIXMAANG 1500 124/130 158/166DIXMAANH 1500 243/254 152/157DIXMAANI 1500 1352/1367 1170/1215DIXMAANK 1500 129/136 135/139DIXMAANL 1500 147/152 171/177DQDRTIC 1000 12/15 13/19DQRTIC 500 85/91 38/43
EIGENALS 110 682/704 541/574EIGENBLS 110 329/343 1072/1116EIGENCLS 462 2808/2843 2795/2900ENGVAL1 1000 22/26 20/23FLETCBV2 1000 952/967 490/505FLETCBV3 1000 L(.3E-1) R(.4E+1)FLETCHBV 100 L(.2E+5) R(.5E+0)FLETCHCR 100 76/84 525/602
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 106
Table 5.9: Results for RHR-L-P and L-BFGS-B (m = 5) on Set #2
Problem n Algorithm RHR-L-P Algorithm L-BFGS-BFMINSURF 1024 229/233 198/208FREUROTH 1000 L(.8E-4) R(.2E-4)GENROSE 500 1136/1211 1086/1244
INDEF 1000 L(.1E+2) R(.6E+1)LIARWHD 1000 20/26 22/27MANCINO 100 L(.2E-4) 11/15MOREBV 1000 81/82 74/79NONDIA 1000 9/21 19/23
NONDQUAR 100 794/848 907/1001PENALTY1 1000 59/71 50/60PENALTY2 100 125/133 69/74PENALTY3 100 L(.8E-2) L(.3E-2)POWELLSG 1000 44/50 51/57
POWER 1000 177/184 131/136QUARTC 1000 87/93 41/47SINQUAD 1000 142/187 150/207
SROSENBR 1000 18/24 17/20TOINTGOR 50 157/161 122/134TOINTGSS 1000 5/8 14/20TOINTPSP 50 133/153 105/129TOINTQOR 50 43/45 38/42TQUARTIC 1000 20/26 21/27
TRIDIA 1000 903/924 675/705VARDIM 100 36/44 36/37
VAREIGVL 1000 90/95 122/130WOODS 1000 45/51 28/31
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 107
5.4 Algorithm RHR-L-P applied to quadratics
We now show that Algorithm RHR-L-P has the quadratic termination property.
For use in the proof of quadratic termination, we define
γk = (sTk yk)
1/2
Note that this definition of γk differs from that given in (3.7).
Theorem 5.3 Consider Algorithm RHR-L-P implemented with exact line search
and σ0 = 1. If this algorithm is applied to the strictly convex quadratic function
(1.17), then RZ is upper bidiagonal. At the start of iteration k (0 ≤ k ≤ m− 1),
RZ satisfies
RZ =
‖g0‖γ0
−‖g1‖γ0
0 · · · 0
‖g1‖γ1
−‖g2‖γ1
. . ....
. . . . . . 0‖gk−1‖γk−1
−‖gk‖γk−1
σ1/2k
At the start of iteration k (k ≥ m), RZ satisfies
RZ =
−‖gl‖2
σl‖pl‖γl
−‖gl+1‖γl
0 · · · 0
‖gl+1‖γl+1
−‖gl+2‖γl+1
. . ....
. . . . . . 0‖gk−1‖γk−1
−‖gk‖γk−1
σ1/2k
,
where l = k −m+ 1. The matrices Zk and Tk satisfy
Zk =( g0
‖g0‖g1
‖g1‖· · · gk
‖gk‖
)and
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 108
Tk =
−‖g0‖ −‖g1‖2
σ1‖g0‖− ‖g2‖2
σ2‖g0‖· · · − ‖gk−1‖2
σk−1‖g0‖0
−‖g1‖σ1
− ‖g2‖2
σ2‖g1‖· · · − ‖gk−1‖2
σk−1‖g1‖0
−‖g2‖σ2
......
. . . − ‖gk−1‖2
σk−1‖gk−2‖0
−‖gk−1‖σk−1
0
‖gk‖
at the start of iteration k (0 ≤ k ≤ m− 1). At the start of iteration k (k ≥ m),
these two matrices satisfy
Zk =( pl
‖pl‖gl+1
‖gl+1‖gl+2
‖gl+2‖· · · gk
‖gk‖
)and
Tk = Ck
‖pl‖‖gl+1‖2
‖gl‖2‖pl‖
‖gl+2‖2
‖gl‖2‖pl‖ · · ·
‖gk−1‖2
‖gl‖2‖pl‖ 0
−‖gl+1‖ −‖gl+2‖2
‖gl+1‖· · · −‖gk−1‖2
‖gl+1‖0
−‖gl+2‖...
...
. . . −‖gk−1‖2
‖gk−2‖0
−‖gk−1‖ 0
‖gk‖
Dk,
where Ck = diag(σl, Im−1) and Dk = diag(σ−1l , σ−1
l+1, . . . , σ−1k−1, 1). Furthermore,
the search direction is given by equation (3.15) of Theorem 3.1, i.e.,
pk =
−gk, if k = 0;
1
σk
(σk−1
‖gk‖2
‖gk−1‖2pk−1 − gk
), otherwise.
(5.12)
Proof. The form of Z, RZ and the search directions is already established for
iterations k (0 ≤ k ≤ m − 1) by Theorem 3.1 since Algorithm RHR-L-P and
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 109
Algorithm RHRL are equivalent for the first m − 1 iterations. Since the first
m − 1 search directions are parallel to the conjugate-gradient directions, the
first m gradients are mutually orthogonal and accepted. Moreover, the rank is
rk = k + 1. The search directions p0, p1, . . ., pm−2 replace the corresponding
gradients during iterations k (0 ≤ k ≤ m − 2). The form of Tk at the start of
iterations k (0 ≤ k ≤ m− 1) follows from the form of pZ (3.16, p. 59).
The value of m is assumed to be 3 for the remainder of the argument.
The key ideas of the proof are illustrated using this value. At the start of iteration
m− 1, the forms of Zk, RZ and Tk are given by
Zk =( g0
‖g0‖g1
‖g1‖g2
‖g2‖
), RZ =
‖g0‖γ0
−‖g1‖γ0
0
0‖g1‖γ1
−‖g2‖γ1
0 0 σ1/22
and
Tk =
−‖g0‖ −‖g1‖2
σ1‖g0‖0
0 −‖g1‖σ1
0
0 0 ‖g2‖
.
The rank satisfies r2 = 3. The form of p2 is given by (5.12) since the algorithm
is identical Algorithm RHRL until the end of iteration m− 1. Since g2 has been
accepted, g2 is replaced by p2. The form of pZ given by (3.16) implies that Tk
satisfies
Tk =
−‖g0‖ −‖g1‖2
σ1‖g0‖− ‖g2‖2
σ2‖g0‖
0 −‖g1‖σ1
− ‖g2‖2
σ2‖g1‖
0 0 −‖g2‖σ2
.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 110
The gradient g3 is orthogonal to g0, g1 and g2 and is accepted. The rank satisfies
r2 = 4 and the updates to Zk, RZ and Tk satisfy
Zk =(Zk
g3
‖g3‖
), Tk =
Tk 0
0 ‖g3‖
and
RZ =
‖g0‖γ0
−‖g1‖γ0
0 0
0‖g1‖γ1
−‖g2‖γ1
0
0 0‖g2‖γ2
−‖g3‖γ2
0 0 0 σ1/23
, respectively.
As r2 > m, the algorithm performs the first discard procedure. Since
‖p1‖ is given by the norm of the second column of Tk, the rotation P12 satisfies
P12 =
− ‖g1‖2
σ1‖g0‖‖p1‖− ‖g1‖σ1‖p1‖
0 0
− ‖g1‖σ1‖p1‖
‖g1‖2
σ1‖g0‖‖p1‖0 0
0 0 1 0
0 0 0 1
.
In the remainder of the proof, we will use the symbol “×” to denote a value that
will be discarded. A short computation gives
P12Tk =
× ‖p1‖σ1‖g2‖2
σ2‖g1‖2‖p1‖ 0
× 0 0 0
0 0 −‖g2‖σ2
0
0 0 0 ‖g3‖
.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 111
Since the second row of P12Tk is a multiple of eT1 , P23 and P34 are permutations.
Hence, Tk satisfies
Tk =
‖p1‖σ1
σ2
‖g2‖2
‖g1‖2‖p1‖ 0
0 −‖g2‖σ2
0
0 0 ‖g3‖
.
The matrix Sk is defined by Sk = P34P23P12 and it is easily verified that
Zk = ZkSTk E3 =
( p1
‖p1‖g2
‖g2‖g3
‖g3‖
).
Another short computation gives
RZP12 =
0 × 0 0
− ‖g1‖2
σ1γ1‖p1‖× −‖g2‖
γ1
0
0 0‖g2‖γ2
−‖g3‖γ2
0 0 0 σ1/23
.
Hence, the rotation P 12 is a permutation, as are P 23 and P 34. The matrix Sk is
defined by Sk = P 34P 23P 12. The update RZ
given by the leading 3× 3 block of
SkRZSk satisfies
RZ
=
− ‖g1‖2
σ1γ1‖p1‖−‖g2‖
γ1
0
0‖g2‖γ2
−‖g3‖γ2
0 0 σ1/23
.
Following the drop-off procedure, the rank satisfies r3 = 3.
We have shown that Zk, Tk and RZ have the required structure at the
start of iteration m. Now assume that they have this structure at the start of
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 112
iteration k (k ≥ m). Moreover, assume that the first k search directions satisfy
(5.12). Hence,
Zk =( pk−2
‖pk−2‖gk−1
‖gk−1‖gk
‖gk‖
),
Tk =
‖pk−2‖σk−2
σk−1
‖gk−1‖2
‖gk−2‖2‖pk−2‖ 0
0 −‖gk−1‖σk−1
0
0 0 ‖gk‖
and
RZ =
− ‖gk−2‖2
σk−2γk−2‖pk−2‖−‖gk−1‖
γk−2
0
0‖gk−1‖γk−1
−‖gk‖γk−1
0 0 σ1/2k
.
The rank satisfies rk = 3.
Since gZ = (0, 0, ‖gk‖)T , the equations RTZ tZ = −gZ and RZpZ = tZ give
tZ =
0
0
−‖gk‖σ
1/2k
and pZ =1
σk
σk−2‖gk‖2
‖gk−2‖2‖pk−2‖
− ‖gk‖2
‖gk−1‖−‖gk‖
.
Using the form of Zk and the form of pk−1 given by (5.12), pk satisfies
pk =1
σk
(σk−1
‖gk‖2
‖gk−1‖2pk−1 − gk
),
as required. Since gk has been accepted, pk is exchanged for gk giving
Tk =
‖pk−2‖σk−2
σk−1
‖gk−1‖2
‖gk−2‖2‖pk−2‖
σk−2
σk
‖gk‖2
‖gk−2‖2‖pk−2‖
0 −‖gk−1‖σk−1
− 1
σk
‖gk‖2
‖gk−1‖
0 0 −‖gk‖σk
.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 113
Since pk is parallel to the (k+1)th conjugate-gradient direction, gk+1 is orthogonal
to the previous gradients (and pk−2) and is accepted. The updated rank satisfies
rk = 4. The corresponding updates satisfy
Zk =(Zk
gk+1
‖gk+1‖
), Tk =
Tk 0
0 ‖gg+1‖
and RZ =
RZ 0
0 σ1/2k
.The matrix RZ is obtained from RZ in the manner described in Theorem 3.1.
The rescaled matrix RZ satisfies
RZ =
− ‖gk−2‖2
σk−2γk−2‖pk−2‖−‖gk−1‖
γk−2
0 0
0‖gk−1‖γk−1
−‖gk‖γk−1
0
0 0‖gk‖γk
−‖gk+1‖γk
0 0 0 σ1/2k+1
Since rk > m, the algorithm executes the drop-off procedure. The
remainder of the proof is similar to that given above for the drop-off at the end
of iteration m− 1 and is omitted.
Chapter 6
Reduced-Hessian Methods forLinearly-Constrained Problems
6.1 Linearly constrained optimization
In this section we consider the linear equality constrained problem (LEP)
minimizex∈IRn
f(x)
subject to Ax = b,(6.1)
where rank(A) = mL and A ∈ IRmL×n. The assumption of full rank is only in-
cluded to simplify the discussion. The methods proposed here do not requirement
this assumption.
A point x ∈ IRn satisfying Ax = b is said to be feasible. Since A
has full row rank, the existence of feasible points is guaranteed. (If A were
rank deficient, then the existence of feasible points requires that b ∈ range(A).)
Standard methods can be used to determine a particular feasible point (e.g., see
Gill et al. [23, p. 316]).
Two first-order necessary conditions for optimality hold at a minimizer
x∗. The first is that x∗ be feasible. The second requires the existence of λ∗ ∈ IRm
114
Reduced-Hessian Methods for Linearly-Constrained Problems 115
such that
ATλ∗ = ∇f(x∗). (6.2)
The components of λ∗ are often called Lagrange multipliers. If nL denotes n−mL,
let N ∈ IRn×nL denote a full-rank matrix whose columns form a basis for null(A).
Condition (6.2) is equivalent to the condition
NT∇f(x∗) = 0 (6.3)
(see Gill et al. [22, pp. 69–70]). The quantity NT∇f(x) is often called a reduced
gradient. In methods for solving (6.1) that utilize a representation for null(A),
equation (6.3) gives a simple method for verifying first-order optimality.
The second-order necessary conditions for optimality at x∗ are that
Ax∗ = b, NT∇f(x∗) = 0 and that the reduced Hessian NT∇2f(x∗)N is posi-
tive semi-definite. Sufficient conditions for optimality at a point x∗ are that the
first-order conditions hold and that NT∇2f(x∗)N is positive definite.
In what follows, we will assume that an initial feasible iterate x0 is
known. Since the constraints for LEP are linear, it is simple to enforce feasibility
of all the iterates. Let x denote a feasible iterate satisfying Ax = b and let
x = x+ αp. If
p = NpN , where pN ∈ IRnL , (6.4)
then x is feasible for all α. Furthermore, it is easily shown that x is feasible only
if p = NpN , for some pN ∈ IRnN . To see this, note that any given direction p
can be written as p = NpN + ATpR. Since x is feasible, it follows that Ax =
A(x + αp) = Ax + αAp = b + αAp = b, which implies that Ap = 0. It follows
that p ∈ null(A), i.e., p = NpN for some pN . A vector p satisfying (6.4) is called
Reduced-Hessian Methods for Linearly-Constrained Problems 116
a feasible direction. Because of the feasibility requirement, the subproblem (1.7)
used to define p in the unconstrained case is replaced by
minimizep∈IRn
f(xk) + gTp+ 12pTBp
subject to Ap = 0.(6.5)
This subproblem is an equality constrained quadratic program (EQP). If B is
positive definite, then the solution of the EQP is given by
p = NpN , where pN = −(NTBN)−1NTg, (6.6)
which is of the form (6.4).
For the remainder of the chapter, the columns of N are assumed to be
orthonormal. Let Q denote an orthogonal matrix of the form Q = ( N Y ),
where Y ∈ IRn×m. Consider the transformed Hessian
QTBQ =
NTBN Y TBN
Y TBN Y TBY
.If all of the search directions satisfy the EQP (6.5), only the reduced Hessian
NTBN is needed, i.e., no information about the transformed Hessian correspond-
ing to Y is required. Hence, we consider quasi-Newton methods for solving LEP
that store only NTBN .
The question naturally arises as to whether the reduced Hessian can be
updated (e.g., using the BFGS update) without knowledge of the entire trans-
formed Hessian. In the unconstrained case, we have seen that B can be block-
diagonalized using a certain choice of Q. The Broyden update to the transformed
Hessian is completely defined by the corresponding update to the (possible much
smaller) reduced Hessian. In the linearly constrained case, QTBQ is generally
dense as long as N is chosen so that range(N) = null(A). It is not possible in
this case to define the updated matrix QTBQ (corresponding to a fixed Q) if
Reduced-Hessian Methods for Linearly-Constrained Problems 117
only NTBN is known. However, it can be shown that NTBN can be obtained
from NTBN without knowledge of ZTBY or Y TBY . The update to NTBN is ob-
tained by way of an update from the Broyden class using the reduced quantities
sN = NTs and sY = NTy in place of s and y respectively.
Based on the above discussion, a method for solving LEP is presented
below. The matrix RN is the Cholesky factor of NTBN and is used to solve for
pN according to (6.6).
Algorithm 6.1. Quasi-Newton method for LEP
Initialize k = 0; Obtain x0 such that Ax0 = b;
Initialize Z so that range(N) = (A) and NTN = InL;
Initialize RN = σ1/2InL;
while not converged do
Solve RTNtN = −gN , RNpN = tN , and set p = NpN ;
Compute α so that sTy > 0 and set x = x+ αp;
Compute sN = αpN and yN = gN − gN ;
Compute RN = Broyden(RN , sN , yN);
end do
Note that N is fixed in Algorithm 6.1. The matrix N can be obtained
in several ways. For example, N can be obtained from a QR factorization of A.
If B denotes a nonsingular matrix whose columns are from A, then N can also be
obtained using a Gram-Schmidt QR factorization on the columns of a variable-
reduction form of null(A) (see Murtagh and Saunders [32]). This factorization
can be stably achieved using either reorthogonalization or modified Gram-Schmidt
(see Golub and Van Loan [26, pp. 218–220]).
Reduced-Hessian Methods for Linearly-Constrained Problems 118
6.2 A dynamic null-space method for LEP
Many quasi-Newton methods for solving LEP utilize a fixed representation, N ,
for null(A). In this section, a new method is given for solving LEP that employs
a dynamic choice of N . Since N is a matrix with orthonormal columns, the
method is only practical for small problems or when the number of constraints
is close to n. In the case when the number of constraints is small, an alternative
range-space method can be used in conjunction with the techniques for large-
scale optimization discussed in Chapter 5. This method is the subject of current
research and will not be discussed further.
The freedom to vary N stems from the invariance of the search direction
(6.6) with respect to N as long as the columns of N form a basis for null(A). To
see this, suppose that N is another matrix with orthonormal columns such that
range(N) = range(N). In this case, there exists an orthogonal matrix M such
that N = NM . The search direction p = −N(NTBN)−1NTg satisfies
p = −NM(MT NTBNM)−1MNTg = −N(NTBN)−1NTg
in terms of N . We now consider a choice for N that will induce special structure
in the reduced Hessian.
In the unconstrained case, the quasi-Newton search direction satisfies
p = ZpZ, where range(Z) = spang0, g1, . . . , gk. If Z has orthonormal columns,
then the orthogonal matrix Q = ( Z W ) induces block-diagonal structure
in the transformed Hessian QTBQ. In Algorithm RHR (p. 52), this structure
facilitates a rescaling scheme that reinitializes the approximate curvature each
iteration that g is accepted.
In the linearly-constrained case, if Q = ( N Y ), the transformed
Hessian QTBQ generally has no special structure. In actuality, we are concerned
Reduced-Hessian Methods for Linearly-Constrained Problems 119
with the structure of NTBN since the transformed Hessian corresponding to Y
is not used to define p. Let gNidenote the reduced gradient NTgi (0 ≤ i ≤ k).
Note that NgNiis the component of gi in null(A). Let Z denote a matrix of
orthonormal columns satisfying
range(Z) = spanNgN0 , NgN1 , . . . , NgNk. (6.7)
Let r denote the column dimension of Z and note that r ≤ nL since range(Z) ⊆
range(N). There exists an nL×r matrix M1 with orthonormal columns such that
Z = NM1. Let M = ( M1 M2 ) denote an orthogonal marix with M1 as its first
block. If N is defined as N = NM , then range(N) = range(N). Let W = NM2
and note that the partition of M implies that
N = NM = ( NM1 NM2 ) = ( Z W ).
By construction of Z and W , it follows that if w ∈ range(W ), then wTgi = 0 for
0 ≤ i ≤ k. Hence if B0 = σI, Lemma 2.3 implies that
NTBN =
ZTBZ 0
0 σInL−r
and NTg =
gZ
0
, (6.8)
where gZ = ZTg.
We have defined N so that it induces block-diagonal structure in NTBN .
If the quantities (6.8) are substituted into the equation for p (6.6), it is easy to
see that p = −Z(ZTBZ)−1gZ. This search direction can be computed using a
Cholesky factor RZ of ZTBZ.
Comparison with the unconstrained case
The choice of Z and W discussed here has some similarities and differences with
the definitions of Z and W in the unconstrained case. The subspace range(Z) is
Reduced-Hessian Methods for Linearly-Constrained Problems 120
associated with gradient information in both cases, although in the linearly con-
strained case Z defines gradient information in null(A). Elements from range(W )
are orthogonal to the first k + 1 gradients in both cases. Also in both cases, the
column dimension of Z is nondecreasing while that of W is nonincreasing. In
the unconstrained case, the column dimension of Z can become as large as n,
but in the linearly-constrained case, the column dimension can only reach nL.
Finally, the approximate curvature in the quadratic model along unit directions
in range(W ) is equal to σ in both cases.
Maintaining Z and W
At the start of iteration k, suppose that N is such that N = ( Z W ), where Z
satisfies (6.7). The component of g in null(A) is given by
NNTg = ( Z W )
gZ
gW
, where gZ = ZTg and gW = W Tg.
If gW = 0, then NNTg ∈ range(Z) and the matrices Z and W remain fixed. If
gW 6= 0, then g has a nonzero component in range(W ). In this case, updates
to both Z and W are required. The purpose of the updates is to insure that Z
satisfies (6.7) postdated one iteration. More specifically, the updates are defined
so that if N = ( Z W ), then
range(N) = range(N), range(Z) = range(Z) ∪ range(NNT g)
and NTN = InL. In the implementation, we preassign a positive constant δ > 0
and require that ‖gW‖ > δ(‖gW‖2 + ‖gZ‖2)1/2 for an update to be performed.
The updates to Z and W are similar to those defined in Section 2.5.1.
Recall that symmetric Givens matrices can be used to define an orthogonal matrix
S such that SgW = ‖gW‖e1. If N is defined by N = ( Z WST ), then range(N) =
Reduced-Hessian Methods for Linearly-Constrained Problems 121
range(N) and NTN = InL. Moreover, the first column of WST satisfies WSTe1 =
WgW/‖gW‖, which is the normalized component of g in range(W ). Hence, we
may define Z = ( Z WSTe1 ) and Y = ( Y STe2 Y STe3 · · · Y STenL).
When the null-space basis is changed from N to N , some related quan-
tities must also be altered. The first is the Cholesky factor RZ. In Section 2.4
(p. 29), we define the effective approximate Hessian associated with a reduced-
Hessian method. In the linearly constrained case, we define an effective reduced
Hessian since the approximate Hessian corresponding to Y is not needed. The
effective reduced Hessian is defined by
NTBδN =
RTZRZ 0
0 σInL−r
.If N = N diag(Ir, S
T ), the reduced Hessian corresponding to N satisfies
NTBδN =
Ir 0
0 S
RTZRZ 0
0 σInL−r
Ir 0
0 ST
= NTBδN.
It follows that the Cholesky factor of ZTBδZ is given by RZ = diag(RZ, σ1/2).
This definition of RZ is identical to that used in the unconstrained case.
The other quantities corresponding to Z are gZ and gZ and sZ. We use
an approximation to gZ (as in the unconstrained case) given by gδZ = (gZ, 0)T .
The vector gZ satisfies
gZ = ZTg =(Z WST e1
)Tg =
gZ
‖gW‖
.Since p ∈ range(Z), sZ = α(pZ, 0)T . As in the unconstrained case, the vector
yδ = gZ − gδZ is used in the Broyden update.
The following algorithm solves LEP using the choice of Z described
above.
Algorithm 6.2. Reduced Hessian method for LEP (RH-LEP)
Reduced-Hessian Methods for Linearly-Constrained Problems 122
Initialize k = 0, r = 1; Choose σ and δ;
Compute x0 satisfying Ax0 = b;
Compute N satisfying range(N) = (A) and NTN = InL;
Initialize RZ = σ1/2;
Rotate NNTg0 into the first column of N and partition N = ( Z W ), where
Z ∈ IRn×1;
while not converged do
Solve RTZ tZ = −gZ, RZpZ = tZ, and set p = ZpZ;
Compute α so that sTy > 0 and set x = x+ αp;
if ‖gW‖ > δ(‖gZ‖2 + ‖gW‖2)1/2 then
Update Z and W as described in this section;
Set r = r + 1;
else Set r = r; end if
Compute sZ and yδZ.
Define RZ by
RZ =
RZ, if r = r; RZ 0
0 σ1/2
, otherwise;(6.9)
Compute RZ = Broyden(RZ, sZ, yZ);
end do
Rescaling RZ
If r = r + 1, then the value σ1/2 in the (r, r) position of RZ is unaltered by the
BFGS update. Hence, the last diagonal component of RZ can be reinitialized ex-
actly as in Algorithm RHR. This leads to the definition of the rescaling algorithm
Reduced-Hessian Methods for Linearly-Constrained Problems 123
for LEP given below. All steps are the same as in Algorithm RH-LEP except the
last three.
Algorithm 6.3. Reduced Hessian rescaling method for LEP (RHR-LEP)
Compute RZ = BFGS(RZ, sZ, yδZ);
Compute σ;
if r = r + 1 and σ < σ then
Replace the (r, r) component of RZ with σ;
end if
6.3 Numerical results
The test problems correspond to seven of the eighteen problems listed in Table
3.2 (p. 54). The constraints are randomly generated. The starting point x0 is the
closest point to the starting point given by More et al. [29] that satisfies Ax0 = b.
More specifically, x0 is the solution to
minimizex∈IRn
12‖x− xMGH‖2
subject to Ax = b,(6.10)
where xMGH is the starting point suggested by More et al.
Numerical results given in Tables 6.1 and 6.2 compare Algorithm 6.1
with Algorithm RHR-LEP. Algorithm RHR-LEP is tested using the the rescaling
techniques R0 (no rescaling), R1, R4 and R5. (see Table 3.1 (p. 53)). Table 6.1
gives results for a random set of five linear constraints. The second table gives
results for mL = 8.
The algorithm is implemented in Matlab code on a DEC 5000/240
work station using the line search given by Fletcher [15, pp. 33–39]. The line
search ensures that αmeets the modified Wolfe conditions (1.16) and uses the step
Reduced-Hessian Methods for Linearly-Constrained Problems 124
Table 6.1: Results for LEPs (mL = 5, δ = 10−10, ‖NTg‖ ≤ 10−6)
Problem Alg. 6.1 Algorithm 6.3No. n σ = 1 R0 R1 R4 R56 16 26/30 24/28 24/28 24/28 24/287 12 39/50 39/50 75/79 67/70 54/578 16 54/72 49/66 113/147 105/138 99/1329 16 33/49 33/49 51/55 28/32 27/3113 20 20/37 20/37 11/14 10/13 9/1214 14 45/79 45/79 53/61 48/54 48/5615 16 41/74 41/74 66/71 47/54 38/44
Table 6.2: Results for LEPs (mL = 8, δ = 10−10, ‖NTg‖ ≤ 10−6)
Problem Alg. 6.1 Algorithm 6.3No. n σ = 1 R0 R1 R4 R56 16 26/30 24/28 24/28 24/28 24/287 12 15/21 15/21 22/25 22/25 21/248 16 17/23 16/21 18/23 18/23 18/239 16 16/46 16/46 20/25 16/21 16/2113 20 17/40 17/40 12/15 11/14 12/1614 14 18/40 L(1.2E-6) 25/30 19/24 L(1.1E-6)15 16 19/41 19/41 30/35 23/28 19/25
length of one whenever it satisfies these conditions. The step length parameters
are ν = 10−4 and η = 0.9. The implementation uses δ = 10−10 with the stopping
criterion ‖NTg‖ < 10−6. The numbers of iterations and function evaluations
required to achieve the stopping criterion are given for each run. For example,
the notation “26/30” indicates that 26 iterations and 30 function evaluations are
required for convergence. The notation “L” indicates termination during the line
search. In this case, the value in parentheses gives the final norm of the reduced
gradient.
Bibliography
[1] I. Bongartz, A. R. Conn, N. I. M. Gould, and P. L. Toint,CUTE: Constrained and unconstrained testing environment, Report 93/10,Departement de Mathematique, Facultes Universitaires de Namur, 1993.
[2] K. W. Brodlie, An assessment of two approaches to variable metric meth-ods, Math. Prog., 12 (1977), pp. 344–355.
[3] K. W. Brodlie, A. R. Gourlay, and J. Greenstadt, Rank-one andrank-two corrections to positive definite matrices expressed in product form,Journal of the Institute of Mathematics and its Applications, 11 (1973),pp. 73–82.
[4] A. G. Buckley, A combined conjugate-gradient quasi-Newton minimiza-tion algorithm, Math. Prog., 15 (1978), pp. 200–210.
[5] , Extending the relationship between the conjugate-gradient and BFGSalgorithms, Math. Prog., 15 (1978), pp. 343–348.
[6] R. H. Byrd, J. Nocedal, and Y.-X. Yuan, Global convergence of aclass of quasi-Newton methods on convex problems, SIAM J. Numer. Anal.,24 (1987), pp. 1171–1190.
[7] M. Contreras and R. A. Tapia, Sizing the BFGS and DFP updates:Numerical study, J. Optim. Theory and Applics., 78 (1993), pp. 93–108.
[8] J. W. Daniel, W. B. Gragg, L. Kaufman, and G. W. Stewart,Reorthogonalization and stable algorithms for updating the Gram-SchmidtQR factorization, Math. Comput., 30 (1976), pp. 772–795.
[9] W. C. Davidon, Variable metric methods for minimization, a. e. c. researchand development, Report ANL-5990, Argonne National Laboratory, 1959.
[10] J. E. Dennis, Jr. and R. B. Schnabel, A new derivation of symmetricpositive definite secant updates, in Nonlinear programming, 4 (Proc. Sym-pos., Special Interest Group on Math. Programming, Univ. Wisconsin, Madi-son, Wis., 1980), Academic Press, New York, 1981, pp. 167–199. ISBN0-12-468662-1.
125
Bibliography 126
[11] J. E. Dennis Jr. and J. J. More, A characterization of superlinearconvergence and its application to quasi-Newton methods, Math. Comput.,28 (1974), pp. 549–560.
[12] , Quasi-Newton methods, motivation and theory, SIAM Review, 19(1977), pp. 46–89.
[13] J. E. Dennis, Jr. and R. B. Schnabel, Numerical Methods for Uncon-strained Optimization and Nonlinear Equations, Prentice-Hall, Inc., Engle-wood Cliffs, New Jersey, 1983.
[14] M. C. Fenelon, Preconditioned Conjugate-Gradient-Type Methods forLarge-Scale Unconstrained Optimization, PhD thesis, Department of Op-erations Research, Stanford University, Stanford, CA, 1981.
[15] R. Fletcher, Practical Methods of Optimization, John Wiley and Sons,Chichester, New York, Brisbane, Toronto and Singapore, second ed., 1987.ISBN 0471915475.
[16] , An overview of unconstrained optimization, Report NA/149, Depart-ment of mathematics and computer science, University of Dundee, June1993.
[17] P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders, Methodsfor modifying matrix factorizations, Math. Comput., 28 (1974), pp. 505–535.
[18] P. E. Gill and W. Murray, The numerical solution of a problem in thecalculus of variations, in Recent mathematical developments in control, D. J.Bell, ed., vol. 24, Academic Press, New York and London, 1973, pp. 97–122.
[19] , Conjugate-gradient methods for large-scale nonlinear optimization, Re-port SOL 79-15, Department of Operations Research, Stanford University,Stanford, CA, 1979.
[20] P. E. Gill, W. Murray, M. A. Saunders, and M. H. Wright,Procedures for optimization problems with a mixture of bounds and generallinear constraints, ACM Trans. Math. Software, 10 (1984), pp. 282–298.
[21] , User’s guide for NPSOL (Version 4.0): a Fortran package for non-linear programming, Report SOL 86-2, Department of Operations Research,Stanford University, Stanford, CA, 1986.
[22] P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization,Academic Press, London and New York, 1981. ISBN 0-12-283952-8.
[23] , Numerical Linear Algebra and Optimization, volume 1, Addison-Wesley Publishing Company, Redwood City, 1991. ISBN 0-201-12649-4.
Bibliography 127
[24] D. Goldfarb, Factorized variable metric methods for unconstrained opti-mization, Math. Comput., 30 (1976), pp. 796–811.
[25] G. H. Golub and C. F. Van Loan, Matrix Computations, The JohnsHopkins University Press, Baltimore, Maryland, 1983. ISBN 0-8018-5414-8.
[26] , Matrix Computations, The Johns Hopkins University Press, Baltimore,Maryland, second ed., 1989. ISBN 0-8018-5414-8.
[27] M. Lalee and J. Nocedal, Automatic column scaling strategies for quasi-Newton methods, SIAM J. Optim., 3 (1993), pp. 637–653.
[28] D. C. Liu and J. Nocedal, On the limited memory BFGS method forlarge scale optimization, Math. Prog., 45 (1989), pp. 503–528.
[29] J. J. More, B. S. Garbow, and K. E. Hillstrom, Testing un-constrained optimization software, ACM Trans. Math. Software, 7 (1981),pp. 17–41.
[30] J. J. More and D. C. Sorensen, Newton’s method, in Studies in Math-ematics, Volume 24. Studies in Numerical Analysis, Math. Assoc. America,Washington, DC, 1984, pp. 29–82.
[31] J. J. More and D. J. Thuente, Line search algorithms with guaranteedsufficient decrease, ACM Trans. Math. Software, 20 (1994), pp. 286–307.
[32] B. A. Murtagh and M. A. Saunders, Large-scale linearly constrainedoptimization, Math. Prog., 14 (1978), pp. 41–72.
[33] J. L. Nazareth, A relationship between the BFGS and conjugate gradientalgorithms and its implications for new algorithms, SIAM J. Numer. Anal.,16 (1979), pp. 794–800.
[34] , The method of successive affine reduction for nonlinear minimization,Math. Programming, 35 (1986), pp. 97–109.
[35] J. Nocedal, Updating quasi-Newton matrices with limited storage, Math.Comput., 35 (1980), pp. 773–782.
[36] , Theory of algorithms for unconstrained optimization, in Acta Numerica1992, A. Iserles, ed., Cambridge University Press, New York, USA, 1992,pp. 199–242. ISBN 0-521-41026-6.
[37] J. Nocedal and Y. Yuan, Analysis of self-scaling quasi-Newton method,Math. Prog., 61 (1993), pp. 19–37.
[38] S. Oren and E. Spedicato, Optimal conditioning of self-scaling variablemetric algorithms, Math. Prog., 10 (1976), pp. 70–90.
Bibliography 128
[39] S. S. Oren and D. G. Luenberger, Self-scaling variable metric (SSVM)algorithms, Part I: Criteria and sufficient conditions for scaling a class ofalgorithms, Management Science, 20 (1974), pp. 845–862.
[40] M. J. D. Powell, Some global convergence properties of a variable metricalgorithm for minimization without exact line searches, in SIAM-AMS Pro-ceedings, R. W. Cottle and C. E. Lemke, eds., vol. IX, Philadelphia, 1976,SIAM Publications.
[41] , How bad are the BFGS and DFP methods when the objective functionis quadratic?, Math. Prog., 34 (1986), pp. 34–37.
[42] , Methods for nonlinear constraints in optimization calculations, in TheState of the Art in Numerical Analysis, A. Iserles and M. J. D. Powell, eds.,Oxford, 1987, Oxford University Press, pp. 325–357.
[43] D. F. Shanno, Conjugate-gradient methods with inexact searches, Math.Oper. Res., 3 (1978), pp. 244–256.
[44] D. F. Shanno and K. Phua, Matrix conditioning and nonlinear optimiza-tion, Math. Prog., 14 (1978), pp. 149–160.
[45] D. Siegel, Modifying the BFGS update by a new column scaling technique,Report DAMTP/1991/NA5, Department of Applied Mathematics and The-oretical Physics, University of Cambridge, May 1991.
[46] , Implementing and modifying Broyden class updates for large scale opti-mization, Report DAMTP/1992/NA12, Department of Applied Mathemat-ics and Theoretical Physics, University of Cambridge, December 1992.
[47] , Updating of conjugate direction matrices using members of Broyden’sfamily, Math. Prog., 60 (1993), pp. 167–185.
[48] P. Wolfe, Convergence conditions for ascent methods, SIAM Review, 11(1968), pp. 226–235.
[49] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization, ACMTrans. Math. Software, 23 (1997), pp. 550–560.