ele 604/ele 704 optimization - hacettepe universityusezen/ele604/lecture_notes.pdfele 604/ele 704...

ELE 604/ELE 704 Optimization

Hacettepe University

Dr. Cenk Toker & Dr. Umut Sezen

27 December 2016

1

Contents

Textbook 4

Textbook 5

1 Notation and Background 61.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.2.1 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.2.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.2.2.1 Hyperplane, Halfspace, Polyhedra, Euclidean Balls & Ellipsoids, andCones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.2.3 Basic De�nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271.2.3.1 Gradient and Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . 281.2.3.2 Positive Semide�nite & Positive De�nite Matrices . . . . . . . . . . . 301.2.3.3 Optimality Conditions for Unconstrained Problems . . . . . . . . . . 30

1.3 Convex Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331.3.1 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331.3.2 Convex and Concave Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 341.3.3 First and Second Order Conditions for Convexity . . . . . . . . . . . . . . . . 36

1.3.3.1 First Order Condition for Convexity . . . . . . . . . . . . . . . . . . 361.3.3.2 Second Order Conditions for Convexity . . . . . . . . . . . . . . . . . 371.3.3.3 Operations that Preserve Convexity . . . . . . . . . . . . . . . . . . 38

1.3.4 Quadratic Functions, Forms and Optimization . . . . . . . . . . . . . . . . . . 401.3.4.1 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 411.3.4.2 Characteristics of Symmetric Matrices . . . . . . . . . . . . . . . . . 41

2 Unconstrained Optimization and Descent Methods 432.1 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.2 Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.2.2 General Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.2.3 Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.2.3.1 Exact Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.2.3.2 Bisection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 492.2.3.3 Backtracking Line Search . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

1

2.3 Gradient Descent (GD) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522.3.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.3.1.1 Convergence of GD with Exact Line Search . . . . . . . . . . . . . . 552.3.1.2 Convergence of GD with Backtracking Line Search . . . . . . . . . . 56

2.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572.4 Steepest Descent (SD) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.4.1 Preliminary De�nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652.4.2 Steepest Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662.4.3 Steepest Descent for di�erent norms . . . . . . . . . . . . . . . . . . . . . . . . 67

2.4.3.1 Euclidean Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672.4.3.2 Quadratic Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672.4.3.3 L1-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682.4.3.4 Choice of norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.4.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692.4.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

2.5 Conjugate Gradient (GD) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742.5.1 Conjugate Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

2.5.1.1 Descent Properties of the Conjugate Gradient Method . . . . . . . . 772.5.2 The Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . 782.5.3 Extension to Nonquadratic Problems . . . . . . . . . . . . . . . . . . . . . . . 80

2.6 Newton's Method (NA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822.6.1 The Newton Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

2.6.1.1 Interpretation of the Newton Step . . . . . . . . . . . . . . . . . . . 832.6.2 The Newton Decrement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 842.6.3 Newton's Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 852.6.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872.6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882.6.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882.6.7 Approximation of the Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3 Constrained Optimization Methods 953.1 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.1.1 Lagrange Dual Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953.1.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3.2 The Lagrange Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993.2.1 Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.3 Weak and Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003.3.1 Weak Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003.3.2 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003.3.3 Slater's Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.3.4 Saddle-point Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.4 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023.4.1 Certi�cate of Suboptimality and Stopping Criterion . . . . . . . . . . . . . . . 102

3.4.1.1 Stopping Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023.4.2 Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033.4.3 KKT Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

2

3.4.4 Solving The Primal Problem via The Dual . . . . . . . . . . . . . . . . . . . . 1063.4.5 Perturbation and Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . 107

3.5 Constrained Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093.5.2 Primal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

3.5.2.1 Feasible Direction Methods . . . . . . . . . . . . . . . . . . . . . . . 1113.5.2.2 Active Set Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 1123.5.2.3 Gradient Projection Method . . . . . . . . . . . . . . . . . . . . . . . 116

3.5.3 Equality Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . 1203.5.3.1 Quadratic Minimization . . . . . . . . . . . . . . . . . . . . . . . . . 1213.5.3.2 Eliminating Equality Constraints . . . . . . . . . . . . . . . . . . . . 1213.5.3.3 Newton's Method Equality Constraints . . . . . . . . . . . . . . . . . 1233.5.3.4 Newton's Method with Equality Constraint Elimination . . . . . . . 124

3.5.4 Penalty and Barrier Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 1253.5.4.1 Penalty Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1253.5.4.2 Barrier Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273.5.4.3 Properties of the Penalty & Barrier Methods . . . . . . . . . . . . . 129

3.5.5 Interior-Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293.5.5.1 Logarithmic Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293.5.5.2 Central Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303.5.5.3 Dual Points from Central Path . . . . . . . . . . . . . . . . . . . . . 1323.5.5.4 KKT Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1333.5.5.5 Newton Step for Modi�ed KKT equations . . . . . . . . . . . . . . . 1333.5.5.6 The Interior-Point Algorithm . . . . . . . . . . . . . . . . . . . . . . 1343.5.5.7 How to start from a feasible point? . . . . . . . . . . . . . . . . . . . 139

3

Textbook

There is no speci�c textbook. Lecture notes will be a composition of the references below:

1. Boyd and Vandenberghe, Convex Optimization, Cambridge, 2004.

2. Luenberger, Linear and Nonlinear Programming, Kluwer, 2002.

3. S. S. Rao, Engineering Optimization: Theory and Practice, 4th Edition, Wiley, 2009.

4. Baldick, Applied Optimization, Cambridge, 2006.

5. Freund, Lecture Notes, MIT.

6. Bertsekas, Lecture Notes, MIT.

7. Bertsekas, Nonlinear Programming, Athena Scienti�c, 1999.

4

Course Contents

http://www.ee.hacettepe.edu.tr/∼usezen/ele704/

TextbookThere is no speci�c textbook. Lecture notes will be a composition of the references below:

1. Luenberger, Linear and Nonlinear Programming, Kluwer, 2002.

2. Boyd and Vandenberghe, Convex Optimization, Cambridge, 2004.

3. S. S. Rao, Engineering Optimization: Theory and Practice, 4th Edition, Wiley, 2009.

4. Baldick, Applied Optimization, Cambridge, 2006.

5. Freund, Lecture Notes, MIT.

6. Bertsekas, Lecture Notes, MIT.

7. Bertsekas, Nonlinear Programming, Athena Scienti�c, 1999.

Course Contents

1. Notation and Background

2. Unconstrained Optimization: Steepest Descent

3. Unconstrained Optimization: Conjugate Gradient

4. Unconstrained Optimization: Newton's Method

5. Constrained Optimization: Optimality and Duality

6. Constrained Optimization: Gradient Projection

7. Constrained Optimization: Modi�ed Newton's Method

8. Constrained Optimization: Penalty & Barrier Methods

9. Constrained Optimization: Interior Point Method

Chapter 1

Notation and Background

1.1 Notation

Before continuing further, we are going to introduce the notation used in this lecture.

� The symbols Z and R denote the sets of integers and real numbers, respectively.

� The symbols Z++ and R++ denote the sets of positive (not including zero) integer and realnumbers, respectively. Similarly, Z+ and R+ denote the sets of nonnegative integer and realnumbers, respectively.

� The symbols ∀,∃, :, ∧, ∨, =⇒,⇐⇒, ⊂ and ∈ denote the terms �for all�, �there exists�, �such that�,�and�, �or�, �if . . . then�, �if and only if (i�)�, �subset of� and �element of�, respectively.

� Functions of a continuous variable indicated with round brackets, for instance, f(t) where t ∈ R.

� Functions are always assumed to be real-valued unless explicitly stated otherwise.

� Boldface type or a bar over the letter is used to denote matrix and vector quantities. Small letterswill be used for column vectors, e.g. a, a or a, and capital letters will be used for matrices, e.g.A, ¯A or A. In this context aT , aT or aT denotes a row vector. The elements of vectors denotedby a subscript starting from 1. See below for some examples,

a =

a1

a2...aN

a = [ a1 a2 · · · aN ]T

a = [ai]N×1

aT = [ a1 a2 · · · aN ]

where i = 1, 2, · · · , N .

6

� Arithmetic, sign and equality (or inequality) operators for vectors and matrices applies to theirelements directly,

n− η = [n1 −K1 n2 −K2 · · · nN −KN ]T

n < η ≡ n1 < K1, n2 < K2, · · · , nN < KN

where n = [n1 n2 · · · nN ]T and η = [K1 K2 · · · KN ]T .

� Scalar values apply to every element of vectors. Here are some explanatory examples

Ka =[Ka1 Ka2 · · · KaN

]Ta +K =

[a1 +K a2 +K · · · aN +K

]Ta < K ≡ a1 < K, a2 < K, · · · , aN < K

a = K ≡ a1 = K, a2 = K, · · · , aN = K

where a = [ a1 a2 · · · aN ]T and K is a scalar value.

� The symbols ZN and RN denote the sets of N -dimensional integer vectors, and N -dimensionalreal number vectors, respectively. In other words ZN and RN denote N -dimensional (column)vector spaces where vector elements are integers and real numbers, respectively, for instance

c ∈ RN ≡

{c = [ c1 c2 · · · cN ]T and

c1 ∈ R, c2 ∈ R, · · · , cN ∈ R

� An M×N matrix A can be stated in two forms

A =

A1,1 A1,2 · · · A1,N

A2,1 A2,2 · · · A2,N...

.... . .

...AM,1 AM,2 · · · AM,N

A = [Ai,j]M×N

where i = 1, 2, · · · ,M and j = 1, 2, · · · , N .

� The quantity AT denotes the transpose of A.

� The quantities A−1 and A−T denote the inverse and the inverse transpose of A, respectively.

� An N -length column vector a is essentially an N×1 matrix and similarly the row vector aT isalso a 1×N matrix.

� Matrix multiplication is de�ned only between anM×P matrix A and P×N matrix B to producean M×N matrix C:

C = AB

where

Ci,j =k=P∑k=0

Ai,kBk,j

with i = 1, 2, · · · ,M and j = 1, 2, · · · , N .

7

� Matrix multiplication is not commutative

AB 6= BA

� Matrix multiplication is associative

ABC = (AB)C = A(BC)

� Matrix multiplication is distributive

A(B + C) = AB + AC

� Transpose of a matrix product is equal to the product of the transposed matrices in reverseorder, e.g.,

(AB)T = BTAT

� The trace of an N×N square matrix A is de�ned to be the sum of the elements on the maindiagonal (the diagonal from the upper left to the lower right) of A, i.e.,

tr(A) = A1,1 + A2,2 + · · ·+ AN,N =N∑i=1

Ai,i

� The notationdiag [ d1 d2 · · · dN ]

stands for an N ×N diagonal matrix D with the main diagonal elements Di,i = di, i.e.

D =

d1 0 · · · 0 00 d2 0 · · · 0...

. . . . . . . . ....

0 · · · 0 dN−1 00 0 · · · 0 dN

The shorthand notations diag d and diag dT can be also used, i.e.

diag d ≡ diag dT ≡ diag [ d1 d2 · · · dN ]

where d = [ d1 d2 · · · dN ]T .

� The symbol IN denotes the N ×N identity matrix, i.e.

IN = diag [ 1 1 · · · 1︸︷︷︸N

] = diag [ 1 ]1×N .

Sometimes the subscript N is omitted and simply I is used to denote an identity matrix.

� Note that, for an N ×N nonsingular (i.e., invertible) matrix A

AA−1 = A−1A = I

8

� The · operator denotes the inner product (or dot product) operator for two vectors of thesame length as de�ned below

a · b = aTb

= a1b1 + a2b2 + · · ·+ aNbN

where a = [ a1 a2 · · · aN ]T and b = [ b1 b2 · · · bN ]T . Note that, the inner product pro-duces a scalar value and it is commutative, i.e.

a · b = b · a

so,aTb = bTa

Inner product is sometimes denoted with the 〈 , 〉 operator, i.e.,

〈a,b〉 = a · b = aTb

� The ⊗ operator denotes the outer product (or tensor product) operator for two vectors ofthe same length as de�ned below

a⊗ b = abT

where a = [ a1 a2 · · · aN ]T and b = [ b1 b2 · · · bN ]T . Note that, the outer product pro-duces an N ×N square matrix and it is not commutative, i.e.

a⊗ b 6= b⊗ a

so,abT 6= baT

� Functions are also de�ned in terms of its input and output sets, i.e.

f : A→ B

means "f is an B-valued function of an A-valued variable" where A ⊂ R and B ⊂ R.

� Vector variables (or arguments) will be denoted by small letters using boldface type, that is

f(x) ≡ f(x1, x2, · · · , xN)

where x = [x1 x2 · · · xN ]T . Assuming x ∈ RN and f(x) is a real-valued function than

f(x) : RN → R

� Domain: The domain of a function is the set of "input" or argument values for which thefunction is de�ned. That is, the function provides an "output" or value for each member of thedomain.

The domain of a function is denoted bydom f

9

For instance, the domain of cosine is the set of all real numbers, while the domain of the squareroot consists only of numbers greater than or equal to 0 (ignoring complex numbers in bothcases), i.e.,

dom cos = Rdom sqrt = R+

For a function whose domain is a subset of the real numbers, when the function is representedin an xy Cartesian coordinate system, the domain is represented on the x-axis.

� Range: The range of a function refers to the image of the function. The codomain is a setcontaining the function's outputs, whereas the image is the part of the codomain which consistsonly of the function's outputs.

For example, the function f(x) = x2 is often described as a function from the real numbersto the real numbers, meaning that the codomain is R, but its image (i.e., range) is the set ofnon-negative real numbers, i.e., R+.

For a function whose range is a subset of the real numbers, when the function is represented inan xy Cartesian coordinate system, the range is represented on the y-axis.

� Function vectors and function matrices are also denoted in the similar fashion like f(x) andF(x), respectively. For example, a function vector f(x) of size N can be stated as

f(x) = [ f1(x) f2(x) · · · fN(x) ]T .

Assuming x ∈ R and f(x) are real-valued functions than

f(x) : R→ RN

� Sets are denoted by { }. An example set is de�ned below

A = {x | P (x) }

meaning "A is a set of x such that P (x) is true". Here, the letter x can be replaced by othersymbols. Another example would be

B = { y | y is a prime number }

� Supremum: The supremum (sup) of a subset S of a totally or partially ordered set T is the leastelement of T that is greater than or equal to all elements of S. Consequently, the supremumis also referred to as the least upper bound (LUB). Plural of supremum is suprema.

If S contains a greatest element, then that element is the supremum. Otherwise, the supremumdoes not belong to S (or does not exist).

For instance, the negative real numbers do not have a greatest element, and their supremum is0 (which is not a negative real number).

10

Examples:

sup {1, 2, 3} = 3

sup {x ∈ R : 0 < x < 1} = sup {x ∈ R : 0 ≤ x ≤ 1} = 1

One basic property of the supremum is

sup {f(t) + g(t) : t ∈ A} ≤ sup {f(t) : t ∈ A}+ sup {g(t) : t ∈ A}

for any functions f and g.

� In�mum: The in�mum (inf) of a subset S of a partially ordered set T is the greatest elementof T that is less than or equal to all elements of S. Consequently the term greatest lowerbound (GLB) is also commonly used. Plural of in�mum is in�ma.

If the in�mum exists, it is unique. If S contains a least element, then that element is the in�mum;otherwise, the in�mum does not belong to S (or does not exist).

For instance, the positive real numbers do not have a least element, and their in�mum is 0,which is not a positive real number.

Examples:

inf {1, 2, 3} = 1

inf {x ∈ R : 0 < x < 1} = inf {x ∈ R : 0 ≤ x ≤ 1} = 0

� Norm: A norm is a way of measuring the length or strength of a vector. The general form ofthe norm is called Lp-norm given by

‖x‖p =

(N∑i=1

|xi|p)1/p

for p ≥ 1.

The value of p is typically 1 or 2 or ∞.

L1-norm is the Taxicab norm or Manhattan norm:

‖x‖1 =N∑i=1

|xi|

L2-norm is the Euclidean norm:

‖x‖2 =

√√√√ N∑i=1

|xi|2

L∞-norm is the maximum norm or in�nity norm:

‖x‖∞ = maxi|xi|

11

� Note: If not stated otherwise, ‖x‖ will denote the Euclidean norm, ‖x‖2, i.e., L2-norm.

� Unit Ball: Let ‖ · ‖ denote any norm on RN , then the unit ball is the set of all vectors withnorm less than or equal to one, i.e.,

B = {x : ‖x‖ ≤ 1} .

Then, B is called the unit ball of the norm ‖ · ‖.

The unit ball satis�es the following properties:

- B is symmetric about the origin, i.e., x ∈ B if and only if −x ∈ B.- B is convex.

- B is closed, bounded, and has nonempty interior.

� Two-dimensional unit ball of L1-norm, i.e., B1 = {x ∈ R2 : ‖x‖1 ≤ 1}.

� Two-dimensional unit ball of L2-norm, i.e., B2 = {x ∈ R2 : ‖x‖2 ≤ 1}.

� Two-dimensional unit ball of L∞-norm, i.e., B∞ = {x ∈ R2 : ‖x‖∞ ≤ 1}.

12

� Two-dimensional unit balls of L1, L2 and L∞ norms shown together.

� Dual Norm: Let ‖ · ‖ denote any norm on RN , then the dual norm, denoted by ‖ · ‖∗, is thefunction from RN to R with values

‖x‖∗ = maxy

yTx : ‖y‖ ≤ 1 = sup{yTx : ‖y‖ ≤ 1

}The above de�nition also corresponds to a norm: it is convex, as it is the pointwise maximum ofconvex (in fact, linear) functions y→ xTy; it is homogeneous of degree 1, that is, ‖αx‖∗ = α‖x‖∗for every x in RN and α ≥ 0.

� By de�nition of the dual norm,xTy ≤ ‖x‖ · ‖y‖∗

This can be seen as a generalized version of the Cauchy-Schwarz inequality, which correspondsto the Euclidean norm.

� The dual to the dual norm above is the original norm, i.e.,

‖x‖∗∗ = ‖x‖

- The norm dual to the Euclidean norm is itself. This comes directly from the Cauchy-Schwarzinequality.

‖x‖2∗ = ‖x‖2

13

- The norm dual to the the L∞-norm is the L1-norm, or vice versa.

‖x‖∞∗ = ‖x‖1 and ‖x‖1∗ = ‖x‖∞

- More generally, the dual of the Lp-norm is the Lq-norm

‖x‖p∗ = ‖x‖q

where q =p

p− 1or 1

p+ 1

q= 1.

� An eigenvector of an N ×N square matrix A is a non-zero vector v that, when multiplied byA, yields the original vector multiplied by a single number λ, i.e.,

Av = λv

The number λ is called the eigenvalue of A corresponding to v.

� Thus, in order to �nd the eigenvalues of A, we solve the above equation for λ

Av − λv = 0

(A− λI)v = 0

where I is the N × N identity matrix. It is a fundamental result of linear algebra that anequation Mv = 0 has a non-zero solution v if and only if the determinant of the matrix M iszero, i.e., det(M) = 0. It follows that the eigenvalues of A are precisely the real numbers λ thatsatisfy the characteristic equation

det (A− λI) = 0

� Note that, condition number, κ(·), of a matrix is given by the ratio of the largest and the smallesteigenvalue, e.g.,

κ(H(x)) =

∣∣∣∣maxλiminλi

∣∣∣∣If the condition number is close to one, the matrix is well-conditioned which means its inversecan be computed with good accuracy. If the condition number is large, then the matrix is saidto be ill-conditioned. Practically, such a matrix is almost singular, and the computation of itsinverse, or solution of a linear system of equations is prone to large numerical errors. A matrixthat is not invertible has the condition number equal to in�nity.

� An N ×N symmetric matrix A is called positive-semide�nite if

xTAx ≥ 0

for all x ∈ RN , and is called positive-de�nite if

xTAx > 0

for all nonzero x ∈ RN .

14

� If A is positive-semide�nite, it is denoted by

A � 0

and has nonnegative eigenvalues.

� If A is positive-de�nite, it is denoted by

A � 0

and has positive eigenvalues.

� For any real matrix B, the matrix BTB is positive-semide�nite, and rank (B) = rank(BTB

).

� An N ×N symmetric matrix A is called negative-semide�nite if

xTAx ≤ 0

for all x ∈ RN , and is called negative-de�nite if

xTAx < 0

for all nonzero x ∈ RN .

� If A is negative-semide�nite, it is denoted by

A � 0

and has nonpositive eigenvalues.

� If A is negative-de�nite, it is denoted by

A ≺ 0

and has negative eigenvalues.

� Quadratic norm: A generalized quadratic norm of x is de�ned by

‖x‖P =(xTPx

)1/2= ‖P1/2x‖2 = ‖Mx‖2

where P = MTM is an N ×N symmetric positive de�nite (SPD) matrix.

� When P = I then, quadratic norm is equal to the Euclidean norm.

� The dual of the quadratic norm is given by

‖x‖P∗ = ‖x‖Q =(xTQx

)1/2

where Q = P−1, i.e.,

‖x‖P∗ =(xTP−1x

)1/2

15

� The notation∂

∂xstands for the multidimensional partial derivative operator in the form of a

column vector, i.e.∂

∂x=

[∂

∂x1

∂

∂x2

· · · ∂

∂xN

]T.

Consider the following example

∂f(x)

∂x=

[∂f(x)

∂x1

∂f(x)

∂x2

· · · ∂f(x)

∂xN

]T.

� The ∇x denotes the multidimensional gradient (or del) operator in the form of a column vector,similar to the multidimensional partial derivative operator. In general we are going to omit thesubscript x, i.e.

∇ = ∇x =

[∂

∂x1

∂

∂x2

· · · ∂

∂xN

]T.

Consider the following example

∇f(x) = ∇xf(x) =∂f(x)

∂x

=

[∂f(x)

∂x1

∂f(x)

∂x2

· · · ∂f(x)

∂xN

]T.

Note that ∇ operator is not commutative, i.e.,

∇f(x) 6= f(x)∇

� Gradient of the dot product bTx is given by

∇(bTx

)= ∇

(xTb

)= b

where b is a N × 1 column vector.

� Thus, the directional derivative in the direction d is given by

(d · ∇)f(x) = d · ∇f(x) = ∇f(x) · d= dT∇f(x) = ∇Tf(x) d

= d1∂f(x)

∂x1

+ d2∂f(x)

∂x2

+ · · ·+ dN∂f(x)

∂xN

where d = [ d1 d2 · · · dN ]T .

� The partial derivative of a vector variable x is given by

∂x

∂x=∂xT

∂x= ∇xT = I

where x = [x1 x2 · · · xN ]T .

16

� Derivative under multiplication

∂(xTB

)∂x

= ∇(xTB

)= B

∂ (Cx)

∂x= ∇

(xTCT

)= CT

where x = [x1 x2 · · · xN ]T , B and C are N ×M and M ×N matrices, respectively.

� The gradient of a quadratic form is given by

∇(xTAx

)= ATx + Ax

where x = [x1 x2 · · · xN ]T and A is an N ×N matrix.

� The ∇2 denotes the Hessian operator (∇∇T ) or Hessian matrix (symmetric), not the Laplaceoperator ∇ · ∇ (or ∇T∇).

∇2 = ∇⊗∇ = ∇∇T

=

∂2

∂x1∂x1∂2

∂x1∂x2· · · ∂2

∂x1∂xN

∂2

∂x2∂x1∂2

∂x2∂x2· · · ∂2

∂x2∂xN...

.... . .

...∂2

∂xN∂x1∂2

∂xN∂x2· · · ∂2

∂xN∂xN

H(x) = ∇2f(x) = ∇∇Tf(x)

=

∂2f(x)∂x1∂x1

∂2f(x)∂x1∂x2

· · · ∂2f(x)∂x1∂xN

∂2f(x)∂x2∂x1

∂2f(x)∂x2∂x2

· · · ∂2f(x)∂x2∂xN

......

. . ....

∂2f(x)∂xN∂x1

∂2f(x)∂xN∂x2

· · · ∂2f(x)∂xN∂xN

� The second derivative (i.e., Hessian) of a quadratic form is given by

∇2(xTAx

)= A + AT

where x = [x1 x2 · · · xN ]T and A is an N ×N matrix.

� For a symmetric matrix S,∇2(xTSx

)= 2 S

� For a vector function f(x),(∇fT (x)

)Tgives the Jacobian matrix

Jf (x) =

[∂f

∂xi

]M×N

=

∂f1(x)∂x1

∂f1(x)∂x2

· · · ∂f1(x)∂xN

∂f2(x)∂x1

∂f2(x)∂x2

· · · ∂f2(x)∂xN

......

. . ....

∂fM (x)∂x1

∂fM (x)∂x2

· · · ∂fM (x)∂xN

where f(x) = [ f1(x) f2(x) · · · fM(x) ]T and x = [x1 x2 · · · xN ]T .

17

� Here,

Jf (x) =(∇fT (x)

)TJTf (x) = ∇fT (x)

� Sometimes ∇f(x) is also used to denote the Jacobian matrix.

� The directional gradient in the direction d for a vector function f(x) would be given by

(d · ∇)f(x) =(dT ∇fT (x)

)T=(∇fT (x)

)Td

=(dT JTf (x)

)T= Jf (x) d

=

d1

∂f1(x)∂x1

+ d2∂f1(x)∂x2

+ · · ·+ dN∂f1(x)∂xN

d1∂f2(x)∂x1

+ d2∂f2(x)∂x2

+ · · ·+ dN∂f2(x)∂xN

...

d1∂fN (x)∂x1

+ d2∂fN (x)∂x2

+ · · ·+ dN∂fN (x)∂xN

where f(x) = [ f1(x) f2(x) · · · fN(x) ]T and x = [x1 x2 · · · xN ]T .

� The Taylor series of a real or complex-valued function f(x) that is in�nitely di�erentiable in aneighborhood of a point x0 is the power series

f(x) = f(x0) +f ′(x0)

1!(x− x0) +

f ′′(x0)

2!(x− x0)2 +

f ′′′(x0)

3!(x− x0)3 + · · ·

f(x0 + ∆x) = f(x0) +f ′(x0)

1!∆x+

f ′′(x0)

2!(∆x)2 +

f ′′′(x0)

3!(∆x)3 + · · ·

∆f = f ′(x0)∆x+1

2f ′′(x0)(∆x)2 +

1

6f ′′′(x0)(∆x)3 + · · ·

where n! denotes the factorial of n, ∆x = x−x0, ∆f = f(x)−f(x0), and f ′(x0), f ′′(x0), f ′′′(x0), . . .denotes the �rst, second, third, . . . derivatives of f(x) evaluated at the point x0, respectively.

� The minx∈RN

or minx

operator returns the minimum value (i.e. global minimum) of the f(x) function,

similarly argminx∈RN

or argminx

operator returns the argument values x∗ which result in the minimum

value of the f(x) function, i.e.,

f(x∗) = minxf(x)

= minx∈RN

f(x)

x∗ = argminx

f(x)

= argminx∈RN

f(x)

18

� Similarly, minx∈L

gives the local minimum where its argument values are inside the subdomain Lwhere L ⊂ RN , i.e.,

f(x∗) = minx∈L

f(x)

x∗ = argminx∈L

f(x)

gives the local minimum.

� The superscript (k) denotes the iteration level, for example, a local minimum and its argumentvalues at k-th iteration would be de�ned as

f(x(k)) = minx∈L(k)

f(x)

x(k) = argminx∈L(k)

f(x)

respectively, where k = 1, 2, · · · K.

1.2 Background

1.2.1 Unconstrained Optimization

� Unconstrained optimization is to �nd the point which minimizes the cost function f(x) withoutsubject to any other constraints.

In other words,

minx∈X

f(x)

where x = [x1 x2 · · · xN ]T , X ⊂ RN and x∗ is a feasible solution (or point) with

f(x∗) = minx∈X

f(x)

1.2.2 Constrained Optimization

� Constrained optimization is to �nd the point which minimizes the cost function f(x) with subjectto some equality and/or unequality constraints.

In other words,

minx∈X

f(x)

subject to g(x) ≤ 0

h(x) = 0

19

where x = [x1 x2 · · · xN ]T , X ⊂ RN , g(x) = [ g1(x) g2(x) · · · gM(x) ]T are the inequalityconstraints, h(x) = [h1(x) h2(x) · · · hL(x) ]T are the equality constraints and x∗ is a feasiblesolution (or point)

i� g ≤ 0 and h = 0 withf(x∗) = min

x∈Xf(x)

� Question: What if we want to maximize f(x)?

maxx

f(x) = minx

(−f(x))

� Examples:

- Computer networks - e.g., optimum routing problem

- Production planning in a factory

- Resource allocation

- Computer aided design (CAD) - e.g., shortest paths in a PCB

- Travelling salesman problem

� A Diet Problem: Find the "most economical" diet that satis�es the minimum nutritionrequirements for good health.

- N di�erent foods

- price of i-th food is ci

- M basic nutritional ingredients

- an individual must take "at least" bj units of j-th nutrient per day

- each unit of i-th food contains Aj,i units of the j-th nutrient

Formulation:

- Variable: amount of food xi for i-th food. Thus, variables can be represented by the vector x

x =[x1 x2 · · · xi · · · xN

]Twhere i = 1, 2, · · · , N .

Amount of food cannot be negative, i.e., xi ≥ 0, so

x ≥ 0

- Cost:

f(x) = c1x1 + c2x2 + · · ·+ cixi + · · ·+ cNxN

= cTx

20

- Constraints:

A1,1x1 + A1,2x2 + · · · + A1,ixi + · · · + A1,NxN ≥ b1

A2,1x1 + A2,2x2 + · · · + A2,ixi + · · · + A2,NxN ≥ b2...

......

......

......

Aj,1x1 + Aj,2x2 + · · · + Aj,ixi + · · · + Aj,NxN ≥ bj...

......

......

......

AM,1x1 + AM,2x2 + · · · + AM,ixi + · · · + AM,NxN ≥ bM

Ax ≥ b

Thus, the optimization formulation for this diet problem is formed as

minx

cTx

s.t. b−Ax ≤ 0

− x ≤ 0

1.2.2.1 Hyperplane, Halfspace, Polyhedra, Euclidean Balls & Ellipsoids, and Cones

� Each equality constraint de�ne a di�erent hyperplane. So, the set of points{x | aTx = b; a,x ∈ RN , b ∈ R, a 6= 0

}constitute a hyperplane, vector a is normal to this hyperplane and point x lie on the hyperplane.Note that hyperplanes are a�ne, i.e., preserve collinearity (all points lying on a line initially stilllie on a line after transformation).

aTx = b

Hyperplane describes a plane in in R3 (i.e., 3D), (as shown in the �gure above) and describesa line in R2 (i.e., 2D), (as shown on the next two �gures).

� Each inequality constraint de�nes a di�erent halfspace. So, the set of points{x | aTx ≥ b; a,x ∈ RN , b ∈ R, a 6= 0

}or {

x | aTx ≤ b; a,x ∈ RN , b ∈ R, a 6= 0}

constitute a halfspace.

21

The half-space {x | 2x1 + x2 ≤ 0} The half-space {x | 2x1 + x2 ≤ 3}

� The solution set of �nitely many linear inequalities & equalities constitute a polyhedra,{x | Ax ≤ b, Cx = d; A ∈ RM×N , C ∈ RP×N , b ∈ RM , d ∈ RP , x ∈ RN

}In other words, a polyhedron is the intersection of �nite number of halfspaces and hyperplanes

Example 1:

Find the solution of

max x1 + 2x2 + 3x3

s.t. x1 + x2 + x3 = 1

x1, x2, x3 ≥ 0

22

Example 2:

Consider the pyramid in the following �gure. The length of each side is 1 unit. Find thehalfspaces de�ning this volume.

Solution:

[0 1 0

] x1

x2

x3

≥ 0

[1√2− 1

2√

3− 1√

6

]x1

x2

x3

≥ 0

[0 − 1

2√

3

√23

]x1

x2

x3

≥ 0

[− 1√

2− 1

2√

3− 1√

6

]x1

x2

x3

≥ − 1√2

� The set of points B(xc, r) = {x | ‖x− xc‖ ≤ r} = {xc + ru | ‖u‖ ≤ 1} form a ball with respectto Euclidian norm with center xc and radius r.

� The set of points BP(xc, r) ={x | (x− xc)

T P (x− xc) ≤ r}with respect to the quadratic norm

with symmetric positive de�nite (SPD) matrix P form an ellipsoid. The axes of the ellipsoidare related with the eigenvalues of P.

23

Change of coordinates:

Let y = P1/2x, then the ellipsoid becomes{y | (y − yc)

T (y − yc) ≤ r}, i.e., a ball with respect

to Euclidian norm with center yc and radius r.

� The set of points{x | x = θ1x1 + θ2x2; x1,x2 ∈ RN , θ1, θ2 ≥ 0

}form a cone.

Example 3:

Find region x1 + 2x2 ≥ 4, i.e.[1 2

] [x1

x2

]≥ 4, and x1, x2 ≥ 0

Solution:

24

Example 4: Find region [1 22 1

] [x1

x2

]≥[66

],

[x1

x2

]≥[00

]Solution:

Example 5: Find region x1 + 2x2 + 3x3 = 1, and x1, x2, x3 ≥ 0

Solution:

Example 6: Find region 1 ≤ x1 ≤ 2, 0 ≤ x2 ≤ 1 and 4 ≤ x3 ≤ 5

Solution:

25

Example 7:

min x1 + x2

s.t. x1 + 2x2 ≥ 6

2x1 + x2 ≥ 6

x1, x2 ≥ 0

Solution:

Example 8:

min 3x1 + x2

s.t. x1 + 2x2 ≥ 6

2x1 + x2 ≥ 6

x1, x2 ≥ 0

Solution:

Example 9:

min x1 − x2

s.t. x1 + 2x2 ≥ 6

2x1 + x2 ≥ 6

x1, x2 ≥ 0

Solution:

26

1.2.3 Basic De�nitions

De�nition: Ball centered at point x0 with radius ε

B(x0, ε) = {x0 | ‖x− x0‖ < ε}

Local/global, strict/non-strict, minima/maxima

Consider the optimization problem

minx

f(x)

s.t. x ∈ F (including constraints)

� De�nition: x∗ ∈ F is a local minimum if

∃ε > 0 ⇒ f(x∗) ≤ f(y), ∀y ∈ B(x∗, ε) ∩ F, y 6= x∗

� De�nition: x∗ ∈ F is a global minimum if

f(x∗) ≤ f(y), ∀y ∈ F, y 6= x∗

� De�nition: x∗ ∈ F is a strict local minimum if

∃ε > 0 ⇒ f(x∗) < f(y), ∀y ∈ B(x∗, ε) ∩ F, y 6= x∗

27

� De�nition: x∗ ∈ F is a strict global minimum if

f(x∗) < f(y), ∀y ∈ F, y 6= x∗

� De�nition: For strict/non-strict and local/global maximum, change ≤ and < to ≥ and >,respectively, in the above de�nitions.

1.2.3.1 Gradient and Hessian

� Let f(x) : X→ R, X ⊂ RN , f(x) is di�erentiable at x0 ∈ X

if∃∇f(x0), ∀x ∈ X

where ∇f(x0) is the gradient of f(x) at point x0.

� In the neighbourhood of a point x0, �rst order approximation of f(x) can be given as

f(x) = f(x0) +∇Tf(x0)(x− x0) + ‖x− x0‖α(x0,x− x0)

where lim(x−x0)→0

α(x0,x− x0) = 0. In other words,

∆f = ∇Tf(x0)∆x + ‖∆x‖α(x0,∆x)

where ∆f = f(x)− f(x0), ∆x = x− x0 and lim∆x→0

α(x0,∆x) = 0.

� f(x) is di�erentiable on X if f(x) is di�erentiable on ∀x ∈ X

Example 10: Let f(x) = 3 x21 x

32 + x2

2 x33, then

∇f(x) =

6x1 x32

9x21 x

22 + 2x2 x

33

3x22 x

23

� Remember: In the neighbourhood of a point x0, a function f(x) can be approximated by asecond order Taylor series expansion

f(x) = f(x0) + f ′(x0)(x− x0) +1

2f ′′(x0)(x− x0)2 + residual

where the �rst order derivative f ′(x) of f(x) around x0 is given by

f ′(x0) = lim(x−x0)→0

f(x)− f(x0)

x− x0

= lim∆x→0

∆f

∆x

where ∆x = x− x0, ∆f = f(x)− f(x0). In terms of the di�erences,

∆f = f ′(x0)∆x+1

2f ′′(x0)(∆x)2 + residual

28

� Second order Taylor series expansion in the neighbourhood of a point x0 is given by

f(x) = f(x0) +∇Tf(x0)(x− x0) +1

2(x− x0)T ∇2f(x0)︸︷︷︸

H(x0)

(x− x0) + residual

where ∇2f(x0) is called the Hessian (matrix) of f(x) at point x0 given by

H(x) = ∇2f(x) = ∇∇Tf(x) =

[∂2f(x)

∂xi∂xj

]N×N

� De�nition: The directional derivative of f(x) at x0 in the direction d is de�ned by

∇Tf(x0) d = limλ→0

f(x0 + λd)− f(x0)

λ

� f(x) is twice di�erentiable at x0 ∈ X,

if∃∇f(x0) and ∃H(x0), ∀x ∈ X

where H(x0) is an N ×N symmetric matrix representing the Hessian of f(x) at x = x0.

� In the neighbourhood of a point x0, second order approximation of f(x) can be given as

f(x) = f(x0) +∇Tf(x0)(x− x0) +1

2(x− x0)TH(x0)(x− x0) + ‖x− x0‖2 β(x0,x− x0)

where lim(x−x0)→0

β(x0,x− x0) = 0. Similarly,

∆f = ∇Tf(x0)∆x +1

2∆xTH(x0)∆x + ‖∆x‖2 β(x0,∆x)

where lim∆x→0

β(x0,∆x) = 0

� f(x) is twice di�erentiable on X if f(x) is twice di�erentiable on ∀x ∈ X

Example 11:

Let f(x) = 3 x21 x

32 + x2

2 x33 as in the previous example,

H(x) = ∇2f(x) =

6x32 18x1 x

22 0

18x1 x22 18x2

1 x2 + 2x33 6x2 x

23

0 6x2 x23 6x2

2 x3

29

1.2.3.2 Positive Semide�nite & Positive De�nite Matrices

An N ×N matrix M is called

� positive de�nite (M � 0), if xTMx > 0, ∀x ∈ RN , x 6= 0,

or, all eigenvalues of M are positive

� positive semide�nite (M � 0), if xTMx ≥ 0, ∀x ∈ RN , x 6= 0,

or, all eigenvalues of M are nonnegative

� negative de�nite (M ≺ 0), if xTMx < 0, ∀x ∈ RN , x 6= 0,

or, all eigenvalues of M are negative

� negative semide�nite (M � 0), if xTMx ≤ 0, ∀x ∈ RN , x 6= 0,

or, all eigenvalues of M are nonpositive

� inde�nite, if ∃x,y ∈ RN , xTMx > 0, yTMy < 0,

or, some eigenvalues of M are positive and some are negative

Example 12:

M =

[2 00 3

]� 0, positive de�nite

M =

[8 −1−1 1

]� 0, positive de�nite

Check xTMx or check eigenvalues.

� Hint: Recall that for x ∈ R, f(x) = ax2 ⇒ f ′′(x) = 2a. Thus the function f(x) is convex ifa > 0 or concave if a < 0.

Thus, positive/negative de�niteness of the Hessian is related to convexity.

1.2.3.3 Optimality Conditions for Unconstrained Problems

Problem:minx∈X

f(x)

� Theorem: x∗ is a strict local minimum, if and only if

∇f(x∗) = 0 and H(x∗) � 0

� Theorem: x∗ is a non-strict local minimum, if and only if

∇f(x∗) = 0 and H(x∗) � 0

30

� De�nition: d is a descent direction of f(x) at point x0 if

f(x0 + εd) < f(x0), ∀ε > 0 and ε is su�ciently small (ε→ 0)

� Theorem: Assume that f(x) is di�erentiable at x0

∃d : ∇Tf(x0)d < 0⇒ ∀λ > 0 (λ→ 0), f(x0 + λd) < f(x0)

Hence, then d is a descent direction of f(x) at x = x0.

Proof: We know that (from the de�nition of gradient)

f(x0 + λd) = f(x0) +∇Tf(x0)(λd) + λ‖d‖α(d, λd)

where limλ→0

α(x0, λd) = 0. Then

f(x0 + λd)− f(x0)

λ= ∇Tf(x0)d︸︷︷︸

<0 (given)

+ residual (→ 0 as λ→ 0)

Hence f(x0 + λd)− f(x0) < 0 as λ→ 0 (λ > 0)

� Corollary: (First order necessary optimality condition) If x∗ is a local minimum then ∇f(x∗) =0.

Proof: If ∇f(x∗) 6= 0, then d = −∇f(x) would be a descent direction and x∗ would not be alocal minimum (at x∗, there would still be room for decrease in f(x))

� Theorem: (Second order necessary optimality condition) Suppose f(x∗) is twice di�erentiableat x∗. If x∗ is a local minimum, then H(x∗) � 0

Proof: We know that ∇f(x∗) = 0. Now, suppose

H(x) ≺ 0 ⇒ ∃d : dTH(x∗)d < 0

f(x∗ + λd) = f(x∗) + λ∇Tf(x∗)︸︷︷︸0

d +1

2λ2dTH(x∗)d + residual

= f(x∗) +1

2λ2dTH(x∗)d + residual

If we rearrange

f(x∗ + λd)− f(x∗)

λ2=

1

2dTH(x∗)d︸︷︷︸

<0


thenf(x∗ + λd)− f(x∗) < 0, ∀λ > 0, λ→ 0

⇒ x∗ is not a local minimum. Contradiction!

31

� Note: If ∇f(x0) = 0 and H(x0) is positive semide�nite, i.e., H(x0) � 0, point x0 may not bea (local) minimum.

� Theorem: (Su�cient condition for local optimality) If ∇f(x0) = 0 and H(x0) is positivede�nite, i.e., H(x0) � 0, then x = x0 is a strict local minimum.

Proof:f(x∗ + λd)− f(x∗)

λ2=

1

2dTH(x∗)d︸︷︷︸

>0


thenf(x∗ + λd)− f(x∗) > 0 ∀λ > 0, λ→ 0

⇒ x∗ is a local minimum.

� Semi-de�niteness does not guarantee minimum or maximum (e.g. it can be a saddle point, asshown below)

� If ∇f(x0) = 0 and H(x0) is positive (negative) de�nite, then x0 is a strict local minimum(maximum).

Example 13:

H(x0) =

[1 11 4

]� 0

∇f(x0) = 0

⇒

Point x0 satis�es the su�cient conditions, point x0 is a local minimum.

Example 14: Let f(x) = x31 + x2

2, then

∇f(x) =

[3x2

1

2x2

]and H(x) =

[6x1 00 2

]

Point ∇f(x) = 0 at x0 =

[00

]but H(x0) =

[0 00 2

]and H(x0) is positive semide�nite, x0 may

or may not be a local minimum. Note that f(x0) = f(0, 0) = 0.

f(−ε, 0) = −ε3 < 0 = f(x0), ∀ε > 0 ⇒ x0 is not a local minimum.

32

Example 15: Let f(x) = x41 + x2

2, then

∇f(x) =

[4x3

1

2x2

]and H(x) =

[12x2

1 00 2

]

Point ∇f(x) = 0 at x0 =

[00

]and H(x0) =

[0 00 2

]. H(x0) is positive semide�nite, x0 may or

may not be a local minimum. Note that f(x0) = f(0, 0) = 0.

∀x f(x) ≥ 0 = f(x0) ⇒ x0 is a local minimum.

Example 16: min f(x) = min

[log

N∑i=1

exp(aTi x + bi

)]Fist order optimality condition ∇f(x) = 0.

∇f(x) =N∑i=1

ai exp(aTi x + bi

) 1N∑i=1

exp (aTi x + bi)

= 0

But there is no analytical solution, so solution can be obtained numerically by an iterativealgorithm.

1.3 Convex Sets and Functions

1.3.1 Convex Sets

� De�nition: A set C ⊆ RN is said to be convex if

x1,x2 ∈ C and 0 ≤ θ ≤ 1 ⇒ θx1 + (1− θ)x2 ∈ C

θx1 + (1− θ)x2 de�nes a line segment between points x1 and x2.

Some simple convex and nonconvex sets are shown below.

33

Left : The hexagon, which includes its boundary (shown darker), is convex. Middle: The kidneyshaped set is not convex, since the line segment between the two points in the set shown as dotsis not contained in the set. Right: The square contains some boundary points but not others,and is not convex.

� Convex Combination: x1,x2,x3, . . . ,xk and for θi ≥ 0,k∑i=1

θi = 1 if x = θ1x1 + θ2x2 + · · · +

θkxk ∈ C, then C is a convex set.

� Convex Hull: Convex hull is a set of all convex combinations of the points in S.

For example, the convex hulls of two sets in R2 are given below.

Left : The convex hull of a set of �fteen points (shown as dots) is the pentagon (shown shaded).Right: The convex hull of the kidney shaped set in the middle of the convex set examples is thewhole shaded set.

1.3.2 Convex and Concave Functions

� De�nition: A function f : C→ R is convex if

dom f = C ⊆ RN

is a convex set and satis�es the Jensen's Inequality given below

f(θx1 + (1− θ)x2) ≤ θf(x1) + (1− θ)f(x2)

∀x1,x2 ∈ C and 0 ≤ θ ≤ 1.

34

� De�nition: A function f : C→ R is concave if

dom f = C ⊆ RN

is a convex set andf(θx1 + (1− θ)x2) ≥ θf(x1) + (1− θ)f(x2)

∀x1,x2 ∈ C and 0 ≤ θ ≤ 1.

� Note: If f is convex (concave), then −f is concave (convex).

� Note: If f is strictly convex or strictly concave, if the equality signs in the inequalities areremoved.

For example, f is strictly convex if

dom f = C ⊆ RN

is a convex set andf(θx1 + (1− θ)x2) < θf(x1) + (1− θ)f(x2)

∀x1,x2 ∈ C and 0 < θ < 1.

� Note: An a�ne function

f(x) = aTx + b (or f(x) = Ax + b)

withdom f : convex

satis�esf(θx1 + (1− θ)x2) = θf(x1) + (1− θ)f(x2)

Hence it can be considered as convex or concave depending on the problem.

Examples on R: (scalar)

35

� Convex:

- a�ne: ax+ b, on R, for any a, b ∈ R

- exponential: eax, on R, for any a ∈ R

- powers: xα, on R++, for α ≥ 1 or α ≤ 0

- powers of absolute value: |x|α, on R, for α ≥ 1

- negative entropy: x log x, on R++

� Concave:

- a�ne: ax+ b, on R, for any a, b ∈ R

- powers: xα, on R++, for 0 ≤ α ≤ 1

- logarithm: log x, on R++

Examples on RN : (vectors)

� All norms (i.e. Lp-norm) are convex

‖x‖p =

(N∑i=1

|xi|p)1/p

for p ≥ 1.

� A�ne functions are convex and concave depending on the problem

f(x) = aTx + b

1.3.3 First and Second Order Conditions for Convexity

1.3.3.1 First Order Condition for Convexity

� Theorem: If f(x) is di�erentiable (i.e. ∇f(x) exists ∀x ∈ dom f(x)) and dom f(x) is convex,then f(x) is convex i�

f(x) ≥ f(x0) +∇Tf(x0)(x− x0), ∀x,x0 ∈ dom f(x)

As shown in the �gure below

�rst-order approximation of a convex function f(x) is a global underestimator.

36

1.3.3.2 Second Order Conditions for Convexity

� Theorem: If f(x) is twice di�erentiable (i.e. H(x) exists ∀x ∈ dom f(x)) and dom f(x) isconvex, then f(x) is convex i�

H(x) � 0, ∀x ∈ dom f(x)

� Theorem: If S is a convex set, f(x) : S→ R is a convex function with local minimum x∗, thenx∗ is a global minimum of f(x) over S.

� If f(x) is (strictly) convex, a local minimum is the (unique) global minimum.

� If f(x) is (strictly) concave, a local maximum is the (unique) global maximum.

� If f(x) is convex, then the global optimality condition that is both necessary and su�cient is

Theorem: Let f(x) : X→ R be convex and di�erentiable on X, Then x0 is a global minimumi� ∇f(x) = 0.

� Example 17: f(x) = 12x2

1 + x1x2 + 2x22 − 4x1 − 4x2 − x3

2 with dom f(x) = {(x1, x2) | x2 < 0}

∇f(x) =

[x1 + x2 − 4

x1 + 4x2 − 3x22

]and H(x) =

[1 11 4− 6x2

].

H(x) � 0 on dom f(x), thus f(x) is convex.

Further Examples:

- Quadratic function

f(x) =1

2xTPx + qTx + r

∇f(x) = Px + q

H(x) = P

f(x) is convex if P � 0.

- Least-squares objective function

f(x) = ‖Ax− b‖2

= (Ax− b)T (Ax− b)

= xTATAx− 2bTAx + bTb

∇f(x) = 2AT (Ax− b)

H(x) = 2ATA

f(x) is convex for any A (Here f(x) is a quadratic function with P = 2ATA, q = −2ATb andr = bTb).

37

- Log-sum-exp: The function

f(x) = logN∑i=1

exi

is convex over RN . This function can be interpreted as a di�erentiable (in fact, analytic) ap-proximation of the max function, since

max x ≤ f(x) ≤ max x + logN

for all x.

- Geometric mean

f(x) =

(N∏i=1

xi

)1/N

is convex for on RN++.

1.3.3.3 Operations that Preserve Convexity

� Nonnegative multipleαf(x), for α ≥ 0

� Sum (including in�nite sums and integrals)

f1(x) + f2(x)

� Compositions with a�ne functionf(Ax + b)

are convex if f(x) is convex.

Ex: Log barrier

f(x) = −M∑i=1

log(bi − aTi x

), dom f(x) =

{x | aTi x < bi, i = 1, . . . ,M

}and norm of an a�ne function

f(x) = ‖Ax + b‖

� Pointwise maximum

f(x) = max {f1(x), f2(x), . . . , fi(x), . . . , fM(x)}

is convex for if fi(x) are convex.

38

Ex: Piecewise-linear function

f(x) = maxi=1,...,M

(aTi x + bi

)� Pointwise supremum

g(x) = supy∈A

f(x,y), ∀y ∈ A

is convex where f(x,y) is convex in x for ∀y ∈ A.

Ex: Distance to farthest point in a set C (convex set)

f(x) = supy∈C‖x− y‖

� Pointwise in�mumg(x) = inf

y∈Cf(x,y)

is convex where f(x,y) is convex in (x,y) and C is a convex set.

Ex: Distance to a set S (convex)

d(x,S) = infy∈S‖x− y‖

� Composition with scalar functionsf(x) = h(g(x))

with g : R→ R and h : R→ R is convex if g is convex, h is convex and h is nondecreasing, or gis concave, h is convex and h is nonincreasing.

Ex: eg(x) is convex if g is convex. Similarly,1

g(x)is convex if g is concave and positive.

39

1.3.4 Quadratic Functions, Forms and Optimization

� De�nition: A quadratic function has the form (f : RN → R)

f(x) =1

2xTQx + cTx + r

where Q ∈ RN×N , x, c ∈ RN , r ∈ R and Q is a symmetric matrix.

� Quadratic optimization problem (Quadratic Program)

min1

2xTQx + cTx + r

s.t. x ∈ RN

Ex: Least-squares problem (Approximation of an over-determined linear system Ax = b whereA is an M ×N matrix and M > N , i.e., the number of equations are more than the number ofvariables)

min ‖Ax− b‖22 = xTATAx− 2bTAx + bTb

s.t. x ∈ RN

� Property: Assuming (f(x) : RN → R) is twice di�erentiable at x = x0, f(x) can be ap-proximated by a quadratic function in the neighbourhood of x0 (a very useful property for theNewton's method).

min f(x) ∼= f(x0) +∇Tf(x0)(x− x0) +1

2(x− xT0 )H(x0)(x− x0))

s.t. x ∈ RN

is a quadratic optimization problem.

Solution of Quadratic Problem (QP)

f(x) =1

2xTQx + cTx + r

∇f(x) = Qx + c

H(x) = Q

� Theorem: The function f(x) = 12xTQx + cTx + r is convex i� Q � 0.

Proof: Apply Jensen's equality

� Corollary:

f(x) is strictly convex i� Q � 0 or convex i� Q � 0

f(x) is strictly concave i� Q ≺ 0 or concave i� Q � 0

f(x) is neither convex nor concave i� Q is inde�nite.

40

1.3.4.1 Optimality Conditions

� Theorem: Suppose Q is a symmetric positive semide�nite (SPSD) matrix f(x) = 12xTQx +

cTx + r has its minimum at x∗ i� x∗ satis�es

∇f(x∗) = Qx∗ + c = 0

Proof: Express f(x∗) as f(x) = f(x∗ + (x − x∗)) and show that f(x) ≥ f(x∗) ∀x ∈ RN byusing Qx∗ + c = 0.

� Ex:

f(x) = xTx = ‖x‖22

f(x) = (x− a)T (x− a) = ‖x− a‖22

f(x) = (x− a)TP(x− a) = ‖x− a‖2P, where P is SPD.

1.3.4.2 Characteristics of Symmetric Matrices

� De�nition: A matrix is an orthonormal matrix if MT = M−1

� Corollary: If M is orthonormal and y = Mx

‖y‖22 = yTy = xT MTM︸︷︷︸

I

x = ‖x‖22 ⇒ ‖y‖2 = ‖x‖2

Recall: Mx = λx, λ ∈ R is an eigenvalue of M, x ∈ RN and x 6= 0 is a correspondingeigenvector. How many eigenvalues?

Recall: A matrix Q is called symmetric if Q = QT .

� Proposition: If Q is a real symmetric matrix, then all of its eigenvalues are real.

� Proposition: If Q is a real symmetric matrix, then its eigenvectors corresponding to di�erenteigenvalues are orthogonal.

� Proposition: If Q is a symmetric matrix with rank N , then it has N distinct eigenvectorswhich constitute an orthonormal basis for RN .

� Proposition: If Q is SPSD, its eigenvalues are nonnegative.

� Proposition: If Q is an N×N square matrix, then trace and determinant of Q would be equalto the sum and product of its eigenvalues, respectively.

tr(Q) =N∑i=1

Qi,i =N∑i=1

λi

det(Q) =N∏i=1

λi

41

� Proposition: (Eigendecomposition) If Q is a symmetric matrix, then Q = RDRT . Columnsof the orthonormal matrix R are the eigenvectors of Q, D is a diagonal matrix with eigenvaluesof Q on the main diagonal.

� Proposition: If Q is SPSD, then Q = MTM and M = D1/2RT is a square root of Q.

� Proposition: If Q is SPSD and xTQx = 0, then Qx = 0.

� Proposition: If a symmetric matrix Q is positive de�nite (i.e. Q � 0), then Q is nonsingular(i.e. its inverse exists) as detQ > 0.

� Proposition: If Q is positive de�nite (i.e. Q � 0), then any principal sub-matrix of Q is alsopositive de�nite.

� Proposition: If Q is positive semi-de�nite (i.e. Q � 0), then any principal sub-matrix of Q isalso positive semi-de�nite.

� Proposition: If Q is a symmetric matrix and Q � 0 and M =

[Q ccT b

], then M � 0 i�

b > cTQ−1c.

42

Chapter 2

Unconstrained Optimization and DescentMethods

2.1 Unconstrained Optimization

� The aim is

min f(x)

where f(x) : RN → R is twice di�erentiable.

� The problem is solvable, i.e., �nite optimal point x∗ exists.

� The optimal value (�nite) is given by

p∗ = infxf(x) = f(x∗) (> −∞)

� Example 1: Quadratic program

minx∈RN

f(x) =1

2xTQx− bTx + c

where Q : RN×N is symmetric, b ∈ RN and c ∈ R.

Necessary conditions:

∇f(x∗) = Qx∗ − b = 0

H(x∗) = Q � 0 (PSD)

- Q ≺ 0⇒ f(x) has no local minimum.

- Q � 0⇒ x∗ = Q−1b is the unique global minimum.

- Q � 0⇒ either no solution or ∞ number of solutions.

43

� Example 2: Consider

minx1,x2∈R

f(x1, x2) =1

2(αx2

1 + βx22 − x1)

Here, let us �rst express the above equation in the quadratic program form with

Q =

[α γ−γ β

]b =

[10

]where γ ∈ R, for simplicity we can take γ = 0. So,

- If α > 0 and β > 0 (i.e., Q � 0): x∗ = ( 1α, 0) is the unique global minimum.

- If α > 0 and β = 0 (i.e., Q � 0): In�nite number of solutions,{

( 1α, y), y ∈ R

}.

- If α = 0 and β > 0 (i.e., Q � 0): No solution.

- If α < 0 and β > 0, or α > 0 and β < 0 (i.e., Q is inde�nite): No solution.

−10−5

05

10

−10−5

05

10−50

0

50

100

150

x1

α > 0, β > 0

x2 −10−5

05

10

−10−5

05

10−20

0

20

40

60

x1

α > 0, β = 0

x2

−10−5

05

10

−10−5

05

10−20

0

20

40

60

x1

α = 0, β > 0

x2−10

−50

510

−10−5

05

10−60

−40

−20

0

20

40

60

x1

α > 0, β < 0

x2

� Two possibilities:

- {f(x) : x ∈ X} is unbounded below ⇒ no optimal solution.

- {f(x) : x ∈ X} is bounded below ⇒ a global minimum exists, if ‖x‖ 6=∞.

44

Then, unconstrained minimization methods

- produce sequence of points x(k) ∈ dom f(x) for k = 0, 1, . . . with

f(x(k))→ p∗

- can be interpreted as iterative methods for solving the optimality condition

∇f(x∗) = 0

2.2 Descent Methods

2.2.1 Motivation

� If ∇f(x) 6= 0, there is an interval (0, δ) of stepsizes such that

f(x− α∇f(x)) < f(x) ∀α ∈ (0, δ)

� If d makes an angle with ∇f(x) that is greater than 90◦, i.e.,

∇Tf(x) d < 0

∃ an interval (0, δ) of stepsizes such that

f(x + αd) < f(x) ∀α ∈ (0, δ)

� De�nition: The descent direction d is selected such that

∇Tf(x) d < 0

45

� Proposition: For a descent method

f(x(k+1)) < f(x(k))

except x(k) = x∗.

� De�nition: Minimizing sequence is de�ned as

x(k+1) = x(k) + α(k)d(k)

where the scalar α(k) ∈ (0, δ) is the stepsize (or step length) at iteration k, and d(k) ∈ RN is thestep or search direction.

- How to �nd optimum α(k)? Line Search Algorithm

- How to �nd optimum d(k)? Depends on the descent algorithm,

e.g., d = −∇f(x(k)).

2.2.2 General Descent Method

� Given a starting point x(0) ∈ dom f(x)

repeat

1. Determine a descent direction d(k),

2. Line search: Choose a stepsize α(k) > 0,

3. Update: x(k+1) = x(k) + α(k)d(k),

until stopping criterion is satis�ed.

� Example 3: Simplest method: Gradient Descent

x(k+1) = x(k) − α(k)∇f(x(k)), k = 0, 1, . . .

Note that, here the descent direction is d(k) = −∇f(x(k)).

� Example 4: Most sophisticated method: Newton's Method

x(k+1) = x(k) − α(k)H−1(x(k))∇f(x(k)), k = 0, 1, . . .

Note that, here the descent direction is d(k) = −H−1(x(k))∇f(x(k)).

46

2.2.3 Line Search

� Suppose f(x) is a continuously di�erentiable convex function and we want to �nd

α(k) = argminα

f(x(k) + αd(k))

for a given descent direction d(k). Now, let

h(α) = f(x(k) + αd(k))

where h(α) : R→ R is a "convex function" in the scalar variable α then the problem becomes

α(k) = argminα

h(α)

Then, as h(α) is convex, it has a minimum at

h′(α(k)) =∂h(α(k))

∂α= 0

where h′(α) is given by

h′(α) =∂h(α)

∂α= ∇Tf(x(k) + αd(k)) d(k) (using chain rule)

Therefore, since d is the descent direction, (i.e., ∇Tf(x(k)) d(k)< 0 ), we have h′(0) < 0. Also,h′(α) is monotone increasing function of α because h(α) is convex. Hence. search for h′(α(k)) = 0.

47

Choice of stepsize:

� Constant stepsizeα(k) = c : constant

� Diminishing stepsizeα(k) → 0

while satisfying∞∑

k=−∞α(k) =∞.

� Exact line search (analytic)α(k) = argmin

αf(x(k) + αd(k))

2.2.3.1 Exact Line Search

Exact line search: (for quadratic programs)

� If f(x) is a quadratic function, then h(α) is also a quadratic function, i.e.,

h(α) = f(x(k) + αd(k))

= f(x(k)) + α∇Tf(x(k))d(k) +α2

2d(k)TH(x(k))d(k)

Exact line search solution α0 which minimizes the quadratic equation above, i.e., ∂h(α0)∂α

= 0, isgiven by

α0 = α(k) = − ∇Tf(x(k))d(k)

d(k)TH(x(k))d(k)

- If f(x) is a higher order function, then second order Taylor series approximation can be usedfor the exact line search algorithm (which also gives an approximate solution).

48

2.2.3.2 Bisection Algorithm

Bisection Algorithm:

� Assume h(α) is convex, then h′(α) is monotonically increasing function. Suppose that we knowa value α such that h′(α) > 0.

- Since h′(0) < 0, α = 0+α2

is the next test point

- If h′(α) = 0, α(k) = α is found (very di�cult to achieve)

- If h′(α) > 0 narrow down the search interval to (0, α)

- If h′(α) < 0 narrow down the search interval to (α, α)

Algorithm:

1. Set k = 0, α` = 0 and αu = α

2. Set α = α`+αu

2and calculate h′(α)

3. If h′(α) > 0 ⇒ αu = α and k = k + 1

4. If h′(α) < 0 ⇒ α` = α and k = k + 1

5. If h′(α) = 0 ⇒ stop.

Proposition: After every iteration, the current interval [α`, αu] contains α∗, h′(α∗) = 0.

Proposition: At the k-th iteration, the length of the current interval is

L =

(1

2

)kα

Proposition: A value of α such that |α− α∗| < ε can be found at most⌈log2

(α

ε

)⌉steps.

� How to �nd α such that h′(α) > 0?

1. Make an initial guess of α

49

2. If h′(α) < 0 ⇒ α = 2α, go to step 2

3. Stop.

� Stopping criterion for the Bisection Algorithm: h′(α)→ 0 as k →∞, may not convergequickly.

Some relevant stopping criteria:

1. Stop after k = K iterations (K : user de�ned)

2. Stop when |αu − α`| ≤ ε (ε : user de�ned)

3. Stop when h′(α) ≤ ε (ε : user de�ned)

In general, 3rd criterion is the best.

2.2.3.3 Backtracking Line Search

Backtracking line search

For small enough α:

f(x0 + αd) ' f(x0) + α∇Tf(x0)d < f(x0) + γα∇Tf(x0)d

where 0 < γ < 0.5 as ∇Tf(x0) d < 0.

� Algorithm: Backtracking line search

Given a descent direction d for f(x) at x0 ∈ dom f(x)

α = 1

while f(x0 + αd) > f(x0) + γα∇Tf(x0)d

50

α = βα

end

where 0 < γ < 0.5 and 0 < β < 1

- At each iteration step size α is reduced by β (β ' 0.1 : coarse search, β ' 0.8 : �ne search).

- γ can be interpreted as the fraction of the decrease in f(x) predicted by linear extrapolation(γ = 0.01↔ 0.3 (typical) meaning that we accept a decrease in f(x) between 1% and 30%).

- The backtracking exit inequality

f(x0 + αd) ≤ f(x0) + γα∇Tf(x0)d

holds for α ∈ [0, α0]. Then, the line search stops with a step length α

i. α = 1 if α0 ≥ 1

ii. α ∈ [βα0, α0].

In other words, the step length obtained by backtracking line search satis�es

α ≥ min {1, βα0} .

2.2.4 Convergence

� Convergence

De�nition: Let ‖ · ‖ be a norm on RN . Let{x(k)}∞k=0

be a sequence of vectors in RN . Then,

the sequence{x(k)}∞k=0

is said to converge to a limit x∗ if

∀ε > 0,∃Nε ∈ Z+ : (k ∈ Z+ and k ≥ Nε)⇒ (‖x(k) − x∗‖ < ε)

If the sequence{x(k)}∞k=0

converges to x∗ then we write

limk→∞

x(k) = x∗

and call x∗ as the limit of the sequence{x(k)}∞k=0

.

- Nε may depend on ε

- For a distance ε, after Nε iterations, all the subsequent iterations are within this distanceε to x∗.

This de�nition does not characterize how fast the convergence is (i.e., rate of convergence).

� Rate of Convergence

51

De�nition: Let ‖ · ‖ be a norm on RN . A sequence{x(k)}∞k=0

that converges to x∗ ∈ RN is saidto converge at rate R ∈ R++ and with rate constant δ ∈ R++ if

limk→∞

‖x(k+1) − x∗‖‖x(k) − x∗‖R

= δ

- If R = 1, 0 < δ < 1, then rate is linear

- If 1 < R < 2, 0 < δ <∞, then rate is called super-linear

- If R = 2, 0 < δ <∞, then rate is called quadratic

The rate of convergence R is sometimes called asymptotic convergence rate. It may not applyfor the iterates, but applies asymptotically as k →∞.

Example 5: The sequence{ak}∞k=0

, 0 < a < 1 converges to 0.

limk→∞

‖ak+1 − 0‖‖ak − 0‖1

= a ⇒ R = 1, δ = a

Example 6: The sequence{a2k}∞k=0

, 0 < a < 1 converges to 0.

limk→∞

‖a2k+1 − 0‖‖a2k − 0‖2

= 1 ⇒ R = 2, δ = 1

2.3 Gradient Descent (GD) Method

� First order Taylor series expansion at x0 gives us

f(x0 + αd) ≈ f(x0) + α∇Tf(x0)d.

This approximation is valid for α‖d‖ → 0.

� We want to choose d so that ∇Tf(x0)d is as small as (as negative as) possible for maximumdescent.

52

� If we normalize d, i.e., ‖d‖ = 1, then normalized direction d

d = − ∇f(x0)

‖∇f(x0)‖

makes the smallest inner product with ∇f(x0).

� Then, the unnormalized directiond = −∇f(x0)

is called the direction of gradient descent (GD) at the point of x0.

� d is a direction as long as ∇f(x0) 6= 0.

� Algorithm: Gradient Descent Algorithm

Given a starting point x(0) ∈ dom f(x)

repeat

1. d(k) = −∇f(x(k))

2. Line search: Choose step size α(k) via a line search algorithm

3. Update: x(k+1) = x(k) + α(k)d(k)

until stopping criterion is satis�ed

- A typical stopping criterion is ‖∇f(x)‖ < ε, ε→ 0 (small)

2.3.1 Convergence Analysis

� Convergence Analysis

The Hessian matrix H(x) is bounded as

1. mI � H(x), i.e.,

(H(x)−mI) � 0

yTH(x)y ≥ m‖y‖2,∀y ∈ RN

2. H(x) �MI, i.e.,

(MI−H(x)) � 0

yTH(x)y ≤M‖y‖2,∀y ∈ RN

with ∀x ∈ dom f(x).

53

Note that, condition number of a matrix is given by the ratio of the largest and the smallesteigenvalue, e.g.,

κ(H(x)) =

∣∣∣∣maxλiminλi

∣∣∣∣ =M

m

If the condition number is close to one, the matrix is well-conditioned which means its inversecan be computed with good accuracy. If the condition number is large, then the matrix is saidto be ill-conditioned.

� Lower Bound: mI � H(x)

For x,y ∈ dom f(x)

f(y) = f(x) +∇Tf(x)(y − x) +1

2(y − x)TH(z)(y − x)

for some z on the line segment [x,y] where H(z) � mI. Thus,

f(y) ≥ f(x) +∇Tf(x)(y − x) +m

2‖y − x‖2

- If m = 0, then the inequality characterizes convexity.

- If m > 0, then we have a better lower bound for f(y)

Right-hand side is convex in y. Minimum is achieved at

y0 = x− 1

m∇f(x)

Then,

f(y) ≥ f(x) +∇Tf(x)(y0 − x) +m

2‖y0 − x‖2

≥ f(x)− 1

2m‖∇f(x)‖2

∀y ∈ dom f .

When y = x∗

f(x∗) = p∗ ≥ f(x)− 1

2m‖∇f(x)‖2

- A stopping criterion

f(x)− p∗ ≤ 12m‖∇f(x)‖2

� Upper Bound: H(x) �MI

For any x,y ∈ dom f(x), using similar derivations as the lower bound, we arrive at

f(y) ≤ f(x) +∇Tf(x)(y0 − x) +M

2‖y0 − x‖2

Then for y = x∗

f(x∗) = p∗ ≤ f(x)− 1

2M‖∇f(x)‖2

54

2.3.1.1 Convergence of GD with Exact Line Search

� Convergence of GD using exact line search

For the exact line search, let us use second order approximation for f(x(k+1)):

f(x(k+1)) = f(x(k) − α∇f(x(k)))

∼= f(x(k))− α∇Tf(x(k))∇f(x(k))︸︷︷︸‖∇f(x(k))‖2

+α2

2∇Tf(x(k)) H(x(k))︸︷︷︸

�MI

∇f(x(k))

This criterion is quadratic in α.

Normally, exact line search solution α0 which minimizes the quadratic equation above is givenby

α0 =∇Tf(x(k))∇f(x(k))

∇Tf(x(k))H(x(k))∇f(x(k))

- However, let us use the upper bound of the second order approximation for convergence analysis

f(x(k+1)) ≤ f(x(k))− α‖∇f(x(k))‖2 +Mα2

2‖∇f(x(k))‖2

Find α′0 such that upper bound of f(x(k) − α∇f(x(k))) is minimized over α.

Upper bound equation (i.e., right-hand side equation) is quadratic in α, hence minimized for

α′0 =1

M

with the minimum value

f(x(k))− 1

2M‖∇f(x(k))‖2

Then, for α′0

f(x(k+1)) ≤ f(x(k))− 1

2M‖∇f(x(k))‖2

Subtract p∗ for both sides

f(x(k+1))− p∗ ≤ f(x(k))− p∗ − 1

2M‖∇f(x(k))‖2

We know that

f(x(k))− p∗ ≤ 1

2m‖∇f(x(k))‖2 ⇒ ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗)

Then substituting this result to the above inequality

f(x(k+1))− p∗ ≤ (f(x(k))− p∗)− m

M(f(x(k))− p∗)

≤ (1− m

M)(f(x(k))− p∗)

55

orf(x(k+1))− p∗

f(x(k))− p∗≤ (1− m

M) = c ≤ 1 (

m

M≤ 1)

- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence

- Upper limit of rate constant is(1− m

M

)� Number of steps? Apply the above inequality recursively

f(x(k+1))− p∗ ≤ ck(f(x(0))− p∗)

i.e., f(x(k+1))→ p∗ as k →∞, since 0 ≤ c < 1. Thus, convergence is guaranteed.

- If m = M ⇒ c = 0, then convergence occurs in one iteration.

- If m�M ⇒ c→ 1, the slow convergence.

(f(x(k+1))− p∗) ≤ ε is achieved after at most

K =log([f(x(0))− p∗]/ε

)log (1/c)

iterations

- Numerator is small when initial point is close to x∗ (K gets smaller).

- Numerator increases as accuracy increases (i.e., ε decreases) (K gets larger).

- Denominator decreases linearly with mM

(reciprocal of the condition number) as c = (1− mM

),i.e., log(1/c) = − log(1− m

M) ' m

M(using log(x) = log(x0) + 1

x0(x− x0)− 1

2x20(x− x0)2 + · · ·

with x0 = 1).

- well-conditioned Hessian, mM→ 1⇒ denominator is large (K gets smaller).

- ill-conditioned Hessian, mM→ 0⇒ denominator is small (K gets larger).

2.3.1.2 Convergence of GD with Backtracking Line Search

� Convergence of GD using backtracking line search

56

Backtracking exit condition

f(x(k) − α∇f(x(k))) ≤ f(x(k))− γα‖∇f(x(k))‖2

is satis�ed when α ∈ [βα0, α0] where α0 ≥ 1M.

Backtracking line search terminates either if α = 1 or α ≥ βM

which gives a lower bound on thedecrease

1. f(x(k+1)) ≤ f(x(k))− γ‖∇f(x(k))‖2 if α = 1

2. f(x(k+1)) ≤ f(x(k))− βγM‖∇f(x(k))‖2 if α ≥ β

M

If we put these inequalities (1 & 2) together

f(x(k+1)) ≤ f(x(k))−min

{γ,βγ

M

}‖∇f(x(k))‖2

Similar to the analysis exact line search, subtract p∗ from both sides

f(x(k+1))− p∗ ≤ f(x(k))− p∗ − γmin

{1,β

M

}‖∇f(x(k))‖2

But, we know that ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗), then

f(x(k+1))− p∗ ≤(

1− 2mγmin

{1,β

M

})(f(x(k))− p∗

)Finally

f(x(k+1))− p∗

f(x(k))− p∗≤(

1− 2mγmin

{1,β

M

})= c < 1

- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence

- Rate is constant c < 1f(x(k+1))− p∗ ≤ ck

(f(x(0))− p∗

)Thus, k →∞⇒ ck → 0, so convergence is guaranteed.

2.3.2 Examples

Note: Examples 7, 8, 9 and 10 are taken from Convex Optimization (Boyd and Vandenberghe) (Ch.9).

Example 7: (quadratic problem in R2) Replace γ with σ.

57

Example 8: (nonquadratic problem in R2) Replace α and t with γ and α.

59

Example 9: (problem in R100) Replace α and t with γ and α.

61

Example 10: (Condition number) Replace γ, α and t with σ, γ and α.

63

Observations:

- The gradient descent algorithm is simple.

- The gradient descent method often exhibits approximately linear convergence.

64

- The choice of backtracking parameters γ and β has a noticeable but not dramatic e�ect on theconvergence. Exact line search sometimes improves the convergence of the gradient method,but the e�ect is not large (and probably not worth the trouble of implementing the exact linesearch).

- The convergence rate depends greatly on the condition number of the Hessian, or the sublevelsets. Convergence can be very slow, even for problems that are moderately well-conditioned(say, with condition number in the 100s). When the condition number is larger (say, 1000 ormore) the gradient method is so slow that it is useless in practice.

- The main advantage of the gradient method is its simplicity. Its main disadvantage is that itsconvergence rate depends so critically on the condition number of the Hessian or sublevel sets.

2.4 Steepest Descent (SD) Method

2.4.1 Preliminary De�nitions

� Dual Norm: Let ‖ · ‖ denote any norm on RN , then the dual norm, denoted by ‖ · ‖∗, is thefunction from RN to R with values

‖x‖∗ = maxy

yTx : ‖y‖ ≤ 1 = sup{yTx : ‖y‖ ≤ 1

}The above de�nition also corresponds to a norm: it is convex, as it is the pointwise maximum ofconvex (in fact, linear) functions y→ xTy; it is homogeneous of degree 1, that is, ‖αx‖∗ = α‖x‖∗for every x in RN and α ≥ 0.

� By de�nition of the dual norm,xTy ≤ ‖x‖ · ‖y‖∗

This can be seen as a generalized version of the Cauchy-Schwartz inequality, which correspondsto the Euclidean norm.

� The dual to the dual norm above is the original norm.

- The norm dual to the Euclidean norm is itself. This comes directly from the Cauchy-Schwartzinequality.

‖x‖2∗ = ‖x‖2

- The norm dual to the the L∞-norm is the L1-norm, or vice versa.

‖x‖∞∗ = ‖x‖1 and ‖x‖1∗ = ‖x‖∞

- More generally, the dual of the Lp-norm is the Lq-norm

‖x‖p∗ = ‖x‖q

where q =p

1− p.

65

� Quadratic norm: A generalized quadratic norm of x is de�ned by

‖x‖P =(xTPx

)1/2= ‖P1/2x‖2 = ‖Mx‖2

where P = MTM is an N ×N symmetric positive de�nite (SPD) matrix.

� When P = I then, quadratic norm is equal to the Euclidean norm.

� The dual of the quadratic norm is given by

‖x‖P∗ = ‖x‖Q =(xTP−1x

)1/2

where Q = P−1.

2.4.2 Steepest Descent Method

� The �rst-order Taylor series approximation of f(x(k) + αd) around x(k) is

f(x(k) + αd) ≈ f(x(k)) + α∇Tf(x(k))d.

This approximation is valid for α‖d‖2 → 0.

� We want to choose d so that ∇Tf(x0)d is as small as (as negative as) possible for maximumdescent.

� First normalize d to obtain normalized steepest descent direction (nsd) dnsd

dnsd = argmin{∇Tf(x(k)) d : ‖d‖ = 1

}where ‖ · ‖ is any norm and ‖ · ‖∗ is its dual norm on RN . Choice of norm is very important.

� It is also convenient to consider the unnormalized steepest descent direction (sd)

dsd = ‖∇f(x)‖∗dnsd

where ‖ · ‖∗ is the dual norm of ‖ · ‖.

� Then, for the steepest descent step, we have

∇Tf(x)dsd = ‖∇f(x)‖∗∇Tf(x)dnsd︸︷︷︸−‖∇f(x)‖∗

= −‖∇f(x)‖2∗

� Algorithm: Steepest Descent Algorithm

Given a starting point x(0) ∈ dom f(x)

repeat

1. Compute the steepest descent direction d(k)sd

2. Line search: Choose step size α(k) via a line search algorithm

3. Update: x(k+1) = x(k) + α(k)d(k)sd


66

2.4.3 Steepest Descent for di�erent norms

2.4.3.1 Euclidean Norm

� Steepest Descent for di�erent norms:

- Euclidean norm: As ‖ · ‖2∗ = ‖ · ‖2 and having x0 = x(k), the steepest descent direction is thenegative gradient, i.e.,

dsd = −∇f(x0)

For Euclidean norm, steepest descent algorithm is the same as the gradient descent algo-rithm.

2.4.3.2 Quadratic Norm

- Quadratic norm: For a quadratic norm ‖ · ‖P and having x0 = x(k), the normalized descentdirection is given by

dnsd = −P−1 ∇f(x0)

‖∇f(x0)‖P∗= −P−1 ∇f(x0)

(∇Tf(x0)P−1∇f(x0))1/2

As ‖∇f(x)‖P∗ = ‖P−1/2∇f(x)‖2, then

dsd = −P−1∇f(x0)

67

Change of coordinates: Let y = P 1/2x, then ‖x‖P = ‖y‖2. Using this change of coordinates,we can solve the original problem of minimizing f(x) by solving the equivalent problem ofminimizing the function f(y) : RN → R, given by

f(y) = f(P−1/2y) = f(x)

Apply the gradient descent method to f(y). The descent direction at y0 (x0 = P−1/2y0 forthe original problem) is

dy = −∇f(y0) = −P−1/2∇f(P−1/2y0) = −P−1/2∇f(x0)

Then the descent direction for the original problem becomes

dx = P−1/2 dy = −P−1∇f(x0)

Thus, x∗ = P−1/2y∗.

The steepest descent method in the quadratic norm ‖ · ‖P is equivalent to the gradient descentmethod applied to the problem after the coordinate transformation

y = P1/2x

2.4.3.3 L1-norm

- L1-norm: For an L1-norm ‖ · ‖1 and having x0 = x(k), the normalized descent direction is givenby

dnsd = − argmin{∇Tf(x)d : ‖d‖1 = 1

}.

Let i be any index for which ‖∇f(x0)‖∞ = max |(∇f(x0))i|. Then a normalized steepest descentdirection dnsd for the L1-norm is given by

dnsd = − sign

(∂f(x0)

∂xi

)ei

where ei is the i-th standard basis vector (i.e., the coordinate axis direction) with the steepestgradient. For example, in the �gure above we have dnsd = e1.

68

Then, the unnormalized steepest descent direction is given by

dsd = dnsd ‖∇f(x0)‖∞ = −∂f(x0)

∂xiei

The steepest descent algorithm in the L1-norm has a very natural interpretation:

- At each iteration we select a component of ∇f(x0) with maximum absolute value, and thendecrease or increase the corresponding component of x0, according to the sign of (∇f(x0))i.

- The algorithm is sometimes called a coordinate-descent algorithm, since only one com-ponent of the variable x(k) is updated at each iteration.

- This can greatly simplify, or even trivialize, the line search.

2.4.3.4 Choice of norm

Choice of norm:

- Choice of norm can dramatically a�ect the convergence

- Condition number of the Hessian should be close to unity for fast convergence

- Consider quadratic norm with respect to SPD matrix P. Performing the change of coordinates

y = P1/2x

can change the condition number.

- If an approximation of the Hessian at the optimal point, H(x∗), is known, then settingP ∼= H(x∗) will yield

P−1/2H(x∗)P1/2 ∼= I

resulting in a very low condition number.

- If P is chosen correctly the ellipsoid ε ={x : xTPx ≤ 1

}approximates the cost surface at

point x.

- A correct P will greatly improve the convergence whereas the wrong choice of P will resultin very poor convergence.


- (Using backtracking line search) It can be shown that any norm can be bounded in terms ofEuclidean norm with a constant η ∈ (0, 1]

‖x‖∗ ≥ η‖x‖2

69

- Assuming strongly convex f(x) and using H(x) ≺MI

f(x(k) + αdsd) ≤ f(x(k)) + α∇Tf(x(k))dsd +Mα2

2‖dsd‖2

2

≤ f(x(k)) + α∇Tf(x(k))dsd +Mα2

2η2‖dsd‖2

∗

≤ f(x(k))− α‖∇f(x(k))‖2∗ + α2 M

2η2‖∇f(x(k))‖2

∗

Right hand side of the inequality is a quadratic function of α and has a minimum at α = η2

M.

Then,

f(x(k) + αdsd) ≤ f(x(k))− η2

2M‖∇f(x(k))‖2

∗ ≤ f(x(k)) +γη2

M∇Tf(x(k))dsd

Since γ < 0.5 and−‖∇f(x)‖2∗ = ∇Tf(x)dsd, backtracking line search will return α ≥ min

{1, βη

2

M

},

then

f(x(k) + αdsd) ≤ f(x(k))− γmin

{1,βη2

M

}‖∇f(x(k))‖2

∗

≤ f(x(k))− γη2 min

{1,βη2

M

}‖∇f(x(k))‖2

2

Subtracting p∗ from both sides and using ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗), we have

f(x(k+1))− p∗

f(x(k))− p∗≤ 1− 2mγη2 min

{1,βη2

M

}= c < 1

- Linear convergencef(x(k))− p∗ ≤ ck

(f(x(0))− p∗

)as k →∞, ck → 0. So, convergence is guaranteed,

2.4.5 Examples

Example 11: A steepest descent example with L1-norm.

70

Example 12: Consider the nonquadratic problem in R2 given in Example 8 (replace α and twith γ and α).

71

When P = I, i.e., gradient descent

72

2.5 Conjugate Gradient (GD) Method

� Can overcome the slow convergence of Gradient Descent algorithm

� Computational complexity is lower than Newton's Method.

� Can be very e�ective in dealing with general objective functions.

� We will �rst investigate the quadratic problem

min1

2xTQx− bTx

where Q is SPD, and then extend the solution to the general case by approximation.

74

2.5.1 Conjugate Directions

� De�nition: Given a symmetric matrix Q, two vectors d1 and d2 are said to be Q-orthogonalor conjugate with respect to Q if

dT1 Qd2 = 0

- Although it is not required, we will assume that Q is SPD.

- If Q = I, then the above de�nition becomes the de�nition of orthogonality.

- A �nite set of non-zero vectors d0,d1, . . . ,dk is said to be a Q-orthogonal set if

dTi Qdj = 0, ∀i, j : i 6= j

� Proposition: If Q is SPD and the set of non-zero vectors d0,d1, . . . ,dk are Q-orthogonal, thenthese vectors are linearly indepedent.

Proof: Assume linear dependency and suppose ∃αi, i = 0, 1, . . . , k :

α0d0 + α1d1 + · · ·+ αkdk = 0

Multiplying with dTi Q yields

α0 dTi Qd0︸︷︷︸=0

+α1 dTi Qd1︸︷︷︸=0

+ · · ·+ αidTi Qdi︸︷︷︸

must be 0

+ · · ·+ αk dTi Qdk︸︷︷︸=0

= 0

But dTi Qdi > 0 (Q: PD), then αi = 0. Repeat for all αi.

� Quadratic Problem:

min1

2xTQx− bTx

If Q is N ×N PD matrix, then we have unique solution

Qx∗ = b

Let d0,d1, . . . ,dN−1 be non-zero Q-orthogonal vectors corresponding to the N ×N SPD matrixQ. They are linearly independent. Then the optimum solution is given by

x∗ = α0d0 + α1d1 + · · ·+ αN−1dN−1

We can �nd the value of the coe�cients αi by multiplying the above equation with dTi Q:

dTi Qx∗ = αidTi Qdi

αi =dTi b

dTi Qdi. . . Qx∗ = b

75

Finally the optimum solution is given by,

x∗ =N−1∑i=0

dTi b

dTi Qdidi

- αi can be found from the known vector b and matrix Q once di are found.

- The expansion of x∗ is a result of an iterative process of N steps where at the i-th step αidi isadded.

� Conjugate Direction Theorem: Let {di}N−1i=0 be a set of non-zero Q-orthogonal vectors. For

any x(0) ∈ dom f(x), the sequence{x(k)}Nk=0

generated according to

x(k+1) = x(k) + α(k)dk, . . . k ≥ 0

with

α(k) = − dTk g(k)

dTkQdk

and g(k) is the gradient at x(k)

g(k) = ∇f(x(k)) = Qx(k) − b

converges to the unique solution x∗ of Qx∗ = b after N steps, i.e., x(N) = x∗.

Proof: Since dk are linearly independent, we can write

x∗ − x(0) = α(0)d0 + α(1)d1 + · · ·+ α(N−1)dN−1

for some α(k). We can �nd α(k) by

α(k) =dTkQ

(x∗ − x(0)

)dTkQdk

(2.1)

Now, the iterative steps from x(0) to x(k)

x(k) − x(0) = α(0)d0 + α(1)d1 + · · ·+ α(k−1)dk−1

and due to Q-orthogonalitydTkQ

(x(k) − x(0)

)= 0 (2.2)

Using (2.1) and (2.2) we arrive at

α(k) =dTkQ

(x∗ − x(k)

)dTkQdk

= − dTk g(k)

dTkQdk

76

2.5.1.1 Descent Properties of the Conjugate Gradient Method

� We de�ne B(k) which is spanned by {d0,d1, . . . ,dk−1} as the subspace of RN , i.e.,

B(k) = span {d0,d1, . . . ,dk−1} ⊆ RN

We will show that at each step x(k) minimizes the objective over the k-dimensional linear varietyx(0) + B(k).

� Theorem: (Expanding Subspace Theorem) Let {di}N−1i=0 be non-zero, Q-orthogonal vectors

in RN .

For any x(0) ∈ RN , the sequence

x(k+1) = x(k) + α(k)dk

α(k) = − dTk g(k)

dTkQdk

minimizes f(x) = 12xTQx− bTx on the line

x = x(k−1) − αdk−1, −∞ < α <∞

and on x(0) + B(k).

Proof: Since x(k) ∈ x(0) + B(k), i.e., B(k) contains the line x = x(k−1) − αdk−1, it is enough toshow that x(k) minimizes f(x) over x(0) + B(k)

Since we assume that f(x) is strictly convex, the above condition holds when g(k) is orthogonalto B(k), i.e., the gradient of f(x) at x(k) is orthogonal to B(k).

- Proof of g(k) ⊥ B(k) is by induction

Let for k = 0, B(0) = {} (empty set), then g(k) ⊥ B(0) is true.

77

Now assume that g(k) ⊥ B(k), show that g(k+1) ⊥ B(k+1)

From the de�nition of g(k) (g(k) = Qx(k) − b), it can be shown that

g(k+1) = g(k) + αkQdk

Hence, by the de�nition of αk

dTk g(k+1) = dTk g(k) + αkdTkQdk = 0

Also, for i < kdTi g(k+1) = dTi g(k)︸︷︷︸

vanishes by induction

+ αk dTi Qdk︸︷︷︸=0

= 0

- Corollary: The gradients g(k), k = 0, 1, . . . , N satisfy

dTi g(k) = 0

for i < k.

Expanding subspace, at every iteration dk increases the dimensionality of B. Since x(k) minimizesf(x) over x(0) + B(k), x(N) is the overall minimum of f(x).

2.5.2 The Conjugate Gradient Method

In the conjugate direction method, select the successive direction vectors as a conjugate versionof the successive gradients obtained as the method progresses

� Conjugate Gradient Algorithm:

Start at any x(0) ∈ RN and de�ne d(0) = −g(0) = b−Qx(0)

repeat

1. α(k) = − d(k)T g(k)

d(k)TQd(k)

2. x(k+1) = x(k) + α(k)d(k)

3. g(k+1) = Qx(k+1) − b

4. β(k) = g(k+1)Qd(k)

d(k)TQd(k)

5. d(k+1) = −g(k+1) + β(k)d(k)

until k = N .

- Algorithm terminates in at most N steps with the exact solution (for the quadratic case)

- Gradient is always linearly independent of all previous direction vectors, i.e., g(k) ⊥ B(k), whereB(k) = {d0,d1, . . . ,dk−1}

78

- If solution is reached before N steps, the gradient is zero

- Very simple formula, computational complexity is slightly higher than gradient descent algorithm

- The process makes uniform progress toward the solution at every step. Important for thenonquadratic case.

Example 13: Consider the quadratic problem

min1

2xTQx− bTx

where Q =

[3 22 6

]and b =

[2−8

].

Solution is given by

CG Summary

� In theory (with exact arithmetic) converges to solution in N steps

- The bad news: due to numerical round-o� errors, can take more than N steps (or fail toconverge)

- The good news: with luck (i.e., good spectrum of Q), can get good approximate solutionin � N steps

� Compared to direct (factor-solve) methods, CG is less reliable, data dependent; often requiresgood (problem-dependent) preconditioner

� But, when it works, can solve extremely large systems

79

2.5.3 Extension to Nonquadratic Problems

� Idea is simple. We have two loops

- Outer loop approximates the problem with a quadratic one

- Inner loop runs conjugate gradient method (CGM) for the approximation

i.e., for the neighbourhood of point x0

f(x) ∼= f(x0) +∇Tf(x0)(x− x0) +1

2(x− x0)TH(x0)(x− x0)︸︷︷︸

quadratic function

+ residual︸︷︷︸→0

- Expanding

f(x) ∼=1

2xTH(x0)x +

(∇Tf(x0)− xT0 H(x0)

)x + f(x0) +

1

2xT0 H(x0)x0 −∇T f(x0)x0︸︷︷︸

independent of x, i.e., constant

Thus,

min f(x) ≡ min1

2xTH(x0)x+

(∇Tf(x0)− xT0 H(x0)

)x

≡ min1

2xTQx− bTx

� Here,

Q = H(x0)

bT = −∇Tf(x0) + xT0 H(x0)

The gradient g(k) is

g(k) = Qx(k) − b

= H(x0)x0 +∇f(x0)−H(x0)x0 . . . x0 = x(k)

= ∇f(x0)

� Nonquadratic Conjugate Gradient Algorithm:

Starting at any x(0) ∈ RN , compute g(0) = ∇f(x(0)) and set d(0) = −g(0)

repeat

repeat

1. α(k) = − d(k)T g(k)

d(k)TH(x(k))d(k)

2. x(k+1) = x(k) + α(k)d(k)

3. g(k+1) = ∇f(x(k+1))

80

4. β(k) = g(k+1)TH(x(k))d(k)

d(k)TH(x(k))d(k)

5. d(k+1) = −g(k+1) + β(k)d(k)

until k = N .

new starting point is x(0) = x(N), g(0) = ∇f(x(N)) and d(0) = −g(0).


- No line search is required.

- H(x(k)) must be evaluated at each point, can be impractical.

- Algorithm may not be globally convergent.

� Involvement of H(x(k)) can be avoided by employing a line search algorithm for α(k) and slightlymodifying β(k)

� Nonquadratic Conjugate Gradient Algorithm with Line-search:

Starting at any x(0) ∈ RN , compute g(0) = ∇f(x(0)) and set d(0) = −g(0)

repeat

repeat

1. Line search: α(k) = argminα

f(x(k) + αd(k))

2. Update: x(k+1) = x(k) + α(k)d(k)

3. Gradient: g(k+1) = ∇f(x(k+1))

4. Use

Fletcher-Reeves method: β(k) = g(k+1)T g(k+1)

g(k)T g(k), or

Polak-Ribiere method: β(k) =(g(k+1)−g(k))

Tg(k+1)

g(k)T g(k)

5. d(k+1) = −g(k+1) + β(k)d(k)

until k = N .

new starting point is x(0) = x(N), g(0) = ∇f(x(N)) and d(0) = −g(0).


- Polak-Ribiere method can be superior to the Fletcher-Reeves method.

- Global convergence of the line search methods is established by noting that a gradient descentstep is taken every N steps and serves as a spacer step. Since the other steps do not increasethe objective, and in fact hopefully they decrease it, global convergence is guaranteed.

Example 14: Convergence example of the nonlinear Conjugate Gradient Method: (a) A com-plicated function with many local minima and maxima. (b) Convergence path of Fletcher-ReevesCG. Unlike linear CG, convergence does not occur in two steps. (c) Cross-section of the surfacecorresponding to the �rst line search. (d) Convergence path of Polak-Ribiere CG.

81

2.6 Newton's Method (NA)

2.6.1 The Newton Step

� In Newton's Method, local quadratic approximations of f(x) are utilized. Starting with thesecond-order Taylor's approximation around x(k),

f(x(k+1)) = f(x(k)) +∇f(x(k))∆x +1

2∆xTH(x(k))∆x︸︷︷︸

f(x(k+1))

+ residual

where ∆x = x(k+1) − x(k), �nd ∆x = ∆xnt such that f(x(k+1)) is minimized.

� Quadratic approximation optimum step ∆xnt (by solving ∂f(x(k+1))∂∆x

= 0)

∆xnt = −H−1(x(k))∇f(x(k))

82

is called the Newton step, which is a descent direction, i.e.,

∇Tf(x(k))∆xnt = −∇Tf(x(k))H−1(x(k))∇f(x(k)) < 0

� Thenx(k+1) = x(k) + ∆xnt

2.6.1.1 Interpretation of the Newton Step

1. Minimizer of second-order approximation

As given above ∆x minimizes f(x(k+1)), i.e., the quadratic approximation of f(x) in the neighbourhoodof x(k).

- If f(x) is quadratic, then f(x(0)) + ∆x is the exact minimizer of f(x) and algorithm terminatesin a single step with the exact answer.

- If f(x) is nearly quadratic, then x + ∆x is a very good estimate of the minimizer of f(x), x∗.

- For twice di�erentiable f(x), quadratic approximation is very accurate in the neighbourhood ofx∗, i.e., when x is very close to x∗, the point x + ∆x is a very good estimate of x∗.

2. Steepest Descent Direction in Hessian Norm

- The Newton step is the steepest descent direction at x(k), i.e.,

‖v‖H(x(k)) =(vTH(x(k))v

) 12

- In the steepest descent method, the quadratic norm ‖ · ‖P can signi�cantly increase speed ofconvergence, by decreasing the condition number. In the neighbourhood of x∗, P = H(x∗) is avery good choice.

- In Newton's method when x is near x∗, we have H(x) ' H(x∗).

83

3. Solution of Linearized Optimality Condition

- First-order optimality condition∇f(x∗) = 0

near x∗ (using �rst order Taylor's approximation for ∇f(x + ∆x))

∇f(x + ∆x) ' ∇f(x) + H(x)∆x = 0

with the solution∆xnt = −H−1(x)∇f(x)

2.6.2 The Newton Decrement

� The norm of the Newton step in the quadratic norm de�ned by H(x) is called the Newtondecrement

λ(x) = ‖∆xnt‖H(x) =(∆xTntH(x)∆xnt

) 12

� It can be used as a stopping criterion since it is an estimate of f(x)− p∗, i.e.,

f(x)− infyf(y) = f(x)− f(x + ∆xnt) =

1

2λ2(x)

where

f(x + ∆xnt) = f(x) +∇Tf(x)∆xnt +1

2∆xTntH(x)∆xnt

i.e., the second-order quadratic approximation of f(x) at x.

84

Substitute f(x + ∆xnt) into f(x)− infyf(y) and let

∆xnt = −H−1(x)∇f(x)

then1

2∇Tf(x)H−1(x)∇f(x) =

1

2λ2(x)

� So, if λ2(x)2

< ε, then algorithm can be terminated for some small ε.

� With the substitution of ∆xnt = −H−1(x)∇f(x), the Newton decrement can also be written as

λ(x(k)) =(∇Tf(x(k))H−1(x(k))∇f(x(k))

) 12

2.6.3 Newton's Method

� Given a starting point x(0) ∈ dom f(x) and some small tolerance ε > 0

repeat

1. Compute the Newton step and Newton decrement

∆x(k) = −H−1(x(k))∇f(x(k))

λ(x(k)) =(∇Tf(x(k))H−1(x(k))∇f(x(k))

) 12

2. Stopping criterion, quit if λ2(x(k))/2 ≤ ε.

3. Line search: Choose a stepsize α(k) > 0, e.g., by backtracking line search.

4. Update: x(k+1) = x(k) + α(k)∆x(k).

� The stepsize α(k) (i.e., line search) is required for the non-quadratic initial parts of the algorithm.Otherwise, algorithm may not converge due to large higher-order residuals.

� As x(k) gets closer to x∗. f(x) can be better approximated by the second-order expansion.Hence, stepsize α(k) is no longer required. Line search algorithm will automatically set α(k) = 1.

� If we start with α(k) = 1 and keep it the same, then the algorithm is called the pure Newton'smethod.

� For an arbitrary f(x), there are two regions of convergence.

- damped Newton phase, when x is far from x∗

- quadratically convergent phase, when x gets closer to x∗

� If we let H(x) = I, the algorithm reduces to gradient descent (GD)

x(k+1) = x(k) − α(k)∇f(x(k))

85

� If H(x) is not positive de�nite, Newton's method will not converge.

So, use (aI + H(x))−1 instead of H−1(x), also known as (a.k.a) Marquardt method. Therealways exists an a which will make the matrix (aI + H(x)) positive de�nite.

a is a trade-o� between GD and NA

- a→∞ ⇒ Gradient Descent (GD)

- a→ 0 ⇒ Newton's Method (NA)

� Newton step and decrement are independent of a�ne transformations (i.e., linear coordinatetransformations), i.e., for non-singular T ∈ RN×N

x = Ty and f(y) = f(Ty)

then

∇f(y) = TT∇f(x)

H(y) = TTH(x)T

- So, the Newton step will be

∆ynt = −H−1(y)∇f(y)

= −(TTH(x)T

)−1(TT∇f(x)

)= −T−1H−1(x)∇f(x)

= T−1∆xnt

i.e,x + ∆xnt = T (y + ∆ynt), ∀x

- Similarly, the Newton decrement will be

λ(y) =(∇T f(y)H−1(y)∇f(y)

) 12

=((∇Tf(x)T

)(TTH(x)T

)−1(TT∇f(x)

)) 12

=(∇Tf(x)H(x)∇f(x)

) 12

= λ(x)

� Thus, Newton's Method is independent of a�ne transformations (i.e., linear coordinate trans-formations).

86


Read Boyd, Section 9.5.3.

� Assume a strongly convex f(x) with mI � H(x) with constant m ∀x ∈ dom f(x)

and H(x) is Lipschitz continuous on dom f(x), i.e.,

‖H(x)−H(y)‖2 ≤ L ‖x− y‖2

for constant L > 0. This inequality imposes a bound on the third derivative of f(x).

If L is small f(x) is closer to a quadratic function. If L is large, f(x) is far from a quadraticfunction. If L = 0, then f(x) is quadratic.

Thus, L measures how well f(x) can be approximated by a quadratic function.

- Newton's Method will perform well for small L.

Convergence: There exist constants η ∈(

0, m2

L

)and σ > 0 such that

� Damped Newton Phase‖∇f(x)‖2 ≥ η

- α < 1 gives better solutions, so most iterations will require line search, e.g., backtrackingline search.

- As k increases, function value decreases by at least σ, but not necessarily quadratic.

- This phase ends after at most f(x(0))−p∗σ

iterations

� Quadratically Convergent Phase‖∇f(x)‖2 < η

- All iterations use α = 1 (i.e., quadratic approximation suits very well.)

-‖∇f(x(k+1))‖‖∇f(x(k))‖2

≤ L

2m2, i.e., quadratic convergence.

- For small ε > 0, f(x)− p∗ < ε is achieved after at most

log2 log2

ε0ε

iterations where ε0 = 2m3

L2 . This is typically 5-6 iterations.

- Number of iterations is bounded above by

f(x(0))− p∗

σ+ log2 log2

ε0ε

σ and ε0 are dependent on m, L and x(0).

87

2.6.5 Summary

� Convergence of Newton's method is rapid in general, and quadratic near x∗. Once the quadraticconvergence phase is reached, at most six or so iterations are required to produce a solution ofvery high accuracy.

� Newton's method is a�ne invariant. It is insensitive to the choice of coordinates, or the conditionnumber of the sublevel sets of the objective.

� Newton's method scales well with problem size. Ignoring the computation of the Hessian, itsperformance on problems in R10000 is similar to its performance on problems in R10, with onlya modest increase in the number of steps required.

� The good performance of Newton's method is not dependent on the choice of algorithm parame-ters. In contrast, the choice of norm for steepest descent plays a critical role in its performance.

� The main disadvantage of Newton's method is the cost of forming and storing the Hessian, andthe cost of computing the Newton step, which requires solving a set of linear equations.

� Other alternatives (called quasi-Newton methods) are also provided by a family of algorithmsfor unconstrained optimization. These methods require less computational e�ort to form thesearch direction, but they share some of the strong advantages of Newton methods, such as rapidconvergence near x∗.

2.6.6 Examples

Example 15: Consider the nonquadratic problem in R2 given in Example 8 and Example 12(replace α and t with γ and α).

88

Example 16: Consider the nonquadratic problem in R100 given in Example 9 (replace α and twith γ and α).

90

Example 17: (problem in R10000) Replace α and t with γ and α.

92

2.6.7 Approximation of the Hessian

For relatively large scale problems, i.e., N is large, calculating the inverse of the Hessian at eachiteration can be costly. So, we may use, some approximations of the Hessian

S(x) = H−1(x)→ H−1(x)

x(k+1) = x(k) − α(k)S(x(k))∇f(x(k))

1. Hybrid GD + NA

We know that the �rst phase the Newton's Algorithm (NA) is not very fast. Therefore, �rst wecan run run GD which has considerably low complexity and after satisfying some conditions, wecan switch to the NA.

Newton's Algorithm may not converge for highly non-quadratic functions unless x is close to x∗.

Hybrid method (given below) also guarantees global convergence.

� Hybrid Algorithm

- Start at x(0) ∈ dom f(x)

repeat

run GD (i.e., S(x(k)) = I)


- Start at the �nal point of GD

repeat

run NA with exact H(x) (i.e., S(x(k)) = H−1(x(k)))


2. The Chord Method

If f(x) is close to a quadratic function, we may use S(x(k)) = H−1(x(0)) throughout the iterations,i.e.,

∆x(k) = −H−1(x(0))∇f(x(k))

x(k+1) = x(k) + ∆x(k)

This is also the same as the SD algorithm with P = H(x(0)) and α(k) = 1.

3. The Shamanski Method

Updating the Hessian at everyN iterations may give better performance, i.e., S(x(k)) = H−1(xbkN cN)

∆x(k) = −H−1(xbkN cN)∇f(x(k))

x(k+1) = x(k) + ∆x(k)

93

This is a trade-o� between the Chord method (N ←∞) and the full NA (N ← 1).

4. Approximating Particular Terms

Inversion of sparse matrices can be easier, i.e., when many entries of H(x) are zero

- If some entries of H(x) are small or below a small threshold, then set them to zero, obtainingH(x). Thus, H(x) becomes sparse.

- In the extreme case. when the Hessian is strongly diagonal dominant, let the o�-diagonalterms to be zero, obtaining H(x). Thus, H(x) becomes diagonal which is very easy toinvert.

There are also other advanced quasi-Newton (modi�ed Newton) algorithms developed to approx-imate the inverse of the Hessian, e.g., Broyden and Davidon-Fletcher-Powell (DFP) methods.

94

Chapter 3

Constrained Optimization Methods

3.1 Duality

� Consider the standard minimization problem (will be referred as the primal problem)

min f(x)

s.t. g(x) ≤ 0

h(x) = 0

where g(x) and h(x)

g(x) =[g1(x) g2(x) · · · gi(x) · · · gL(x)

]Th(x) =

[h1(x) h2(x) · · · hj(x) · · · hM(x)

]Trepresent the L inequality and M equality constraints, respectively.

� The domain of the optimization problem D is de�ned by

D = dom f(x) ∩L⋂i=1

dom gi(x) ∩M⋂j=1

domhj(x)

� Any point x ∈ D satisfying the constraints is called a feasible point, i.e., g(x) ≤ 0 and h(x)=0.

3.1.1 Lagrange Dual Function

� De�ne the Lagrangian L : RN × RL × RM → R as

L(x,λ,ν) = f(x) + λTg(x) + νTh(x)

= f(x) +L∑i=1

λigi(x) +M∑j=1

νjhj(x)

95

where λi ≥ 0 and νj are called the Lagrange multipliers, and λ and ν

λ =[λ1 λ2 · · · λi · · · λL

]Tν =

[ν1 ν2 · · · νj · · · νM

]Tare called the Lagrange multiplier vectors.

� On the feasible set F where

F = {x | x ∈ D ∧ g(x) ≤ 0 ∧ h(x) = 0}

Lagragian has a value less than or equal to the cost function, i.e,

L(x,λ,ν) ≤ f(x), ∀x ∈ F, ∀λi ≥ 0

.

� Then the Lagrange dual function, `(λ,ν), is de�ned as

`(λ,ν) = infx∈D

L(x,λ,ν)

= infx∈D

(f(x) + λTg(x) + νTh(x)

)= inf

x∈D

(f(x) +

L∑i=1

λigi(x) +M∑j=1

νjhj(x)

)

� Dual function `(λ,ν) is the pointwise in�mum of a set of a�ne functions of λ and ν, hence itis concave even if f(x), gi(x) and hj(x) are not convex.

� Proposition: The dual function constitutes a lower bound on p∗ = f(x∗), i.e.,

`(λ,ν) ≤ p∗ ∀λ ≥ 0

� Proof: Let x be a feasible point, that is x ∈ F (i.e., gi(x) ≤ 0 and hi(x) = 0), and λ > 0, then

λTg(x) + νTh(x) ≤ 0.

Then, the Lagrangian is

L(x,λ,ν) = f(x) + λTg(x) + νTh(x)︸︷︷︸≤0

≤ f(x)

So,`(λ,ν) = inf

x∈DL(x,λ,ν) ≤ L(x,λ,ν) ≤ f(x) ∀x �

� The pair (λ,ν) is called dual feasible when λ ≥ 0.

96

3.1.1.1 Examples

� Least Squares (LS) solution of linear equations

min xTx

s.t. Ax = b

- The Lagrangian isL(x,ν) = xTx + νT (Ax− b)

97

then the Lagrange dual function is

`(ν) = infxL(x,ν)

Since L(x,ν) is quadratic in x, it is convex

∇L(x,ν) = 2x + ATν = 0⇒ x∗ = −1

2ATν

Hence Lagrange dual function is given by

`(ν) = L(−1

2ATν,ν) = −1

4νTATAν − bTν

which is obviously concave p∗ ≥ `(ν) = −14νTATAν − bTν

� Linear Programming (LP)

min cTx

s.t. Ax = b

− x ≤ 0

- The Lagrangian isL(x,λ,ν) = cTx− λTx + νT (Ax− b)

then the Lagrange dual function is

`(λ,ν) = −bTν + infx

{(c + ATν − λ

)Tx}

In order `(ν) to be bounded, we must have c + ATν − λ = 0, i.e.,

`(λ,ν) =

{−bTν, c + ATν − λ = 0

−∞, otherwise

A�ne, i.e., concave when c + ATν − λ = 0 with λ ≥ 0

So, p∗ ≥ −bTν when c + ATν − λ = 0 with λ ≥ 0.

� Two-way partitioning (a non-convex problem)

min xTWx

s.t. x2j = 1, j = 1, . . . , N

- This is a discrete-problem since xj ∈ {∓1}, and very di�cult to solve for large N .

The Lagrange dual function is

`(ν) = infx

{xTWx +

N∑j=1

νj(x2j − 1

)}= inf

x

{xT (W + diag ν)x

}− 1Tν

=

{−1Tν, W + diag ν � 0

−∞, else

98

We may take ν = −λmin(W)1T which yields

p∗ ≥ −1Tν = N λmin(W)

where λmin(W) is the minimum eigen value of W.

- If we relax the constraint to be ‖x‖2 = N , i.e.,N∑j=1

x2j = N , then problem becomes easy to

solve

min xTWx

s.t. ‖x‖2 = N

with an exact solution ofp∗ = N λmin(W)

where λmin(W) is the minimum eigen value of W.

3.2 The Lagrange Dual Problem

3.2.1 Dual Problem

max `(λ,ν)

s.t. λ ≥ 0

gives the best lower bound for p∗, i.e., p∗ ≥ `(λ,ν).

� The pairs (λ,ν) with λ ≥ 0, ν ∈ RM and `(λ,ν) > −∞ are dual feasible.

� Solution of the above problem for a dual feasible set is called the dual optimal point (λ∗,ν∗)(i.e., optimal Lagrange multipliers)

p∗ = `(λ∗,ν∗) ≤ p∗

� Some hidden (implicit) constraints can be made explicit in the dual problem, e.g., consider theLP problems in the next sections.

� Linear problem (LP)

min cTx

s.t. Ax = b

x ≥ 0

has the dual function

`(λ,ν) =

{−bTν, c + ATν − λ = 0

−∞, otherwise

99

- So, the dual problem can be given by

max − bTν

s.t. ATν − λ + c = 0

λ ≥ 0

- The dual problem can be further simpli�ed to the following problem

max − bTν

s.t. ATν + c ≥ 0

� Linear problem (LP) with inequality

min cTx

s.t. Ax ≤ b

has the dual function

`(λ) =

{−bTλ, ATλ + c = 0

−∞, otherwise

- So, the dual problem can be given by

max − bTλ

s.t. ATλ + c = 0

λ ≥ 0

3.3 Weak and Strong Duality

3.3.1 Weak Duality

� We know thatsup

λ∈R+,ν`(λ,ν) = p∗ ≤ p∗ = inf

x∈Df(x)

� The inequality means weak duality.

� Here (p∗ − p∗) is called the duality gap, which is always nonnegative.

� Weak duality always holds even when the primal problem is non-convex.

3.3.2 Strong Duality

� Strong duality refers to the case where the duality gap is zero, i.e.,

p∗ = p∗

� In general it does not hold.

� It may hold if the primal problem is convex.

� The conditions under which strong duality hold are called constraint quali�cations, where oneof them is the Slater's condition.

100

3.3.3 Slater's Condition

� Strong duality holds for a convex problem

min f(x)

s.t. g(x) ≤ 0

Ax = b

if it is strongly feasible, i.e., ∃x ∈ interiorD

g(x) < 0

Ax = b

i.e., inequality constraints hold with strict inequality.

3.3.4 Saddle-point Interpretation

� First consider the following problem,

supλ≥0

L(x,λ) = supλ≥0

(f(x) + λTg(x)

)= sup

λ≥0

(f(x) +

L∑i=1

λigi(x)

)

=

{f(x), gi(x) ≤ 0,∀i∞, otherwise

with no equality constraint.

� If gi(x) ≤ 0, then optimum choice is λi = 0

� Hence from duality gap, we havep∗ ≤ p∗

with weak duality as the inequality

supλ≥0

infxL(x,λ) ≤ inf

xsupλ≥0

L(x,λ)

and strong duality as the equality

supλ≥0

infxL(x,λ) = inf

xsupλ≥0

L(x,λ)

� With strong duality we can switch inf and sup for λ ≥ 0.

� In general, for any f(w, z) : RN × RL → R

supz∈Z

infw∈W

f(w, z) ≤ infw∈W

supz∈Z

f(w, z)

with W ⊆ RN , Z ⊆ RL, and is called the max-min inequality.

101

� f(w, z) satisfy the strong max-min property or the saddle-point property if the above inequalityholds with equality.

� A point {(w, z)|w ∈W, z ∈ Z} is called as the saddle-point for f(w, z) if

f(w, z) ≤ f(w, z) ≤ f(w, z)

∀w ∈W, ∀z ∈ Z

� i.e.,

f(w, z) = infw∈W

f(w, z)

= supz∈Z

f(w, z)

the strong max-min property holds with the value f(w, z).

� For Lagrange duality, if x∗ and λ∗ are optimal points for the primal and dual problems withstrong duality (zero duality gap), then they form a saddle-point for the Lagrangian, or vice-versa(i.e., the converse is also true).

3.4 Optimality Conditions

3.4.1 Certi�cate of Suboptimality and Stopping Criterion

� We know that a dual feasible point satis�es

`(λ,ν) ≤ p∗

i.e., the point (λ,ν) is proof or certi�cate of this condition.

Then,f(x)− p∗ ≤ f(x)− `(λ,ν)

for primal feasible point x and dual feasible point (λ,ν) whereas the duality gap associated withthese points is

f(x)− `(λ,ν)

in other wordsp∗ ∈ [`(λ,ν), f(x)] and p∗ ∈ [`(λ,ν), f(x)]

- If the duality gap is zero, i.e., f(x) = `(λ,ν) then x∗ = x and (λ∗,ν∗) = (λ,ν) are the primaland dual optimal points.

3.4.1.1 Stopping Criterion

� If an algorithm produces the sequences x(k) and (λ(k),ν(k)) check for

f(x(k))− `(λ(k),ν(k))?< ε

to guarantee ε-suboptimality. ε can approach to zero, i.e., ε→ 0, for strong duality.

102

3.4.2 Complementary Slackness

� Assume that x∗ and (λ∗,ν∗) satisfy strong duality

f(x∗) = `(λ∗,ν∗)

= infx

(f(x) + λTg(x) + νTh(x)

)= inf

x

(f(x) +

∑i

λigi(x) +∑j

νjhj(x)

)≤ f(x∗) +

∑i

λ∗i gi(x∗)︸︷︷︸

≤0

+∑j

ν∗j hj(x∗)︸︷︷︸

=0

≤ f(x∗)

� Observations: (due to strong duality)

1. Inequality in the third line always holds with equality, i.e., x∗ minimizes L(x,λ∗,ν∗)

2. From the forth line we have∑i

λ∗i gi(x∗) = 0 and since λigi(x) ≤ 0

λ∗i gi(x∗) = 0 ∀i

which is known as complementary slackness.

� In other words

λ∗i > 0 if gi(x∗) = 0

λ∗i = 0 if gi(x∗) < 0

i.e., the ith optimal lagrange multiplier is

- positive if gi(x∗) is active at x∗.

- zero if gi(x∗) is not active at x∗.

3.4.3 KKT Optimality Conditions

� x∗ minimizes L(x,λ∗,ν∗), thus ∇L(x,λ∗,ν∗) = 0, i.e.,

∇f(x∗) +∑i

λ∗i∇gi(x∗) +∑j

ν∗j∇hj(x∗) = 0

� Then the Karush-Kuchn-Tucker (KKT) conditions for x∗ and (λ∗,ν∗) being primal and dualoptimal points with zero duality gap (i.e., with strong duality) are

gi(x∗) ≤ 0 (constraint)

hj(x∗) = 0 (constraint)

λ∗i ≥ 0 (constraint)

λ∗i gi(x∗) = 0 (complementary slackness)

∇f(x∗) +∑i

λ∗i∇gi(x∗) +∑j

ν∗j∇hj(x∗) = 0

103

� For any optimization problem with di�erentiable objective (cost) and constraint functions forwhich strong duality holds, any pair of primal and optimal points satisfy KKT conditions.

� For convex problems:

If f(x), gi(x) and hj(x) are convex and x, λ and ν satisfy KKT conditions, then they areoptimal points, i.e.,

- from complementary slackness f(x) = L(x, λ, ν)

- from last condition `(λ, ν) = L(x, λ, ν)(

= infxL(x, λ, ν)

). Note that L(x, λ, ν) is convex

in x.

Thus,f(x) = `(λ, ν).

Example 18:

min1

2xTQx + cTx + r (Q : SPD)

s.t. Ax = b

Soln: From KKT consitions

Ax∗ = b

Qx∗ + c + ATν∗ = 0

}=

[Q AT

A 0

] [x∗

ν∗

]=

[−cb

]

Example 19:

Solution:

104

3.4.4 Solving The Primal Problem via The Dual

� If strong duality holds and if dual optimal solution (λ∗,ν∗) exists, then we can compute a primaloptimal solution from the dual solutions.

� When strong duality holds and (λ∗,ν∗) is given, the minimizer of L(x,λ∗,ν∗), i.e.,

min f(x) +∑i

λ∗i gi(x) +∑j

ν∗j hj(x)

is unique and if it is primal feasible, then it is also primal optimal.

� If the dual problem is easier to solve (e.g., has less dimensions or has an analytical solution),then solving the dual problem and �nding the optimal dual parameters (λ∗,ν∗) �rst and thensolving

x∗ = argmin L(x,λ∗,ν∗)

will be an acceptable method to solve constrained minimization problems.

� Example 20: Consider the following problem

min f(x) =N∑i=1

fi(xi)

s.t. aTx = b

where each fi(x) is strictly convex and di�erentiable, a ∈ RN and b ∈ R. Assume that theproblem has a unique solution (non-empty and non-in�nity) and it is dual feasible.

106

- f(x) is separable because fi(xi) is a function of xi only. Now the Lagrangian will be given as

L(x, ν) =N∑i=1

fi(xi) + ν(aTx− b

)= −bν +

N∑i=1

(fi(xi) + νaixi)

Then the dual function is given by

`(ν) = infxL(x, ν)

= −bν + infx

N∑i=1

(fi(xi) + νaixi)

= −bν +N∑i=1

infxi

(fi(xi) + νaixi)︸︷︷︸f∗i (−νai)

= −bν −N∑i=1

f ∗i (−νai)

where f ∗i (y) is the conjugate function of fi(x).

NOTE: The conjugate function f ∗(y) is the maximum gap between the linear function yTx andf(x) (see Boyd Section 3.3). If f(x) is di�erentiable, this occurs at a point x where ∇f(x) = y.Note that, f ∗(y) is a convex function.

f ∗(y) = supx∈dom f(x)

(yTx− f(x)

)Then the dual problem is a function of a scalar ν ∈ R

maxν

(−bν −

N∑i=1

f ∗i (−νai)

)

- Once we �nd ν∗, we know that L(x, ν∗) is strictly convex as each fi(x) is strictly convex. So,we can �nd x∗ by solving

∇L(x, ν∗) = 0

∂fi(xi)

∂xi= −ν∗ai.

3.4.5 Perturbation and Sensitivity Analysis

� Original Problem

107

primal dual

min f(x)

s.t. g(x) ≤ 0

h(x) = 0

max `(λ,ν)

s.t. λ ≥ 0

� Perturbed problem

primal dual

min f(x)

s.t. g(x) ≤ u

h(x) = v

max `(λ,ν)− λTu− νTv

s.t. λ ≥ 0

� Here u = [ui]L×1 and v = [vj]M×1 are called the perturbations. When ui = 0 and vj = 0, theproblem becomes the original problem. If ui > 0, it means we have relaxed the i-th inequalityconstraint, and if ui < 0, it means we have tightened the i-th inequality constraint.

� Let us use the notation p∗(u,v) to denote the optimal solution of the perturbed problem. Thus,the optimal solution of the original problem is p∗ = p∗(0,0) = `(λ∗,ν∗).

� Assume that strong duality holds and optimal dual solution exists, i.e., p∗ = `(λ∗,ν∗). Then,we can show that

p∗(u,v) ≥ p∗(0,0)− λ∗Tu− ν∗Tv

� If λi is large and ui < 0, then p∗(u,v) is guaranteed to increase greatly.

� If λi is small and ui > 0, then p∗(u,v) will not decrease too much.

� If |νj| is large

- If νj > 0 and vi < 0, then p∗(u,v) is guaranteed to increase greatly.

- If νj < 0 and vi > 0, then p∗(u,v) is guaranteed to increase greatly.

� If |νj| is small

- If νj > 0 and vi > 0, then p∗(u,v) will not decrease too much.

- If νj < 0 and vi < 0, then p∗(u,v) will not decrease too much.

108

� The perturbation inequality, p∗(u,v) ≥ p∗(0,0) − λ∗Tu − ν∗Tv, give a lower bound on theperturbed optimal value, but no upper bound. For this reason the results are not symmetricrespect to loosening or tightening a constraint, For example, if λi is large and we loosen the i-thconstraint a bit (i.e., take ui small and positive, 0 < ui < ε), then perturbation inequality is notuseful, it does not imply that optimal value will decrease considerably.

3.5 Constrained Optimization Algorithms

3.5.1 Introduction

� The general constrained optimization problem

min f(x)

s.t. g(x) ≤ 0

h(x) = 0

� Defn (Active Constraint): Given a feasible point x(k), if gi(x) ≤ 0 is satis�ed with equality,i.e., gi(x) = 0 , then constraint i is said to be active at x(k). Otherwise it is inactive at x(k).

109

3.5.2 Primal Methods

See Luenberger Chapter 12.

� A primal method is a search method that works on the original problem directly by searchingthrough the feasible region for the optimal solution. Each point in the process is feasible andthe value of the objective function constantly decreases.

For a problem with N variables andM equality constraints, primal methods work in the feasiblespace of dimension N −M .

Advantages:

- x(k) are all composed of feasible points.

- If x(k) is a convergent sequence, it converges at least to a local minimum.

- Do not rely on special problem structure, e.g. convexity, in other words, primal methodsare applicable to general non-linear problems.

Disadvantages:

- must start from a feasible initial point.

- may fail to converge for inequality constraints if precaution is not taken.

110

3.5.2.1 Feasible Direction Methods

� Update equation isx(k+1) = x(k) + α(k)d(k)

d(k) must be a descent direction and x(k) +α(k)d(k) must be contained in the feasible region, i.e.,d(k) must be a feasible direction for some α(k) > 0.

Very similar to unconstrained descent methods but now line search is constrained so that x(k+1)

is also a feasible point.


min f(x)

s.t. aTi x ≤ bi, i = 1, . . . ,M

- Let A(k) be the set of indices representing active constraints, i.e., aTi x(k+1) = bi, i ∈ A(k) at x(k),then the direction vector d(k) is calculated by

mind∇Tf(x(k))d

s.t. aTi d ≤ 0, i ∈ A(k)

N∑i=1

|di| = 1

Last line ensures a bounded solution. The other constraints assure that vectors of the formx(k) + α(k)d(k) will be feasible for su�ciently small α(k) > 0, and subject to these conditions,d(k) is chosen to line up as closely as possible with the negative gradient of f(x(k)). In somesense this will result in the locally best direction in which to proceed. The overall procedureprogresses by generating feasible directions in this manner, and moving along them to decreasethe objective.

There are two major shortcomings of feasible direction methods that require that they be modi�edin most cases.

� The �rst shortcoming is that for general problems there may not exist any feasible directions.If, for example, a problem had nonlinear equality constraints, we might �nd ourselves in thesituation where no straight line from x(k) has a feasible segment. For such problems it is necessaryeither to relax our requirement of feasibility by allowing points to deviate slightly from theconstraint surface or to introduce the concept of moving along curves rather than straight lines.

� A second shortcoming is that in simplest form most feasible direction methods are not globallyconvergent. They are subject to jamming (sometimes referred to as zigzagging) where thesequence of points generated by the process converges to a point that is not even a constrainedlocal minimum point.

111

3.5.2.2 Active Set Methods

� The idea underlying active set methods is to partition inequality constraints into two groups:those that are to be treated as active and those that are to be treated as inactive. The constraintstreated as inactive are essentially ignored.

� Consider the following problem

min f(x)

s.t. g(x) ≤ 0

For simplicity, no equality constraints, the inclusion of equality constraints will be straightfor-ward.

Necessary conditions for optimum x∗ are

∇f(x∗) + λT∇g(x∗) = 0

g(x∗) ≤ 0

λTg(x∗) = 0

λ ≥ 0

� Let A be the set of indices of active constraints (i.e., gi(x∗) = 0, i ∈ A). Then

∇f(x∗) +∑i∈A

λi∇gi(x∗) = 0

gi(x∗) = 0, i ∈ A

gi(x∗) < 0, i /∈ Aλi ≥ 0, i ∈ Aλi = 0, i /∈ A

Inactive constraints are inhibited (i.e., λi = 0).

� Just taking active constraints in A, problem is converted to an equality-constrained-only prob-lem.

� Active set method:

- at each step �nd the working set W and treat it as the active set (can be a subset of theactual active set, i.e., W ⊆ A ).

- move to a lower point on the surface of the working set

- �nd a new working set W for this point

- repeat

� The surface de�ned by the working set W will be called the working surface.

112

� Given any working set W ⊆ A, Assume that xW is a solution to the problem PW

min f(x)

s.t. gi(x) = 0, i ∈W

also satisfying gi(xW) < 0, i /∈ A. If xW cannot be found change the working set W until asolution xW is obtained.

� Once xW is found, then solve

∇f(xW) +∑i∈W

λi∇gi(xW) = 0

to �nd λi.

� If λi ≥ 0,∀i ∈W, then xW is a local optimal solution of the original problem.

� If ∃i ∈ W such that λi < 0, then dropping constraint i (but staying feasible) will decrease thevalue due to the sensitivity theorem (relaxing the constraint i to be gi(x) = −c reduces f(x) byλic).

� The surface de�ned by a working set is called a working surface. By dropping i from W andmoving on the new working surface (toward the interior of the feasible region F, we move to animproved solution. )

113

� Monitoring the movement to avoid infeasibility until one or more constraints become active,then add them to the working set W. Now, solve the changed problem PW again to �nd the anew xW and repeat the previous steps.

� If we can assure the the objective function (cost function) value is monotonically decreasing, thenany working set will not appear twice in the process. Hence the active set method terminatesin a �nite number of iterations.)

114

Active Sets Algorithm: For a given working set W

repeat

repeat

- minimize f(x) over the working set W using x(k+1) = x(k) + α(k)d(k)

- check whether a new constraint becomes active.

if so, add the new constraint to the working set W.

until some stopping criterion is satis�ed

check the Lagrange multipliers λi

- drop constraints with λi < 0 from the working set W

until some stopping criterion is satis�ed

� Note that f(x) strictly decreases at each step.

� Disadvantage: The inner loop must terminate at a global optimum in order to determine correctλi, and the same working set is not encountered in the following iterations.

� Discuss: How can we integarte equality constraints to the original problem?

115

3.5.2.3 Gradient Projection Method

� Gradient Projection Method is an extension of the Gradient (or Steepest) Descent Method (GDor SD) to the constrained case.

� Let us �rst consider linear constraints

min f(x)

s.t. aTi x ≤ bi, i ∈ I1

aTi x = bi, i ∈ I2

Let us take the active constraints,i.e., aTi x = bi, as the working set W and seek a feasibledescent direction

�nd ∇Tf(x)d < 0

while satisfying aTi d = 0, i ∈W

i.e., d must lie in the tangent plane de�ned by aTi d = 0, i ∈W.

Hence, the above problem is the projection of −∇f(x) on to this tangent plane.

� Another perspective is to let the equality Ax = b to represent all the active constraints aTi x =bi, i ∈W. Thus

min f(x)

s.t. Ax = b

� If we use �rst order Taylor approximation around point x(k), such that f(x) = f(x(k) + d) ∼=f(x(k)) +∇Tf(x(k))d for small enough d , the we will have

min f(x(k)) +∇Tf(x(k))d

s.t. A(x(k) + d

)= b

dT I d ≤ 1

� As Ax(k) = b and Ax(k+1) = b, so this implies that Ad = 0.

� Thus, the problem simpli�es to

min ∇Tf(x(k))d

s.t. Ad = 0

dT I d ≤ 1

- Ad = 0 de�nes the tangent plane M ⊆ RN and ensures that x(k+1) is still feasible.

- dT I d ≤ 1 is the Euclidean unit ball (‖d‖22 ≤ 1), thus this is the projected GD algorithm.

116

- We may also use the constraint dTQd ≤ 1, where Q is a SPD matrix, to obtain the projectedSD algorithm.

Projected Steepest Descent Algorithm (PSDA):

1. For a feasible initial point x(0) (i.e., x(0) ∈ F)

2. Solve the Direction Finding Problem (DFP)

d(k) = argmin ∇Tf(x(k))d

s.t. Ad = 0

dTQd ≤ 1, (Q : SPD)

3. If ∇Tf(x(k))d(k) = 0, stop. x(k) is a KKT point.

4. Solve α(k) = argmin f(x(k) +αd(k)) using a line search algorithm, e.g. exact or backtracking linesearch.

5. x(k+1) = x(k) + α(k)d(k)

6. Goto Step 1 with x(0) = x(k+1).

Projection:

� DFP problem is another constrained optimization problem which should satisfy its KKT condi-tions, i.e.,

Ad(k) = 0

d(k)TQd(k) = 1

∇f(x(k)) + 2βkQd(k) + ATλk = 0

βk ≥ 0

βk

(1− d(k)TQd(k)

)= 0

- Here, Ad(k) = 0 de�nes the tangent plane M(k) ⊆ RN .

If we set d(k) = 2βkd(k), From the third condition we �nd that

d(k) = −Q−1∇f(x(k))−Q−1ATλk

Now if put this value into the �rst condition we obtain

λk = −(AQ−1AT

)−1AQ−1∇f(x(k))

Thus, d(k) is obtained as

d(k) = −[Q−1 −Q−1AT

(AQ−1AT

)−1AQ−1

]∇f(x(k))

= −P(k)∇f(x(k))

117

whereP(k) = Q−1 −Q−1AT

(AQ−1AT

)−1AQ−1

is called the projection matrix.

� Note that if Q = I, then

P(k) = I−AT(AAT

)−1A

� As 2βk is just a scaling factor, we can safely write that the descent direction d(k) is given by

d(k) = −P(k)∇f(x(k))

PSDA with DFP Algorithm: Given a feasible point x(k) (i.e., x(k) ∈ F)

1. Find the active constraint set W(k) and form A (actually A(k)) matrix

2. Calculate

P(k) = Q−1 −Q−1AT(AQ−1AT

)−1AQ−1

d(k) = −P(k)∇f(x(k))

3. If d(k) 6= 0

a) Find α

α1 = max{α : x(k) + αd(k) is feasible

}α2 = argmin

{f(x + αd(k)) : 0 ≤ α ≤ α1

}b) x(k+1) = x(k) + α2d

(k)

c) Goto Step 1 (with x(k) = x(k+1))

4. If d(k) = 0

a) Find λ

λ = −(AQ−1AT

)−1AQ−1∇f(x(k))

b) If λi ≥ 0,∀i, stop. x(k) satis�es KKT.

c) If ∃λi < 0, delete the row from A corresponding to the inequality with the most negativecomponent of λi and drop its index from W. Goto Step 2.


min x21 + x2

2 + x23 + x2

4 − 2x1 − 3x4

s.t. 2x1 + x2 + x3 + 4x4 = 7

x1 + x2 + 2x3 + x4 = 6

xi ≥ 0, i = 1, 2, 3, 4

Given the initial point x(0) =[2 2 1 0

]T, �nd the initial direction d(0) for the projected

gradient descent algorithm (PGDA).

118

- The active constraints are the two equalities and the inequality x4 ≥ 0, thus

A =

2 1 1 41 1 2 10 0 0 1

Also, ∇f(x(0)) is given by

∇f(x(0)) =

242−3

So,

AAT =

22 9 49 7 14 1 1

(AAT

)−1=

1

11

6 −5 −19−5 6 14−19 14 73

Hence, the projection matrix P(0) is given by

P(0) = I−AT(AAT

)−1A =

1

11

1 −3 1 0−3 9 −3 01 −3 1 00 0 0 0

Finally, the driection d(0) is given by

d(0) = −P(0)∇f(x(0)) =1

11

−824−80

.Nonlinear constraints:

� Consider the problem which de�nes only the active constraints

min f(x)

s.t. h(x) = 0

� In linear constraints x(k) lie on the tangent plane M. However in nonlinear constraints, thesurface de�ned by the constraints and the tangent plane M touch in a single point

119

� In this case updated point must be projected onto the constrained surface. For projected gradientdescent method, the projection matrix is given by

P(k) = I− JTh(x(k))(Jh(x(k))JTh(x(k))

)−1Jh(x(k))

where Jh(x) is the Jacobian of h(x), i.e.,

Jh(x) =(∇hT (x)

)T.

3.5.3 Equality Constrained Optimization

See Boyd Chapter 10.

min f(x)

s.t. Ax = b

where f(x) : RN → R, convex, twice di�erentiable, A ∈ RM×N , M < N .

Minimum occurs atp∗ = inf {f(x) |Ax = b} = f(x∗)

where x∗ satis�es KKT conditions

Ax = b

∇f(x∗) + ATν∗ = 0

120

3.5.3.1 Quadratic Minimization

min1

2xTQx + cTx + r

s.t. Ax = b

where Q ∈ RN×N is a symmetric positive semide�ne matrix (SPSD), A ∈ RM×N and b ∈ RM .

� Using KKT conditions

Ax∗ = b

Qx∗ + c + ATν∗ = 0

}=

[Q AT

A 0

]︸︷︷︸KKT matrix

[x∗

ν∗

]=

[−cb

]

� Above equations de�nes a KKT system for an equality constrained quadratic optimization prob-lem with (N +M) linear equations in the (N +M) variables (x∗,ν∗).

3.5.3.2 Eliminating Equality Constraints

� One general approach to solving the equality constrained problem is to eliminate the equalityconstraints and then solve the resulting unconstrained problem using methods for unconstrainedminimization.

� We �rst �nd a matrix F ∈ RN×(N−M) and vector x that parametrize the (a�ne) feasible set:

{x |Ax = b} ={Fz + x | z ∈ RN−M} .

Here x can be chosen as any particular solution of Ax = b, and F ∈ RN×(N−M) is any matrixwhose range (columnspace) is the nullspace of A, i.e., R(F) = N (A).

� We then form the reduced or eliminated optimization problem

min f(z) = f(Fz + x)

which is an unconstrained problem with variable z ∈ RN−M . From its solution z∗, we can �ndthe solution of the equality constrained problem as

x∗ = Fz∗ + x

.

� There are, of course, many possible choices for the elimination matrix F, which can be chosenas any matrix in RN×(N−M) with the range (columnspace) of the matrix F is equals to thenullspace of the matrix A, i.e., R(F) = N (A). If F is one such matrix, and T ∈ R(M−N)×(M−N)

is nonsingular, then F = FT is also a suitable elimination matrix, since

R(F) = R(F) = N (A).

121

� Conversely, if F and F are any two suitable elimination matrices, then there is some nonsingularT such that F = FT. If we eliminate the equality constraints using F, we solve the unconstrainedproblem

min f(Fz + x)

while if F is used, we solve the unconstrained problem

min f(Fz + x) = f(F(Tz) + x)

This problem is equivalent to the one above, and is simply obtained by the change of coordinatesz = Tz. In other words, changing the elimination matrix can be thought of as changing variablesin the reduced problem.

122

Example 23:

3.5.3.3 Newton's Method Equality Constraints

� It is the same as unconstrained Newton's Method except

- initial point is feasible, x(0) ∈ F, i.e., Ax(0) = b.

- Newton step ∆xnt is a feasible direction, i.e., A∆xnt = 0.

� In order to use Newton's Method on the problem

min f(x)

s.t. Ax = b

we can use second-order Taylor approximation around x (actually x(k)) to obtain a quadraticminimization problem

min f(x + ∆x) = f(x) +∇Tf(x)∆x +1

2∆xTH(x)∆x

s.t. A (x + ∆x) = b

This problem is convex if H(x) � 0

� Using the results of quadratic minimization with equality constraints[H(x) AT

A 0

]︸︷︷︸

KKT matrix

[∆xnt

ν∗

]=

[−∇f(x)

0

].

123

Solution exists when the KKT matrix is non-singular.

� The same solution can also be obtained by setting x∗ = x + ∆xnt in the optimality conditionequations of the original problem

Ax∗ = b

∇f(x∗) + ATν∗ = 0

as ∆xnt and ν∗ should satisfy the optimality conditions.

� Newton decrement λ(x) is the same as the one used for the unconstrained problem, i.e.,

λ(x) =(∆xTntH(x)∆xnt

)1/2= ‖∆xnt‖H(x)

being the norm of the Newton step in the norm de�ned by H(x).

�λ2(x)

2, being a good estimate of f(x) − p∗ (i.e., f(x) − p∗ ≈ λ2(x)

2), can be used as the stopping

criterion λ2(x)2≤ ε.

� Appears in the line search

d f(x + α∆xnt)

dα

∣∣∣∣α=0

= ∇Tf(x)∆xnt = −λ2(x)

� In order to Newton step ∆xnt to be a feasible descent direction the following two conditionsneed to be satis�ed

∇Tf(x)∆xnt = −λ2(x)

A∆xnt = 0

� The solution for equality constrained Newton's Method is invariant to a�ne transformations.

3.5.3.4 Newton's Method with Equality Constraint Elimination

min f(x)

s.t. Ax = b≡ min f(z) = f(Fz + x)

with R(F) = N (A) and Ax = b.

� The gradient and Hessian of f(z) are

∇f(z) = FT∇f(Fz + x)

H(z) = FTH(Fz + x)F

� Then, the KKT matrix is invertible, i� H(z) is invertible.

124

� The Newton step of the reduced problem is

∆znt = −H−1(z)∇f(z) = −(FTH(x)F

)−1FT∇f(x)

where x = Fz + x.

� It can be shown that∆xnt = F∆znt

where ∆xnt is the Newton step of the original problem with equality constraints.

� The Newton decrement λ(z) is the same as the Newton decrement of the original problem

λ2(z) = ∆zTntH(z)∆znt

= ∆zTntFTH(x)F∆znt

= ∆xTntH(x)∆xnt

= λ2(x)

3.5.4 Penalty and Barrier Methods

See Luenberger Chapter 13.

� They approximate constrained problems by unconstrained problems

� The approximation is obtained by adding

- a term with a high cost for violating the constraints (penalty methods)

- a term that favors points interior to the feasible points over those near the boundary (barriermethods)

to the objective function (cost function).

� The penalty and barrier methods work directly in original the N -dimensional space RN ratherthan the (N −M)-dimensional space RN−M as in the case of the primal methods.

3.5.4.1 Penalty Methods

min f(x)

s.t. x ∈ F

� The idea is to replace the this problem by the following penalty problem

min f(x) + cP (x)

where c ∈ R++ is a constant and P (x) : RN → R is the penalty function.

� Here

- P (x) is continuous

125

- P (x) ≥ 0 ∀x ∈ RN

- P (x) = 0 i� x ∈ F

� In general, the following quadratic penalty function is used

P (x) =1

2

L∑i=1

(max {0, gi(x)})2 +1

2

M∑j=1

(hj(x))2

where gi(x) ≤ 0 are the inequality constraints and hj(x) = 0 are the equality constraints.

� For example, consider P (x) = 12

2∑i=1

(max {0, gi(x)})2 with g1(x) = x− b and g2(x) = a− x

� For large c, the minimum point of the penalty problem will be in a region where P (x) is small.

� For increasing c, the solution will approach the feasible region F and as c → ∞ the penaltyproblem will converge to a solution of the constrained problem.

Penalty Method:

� Let{c(k)}, k = 1, 2, . . . be a sequence tending to ∞ such that ∀k c(k) ≥ 0 and c(k+1) > c(k).

� Letq(c,x) = f(x) + cP (x)

and for each k solve the penalty problem

min q(c(k),x)

obtaining a solution point x(k).

� Lemma: As k →∞, c(k+1) > c(k) (e.g., start from an exterior point)

126

i. q(c(k),x(k)) ≤ q(c(k+1),x(k+1))

ii. P (x(k)) ≥ P (x(k+1))

iii. f(x(k)) ≤ f(x(k+1))

iv. f(x∗) ≥ q(c(k),x) ≥ f(x(k))

� Theorem: Let{x(k)}, k = 1, 2, . . . be a sequence generated by the penalty method. Then, any

limit of the sequence is a solution of the original constrained problem.

3.5.4.2 Barrier Methods

Also known as interior methods.

min f(x)

s.t. x ∈ F

� The idea is to replace the this problem by the following barrier problem

min f(x) +1

cB(x)

s.t. x ∈ interior of F

where c ∈ R++ is a constant and B(x) : RN → R is the barrier function.

� Here, barrier function B(x) is de�ned on the interior of F such that

- B(x) is continuous

- B(x) ≥ 0

- B(x)→∞ as x approaches the boundary of F

� Ideally,

B(x) =

{0, x ∈ interiorF∞, x /∈ interiorF

� There are several approximations. Two common barrier functions for the inequality constraintsare given below

Log barrier: B(x) = −L∑i=1

log (−gi(x))

Inverse barrier: B(x) = −L∑i=1

1

gi(x)

� For example, consider B(x) = −1/g1(x)− 1/g2(x), g1(x) = x− b and g2(x) = a− x.

127

Barrier Method:

� Let{c(k)}, k = 1, 2, . . . be a sequence tending to ∞ such that ∀k c(k) ≥ 0 and c(k+1) > c(k).

� Let

r(c,x) = f(x) +1

cB(x)

and for each k solve the barrier problem

min r(c(k),x)

s.t. x ∈ interior of F

obtaining a solution point x(k).

� Lemma: As k →∞, c(k+1) > c(k) (start from an interior point)

i. r(c(k),x(k)) ≥ r(c(k+1),x(k+1))

ii. B(x(k)) ≤ B(x(k+1))

iii. f(x(k)) ≥ f(x(k+1))

iv. f(x∗) ≤ f(x(k)) ≤ r(c(k),x)

� Theorem: Any limit point of a sequence{x(k)}, k = 1, 2, . . . generated by the barrier method

is a solution of the original constrained problem.

128

3.5.4.3 Properties of the Penalty & Barrier Methods

� The penalty method solves the unconstrained problem

min f(x) + cP (x)

� Let pi(x) = max {0, gi(x)} and p(x) =[pi]L×1

Let the penalty function γ(x) where P (x) = γ(p(x)) to be the Euclidean norm function

γ(y) = yTy ⇒ P (x) = pT (x)p(x) =L∑i=1

(pi(x))2

or more generally to be the quadratic norm function

γ(y) = yT Γ y ⇒ P (x) = pT (x) Γ p(x)

� The Hessian of the above problem becomes more and more ill-conditioned as c→∞

� De�ningq(c,x) = f(x) + c γ(p(x))

The Hessian Q(c,x) is given by

Q(c,x) = F(x) + c∇Tγ(p(x)) G(x) + cJTp(x) Γ(p(x)) Jp(x)

where F(x), G(x) and Γ(x) are Hessians of f(x), g(x) and γ(x) respectively, and Jp(x) is theJacobian of p(x).

� If at x∗ there are r active constraints, then the Hessian matrices Q(c(k),x(k)) have r eigenvaluestending to ∞ as c(k) →∞ and (n− r) eigenvalues tending to some �nite value. In other words,condition number goes to in�nity (κ→∞) as c(k) →∞.

� Gradient Descent may not be directly applied, instead Newton's Method is preferred!

� Same observation also applies to the barrier method.

3.5.5 Interior-Point Methods

3.5.5.1 Logarithmic Barrier

See Boyd Chapter 11.

min f(x) +1

c

(−

L∑i=1

log (−gi(x))

)s.t. Ax = b

129

Minimum occurs atp∗ = inf {f(x) |Ax = b} = f(x∗)

� The function

φ(x) = −L∑i=1

log (−gi(x))

with domφ(x) ={x ∈ RN | gi(x) < 0,∀i

}is called the logarithmic barrier function.

� We will modify the Newton's algorithm to solve the above problem. So, we will need

∇φ(x) = −L∑i=1

1

gi(x)∇gi(x)

Hφ(x) =L∑i=1

1

g2i (x)∇gi(x)∇Tgi(x)−

L∑i=1

1

gi(x)Hgi(x)

where Hφ(x) = ∇2φ(x) and Hgi(x) = ∇2gi(x) are the Hessians of φ(x) and gi(x) respectively.

3.5.5.2 Central Path

� Consider the equivalent problem (c > 0)

min cf(x) + φ(x)

s.t. Ax = b

130

� The point x∗(c) is the solution. The trajectory of x∗(c) as a function of c is called the centralpath. Points on the central path satisfy

Ax∗(c) = b

gi(x∗(c)) < 0, ∀i

c∇f(x∗(c)) +∇φ(x∗(c)) + AT ν = 0

Centrality conditions

for some ν ∈ RM .

� Last line could be rewritten as

c∇f(x∗(c))−L∑i=1

1

gi(x∗(c))∇gi(x∗(c)) + AT ν = 0

� Example 24: Inequality form linear programming. The logarithmic barrier function for an LPin inequality form

min eTx

s.t. Ax ≤ b

(where e ∈ RN , A ∈ RN×L and b ∈ RL are constants) is given by

φ(x) = −L∑i=1

log(bi − aTi x

), domφ(x) = {x |Ax < b}

where aT1 , . . . , aTL are the rows of A.

The gradient and Hessian of the barrier function are

∇φ(x) =L∑i=1

1

bi − aTi xai

= ATd

Hφ(x) =L∑i=1

1

(bi − aTi x)2 aia

Ti

= AT (diag d)2A

where di =1

bi − aTi x.

Since x is strictly feasible, we have d > 0, so the Hessian of φ(x), Hφ(x) is nonsingular if andonly if A has rank N , i.e., full-rank.

The centrality condition isc e + ATd = 0

131

We can give a simple geometric interpretation of the centrality condition. At a point x∗(c) onthe central path the gradient ∇φ(x∗(c)), which is normal to the level set of φ(x) through x∗(c),must be parallel to e. In other words, the hyperplane eTx = eTx∗(c) is tangent to the level setof φ(x) through x∗(c). Figure below shows an example with L = 6 and N = 2.

The dashed curves in the previous �gure show three contour lines of the logarithmic barrierfunction φ(x). The central path converges to the optimal point x∗ as c→∞. Also shown is thepoint on the central path with c = 10. The optimality condition at this point can be veri�edgeometrically: The line eTx = eTx∗(10) is tangent to the contour line of φ(x) through x∗(10).

3.5.5.3 Dual Points from Central Path

� From the centrality condition, let

λ∗i (c) = − 1

c gi(x∗(c)), ∀i

ν∗(c) =ν

c

then

∇f(x∗(c)) +L∑i=1

λ∗i (c)∇gi(x∗(c)) + ATν∗(c) = 0

Hence, from KKT, x∗(c) minimizes

L(x,λ,ν) = f(x) + λTg(x) + νT (Ax− b)

for λ = λ∗(c) and ν = ν∗(c) for a particular c, which means that (λ∗(c),ν∗(c)) is a dual feasiblepair.

132

So, the dual function optimal value p∗(c) = `(λ∗(c),ν∗(c)) is �nite and given as

`(λ∗(c),ν∗(c)) = f(x∗(c)) +L∑i=1

λ∗i (c)gi(x∗(c))︸︷︷︸

= 1c

+ν∗T (c) (Ax∗(c)− b)︸︷︷︸=0

= f(x∗(c))− L

c

� Duality gap isL

cand

f(x∗(c))− p∗ ≤ L

cgoes to zero, as c→∞.

3.5.5.4 KKT Interpretation

� Central path (centrality) conditions can be seen as deformed KKT conditions, i.e.,

Ax = b

gi(x∗) ≤ 0, ∀iλ∗i ≥ 0, ∀i

−λ∗i gi(x∗) =1

c, ∀i

∇f(x∗) +L∑i=1

λ∗i∇gi(x∗) + ATν∗ = 0

satis�ed byx∗(c), λ∗(c) and ν∗(c)

� Complementary slackness, λigi(x) = 0 → −λ∗i gi(x∗) =1

c

� As c→∞, x∗(c), λ∗(c) and ν∗(c) almost satisfy the KKT optimality conditions.

3.5.5.5 Newton Step for Modi�ed KKT equations

� Let λi = − 1

c gi(x), then

∇f(x)−L∑i=1

1

c gi(x)∇gi(x) + ATν = 0

Ax = b

� To solve this set of (N+M) linear independent equations (of N+M variables x and ν) considerthe nonlinear part of the �rst set of equations

∇f(x + d)−L∑i=1

1

c gi(x + d)∇gi(x + d) ∼= ∇f(x)−

L∑i=1

1

c gi(x)∇gi(x)︸︷︷︸

=g

133

+ H(x)d−L∑i=1

1

c gi(x)Hgi(x)d +

L∑i=1

1

c g2i (x)∇gi(x)∇Tgi(x)d︸︷︷︸

=Hd

� Substituting back, we obtain

Hd + ATν = −g

Ad = 0

where

H = H(x)−L∑i=1

1

c gi(x)Hgi(x) +

L∑i=1

1

c g2i (x)∇gi(x)∇Tgi(x)

g = ∇f(x)−L∑i=1

1

c gi(x)∇gi(x)

� Using the derivations of ∇φ(x) and Hφ(x)

H = H(x) +1

cHφ(x)

g = ∇f(x) +1

c∇φ(x)

� Let us represent the previous modi�ed KKT equations in matrix form[H AT

A 0

] [dν

]=

[−g0

]whose solution would give the modi�ed Newton step ∆xnt and ν∗nt[

cH(x) + Hφ(x) AT

A 0

] [∆xnt

ν∗nt

]=

[−c∇f(x)−∇φ(x)

0

]where ν∗nt = cν∗.

� Using this Newton step, the Interior-Point Method (i.e., Barrier Method) can be constructed.

3.5.5.6 The Interior-Point Algorithm

The Interior-Point Method (Barrier Method):Given a strictly feasible x ∈ F, c = c(0)(> 0), µ > 1 and tolerance ε > 0

repeat

1. Centering step:

134

Compute x∗(c) by minimizing the modi�ed barrier problem

x∗(c) = argmin (c f(x) + φ(x))

s.t. Ax = b

starting at x using the modi�ed Newton's Method.

2. Update: x = x∗(c)

3. Stopping criterion: quit ifL

c< ε

4. Increase c: c = µ c

� Accuracy of centering:

- Computing x∗(c) exactly is not necessary since the central path has no signi�cance beyondthe fact that it leads to a solution of the original problem as c→∞; inexact centering willstill yield a sequence of points x(k) that converges to an optimal point. Inexact centering,however, means that the points λ∗(c) and ν∗(c), computed from the �rst two equationgiven in the section titled "Dual points from central path", are not exactly dual feasible.This can be corrected by adding a correction term to these formulae, which yields a dualfeasible point provided the computed x is near the central path, i.e., x∗(c).

- On the other hand, the cost of computing an extremely accurate minimizer of c f(x)+φ(x),as compared to the cost of computing a good minimizer of c f(x) +φ(x), is only marginallymore, i.e., a few Newton steps at most. For this reason it is not unreasonable to assumeexact centering.

� Choice of µ:

- µ provides a trade-o� in the number of iterations for the inner and outer loops.

- small µ: at each step inner loop starts from very good point, few inner loop iterations arerequired, but too many outer loop iterations may be required.

- large µ: at each step c increases a large amount, the current starting point may not be agood point for the inner loop. Too many inner loop iterations may be required, but fewouter loop iterations are required.

- In practice, small values of µ (i.e., near one) result in many outer iterations, with just afew Newton steps for each outer iteration. For µ in a fairly large range, from around 3to 100 or so, the two e�ects nearly cancel, so the total number of Newton steps remainsapproximately constant. This means that the choice of µ is not particularly critical; valuesfrom around 10 to 20 or so seem to work well.

� Choice of c(0):

- large c(0): �rst run of inner loop may require too many iterations.

- small c(0): more outer-loop iterations are required.

135

- One reasonable choice is to choose c(0) so that Lc(0)

is approximately of the same order as

f(x(0))− p∗, or µ times this amount. For example, if a dual feasible point (λ,ν) is known,with duality gap η = f(x(0))− `(λ,ν), then we can take c(0) = L

η. Thus, in the �rst outer

iteration we simply compute a pair with the same duality gap as the initial primal and dualfeasible points.

- Another possibility is to �nd c(0) which minimizes

infν‖c∇f(x(0)) +∇φ(x(0)) + ATν‖2

Example 25:

Solution:

136

Example 26:

Solution:

138

3.5.5.7 How to start from a feasible point?

� The interior-point method requires a strictly feasible starting point x(0)

� If such a point is not known, a preliminary stage, Phase I is run �rst

Basic Phase I Method:

139

� Consider gi(x) ≤ 0, i = 1, 2, . . . , L and Ax = b.

� We always assume that we are given a point x(0) ∈L⋂i=1

dom gi(x) satisfying Ax(0) = b.

� Then, we form the following optimization problem

min s

s.t. gi(x) ≤ s, i = 1, 2, . . . , L

Ax = b

in the variables x ∈ RN and s ∈ R.

s is a bound on the maximum infeasibility of the inequalities and it is to be driven below zero.

� The problem is always feasible when select s(0) ≥ maxigi(x

(0)) together with the given x(0) ∈L⋂i=1

dom gi(x) with Ax(0) = b.

� Then, apply the interior-point method to solve the above problem. There are three cases de-pending on the optimal value p∗

1. If p∗ < 0, then a strictly feasible solution is reached. Moreover if (x, s) is feasible withs < 0, then x satis�es gi(x) < 0. This means we do not need to solve the optimizationproblem with high accuracy; we can terminate when s < 0.

2. If p∗ > 0, then there is no feasible solution.

3. If p∗ = 0, and the minimum is attained at x∗ and s∗ = 0, then the set of inequalities isfeasible, but not strictly feasible. However, if p∗ = 0 and the minimum is not attained,then the inequalities are infeasible.

Example 27:

Solution:

140

ele 604/ele 704 optimization - hacettepe universityusezen/ele604/lecture_notes.pdfele 604/ele 704...

Documents