introduction -...

Introduction

Dimitar Dimitrov

Orebro University

May, 2011

1 / 81

The materials for this seminar are heavily based on the books by

S. Boyd, and L. Vandenberghe [1]

P. E. Gill, W. Murray, and M. H. Wright [2]

J. Nocedal, and S. J. Wright [3]

D. Bertsimas, and J. N. Tsitsiklis [5]

R. Fletcher [6]

D. P. Bertsekas [8]

2 / 81

Topics addressed in the seminar (in blue are depicted the topics covered today)

Classification of optimization problems

A short review of some useful concepts from linear algebra and analysis

Optimality conditionsunconstrainedlinear equality constraintslinear inequality constraints

Basic things we need to know about convex sets and functions

Some important convex optimization problems

Lagrange duality

Some applications of convex optimization

Algorithms for unconstrained optimizationGradient descent methodSteepest descent methodNewton’s method

Solving equality constrained optimization

Algorithms for inequality constrained optimization (we focus on LPs and QPs)Active-set methodsInterior-point methods

We will have many examples in Matlab that demonstrate the concepts we discusson simple problems

3 / 81

“Nature optimizes. Physical systems tend to a state of minimum energy. The

molecules in an isolated chemical system react with each other until the total

potential energy of their electrons is minimized. Rays of light follow paths that

minimize their travel time.” [3], pp. 1.

Mathematically speaking, optimization is the minimization (or maximization) of anobjective function subject to constraints on its variables.

In the context of optimization, the identification of

a set of relevant variables (unknowns),

an objective function that depends on the variables, and reflects the properties ofa system of study,

possible restrictions (constraints) on the variables,

is known as modeling. The construction of an appropriate model is the first (andsometimes the most important) step in the optimization process. If the model is toosimplistic, it will not give useful insights into the practical problem. If it is toocomplex, it may be too difficult to solve. Even thought this is a very important topic,we will not address it here (and most of the time, it will be assumed that the modelingstage has been completed).

4 / 81

Given a model of some “real-life” problem, our goal is to find values for the variablesthat optimize the objective, while accounting for the constraints that the variablesshould satisfy.

Mathematical optimization problem (standard form)

minimizex

f0(x), equivalent to maximizex

−f0(x) (1)

subject to fi(x) ≤ 0, i = 1, . . . ,mi

hi(x) = 0, i = 1, . . . , me

x = (x1, . . . , xn) - optimization (decision) variables, n is a positive integer.

f0 : Rn → R ∪ {±∞} - objective function,

fi : Rn → R, i = 1, . . . ,mi - inequality constraint functions,

hi : Rn → R, i = 1, . . . , me - equality constraint functions.

Since the minimization of f0(x) is equivalent to the maximization of −f0(x), we willconsider only minimization from now on (with small exceptions, when we deal withLagrangian duality theory). When there is no possibility of confusion, we will denotethe objective function by f .

5 / 81

Notation

R, R+ and R++, denote the sets of real, nonnegative real, and positive realnumbers

Rn and Rm×n denote the set of real n vectors, and real m× n matrices

Sn denotes the set of real symmetric n× n matrices, while Sn+ and Sn++ stand forthe sets of real positive semidefinite, and positive definite matrices (of coursethey are assumed to be symmetric)

The inequality sign in x ≥ y (where x,y ∈ Rn), is to be interpreted as

component-wise inequality, i.e., xi ≥ yi for i = 1, . . . , n

f : Rn → R for example, means that f is a scalar valued function on some subsetof Rn, which we call its domain, and denote dom(f). For example, consider thenatural logarithm log : R→ R, with domain dom(log) = R++

Sometimes we use x = (x1, . . . , xn) as a shorthand notation for

x =

x1

...xn

Since we consider minimization of a function, if not stated otherwise, henceforthwe assume that f0(x) > −∞ for all feasible x, and in addition, f(x) <∞ for atleast one feasible x. Functions that satisfy these two properties are called proper

[7], pp. 15.

6 / 81

Domain of the optimization problem

minimizex

f0(x)


hi(x) = 0, i = 1, . . . ,me

We denote the domain of the above minimization problem by

D =

(mi⋂

0

dom(fi)

)

∩

(me⋂

i

dom(hi)

)

Feasible point

A point x ∈ D is called feasible if it satisfies all the inequality and equality constraints.Hence, a necessary condition for having at least one solution is

D ∩ {x : fi(x) ≤ 0, i = 1, . . . , mi, hi(x) = 0, i = 1, . . . , me} 6= ∅

Sometimes we will call

{x : fi(x) ≤ 0, i = 1, . . . , mi, hi(x) = 0, i = 1, . . . , me}

the feasible set. In such case we assume that D = Rn.7 / 81

Typical characteristics of the objective function and constraints used to classifyoptimization problems [2], pp. 4.

Objective function

Function of a single variable

Linear function

Sum of squares of linear functions

Quadratic function

Sum of squares of nonlinear functions

Smooth nonlinear function

Sparse nonlinear function

Non-smooth nonlinear function

Constraints

No constraints

Simple bounds

Linear functions

Sparse linear functions

Smooth nonlinear functions

Sparse nonlinear functions

Non-smooth nonlinear functions

Other features can also be used to distinguish between optimization problems

size of the problem (small, medium, large scale),

information available (objective function value, derivatives of the objectivefunction),

desired accuracy of the solution (this may determine the choice of optimizationalgorithm to be used),

and (most importantly) the problem is convex or not.

8 / 81

“... in fact, the great watershed in optimization is not between linearity andnonlinearity, but between convexity and non-convexity.” Rockafellar.

Convex optimization problem (standard form)

minimizex

f0(x)


Cx = d, C ∈ Rme×n,

where fi(x), i = 0, . . . ,mi are convex functions. Note that the equality constraints ofa convex optimization problem can be only linear (actually affine) functions of x.

We will properly define convex functions and sets later on. Some examples of convexproblems are:

Linear program (LP) in inequality form

minimizex∈Rn

f0(x) = gTx

subject to Ax ≤ b, A ∈ Rmi×n

Cx = d, C ∈ Rme×n

Quadratic program (QP)

minimizex∈Rn

f0(x) =1

2xTHx+ xT g


Cx = d, C ∈ Rme×n

9 / 81

Linear inequality constraints

−5 −4 −3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3

4

5

x1

x2

a

P

aT x

=b

aT x≥b

aT x≤b

10 / 81

Example (linear inequality constraints)

−2 −1 0 1 2 3 4 5−2

−1

0

1

2

3

4

5

x1

x2

P

−1 −1−1 00 −11 00 1

︸︷︷︸

A

[x1

x2

]

≤

−20033

︸︷︷︸

b

11 / 81

Example (ball hanging on a spring subject to constraints)

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

−2

−1.5

−1

−0.5

0

0.5

1

−1 −11 −10.1 0.8

︸︷︷︸

A

[x1

x2

]

≤

130.4

︸︷︷︸

b

feasible set

(1,−2)

m

−x1 −

x2 =

1

x 1−x 2

=3

0.1x1 + 0.8x2 = 0.4

x1

x2

(0, 0)

Consider the following 2D mass-spring-damper model (x = (x1, x2))

[m 00 m

] [x1

x2

]

+

[c 00 c

] [x1

x2

]

+

[k 00 k

] [x1

x2

]

=

[0mg

]

,

where m, k, c > 0 denote mass, spring coefficient and damper coefficient, respectively,and g = −9.81m/s2 is the acceleration due to gravity. We want to find the positionat rest (i.e., x = x = 0) when x is subject to constraints of the form Ax ≤ b.

12 / 81


If we temporary disregard the constraints, the point at rest is simply the point wherethe force due to the spring (kx) and the force due to the weight of the mass (mg) arein balance, i.e., (x1 = 0, x2 = mg/k), which is simply the unique solution of

[k 00 k

]

︸︷︷︸

H

[x1

x2

]

=

[0mg

]

.

For some values of m and k, the above solution would even satisfy the constraints.However, in general it would not, so how do we approach this problem then? Recall“Nature optimizes. Physical systems tend to a state of minimum energy ...”.Essentially, “nature” comes up with a solution that minimizes the potential energy inthe system.

Find the position at rest = minimize potential energy in system, subject to Ax ≤ b

P =k

2(x2

1 + x22)

︸︷︷︸

spring

− mgx2︸︷︷︸

gravity

=1

2xTHx+ xT g, where g =

[0−mg

]

.

13 / 81


−2 −1 0 1 2 3 4−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

x1

x2

x⋆

x⋆unc = (0,−1.96)

−∇f(x⋆)

Figure: The figure depicts the level-curves of the objective function (k = 4 N/m, m = 0.8 kg), its

gradient evaluated at x⋆, and the gradient of the active constraint aT1 x = b1. Note that the

“geometry of the mass” is not considered.

minimizex∈R2

f(x) =1

2xTHx+ xT g

subject to Ax ≤ b, A ∈ R3×2

14 / 81

Example (unconstrained problem - projection onto a subspace)

y

p

n

a1

a2

e=

y−

p

Figure: Projection of vector y on the subspace (plane with normal n) spanned by a1 and a2.

Least squares problem A ∈ Rm×n, m > n, rank(A) = n

minimizex∈Rn,e∈Rm

‖e‖22

subject to e = Ax− y

15 / 81

Example (a transportation problem [3], pp. 4)

Problem description

A company has 2 factories F1 and F2, where factory Fi can provide at most aitons of product each week

There are 5 retail outlets R1, . . . , R5, where each retail outlet Rj has a knownweekly demand of at least bj tons of product

The cost of shipping one ton of the product from a factory Fi to retail outlet Rj

is gij EUR.

Determine how much of the product to ship from each factory to each outlet soas to satisfy all the requirements and minimize the cost.

Mathematical formulation

The variables of the problem are xij , i = 1, 2, j = 1, . . . , 5 (hence, x ∈ R10).

minimize∑

ij

gijxij , (xij is tons of product shipped from Fi to Rj per week)

subject to5∑

j=1

xij ≤ ai, i = 1, 2, (limited resources)

2∑

i=1

xij ≥ bj , j = 1, . . . , 5, (at least equal to the demand)

xij ≥ 0, i = 1, 2, j = 1, . . . , 5, (we do not want to receive product)

16 / 81

Example (a transportation problem)

F1

F2

R1

R2

R3

R4

R5

g25

g11

17 / 81

Example (a problem involving absolute values [5], pp. 17)

Another example of convex optimization problems are problems involving norms. Herewe consider the “simplest” example of a norm, i.e., the absolute value on R.

minimizex∈Rn

n∑

i=1

gi|xi| (2)

subject to Ax ≤ b.

The objective function is non-differentiable (it is the sum of piecewise linear convexfunctions). We can formulate it as an LP by observing that |xi| is the smallestnumber zi that satisfies −zi ≤ xi ≤ zi.

minimizex,z∈Rn

n∑

i=1

gizi, where z = (z1, . . . , zn) (3)

subject to Ax ≤ b.

xi ≤ zi, i = 1, . . . , n

−xi ≤ zi, i = 1, . . . , n.

Note that problems (2) and (3) are equivalent (for definition of equivalence see[1], pp. 130.). It is interesting to observe that after the reformulation, the objectivefunction is differentiable.

Later on, we will see how this simple technique can be applied to a variety of problemse.g., optimal control of linear systems, robust (to outliers or noise) estimation, sparsesignal reconstruction etc.

18 / 81

Example (optimization in control of linear systems [5], pp. 20)

Consider a dynamical system that evolves according to a model of the form

xk+1 = Axk +Buk

yk = cTxk,

where xk is the state of the system at discrete time k, yk is the system output

(assumed to be scalar for simplicity), and uk is the control vector that we are free tochoose subject to linear constraints of the form Dkuk ≤ dk. One possible problem isto choose the values of the control variables u0, . . . ,uN−1 to drive the state xN to atarget state. In addition, it is often desired to keep the magnitude of the output smallat all intermediate times. We wish to minimize

maxk=1,...,N

|yk|.

The solution of the following LP accomplishes the above objective

maximizez,u0,...,uN−1

z

subject to − z ≤ yk ≤ z, k = 1, . . . , N

xk+1 = Axk +Buk, k = 0, . . . , N − 1

yk = cTxk, k = 1, . . . , N

Dkuk ≤ dk, k = 0, . . . , N − 1

xN = target state,

x0 is given.19 / 81

Example (a nonconvex problem - just for fun)

Problem description

Find a point x⋆ whose distance from the origin is minimized and

x⋆ 6∈ {x ∈ Rn : Ax < b}.

Note that the complement of {x ∈ Rn : Ax < b} is not a convex set.

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

−2

−1.5

−1

−0.5

0

0.5

1

(0, 0)

x⋆

20 / 81

Example (a nonlinear and non-convex problem)

Problem description

The forward geometric model of a manipulator system can be expressed as f(q) = p,where f : Rn → Rm is a nonlinear (vector valued) function of the joint angles q, andp ∈ Rm is (for simplicity, only) the end-effector Cartesian position. Find q⋆ such thatf(q⋆) = pdes, where pdes is a desired position for the end-effector.

Consider the following recursion k = 0, . . .

J(q(k))q = −λc(q(k)),

where λ > 0 and

J(q(k)) =∂f(q)

∂q

∣∣∣∣q(k)

,

c(q(k)) = f(q(k))− pdes.

Update q using

q(k+1) = q(k) + αq, α > 0.

f(q(0))

pdes

−λc

q1

q2

q3

xx

yy

21 / 81

Example (a problem with non-smooth constraints - integer programming)

The zero-one knapsack problem [5], pp. 453

We are given n items. The jth item has weight wj and its value is gj . Given a boundW on the weight that can be carried in the knapsack. We would like to select items tomaximize the total value (of the items in the knapsack).

Modeling of the problem

We define a binary variable xj which is 1 if the jth item is chosen and 0 otherwise.The problem can be formulated as follows

maximizex∈Rn

n∑

j=1

gjxj

subject ton∑

i=1

wjxj ≤W,

xj ∈ {0, 1}, i = 1, . . . , n.

Problems with both continuous and binary (or integer) variables are called mixed

integer programming problems. A common solution strategy for such problems is tosolve a sequence of “relaxed” continuous problems. Where one common “relaxation”is to simply replace the integral constraints with 0 ≤ xj ≤ 1, j = 1, . . . , n.

Note that an integer programing problem is a special case of a nonlinear programmingproblem, since the constraint xj ∈ {0, 1} can be expressed as xj(1 − xj) = 0.

22 / 81

Conventions [7], pp. 13

Let us denote the set of vectors satisfying the constraints of a given optimizationproblem by S ⊆ Rn (this is the set of all feasible points). What exactly do we meanby solving the problem

minimizex∈S

f(x) ?

In order to answer the above question, we define the following two operations

f⋆ := infimumx∈S

f(x),

S⋆ := argminimumx∈S

f(x), or simply argminx∈S

f(x).

The former operation defines the infimum value of the function f over the set S. Thelatter one defines the set of solutions to the problem at hand (S⋆ ⊆ S). Note that S⋆

is not empty if and only if f⋆ is attained at some point x⋆ ∈ S. In this case we canwrite

f(x⋆) = f⋆ = minimumx∈S

f0(x).

23 / 81

Example

Consider the problem of minimizing f(x) defined by

f(x) :=

{1/x, if x > 0;+∞ otherwise,

over S = {x ∈ R : x ≥ 0}. Here, the infimum f⋆ = 0, however, S⋆ = ∅, and the value0 is not attained for a finite x. This problem has a finite infimum, but not a solution.

The above example motivates the following convention when interpreting thestatement

minimizex∈S

f(x). (4)

“Solving (4)” means, find f⋆, and then x⋆ ∈ S⋆, or conclude that S⋆ = ∅.

Abusing the above convention

In practise, when we have to solve the general nonlinear problem (1), finding an actualsolution turns out to be a very difficult problem. So, in some cases (in practice) welabel a feasible point x as a “solution” if it “seems to be reasonable”...

24 / 81

Local and global minima

In general, we would be happy if we find a global minimizer of f . This is a pointwhere the function attains its least value, or formally

Global minimizer

A point x⋆ is a global minimizer if f(x⋆) ≤ f(x) for all feasible x.

A global minimizer can be difficult to find because our knowledge of f is usually onlylocal. Since our algorithm does not visit many points (we hope!), we usually do nothave a good picture of the overall shape of f , and we can never be sure that thefunction does not take a sharp dip in a region that has not been sampled by thealgorithm. Most algorithms are able to find only a local minimizer ([3], pp. 12), whichis a point that achieves the lowest value of f in its neighborhood, or formally

Weak local minimizer

A point x⋆ is a weak local minimizer if there is an open set (i.e., a neighborhood) Nof x⋆, such that f(x⋆) ≤ f(x) for all x ∈ N .

Strict (strong) local minimizer

A point x⋆ is a strict local minimizer if there is an open set N of x⋆, such thatf(x⋆) < f(x) for all x ∈ N and x 6= x⋆.

25 / 81

Local and global minima

0 0.5 1 1.5 2 2.5 3

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

x

f(x

)

stronglocal minimum

weaklocal minima

globalminimum

Figure: Examples of minima in the univariate case. Note that it is possible to have multiple globalminimizers as well (of course their corresponding function values would be the same, recall themanipulator example).

26 / 81

Global minima do not necessarily exist

Note that in general, it is possible that there may exist none of these types of minima.In particular, f(x) may be unbounded below (in the feasible region) e.g.,f(x1, x2) = x1 + x3

2. Even if the function is bounded below its infimum may occur ata limit as ‖x‖ approaches infinity e.g., f(x) = e−x [2], pp. 60.

−1 0 1 2 3 4 5−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

f(x) = e−x

x

f(x

)

27 / 81

Closed, open, bounded sets ... (just a reminder [8], pp. 667)

Let X be a subset of Rn. x is a closure point of X if there exists a sequence{xk} ⊂ X that converges to x. The closure of X is the set of all closure points of X .For example the two sets Bc = {x ∈ R2 : ‖x‖2 ≤ 1} and Bo = {x ∈ R2 : ‖x‖2 < 1}have the same closure, i.e., all points contained in the interior and on the boundary ofthe unit ball in R2.

X is called

closed if it is equal to its closure,

open if its complement {x : x 6∈ X} is closed,

bounded if there exists a scalar c such that ‖x‖ ≤ c for all x ∈ X ,

compact if it is closed and bounded.

A neighborhood of a vector x is an open set containing x (e.g., Bo).

x is an interior point of X if there exists a neighborhood of x that is containedin X .

A vector x ∈ X which is not an interior point of X is said to be a boundary

point of X .

The set of all boundary points of X is called the boundary of X .

28 / 81

To summarize, let S ⊆ Rn be the feasible set of (4). The reason for nonexistence ofglobal minima is often

unboundedness of S

lack of closeness of S

lack of continuity of f(x) on S

Weierstrass’ theorem

Let S be a nonempty compact (i.e., closed and bounded) subset of Rn and letf : Rn → R be a continuous function on S, then f(x) has at least one global

minimum (maximum) point.

A more elaborate version of the above theorem can be found in [8], pp. 669, where theboundedness of S can be removed from the assumptions if additional conditions areimposed for f . Furthermore, a weaker condition than continuity of f is sufficient whendealing with minimization problems i.e., f can be only lower semi-continuous (see[7], pp. 79 for a definition and graphical interpretation).

Implications

It is important to notice that the closedness of S is crucial. If S is not closed, asequence generated in S may converge to a point outside of S (recall the example ofminimizing f(x) = 1/x, x > 0). This is the reason why the inequality constraints forour optimization problem are not strict (i.e., ≤ and not <)!

29 / 81

Consider the following “LP look a likeproblem”

minimizex∈R2

gTx (5)

subject to − 1 < xi < 1, i = 1, 2.

−1.5 −1 −0.5 0 0.5 1 1.5

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x1

x2

x⋆LP

g

The solution of the problem minimize gTx, subject to −1 ≤ xi ≤ 1, i = 1, 2, isdenoted by x⋆

LP = (1, 1). The problem (5) does not have a solution, since we can getarbitrarily close to the boundary of the feasible region S (the interior of the squaredepicted with blue dashed line), but we can not reach it. The closer we get to theboundary, the lower the function value would be. We can think of the blue dots as asequence {xk} starting in (the open set) S whose limit point is outside of S.

As we will see later on (when we study interior-point methods), the blue dots are theiterates of an algorithm that we can use to find an approximate solution to an LP. It isinteresting to note that this algorithm is actually “trying to solve” problem (5) bygradually improving its approximation of −1 < xi < 1, i = 1, 2 (see the red contoursin the figure).

30 / 81

Other examples (non-existence of a minimizer)

( ]x

f(x)

[ ]x

f(x)

[ )+∞

x

f(x)

unboundedness of S

lack of closeness of S

lack of continuity of f(x) on S

31 / 81

Recognizing a local minimum

If f is twice continuously differentiable (in a neighborhood of x⋆), we can determinewhether x⋆ is a local minimizer (or strict local minimizer) by examining the gradient∇f(x⋆) and the Hessian ∇2f(x⋆) of f (evaluated at point x⋆). This is far moreappealing compared to evaluating the function for “sufficiently many” points in theneighborhood of x⋆.

Recall that the gradient and Hessian are given by

∇f(x) =

∂f(x)∂x1

...∂f(x)∂xn

, ∇2f(x) =

∂2f(x)∂x1∂x1

. . .∂2f(x)∂x1∂xn

.... . .

...∂2f(x)∂x1∂xn

. . . ∂2f(x)∂xn∂xn

.

In what follows, we make an implicit assumption that f is continuously differentiable(twice continuously differentiable) whenever ∇f(x) (∇2f(x)) is used.

Example

Consider the quadratic function f(x) = 12xTHx+ xT g. Note that without loss of

generality we could assume that H ∈ Sn (why?). Its gradient and Hessian are given by

∇f(x) = Hx+ g,

∇2f(x) = H.

32 / 81

Taylor’s theorem [3], pp. 14, [2], pp. 52

“The results from analysis that are most frequently used in optimization come fromthe group of “Taylor” or “mean-value” theorems.” [2], pp. 52.

For simplicity, we limit ourselves to only three terms of the expansion

Suppose that f : Rn → R is continuously differentiable and that ∆x ∈ Rn, then wehave that

f(x +∆x) = f(x) +∇f(x + t∆x)T∆x

for some t ∈ [0, 1]. Moreover if f is twice continuously differentiable we have that

f(x +∆x) = f(x) +∇f(x)T∆x+1

2∆xT∇2f(x + t∆x)∆x,

for some t ∈ [0, 1].

The Taylor series expansion of a function f about a point x allows us to constructsimple approximations to the function in a neighborhood of x. For example, ignoringall but the linear term of the Taylor series gives

f(x +∆x) ≈ f(x) +∇f(x)T∆x, the RHS is a linear function in ∆x.

Including one additional term from the Taylor series produces a quadraticapproximation of f at the point x

f(x+∆x) ≈ f(x) +∇f(x)T∆x+1

2∆xT∇2f(x)∆x.

33 / 81

Taylor’s theorem

−1 0 1 2 3 4

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

∇2f = 23

∆x = 1

f(x) +∇f(x)∆x = 1.75

f(x+∆x) = f(x) +∇f(x+ 0.5∆x)∆x ≈ 2.08

2.08 = 1.75 + 12∆x 2

3∆x

(x = 1.5, f(x))

f(x) = 13x2

12

[∇f(x)−1

]

34 / 81

Level sets of a function

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−1.5

−1

−0.5

0

0.5

1

1.5

0.5

0.50.5

1

1

1

1

1

2.252.25

2.25

2.25

2.25

2.25

3

33

3

3

33

3

x

−∇f(x)‖∇f(x)‖2

The level set of a function f : Rn → R corresponding to a real value c is the set ofpoints {x ∈ Rn : f(x) = c}. The gradient of f at point x, ∇f(x), is orthogonal tothe level set containing x.

The sublevel set of a function f : Rn → R corresponding to a real value c is the set ofpoints {x ∈ Rn : f(x) ≤ c}.

35 / 81

Optimality conditions at a glance

Consider the problem

minimizex∈S

f(x), (6)

where S ⊆ Rn is the set of feasible points. Loosely speaking,

x ∈ S is a stationary point of (6) if, with the use of only first-order information aboutthe problem at x (i.e., f(x), ∇f(x), and the same for the constraint functionsdefining S), we can not find a feasible descent direction at x [7], pp. 16.

Investigating whether x is a stationary point or not, is important because:

If x is a stationary point, than it is a “good” candidate for a solution (as we willsee, in case of a convex problem, such x would be guaranteed to be a globalminimizer).

If x is not a stationary point, in the “investigation process” we can generate afeasible descent direction from x, in order to move towards a “better” feasiblepoint. Such an approach would produce an iterative sequence of feasible points.

Three important keywords

stationary point

descent direction

feasible direction (e.g., for unconstrained problems all directions are feasible)

36 / 81

Univariate unconstrained case [8], pp. 10

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

no globalminimizer

−∞

x

f(x

)

− 1√2

1√2

f(x) = x2 − x4

∇f(x) = 2x− 4x3

∇2f(x) = 2− 12x2

necessary condition: ∇f(x⋆) = 0 (stationarity)

sufficient conditions: x⋆ is stationary and ∇f(x⋆) > 0

Figure: A necessary condition, and sufficient conditions for a local minimum.37 / 81

Another necessary condition

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−1.5

−1

−0.5

0

0.5

1

1.5

necessary condition: ∇2f(x⋆) ≥ 0

x

f(x

)

f(x) = x3 ∇f(x) = 3x2 ∇2f(x) = 6x

saddle point

Figure: Another necessary condition for a local minimum. Note that the function f(x) = x3 hasno local minima or maxima.

38 / 81

Descent direction

The first-order Taylor approximation of f(x+∆x) around the point x is

f(x +∆x) = f(x) +∇f(x+ t∆x)T∆x, for some t ∈ [0, 1]

≈ f(x) +∇f(x)T∆x,

where ∆x is a step away from x. The term ∇f(x)T∆x is the directional derivative off at x in the direction ∆x.

A step ∆x is a descent direction from x if ∇f(x)T∆x < 0.

This can be demonstrated using the argument that if ∇f(x)T∆x < 0 and due to thefact that f is continuous, there must exist a small enough t such that∇f(x + t∆x)T∆x < 0 as well, which implies that f(x + α∆x) < f(x), for asufficiently small α > 0 [3], pp. 15.

If ∇f(x) 6= 0, there are infinitely many descent directions at x

One possible choice is ∆x = −∇f(x), which leads to

∇f(x)T∆x = −∇f(x)T∇f(x) = −‖∇f(x)‖22 < 0.

Another choice would be ∆x = −B∇f(x), for any B ∈ Sn++, because

∇f(x)T∆x = −∇f(x)TB∇f(x) = −‖∇f(x)‖2B < 0,

Recall that a matrix B ∈ Rn×n is positive definite if it is symmetric and vTBv > 0for all nonzero v ∈ R

n.

39 / 81

Geometric interpretation of vTBv > 0, B ∈ S

n++

The condition vTBv > 0 for all nonzero v means that the angle between v and Bv

is strictly less that π/2.

We define the (unsigned) angle between two nonzero vectors x and y as

cos−1

(xT y

‖x‖2‖y‖2

)

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−1.5

−1

−0.5

0

0.5

1

1.5

x

Figure: Three descent directions are depicted. The descent direction −∇f(x) is in black. Notethe “very special” descent direction that leads exactly to the minimum of the quadratic function.

40 / 81

Steepest descent direction

Let us assume that descent directions exist at x.

Unreasonable objective

Choose ∆x so that ∇f(x)T∆x is as negative as possible.

Reasonable objective

Choose ∆x so that ∇f(x)T∆x is as negative as possible and ‖∆x‖ = 1, i.e.,

minimize∆x

∇f(x)T∆x

subject to ‖∆x‖ = 1

The set of solutions of the above problem (which need not be a singleton), is given by

∆xnsd = argminz{∇f(x)T z : ‖z‖ = 1},

where ∆xnsd is called a normalized steepest descent direction [1], pp. 475.

Depending on the choice of norm, we can obtain different solutions for ∆x. If we takethe norm ‖·‖ to be the Euclidean norm, the unique solution is given by (why?)

∆xnsd = −∇f(x)

‖∇f(x)‖2.

Among all directions in which we could move from a point x, ∆xnsd is the one alongwhich f decreases most rapidly. We will consider other norms later on.

41 / 81

Unconstrained case - summary

Assuming f : Rn → R to be twice continuously differentiable, we have

Necessary conditions

x⋆ is a local minimizer of f ⇒

{∇f(x⋆) = 0

∇2f(x⋆) ∈ Sn+

Note that the “reverse direction” is false (consider f(x) = x3). The first condition isnecessary, since if ∇f(x⋆) 6= 0, a descent direction exists. The second condition (aswell) can be justified by using the Taylor’s theorem

f(x⋆ +∆x) = f(x⋆) +∇f(x⋆)T∆x︸︷︷︸

=0

+1

2∆xT∇2f(x⋆ + t∆x)∆x, for some t ∈ [0, 1].

If ∇2f(x⋆) is not positive semidefinite, then we can choose ∆x such that∆xT∇2f(x⋆)∆x < 0, and due to the continuity of f , there must exists t such that∆xT∇2f(x⋆ + t∆x)∆x < 0, which would imply that f(x⋆ + α∆x) < f(x⋆), for asufficiently small α > 0.

Sufficient conditions

∇f(x⋆) = 0

∇2f(x⋆) ∈ Sn++

}

⇒ x⋆ is a (strong) local minimizer of f

The “reverse direction” is clearly false since ∇2f(x⋆) need not be positive definite inorder for x⋆ to be a local minimizer (for proof see [3], pp. 16).

42 / 81

Example (stationary points in 2D)

−2−1

01

2

−2

−1

0

1

2−4

−2

0

2

4

6

8

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−2

−1

0

1

2

−2

−1

0

1

20

2

4

6

8

10

12

−2

−1

0

1

2

−2

−1

0

1

2−12

−10

−8

−6

−4

−2

0

43 / 81

Properties of quadratic functions [2], pp. 65

The Taylor’s theorem implies that a smooth function can be closely approximated by aquadratic function in a sufficiently small neighborhood of a given point. For thisreason many algorithms are based on the properties of quadratic functions. Consider

f(x) =1

2xTHx+ xT g, x ∈ R

n,H ∈ Sn.

Directly from the definition of f it follows that

f(x + α∆x) = f(x) + α(Hx+ g)T∆x+1

2α2∆xTH∆x.

f has a stationary point x⋆ only if ∇f(x⋆) = Hx⋆ + g = 0. This means that, x⋆

should satisfy

Hx⋆ = −g.

If g 6∈ R(H) (i.e., the system is not compatible), there are no stationary pointsand the problem is unbounded above and below.

If g ∈ R(H) and rank(H) < n, there are infinitely many stationary points.

If g ∈ R(H) and rank(H) = n, there is a unique stationary point (can be eitherof the three cases on the previous slide). If H is indefinite, then an eigenvectorcorresponding to a negative eigenvalue can be used as a descent direction.

44 / 81

(i) infinitely many stationary points; (ii) unbounded above and below

−2

−1

0

1

2

−2−1

01

2

0

2

4

6

8

10

12

14

16

H =

[1 11 1

]

∈ S+, g =

[00

]

.

−2

0

2

−2−1.5−1−0.500.511.52−20

−10

0

10

20

30

40

H =

[1 11 1

]

∈ S+, g =

[010

]

.

45 / 81

Example (ball hanging on a spring - unconstrained case)

−3 −2 −1 0 1 2 3−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

x1

x2

x⋆

The unconstrained solution is given by

x⋆ = −H−1g =

[0

mkg

]

A cool fact

The condition ∇f(x⋆) = 0 was “originallyformulated by Fermat in 1637 in the shorttreatise “Methodus as DisquirendamMaximam et Minimam” without proof (ofcourse!)” [8], pp. 5.

46 / 81

Linear equality constraints [2], pp. 67

In general, a set of linear equality constraints can be expressed as Cx = d,

C ∈ Rme×n (we assume rank(C) < n), where the ith constraint is given by

cTi x = ci1x1 + · · ·+ cinxn = di, cTi ∈ Rn, di ∈ R.

If d 6∈ R(C), there would be no feasible point (such constraints are calledinconsistent), and thus we assume that d is a linear combination of the columns of C.

r linearly independent constraints, remove r degrees of freedom from the choice ofx⋆. In two dimensions, for example, from point x we have a “strong urge” to followthe negative gradient (slide down the slope), however, in the presence of equalityconstraints, we are not free to choose any descent direction (see the figure below).

Feasible direction [7], pp. 88

Suppose that we are at a pointx ∈ S ⊆ Rn. ∆x ∈ Rn defines a feasible

direction at x, if a “small” step in thedirection ∆x does not lead outside of theset S. In other words,

∃δ > 0 such that x+ α∆x ∈ S,

for all α ∈ [0, δ].

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

x

−∇f(x)

0.2x1 − x2 = −0.6

47 / 81

Null space of a matrix ... (just a reminder)

Note that it is not the number of linear constraints that is important, but their rank.For example the two constraints

x1 + x2 = 0,

2x1 + 2x2 = 0,

would remove only one degree of freedom. The reason is that, any multiple of (−1, 1)would solve the above system of equations.

Null space of a matrix

The null space of C ∈ Rme×n is denoted by N (C) and is defined as

N (C) = {x ∈ Rn : Cx = 0}.

In words, this means that the null space of C is the set of all vectors x that solve thesystem Cx = 0. N (C) is a vector space. This means that if x,y ∈ N (C), then anylinear combination of x and y is in N (C). Every matrix has a null space (C0 = 0).

Basis for the null space

We denote by Z ∈ Rn×z a basis for N (C). The z columns of Z are linearlyindependent (since they form a basis). Any vector x ∈ N (C) can be expressed asx = Zxz for some xz ∈ Rz. In the above example

Z =

[−11

]

.

48 / 81

Feasible directions (linear equality constraints)

Consider the set of constraints Cx = d, C ∈ Rme×n, me < n, d ∈ R(C)

By the definition of a feasible direction ∆x (from a feasible x), we have

C(x +∆x) = d,

from which follows that ∆x should satisfy

C∆x = 0.

Or in words, any feasible direction for a system of linear equality constraints belongs toN (C). This means that any feasible direction can be expressed as Zxz for some xz .

In order to determine whether a given feasible point x is a (constrained) localminimizer, we examine the Taylor-series expansion of f about x along a feasible

direction ∆x = Zxz .

f(x + αZxz) = f(x) + αxTz ZT∇f(x) +

1

2α2xT

z ZT∇2f(x + αt∆x)Zxz ,

for some t ∈ [0, 1]. We consider (without loss of generality) α > 0. Using an argumentsimilar to the unconstrained case, we can show that if there exists xz such that

xTz ZT∇f(x) < 0,

then there exist feasible descent directions.

49 / 81

Projected gradient and Projected Hessian

Clearly, if ZT∇f(x) 6= 0, we can always choose xz such that xTz ZT∇f(x) < 0,

hence a (first-order) necessary condition for x to be a (constrained) local minimizer is

ZT∇f(x) = 0. (7)

The vector ZT∇f(x) is called the projected gradient of f at x. Any point at whichthe projected gradient vanishes is called a constrained stationary point.

If a point x is a constrained stationary point, then the Taylor-series expansion becomes

f(x + αZxz) = f(x) +1

2α2xT

z ZT∇2f(x + αt∆x)Zxz , for some t ∈ [0, 1].

In a similar way as in the unconstrained case, we can show that if the matrixZT∇2f(x)Z is not positive semidefinite, every neighborhood of x contains feasiblepoints with a strictly lower value then f(x). Therefore, a (second-order) necessarycondition is that

ZT∇2f(x)Z ∈ Sz+. (8)

The matrix ZT∇2f(x)Z is called the projected Hessian of f at x. Note that whenwe have constraints, ∇2f(x) need not be positive semidefinite for x to be a localminimizer. It is possible for ∇2f(x) to be indefinite, but ZT∇2f(x)Z to be positivesemidefinite.

50 / 81

Where are the Lagrange multipliers?

It is possible to state the first-order optimality condition ZT∇f(x) = 0 in analternative way by observing that

N (C)⊥R(CT )

Consider for example[

1 2 34 5 6

]

︸︷︷︸

C

x1

x2

x3

=

[00

]

Clearly, any vector a ∈ N (C) is orthogonal to the rows of C, since two vectorsa, b ∈ R

n are orthogonal only if aT b = 0.

First, note that ZT∇f(x) = 0 implies that ∇f(x) is orthogonal to the rows of ZT ,which are the columns of Z, hence

∇f(x)⊥N (C). (9)

This means that ∇f(x) ∈ R(CT ). Hence, if x is a (constrained) local minimizer,there exists a vector ν such that

∇f(x) +CTν = 0. (10)

The above equation states that ∇f(x) is a linear combination of the rows of C (i.e.,the constraint normals), and the vector ν contains the weights for this linearcombination. These weights are called Lagrange multipliers (and are unique only ifthe rows of C are linearly independent). Note that, equivalently we could have written(10) as ∇f(x) = CT ν (in which case ν would have opposite sign).

51 / 81

Linear equality constraints - summary ([2], pp. 70)


x⋆ is a constrained local minimizer of f ⇒

Cx⋆ = d

ZT∇f(x⋆) = 0

ZT∇2f(x⋆)Z ∈ Sn+

or equivalently, there exist ν⋆, such that


Cx⋆ = d

∇f(x⋆) +CT ν⋆ = 0



Cx⋆ = d

∇f(x⋆) +CT ν⋆ = 0

ZT∇2f(x⋆)Z ∈ Sn++

⇒ x⋆ is a constrained local minimizer of f

Essentially, the difference between the unconstrained case and the case with linearequality constraints is that we interchanged the gradient and Hessian with theprojected gradient and projected Hessian. And of course we explicitly added theequality constraints in the conditions.

52 / 81

Lagrangian function

In the unconstrained case, the first-order necessary condition for x⋆ to be a localminimizer of f is ∇f(x⋆) = 0. Is it possible to state the first-order necessaryconditions in the case of linear equality constraints as the gradient of some function tobe equal to zero?

The answer is: yes

Consider the function L : Rn+me → R

L(x,ν) = f(x) + νT (Cx− d) .

L(x,ν) has the property that

∇xL(x,ν) = ∇f(x) +CTν

∇νL(x) = Cx− d,

where ∇xL(x,ν) are the partial derivatives of L(x, ν) with respect to x, and∇νL(x, ν) are the partial derivatives of L(x,ν) with respect to ν. The first-ordernecessary conditions can be expressed as

∇xL(x⋆,ν⋆) = 0, ∇νL(x

⋆) = 0.

L(x,ν) is called a Lagrangian function. The above two conditions suggest that wecan search for solutions of the equality-constrained problem by seeking stationarypoints of L(x, ν) [3], pp. 310. Note that we could have defined L(x,ν) asf(x) + νT (d−Cx), in which case ν would have opposite sign.

53 / 81

KKT conditions (linear equality constraints)

The Karush-Kuhn-Tucker (KKT) conditions are given by

Cx⋆ = d

∇f(x⋆) +CT ν⋆ = 0

Hence, the KKT conditions are equivalent to the first-order conditions we derived. Ingeneral, this is a set of nonlinear equations.

Example

In the case of a quadratic function, ∇f(x) = Hx+ g is a linear function of x, hencethe KKT conditions can be expressed as

Cx⋆ = d

Hx⋆ + g +CT ν⋆ = 0

or in a matrix form[

H CT

C 0

]

︸︷︷︸

K

[x⋆

ν⋆

]

=

[−gd

]

.

The matrix K ∈ R(n+me)×(n+me) is called the KKT matrix. (x⋆,ν⋆) is called anoptimal pair. Every such pair satisfies the above system of linear equations. The factthat solving an equality constrained QP amounts to solving one system of linearequations is heavily used in various algorithms.

54 / 81

Example [8], pp. 294


minimizex

1

2(x2

1 + x22 + x2

3)

subject to x1 + x2 + x3 = 3.

The first-order necessary conditions yield

∇xL(x,ν) = ∇f(x) +CT ν =

x1

x2

x3

+ ν

111

=

000

∇νL(x,ν) = Cx− d =[

1 1 1]

x1

x2

x3

− 3 = 0.

This is a system of four linear equations in four unknowns

1 0 0 10 1 0 10 0 1 11 1 1 0

︸︷︷︸

KKT matrix

x1

x2

x3

ν

=

0003

. (11)

The unique solution is given by x⋆ = (1, 1, 1), ν⋆ = −1. Since ∇2f(x) is the identitymatrix, we conclude that x⋆ is a local minimizer (in fact, x⋆ is a unique globalminimizer as we will see later on). Note that ∇2f(x) = ∇2

xxL(x).55 / 81

Example (ball hanging on a spring subject to one equality constraint)

−3 −2 −1 0 1 2 3−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

x1

x2

x⋆

cT

−∇f(x⋆)

−x1 −

x2 =

1

Consider our mass-spring example, but now subject to only one equality constraintcTx = −x1 − x2 = 1. It is easy to verify that x⋆ satisfies the optimality conditions(x⋆ = (0.48,−1.48), ν = 1.92). Note that since ∇f(x⋆) is orthogonal to the levelcurve associated with x⋆, the geometric interpretation of the condition∇f(x⋆) +CT ν⋆ = 0 is that CTν⋆ should as well be orthogonal to it.

56 / 81

Linear inequality constraints [2], pp. 71

In general, a set of linear inequality constraints can be expressed as Ax ≤ b,

A ∈ Rmi×n, where the ith constraint is given by

aTi x = ai1x1 + · · ·+ ainxn ≤ bi, aT

i ∈ Rn, bi ∈ R.

It is possible that there is no x that satisfies a given set of inequality constraintsAx ≤ b, in such case the constraints are called inconsistent.

Note that our mass-spring examples with:

three inequality constraints

one equality constraint

have the same solution (of course in both cases the objective function is the same).

−2 −1 0 1 2 3 4−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

x1

x2

x⋆

−∇f(x⋆)x⋆unc = (0,−1.96)

−3 −2 −1 0 1 2 3−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

x1

x2

x⋆

cT

−∇f(x⋆)

−x1 −

x2 =

1

57 / 81

Active constraints

Consider the following set of equality and inequality constraints

cT1...

cTme

︸︷︷︸

C

x =

d1...

dme

︸︷︷︸

d

,

aT1...

aTmi

︸︷︷︸

A

x ≤

b1...

bmi

︸︷︷︸

b

,

We introduce two sets: E and A. E(x) contains the indexes of equality constraintsthat are satisfied at a given point x. A(x) contains the indexes of inequalityconstraints that are satisfied as equalities at x.

Active set

The set of constraints

cTi x = di, for all i ∈ E(x), aTj x = bj , for all j ∈ A(x)

is called the set of active constraints at x, i.e., all constraints that are satisfied asequalities at the point x. For example, at a feasible x, where the first, third andfourth inequality constraints are active (i.e., satisfied as equalities),

E(x) = {1, . . . , me}, A(x) = {1, 3, 4}.

We gather the coefficients of all active inequality constraints (at a point x) in a

matrix A and vector b. With the notation (A, b)← A(x) we mean the matrix A and

vector b corresponding to the set of active inequality constraints at a point x.

58 / 81

The two problems below have the same solution x⋆ (note that sometimes we refer tox⋆ as a ”solution” even thought it might be only a local minimizer.

minimizex

f(x) (12)

subject to Ax ≤ b,

Cx = d,

minimizex

f(x) (13)

subject to Cx = d,

Ax = b,

(A, b)← A(x⋆).

“It is clear that, if it were known a priori which constraints were active at the solutionof (12), the solution would be a local minimum point of the problem defined byignoring the inactive constraints and treating all active constraints as equalityconstraints. Hence with respect to local (or relative) solutions, the problem could beregarded as having equality constraints only.” [4], pp. 322.

Hence, if we have to solve problem (12), but (somehow) we knew what is the activeset at x⋆, we could directly solve (13) i.e., completely disregard the inactive inequalityconstraints at x⋆. The whole problem (of course!) is that we usually do not knowA(x⋆). For the moment, let us assume that there are no equality constraints. Thereare two possibilitiesA(x⋆) = ∅,ma inequality constraints are active at x⋆.

The optimality conditions when A(x⋆) = ∅ are identical to the ones we had for theunconstrained case. Next, we discuss the second case, i.e., when x⋆ is on theboundary of the feasible region Ax ≤ b.

59 / 81

Feasible directions (linear inequality constraints)

If the jth inequality constraint is inactive at a feasible point x it is possible to move anon-zero distance from x in any direction without violating that constraint. Thismeans that for any vector ∆x, there exists α such that x+ α∆x is feasible with

respect to jth inequality constraint.

On the other hand, consider a feasiblepoint x that satisfies aT

j x = bj (i.e., the

jth inequality constraint is active at x).Then there are two categories of feasibledirections ∆x (with respect to thisconstraint)

aTj ∆x = 0, binding

stay on the constraint

aTj ∆x < 0, non-binding

move “off” the constraint.

The latter ∆x is a feasible direction since

aTj (x + α∆x) = aT

j x︸︷︷︸

bj

+αaTj ∆x︸︷︷︸

<0

< bj .

−1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4−0.5

0

0.5

1

1.5

2

2.5

3

3.5

x1

x2

x

xx

Figure: Binding feasible directions are depictedin red, while non-binding feasible directions aredepicted in green.

60 / 81

λ ≥ 0 [2], pp. 71

Let x be a feasible point and (A, b)← A(x). Let (like in the equality constrained

case) Z be a matrix whose columns form a basis for the set of vectors orthogonal to

the rows of A. If we regard the active constraints simply as equality constraints, wehave shown that a necessary condition for (local) optimality of a point x is that the

projected gradient ZT∇f(x) = 0, or equivalently

∇f(x) + ATλ = 0, (14)

where we denote the Lagrange multipliers corresponding to the active inequalityconstraints by λ ∈ R

ma .

Additional condition

Condition (14) ensures that f is stationary along all binding feasible directions fromthe point x. However, since non-binding feasible directions might exist as well, werequire an additional condition that states that there is no non-binding feasible

descent direction ∆x that satisfies ∇f(x)T∆x < 0. To avoid this possibility, we

impose the condition that for all ∆x that satisfy A∆x ≤ 0, it should hold that∇f(x)T∆x ≥ 0.

Substituting ∇f(x) = −ATλ in ∇f(x)T∆x ≥ 0, we obtain the condition that

−λTA∆x︸︷︷︸

≤0

≥ 0.

The above inequality can hold only if λ ≥ 0.61 / 81

λ ≥ 0

Geometric interpretation

Consider the condition

ATλ = −∇f(x), λ ≥ 0.

It states that, at x, the negative gradient has to be a non-negative linear combinationof the rows of A (i.e., the normals to the active inequality constraints at x). The setof all non-negative linear combinations (i.e., conic combinations) of a set of vectors iscalled a cone [1], pp. 25.

The following two conditions are mutually exclusive (“by construction”)

Exists ∆x such that

∇f(x)T∆x < 0, descent direction

A∆x ≤ 0, feasible direction

−∇f(x) is a conic combination of the

rows of C

ATλ = −∇f(x), λ ≥ 0.

It should be noted that even if there is no ∆x that satisfies ∇f(x)T∆x < 0, therestill could be a descent direction from x, depending on the eigenvalues of ∇2f(x)(like in the unconstrained case).

62 / 81

Farkas’ lemma

The above mutually exclusive conditions do not appear only in our particular context,they are known as the Farkas’ lemma, which states

Farkas’s lemma [5], pp. 165, [8], pp. 339

Exactly one of the following two alternatives holds

There exits some x ≥ 0 such that Ax = b.

There exists some p such that ATp ≤ 0 and bTp > 0,

where A ∈ Rm×n and b ∈ Rm. See as well [6], pp. 205.

Note that if we change bTp > 0 to bTp ≥ 0, the above conditions would not hold.For example if b = 0, then both conditions would be satisfied for x = 0 and p = 0.

Farkas’s lemma restated (with our notation)

Exactly one of the following two alternatives holds

There exits some λ ≥ 0 such that ATλ = −∇f(x).

There exists some ∆x such that A∆x ≤ 0 and ∇f(x)T∆x < 0.

From the above lemma, it follows that at a local minimizer x the negative gradientcan be expressed as a conic combination of the normals to the active inequalityconstraints at x.

63 / 81

Farkas’ lemma - geometric interpretation

−3 −2 −1 0 1 2 3 4 5−2

−1

0

1

2

3

4

5

x1

x2

−∇f(x)

aT1

aT2

aT3

∆x

separating hyperp

lane

polyhedralcone

Figure: If the vector −∇f(x) does not belong to the set of all non-negative linear combinations

of aT1 , . . . , aT

ma(i.e., {c : c =

∑maj=1 λj a

Tj , λj ≥ 0}), then we can find a hyperplane

{z : ∆xT z = 0} that separates −∇f(x) from the set. Note that a polyhedral cone is anexample of a closed but unbounded set.

64 / 81

Farkas’ lemma - geometric interpretation

−5 −4 −3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3

4

5

x1

x2

−∇f(x)

aT1

aT2

∆x

{∆x : aTj ∆x ≤ 0, j = 1, . . . ,ma}

Figure: Clearly, ∇f(x)∆x > 0 for all ∆x that satisfy aTj ∆x ≤ 0, j = 1, . . . ,ma.

65 / 81

Linear inequality constraints - summary ([2], pp. 73)


Let (A, b)← A(x⋆)


Ax⋆ ≤ b

Ax⋆ = b

∇f(x⋆) + ATλ⋆= 0

λ⋆≥ 0


Note that λ⋆corresponding to x⋆ is unique only if the matrix A has linearly

independent rows (why?).


Ax⋆ ≤ b

Ax⋆ = b

∇f(x⋆) + ATλ⋆= 0

λ⋆>0

ZT∇2f(x⋆)Z ∈ Sn++

⇒ x⋆ is a constrained local minimizer of f

For reasoning about the strict inequality above see [2], pp. 74, [6], pp. 201.Essentially, this is necessary in order to handle the case when x⋆ coincides with anunconstrained minimizer which is on some of the inequality constraints. Suchconstraints are called weakly active at x⋆.

66 / 81

Example (linear inequality constraints)

−4 −3 −2 −1 0 1 2 3 4 5−3

−2

−1

0

1

2

3

4

5

x1

x2

x

x⋆

−∇f(x)

P

Figure: The point x is not a local minimizer, because the negative gradient is not within the conespanned by the normals to the active constraints at x. This cone is called a normal cone.

67 / 81

Normal cone

−2 −1 0 1 2 3 4 5

−1

0

1

2

3

4

x1

x2

Px

x

y

Figure: Normal cones at three points on the boundary of a polygon P.

Normal cone [1], pp. 66

The normal cone of a set P at a boundary point x is the set of all vectors y such thatyT (x − x) ≤ 0 for all x ∈ P. Or in other words, the set of vectors that define asupporting hyperplane to P at x.

68 / 81

Lagrange multipliers (interpretation)

−2 −1 0 1 2 3 4−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

x1

x2

x⋆

−∇f(x⋆)x⋆unc = (0,−1.96)

−3 −2 −1 0 1 2 3 4−5

−4

−3

−2

−1

0

1

x1

x2

x⋆

−∇f(x⋆)‖∇f(x⋆)‖2

x⋆unc = (0,−3.68)

The two figures above illustrate the condition: at x⋆ there exists λ⋆≥ 0 such that

ATλ⋆= −∇f(x⋆).

The non-negativity of λ⋆implies that each inequality constraint can apply only

“repelling forces” (i.e., they are unilateral constraints). The equality implies that thecombined “effort” of the constraints can counterbalance the force −∇f(x⋆).

If the constraint normals are assumed to be of unit length, then the Lagrangemultiplier λ⋆

i can be interpreted as the magnitude of the repelling force applied by the

ith active constraint, in order to counterbalance the potential pull at x⋆.

69 / 81

Lagrange multipliers (interpretation)

[6], pp. 198

“... the Lagrange multiplier of any constraint measures the rate of change in theobjective function, consequent upon changes in the constraint function.”

Consider the ith equality constraint hi(x) = 0. We perturb it “slightly” to obtain

hi(x) = ǫi.

Let x(ǫ) and ν(ǫ) denote how the solution and Lagrange multipliers change with thechange of ǫ = (ǫ1, . . . , ǫme ). The Lagrangian function for the equality constrainedproblems is given by

L(x, ν, ǫ) = f(x) +

me∑

i=1

νi(hi(x)− ǫi).

If x is a (local) solution to the equality constrained problem, then

f(x(ǫ)) = L(x(ǫ), ν(ǫ), ǫ), ∇xL = 0, ∇νL = 0.

Using the chain rule leads to

df(x(ǫ))

dǫi=

dL(x(ǫ),ν(ǫ), ǫ)

dǫi= ∇xL

T ∂x

∂ǫi+∇νL

T ∂ν

∂ǫi+

∂L

∂ǫi= −νi.

70 / 81

Where is the complementarity condition?

So far, we associated a Lagrange multiplier λi with the ith active inequality constraint

(which is stored as the ith row of the matrix A and vector b).

Let us, for convenience, introduce a vector of Lagrange multipliers λ ∈ Rmi , where λi

is associated with the ith inequality constraint (which is stored as the ith row of thematrix A and vector b). The fact that constraints that are not active at the solutiondo not apply any “forces” for counterbalancing the potential pull can be expressed bysetting their corresponding Lagrange multipliers equal to zero. The following twoconditions are equivalent

∇f(x⋆) + ATλ⋆= 0

λ⋆≥ 0

∇f(x⋆) +ATλ⋆ = 0

λ⋆ ≥ 0

(Ax⋆ − b)Tλ⋆ = 0

(Ax⋆ − b)Tλ⋆ = 0 is known as the complementarity condition and is equivalent to

mi∑

i=1

λ⋆i (a

Ti x⋆ − bi) = 0.

Due to the non-negativity of λ⋆, the above equality can be attained only if all termsin the summation are equal to zero

λ⋆i (a

Ti x⋆ − bi) = 0, for all i.

The complementarity condition states that either the ith constraint is active, or itscorresponding Lagrange multiplier λ⋆

i is equal to zero.71 / 81

Linear constraints - summary

Consider the following problem

minimizex∈Rn

f(x)


Cx = d, C ∈ Rme×n.

KKT optimality conditions (these are our necessary conditions)

Ax⋆ ≤ b, primal feasibility condition

Cx⋆ = d, primal feasibility condition

λ⋆ ≥ 0, dual feasibility condition

(Ax⋆ − b)Tλ⋆ = 0, complementarity condition

∇f(x⋆) +ATλ⋆ +CTν⋆ = 0, stationarity condition

Note that the Lagrange multipliers ν⋆ ∈ Rme associated with the equalityconstraints do not have a sign restriction.The Lagrange multipliers λ⋆ ∈ Rmi associated with the inequality constraintshave a sign restriction and should satisfy in addition the complementaritycondition.

We will see later on why λ⋆ ≥ 0 is referred to as “dual feasibility” condition. (λ, ν)are sometimes called dual variables, while the elements of x are called primal variables.

72 / 81

Summary of important points

Let x⋆ be a local minimizer of f and let the rows of the matrix A ∈ Rma×n containthe normal vectors to the active inequality constraints at x⋆.

In the equality constrained case, the equation

∇f(x⋆) +CT ν⋆ = 0

implies that∇f(x⋆)⊥N (C). This means that at x⋆ the gradient of the objective function isorthogonal to the set of feasible variations.∇f(x⋆) belongs to the subspace spanned by the rows of Cthere are no sign restrictions for the Lagrange multipliers ν⋆

In the inequality constrained case, the conditions

∇f(x⋆) + ATλ⋆= 0, λ ≥ 0

imply that

∇f(x⋆)⊥N (A)−∇f(x⋆) belongs to the normal cone at x⋆

Both local minima, local maxima and saddle points satisfy the first-ordernecessary conditions.

A step ∆x is a descent direction from a point x if ∇f(x)T∆x < 0.

Even if ∇f(x) = 0, descent directions may exist. Such directions can be foundby examining the eigenvalues of ∇2f(x).

73 / 81

The condition ∇f(x⋆)⊥N (C) is a generalization of the “zero gradientcondition” in the unconstrained case. This is because if there are no equalityconstraints, N (C) = Rn.Consider the following problem

minimizex∈Rn

f(x) (15)


Cx = d, C ∈ Rme×n.

At a local minimizer x⋆, there exists at least one pair of Lagrange multipliers(λ⋆, ν⋆), such that (x⋆,λ⋆,ν⋆) satisfy the KKT conditions.

When there are only equality constraints, and their normal vectors are linearlyindependent, i.e., rank(C) = me, there is a unique ν⋆ associated with x⋆

[8], pp. 291.When there are only inequality constraints, and the normal vectors to the activeconstraints at x⋆ are linearly independent, there is a unique λ⋆ associated with x⋆.Note that it is possible that ma > n, i.e., there are more than n active inequalityconstraints at x⋆ (of course they are not linearly independent).When there are both equality and inequality constraints (as in (15)), the pair (ν⋆,λ⋆)associated with x⋆ is unique if the matrix

[

C

A

]

, (16)

has linearly independent rows [8], pp. 315. Note that we could write the stationaritycondition as

∇f(x⋆) +

[

C

A

]T [

µ⋆

λ⋆

]

= 0.

When there is a unique pair (ν⋆,λ⋆) associated with x⋆, x⋆ is called regular localminimizer [8], pp. 316.

74 / 81

Problems


minimizex

−1

2(x2

1 + x22 + x2

3)


Write down the KKT optimality conditions. How many stationary points doesthis problem have? Are there local minima?


minimizex

1

2(x2

1 + x22 + x2

3) (17)


Express x1 as a function of x2 and x3 and write down the resulting unconstrainedminimization problem (in a matrix form). Verify that x⋆ = (1, 1, 1). Note thatthe system of linear equations you need to solve with this approach is smaller, butturns out to be completely dense, as opposed to the KKT matrix in

1 0 0 10 1 0 10 0 1 11 1 1 0

︸︷︷︸

KKT matrix

x1

x2

x3

ν

=

0003

. (18)

The leading matrix in (18) has a very nice structure (it looks like an “arrow”).One possible way of obtaining the solution is to perform block elimination.

75 / 81

Block elimination

Consider the following system of linear equations

[A11 A12

A21 A22

]

︸︷︷︸

A

[x1

x2

]

=

[b1b2

]

.

From the first equation solve for x1 to obtain

x1 = A−111 (b1 −A12x2),

then substituting x1 in the second equation leads to

(A22 −A21A−111 A12)

︸︷︷︸

Schur complement

x2 = b2 −A21A−111 b1

Obviously, the above procedure is possible only if A11 is an invertible matrix. Notethat even if A11 is not invertible, A can be invertible (give an example). Whenworking with a KKT matrix, A22 = 0.

Solve the system of linear equations (18) using block elimination.

Consider again the objective function (17), this time subject to the inequalityconstraint x1 + x2 + x3 ≤ 3. ([8], pp. 317). The problem has a unique solution,find it using only pen and paper. Hint: essentially, there are only two possibilitiesat x⋆: either the inequality constraint is active or not.

76 / 81

Using the same approach (as in the previous problem), find the solution to ourmass-spring example for the following three cases

(m=0.3, k=4, g=-9.81)(m=0.8, k=4, g=-9.81)(m=1.5, k=4, g=-9.81)

For all cases determine both x⋆ and λ⋆. The inequality constraints (which youshould visualize) are given by

−1 −11 −10.1 0.8

︸︷︷︸

A

[x1

x2

]

≤

130.4

︸︷︷︸

b

.

Use Matlab’s QP solver quadprog to verify your results.

Consider the problem ([8], pp. 320, [2], pp. 74)

minimizex

1

2(x2

1 − x22)

subject to x2 ≤ 0.

Which of the sufficient conditions in the case of linear inequality constraints arenot satisfied at x = (0, 0), λ = 0. What can you say about the Hessian matrixand projected Hessian matrix? Give a feasible descent direction from x.

77 / 81


minimizex∈Rn

‖Ax − y‖22,

where A ∈ Rm×n, m > n, rank(A) = n. This is a classical “least-squares”problem, and it is well known that under our assumptions there is a uniquesolution - derive it.


minimizex∈Rn

‖x‖22

subject to Ax = y

where A ∈ Rm×n, m < n, rank(A) = m. This is a classical “least-norm”

problem, and it is well known that there is a unique solution - derive it.Choose the cost g of the following LP so that

there are no stationary pointsthere are infinitely many stationary pointsthere is only one stationary point

minimizex∈R2

gTx

subject to x1 + x2 ≥ 1

x2 ≥ 0.

Draw the feasible region, the normals to the two constraints and g. You cancheck your solution using Matlab’s LP solver linprog. Does an unconstrained LP(with g 6= 0) have stationary points?

78 / 81

([1], pp. 282) Prove (without using any linear programming code) that theoptimal solution of the following LP is given by x = (1, 1, 1, 1).

minimizex

47x1 + 93x2 + 17x3 − 93x4

subject to

−1 −6 1 3−1 −2 7 10 3 −10 −1−6 −11 −2 121 6 −1 −3

x1

x2

x3

x4

≤

−35−8−74

.

Is the solution unique?

Construct an example of an inequality constrained QP (in 2D) that has a solutionx⋆ at a point where the normals to the active inequality constraints are linearly

dependent. What can you say about the vector λ⋆.

79 / 81

−5 −4 −3 −2 −1 0 1 2 3 4

−5

−4

−3

−2

−1

0

1

x

y

v

w

P1

P2

Find the shortest distance between the polygons P1 and P2, and the pointsv ∈ P1 and w ∈ P2 that yield the minimum distance. Hint: you have tominimize ‖v −w‖2 subject to the constraints v ∈ P1 and w ∈ P2 (this is a QPthat you can solve using quadprog). What can you say about the Hessian matrixof the QP? What can you say about the projected Hessian matrix at x⋆.

P1 :

−1 −11 −10.1 0.8

[xy

]

≤

130.4

, P2 :

0 −11 0−1 1

[xy

]

≤

4.5−1.5−0.1

.

80 / 81

[1] S. Boyd, and L. Vandenberghe, “Convex Optimization,” Cambridge, 2004.

[2] P. E. Gill, W. Murray, and M. H. Wright, “Practical Optimization,” Emerald,2007.

[3] J. Nocedal, and S. J. Wright, “Numerical Optimization,” Springer.

[4] D. G. Luenberger, Y. Ye, “Linear and Nonlinear programming,” Springer, 2010.

[5] D. Bertsimas, and J. N. Tsitsiklis, “Introduction to Linear Optimization,” AthenaScientific, 1997.

[6] R. Fletcher, “Practical Methods of Optimization,” Wiley, (2rd edition), 1987.

[7] N. Andreasson, A. Evgrafov, and M. Patriksson, “An Introduction to ContinuousOptimization: Foundations and Fundamental Algorithms,” 2005.

[8] D. P. Bertsekas, “Nonlinear Programming,” Athena Scientific, (3rd print) 2008.

81 / 81

introduction -...

Documents