lecture notes: convex duality and financial mathematics

Peter Carr and Qiji Jim Zhu

Lecture notes: Convex Dualityand Financial Mathematics

September 9, 2015

Contents

1 Lagrange Multipliers and Duality . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Lagrange multiplier rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Convex Functions and Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Sandwich Theorems and Calculus . . . . . . . . . . . . . . . . . . . . . . . . . 191.4 Fenchel Conjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.5 Convex duality theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

1

Lagrange Multipliers and Duality

This chapter discusses Lagrange multiplier rule for constrained optimizationproblems and its relationship with duality. We use a variational approachto the Lagrange multiplier rule. To understand this approach heuristically,we take real functions f and g := (g1, g2, . . . gN ), and consider the simpleoptimization problem

v(y) = min{f(x) : g(x) ≤ y}.

Here the order used is the coordinate order: z ≤ y if and only if zj ≤ yj foreach 1 ≤ j ≤ N .

Let us assume that xy is a (local) solution to the above problem. Then weobserve that the function

x 7→ f(x)− v(g(x))

attains a minimum at xy. Assuming all functions involved are smooth thenwe have

f ′(xy)−Dv(g(xy))g′(xy) = 0,

revealing λy = −Dv(g(xy)) as a Lagrange multiplier. This gives us an idea asto why we might expect a Lagrange multiplier to exist.

As point out by David Gale in the convex case, if we view y as constraintson resources and −f as the output attainable with that level of resources. ALagrange multiplier, then, reflects the marginal gain of the output functionwith respect to the vector of resource constraints. Following this observation,if we penalize the resource utilization with a (vector) Lagrange multiplier thenthe constrained optimization problem can be converted to an unconstrainedone.

In general, however, v is neither convex nor smooth. This helps explainswhy this view was not prevalent before the systematic development of nons-mooth and variational analysis. This systematic development of convex andnonsmooth analysis during the 1950s through 1970s, respectively, provided

2 1 Lagrange Multipliers and Duality

suitable tools for proper analysis of Lagrange multipliers. Gale himself pro-vided a rigorous proof of the fact that for well behaved convex problems thesubdifferential of the optimal value function exactly characterizes the set of allLagrange multipliers. Subsequently, many researchers have derived versions ofLagrange multiplier theorems with different ranges of application using othergeneralized derivative concepts (see e.g.,[3]).

Also worthy noting is that there is a close relationship between Lagrangemultiplier and duality theory for convex problems. Many financial problemscan be modeled using this framework. In this chapter we will first discussLagrange multiplier rules and then explore their relationship with convexduality theory. These tools will be applied in later chapters to various financialproblems.

1.1 Lagrange multiplier rules

1.1.1 Notation and the Problem

Let R be the real numbers and let X be a finite dimensional Banach space.Consider an extended-real-valued function f : X → R ∪ {+∞} . The domainof f is the set where it is finite and is denoted by dom f := {x | f(x) <+∞}. The range of f is the set of all the values of f and is denoted byrange f := {f(x) | x ∈ dom f}. We call an extended-valued function f properprovided that its domain is nonempty. We say f : X → R ∪ {+∞} is lowersemicontinuous (lsc) at x provided that lim infy→x f(y) ≥ f(x). We say thatf is lsc if it is lsc everywhere in its domain.

A subset S of X can often be better studied by using related functions.The extended-valued indicator function of S,

ιS(x) = ι(S;x) :=

{0 x ∈ S,

+∞ otherwise,

characterizes S.We also use the distance function

dS(x) = d(S;x) := inf{∥x− y∥ | y ∈ S}.

The Exercises 1.1.1 and 1.1.2 show that the distance function determinesclosed sets.

On the other hand, to study a function f : X → R ∪ {+∞} it is oftenequally helpful to examine its epigraph and graph, related sets in X × R,defined by

epi f := {(x, r) ∈ X × R | f(x) ≤ r}

andgraph f := {(x, f(x)) ∈ X × R | x ∈ dom f}.

1.1 Lagrange multiplier rules 3

We denote the preimage of f : X → R ∪ {+∞} of a subset S in R by

f−1(S) := {x ∈ X | f(x) ∈ S}.

Two special cases which will be used often are f−1((−∞, a]), the sublevelset, and f−1(a), the level set, of f at a ∈ R. For a set S ⊂ X, we denoteby intS, S, bdS, its interior, closure, boundary, respectively, and we denoteby Br(S) := {x ∈ X | d(S;x) ≤ r} its r-enlargement. Closed sets and lscfunctions are closely related as illustrated in Exercises 1.1.4 and 1.1.5.

Let X, Y and Z be finite dimensional Banach spaces. Assume that Y hasa partial order ≤K generated by the closed convex cone K (y1 ≤K y2 iffy2 − y1 ∈ K). We denote the polar cone of K by

K+ := {y∗ ∈ Y ∗ : ⟨y∗, y⟩ ≥ 0 forall y ∈ K}.

Consider the following class of constrained optimization problems

P (y, z) Minimize f(x) (1.1.1)

Subject to g(x) ≤K y,

h(x) = z,

x ∈ C,

where C is a closed set, f : X → R is lower semicontinuous, g : X → Y islower semicontinuous with respect to ≤Y , and h : X → Z is continuous. Wewill use v(y, z) to represent the optimal value function

v(y, z) := inf{f(x) : g(x) ≤K y, h(x) = z, x ∈ C},

which may take values ±∞ (in infeasible or unbounded below cases), andS(y, z) the (possibly empty) solution set of problem P (y, z).

A concrete example is

P (y, z) Minimize f(x) (1.1.2)

Subject to gi(x) ≤ ym,m = 1, 2, . . . ,M,

hk(x) = zk, k = 1, 2, . . . ,K

x ∈ C ⊂ RN ,

where C is a closed subset, f, gm : RN → R are lower semicontinuousand hk : RN → R are continuous. Defining vector valued function g =(g1, g2, . . . , gM ) and h = (h1, h2, . . . , hk) problem (1.1.2) becomes problem(1.1.1) with ≤K=≤RM+ . Beside euclidean spaces, for applications in this classwe will often need to consider the Banach space of random variables.


1.1.2 Lagrange multiplier and subdifferential

Recall the following definition.

Definition 1.1.1 (Subdifferential) The subdifferential of a lower semicontin-uous function ϕ : X → R ∪ {+∞} at x ∈ dom ϕ is defined by

∂ϕ(x) = {x∗ ∈ X∗ : ϕ(y)− ϕ(x) ≥ ⟨x∗, y − x⟩ for all y ∈ X}.

We define the domain of the subdifferential of ϕ by

dom ∂ϕ = {x ∈ RN | ∂ϕ(x) = ∅}.

An element of ∂ϕ(x) is called a subgradient of ϕ at x.

Definition 1.1.2 (Normal cone) For a closed convex set C ⊂ RN , we definethe normal cone of C at x ∈ C by N(C; x) = ∂ιC(x).

Sometimes we will also use the notation NC(x) = N(C; x). A useful charac-terization of the normal cone is x∗ ∈ N(C;x) if and only if, for all y ∈ C,⟨x∗, y − x⟩ ≤ 0.

This globally-defined subdifferential is introduced as a replacement for thepossibly nonexistent derivative of a convex function. It has many applications.While it arises naturally for convex functions the definition works equallywell, at least formally, for nonconvex functions. As we shall see from the twoversions of Lagrange multiplier rules given below, the subdifferential of theoptimal value function completely characterizes the set of Lagrange multipliers(denoted λ in these theorems).

Theorem 1.1.3 (Lagrange multiplier without existence of optimal solution)Let v(y, z) be the optimal value function of the constrained optimization prob-lem P (y, z). Then −λ ∈ ∂v(0, 0) if and only if

(i) (nonnegativity) λ ∈ K+ × Z∗; and(ii)(unconstrained optimum) for any x ∈ C,

f(x) + ⟨λ, (g(x), h(x))⟩ ≥ v(0, 0).

Proof. (a) The “only if” part. Suppose that −λ ∈ ∂v(0, 0). It is easy to seethat v(y, 0) is non-increasing with respect to the partial order ≤K . Thus, forany y ∈ K,

0 ≥ v(y, 0)− v(0, 0) ≥ ⟨−λ, (y, 0)⟩

so that λ ∈ K+×Z∗. Conclusion (ii) follows from, the fact that for all x ∈ C,

f(x) + ⟨λ, (g(x), h(x))⟩ ≥ v(g(x), h(x)) + ⟨λ, (g(x), h(x))⟩ ≥ v(0, 0).(1.1.3)

(b) The “if” part. Suppose λ satisfies conditions (i) and (ii). Then we have,for any x ∈ C, g(x) ≤K y and h(x) = z,


f(x) + ⟨λ, (y, z)⟩ ≥ f(x) + ⟨λ, (g(x), h(x))⟩ (1.1.4)

≥ v(g(x), h(x)) + ⟨λ, (g(x), h(x))⟩ ≥ v(0, 0).

Taking the infimum of the leftmost term under the constraints x ∈ C, g(x) ≤K

y and h(x) = z, we arrive at

v(y, z) + ⟨λ, (y, z)⟩ ≥ v(0, 0). (1.1.5)

Therefore, −λ ∈ ∂v(0, 0). •If we denote by Λ(y, z) the multipliers satisfying (i) and (ii) of Theorem

1.1.3 then we may write the useful set equality

Λ(0, 0) = −∂v(0, 0).

The next corollary is now immediate. It is often a useful variant since h maywell be affine.

Corollary 1.1.4 (Lagrange multiplier without existence of optimal solution)Let v(y, z) be the optimal value function of the constrained optimization prob-lem P (y, z). Then −λ ∈ ∂v(0, 0) if and only if

(i) (nonnegativity) λ ∈ Y + × Z∗;(ii)(unconstrained optimum) for any x ∈ C, satisfying g(x) ≤K y and h(x) =

z,f(x) + ⟨λ, (y, z)⟩ ≥ v(0, 0).

When an optimal solution for the problem P (0, 0) exists, we can also derivea so called complementary slackness condition.

Theorem 1.1.5 (Lagrange multiplier when optimal solution exists) Let v(y, z)be the optimal value function of the constrained optimization problem P (y, z).Then the pair (x, λ) satisfies −λ ∈ ∂v(0, 0) and x ∈ S(0, 0) if and only if

(i) (nonnegativity) λ ∈ RM+ × RK ;

(ii)(unconstrained optimum) the function

x 7→ f(x) + ⟨λ, (g(x), h(x))⟩

attains its minimum over C at x;(iii)(complementary slackness) ⟨λ, (g(x), h(x))⟩ = 0.

Proof. (a) The “only if” part. Suppose that x ∈ S(0, 0) and −λ ∈ ∂v(0, 0).As in the proof of Theorem 1.1.3 we can show that λ ∈ K+ × Z∗. By thedefinition of the subdifferential and the fact that v(g(x), h(x)) = v(0, 0), wehave

0 = v(g(x), h(x))− v(0, 0) ≥ ⟨−λ, (g(x), h(x))⟩ ≥ 0,

so that the complementary slackness condition ⟨λ, (g(x), h(x))⟩ = 0 holds.


Observing that v(0, 0) = f(x) + ⟨λ, (g(x), h(x))⟩, the strengthened uncon-strained optimum condition follows directly from that of Theorem 1.1.3.

(b) The “if” part. Let λ, x satisfy conditions (i), (ii) and (iii). Then, forany x ∈ C satisfying g(x) ≤K 0 and h(x) = 0,

f(x) ≥ f(x) + ⟨λ, (g(x), h(x))⟩ (1.1.6)

≥ f(x) + ⟨λ, (g(x), h(x))⟩ = f(x).

That is to say x ∈ S(0, 0).Moreover, for any g(x) ≤ y, h(x) = z, f(x) + ⟨λ, (y, z)⟩ ≥ f(x) +

⟨λ, (g(x), h(x))⟩. Since v(0, 0) = f(x), by (1.1.6) we have

f(x) + ⟨λ, (y, z)⟩ ≥ f(x) = v(0, 0). (1.1.7)

Taking the infimum on the left hand side of (1.1.7) yields

v(y, z) + ⟨λ, (y, z)⟩ ≥ v(0, 0),

which is to say, −λ ∈ ∂v(0, 0). •We can deduce from Theorems 1.1.3 and 1.1.5 that ∂v(0, 0) completely

characterizes the set of Lagrange multipliers. Thus, sufficient conditions toensure the non-emptiness of ∂v(0, 0) are important in analyzing Lagrangemultipliers. Convex problems provide examples where such analysis is partic-ularly convenient. Since many financial problems naturally involve convexitydue to the use of concave utility functions and convex risk measures the convexcase also plays a crucial role in financial applications. The next two sectionsprovides a concise discussion of the convex sets, functions and the subdiffer-ential theory.

1.1.3 Necessary optimaility conditions

While the theorems in the previous subsection completely characterizes theLagrange multiplier, calculating ∂v is, in general, quite difficult – it requires tosolve the optimal value of problem 1.1.1 at all points near (0, 0). Thus, in prac-tice people often, under additional assumptions, directly using the Lagrangemultipliers as a necessary condition to determine candidates for possible op-timal solutions.

Let’s illustrate how this idea works by considering an idea situation: as-sume that the solution to problem 1.1.1 exists and is denoted x; all the functioninvolved are C1 at x with subdifferential of v at (0, 0) nonempty and, thus, theLagrange multiplier set is nonempty. Let λ = (λ1, . . . , λM+K) be a Lagrangemultiplier with λm ≥ 0,m = 1, 2, . . . ,M . Then the function

x → f(x) +

M∑m=1

λmgm(x) +

K∑k=1

λM+khk(x) (1.1.8)


attains a minimum at x = x and the Lagrange multipliers satisfies the com-plementary slackness condition

M∑m=1

λmgm(x) = 0 (1.1.9)

It follows that

f ′(x) +M∑

m=1

λmg′m(x) +K∑

k=1

λm+kh′k(x) = 0. (1.1.10)

Clearly, every pair of Lagrange multiplier λ and optimal solution x mustsatisfy the stationary condition (1.1.10) and the complementary slackness con-dition (1.1.9). Thus, these two conditions can be used to find candidates foroptimality. A few examples are assigned as exercises.

1.1.4 Exercises

Exercise 1.1.1 Show that x ∈ S if and only if dS(x) = 0.

Exercise 1.1.2 Suppose that S1 and S2 are two subsets of X. Show thatdS1 = dS2 if and only if S1 = S2.

Exercise 1.1.3 Prove that S is a closed set if and only if ιS is lsc.

Exercise 1.1.4 Prove that f is lsc if and only if epi f is closed.

Exercise 1.1.5 Prove that f is lsc if and only if its sublevel set at a,f−1((−∞, a]), is closed for all a ∈ R.

Exercise 1.1.6 (Normal Cone’s Characterization) Let C be a closed convexsubset of X. Prove that x∗ ∈ NC(x) if and only if, for all y ∈ C,

⟨x∗, y − x⟩ ≤ 0.

Exercise 1.1.7 Show that ∂ϕ(x) is a closed convex set.

Exercise 1.1.8 Show that if ϕ is C1 at x then ∂ϕ(x) ⊂ {ϕ′(x)} and useexample to illustrate that ∂ϕ(x) maybe empty.

Exercise 1.1.9 Find the maximum and minimum of the function f(x, y, z) =x + y + z + 3 on the feasible region defined by constraints x2 + y2 + z2 ≤ 1and x+ 2y + z = 0.

Exercise 1.1.10 Representing the payoff of portfolios by random variable ξ,an investor with risk aversion factor a will try to maximize E[ξ] − aVar(ξ).Suppose the portfolio consists three assets that are independent with returns


r1, r2, r3 and variance σ21 , σ

22 , σ

23 . Solve the portfolio optimization problem for

such an investor:

max w1r1 + w2r2 + w3r3 − a(w21σ

21 + w2

2σ22 + w2

3σ23) (1.1.11)

subject to w1 + w2 + w3 = 1, w1, w2, w3 ≥ 0,

where w1, w2, w3 are weights of the corresponding assets in the portfolio.

Exercise 1.1.11 In information theory, the entropy of an information sourceis measure of the disorder or randomness of the source. It is defined to be

E(x1, x2, . . . , xN ) = −N∑

n=1

xn lnxn.

Find the solution to the following entropy maximization problem:

max E(x1, x2, . . . , xN ) (1.1.12)

subject to x1 + x2 + . . .+ xN = 1.

Exercise 1.1.12 (Reilay Quotient) Let A be a symmetric square matrix.Solve optimization problem

max{⟨Ax, x⟩ : ⟨x, x⟩ = 1}

using the Lagrange multiplier rule. What does the Lagrange multiplier mean-ing in this example?

Exercise 1.1.13 (AG average inequality) Using Lagrange multiplier rule toprove the arithmatic -geometric average inequality. Hint: consider problem

min{x1 + . . .+ xN : x1 . . . xN = 1, xn > 0, n = 1, . . . , N}.

1.2 Convex Functions and Sets

We see that Lagrange multipliers are characterized by subdifferentials of theoptimal value function. Naturally we want to know when will the subdiffer-ential be nonempty. It turns out for a convex function the subdifferential isnonempty at all continuous points. This section is devoted to this importantresult and we start with the basics.

1.2.1 Definitions and Basic Properties

Definition 1.2.1 (Convex sets and functions) We say that a subset C of Xis convex if, for any x, y ∈ C and any λ ∈ [0, 1], λx+ (1− λ)y ∈ C. We say anextended-valued function f : X → R∪{+∞} is convex if its domain is convexand for any x, y ∈ dom f and any λ ∈ [0, 1], one has

f(λx+ (1− λ)y) ≤ λf(x) + (1− λ)f(y).

We call a function f : X → [−∞,+∞) concave if −f is convex.

In some sense convex functions are the simplest functions next to linearfunctions. Convex functions and convex sets are intrinsically related. For ex-ample, if C is a convex set then ιC and dC are convex functions. On the otherhand if f is a convex function then epi f and f−1((−∞, a]), a ∈ R are convexsets. These facts are not hard to verify and are left as exercises. Some of themcan also be easily derived as corollaries of the following two general results.

Proposition 1.2.2 (Convexity of optimal value function) Suppose that in theconstriained optimization problem (1.1.1), functions f is convex, mapping gis ≤K convex, and h is affine and set C is convex. Then the optimal valuefunction v is convex.

Proof. Consider (yi, zi), i = 1, 2 in the domain of v and an arbitrary ε > 0.We can find xi

ε feasible to the constraint of problem P (yi, zi) such that

f(xiε) < v(yi, zi) + ε, i = 1, 2. (1.2.1)

Now for any λ ∈ [0, 1], we have

f(λx1ε + (1− λ)x2

ε) ≤ λf(x1ε) + (1− λ)f(x2

ε) (1.2.2)

< λv(y1, z1) + (1− λ)v(y2, z2) + ε.

It is easy to check that λx1ε + (1− λ)x2

ε is feasible for problem P (λ(y1, z1) +(1 − λ)(y2, z2)). Thus, v(λ(y1, z1) + (1 − λ)(y2, z2)) ≤ f(λx1

ε + (1 − λ)x2ε).

Combining with inequality (1.2.2) and letting ε → 0 we arrive at

v(λ(y1, z1) + (1− λ)(y2, z2)) ≤ λv(y1, z1) + (1− λ)v(y2, z2),

that is to say v is convex. •


Proposition 1.2.3 (Epigraph of convex function) Let f : X → R∪{+∞} bean extended-valued function. Then f is convex if and only if epi f is a convexsubset of X × R.

Proof. Exercise. •Two other important functions related to a convex set C are the gauge

function defined by

γC(x) := inf{r > 0 | x ∈ rC},

and the support function defined on the dual space X∗ by

σC(x∗) = σ(C;x∗) := sup{⟨x, x∗⟩ | x ∈ C}.

Several useful properties of the gauge function and the support function areincluded in exercises.

An important property of convex functions related to applications in eco-nomics and finance is the Jensen inequality.

Proposition 1.2.4 (Jensen’s inequality) Let f be a convex function. Then,for any random variable X on a finite probability space,

f(E(X)) ≤ E(f(X)).

Proof. This follows directly from the definitions and is left as an exercise.•

1.2.2 Local Lipschitz Property of Convex Functions

Let f : X → R ∪ {+∞} be a lsc convex functions. We define cont f :={x ∈ int dom f : f is continuous at x}. A somewhat surprising fact is that alsc convex function is actually locally Lipschitz in int dom f and, therefore,cont f = int dom f . The key to this result is that a convex function f locallybounded above is locally Lipschitz in int dom f . This result is quite usefulitself and we describe it in two propositions.

Proposition 1.2.5 Let f : X → R ∪ {+∞} be a convex function. Supposethat f is locally bounded above at x ∈ D := int(dom f). Then f is locallybounded at x (meaning bounded in a ball centered at x).

Proof. Suppose f is bounded above by M , say, in Br(x) ⊂ int(dom f) forsome r > 0, then it is bounded below in Br(x). Indeed, if y ∈ Br(x) then sois 2x− y, the reflection of y with respect to x (see Fig. 1.2.1).

1.2 Functions and Sets 11

x

y

2x− y

Fig. 1.2.1

Since x = [y + (2x− y)]/2, we have

f(x) ≤ 1

2[f(y) + f(2x− y)] ≤ 1

2[f(y) +M ]

so f(y) ≥ 2f(x)−M for all y ∈ Br(x). •

Proposition 1.2.6 Let f : X → R ∪ {+∞} be a convex function. Supposethat f is locally bounded at x ∈ D := int(dom f). Then f is locally Lipschitzat x.

Proof. Suppose that |f | is bounded byM over B2r(x) ⊂ D. Consider distinctpoints x, y ∈ Br(x). Let a = ∥y − x∥ and let z = y + (r/a)(y − x). Thenz ∈ B2r(x) (see Fig. 1.2.2).

x

y

x

z = y + r∥y−x∥ (y − x)

Fig. 1.2.2

Sincey =

a

a+ rz +

r

a+ rx

is a convex combination lying in B2r(x), we have

f(y) ≤ a

a+ rf(z) +

r

a+ rf(x).


Thus,

f(y)− f(x) ≤ a

a+ r(f(z)− f(x)) ≤ 2Ma

r=

2M

r∥y − x∥.

Interchange x and y gives

|f(y)− f(x)| ≤ 2M

r∥y − x∥.

•Our main result is

Theorem 1.2.7 (Lipschitz Property of Convex Functions) Let f : X → R ∪{+∞} be a lsc convex function. Then f is locally Lipschitz on int(dom f).

Proof. We need only show that f is locally bounded above at any pointx ∈ int(dom f). Let {en : n = 1, . . . , N} be a bases of X, let e0 = 0 and e =∑N

n=1 en. Define the unit simplex by ∆ := {

∑Nn=0 λne

n : λn ≥ 0,∑N

n=0 λn =1}. Then, 0 ∈ int(∆ − e/2N). Taking an η small enough we have x + η(∆ −e/2N) ⊂ int(dom f) (see Fig. 1.2.3).

s

t

x

Fig. 1.2.3

For any y ∈ x+ η(∆− e/2N), there exists λn ≥ 0,∑N

n=0 λn = 1 such that

y = x+ ηN∑

n=0

λn(en − e

2N) =

N∑n=0

λn(x+ η(en − e

2N)).

It follows that

f(y) = f(N∑

n=0

λn(x+ η(en − e

2N)) ≤

N∑n=0

λnf(x+ η(en − e

2N))

≤ max{f(x+ η(en − e

2N)) : n = 0, 1, . . . , N}.


Thus, f is locally bounded above at x. •Clearly, as a corollary we have

Corollary 1.2.8 Let f : RN → R ∪ {+∞} be a lsc convex function. Thencont f = int(dom f).

1.2.3 Fermat’s rule for convex functions

The following easy observation suggests the fundamental significance of sub-differential in optimization.

Proposition 1.2.9 (Subdifferential at Optimality) Let X be a Banach spaceand let f : X → R∪{+∞} be a proper convex function. Then the point x ∈ Xis a (global) minimizer of f if and only if the condition 0 ∈ ∂f(x) holds.

Proof. Exercise. •Alternatively put, minimizers of f correspond exactly to “zeroes” of ∂f .

We know that in general continuity does not imply differentiability. However,for a lscconvex function its subdifferential is nonempty at every point in cont f .This will be the focus of the next two subsections.

1.2.4 Directional Derivatives and Subdifferential

One useful tool in analyzing the convex subdifferential is the directionalderivative. Let f : X → R ∪ {+∞} and let x ∈ dom f and d ∈ X. Thedirectional derivative of f at x in the direction of d is defined by

f ′(x; d) := limt→0+

f(x+ td)− f(x)

t

when this limit exists. It turns out that the directional derivative of a convexfunction is again convex. If a convex function f satisfies the stronger condition

f(λx+ µy) ≤ λf(x) + µf(y) for all x, y ∈ X, λ, µ ≥ 0

we say f is sublinear. The sublinear property of a function f can be decom-posed further into positively homogeneous: f(λx) = λf(x) for all x and λ ≥ 0and subadditive: f(x+ y) ≤ f(x) + f(y) for all x and y. It is an easy exerciseto show that these two properties characterize a sublinear function.

It is immediate that if the function f is sublinear then −f(x) ≤ f(−x) forall x in X. The linearity space of a sublinear function f is the set

lin f = {x ∈ X | −f(x) = f(−x)}.

The following result shows this set is a subspace.

Proposition 1.2.10 (Linearity Space) Let X be a Banach space and letf : X → R ∪ {+∞} be a sublinear function. Then, the linearity space lin fof f is the largest subspace of X on which f is linear.

Proof. It is clear that if Y is a subspace on which f is linear then Y ⊂ lin f .We need only show that lin f is a subspace. Let x ∈ lin f and a ∈ R. Since fis homogeneous we have


f(ax) = |a|f( a

|a|x)= −|a|f

(− a

|a|x)= −f

(|a|(− a

|a|x))

= −f(−ax),

so that ax ∈ lin f . Let x, y ∈ lin f . Since f is subadditive we have

f(x+ y) ≤ f(x) + f(y) = −f(−x)− f(−y) ≤ −f(−x− y) = −f(−(x+ y)),

so that x+ y ∈ lin f . Thus, lin f is a subspace. •Our next proposition asserts that the directional derivative of a convex

function exists at every continuous point of f and is, in fact, sublinear.

Proposition 1.2.11 (Sublinearity of the Directional Derivative) Let f : X →R∪{+∞} be a convex function. Suppose that x ∈ cont f . Then the directionalderivative f ′(x; ·) is everywhere finite and sublinear.

Proof. For d in X and nonzero t in R, define

g(d; t) =f(x+ td)− f(x)

t.

By convexity we deduce (Exercise 1.2.13) for 0 < t ≤ s ∈ R, the inequality

g(d;−s) ≤ g(d;−t) ≤ g(d; t) ≤ g(d; s).

Since x lies in core(dom f), for small s > 0 both g(d;−s) and g(d; s) are finite,so as t ↓ 0 we have

+∞ > g(d; s) ≥ g(d; t) ↓ f ′(x; d) ≥ g(d;−s) > −∞. (1.2.3)

Again by convexity we have for any directions d and e in X and real t > 0,

g(d+ e; t) ≤ g(d; 2t) + g(e; 2t).

Now letting t ↓ 0 we see that f ′(x; ·) is subadditive. The positive homogeneityis easy to check. •

Next we show that the directional derivative characterizes subgradients.That explains why it is useful in analyzing the subdifferential.

Proposition 1.2.12 (Subgradients and Directional Derivatives) Let f : RN →R ∪ {+∞} be a convex function and let x ∈ dom f . Then x∗ ∈ X∗ is a sub-gradient of f at x if and only if it satisfies ⟨x∗, x⟩ ≤ f ′(x;x) for all x ∈ RN .

Proof. For the “only if” part, let x∗ ∈ ∂f(x). Then, for any h ∈ X andt > 0,

⟨x∗, th⟩ ≤ f(x+ th)− f(x).

Dividing by t and taking limits as t → 0 we have ⟨x∗, h⟩ ≤ f ′(x;h).For the reverse direction, it follows from the proof of Proposition 1.2.11

that for any h ∈ X and t > 0,


⟨x∗, h⟩ ≤ f ′(x;h) ≤ f(x+ th)− f(x)

t.

Let x be an arbitrary element of X. Setting h = x− x and t = 1 in the aboveinequality we have

⟨x∗, x− x⟩ ≤ f(x)− f(x),

that is x∗ ∈ ∂f(x). •

1.2.5 Nonemptiness of the Subdifferential

The main result of this section is that the set of subgradients of a convexfunction is usually nonempty. We prove this by actually constructing a sub-gradient. The idea is rather simple. We recursively construct a decreasingsequence of sublinear functions which, after translation, minorize f . At eachstep we guarantee one extra direction of linearity. Thus, after at most N stepswe arrive at a linear function that minorizes f subject to a translation. Thebasic step is summarized in the following lemma.

Lemma 1.2.13 Let p : X → R∪{+∞} be a sublinear function. Suppose thatd ∈ cont p. Then the function q(·) = p′(d; ·) satisfies the conditions

(i) q(λd) = λp(d) for all real λ,(ii) q ≤ p,(iii) lin q ⊃ lin p+ span{d}, and(iv) p = q on lin p.

The prove is merely to check the conclusions hold and is left as an exercise.With these tools we are now ready for the main result, which gives condi-

tions guaranteeing the existence of a subgradient of a convex function. Propo-sition 1.2.12 showed how to identify subgradients from directional derivatives;this next result shows how to move in the reverse direction.

Theorem 1.2.14 (Max Formula) Let d ∈ X and let f : X → R ∪ {+∞} bea convex function. Suppose that x ∈ cont f . Then,

f ′(x; d) = max{⟨x∗, d⟩ : x∗ ∈ ∂f(x)}. (1.2.4)

Proof. In view of Proposition 1.2.12, we simply have to show that for anyfixed d in RN there is a subgradient x∗ satisfying ⟨x∗, d⟩ = f ′(x; d).

Choose a basis {e1, e2, . . . , eN} for X with e1 = d if d is nonzero. Nowdefine a sequence of functions p0, p1, . . . , pN recursively by p0(x) = f ′(x;x),and pk(x) = p′k−1(ek;x) for k = 1, 2, . . . , N . We will show that pN is therequired subgradient.

Note that by Proposition 1.2.11 each pk is a sublinear function defined onRN . By part (iii) of Proposition 1.2.13 we know that, for k = 1, 2, . . . , N ,


lin pk ⊃ lin pk−1 + span{ek},

so pN is linear. Thus, there is an element x∗ ∈ RN satisfying ⟨x∗, x⟩ = pN (x).Part (ii) of Proposition 1.2.13 implies that pN ≤ pN−1 ≤ . . . ,≤ p0. It

follows that, for any x ∈ RN ,

⟨x∗, x− x⟩ = pN (x− x) ≤ p0(x− x) = f ′(x;x− x) ≤ f(x)− f(x).

Thus, x∗ ∈ ∂f(x). The max formula follows from ⟨x∗, d⟩ = p0(d) = f ′(x; d).•

As an easy corollary of the max formula we have the following key resultin subdifferential theory due to Fenchel and Rockafellar.

Theorem 1.2.15 (Nonemptiness of Subdifferential) Let f : X → R ∪ {+∞}be a convex function. Suppose that x ∈ cont f . Then the subdifferential ∂f(x)is nonempty.

Proof. Follows directly from Theorem 1.2.14. •The conclusion ∂f(x) = ∅ can be stated alternatively as there exists a

linear functional x∗ such that f − x∗ attains its minimum at x. This is a veryuseful perspective relates to the use of variational arguments – derive resultsby observing a certain auxiliary function attains a minimum or maximum.

Finally, we note that for smooth functions the convexity be characterizedusing the derivatives and second order derivative (see Exercises 1.2.15 and1.2.16).

1.2.6 Exercises

Exercise 1.2.1 Prove Proposition 1.2.3

Exercise 1.2.2 Let C be a convex subset. Show that

(1) dC is convex, (Hint: Use Propostion 1.2.2)(2) σC is convex. (Hint: Use Propostion 1.2.3)

Exercise 1.2.3 Let f be a convex function. Show that for any a ∈ R,f−1((−∞, a]) is a convex set.

Exercise 1.2.4 Show that the intersection of a family of arbitrary convexsets is convex. Conclude that f(x) := sup{fα(x) : α ∈ A} is convex (and lsc)when {fα}α∈A is a collection of convex (and lsc) functions.

Exercise 1.2.5 * Calculate the gauge function for C := epi 1/x ∩ R2+ and

conclude that a gauge function is not necessarily lsc.

Exercise 1.2.6 Let C be a convex subset of X and let γC be the gaugefunction of C.


(i) Show that γC is convex.(ii) Show that if x ∈ intC then dom γC−x = X.(iii) Suppose 0 ∈ intC. Prove that clC ⊂ {x ∈ X | γC(x) ≤ 1}.

Exercise 1.2.7 Prove Lemma 1.2.13.

Exercise 1.2.8 Let C be a convex subset of RN .

(i) Show that intC = {x ∈ X | γC(x) < 1}.(ii) Deduce that γC is defined on X and is continuous.

Exercise 1.2.9 Prove Proposition 1.2.9

Exercise 1.2.10 (Jensen’s inequality) Prove the Jensen’s inequality in Propo-sition 1.2.4.

Exercise 1.2.11 (AM-GM inequality) Using Jensen’s inequality to prove theAM-GM inequality: for nonnegative numbers xn, n = 1, 2, . . . N ,

(x1x2 . . . xN )1/N ≤ x1 + x2 + . . .+ xN

N.

Hint: Let f = ln.

Exercise 1.2.12 Let C1 and C2 be closed convex subsets RN . Then C1 ⊂ C2

if and only if, for any x∗ ∈ RN , σ(C1;x∗) ≤ σ(C2;x

∗). Thus, a closed convexset is characterized by its support function.

Exercise 1.2.13 Let f : X → R ∪ {+∞} be a convex function. Suppose thatx ∈ cont f . Show that for any d ∈ X, t → g(d; t) := (f(x+ td)− f(x))/t is anondecreasing function.

Exercise 1.2.14 Let p0(x1, x2, x3) := |x1| + |x2| − x3. Show that p0 is sub-linear and calculate pi(·) = p′i−1(e

i; ·), lin pi, i = 1, 2, 3

∗Exercise 1.2.15 (Monotonicity of Gradients) Suppose that the set C ⊂ RN

is open and convex and that the function f : C → R is differentiable. Prove fis convex if and only if

⟨f ′(x)− f ′(y), x− y⟩ ≥ 0 for all x, y ∈ S,

and f is strictly convex if and only if the above inequality holds strictlywhenever x = y.

∗Exercise 1.2.16 (Second Order Characterization) Suppose that the set C ⊂RN is open and convex and that the function f : C → R is twice continuouslydifferentiable on C. Prove that f is convex if and only if its Hessian matrix ispositive semidefinite everywhere on C, and f is strictly convex if its Hessianmatrix is positive definite everywhere on C. Hint: Consider function g definedby g(t) = f(x+ td) for small real t, points x in C, and directions d in X.

1.3 Sandwich Theorems and Calculus

1.3.1 A Decoupling Lemma

To apply the convex subdifferential we need a convenient calculus for it. Itturns out the key for developing such a calculus is to combine a decouplingmechanism with the existence of subgradient. We summarize this idea in thefollowing lemma.

Lemma 1.3.1 (Decoupling Lemma) Let the functions f : X → R and g : Y →R be convex and let A : X → Y be a linear transform. Suppose that f , g andA satisfy the condition

0 ∈ ri dom g −A dom f. (1.3.1)

Then there is a y∗ ∈ Y ∗ such that for any x ∈ X and y ∈ Y ,

p ≤ [f(x)− ⟨y∗, Ax⟩] + [g(y) + ⟨y∗, y⟩], (1.3.2)

where p = infX{f(x) + g(Ax)}.

Proof. Define an optimal value function v : Y → [−∞,+∞] by

v(u) = infx∈X

{f(x) + g(Ax+ u)}

= infx∈X

{f(x) + g(y) : y −Ax = u.} (1.3.3)

Proposition 1.2.2 implies that v is convex. Moreover, it is easy to check thatdom v = dom g−A dom f so that the constraint qualification condition (1.3.1)ensures that ∂v(0) = ∅. By Theorem 1.1.3 constrained optimization problem(1.3.3) has a Lagrange multiplier −y∗ ∈ ∂v(0) such that

v(0) = p ≤ f(x) + g(y) + ⟨y∗, y −Ax⟩. (1.3.4)

•

1.3.2 Sandwich Theorems

We apply the decoupling lemma of Lemma 1.3.1 to establish a sandwich the-orem.

Theorem 1.3.2 (Sandwich Theorem) Let f : X → R ∪ {+∞} and g : Y →R ∪ {+∞} be convex functions and let A : X → Y be a linear map. Supposethat f ≥ −g ◦ A and f , g and A satisfy condition (1.3.1). Then there isan affine function α : X → R of the form α(x) = ⟨A⊤y∗, x⟩ + r satisfyingf ≥ α ≥ −g ◦ A. Moreover, for any x satisfying f(x) = −g ◦ A(x), we have−y∗ ∈ ∂g(Ax).


Proof. By Lemma 1.3.1 there exists y∗ ∈ Y ∗ such that for any x ∈ X andy ∈ Y ,

0 ≤ p ≤ [f(x)− ⟨y∗, Ax⟩] + [g(y) + ⟨y∗, y⟩]. (1.3.5)

For any z ∈ X setting y = Az in (1.3.5) we have

f(x)− ⟨A⊤y∗, x⟩ ≥ −g(Az)− ⟨A⊤y∗, z⟩. (1.3.6)

Thus,

a := infx∈X

[f(x)− ⟨A⊤y∗, x⟩] ≥ b := supz∈X

[−g(Az)− ⟨A⊤y∗, z⟩].

Picking any r ∈ [a, b], α(x) := ⟨A⊤y∗, x⟩+r is an affine function that separatesf and −g ◦ A. Finally, when f(x) = −g ◦ A(x), it follows from (1.3.5) that−x∗ ∈ ∂g(Ax). •

1.3.3 Calculus for the Subdifferential

We now use the tools established above to deduce calculus rules for the convexfunctions. We start with a sum rule plays a role similar to the sum rule forderivatives in calculus.

Theorem 1.3.3 (Convex Subdifferential Sum Rule) Let f : X → R∪ {+∞}and g : Y → R ∪ {+∞} be convex functions and let A : X → Y be a linearmap. Then at any point x in X, we have the sum rule

∂(f + g ◦A)(x) ⊃ ∂f(x) +A⊤∂g(Ax), (1.3.7)

with equality if condition (1.3.1) holds.

Proof. Inclusion (1.3.7) is easy and left as an exercise. We prove the reverseinclusion under condition (1.3.1). Suppose x∗ ∈ ∂(f +g ◦A)(x). Since shiftingby a constant does not change the subdifferential of a convex function, wemay assume without loss of generality that

x → f(x) + g(Ax)− ⟨x∗, x⟩

attains its minimum 0 at x = x. By the sandwich theorem there exists anaffine function α(x) := ⟨A⊤y∗, x⟩+ r with −y∗ ∈ ∂g(Ax) such that

f(x)− ⟨x∗, x⟩ ≥ α(x) ≥ −g(Ax).

Clearly equality is attained at x = x. It is now an easy matter to check thatx∗ +A∗y∗ ∈ ∂f(x). •

1.3 Sandwich Theorems 21

Note that when A is the identity mapping and both f and g are differen-tiable Theorem 1.3.3 recovers sum rules in calculus. The geometrical interpre-tation of this is that one can find a hyperplane in X × R that separates theepigraph of f and hypograph of −g.

To deal with convex sets we need the concept of normal cone:

Definition 1.3.4 Let C be a closed convex set and let x ∈ C. We define thenormal cone of C at x by

N(C;x) = ∂ιC(x).

By applying the subdifferential sum rule to the indicator functions of twoconvex sets we have parallel results for the normal cones to the intersectionof convex sets.

Theorem 1.3.5 (Normals to an Intersection) Let C1 and C2 be two convexsubsets of X and let x ∈ C1 ∩ C2. Suppose that C1 ∩ intC2 = ∅. Then

N(C1 ∩ C2;x) = N(C1;x) +N(C2;x).

Proof. Exercise 1.3.4. •The condition (1.3.1) is often referred to as a constraint qualification.

Without it the equality in the convex subdifferential sum rule may not hold(Exercise 1.3.9).

A useful application of the sum rule is the Pshenichnii–Rockafellar condi-tions. Consider the constrained convex optimization problem of

CP minimize f(x)

subject to x ∈ C ⊂ X,

(1.3.8)

where C is a closed convex subset of X and f : X → R∪{+∞} is a convex lscfunction. The convex subdifferential calculus developed in this section enablesus to derive sharp necessary optimality conditions for CP.

Theorem 1.3.6 (Pshenichnii–Rockafellar Conditions) Let C be a closed con-vex subset of RN and let f : X → R ∪ {+∞} be a convex function. Supposethat C ∩ cont f = ∅ and f is bounded below on C. Then x is a solution of CPif and only if it satisfies

0 ∈ ∂f(x) +N(C; x).

Proof. Apply the convex subdifferential sum rule of Theorem 1.3.3 to f + ιCat x. •

We can also derive a version of the multidirectional mean value inequalityfor convex functions that generalizes the Lagrange mean value theorem. Wewill use [x,C] to signal the convex hull of {x} ∪ C–the smallest convex setthat contains this set.


Theorem 1.3.7 (Convex Multidirectional Mean Value Inequality) Let C be anonempty, closed and bounded convex subset of X and x ∈ X and let f : X →R be a continuous convex function. Then there exist z ∈ [x,C] and z∗ ∈ ∂f(z),such that

infc∈C

f(c)− f(x) ≤ ⟨z∗, y − x⟩ for all y ∈ C.

Proof. First we consider the special case when infc∈C f(c) − f(x) = 0. Letz ∈ [x,C] be a minimum of f over [x,C]. Without loss of generality we mayassume that z ∈ C (why?). Then by Theorem 1.3.6,

0 ∈ ∂f(z) +N([x,C]; z).

Thus, there exists z∗ ∈ ∂f(z) such that −z∗ ∈ N([x,C]; z). It follows fromExercise 1.1.6 that

0 ≤ ⟨z∗, u− z⟩, for all u ∈ [x,C]. (1.3.9)

Since z ∈ [x,C]\C, z = λx+ (1− λ)c where c ∈ C and λ > 0. For any y ∈ C,letting u = λy + (1− λ)c in (1.3.9) we obtain

0 ≤ ⟨z∗, y − x⟩.

For the general case, denote r = infc∈C f(c) − f(x) and define F (c, t) =f(c)− rt. We can verify that F is a continuous convex function on X×R and

0 = inf(c,t)∈C×{1}

F (c, t)− F (x, 0).

It remains to apply the special case to function F , point (x, 0) and convex setC × {1}. •

1.3.4 The Separation Theorems

We have seen that the sandwich theorem and the convex subdifferential calcu-lus are intimately related to the separation theorem. We now explicitly deducethe separation theorem.

Theorem 1.3.8 (Separation Theorem) Let C1 and C2 be two convex subsetsof X. Suppose that intC1 = ∅ and C2 ∩ intC1 = ∅. Then there exists anaffine function α on X such that

supc1∈C1

α(c1) ≤ infc2∈C2

α(c2).

Proof. Without loss of generality we may assume that 0 ∈ intC1 and thenapply the sandwich theorem with f = ιclC2 , A the identity mapping of X andg = γC1 − 1. •

1.3 Sandwich Theorems 23

1.3.5 Commentary and Exercises

There are different ways of developing basic results in convex analysis. Here wefollow our short notes [2] using the Decoupling Lemma as a starting point. Thisdevelopment actually works also in infinite dimensional space (see [?]). Theproof of this result is a typical variational argument –proof by using the fact acertain function attains minimum. We illustrate the potential of this theoremby deducing the sandwich theorem, the sum rule for convex subdifferential anda convex version of the multidirectional mean value theorem due to Ledyaevand Zhu [5]. The Pshenichnii–Rockafellar condition [6, 7] provides a prototypefor necessary conditions for nonsmooth constrained optimization problems.

Exercise 1.3.1 Define h(u) = infx∈X{f(x)+g(Ax+u)}. Prove that domh =dom g −A dom f.

Exercise 1.3.2 Supply details for the proof of Theorem 1.3.3 by

(i) Proving (1.3.7).(ii) Verifying x∗ +A∗y∗ ∈ ∂f(x).

Exercise 1.3.3 Interpret the sandwich theorem geometrically in the casewhen A is the identity map.

∗Exercise 1.3.4 Prove Theorem 1.3.5.

Exercise 1.3.5 Give the details of the proof of Theorem 1.3.6.

Exercise 1.3.6 Apply the Pshenichnii–Rockafellar conditions to the follow-ing two cases:

(i) C a single point {x0} ⊂ RN ,(ii) C a polyhedron {x | Ax ≤ b}, where b ∈ RM .

Exercise 1.3.7 Provide details for the proof of Theorem 1.3.8.

∗Exercise 1.3.8 (Subdifferential of a Max-Function) Suppose that I is a finiteset of integers, and gi : X → R∪{+∞} , i ∈ I are lower semicontinuous convexfunctions with dom gj ∩

∩i∈I\{j} cont gi = ∅ for some index j in I. Prove

∂(maxi gi)(x) = conv∪

i∈I ∂gi(x).

∗Exercise 1.3.9 (Failure of Convex Calculus)

(i) Find convex functions f, g : R → R ∪ {+∞} with

∂f(0) + ∂g(0) = ∂(f + g)(0).

(ii) Find a convex function g : R2 → R∪{+∞} and a linear map A : R → R2

with A∗∂g(0) = ∂(g ◦A)(0).

1.4 Fenchel Conjugate

Obtaining Lagrange multipliers by using convex subdifferentials is closely re-lated to convex duality theory based on the concept of conjugation functionsintroduced by Fenchel. In this section we give a concise sketch of the Fenchelconjugation theory. One can regard it as a natural generalization of the linearprogramming duality, or as a form of the Legendre transform in the convexsetting. It is an elegant primal-dual space representation of the basic varia-tional techniques for convex functions.

1.4.1 The Fenchel Conjugate

The Fenchel conjugate of a function f : X → [−∞,+∞] is the functionf∗ : X∗ → [−∞,+∞] defined by

f∗(x∗) = supx∈X

{⟨x∗, x⟩ − f(x)}.

The function f∗ is convex and if the domain of f is nonempty then f∗ nevertakes the value −∞. Clearly the conjugacy operation is order-reversing : forfunctions f, g : X → [−∞,+∞], the inequality f ≥ g implies f∗ ≤ g∗. We canconsider the conjugate of f∗ called the biconjugate of f and denoted f∗∗. Thisis a function on X∗∗ = X.

1.4.2 The Fenchel–Young Inequality

This is an elementary but important result that relates conjugation with thesubgradient.

Proposition 1.4.1 (Fenchel–Young Inequality) Let f : X → R∪{+∞} be aconvex function. Suppose that x∗ ∈ X∗ and x ∈ dom f . Then

f(x) + f∗(x∗) ≥ ⟨x∗, x⟩. (1.4.1)

Equality holds if and only if x∗ ∈ ∂f(x).

Proof. The inequality (1.4.1) follows directly from the definition. We havethe equality

f(x) + f∗(x∗) = ⟨x∗, x⟩,

if and only if, for any y ∈ X,

f(x) + ⟨x∗, y⟩ − f(y) ≤ ⟨x∗, x⟩.

That isf(y)− f(x) ≥ ⟨x∗, y − x⟩,

or x∗ ∈ ∂f(x). •Combining Fenchel inequality and the sandwich theorem we can show that

f∗∗ = f for convex lsc function f .

1.4 Fenchel Conjugate 25

Theorem 1.4.2 (Biconjugate) Let X be a finite dimensional Banach space.Then f∗∗ ≤ f in dom f and equality holds when f is lsc and convex.

Proof. It is easy to check f∗∗ ≤ f and we leave it as an exercise. Now assumethat f is lsc and convex. For any x ∈ dom f , define g(x) = ιx(x) − f(x). Wecan see that g is lsc convex function satisfying f ≥ −g and f(x) = g(x).Applying the sandwich theorem we conclude there exists an affine function⟨x∗, x⟩+ r such that

f(x) ≥ ⟨x∗, x⟩+ r ≥ −ιx(x) + f(x).

Setting x = x we derive r = f(x)− ⟨x∗, x⟩. Hence, for all x ∈ X,

f(x) ≥ f(x) + ⟨x∗, x− x⟩.

That is to say x∗ ∈ ∂f(x). By the Fenchel inequality we have

f(x) = ⟨x∗, x⟩ − f∗(x∗) = supy∗

[⟨y∗, x⟩ − f∗(y∗)] = f∗∗(x).

•

Remark 1.4.3 The order reversing property and the self-involution prop-erty described in Theorem 1.4.2, in fact, characterizes the Fenchel -Legrendratransform up to a linear transform. This is a deep result derived in [].

To relate Fenchel duality and convex programming with linear constraints,we let g be the indicator function of a point, which gives the following partic-ularly elegant and useful corollary.

Corollary 1.4.4 (Fenchel Duality for Linear Constraints) Given any func-tion f : X → R∪{+∞} , any bounded linear map A : X → Y , and any elementb of Y , the weak duality inequality

infx∈X

{f(x) | Ax = b} ≥ supx∗∈Y

{⟨b, x∗⟩ − f∗(A∗x∗)}

holds. If f is lsc and convex and b belongs to int(Adom f) then equality holds,and the supremum is attained when finite.

Proof. Exercise 1.4.9. •

1.4.3 Graphic illustration and generalizations

For increasing function ϕ, ϕ(0) = 0, f(x) =∫ x

0ϕ(s)ds is convex and f∗(x∗) =∫ x∗

0ϕ−1(t)dt. Graphs Fig. 1.4.1 and Fig. 1.4.2. illustrates the Fenchel-Young

inequality graphically. The additional areas enclosed by the graph of ϕ−1,


s = x and t = x∗ or that of ϕ, s = x and t = x∗ beyond the area of therectangle [0, x] × [0, x∗] generates the additional area that leads to a strictinequality. We also see that equality holds when x∗ = ϕ(x) = f ′(x) andx = ϕ−1(x∗) = (f∗)′(x∗) in Fig. 1.4.3.

s

t

O x

x∗

ϕϕ−1

Fig. 1.4.1. Fenchel-Young inequality

s

t

O x

x∗

ϕϕ−1

Fig. 1.4.2. Fenchel-Young inequality

s

t

O x

x∗

ϕϕ−1

Fig. 1.4.3. Fenchel-Young equality


Thinking in terms of these graphs we also realize that the underlyinginequality relationship remains valid when the area is weighted by a positive‘density’ function K(s, t). Thus, we have

Theorem 1.4.5 (Weighted Fenchel-Young inequality) Let K(x, y) be a con-tinuous positive function and let ϕ be a continuous increasing function withϕ(0) = 0. Then∫ x

0

∫ x∗

0

K(s, t)dsdt ≤∫ x

0

∫ ϕ(s)

0

K(s, t)dtds+

∫ x∗

0

∫ ϕ−1(t)

0

K(s, t)dsdt

With equality attiained when x∗ = ϕ(x) and x = ϕ−1(x∗).

Proof. If ϕ(x) ≥ x∗ we have∫ x

0

∫ ϕ(s)

0

K(s, t)dtds+

∫ x∗

0

∫ ϕ−1(t)

0

K(s, t)dsdt (1.4.2)

≥∫ x

0

∫ x∗

0

K(s, t)dsdt+

∫ x

ϕ−1(x∗)

∫ ϕ(s)

x∗K(s, t)dtds

≥∫ x

0

∫ x∗

0

K(s, t)dsdt.

Otherwise, ϕ(x) < x∗ and we have∫ x

0

∫ ϕ(s)

0

K(s, t)dtds+

∫ x∗

0

∫ ϕ−1(t)

0

K(s, t)dsdt (1.4.3)

≥∫ x

0

∫ x∗

0

K(s, t)dsdt+

∫ ϕ−1(x∗)

x

∫ x∗

ϕ(s)

K(s, t)dtds

≥∫ x

0

∫ x∗

0

K(s, t)dsdt.

Clearly equality holds if and only if ϕ(x) = x∗. •The condition ϕ(0) = 0 merely conveniently locate the lower left conner

of the graph to the coordinate origin and is clearly not essential. In generalwe can always shift this corner to any point (a, ϕ(a)). More substantively, therequirment that ϕ being a continuous increasing function can be relaxed tonondecresing as long as ϕ−1 is replaced appropriately by

ϕ−1inf (t) = inf{s, ϕ(s) ≥ t}.

Now we can state a more general Fenchel- Young inequality

Theorem 1.4.6 (Weighted Fenchel-Young inequality) Let K(x, y) be a boundedessentially positive measurable function and let ϕ be a nonincreasing function.Then

28 1 Lagrange Multipliers and Duality∫ x

a

∫ x∗

ϕ(a)

K(s, t)dsdt ≤∫ x

a

∫ ϕ(s)

ϕ(a)

K(s, t)dtds+

∫ x∗

ϕ(a)

∫ ϕ−1inf (t)

a

K(s, t)dsdt

With equality attiained when x∗ ∈ [ϕ(x−), ϕ(x+)], x ∈ [ϕ−1inf (x

∗−), ϕ−1inf (x

∗+)].

Proof. Exercise. •The above idea can be further pushed in two different directions in the

next two sections.

1.4.4 n-dimensional Fenchel Young inequality

It is easier to understand and to formulate general n-dimensional FenchelYoung inequality starting by re-examining the graphs presented above with aparameterization (ϕ1, ϕ2) of the graph of ϕ in Fig 1.4.4–1.4.6.

ϕ1

ϕ2

(ϕ1(a), ϕ2(a)) ϕ1(b1)

ϕ2(b2)

(ϕ1(t), ϕ2(t))

Fig. 1.4.4. Fenchel-Young inequality b2 > b1

ϕ1

ϕ2

(ϕ1(a), ϕ2(a)) ϕ1(b1)

ϕ2(b2)

(ϕ1(t), ϕ2(t))

Fig. 1.4.5. Fenchel-Young inequality b2 < b1


ϕ1

ϕ2

(ϕ1(a), ϕ2(a)) ϕ1(b1)

ϕ2(b1)

(ϕ1(t), ϕ2(t))

Fig. 1.4.6. Fenchel-Young equality b2 = b1

Let K(s1, s2) be a nonnegative function and let ϕ1, ϕ2 be increasing func-tions. To avoid technical complication we assume ϕ1, ϕ2 are invertible. Thenwe can rewrite the Fenchel-Young inequality as

∫ ϕ2(b2)

ϕ2(a)

∫ ϕ1(b1)

ϕ1(a)

K(s1, s2)ds1ds2 (1.4.4)

≤∫ ϕ1(b1)

ϕ1(a)

∫ ϕ2(ϕ−11 (s1))

ϕ2(a)

K(s1, s2)ds2ds1 +

∫ ϕ2(b2)

ϕ2(a)

∫ ϕ1(ϕ−12 (s2))

ϕ1(a)

K(s1, s2)ds1ds2(1.4.5)

with equality attained when b1 = b2.This form of the Fenchel - Young inequality can easily be generalized to

N -dimention with an induction argument. We will use the following vectornotation: sN = (s1, . . . , sN ), 1N = (1, 1, . . . , 1) and

sNn = (s1, . . . , sn−1, sn+1, . . . , sN ).

When ϕN = (ϕ1, . . . , ϕN ) is a vector valued function we define

ϕN (sN ) = (ϕ1(s1), . . . , ϕN (sN )).

Similarly,∫ ϕN (bN )

ϕN (aN )

K(sN )dsN =

∫ ϕN (b1)

ϕ1(a1)

. . .

∫ ϕN (bN )

ϕN (aN )

K(s1, . . . , sN )dsN . . . ds1.

Theorem 1.4.7 (N -dimensioal generalized Fenchel-Young inequality) LetK : RN 7→ R be a nonnegative function and let ϕN be vector function with allthe components increasing and invertible. We have∫ ϕN (bN )

ϕN (a·1N )

K(sN )dsN ≤N∑

n=1

∫ ϕn(bn)

ϕn(a)

∫ ϕNn (ϕ−1

n (sn)·1N−1)

ϕNn (a·1N−1)

K(sN )dsNn dsn

with equality attained when b1 = b2 = . . . = bN .


Proof. We prove by induction. The case N = 2 has already been established.We focus on the induction step. By separating the integration wirh respect todsN+1, we can write the left hand side of the inequality as

LHS =

∫ ϕN (bN+1)

ϕN+1(a·1N+1)

K(sN+1)dsN+1

=

∫ ϕN+1(bN+1)

ϕN+1(a)

∫ ϕN (bN )

ϕN (a·1N )

K(sN+1)dsNdsN+1

Applying the induction hypothesis to the inner layer of the integration wehave

LHS ≤∫ ϕN+1(bN+1)

ϕN+1(a)

N∑n=1

∫ ϕn(bn)

ϕn(a)

∫ ϕNn (ϕ−1

n (sn)·1N−1)

ϕNn (a·1N−1)

K(sN+1)dsNn dsndsN+1

=N∑

n=1

∫ ϕN+1(bN+1)

ϕN+1(a)

∫ ϕn(bn)

ϕn(a)

(∫ ϕNn (ϕ−1

n (sn)·1N−1)

ϕNn (a·1N−1)

K(sN+1)dsNn

)dsndsN+1.

The last equality groups the two out layers of the integration together. NowApplying the Fenchel-Young inequality with N = 2 to get

LHS ≤N∑

n=1

∫ ϕn(bn)

ϕn(a)

∫ ϕN+1(ϕ−1n (sn))

ϕN+1(a)

∫ ϕNn (ϕ−1

n (sn)·1N−1)

ϕNn (a·1N−1)

K(sN+1)dsNn dsN+1dsn

+

∫ ϕN+1(bN+1)

ϕN+1(a)

N∑n=1

∫ ϕn(ϕ−1N+1(sN+1))

ϕn(a)

∫ ϕNn (ϕ−1

n (sn)·1N−1)

ϕNn (a·1N−1)

K(sN+1)dsNn dsndsN+1

Combining the inner layersof the integration in the first sum and applyingthe equality part of the induction hypothesis for the second sum we arrive at

LHS ≤N∑

n=1

∫ ϕn(bn)

ϕn(a)

∫ ϕN+1n (ϕ−1

n (sn)·1N )

ϕN+1n (a·1N )

K(sN+1)dsN+1n dsn

+

∫ ϕN+1(bN+1)

ϕN+1(a)

∫ ϕN (ϕ−1N+1(sN+1)·1N )

ϕNn (a·1N )

K(sN+1)dsNdsN+1

=N+1∑n=1

∫ ϕn(bn)

ϕn(a)

∫ ϕN+1n (ϕ−1

n (sn)·1N )

ϕN+1n (a·1N )

K(sN+1)dsN+1n dsn = RHS.

•

Remark 1.4.8 We also have the following alternative form of estimations bychanging the way of integration. Let K(s1, s2) be a nonnegative function andlet ϕ1, ϕ2 be non-decreasing functions.

1.4 Fenchel Conjugate 31∫ ϕ2(b2)

ϕ2(a)

∫ ϕ1(b1)

ϕ1(a)

K(s1, s2)ds1ds2

≤∫ ϕ1(b2)

ϕ1(a)

∫ ϕ2(b2)

ϕ2(ϕ−11 (s1))

K(s1, s2)ds2ds1 +

∫ ϕ2(b1)

ϕ2(a)

∫ ϕ1(b1)

ϕ1(ϕ−12 (s2))

K(s1, s2)ds1ds2

with equality attained when b1 = b2.

1.4.5 Generalized Fenchel conjugate

Note that the graphic illustrations in section 1.4.3 only works when x, x∗ ∈ R.When in general (x, x∗) ∈ X×X∗ we can immitate the general defintion of theFencel conjugate. In such a generalization a nonlinear function c(x, x∗) replace

the role of ⟨x∗, x⟩ just as in Theorem 1.4.5∫ x

0

∫ x∗

0K(s, t)dsdt replacing the

product x∗x. This is a more significant generalization. For this to work oneneeds to first revise the concept of convexity.

Definition 1.4.9 (Generalized convexity) Let Φ be a set of extended real val-ued functions. We say f is Φ− convex if it is the pointwise supremum of asubset of Φ.

Φ− convex functions are closed under supremum. Thus, every function has alargest Φ− convex minorant called its Φ− convex hull. Moreover, if f is Φ−convex then it is coincide with its Φ− convex hull. By setting Φ to be the classof affine functions we get the usual convexity.

Similar to Fenchel conjugate we define:

Definition 1.4.10 (Generalized Fenchel conjugate) Let c be a function onX × Y . We define

f c(y) = supx[c(x, y)− f(x)] and gc

′(x) = sup

y[c(x, y)− g(y)].

They are generalizations of Fenchel conjugate. When the function c isnot symmetric with respect to its two variables, the c and c′ conjugate aredifferent. It is easy to see that the generalized Fenchel conjugate also has theorder reversing property. Define Φc = {c(·, y) − b : y ∈ Y, b ∈ R} and Φc′ ={c(x, ·) − b : x ∈ X, b ∈ R}. Then f c is Φc′− convex and gc

′is Φc− convex.

Next we discuss some basic properties of generalized Fenchel conjugate.

Theorem 1.4.11 (Fenchel inequality and duality) Let f : X → R ∪ {+∞}and g : Y → R ∪ {+∞}. Then

(i) (Fenchel inequality) f c(y) ≥ c(x, y)− f(x), gc′(x) ≥ c(x, y)− g(y),

(ii)(Convex hull) The Φc(c′)− convex hull of f(g) is f cc′(gc′c),

(iii)(Duality) f c = f cc′c, gc′cc′ = gc

′.


Proof. (i) follows directly from the definitions.To prove (ii) we observe that by (i) f(x) ≥ c(x, y) − f c(y). Taking sup

over y we get f ≥ f cc′ . On the other hand if for some y, b, f(x) ≥ c(x, y)− bfor all x, then b ≥ c(x, y)− f(x). Taking sup over x we have b ≥ f c(y). Thus,

f(x) ≥ f cc′(x) ≥ c(x, y)− f c(y) ≥ c(x, y)− b

establishing f cc′ as the largest Φc− convex function dominated by f . Theproof that gc

′c is the Φc′− convex hull of g is similar.(iii) follows from (ii) since f c is Φc′− convex and gc

′is Φc− convex. •

Corollary 1.4.12 Let f : X → R ∪ {+∞} and g : Y → R ∪ {+∞}. Thenf cc′ is the Φc− convex hull of f and gc

′c is the Φc′− convex hull of g.

Remark 1.4.13 We see from the discussion about generalized Fenchel conju-gate that what is essential in dealing with conjugate operation is the closednesswith respect to the sup operation. For simple convexity the key link is thata convex function is the sup of all the affine functions it dominates. A factrelies on the fundamental convex separation theorem.

The concept of subdifferential and its relationship with Fenchel conjugatecan also be generalized.

Definition 1.4.14 (Generalized subdifferential) Let c be a function on X ×Y . We say y0(x0) is a c(c′)−subdifferential of f(g) at x0(y0) if

f(x)− c(x, y0)(g(y)− c(x0, y))

attains minimum at x0(y0).

Notation y0 ∈ ∂cf(x0)(x0 ∈ ∂c′g(y0)).

Theorem 1.4.15 (Fenchel equality)

(i) (Fenchel equality) y0 ∈ ∂cf(x0) iff f(x0) + f c(y0) = c(x0, y0).(ii)(Symmetry) y0 ∈ ∂cf

cc′(x0) iff x0 ∈ ∂c′fc(y0).

(iii)(Φ convexity) ∂cf(x0) = ∅ implies that f is Φc convex at x0. On theother hand f is Φc convex at x0 implies that ∂cf(x0) = ∂cf

cc′(x0).

Proof. The argument for proving Fenchel equality applies to (i) with ⟨y0, x0⟩replaced by c(x0, y0). The rest follows from this generalized Fenchel equality.Details are left as an exercise. •

Similar to the usual subdifferential we have

Theorem 1.4.16 (Cyclical monotonicity) Subdifferential ∂cf is c− cycli-cally monotone that is for any m pairs of points yi ∈ ∂cf(xi) we have

(c(x1, y0)− c(x0, y0)) + (c(x2, y1)− c(x1, y1)) +

. . .+ (c(x0, ym)− c(xm, ym)) ≤ 0.


Proof. Adding the following inequalities:

f(x1)− f(x0) ≥ c(x1, y0)− c(x0, y0)

f(x2)− f(x1) ≥ c(x2, y1)− c(x1, y1)

. . . . . .

f(x0)− f(xm) ≥ c(x0, ym)− c(xm, ym).

and noticing all the terms on the left hand side are cancelled. •

Theorem 1.4.17 (Duality and shift reversing) Let fi be a sequence of func-tions and d ∈ R an arbitrary number. Then

(i) (Duality) (inf fi)c = sup f c

i , and(ii)(Shift reversing) (f + d)c = f c − d.

Proof. (ii) follows directly from the definition. We focus on (i) Order revers-ing and inf fi ≤ fi implies that (inf fi)

c ≥ f ci . Taking sup we have

(inf fi)c ≥ sup f c

i .

On the other hand, g = sup f ci ≥ f c

i is Φc− convex and, therefore, g = gc′c.

By order revesing gc′ ≤ f cc′

i ≤ fi. Taking inf we have gc′ ≤ inf fi. Using order

reversing again we have

sup f ci = g = gc

′c ≥ (inf fi)c.

•In fact, (i) and (ii) characterizes c−conjugate with

c(x, y) = ∆(ι{x})(y).

1.4.6 Exercises

Exercise 1.4.1 Prove f∗∗ ≤ f . Hint: This follows immediately from theFenchel–Young inequality (1.4.1).

Exercise 1.4.2 Many important convex functions f on X = RN equals totheir biconjugate f∗∗. Such functions thus occur as natural pairs, f and f∗.Table 1.1 shows some elegant examples on R, and Table 1.2 describes somesimple transformations of these examples. Check the calculation of f∗ andcheck f = f∗∗ for functions in Table 1.1. Verify the formulas in Table 1.2.


f(x) = g∗(x) dom f g(y) = f∗(y) dom g

0 R 0 {0}

0 R+ 0 −R+

0 [−1, 1] |y| R

0 [0, 1] y+ R

|x|p/p, p > 1 R |y|q/q ( 1p+ 1

q= 1) R

|x|p/p, p > 1 R+ |y+|q/q ( 1p+ 1

q= 1) R

−xp/p, 0<p<1 R+ −(−y)q/q ( 1p+ 1

q= 1) − intR+

− log x intR+ −1− log(−y) − intR+

ex R{y log y − y (y > 0)0 (y = 0)

R+

Table 1.1. Conjugate pairs of convex functions on R.

f = g∗ g = f∗

f(x) g(y)

h(ax) (a = 0) h∗(y/a)

h(x+ b) h∗(y)− by

ah(x) (a > 0) ah∗(y/a)

Table 1.2. Transformed conjugates.

Exercise 1.4.3 Calculate the conjugate and biconjugate of the function

f(x1, x2) =

x21

2x2+ x2 log x2 − x2 if x2 > 0,

0 if x1 = x2 = 0,

+∞ otherwise.

∗Exercise 1.4.4 (Maximum Entropy Example)


(i) Let a0, a1, . . . , aN ∈ X. Prove the function

g(z) := infx∈RN+1

{ N∑n=0

exp∗(xn)∣∣∣ N∑n=0

xn = 1,

N∑n=0

xnan = z

}is convex.

(ii) For any point y in RN+1, prove

g∗(y) = supx∈RN+1

{ N∑n=0

(xn⟨an, y⟩ − exp∗(xn))∣∣∣ N∑n=0

xn = 1}.

(iii) Deduce the conjugacy formula

g∗(y) = 1 + ln( N∑

n=0

exp ⟨an, y⟩).

(iv) Compute the conjugate of the function of x ∈ RN+1,{∑Nn=0 exp

∗(xn) if∑N

n=0 xn = 1,

+∞ otherwise.

Exercise 1.4.5 Give the details for the proof of Theorem ?? (Fenchel WeakDuality).

Exercise 1.4.6 (Conjugate of Indicator Function) Let X be a reflexive Ba-nach space and let C be a closed convex subset of X. Show that ι∗C = σC andι∗∗C = ιC .

Exercise 1.4.7 Let X = RN . Suppose that A : X → X∗ is linear map, C aconvex subset of X and D a nonempty closed bounded convex subset of X∗.Show that

infx∈C

supy∈D

⟨y,Ax⟩ = maxy∈D

infx∈C

⟨y,Ax⟩.

Hint: Apply the Fenchel duality theorem to f = ιC and g = ι∗D.

∗Exercise 1.4.8 Prove Theorem 1.4.6.

∗Exercise 1.4.9 Prove Corollary 1.4.4 (Fenchel Duality for Linear Con-straints). Deduce duality theorems for the following separable problems.

inf{ N∑

n=1

p(xn)∣∣∣ Ax = b

},

where the map A : RN → RM is linear, b ∈ RM , and the function p : R →R ∪ {+∞} is convex, defined as follows:


(i) (Nearest Points in Polyhedra) p(t) = t2/2 with domain R+.(ii) (Analytic Center) p(t) = − log t with domain intR+.(iii) (Maximum Entropy) p = exp∗.

What happens if the objective function is replaced by∑N

n=1 pn(xn)?

Exercise 1.4.10 (Symmetric Fenchel Duality) Let X be a Banch space.For functions f, g : X → [−∞,+∞], define the concave conjugate g∗ : X →[−∞,+∞] by

g∗(x∗) = inf

x∈X{⟨x∗, x⟩ − g(x)}.

Proveinf(f − g) ≥ sup(g∗ − f∗),

with equality if f is lower semicontinuous and convex, g is upper semicontin-uous and concave, and

0 ∈ core(dom f − dom(−g)),

or f is convex, g is concave and

dom f ∩ cont g = ∅.

Exercise 1.4.11 Illustrate the generalized weighted Fenchel -Young inequal-ity for N = 3 using computer graphing.

Exercise 1.4.12 Represent the gap in the generalized weighted Fenchel -young inequality for N = 2. Then generalize.

Exercise 1.4.13 Generalize the inequality in Remark 1.4.8 to 3-dimensionalspaces and then to general N -dimensional spaces.

Exercise 1.4.14 Fill in the details of the proof of Theorem 1.4.15

1.5 Convex duality theory

Using the Fenchel-Young inequality for each constrained optimization problemwe can write its companion dual problem. There are several different butequivalent perspectives. Since the dimension of the Euclidean spaces involvedare usually not essential, we will use X,Y, Z etc.. to denote those Euclideanspaces to simplify notation.

1.5.1 Rockafellar duality

We start with the Rockafellar formulation of bi-conjugate. It is very generaland — as we shall see — other perspectives can easily be written as specialcases.

Consider a two variable function F (x, y). Treating y as a parameter, con-sider the parameterized optimization problem

v(y) = infx

F (x, y). (1.5.1)

Our associated primal optimization problem1 is

p = v(0) = infx∈X

F (x, 0) (1.5.2)

and the dual problem is

d = v∗∗(0) = supy∗∈Y ∗

−F ∗(0,−y∗). (1.5.3)

Since v dominates v∗∗ as the Fenchel-Young inequality establishes, we have

v(0) = p ≥ d = v∗∗(0).

This is called weak duality and the non-negative number p−d = v(0)−v∗∗(0)is called the duality gap — which we aspire to be small or zero.

Let F (x, (y, z)) := f(x)+ιepi(g)(x, y)+ιgraph(h)(x, z). Then problem P (y, z)becomes problem (1.5.1) with parameters (y, z). On the other hand, we canrewrite (1.5.1) as

v(y) = infx{F (x, u) : u = y}

which is problem P (0, y) with x = (x, u), C = X × Y , f(x, u) = F (x, u),h(x, u) = u and g(x, u) = 0. So where we start is ammeter of taste andpredisposition.

Theorem 1.5.1 (Duality and Lagrange multipliers) The following are equiv-alent:

1 The use of the term ‘primal’ is much more recent than the term ‘dual’ and wassuggested by George Dantzig’s father Tobias when linear programming was beingdeveloped in the 1940’s.


(i) the primal problem has a Lagrange multiplier λ.(ii) there is no duality gap, i.e. d = p is finite and the dual problem has solution

−λ.

Proof. If the primal problem has a Lagrange multiplier λ then −λ ∈ ∂v(0).By the Fenchel-Young equality

v(0) + v∗(−λ) = ⟨−λ, 0⟩ = 0.

Direct calculation yields

v∗(−λ) = supy{⟨−λ, y⟩ − v(y)}

= supy,x

{⟨−λ, y⟩ − F (x, y)} = F ∗(0,−λ).

Since

−F ∗(0,−λ) ≤ v∗∗(0) ≤ v(0) = −v∗(−λ) = −F ∗(0,−λ), (1.5.4)

λ is a solution to the dual problem and p = v(0) = v∗∗(0) = d.On the other hand, if v∗∗(0) = v(0) and λ is a solution to the dual problem

then all the quantities in (1.5.4) are equal. In particular,

v(0) + v∗(−λ) = 0.

This implies that −λ ∈ ∂v(0) so that λ is a Lagrange multiplier of the primalproblem. •

Example 1 (Finite duality gap). Consider

v(y) = inf{|x2 − 1| :√x21 + x2

2 − x1 ≤ y}.

We can easily calculate

v(y) =

0 y > 0

1 y = 0

+∞ y < 0,

and v∗∗(0) = 0, i.e. there is a finite duality gap v(0)− v∗∗(0) = 1.In this example neither the primal nor the dual problem has a Lagrange

multiplier yet both have solutions. Hence, even in two dimensions, existenceof a Lagrange multiplier is only a sufficient condition for the dual to attain asolution and is far from necessary. ♢

1.5 Convex duality theory 39

1.5.2 Fenchel duality

Let us specify F (x, y) := f(x) + g(Ax + y), where A : X → Y is a linearoperator. We thence get the Fenchel formulation of duality. Now the primalproblem is

p = v(0) = infx

f(x) + g(Ax). (1.5.5)

To derive the dual problem we calculate

F ∗(0,−y∗) = supy⟨−y∗, y⟩ − f(x)− g(Ax+ y).

Letting u = Ax+ y we have

F ∗(0,−y∗) = supx,u

⟨−y∗, u−Ax⟩ − f(x)− g(u)

= supx[⟨y∗, Ax⟩ − f(x)] + sup

u[⟨−y∗, u⟩ − g(u)]

= f∗(A∗y∗) + g∗(−y∗).

Thus, the dual problem is

d = v∗∗(0) = supy∗

−f∗(A∗y∗)− g∗(−y∗). (1.5.6)

If both f and g are convex functions it is easy to see that so is

v(y) = infx

f(x) + g(Ax+ y).

It is easy to check that dom v = dom g−A dom f . Now, in the Euclideansetting, a sufficient condition for the existence of Lagrange multipliers for theprimal problem, i.e., ∂v(0) = ∅, is

0 ∈ ri dom v = ri[dom g −A dom f ]. (1.5.7)

Checking dom v = dom g −A dom f is not hard and is left as an exercise.Figure 1.1 illustrate the Fenchel duality theorem for f(x) := x2/2+ 1 and

g(x) = (x − 1)2/2 + 1/2. The upper function is f and the lower one is −g.The minimum gap occurs at 1/2 and is 7/4.

In infinite dimensions more care is necessary but in Banach space whenthe functions are lower semicontinuous and the operator is closed

0 ∈ core dom v = core[dom g −A dom f ]. (1.5.8)

suffices.A condition of the form (1.5.7) or (1.5.8) is often referred to as a con-

straint qualification or a transversality condition. Enforcing such constraintqualification conditions we can write Theorem 1.5.1 in the following form:


2

0–0.5 1 1.50.5

1

–1

Fig. 1.1. The Fenchel duality sandwich

Theorem 1.5.2 (Duality and constraint qualification) If the convex func-tions f , g and the linear operator A satisfy the constraint qualification con-ditions (1.5.7) or (1.5.8) then there is a zero duality gap between the primaland dual problems, (1.5.5) and (1.5.6), and the dual problem has a solution.

A really illustrative example is the application to entropy optimization.

Example 2. (Entropy optimization problem) Entropy maximization refers to

minimize f(x) subject to Ax = b ∈ RN , (1.5.9)

with the lower semicontinuous convex function f defined on a Banach spaceof signals, emulating the negative of an entropy and A emulating a finitenumber of continuous linear constraints representing conditions on some givenmoments. A wide variety of applications can be covered by this model due toits physical relevance.

Applying Theorem 1.5.2 with g = ι{b} we have if b ∈ ri(A dom f) then

infx∈X

{f(x) | Ax = b} = maxϕ∈RN

{⟨ϕ, b⟩ − f∗(A∗ϕ)}. (1.5.10)

When N < dim X (often infinite) the dual problem is typically much easierto solve then the primary. ♢

Example 3. (Boltzmann–Shannon entropy in Euclidean space) Let


f(x) :=

N∑n=1

p(xn), (1.5.11)

where

p(t) :=

t ln t− t if t > 0,

0 if t = 0,

+∞ if t < 0.

The functions p and f defined above are (negatives of) Boltzmann–Shannonentropy functions on R and RN , respectively. For c ∈ RN , b ∈ RM and linearmapping A : RN → RM consider the entropy optimization problem

minimize {f(x) + ⟨c, x⟩ : Ax = b.} (1.5.12)

Example 2 can help us conveniently derive an explicit formula for solutionsof (1.5.12) in terms of the solution to its dual problem.

First we note that the sublevel sets of the objective function are compact,thus ensuring the existence of solutions to problem (1.5.12). We can also seeby direct calculation that the directional derivative of the cost function is −∞on any boundary point x of dom f = RN

+ , the domain of the cost function, inthe direction of z − x. Thus, any solution of (1.5.12) must be in the interiorof RN

+ . Since the cost function is strictly convex on int (RN+ ), the solution is

unique.Let us denote this unique solution of (1.5.12) by x. Then the duality result

in Example 2 implies that

f(x) + ⟨c, x⟩ = infx∈RN

{f(x) + ⟨c, x⟩ : Ax = b}

= maxϕ∈RM

{⟨ϕ, b⟩ − (f + c)∗(A⊤ϕ)}.

Now let ϕ be a solution to the dual problem, i.e,, a Lagrange multiplier forthe constrained minimization problem (1.5.12).

We have

f(x) + ⟨c, x⟩+ (f + c)∗(A⊤ϕ) = ⟨ϕ, b⟩ = ⟨ϕ, Ax⟩ = ⟨A⊤ϕ, x⟩.

It follows from the Fenchel-Young equality that A⊤ϕ ∈ ∂(f + c)(x). Sincex ∈ int (RN

+ ) where f is differentiable, we have A⊤ϕ = f ′(x) + c. Explicitcomputation shows x = (x1, . . . , xN ) is determined by

xn = exp(A⊤ϕ− c)n, n = 1, . . . , N. (1.5.13)

Indeed, we can use the existence of the dual solution to prove that theprimal problem has the given solution without direct appeal to compactness— we deduce the existence of the primal from the duality theory. ♢


We say a function f is polyhedral if its epigraph is polyhedral, i.e. finiteintersection of closed half-spaces. When both f and g are polyhedral constraintqualification condition (1.5.7) simplifies to

dom g ∩Adom f = ∅. (1.5.14)

This is very useful in dealing with polyhedral cone programming and, in par-ticular, linear programming problems. One can also similarly handle a subsetof polyhedral constraints [3].

Example 4 (Abstract linear programming). Consider the cone programmingproblem (??) and its dual is

sup{⟨b, ϕ⟩ : A∗ϕ+ c ∈ K+}. (1.5.15)

Assuming that K is polyhedral then if b ∈ AK (i.e., the primary problemis feasible) implies that there is no duality gap and the dual optimal valueis attained when finite. Symmetrically, if −c ∈ A∗X∗ − K+ (i.e., the dualproblem is feasible) then there is no duality gap and the primary optimalvalue is attained if finite.

1.5.3 Lagrange duality

For problem (1.1.1) define the Lagrangian

L(λ, x; (y, z)) = f(x) + ⟨λ, (g(x)− y, h(x)− z)⟩.

Then

supλ∈Y ∗

+×Z∗L(λ, x; (y, z)) =

{f(x) if g(x) ≤ y, h(x) = z

+∞ otherwise..

Then problem (1.1.1) can be written as

p = v(0) = infx∈C

supλ∈Y ∗

+×Z∗L(λ, x; 0). (1.5.16)

We can calculate

v∗(−λ) = supy,z

[⟨−λ, (y, z)⟩ − v(y, z)]

= supy,z

[⟨−λ, (y, z)⟩ − infx∈C

{f(x) : g(x) ≤ y, h(x) = z}]

= supx∈C,y,z

[{⟨−λ, (y, z)⟩ − f(x) : g(x) ≤ y, h(x) = z}].

Letting ξ = y − g(x) ∈ K we can rewrite the expression above as


v∗(−λ) = supx∈C,ξ∈K

[⟨−λ, (g(x), h(x))⟩ − f(x) + ⟨−λ, (ξ, 0)⟩]

= − infx∈C,ξ∈K

L(x, λ, 0)− ⟨λ, (ξ, 0)⟩

=

{− infx L(x, λ, 0) if λ ∈ Y ∗

+ × Z∗

+∞ otherwise.

Thus, the dual problem is

d = v∗∗(0) = supλ

−v∗(−λ) = supλ∈Y ∗

+×Z∗infx∈C

L(λ, x; 0). (1.5.17)

We can see that the weak duality inequality v(0) ≥ v∗∗(0) is simply thefamiliar fact that

inf sup ≥ sup inf .

Example 5. (Classical linear programming duality) Consider a linear program-ming problem

min ⟨c, x⟩ (1.5.18)

subject to Ax ≤ b,

where x ∈ RN , b ∈ RM , A is a M × N matrix and ≤ is the partial ordergenerated by the cone RM

+ . Then by the Lagrange duality the dual problemis

max −⟨b, ϕ⟩ (1.5.19)

subject to A∗ϕ = −c, ϕ ≥ 0.

Clearly all the functions involved here are polyhedral. Applying the polyhedralcone duality results in Example 4 we can conclude that if either the primaryproblem or the dual problem is feasible then there is no duality gap. Moreover,when the common optimal value is finite then both problems have optimalsolutions. ♢

The hard work in Example 5 was hidden in establishing that the constraintqualification (1.5.14) is sufficient, but unlike many applied developments wehave rigorously recaptured linear programming duality with in our framework.

1.5.4 Exercises

Exercise 1.5.1 Let f and g be convex function and A a linear mapping.Show that v(y) = infx f(x) + g(Ax+ y) is a convex function.

Exercise 1.5.2 Let v(y) = infx f(x)+g(Ax+y). Show that dom v = dom g−A dom f .

References

1. J. M. Borwein and Adrian S. Lewis. Convex analysis and nonlinear optimization.CMS Books in Mathematics/Ouvrages de Mathematiques de la SMC, 3. Springer-Verlag, New York, 2000.

2. J. M. Borwein and Q. J. Zhu. Variational methods in convex analysis. JOGO,to appear.

3. J. M. Borwein and Qiji J. Zhu. Techniques of Variational Analysis. CMS Booksin Mathematics/Ouvrages de Mathematiques de la SMC, 20. Springer-Verlag,New York, 2005.

4. W. Fenchel. On conjugate convex functions. Canad. J. Math., 1:73–77, 1949.5. Yu. S. Ledyaev and Q. J. Zhu. Implicit multifunction theorems. Set-Valued Anal.,

7:209–238, 1999.6. B. N. Pshenichnyi. Necessary conditions for an extremum, volume 4 of Translated

from the Russian by Karol Makowski. Translation edited by Lucien W. Neustadt.Pure and Applied Mathematics. Marcel Dekker Inc., New York, 1971.

7. R. T. Rockafellar. Convex analysis. Princeton Mathematical Series, No. 28.Princeton University Press, Princeton, N.J., 1970.

lecture notes: convex duality and financial mathematics

Documents