optimization for machine learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · since d is...

93
Optimization for Machine Learning Lecture 4: Optimality conditions 6.881: MIT Suvrit Sra Massachusetts Institute of Technology 25 Feb, 2021

Upload: others

Post on 11-Mar-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimization for Machine Learning

Lecture 4: Optimality conditions

6.881: MIT

Suvrit SraMassachusetts Institute of Technology

25 Feb, 2021

Page 2: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality(Local and global optima)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 2

Page 3: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality

Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .

Theorem. For convex f , locally optimal point also global.

I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.

I We need to show that f (y) ≥ f (x∗) = p∗ .

I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)

I Since f is cvx, and x∗, y ∈ dom f , we have

f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).

I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3

Page 4: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality

Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .

Theorem. For convex f , locally optimal point also global.

I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.

I We need to show that f (y) ≥ f (x∗) = p∗ .

I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)

I Since f is cvx, and x∗, y ∈ dom f , we have

f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).

I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3

Page 5: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality

Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .

Theorem. For convex f , locally optimal point also global.

I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.

I We need to show that f (y) ≥ f (x∗) = p∗ .

I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)

I Since f is cvx, and x∗, y ∈ dom f , we have

f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).

I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3

Page 6: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality

Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .

Theorem. For convex f , locally optimal point also global.

I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.

I We need to show that f (y) ≥ f (x∗) = p∗ .

I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)

I Since f is cvx, and x∗, y ∈ dom f , we have

f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).

I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3

Page 7: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality

Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .

Theorem. For convex f , locally optimal point also global.

I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.

I We need to show that f (y) ≥ f (x∗) = p∗ .

I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)

I Since f is cvx, and x∗, y ∈ dom f , we have

f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).

I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3

Page 8: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality

Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .

Theorem. For convex f , locally optimal point also global.

I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.

I We need to show that f (y) ≥ f (x∗) = p∗ .

I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)

I Since f is cvx, and x∗, y ∈ dom f , we have

f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).

I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3

Page 9: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality

Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .

Theorem. For convex f , locally optimal point also global.

I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.

I We need to show that f (y) ≥ f (x∗) = p∗ .

I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)

I Since f is cvx, and x∗, y ∈ dom f , we have

f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).

I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.

I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3

Page 10: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality

Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .

Theorem. For convex f , locally optimal point also global.

I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.

I We need to show that f (y) ≥ f (x∗) = p∗ .

I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)

I Since f is cvx, and x∗, y ∈ dom f , we have

f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).

I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3

Page 11: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Set of Optimal Solutions

The set of optimal solutions X ∗ may be empty

Example. If X = ∅, i.e., no feasible solutions, then X ∗ = ∅

Example. When only inf not min, e.g., inf ex as x→ −∞in general, we should worry about the question “Is X ∗ = ∅?”

Exercise: Verify that X ∗ is always a convex set.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 4

Page 12: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions(Recognizing optima)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 5

Page 13: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: unconstrained

Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.

Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).

0 ≤ limt↓0

f (x∗ + td)− f (x∗)t

=dg(0)

dt= 〈∇f (x∗), d〉.

Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.

Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6

Page 14: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: unconstrained

Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.

Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).

0 ≤ limt↓0

f (x∗ + td)− f (x∗)t

=dg(0)

dt= 〈∇f (x∗), d〉.

Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.

Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6

Page 15: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: unconstrained

Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.

Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).

0 ≤

limt↓0

f (x∗ + td)− f (x∗)

t=

dg(0)

dt= 〈∇f (x∗), d〉.

Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold.

Since d is arbitrary,∇f (x∗) = 0.

Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6

Page 16: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: unconstrained

Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.

Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).

0 ≤

limt↓0

f (x∗ + td)− f (x∗)t

=dg(0)

dt= 〈∇f (x∗), d〉.

Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.

Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6

Page 17: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: unconstrained

Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.

Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).

0 ≤ limt↓0

f (x∗ + td)− f (x∗)t

=dg(0)

dt= 〈∇f (x∗), d〉.

Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.

Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6

Page 18: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: unconstrained

Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.

Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).

0 ≤ limt↓0

f (x∗ + td)− f (x∗)t

=dg(0)

dt

= 〈∇f (x∗), d〉.

Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.

Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6

Page 19: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: unconstrained

Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.

Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).

0 ≤ limt↓0

f (x∗ + td)− f (x∗)t

=dg(0)

dt= 〈∇f (x∗), d〉.

Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.

Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6

Page 20: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: unconstrained

Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.

Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).

0 ≤ limt↓0

f (x∗ + td)− f (x∗)t

=dg(0)

dt= 〈∇f (x∗), d〉.

Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.

Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6

Page 21: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: unconstrained

Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.

Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).

0 ≤ limt↓0

f (x∗ + td)− f (x∗)t

=dg(0)

dt= 〈∇f (x∗), d〉.

Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.

Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6

Page 22: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: constrained

♠ For convex f , we have f (y) ≥ f (x) + 〈∇f (x), y− x〉.

♠ Thus, x∗ is optimal if and only if

〈∇f (x∗), y− x∗〉 ≥ 0, for all y ∈ X .♠ If X = Rn, this reduces to∇f (x∗) = 0

x∗

∇f(x∗)xf(x

)

X

♠ If∇f (x∗) 6= 0, it defines supporting hyperplane to X at x∗

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 7

Page 23: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: constrained

♠ For convex f , we have f (y) ≥ f (x) + 〈∇f (x), y− x〉.♠ Thus, x∗ is optimal if and only if

〈∇f (x∗), y− x∗〉 ≥ 0, for all y ∈ X .

♠ If X = Rn, this reduces to∇f (x∗) = 0

x∗

∇f(x∗)xf(x

)

X

♠ If∇f (x∗) 6= 0, it defines supporting hyperplane to X at x∗

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 7

Page 24: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: constrained

♠ For convex f , we have f (y) ≥ f (x) + 〈∇f (x), y− x〉.♠ Thus, x∗ is optimal if and only if

〈∇f (x∗), y− x∗〉 ≥ 0, for all y ∈ X .♠ If X = Rn, this reduces to∇f (x∗) = 0

x∗

∇f(x∗)xf(x

)

X

♠ If∇f (x∗) 6= 0, it defines supporting hyperplane to X at x∗

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 7

Page 25: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

First-order conditions: constrained

I Let f be continuously differentiable, possibly nonconvexI Suppose ∃y ∈ X such that 〈∇f (x∗), y− x∗〉 < 0I Using mean-value theorem of calculus, ∃ξ ∈ [0, 1] s.t.

f (x∗ + t(y− x∗)) = f (x∗) + 〈∇f (x∗ + ξt(y− x∗)), t(y− x∗)〉(we applied MVT to g(t) := f (x∗ + t(y− x∗)))

I For sufficiently small t, since ∇f continuous, by assump on y,〈∇f (x∗ + ξt(y− x∗)), y− x∗〉 < 0

I This in turn implies that f (x∗ + t(y− x∗)) < f (x∗)I Since X is convex, x∗ + t(y− x∗) ∈ X is also feasibleI Contradiction to local optimality of x∗

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 8

Page 26: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality without differentiability

Theorem. (Fermat’s rule): Let f : Rn → (−∞,+∞]. Then,

Argmin f = zer(∂f ) := {x ∈ Rn | 0 ∈ ∂f (x)} .

Proof: x ∈ Argmin f implies that f (x) ≤ f (y) for all y ∈ Rn.Equivalently, f (y) ≥ f (x) + 〈0, y− x〉 ∀y ,⇔ 0 ∈ ∂f (x).

Nonsmooth problemmin

xf (x) s.t. x ∈ X

minx

f (x) + 1X (x).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9

Page 27: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality without differentiability

Theorem. (Fermat’s rule): Let f : Rn → (−∞,+∞]. Then,

Argmin f = zer(∂f ) := {x ∈ Rn | 0 ∈ ∂f (x)} .Proof: x ∈ Argmin f implies that f (x) ≤ f (y) for all y ∈ Rn.

Equivalently, f (y) ≥ f (x) + 〈0, y− x〉 ∀y ,⇔ 0 ∈ ∂f (x).

Nonsmooth problemmin

xf (x) s.t. x ∈ X

minx

f (x) + 1X (x).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9

Page 28: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality without differentiability

Theorem. (Fermat’s rule): Let f : Rn → (−∞,+∞]. Then,

Argmin f = zer(∂f ) := {x ∈ Rn | 0 ∈ ∂f (x)} .Proof: x ∈ Argmin f implies that f (x) ≤ f (y) for all y ∈ Rn.Equivalently, f (y) ≥ f (x) + 〈0, y− x〉 ∀y ,

⇔ 0 ∈ ∂f (x).

Nonsmooth problemmin

xf (x) s.t. x ∈ X

minx

f (x) + 1X (x).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9

Page 29: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality without differentiability

Theorem. (Fermat’s rule): Let f : Rn → (−∞,+∞]. Then,

Argmin f = zer(∂f ) := {x ∈ Rn | 0 ∈ ∂f (x)} .Proof: x ∈ Argmin f implies that f (x) ≤ f (y) for all y ∈ Rn.Equivalently, f (y) ≥ f (x) + 〈0, y− x〉 ∀y ,⇔ 0 ∈ ∂f (x).

Nonsmooth problemmin

xf (x) s.t. x ∈ X

minx

f (x) + 1X (x).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9

Page 30: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality without differentiability

Theorem. (Fermat’s rule): Let f : Rn → (−∞,+∞]. Then,

Argmin f = zer(∂f ) := {x ∈ Rn | 0 ∈ ∂f (x)} .Proof: x ∈ Argmin f implies that f (x) ≤ f (y) for all y ∈ Rn.Equivalently, f (y) ≥ f (x) + 〈0, y− x〉 ∀y ,⇔ 0 ∈ ∂f (x).

Nonsmooth problemmin

xf (x) s.t. x ∈ X

minx

f (x) + 1X (x).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9

Page 31: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality – nonsmooth

I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)

I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)

I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:

NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}

Application

min f (x) + 1X (x).

♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10

Page 32: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality – nonsmooth

I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)

I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)

I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:

NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}

Application

min f (x) + 1X (x).

♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10

Page 33: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality – nonsmooth

I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)

I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)

I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.

I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:

NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}

Application

min f (x) + 1X (x).

♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10

Page 34: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality – nonsmooth

I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)

I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)

I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .

I Subdifferential of the indicator 1X (x), aka normal cone:NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}

Application

min f (x) + 1X (x).

♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10

Page 35: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality – nonsmooth

I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)

I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)

I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:

NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}

Application

min f (x) + 1X (x).

♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10

Page 36: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality – nonsmooth

I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)

I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)

I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:

NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}

Application

min f (x) + 1X (x).

♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)

♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10

Page 37: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality – nonsmooth

I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)

I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)

I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:

NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}

Application

min f (x) + 1X (x).

♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10

Page 38: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Example

min f (x) ‖x‖ ≤ 1.

A point x is optimal if and only if

x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.

In other words

∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx

sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.

Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11

Page 39: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Example

min f (x) ‖x‖ ≤ 1.

A point x is optimal if and only if

x ∈ dom f , ‖x‖ ≤ 1,

∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.

In other words

∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx

sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.

Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11

Page 40: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Example

min f (x) ‖x‖ ≤ 1.

A point x is optimal if and only if

x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.

In other words

∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx

sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.

Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11

Page 41: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Example

min f (x) ‖x‖ ≤ 1.

A point x is optimal if and only if

x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.

In other words

∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx

sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.

Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11

Page 42: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Example

min f (x) ‖x‖ ≤ 1.

A point x is optimal if and only if

x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.

In other words

∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx

sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx

‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.

Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11

Page 43: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Example

min f (x) ‖x‖ ≤ 1.

A point x is optimal if and only if

x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.

In other words

∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx

sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.

Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11

Page 44: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Example

min f (x) ‖x‖ ≤ 1.

A point x is optimal if and only if

x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.

In other words

∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx

sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.

Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11

Page 45: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions(KKT and friends)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 12

Page 46: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 47: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ X

I Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 48: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?

I g(λ) = infx (L(x, λ) := f (x) +∑

i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 49: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 50: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 51: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗)

= d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 52: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗)

= minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 53: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗)

≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 54: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗)

≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 55: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 56: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 57: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

min f (x), fi(x) ≤ 0, i = 1, . . . ,m.

I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +

∑i λifi(x))

Assume strong duality and that p∗, d∗ attained!

Thus, there exists a pair (x∗, λ∗) such that

p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗

I Thus, equalities hold in above chain, and

x∗ ∈ argminxL(x, λ∗).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13

Page 58: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies

∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑

iλ∗i∇fi(x∗) = 0.

Moreover, since L(x∗, λ∗) = f (x∗), we also have∑iλ∗i fi(x∗) = 0.

But λ∗i ≥ 0 and fi(x∗) ≤ 0, so complementary slackness

λ∗i fi(x∗) = 0, i = 1, . . . ,m.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14

Page 59: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies

∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑

iλ∗i∇fi(x∗) = 0.

Moreover, since L(x∗, λ∗) = f (x∗), we also have∑iλ∗i fi(x∗) = 0.

But λ∗i ≥ 0 and fi(x∗) ≤ 0, so complementary slackness

λ∗i fi(x∗) = 0, i = 1, . . . ,m.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14

Page 60: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies

∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑

iλ∗i∇fi(x∗) = 0.

Moreover, since L(x∗, λ∗) = f (x∗), we also have

∑iλ∗i fi(x∗) = 0.

But λ∗i ≥ 0 and fi(x∗) ≤ 0, so complementary slackness

λ∗i fi(x∗) = 0, i = 1, . . . ,m.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14

Page 61: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies

∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑

iλ∗i∇fi(x∗) = 0.

Moreover, since L(x∗, λ∗) = f (x∗), we also have∑iλ∗i fi(x∗) = 0.

But λ∗i ≥ 0 and fi(x∗) ≤ 0, so complementary slackness

λ∗i fi(x∗) = 0, i = 1, . . . ,m.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14

Page 62: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies

∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑

iλ∗i∇fi(x∗) = 0.

Moreover, since L(x∗, λ∗) = f (x∗), we also have∑iλ∗i fi(x∗) = 0.

But λ∗i ≥ 0 and fi(x∗) ≤ 0,

so complementary slackness

λ∗i fi(x∗) = 0, i = 1, . . . ,m.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14

Page 63: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Optimality conditions via Lagrangian

x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies

∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑

iλ∗i∇fi(x∗) = 0.

Moreover, since L(x∗, λ∗) = f (x∗), we also have∑iλ∗i fi(x∗) = 0.

But λ∗i ≥ 0 and fi(x∗) ≤ 0, so complementary slackness

λ∗i fi(x∗) = 0, i = 1, . . . ,m.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14

Page 64: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT conditions

Karush-Kuhn-Tucker Conditions (KKT)

fi(x∗) ≤ 0, i = 1, . . . ,m (primal feasibility)λ∗i ≥ 0, i = 1, . . . ,m (dual feasibility)

λ∗i fi(x∗) = 0, i = 1, . . . ,m (compl. slackness)∇xL(x, λ∗)|x=x∗ = 0 (Lagrangian stationarity)

I Thus, if strong duality holds, and (x∗, λ∗) exists, then KKTconditions are necessary for pair (x∗, λ∗) to be optimal

I If problem is convex, then KKT also sufficientExercise: Prove the above sufficiency of KKT.Hint: Use that L(x, λ∗) is convex, and conclude from KKT conditions thatg(λ∗) = f0(x∗), so that (x∗, λ∗) optimal primal-dual pair.

Read Ch. 5 of BV

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15

Page 65: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT conditions

Karush-Kuhn-Tucker Conditions (KKT)

fi(x∗) ≤ 0, i = 1, . . . ,m (primal feasibility)λ∗i ≥ 0, i = 1, . . . ,m (dual feasibility)

λ∗i fi(x∗) = 0, i = 1, . . . ,m (compl. slackness)∇xL(x, λ∗)|x=x∗ = 0 (Lagrangian stationarity)

I Thus, if strong duality holds, and (x∗, λ∗) exists, then KKTconditions are necessary for pair (x∗, λ∗) to be optimal

I If problem is convex, then KKT also sufficientExercise: Prove the above sufficiency of KKT.Hint: Use that L(x, λ∗) is convex, and conclude from KKT conditions thatg(λ∗) = f0(x∗), so that (x∗, λ∗) optimal primal-dual pair.

Read Ch. 5 of BV

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15

Page 66: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT conditions

Karush-Kuhn-Tucker Conditions (KKT)

fi(x∗) ≤ 0, i = 1, . . . ,m (primal feasibility)λ∗i ≥ 0, i = 1, . . . ,m (dual feasibility)

λ∗i fi(x∗) = 0, i = 1, . . . ,m (compl. slackness)∇xL(x, λ∗)|x=x∗ = 0 (Lagrangian stationarity)

I Thus, if strong duality holds, and (x∗, λ∗) exists, then KKTconditions are necessary for pair (x∗, λ∗) to be optimal

I If problem is convex, then KKT also sufficient

Exercise: Prove the above sufficiency of KKT.Hint: Use that L(x, λ∗) is convex, and conclude from KKT conditions thatg(λ∗) = f0(x∗), so that (x∗, λ∗) optimal primal-dual pair.

Read Ch. 5 of BV

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15

Page 67: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT conditions

Karush-Kuhn-Tucker Conditions (KKT)

fi(x∗) ≤ 0, i = 1, . . . ,m (primal feasibility)λ∗i ≥ 0, i = 1, . . . ,m (dual feasibility)

λ∗i fi(x∗) = 0, i = 1, . . . ,m (compl. slackness)∇xL(x, λ∗)|x=x∗ = 0 (Lagrangian stationarity)

I Thus, if strong duality holds, and (x∗, λ∗) exists, then KKTconditions are necessary for pair (x∗, λ∗) to be optimal

I If problem is convex, then KKT also sufficientExercise: Prove the above sufficiency of KKT.Hint: Use that L(x, λ∗) is convex, and conclude from KKT conditions thatg(λ∗) = f0(x∗), so that (x∗, λ∗) optimal primal-dual pair.

Read Ch. 5 of BV

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15

Page 68: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Examples

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 16

Page 69: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Projection onto a hyperplane

minx

12‖x− y‖2, s.t. aTx = b.

KKT Conditions

L(x, ν) = 12‖x− y‖2 + ν(aTx− b)

∂L∂x

= x− y + νa = 0

x = y− νaaTx = aTy− νaTa‖a‖2ν = aTy− b

x = y− 1‖a‖2 (aTy− b)a

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17

Page 70: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Projection onto a hyperplane

minx

12‖x− y‖2, s.t. aTx = b.

KKT Conditions

L(x, ν) = 12‖x− y‖2 + ν(aTx− b)

∂L∂x

= x− y + νa = 0

x = y− νaaTx = aTy− νaTa‖a‖2ν = aTy− b

x = y− 1‖a‖2 (aTy− b)a

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17

Page 71: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Projection onto a hyperplane

minx

12‖x− y‖2, s.t. aTx = b.

KKT Conditions

L(x, ν) = 12‖x− y‖2 + ν(aTx− b)

∂L∂x

= x− y + νa = 0

x = y− νa

aTx = aTy− νaTa‖a‖2ν = aTy− b

x = y− 1‖a‖2 (aTy− b)a

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17

Page 72: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Projection onto a hyperplane

minx

12‖x− y‖2, s.t. aTx = b.

KKT Conditions

L(x, ν) = 12‖x− y‖2 + ν(aTx− b)

∂L∂x

= x− y + νa = 0

x = y− νaaTx = aTy− νaTa‖a‖2ν = aTy− b

x = y− 1‖a‖2 (aTy− b)a

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17

Page 73: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Projection onto a hyperplane

minx

12‖x− y‖2, s.t. aTx = b.

KKT Conditions

L(x, ν) = 12‖x− y‖2 + ν(aTx− b)

∂L∂x

= x− y + νa = 0

x = y− νaaTx = aTy− νaTa‖a‖2ν = aTy− b

x = y− 1‖a‖2 (aTy− b)a

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17

Page 74: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Projection onto simplex

minx

12‖x− y‖2, s.t. xT1 = 1, x ≥ 0.

KKT Conditions

L(x, λ, ν) = 12‖x− y‖2 −

∑iλixi + ν(xT1− 1)

∂L∂xi

= xi − yi − λi + ν = 0

λixi = 0λi ≥ 0

xT1 = 1, x ≥ 0

Challenge A. Solve the above conditions in O(n log n) time.

Challenge A+. Solve the above conditions in O(n) time.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 18

Page 75: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Projection onto simplex

minx

12‖x− y‖2, s.t. xT1 = 1, x ≥ 0.

KKT Conditions

L(x, λ, ν) = 12‖x− y‖2 −

∑iλixi + ν(xT1− 1)

∂L∂xi

= xi − yi − λi + ν = 0

λixi = 0λi ≥ 0

xT1 = 1, x ≥ 0

Challenge A. Solve the above conditions in O(n log n) time.

Challenge A+. Solve the above conditions in O(n) time.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 18

Page 76: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Projection onto simplex

minx

12‖x− y‖2, s.t. xT1 = 1, x ≥ 0.

KKT Conditions

L(x, λ, ν) = 12‖x− y‖2 −

∑iλixi + ν(xT1− 1)

∂L∂xi

= xi − yi − λi + ν = 0

λixi = 0λi ≥ 0

xT1 = 1, x ≥ 0

Challenge A. Solve the above conditions in O(n log n) time.

Challenge A+. Solve the above conditions in O(n) time.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 18

Page 77: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Total variation minimization

min 12‖x− y‖2 + λ

∑i|xi+1 − xi|,

min 12‖x− y‖2 + λ‖Dx‖1,

(the matrix D is also known as a differencing matrix).

Step 1. Take the dual (recall from L3-25) to obtain:

minu

12‖DTu‖2 − uTDy, s.t. ‖u‖∞ ≤ λ.

Step 2. Replace obj by ‖DTu− y‖2 (argmin is unchanged)Step 3. Add dummies u0 = un = 0; write s = r− u for r =

∑ik=1 yk

mins

∑n

i=1(si−1 − si)

2, s.t. ‖s− r‖∞ ≤ λ, s0 = 0, sn = rn.

Step 4 (Challenge). Look at KKT conditions, and keep working. . . finally, obtain O(n) method!For full-story look at: A. Barbero, S. Sra. “Modular proximal optimization formultidimensional total-variation regularization” (JMLR 2019, pp. 1–82)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19

Page 78: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Total variation minimization

min 12‖x− y‖2 + λ

∑i|xi+1 − xi|,

min 12‖x− y‖2 + λ‖Dx‖1,

(the matrix D is also known as a differencing matrix).Step 1. Take the dual (recall from L3-25) to obtain:

minu

12‖DTu‖2 − uTDy, s.t. ‖u‖∞ ≤ λ.

Step 2. Replace obj by ‖DTu− y‖2 (argmin is unchanged)

Step 3. Add dummies u0 = un = 0; write s = r− u for r =∑i

k=1 yk

mins

∑n

i=1(si−1 − si)

2, s.t. ‖s− r‖∞ ≤ λ, s0 = 0, sn = rn.

Step 4 (Challenge). Look at KKT conditions, and keep working. . . finally, obtain O(n) method!For full-story look at: A. Barbero, S. Sra. “Modular proximal optimization formultidimensional total-variation regularization” (JMLR 2019, pp. 1–82)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19

Page 79: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Total variation minimization

min 12‖x− y‖2 + λ

∑i|xi+1 − xi|,

min 12‖x− y‖2 + λ‖Dx‖1,

(the matrix D is also known as a differencing matrix).Step 1. Take the dual (recall from L3-25) to obtain:

minu

12‖DTu‖2 − uTDy, s.t. ‖u‖∞ ≤ λ.

Step 2. Replace obj by ‖DTu− y‖2 (argmin is unchanged)Step 3. Add dummies u0 = un = 0; write s = r− u for r =

∑ik=1 yk

mins

∑n

i=1(si−1 − si)

2, s.t. ‖s− r‖∞ ≤ λ, s0 = 0, sn = rn.

Step 4 (Challenge). Look at KKT conditions, and keep working. . . finally, obtain O(n) method!For full-story look at: A. Barbero, S. Sra. “Modular proximal optimization formultidimensional total-variation regularization” (JMLR 2019, pp. 1–82)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19

Page 80: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Total variation minimization

min 12‖x− y‖2 + λ

∑i|xi+1 − xi|,

min 12‖x− y‖2 + λ‖Dx‖1,

(the matrix D is also known as a differencing matrix).Step 1. Take the dual (recall from L3-25) to obtain:

minu

12‖DTu‖2 − uTDy, s.t. ‖u‖∞ ≤ λ.

Step 2. Replace obj by ‖DTu− y‖2 (argmin is unchanged)Step 3. Add dummies u0 = un = 0; write s = r− u for r =

∑ik=1 yk

mins

∑n

i=1(si−1 − si)

2, s.t. ‖s− r‖∞ ≤ λ, s0 = 0, sn = rn.

Step 4 (Challenge). Look at KKT conditions, and keep working. . . finally, obtain O(n) method!For full-story look at: A. Barbero, S. Sra. “Modular proximal optimization formultidimensional total-variation regularization” (JMLR 2019, pp. 1–82)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19

Page 81: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Nonsmooth KKT(via subdifferentials)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 20

Page 82: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT via subdifferentials?

Assume all fi(x) are finite valued, and dom f = Rn

minx∈Rn

f (x) s.t. fi(x) ≤ 0, i ∈ [m].

Assume Slater’s condition: ∃x such that fi(x) < 0 for i ∈ [m]Write Ci := {x | fi(x) ≤ 0}. Then, above problem becomes

minx

φ(x) := f (x) + 1C1(x) + · · ·+ 1Cm(x).

An optimal solution to this problem is a vector x̄ such that

0 ∈ ∂φ(x̄).

Slater’s condition tells us thatint C1 ∩ · · · ∩ int Cm 6= ∅.

Exercise: Rigorously justify the above (Hint: use continuity of fi)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21

Page 83: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT via subdifferentials?

Assume all fi(x) are finite valued, and dom f = Rn

minx∈Rn

f (x) s.t. fi(x) ≤ 0, i ∈ [m].

Assume Slater’s condition: ∃x such that fi(x) < 0 for i ∈ [m]

Write Ci := {x | fi(x) ≤ 0}. Then, above problem becomes

minx

φ(x) := f (x) + 1C1(x) + · · ·+ 1Cm(x).

An optimal solution to this problem is a vector x̄ such that

0 ∈ ∂φ(x̄).

Slater’s condition tells us thatint C1 ∩ · · · ∩ int Cm 6= ∅.

Exercise: Rigorously justify the above (Hint: use continuity of fi)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21

Page 84: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT via subdifferentials?

Assume all fi(x) are finite valued, and dom f = Rn

minx∈Rn

f (x) s.t. fi(x) ≤ 0, i ∈ [m].

Assume Slater’s condition: ∃x such that fi(x) < 0 for i ∈ [m]Write Ci := {x | fi(x) ≤ 0}. Then, above problem becomes

minx

φ(x) := f (x) + 1C1(x) + · · ·+ 1Cm(x).

An optimal solution to this problem is a vector x̄ such that

0 ∈ ∂φ(x̄).

Slater’s condition tells us thatint C1 ∩ · · · ∩ int Cm 6= ∅.

Exercise: Rigorously justify the above (Hint: use continuity of fi)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21

Page 85: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT via subdifferentials?

Assume all fi(x) are finite valued, and dom f = Rn

minx∈Rn

f (x) s.t. fi(x) ≤ 0, i ∈ [m].

Assume Slater’s condition: ∃x such that fi(x) < 0 for i ∈ [m]Write Ci := {x | fi(x) ≤ 0}. Then, above problem becomes

minx

φ(x) := f (x) + 1C1(x) + · · ·+ 1Cm(x).

An optimal solution to this problem is a vector x̄ such that

0 ∈ ∂φ(x̄).

Slater’s condition tells us thatint C1 ∩ · · · ∩ int Cm 6= ∅.

Exercise: Rigorously justify the above (Hint: use continuity of fi)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21

Page 86: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT via subdifferentials?

Assume all fi(x) are finite valued, and dom f = Rn

minx∈Rn

f (x) s.t. fi(x) ≤ 0, i ∈ [m].

Assume Slater’s condition: ∃x such that fi(x) < 0 for i ∈ [m]Write Ci := {x | fi(x) ≤ 0}. Then, above problem becomes

minx

φ(x) := f (x) + 1C1(x) + · · ·+ 1Cm(x).

An optimal solution to this problem is a vector x̄ such that

0 ∈ ∂φ(x̄).

Slater’s condition tells us thatint C1 ∩ · · · ∩ int Cm 6= ∅.

Exercise: Rigorously justify the above (Hint: use continuity of fi)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21

Page 87: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT via subdifferentials?

Since int C1 ∩ · · · ∩ int Cm 6= ∅, Rockafellar’s theorem tells us

∂φ(x) = ∂f (x) + ∂1C1(x) + · · ·+ ∂1Cm(x).

Recall: ∂1Ci = NCi (normal cone). Verify (Challenge) that

NCi(x) =

⋃ {λi∂fi(x) | λi ≥ 0} , if fi(x) = 0,{0} , if fi(x) < 0,∅, if fi(x) > 0.

Thus, ∂φ(x) 6= ∅ iff x satisfies fi(x) ≤ 0(Verify: that the Minkowski sum A + ∅ = ∅)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 22

Page 88: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT via subdifferentials?

Since int C1 ∩ · · · ∩ int Cm 6= ∅, Rockafellar’s theorem tells us

∂φ(x) = ∂f (x) + ∂1C1(x) + · · ·+ ∂1Cm(x).

Recall: ∂1Ci = NCi (normal cone). Verify (Challenge) that

NCi(x) =

⋃ {λi∂fi(x) | λi ≥ 0} , if fi(x) = 0,{0} , if fi(x) < 0,∅, if fi(x) > 0.

Thus, ∂φ(x) 6= ∅ iff x satisfies fi(x) ≤ 0(Verify: that the Minkowski sum A + ∅ = ∅)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 22

Page 89: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT via subdifferentials?

Since int C1 ∩ · · · ∩ int Cm 6= ∅, Rockafellar’s theorem tells us

∂φ(x) = ∂f (x) + ∂1C1(x) + · · ·+ ∂1Cm(x).

Recall: ∂1Ci = NCi (normal cone). Verify (Challenge) that

NCi(x) =

⋃ {λi∂fi(x) | λi ≥ 0} , if fi(x) = 0,{0} , if fi(x) < 0,∅, if fi(x) > 0.

Thus, ∂φ(x) 6= ∅ iff x satisfies fi(x) ≤ 0(Verify: that the Minkowski sum A + ∅ = ∅)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 22

Page 90: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT via subdifferentials?

Thus, ∂φ(x) =⋃ {∂f (x) + λ1∂f1(x) + · · ·+ λm∂fm(x)}, over all

choices of λi ≥ 0 such thatλifi(x) = 0.

If fi(x) < 0, ∂1Ci = {0}, while for fi(x) = 0, ∂1Ci(x) = {λi∂fi(x) | λi ≥ 0},and we cannot jointly have λi ≥ 0 and fi(x) > 0.

In other words, 0 ∈ ∂φ(x) iff there exist λ1, . . . , λm that satisfythe KKT conditions.

Exercise: Double check the above for differentiable f , fi

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 23

Page 91: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

KKT via subdifferentials?

Thus, ∂φ(x) =⋃ {∂f (x) + λ1∂f1(x) + · · ·+ λm∂fm(x)}, over all

choices of λi ≥ 0 such thatλifi(x) = 0.

If fi(x) < 0, ∂1Ci = {0}, while for fi(x) = 0, ∂1Ci(x) = {λi∂fi(x) | λi ≥ 0},and we cannot jointly have λi ≥ 0 and fi(x) > 0.

In other words, 0 ∈ ∂φ(x) iff there exist λ1, . . . , λm that satisfythe KKT conditions.

Exercise: Double check the above for differentiable f , fi

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 23

Page 92: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Example: Constrained regression

minx

12‖Ax− b‖2, s.t. ‖x‖ ≤ θ.

KKT Conditions

L(x, λ) = 12‖Ax− b‖2 + λ(‖x‖ − θ)

0 ∈ AT(Ax− b) + λ∂‖x‖

∂‖x‖ =

{‖x‖−1x x 6= 0,{z | ‖z‖ ≤ 1} x = 0.

Hmmm...?

I Case (i). x← pinv(A)b and ‖x‖ < θ, then x∗ = x

I Case (ii). If ‖x‖ ≥ θ, then ‖x∗‖ = θ. Thus, consider instead12‖Ax− b‖2 s.t. ‖x‖2 = θ2. (Exercise: complete the idea.)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 24

Page 93: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually

Example: Constrained regression

minx

12‖Ax− b‖2, s.t. ‖x‖ ≤ θ.

KKT Conditions

L(x, λ) = 12‖Ax− b‖2 + λ(‖x‖ − θ)

0 ∈ AT(Ax− b) + λ∂‖x‖

∂‖x‖ =

{‖x‖−1x x 6= 0,{z | ‖z‖ ≤ 1} x = 0.

Hmmm...?I Case (i). x← pinv(A)b and ‖x‖ < θ, then x∗ = x

I Case (ii). If ‖x‖ ≥ θ, then ‖x∗‖ = θ. Thus, consider instead12‖Ax− b‖2 s.t. ‖x‖2 = θ2. (Exercise: complete the idea.)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 24