optimization for machine learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · since d is...
TRANSCRIPT
![Page 1: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/1.jpg)
Optimization for Machine Learning
Lecture 4: Optimality conditions
6.881: MIT
Suvrit SraMassachusetts Institute of Technology
25 Feb, 2021
![Page 2: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/2.jpg)
Optimality(Local and global optima)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 2
![Page 3: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/3.jpg)
Optimality
Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .
Theorem. For convex f , locally optimal point also global.
I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.
I We need to show that f (y) ≥ f (x∗) = p∗ .
I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)
I Since f is cvx, and x∗, y ∈ dom f , we have
f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).
I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
![Page 4: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/4.jpg)
Optimality
Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .
Theorem. For convex f , locally optimal point also global.
I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.
I We need to show that f (y) ≥ f (x∗) = p∗ .
I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)
I Since f is cvx, and x∗, y ∈ dom f , we have
f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).
I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
![Page 5: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/5.jpg)
Optimality
Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .
Theorem. For convex f , locally optimal point also global.
I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.
I We need to show that f (y) ≥ f (x∗) = p∗ .
I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)
I Since f is cvx, and x∗, y ∈ dom f , we have
f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).
I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
![Page 6: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/6.jpg)
Optimality
Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .
Theorem. For convex f , locally optimal point also global.
I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.
I We need to show that f (y) ≥ f (x∗) = p∗ .
I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)
I Since f is cvx, and x∗, y ∈ dom f , we have
f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).
I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
![Page 7: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/7.jpg)
Optimality
Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .
Theorem. For convex f , locally optimal point also global.
I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.
I We need to show that f (y) ≥ f (x∗) = p∗ .
I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)
I Since f is cvx, and x∗, y ∈ dom f , we have
f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).
I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
![Page 8: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/8.jpg)
Optimality
Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .
Theorem. For convex f , locally optimal point also global.
I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.
I We need to show that f (y) ≥ f (x∗) = p∗ .
I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)
I Since f is cvx, and x∗, y ∈ dom f , we have
f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).
I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
![Page 9: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/9.jpg)
Optimality
Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .
Theorem. For convex f , locally optimal point also global.
I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.
I We need to show that f (y) ≥ f (x∗) = p∗ .
I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)
I Since f is cvx, and x∗, y ∈ dom f , we have
f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).
I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.
I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
![Page 10: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/10.jpg)
Optimality
Def. A point x∗ ∈ X is locally optimal if f (x∗) ≤ f (x) for all xin a neighborhood of x∗. Global if f (x∗) ≤ f (x) for all x ∈ X .
Theorem. For convex f , locally optimal point also global.
I Let x∗ be local minimizer of f onX ; y ∈ X any other feasible point.
I We need to show that f (y) ≥ f (x∗) = p∗ .
I X is cvx., so we have xθ = θy + (1− θ)x∗ ∈ X for θ ∈ (0, 1)
I Since f is cvx, and x∗, y ∈ dom f , we have
f (xθ) ≤ θf (y) + (1− θ)f (x∗)f (xθ)− f (x∗) ≤ θ(f (y)− f (x∗)).
I Since x∗ is a local minimizer, for small enough θ > 0, lhs ≥ 0.I So rhs is also nonnegative, proving f (y) ≥ f (x∗) as desired.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
![Page 11: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/11.jpg)
Set of Optimal Solutions
The set of optimal solutions X ∗ may be empty
Example. If X = ∅, i.e., no feasible solutions, then X ∗ = ∅
Example. When only inf not min, e.g., inf ex as x→ −∞in general, we should worry about the question “Is X ∗ = ∅?”
Exercise: Verify that X ∗ is always a convex set.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 4
![Page 12: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/12.jpg)
Optimality conditions(Recognizing optima)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 5
![Page 13: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/13.jpg)
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).
0 ≤ limt↓0
f (x∗ + td)− f (x∗)t
=dg(0)
dt= 〈∇f (x∗), d〉.
Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.
Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
![Page 14: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/14.jpg)
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).
0 ≤ limt↓0
f (x∗ + td)− f (x∗)t
=dg(0)
dt= 〈∇f (x∗), d〉.
Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.
Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
![Page 15: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/15.jpg)
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).
0 ≤
limt↓0
f (x∗ + td)− f (x∗)
t=
dg(0)
dt= 〈∇f (x∗), d〉.
Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold.
Since d is arbitrary,∇f (x∗) = 0.
Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
![Page 16: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/16.jpg)
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).
0 ≤
limt↓0
f (x∗ + td)− f (x∗)t
=dg(0)
dt= 〈∇f (x∗), d〉.
Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.
Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
![Page 17: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/17.jpg)
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).
0 ≤ limt↓0
f (x∗ + td)− f (x∗)t
=dg(0)
dt= 〈∇f (x∗), d〉.
Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.
Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
![Page 18: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/18.jpg)
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).
0 ≤ limt↓0
f (x∗ + td)− f (x∗)t
=dg(0)
dt
= 〈∇f (x∗), d〉.
Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.
Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
![Page 19: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/19.jpg)
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).
0 ≤ limt↓0
f (x∗ + td)− f (x∗)t
=dg(0)
dt= 〈∇f (x∗), d〉.
Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.
Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
![Page 20: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/20.jpg)
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).
0 ≤ limt↓0
f (x∗ + td)− f (x∗)t
=dg(0)
dt= 〈∇f (x∗), d〉.
Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.
Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
![Page 21: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/21.jpg)
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on anopen set S containing x∗, a local minimum. Then,∇f (x∗) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn; t > 0.Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗).
0 ≤ limt↓0
f (x∗ + td)− f (x∗)t
=dg(0)
dt= 〈∇f (x∗), d〉.
Similarly, using −d it follows that 〈∇f (x∗), d〉 ≤ 0, so〈∇f (x∗), d〉 = 0 must hold. Since d is arbitrary,∇f (x∗) = 0.
Exercise: Prove that if f is convex, then∇f (x∗) = 0 is actuallysufficient for global optimality! For general f this is not true.(This property is what makes convex optimization special!)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
![Page 22: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/22.jpg)
First-order conditions: constrained
♠ For convex f , we have f (y) ≥ f (x) + 〈∇f (x), y− x〉.
♠ Thus, x∗ is optimal if and only if
〈∇f (x∗), y− x∗〉 ≥ 0, for all y ∈ X .♠ If X = Rn, this reduces to∇f (x∗) = 0
x∗
∇f(x∗)xf(x
)
X
♠ If∇f (x∗) 6= 0, it defines supporting hyperplane to X at x∗
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 7
![Page 23: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/23.jpg)
First-order conditions: constrained
♠ For convex f , we have f (y) ≥ f (x) + 〈∇f (x), y− x〉.♠ Thus, x∗ is optimal if and only if
〈∇f (x∗), y− x∗〉 ≥ 0, for all y ∈ X .
♠ If X = Rn, this reduces to∇f (x∗) = 0
x∗
∇f(x∗)xf(x
)
X
♠ If∇f (x∗) 6= 0, it defines supporting hyperplane to X at x∗
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 7
![Page 24: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/24.jpg)
First-order conditions: constrained
♠ For convex f , we have f (y) ≥ f (x) + 〈∇f (x), y− x〉.♠ Thus, x∗ is optimal if and only if
〈∇f (x∗), y− x∗〉 ≥ 0, for all y ∈ X .♠ If X = Rn, this reduces to∇f (x∗) = 0
x∗
∇f(x∗)xf(x
)
X
♠ If∇f (x∗) 6= 0, it defines supporting hyperplane to X at x∗
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 7
![Page 25: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/25.jpg)
First-order conditions: constrained
I Let f be continuously differentiable, possibly nonconvexI Suppose ∃y ∈ X such that 〈∇f (x∗), y− x∗〉 < 0I Using mean-value theorem of calculus, ∃ξ ∈ [0, 1] s.t.
f (x∗ + t(y− x∗)) = f (x∗) + 〈∇f (x∗ + ξt(y− x∗)), t(y− x∗)〉(we applied MVT to g(t) := f (x∗ + t(y− x∗)))
I For sufficiently small t, since ∇f continuous, by assump on y,〈∇f (x∗ + ξt(y− x∗)), y− x∗〉 < 0
I This in turn implies that f (x∗ + t(y− x∗)) < f (x∗)I Since X is convex, x∗ + t(y− x∗) ∈ X is also feasibleI Contradiction to local optimality of x∗
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 8
![Page 26: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/26.jpg)
Optimality without differentiability
Theorem. (Fermat’s rule): Let f : Rn → (−∞,+∞]. Then,
Argmin f = zer(∂f ) := {x ∈ Rn | 0 ∈ ∂f (x)} .
Proof: x ∈ Argmin f implies that f (x) ≤ f (y) for all y ∈ Rn.Equivalently, f (y) ≥ f (x) + 〈0, y− x〉 ∀y ,⇔ 0 ∈ ∂f (x).
Nonsmooth problemmin
xf (x) s.t. x ∈ X
minx
f (x) + 1X (x).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9
![Page 27: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/27.jpg)
Optimality without differentiability
Theorem. (Fermat’s rule): Let f : Rn → (−∞,+∞]. Then,
Argmin f = zer(∂f ) := {x ∈ Rn | 0 ∈ ∂f (x)} .Proof: x ∈ Argmin f implies that f (x) ≤ f (y) for all y ∈ Rn.
Equivalently, f (y) ≥ f (x) + 〈0, y− x〉 ∀y ,⇔ 0 ∈ ∂f (x).
Nonsmooth problemmin
xf (x) s.t. x ∈ X
minx
f (x) + 1X (x).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9
![Page 28: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/28.jpg)
Optimality without differentiability
Theorem. (Fermat’s rule): Let f : Rn → (−∞,+∞]. Then,
Argmin f = zer(∂f ) := {x ∈ Rn | 0 ∈ ∂f (x)} .Proof: x ∈ Argmin f implies that f (x) ≤ f (y) for all y ∈ Rn.Equivalently, f (y) ≥ f (x) + 〈0, y− x〉 ∀y ,
⇔ 0 ∈ ∂f (x).
Nonsmooth problemmin
xf (x) s.t. x ∈ X
minx
f (x) + 1X (x).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9
![Page 29: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/29.jpg)
Optimality without differentiability
Theorem. (Fermat’s rule): Let f : Rn → (−∞,+∞]. Then,
Argmin f = zer(∂f ) := {x ∈ Rn | 0 ∈ ∂f (x)} .Proof: x ∈ Argmin f implies that f (x) ≤ f (y) for all y ∈ Rn.Equivalently, f (y) ≥ f (x) + 〈0, y− x〉 ∀y ,⇔ 0 ∈ ∂f (x).
Nonsmooth problemmin
xf (x) s.t. x ∈ X
minx
f (x) + 1X (x).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9
![Page 30: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/30.jpg)
Optimality without differentiability
Theorem. (Fermat’s rule): Let f : Rn → (−∞,+∞]. Then,
Argmin f = zer(∂f ) := {x ∈ Rn | 0 ∈ ∂f (x)} .Proof: x ∈ Argmin f implies that f (x) ≤ f (y) for all y ∈ Rn.Equivalently, f (y) ≥ f (x) + 〈0, y− x〉 ∀y ,⇔ 0 ∈ ∂f (x).
Nonsmooth problemmin
xf (x) s.t. x ∈ X
minx
f (x) + 1X (x).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9
![Page 31: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/31.jpg)
Optimality – nonsmooth
I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)
I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)
I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:
NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}
Application
min f (x) + 1X (x).
♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
![Page 32: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/32.jpg)
Optimality – nonsmooth
I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)
I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)
I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:
NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}
Application
min f (x) + 1X (x).
♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
![Page 33: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/33.jpg)
Optimality – nonsmooth
I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)
I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)
I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.
I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:
NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}
Application
min f (x) + 1X (x).
♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
![Page 34: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/34.jpg)
Optimality – nonsmooth
I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)
I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)
I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .
I Subdifferential of the indicator 1X (x), aka normal cone:NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}
Application
min f (x) + 1X (x).
♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
![Page 35: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/35.jpg)
Optimality – nonsmooth
I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)
I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)
I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:
NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}
Application
min f (x) + 1X (x).
♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
![Page 36: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/36.jpg)
Optimality – nonsmooth
I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)
I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)
I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:
NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}
Application
min f (x) + 1X (x).
♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)
♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
![Page 37: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/37.jpg)
Optimality – nonsmooth
I Minimizing x must satisfy: 0 ∈ ∂(f + 1X )(x)
I (CQ) Assuming ri(dom f ) ∩ ri(X ) 6= ∅, 0 ∈ ∂f (x) + ∂1X(x)
I Recall, g ∈ ∂1X (x) iff 1X (y) ≥ 1X (x) + 〈g, y− x〉 for all y.I So g ∈ ∂1X (x) means x ∈ X and 0 ≥ 〈g, y− x〉 ∀y ∈ X .I Subdifferential of the indicator 1X (x), aka normal cone:
NX (x) := {g ∈ Rn | 0 ≥ 〈g, y− x〉 ∀y ∈ X}
Application
min f (x) + 1X (x).
♦ If f is diff., we get 0 ∈ ∇f (x∗) +NX (x∗)♦ −∇f (x∗) ∈ NX (x∗)⇐⇒ 〈∇f (x∗), y− x∗〉 ≥ 0 for all y ∈ X .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
![Page 38: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/38.jpg)
Example
min f (x) ‖x‖ ≤ 1.
A point x is optimal if and only if
x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.
In other words
∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx
sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.
Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
![Page 39: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/39.jpg)
Example
min f (x) ‖x‖ ≤ 1.
A point x is optimal if and only if
x ∈ dom f , ‖x‖ ≤ 1,
∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.
In other words
∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx
sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.
Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
![Page 40: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/40.jpg)
Example
min f (x) ‖x‖ ≤ 1.
A point x is optimal if and only if
x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.
In other words
∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx
sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.
Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
![Page 41: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/41.jpg)
Example
min f (x) ‖x‖ ≤ 1.
A point x is optimal if and only if
x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.
In other words
∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx
sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.
Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
![Page 42: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/42.jpg)
Example
min f (x) ‖x‖ ≤ 1.
A point x is optimal if and only if
x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.
In other words
∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx
sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx
‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.
Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
![Page 43: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/43.jpg)
Example
min f (x) ‖x‖ ≤ 1.
A point x is optimal if and only if
x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.
In other words
∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx
sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.
Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
![Page 44: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/44.jpg)
Example
min f (x) ‖x‖ ≤ 1.
A point x is optimal if and only if
x ∈ dom f , ‖x‖ ≤ 1,∀y s.t. ‖y‖ ≤ 1 =⇒ ∇f (x)T(y− x) ≥ 0.
In other words
∀‖y‖ ≤ 1, ∇f (x)Ty ≥ ∇f (x)Tx∀‖y‖ ≤ 1, −∇f (x)Ty ≤ −∇f (x)Tx
sup{−∇f (x)Ty | ‖y‖ ≤ 1} ≤ −∇f (x)Tx‖−∇f (x)‖∗ ≤ −∇f (x)Tx‖∇f (x)‖∗ ≤ −∇f (x)Tx.
Observe: If constraint satisfied strictly at optimum (‖x‖ < 1), then∇f (x) = 0 (else we’d violate the last inequality above).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
![Page 45: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/45.jpg)
Optimality conditions(KKT and friends)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 12
![Page 46: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/46.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 47: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/47.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ X
I Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 48: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/48.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?
I g(λ) = infx (L(x, λ) := f (x) +∑
i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 49: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/49.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 50: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/50.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 51: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/51.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗)
= d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 52: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/52.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗)
= minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 53: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/53.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗)
≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 54: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/54.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗)
≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 55: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/55.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 56: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/56.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 57: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/57.jpg)
Optimality conditions via Lagrangian
min f (x), fi(x) ≤ 0, i = 1, . . . ,m.
I Recall: 〈∇f (x∗), x− x∗〉 ≥ 0 for all feasible x ∈ XI Can we simplify this using Lagrangian?I g(λ) = infx (L(x, λ) := f (x) +
∑i λifi(x))
Assume strong duality and that p∗, d∗ attained!
Thus, there exists a pair (x∗, λ∗) such that
p∗ = f (x∗) = d∗ = g(λ∗) = minxL(x, λ∗) ≤ L(x∗, λ∗) ≤ f (x∗) = p∗
I Thus, equalities hold in above chain, and
x∗ ∈ argminxL(x, λ∗).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
![Page 58: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/58.jpg)
Optimality conditions via Lagrangian
x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies
∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑
iλ∗i∇fi(x∗) = 0.
Moreover, since L(x∗, λ∗) = f (x∗), we also have∑iλ∗i fi(x∗) = 0.
But λ∗i ≥ 0 and fi(x∗) ≤ 0, so complementary slackness
λ∗i fi(x∗) = 0, i = 1, . . . ,m.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
![Page 59: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/59.jpg)
Optimality conditions via Lagrangian
x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies
∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑
iλ∗i∇fi(x∗) = 0.
Moreover, since L(x∗, λ∗) = f (x∗), we also have∑iλ∗i fi(x∗) = 0.
But λ∗i ≥ 0 and fi(x∗) ≤ 0, so complementary slackness
λ∗i fi(x∗) = 0, i = 1, . . . ,m.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
![Page 60: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/60.jpg)
Optimality conditions via Lagrangian
x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies
∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑
iλ∗i∇fi(x∗) = 0.
Moreover, since L(x∗, λ∗) = f (x∗), we also have
∑iλ∗i fi(x∗) = 0.
But λ∗i ≥ 0 and fi(x∗) ≤ 0, so complementary slackness
λ∗i fi(x∗) = 0, i = 1, . . . ,m.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
![Page 61: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/61.jpg)
Optimality conditions via Lagrangian
x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies
∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑
iλ∗i∇fi(x∗) = 0.
Moreover, since L(x∗, λ∗) = f (x∗), we also have∑iλ∗i fi(x∗) = 0.
But λ∗i ≥ 0 and fi(x∗) ≤ 0, so complementary slackness
λ∗i fi(x∗) = 0, i = 1, . . . ,m.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
![Page 62: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/62.jpg)
Optimality conditions via Lagrangian
x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies
∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑
iλ∗i∇fi(x∗) = 0.
Moreover, since L(x∗, λ∗) = f (x∗), we also have∑iλ∗i fi(x∗) = 0.
But λ∗i ≥ 0 and fi(x∗) ≤ 0,
so complementary slackness
λ∗i fi(x∗) = 0, i = 1, . . . ,m.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
![Page 63: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/63.jpg)
Optimality conditions via Lagrangian
x∗ ∈ argminxL(x, λ∗).If f , f1, . . . , fm are differentiable, this implies
∇xL(x, λ∗)|x=x∗ = ∇f (x∗) +∑
iλ∗i∇fi(x∗) = 0.
Moreover, since L(x∗, λ∗) = f (x∗), we also have∑iλ∗i fi(x∗) = 0.
But λ∗i ≥ 0 and fi(x∗) ≤ 0, so complementary slackness
λ∗i fi(x∗) = 0, i = 1, . . . ,m.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
![Page 64: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/64.jpg)
KKT conditions
Karush-Kuhn-Tucker Conditions (KKT)
fi(x∗) ≤ 0, i = 1, . . . ,m (primal feasibility)λ∗i ≥ 0, i = 1, . . . ,m (dual feasibility)
λ∗i fi(x∗) = 0, i = 1, . . . ,m (compl. slackness)∇xL(x, λ∗)|x=x∗ = 0 (Lagrangian stationarity)
I Thus, if strong duality holds, and (x∗, λ∗) exists, then KKTconditions are necessary for pair (x∗, λ∗) to be optimal
I If problem is convex, then KKT also sufficientExercise: Prove the above sufficiency of KKT.Hint: Use that L(x, λ∗) is convex, and conclude from KKT conditions thatg(λ∗) = f0(x∗), so that (x∗, λ∗) optimal primal-dual pair.
Read Ch. 5 of BV
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15
![Page 65: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/65.jpg)
KKT conditions
Karush-Kuhn-Tucker Conditions (KKT)
fi(x∗) ≤ 0, i = 1, . . . ,m (primal feasibility)λ∗i ≥ 0, i = 1, . . . ,m (dual feasibility)
λ∗i fi(x∗) = 0, i = 1, . . . ,m (compl. slackness)∇xL(x, λ∗)|x=x∗ = 0 (Lagrangian stationarity)
I Thus, if strong duality holds, and (x∗, λ∗) exists, then KKTconditions are necessary for pair (x∗, λ∗) to be optimal
I If problem is convex, then KKT also sufficientExercise: Prove the above sufficiency of KKT.Hint: Use that L(x, λ∗) is convex, and conclude from KKT conditions thatg(λ∗) = f0(x∗), so that (x∗, λ∗) optimal primal-dual pair.
Read Ch. 5 of BV
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15
![Page 66: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/66.jpg)
KKT conditions
Karush-Kuhn-Tucker Conditions (KKT)
fi(x∗) ≤ 0, i = 1, . . . ,m (primal feasibility)λ∗i ≥ 0, i = 1, . . . ,m (dual feasibility)
λ∗i fi(x∗) = 0, i = 1, . . . ,m (compl. slackness)∇xL(x, λ∗)|x=x∗ = 0 (Lagrangian stationarity)
I Thus, if strong duality holds, and (x∗, λ∗) exists, then KKTconditions are necessary for pair (x∗, λ∗) to be optimal
I If problem is convex, then KKT also sufficient
Exercise: Prove the above sufficiency of KKT.Hint: Use that L(x, λ∗) is convex, and conclude from KKT conditions thatg(λ∗) = f0(x∗), so that (x∗, λ∗) optimal primal-dual pair.
Read Ch. 5 of BV
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15
![Page 67: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/67.jpg)
KKT conditions
Karush-Kuhn-Tucker Conditions (KKT)
fi(x∗) ≤ 0, i = 1, . . . ,m (primal feasibility)λ∗i ≥ 0, i = 1, . . . ,m (dual feasibility)
λ∗i fi(x∗) = 0, i = 1, . . . ,m (compl. slackness)∇xL(x, λ∗)|x=x∗ = 0 (Lagrangian stationarity)
I Thus, if strong duality holds, and (x∗, λ∗) exists, then KKTconditions are necessary for pair (x∗, λ∗) to be optimal
I If problem is convex, then KKT also sufficientExercise: Prove the above sufficiency of KKT.Hint: Use that L(x, λ∗) is convex, and conclude from KKT conditions thatg(λ∗) = f0(x∗), so that (x∗, λ∗) optimal primal-dual pair.
Read Ch. 5 of BV
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15
![Page 68: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/68.jpg)
Examples
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 16
![Page 69: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/69.jpg)
Projection onto a hyperplane
minx
12‖x− y‖2, s.t. aTx = b.
KKT Conditions
L(x, ν) = 12‖x− y‖2 + ν(aTx− b)
∂L∂x
= x− y + νa = 0
x = y− νaaTx = aTy− νaTa‖a‖2ν = aTy− b
x = y− 1‖a‖2 (aTy− b)a
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17
![Page 70: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/70.jpg)
Projection onto a hyperplane
minx
12‖x− y‖2, s.t. aTx = b.
KKT Conditions
L(x, ν) = 12‖x− y‖2 + ν(aTx− b)
∂L∂x
= x− y + νa = 0
x = y− νaaTx = aTy− νaTa‖a‖2ν = aTy− b
x = y− 1‖a‖2 (aTy− b)a
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17
![Page 71: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/71.jpg)
Projection onto a hyperplane
minx
12‖x− y‖2, s.t. aTx = b.
KKT Conditions
L(x, ν) = 12‖x− y‖2 + ν(aTx− b)
∂L∂x
= x− y + νa = 0
x = y− νa
aTx = aTy− νaTa‖a‖2ν = aTy− b
x = y− 1‖a‖2 (aTy− b)a
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17
![Page 72: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/72.jpg)
Projection onto a hyperplane
minx
12‖x− y‖2, s.t. aTx = b.
KKT Conditions
L(x, ν) = 12‖x− y‖2 + ν(aTx− b)
∂L∂x
= x− y + νa = 0
x = y− νaaTx = aTy− νaTa‖a‖2ν = aTy− b
x = y− 1‖a‖2 (aTy− b)a
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17
![Page 73: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/73.jpg)
Projection onto a hyperplane
minx
12‖x− y‖2, s.t. aTx = b.
KKT Conditions
L(x, ν) = 12‖x− y‖2 + ν(aTx− b)
∂L∂x
= x− y + νa = 0
x = y− νaaTx = aTy− νaTa‖a‖2ν = aTy− b
x = y− 1‖a‖2 (aTy− b)a
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17
![Page 74: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/74.jpg)
Projection onto simplex
minx
12‖x− y‖2, s.t. xT1 = 1, x ≥ 0.
KKT Conditions
L(x, λ, ν) = 12‖x− y‖2 −
∑iλixi + ν(xT1− 1)
∂L∂xi
= xi − yi − λi + ν = 0
λixi = 0λi ≥ 0
xT1 = 1, x ≥ 0
Challenge A. Solve the above conditions in O(n log n) time.
Challenge A+. Solve the above conditions in O(n) time.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 18
![Page 75: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/75.jpg)
Projection onto simplex
minx
12‖x− y‖2, s.t. xT1 = 1, x ≥ 0.
KKT Conditions
L(x, λ, ν) = 12‖x− y‖2 −
∑iλixi + ν(xT1− 1)
∂L∂xi
= xi − yi − λi + ν = 0
λixi = 0λi ≥ 0
xT1 = 1, x ≥ 0
Challenge A. Solve the above conditions in O(n log n) time.
Challenge A+. Solve the above conditions in O(n) time.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 18
![Page 76: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/76.jpg)
Projection onto simplex
minx
12‖x− y‖2, s.t. xT1 = 1, x ≥ 0.
KKT Conditions
L(x, λ, ν) = 12‖x− y‖2 −
∑iλixi + ν(xT1− 1)
∂L∂xi
= xi − yi − λi + ν = 0
λixi = 0λi ≥ 0
xT1 = 1, x ≥ 0
Challenge A. Solve the above conditions in O(n log n) time.
Challenge A+. Solve the above conditions in O(n) time.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 18
![Page 77: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/77.jpg)
Total variation minimization
min 12‖x− y‖2 + λ
∑i|xi+1 − xi|,
min 12‖x− y‖2 + λ‖Dx‖1,
(the matrix D is also known as a differencing matrix).
Step 1. Take the dual (recall from L3-25) to obtain:
minu
12‖DTu‖2 − uTDy, s.t. ‖u‖∞ ≤ λ.
Step 2. Replace obj by ‖DTu− y‖2 (argmin is unchanged)Step 3. Add dummies u0 = un = 0; write s = r− u for r =
∑ik=1 yk
mins
∑n
i=1(si−1 − si)
2, s.t. ‖s− r‖∞ ≤ λ, s0 = 0, sn = rn.
Step 4 (Challenge). Look at KKT conditions, and keep working. . . finally, obtain O(n) method!For full-story look at: A. Barbero, S. Sra. “Modular proximal optimization formultidimensional total-variation regularization” (JMLR 2019, pp. 1–82)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19
![Page 78: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/78.jpg)
Total variation minimization
min 12‖x− y‖2 + λ
∑i|xi+1 − xi|,
min 12‖x− y‖2 + λ‖Dx‖1,
(the matrix D is also known as a differencing matrix).Step 1. Take the dual (recall from L3-25) to obtain:
minu
12‖DTu‖2 − uTDy, s.t. ‖u‖∞ ≤ λ.
Step 2. Replace obj by ‖DTu− y‖2 (argmin is unchanged)
Step 3. Add dummies u0 = un = 0; write s = r− u for r =∑i
k=1 yk
mins
∑n
i=1(si−1 − si)
2, s.t. ‖s− r‖∞ ≤ λ, s0 = 0, sn = rn.
Step 4 (Challenge). Look at KKT conditions, and keep working. . . finally, obtain O(n) method!For full-story look at: A. Barbero, S. Sra. “Modular proximal optimization formultidimensional total-variation regularization” (JMLR 2019, pp. 1–82)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19
![Page 79: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/79.jpg)
Total variation minimization
min 12‖x− y‖2 + λ
∑i|xi+1 − xi|,
min 12‖x− y‖2 + λ‖Dx‖1,
(the matrix D is also known as a differencing matrix).Step 1. Take the dual (recall from L3-25) to obtain:
minu
12‖DTu‖2 − uTDy, s.t. ‖u‖∞ ≤ λ.
Step 2. Replace obj by ‖DTu− y‖2 (argmin is unchanged)Step 3. Add dummies u0 = un = 0; write s = r− u for r =
∑ik=1 yk
mins
∑n
i=1(si−1 − si)
2, s.t. ‖s− r‖∞ ≤ λ, s0 = 0, sn = rn.
Step 4 (Challenge). Look at KKT conditions, and keep working. . . finally, obtain O(n) method!For full-story look at: A. Barbero, S. Sra. “Modular proximal optimization formultidimensional total-variation regularization” (JMLR 2019, pp. 1–82)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19
![Page 80: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/80.jpg)
Total variation minimization
min 12‖x− y‖2 + λ
∑i|xi+1 − xi|,
min 12‖x− y‖2 + λ‖Dx‖1,
(the matrix D is also known as a differencing matrix).Step 1. Take the dual (recall from L3-25) to obtain:
minu
12‖DTu‖2 − uTDy, s.t. ‖u‖∞ ≤ λ.
Step 2. Replace obj by ‖DTu− y‖2 (argmin is unchanged)Step 3. Add dummies u0 = un = 0; write s = r− u for r =
∑ik=1 yk
mins
∑n
i=1(si−1 − si)
2, s.t. ‖s− r‖∞ ≤ λ, s0 = 0, sn = rn.
Step 4 (Challenge). Look at KKT conditions, and keep working. . . finally, obtain O(n) method!For full-story look at: A. Barbero, S. Sra. “Modular proximal optimization formultidimensional total-variation regularization” (JMLR 2019, pp. 1–82)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19
![Page 81: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/81.jpg)
Nonsmooth KKT(via subdifferentials)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 20
![Page 82: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/82.jpg)
KKT via subdifferentials?
Assume all fi(x) are finite valued, and dom f = Rn
minx∈Rn
f (x) s.t. fi(x) ≤ 0, i ∈ [m].
Assume Slater’s condition: ∃x such that fi(x) < 0 for i ∈ [m]Write Ci := {x | fi(x) ≤ 0}. Then, above problem becomes
minx
φ(x) := f (x) + 1C1(x) + · · ·+ 1Cm(x).
An optimal solution to this problem is a vector x̄ such that
0 ∈ ∂φ(x̄).
Slater’s condition tells us thatint C1 ∩ · · · ∩ int Cm 6= ∅.
Exercise: Rigorously justify the above (Hint: use continuity of fi)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21
![Page 83: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/83.jpg)
KKT via subdifferentials?
Assume all fi(x) are finite valued, and dom f = Rn
minx∈Rn
f (x) s.t. fi(x) ≤ 0, i ∈ [m].
Assume Slater’s condition: ∃x such that fi(x) < 0 for i ∈ [m]
Write Ci := {x | fi(x) ≤ 0}. Then, above problem becomes
minx
φ(x) := f (x) + 1C1(x) + · · ·+ 1Cm(x).
An optimal solution to this problem is a vector x̄ such that
0 ∈ ∂φ(x̄).
Slater’s condition tells us thatint C1 ∩ · · · ∩ int Cm 6= ∅.
Exercise: Rigorously justify the above (Hint: use continuity of fi)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21
![Page 84: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/84.jpg)
KKT via subdifferentials?
Assume all fi(x) are finite valued, and dom f = Rn
minx∈Rn
f (x) s.t. fi(x) ≤ 0, i ∈ [m].
Assume Slater’s condition: ∃x such that fi(x) < 0 for i ∈ [m]Write Ci := {x | fi(x) ≤ 0}. Then, above problem becomes
minx
φ(x) := f (x) + 1C1(x) + · · ·+ 1Cm(x).
An optimal solution to this problem is a vector x̄ such that
0 ∈ ∂φ(x̄).
Slater’s condition tells us thatint C1 ∩ · · · ∩ int Cm 6= ∅.
Exercise: Rigorously justify the above (Hint: use continuity of fi)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21
![Page 85: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/85.jpg)
KKT via subdifferentials?
Assume all fi(x) are finite valued, and dom f = Rn
minx∈Rn
f (x) s.t. fi(x) ≤ 0, i ∈ [m].
Assume Slater’s condition: ∃x such that fi(x) < 0 for i ∈ [m]Write Ci := {x | fi(x) ≤ 0}. Then, above problem becomes
minx
φ(x) := f (x) + 1C1(x) + · · ·+ 1Cm(x).
An optimal solution to this problem is a vector x̄ such that
0 ∈ ∂φ(x̄).
Slater’s condition tells us thatint C1 ∩ · · · ∩ int Cm 6= ∅.
Exercise: Rigorously justify the above (Hint: use continuity of fi)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21
![Page 86: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/86.jpg)
KKT via subdifferentials?
Assume all fi(x) are finite valued, and dom f = Rn
minx∈Rn
f (x) s.t. fi(x) ≤ 0, i ∈ [m].
Assume Slater’s condition: ∃x such that fi(x) < 0 for i ∈ [m]Write Ci := {x | fi(x) ≤ 0}. Then, above problem becomes
minx
φ(x) := f (x) + 1C1(x) + · · ·+ 1Cm(x).
An optimal solution to this problem is a vector x̄ such that
0 ∈ ∂φ(x̄).
Slater’s condition tells us thatint C1 ∩ · · · ∩ int Cm 6= ∅.
Exercise: Rigorously justify the above (Hint: use continuity of fi)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21
![Page 87: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/87.jpg)
KKT via subdifferentials?
Since int C1 ∩ · · · ∩ int Cm 6= ∅, Rockafellar’s theorem tells us
∂φ(x) = ∂f (x) + ∂1C1(x) + · · ·+ ∂1Cm(x).
Recall: ∂1Ci = NCi (normal cone). Verify (Challenge) that
NCi(x) =
⋃ {λi∂fi(x) | λi ≥ 0} , if fi(x) = 0,{0} , if fi(x) < 0,∅, if fi(x) > 0.
Thus, ∂φ(x) 6= ∅ iff x satisfies fi(x) ≤ 0(Verify: that the Minkowski sum A + ∅ = ∅)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 22
![Page 88: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/88.jpg)
KKT via subdifferentials?
Since int C1 ∩ · · · ∩ int Cm 6= ∅, Rockafellar’s theorem tells us
∂φ(x) = ∂f (x) + ∂1C1(x) + · · ·+ ∂1Cm(x).
Recall: ∂1Ci = NCi (normal cone). Verify (Challenge) that
NCi(x) =
⋃ {λi∂fi(x) | λi ≥ 0} , if fi(x) = 0,{0} , if fi(x) < 0,∅, if fi(x) > 0.
Thus, ∂φ(x) 6= ∅ iff x satisfies fi(x) ≤ 0(Verify: that the Minkowski sum A + ∅ = ∅)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 22
![Page 89: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/89.jpg)
KKT via subdifferentials?
Since int C1 ∩ · · · ∩ int Cm 6= ∅, Rockafellar’s theorem tells us
∂φ(x) = ∂f (x) + ∂1C1(x) + · · ·+ ∂1Cm(x).
Recall: ∂1Ci = NCi (normal cone). Verify (Challenge) that
NCi(x) =
⋃ {λi∂fi(x) | λi ≥ 0} , if fi(x) = 0,{0} , if fi(x) < 0,∅, if fi(x) > 0.
Thus, ∂φ(x) 6= ∅ iff x satisfies fi(x) ≤ 0(Verify: that the Minkowski sum A + ∅ = ∅)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 22
![Page 90: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/90.jpg)
KKT via subdifferentials?
Thus, ∂φ(x) =⋃ {∂f (x) + λ1∂f1(x) + · · ·+ λm∂fm(x)}, over all
choices of λi ≥ 0 such thatλifi(x) = 0.
If fi(x) < 0, ∂1Ci = {0}, while for fi(x) = 0, ∂1Ci(x) = {λi∂fi(x) | λi ≥ 0},and we cannot jointly have λi ≥ 0 and fi(x) > 0.
In other words, 0 ∈ ∂φ(x) iff there exist λ1, . . . , λm that satisfythe KKT conditions.
Exercise: Double check the above for differentiable f , fi
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 23
![Page 91: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/91.jpg)
KKT via subdifferentials?
Thus, ∂φ(x) =⋃ {∂f (x) + λ1∂f1(x) + · · ·+ λm∂fm(x)}, over all
choices of λi ≥ 0 such thatλifi(x) = 0.
If fi(x) < 0, ∂1Ci = {0}, while for fi(x) = 0, ∂1Ci(x) = {λi∂fi(x) | λi ≥ 0},and we cannot jointly have λi ≥ 0 and fi(x) > 0.
In other words, 0 ∈ ∂φ(x) iff there exist λ1, . . . , λm that satisfythe KKT conditions.
Exercise: Double check the above for differentiable f , fi
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 23
![Page 92: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/92.jpg)
Example: Constrained regression
minx
12‖Ax− b‖2, s.t. ‖x‖ ≤ θ.
KKT Conditions
L(x, λ) = 12‖Ax− b‖2 + λ(‖x‖ − θ)
0 ∈ AT(Ax− b) + λ∂‖x‖
∂‖x‖ =
{‖x‖−1x x 6= 0,{z | ‖z‖ ≤ 1} x = 0.
Hmmm...?
I Case (i). x← pinv(A)b and ‖x‖ < θ, then x∗ = x
I Case (ii). If ‖x‖ ≥ θ, then ‖x∗‖ = θ. Thus, consider instead12‖Ax− b‖2 s.t. ‖x‖2 = θ2. (Exercise: complete the idea.)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 24
![Page 93: Optimization for Machine Learningoptml.mit.edu/teach/6881/lect4.pdf · 2021. 2. 26. · Since d is arbitrary, rf(x) = 0. Exercise: Prove that if f is convex, then rf(x) = 0 is actually](https://reader035.vdocuments.us/reader035/viewer/2022071416/61135c63b18bfc31781dfcca/html5/thumbnails/93.jpg)
Example: Constrained regression
minx
12‖Ax− b‖2, s.t. ‖x‖ ≤ θ.
KKT Conditions
L(x, λ) = 12‖Ax− b‖2 + λ(‖x‖ − θ)
0 ∈ AT(Ax− b) + λ∂‖x‖
∂‖x‖ =
{‖x‖−1x x 6= 0,{z | ‖z‖ ≤ 1} x = 0.
Hmmm...?I Case (i). x← pinv(A)b and ‖x‖ < θ, then x∗ = x
I Case (ii). If ‖x‖ ≥ θ, then ‖x∗‖ = θ. Thus, consider instead12‖Ax− b‖2 s.t. ‖x‖2 = θ2. (Exercise: complete the idea.)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 24