tensors for optimization(?)

68
Tensors for optimization(?) Higher order and accelerated methods Based on Estimate sequence methods: extensions and approximations, Michel Baes, 2009 MLRG Fall 2020 – Tensor basics and applications – Nov 18

Upload: others

Post on 16-Oct-2021

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tensors for optimization(?)

Tensors for optimization(?)Higher order and accelerated methods

Based onEstimate sequence methods: extensions and approximations, Michel Baes, 2009MLRG Fall 2020 – Tensor basics and applications – Nov 18

Page 2: Tensors for optimization(?)

Tensors in optimization

Goal: find the minimum of a function f

Only available information is local: x , f (x),∇f (x), . . .

f

1

Page 3: Tensors for optimization(?)

Tensors in optimization

Goal: find the minimum of a function f

Only available information is local: x , f (x),∇f (x), . . .

f

1

Page 4: Tensors for optimization(?)

Tensors in optimization

Goal: find the minimum of a function f

Only available information is local: x , f (x),∇f (x), . . .

f

1

Page 5: Tensors for optimization(?)

Tensors in optimization

Goal: find the minimum of a function f

Only available information is local: x , f (x),∇f (x), . . .

f

Which one is it?1

Page 6: Tensors for optimization(?)

Tensors in optimization

Goal: find the minimum of a function f

Only available information is local: x , f (x),∇f (x), . . .

f

Higher order derivatives = more information1

Page 7: Tensors for optimization(?)

But... isn’t Newton “bad”?

Newton’s method: x ′ = x − α[∇2f (x)]−1∇f (x)

Less stable than gradient descent Can go up instead of downOnly works for small problems Awful scaling with dimension

Why use even higher order information?

Today: a primer

Newton: the issues and how to fix themGeneral recipe for higher orderSome intuition for faster/approximate methods

Next week: Superfast higher order methods

2

Page 8: Tensors for optimization(?)

But... isn’t Newton “bad”?

Newton’s method: x ′ = x − α[∇2f (x)]−1∇f (x)

Less stable than gradient descent Can go up instead of downOnly works for small problems Awful scaling with dimension

Why use even higher order information?

Today: a primer

Newton: the issues and how to fix themGeneral recipe for higher orderSome intuition for faster/approximate methods

Next week: Superfast higher order methods

2

Page 9: Tensors for optimization(?)

But... isn’t Newton “bad”?

Newton’s method: x ′ = x − α[∇2f (x)]−1∇f (x)

Less stable than gradient descent Can go up instead of downOnly works for small problems Awful scaling with dimension

Why use even higher order information?

Today: a primer

Newton: the issues and how to fix themGeneral recipe for higher orderSome intuition for faster/approximate methods

Next week: Superfast higher order methods2

Page 10: Tensors for optimization(?)

The issues with Newton: Non-convex functions

3

Page 11: Tensors for optimization(?)

The issues with Newton: Non-convex functions

3

Page 12: Tensors for optimization(?)

The issues with Newton: Non-convex functions

3

Page 13: Tensors for optimization(?)

The issues with Newton: Non-convex functions

3

Page 14: Tensors for optimization(?)

The issues with Newton: Non-convex functions

3

Page 15: Tensors for optimization(?)

The issues with Newton: Stability

4

Page 16: Tensors for optimization(?)

The issues with Newton: Stability

4

Page 17: Tensors for optimization(?)

The issues with Newton: Stability

4

Page 18: Tensors for optimization(?)

The issues with Newton: Stability

What?

4

Page 19: Tensors for optimization(?)

The issues with Newton: Stability

ff′

f′′

4

Page 20: Tensors for optimization(?)

Optimization

Want to minimize f . At x , we know f (x),∇f (x), . . .

f

x

Surrogate: Simple(r) to optimize, progress on it leads to progress on f

Assumptions on f and available information → Best algorithm?

Algorithm → Why does it work? When does it work?

Gradient descent → Modified Newton → Arbitrary order

5

Page 21: Tensors for optimization(?)

Optimization

Want to minimize f . At x , we know f (x),∇f (x), . . .

f

x

Surrogate: Simple(r) to optimize, progress on it leads to progress on f

Assumptions on f and available information → Best algorithm?

Algorithm → Why does it work? When does it work?

Gradient descent → Modified Newton → Arbitrary order

5

Page 22: Tensors for optimization(?)

Optimization

Want to minimize f . At x , we know f (x),∇f (x), . . .

f

x

Surrogate: Simple(r) to optimize, progress on it leads to progress on f

Assumptions on f and available information → Best algorithm?

Algorithm → Why does it work? When does it work?

Gradient descent → Modified Newton → Arbitrary order

5

Page 23: Tensors for optimization(?)

Optimization

Want to minimize f . At x , we know f (x),∇f (x), . . .

f

x

Surrogate: Simple(r) to optimize, progress on it leads to progress on f

Assumptions on f and available information → Best algorithm?

Algorithm → Why does it work? When does it work?

Gradient descent → Modified Newton → Arbitrary order

5

Page 24: Tensors for optimization(?)

Optimization

Want to minimize f . At x , we know f (x),∇f (x), . . .

f

x

Surrogate: Simple(r) to optimize, progress on it leads to progress on f

Assumptions on f and available information → Best algorithm?

Algorithm → Why does it work? When does it work?

Gradient descent → Modified Newton → Arbitrary order

5

Page 25: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

Need continuity: if the gradient changes too fast, not informative

6

Page 26: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

Need continuity: if the gradient changes too fast, not informative

6

Page 27: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

Need continuity: if the gradient changes too fast, not informative

6

Page 28: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

The Hessian is bounded → The gradient does not change too fast

Taylor expansion:

f (y) = f (x) + f ′(x)(y − x) + 12! f′′(x)(y − x)2 +

13! f′′′(x)(y − x)3 + . . .

Truncated error:

f (y) = f (x) + f ′(x)(y − x) + O((y − x)2)

≤ f (x) + f ′(x)(y − x) +[

12 max

xf ′′(x)

](y − x)2

7

Page 29: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

The Hessian is bounded → The gradient does not change too fast

Taylor expansion:

f (y) = f (x) + f ′(x)(y − x) + 12! f′′(x)(y − x)2 +

13! f′′′(x)(y − x)3 + . . .

Truncated error:

f (y) = f (x) + f ′(x)(y − x) + O((y − x)2)

≤ f (x) + f ′(x)(y − x) +[

12 max

xf ′′(x)

](y − x)2

7

Page 30: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

The Hessian is bounded → The gradient does not change too fast

Taylor expansion:

f (y) = f (x) + f ′(x)(y − x) + 12! f′′(x)(y − x)2 +

13! f′′′(x)(y − x)3 + . . .

Truncated error:

f (y) = f (x) + f ′(x)(y − x) + O((y − x)2)

≤ f (x) + f ′(x)(y − x) +[

12 max

xf ′′(x)

](y − x)2

7

Page 31: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

The Hessian is bounded → The gradient does not change too fast

Taylor expansion:

f (y) = f (x) + f ′(x)(y − x) + 12! f′′(x)(y − x)2 +

13! f′′′(x)(y − x)3 + . . .

Truncated error:

f (y) = f (x) + f ′(x)(y − x) + O((y − x)2)

≤ f (x) + f ′(x)(y − x) +[

12 max

xf ′′(x)

](y − x)2

7

Page 32: Tensors for optimization(?)

First order

f

x

f (y) = f (x) +∇f (x)(y − x) + L2‖y − x‖2

8

Page 33: Tensors for optimization(?)

First order

f

x

f (y) = f (x) +∇f (x)(y − x) + L2‖y − x‖2

Convex quadratic upper bound on f :

0 = ∇f (y)=⇒ 0 = ∇f (x) + L(y − x)

=⇒ y = x − 1L∇f (x)

8

Page 34: Tensors for optimization(?)

First order

f

x

f (y) = f (x) +∇f (x)(y − x) + L2‖y − x‖2

8

Page 35: Tensors for optimization(?)

First order

f

x

f (y) = f (x) +∇f (x)(y − x) + L2‖y − x‖2

8

Page 36: Tensors for optimization(?)

First order

f

x

f (y) = f (x) +∇f (x)(y − x) + L2‖y − x‖2

8

Page 37: Tensors for optimization(?)

Second order

The Hessian does not change too fast

f (x) + f ′(x)(y − x) + 12 f ′′(x)(y − x)2 + O((y − x)3)

The third derivative is bounded

x ′ = argminy

{〈∇f (x), y − x〉+ 1

2∇2f (x)[y − x , y − x ] + M

6 ‖y − x‖3}

Cubic regularization of Newton’s method

9

Page 38: Tensors for optimization(?)

Second order

The Hessian does not change too fast

f (x) + f ′(x)(y − x) + 12 f ′′(x)(y − x)2 +

[ 16 maxx f ′′′(x)

](y − x)3

The third derivative is bounded

x ′ = argminy

{〈∇f (x), y − x〉+ 1

2∇2f (x)[y − x , y − x ] + M

6 ‖y − x‖3}

Cubic regularization of Newton’s method

9

Page 39: Tensors for optimization(?)

Second order

The Hessian does not change too fast

f (x) + f ′(x)(y − x) + 12 f ′′(x)(y − x)2 +

[ 16 maxx f ′′′(x)

](y − x)3

The third derivative is bounded

x ′ = argminy

{〈∇f (x), y − x〉+ 1

2∇2f (x)[y − x , y − x ] + M

6 ‖y − x‖3}

Cubic regularization of Newton’s method

9

Page 40: Tensors for optimization(?)

Cubic regularization: non-convex

Newton Cubic regularization

10

Page 41: Tensors for optimization(?)

Cubic regularization: non-convex

Newton Cubic regularization

10

Page 42: Tensors for optimization(?)

Cubic regularization: non-convex

Newton Cubic regularization

10

Page 43: Tensors for optimization(?)

Cubic regularization: non-convex

Newton Cubic regularization

10

Page 44: Tensors for optimization(?)

Cubic regularization: non-convex

Newton Cubic regularization

10

Page 45: Tensors for optimization(?)

Cubic regularization: non-convex

Stationary points

10

Page 46: Tensors for optimization(?)

Cubic regularization: non-convex

Stationary points

10

Page 47: Tensors for optimization(?)

General strategy

For mth order methods: If bounded (m + 1)th derivative

Minimize {mth order Taylor expansion + C‖x − y‖m+1}

So far:

Issues with naıve NewtonRegularity assumptions for GD and higher order methodsCubic regularization

Next:

How to solve the subproblemSome caveatsIs it faster? Fastest?

11

Page 48: Tensors for optimization(?)

General strategy

For mth order methods: If bounded (m + 1)th derivative

Minimize {mth order Taylor expansion + C‖x − y‖m+1}

So far:

Issues with naıve NewtonRegularity assumptions for GD and higher order methodsCubic regularization

Next:

How to solve the subproblemSome caveatsIs it faster? Fastest?

11

Page 49: Tensors for optimization(?)

General strategy

For mth order methods: If bounded (m + 1)th derivative

Minimize {mth order Taylor expansion + C‖x − y‖m+1}

So far:

Issues with naıve NewtonRegularity assumptions for GD and higher order methodsCubic regularization

Next:

How to solve the subproblemSome caveatsIs it faster? Fastest?

11

Page 50: Tensors for optimization(?)

Time per iteration

Cubic regularization update:

x ′ = x − d

mind

{〈g , d〉+ 1

2 H[d , d ] + M6 ‖d‖

3}

... It’s not convex

(but it is simpler)

If ‖d‖ = r =⇒ minimizing a quadratic (with simple constraints)

d =

[H +

Mr2 I

]−1

g

Find the fixed point of r =∥∥∥[H + Mr

2 I]−1 g

∥∥∥ (1D, convex)

Time: Matrix inverse (once, then reuse) O(n3)

+ a few iterations of a convex 1D solver (only matrix-vector products)

12

Page 51: Tensors for optimization(?)

Time per iteration

Cubic regularization update:

x ′ = x − d

mind

{〈g , d〉+ 1

2 H[d , d ] + M6 ‖d‖

3}

... It’s not convex (but it is simpler)

If ‖d‖ = r =⇒ minimizing a quadratic (with simple constraints)

d =

[H +

Mr2 I

]−1

g

Find the fixed point of r =∥∥∥[H + Mr

2 I]−1 g

∥∥∥ (1D, convex)

Time: Matrix inverse (once, then reuse) O(n3)

+ a few iterations of a convex 1D solver (only matrix-vector products)

12

Page 52: Tensors for optimization(?)

Time per iteration

Cubic regularization update:

x ′ = x − d

mind

{〈g , d〉+ 1

2 H[d , d ] + M6 ‖d‖

3}

... It’s not convex (but it is simpler)

If ‖d‖ = r =⇒ minimizing a quadratic (with simple constraints)

d =

[H +

Mr2 I

]−1

g

Find the fixed point of r =∥∥∥[H + Mr

2 I]−1 g

∥∥∥ (1D, convex)

Time: Matrix inverse (once, then reuse) O(n3)

+ a few iterations of a convex 1D solver (only matrix-vector products) 12

Page 53: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

13

Page 54: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

13

Page 55: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

13

Page 56: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

13

Page 57: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

Bounds on f , f ′, f ′′, f ′′′, . . .

Some functions get “smoother” with higher derivatives, some less so

13

Page 58: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

Bounds on f , f ′, f ′′, f ′′′, . . .

Some functions get “smoother” with higher derivatives, some less so

f f ′ f ′′ f ′′′ f ′′′′

sin(cx) c cos(cx) −c2 sin(cx) −c3 cos(cx) c4 sin(cx)

c < 1 =⇒ max f (m)(x)→ 0 c > 1 =⇒ max f (m)(x)→∞ 13

Page 59: Tensors for optimization(?)

Is it faster?

After T steps, f (xT )− f ∗ ≤ ?

(in convex world)

Cm depends on bound on f (m+1) (and initial error)

Gradient descent: f (xt)− f ∗ ≤ C1/TCubic regularization: f (xt)− f ∗ ≤ C2/T 2

mth-order (regularized): f (xt)− f ∗ ≤ Cm/T m

How many iterations to reach f (xT )− f ∗ ≤ ε ?

Gradient descent: T ≥ C1/ε

Cubic regularization: T ≥ (C2/ε)1/2

mth-order (regularized): T ≥ (Cm/ε)1/m

14

Page 60: Tensors for optimization(?)

Is it faster?

After T steps, f (xT )− f ∗ ≤ ?

(in convex world)Cm depends on bound on f (m+1) (and initial error)

Gradient descent: f (xt)− f ∗ ≤ C1/TCubic regularization: f (xt)− f ∗ ≤ C2/T 2

mth-order (regularized): f (xt)− f ∗ ≤ Cm/T m

How many iterations to reach f (xT )− f ∗ ≤ ε ?

Gradient descent: T ≥ C1/ε

Cubic regularization: T ≥ (C2/ε)1/2

mth-order (regularized): T ≥ (Cm/ε)1/m

14

Page 61: Tensors for optimization(?)

Is it faster?

After T steps, f (xT )− f ∗ ≤ ?

(in convex world)Cm depends on bound on f (m+1) (and initial error)

Gradient descent: f (xt)− f ∗ ≤ C1/TCubic regularization: f (xt)− f ∗ ≤ C2/T 2

mth-order (regularized): f (xt)− f ∗ ≤ Cm/T m

How many iterations to reach f (xT )− f ∗ ≤ ε ?

Gradient descent: T ≥ C1/ε

Cubic regularization: T ≥ (C2/ε)1/2

mth-order (regularized): T ≥ (Cm/ε)1/m

14

Page 62: Tensors for optimization(?)

Is it faster?

Time to reach f (xT )− f ∗ ≤ ε

10 4 10 2 100 102

hard error easy

10 2

10 1

100

101

102

103

104

105

T (n

umbe

r of

iter

atio

ns)

Gradient descentCubic regularization3rd order

Plot caveats

- Height depends on constants- Only slopes are accurate- Worst case, if assumptions hold- Log-log scale

Main takeaway:For tiny ε, higher order methods arebetter even if more expensive/iteration

15

Page 63: Tensors for optimization(?)

Is it fastest? (main part of the paper)

“Actual time”:

nope

Only need to solve subproblem approximately (at first)

ft(xt+1)− f ∗t ≤ εt , εt = O(

1tc

)

Number of iterations:

nope (for convex functions)

Gradient descent: 1/T

Cubic regularization: 1/T 2 Accelerated Gradient Descent: 1/T 2

Accelerated cubic regularization: 1/T 3 mth order: 1/T m+1

16

Page 64: Tensors for optimization(?)

Is it fastest? (main part of the paper)

“Actual time”: nope

Only need to solve subproblem approximately (at first)

ft(xt+1)− f ∗t ≤ εt , εt = O(

1tc

)

Number of iterations:

nope (for convex functions)

Gradient descent: 1/T

Cubic regularization: 1/T 2 Accelerated Gradient Descent: 1/T 2

Accelerated cubic regularization: 1/T 3 mth order: 1/T m+1

16

Page 65: Tensors for optimization(?)

Is it fastest? (main part of the paper)

“Actual time”: nope

Only need to solve subproblem approximately (at first)

ft(xt+1)− f ∗t ≤ εt , εt = O(

1tc

)

Number of iterations: nope (for convex functions)

Gradient descent: 1/T

Cubic regularization: 1/T 2

Accelerated Gradient Descent: 1/T 2

Accelerated cubic regularization: 1/T 3 mth order: 1/T m+1

16

Page 66: Tensors for optimization(?)

Is it fastest? (main part of the paper)

“Actual time”: nope

Only need to solve subproblem approximately (at first)

ft(xt+1)− f ∗t ≤ εt , εt = O(

1tc

)

Number of iterations: nope (for convex functions)

Gradient descent: 1/T

Cubic regularization: 1/T 2 Accelerated Gradient Descent: 1/T 2

Accelerated cubic regularization: 1/T 3 mth order: 1/T m+1

16

Page 67: Tensors for optimization(?)

Is it fastest? (main part of the paper)

“Actual time”: nope

Only need to solve subproblem approximately (at first)

ft(xt+1)− f ∗t ≤ εt , εt = O(

1tc

)

Number of iterations: nope (for convex functions)

Gradient descent: 1/T

Cubic regularization: 1/T 2 Accelerated Gradient Descent: 1/T 2

Accelerated cubic regularization: 1/T 3 mth order: 1/T m+1

16

Page 68: Tensors for optimization(?)

Main ideas

Optimization with higher order approximationsRegularity assumptionsConstructing upper boundsSolving polynomials

Next week: Super fast accelerated higher order methods (and maybe a tensor)

Thanks!

17