tensors for optimization(?)

68

Tensors for optimization(?) Higher order and accelerated methods Based on Estimate sequence methods: extensions and approximations, Michel Baes, 2009 MLRG Fall 2020 – Tensor basics and applications – Nov 18

Upload: others

Post on 16-Oct-2021

18 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Tensors for optimization(?)

Tensors for optimization(?)Higher order and accelerated methods

Based onEstimate sequence methods: extensions and approximations, Michel Baes, 2009MLRG Fall 2020 – Tensor basics and applications – Nov 18

Page 2: Tensors for optimization(?)

Tensors in optimization

Goal: find the minimum of a function f

Only available information is local: x , f (x),∇f (x), . . .

f

1

Page 3: Tensors for optimization(?)

Tensors in optimization

Goal: find the minimum of a function f

Only available information is local: x , f (x),∇f (x), . . .

f

1

Page 4: Tensors for optimization(?)

Tensors in optimization

Goal: find the minimum of a function f

Only available information is local: x , f (x),∇f (x), . . .

f

1

Page 5: Tensors for optimization(?)

Tensors in optimization

Goal: find the minimum of a function f

Only available information is local: x , f (x),∇f (x), . . .

f

Which one is it?1

Page 6: Tensors for optimization(?)

Tensors in optimization

Goal: find the minimum of a function f

Only available information is local: x , f (x),∇f (x), . . .

f

Higher order derivatives = more information1

Page 7: Tensors for optimization(?)

But... isn’t Newton “bad”?

Newton’s method: x ′ = x − α[∇2f (x)]−1∇f (x)

Less stable than gradient descent Can go up instead of downOnly works for small problems Awful scaling with dimension

Why use even higher order information?

Today: a primer

Newton: the issues and how to fix themGeneral recipe for higher orderSome intuition for faster/approximate methods

Next week: Superfast higher order methods

2

Page 8: Tensors for optimization(?)

But... isn’t Newton “bad”?

Newton’s method: x ′ = x − α[∇2f (x)]−1∇f (x)

Less stable than gradient descent Can go up instead of downOnly works for small problems Awful scaling with dimension

Why use even higher order information?

Today: a primer

Newton: the issues and how to fix themGeneral recipe for higher orderSome intuition for faster/approximate methods

Next week: Superfast higher order methods

2

Page 9: Tensors for optimization(?)

But... isn’t Newton “bad”?

Newton’s method: x ′ = x − α[∇2f (x)]−1∇f (x)

Less stable than gradient descent Can go up instead of downOnly works for small problems Awful scaling with dimension

Why use even higher order information?

Today: a primer

Newton: the issues and how to fix themGeneral recipe for higher orderSome intuition for faster/approximate methods

Next week: Superfast higher order methods2

Page 10: Tensors for optimization(?)

The issues with Newton: Non-convex functions

3

Page 11: Tensors for optimization(?)

The issues with Newton: Non-convex functions

3

Page 12: Tensors for optimization(?)

The issues with Newton: Non-convex functions

3

Page 13: Tensors for optimization(?)

The issues with Newton: Non-convex functions

3

Page 14: Tensors for optimization(?)

The issues with Newton: Non-convex functions

3

Page 15: Tensors for optimization(?)

The issues with Newton: Stability

4

Page 16: Tensors for optimization(?)

The issues with Newton: Stability

4

Page 17: Tensors for optimization(?)

The issues with Newton: Stability

4

Page 18: Tensors for optimization(?)

The issues with Newton: Stability

What?

4

Page 19: Tensors for optimization(?)

The issues with Newton: Stability

ff′

f′′

4

Page 20: Tensors for optimization(?)

Optimization

Want to minimize f . At x , we know f (x),∇f (x), . . .

f

x

Surrogate: Simple(r) to optimize, progress on it leads to progress on f

Assumptions on f and available information → Best algorithm?

Algorithm → Why does it work? When does it work?

Gradient descent → Modified Newton → Arbitrary order

5

Page 21: Tensors for optimization(?)

Optimization

Want to minimize f . At x , we know f (x),∇f (x), . . .

f

x

Surrogate: Simple(r) to optimize, progress on it leads to progress on f

Assumptions on f and available information → Best algorithm?

Algorithm → Why does it work? When does it work?

Gradient descent → Modified Newton → Arbitrary order

5

Page 22: Tensors for optimization(?)

Optimization

Want to minimize f . At x , we know f (x),∇f (x), . . .

f

x

Surrogate: Simple(r) to optimize, progress on it leads to progress on f

Assumptions on f and available information → Best algorithm?

Algorithm → Why does it work? When does it work?

Gradient descent → Modified Newton → Arbitrary order

5

Page 23: Tensors for optimization(?)

Optimization

Want to minimize f . At x , we know f (x),∇f (x), . . .

f

x

Surrogate: Simple(r) to optimize, progress on it leads to progress on f

Assumptions on f and available information → Best algorithm?

Algorithm → Why does it work? When does it work?

Gradient descent → Modified Newton → Arbitrary order

5

Page 24: Tensors for optimization(?)

Optimization

Want to minimize f . At x , we know f (x),∇f (x), . . .

f

x

Surrogate: Simple(r) to optimize, progress on it leads to progress on f

Assumptions on f and available information → Best algorithm?

Algorithm → Why does it work? When does it work?

Gradient descent → Modified Newton → Arbitrary order

5

Page 25: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

Need continuity: if the gradient changes too fast, not informative

6

Page 26: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

Need continuity: if the gradient changes too fast, not informative

6

Page 27: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

Need continuity: if the gradient changes too fast, not informative

6

Page 28: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

The Hessian is bounded → The gradient does not change too fast

Taylor expansion:

f (y) = f (x) + f ′(x)(y − x) + 12! f′′(x)(y − x)2 +

13! f′′′(x)(y − x)3 + . . .

Truncated error:

f (y) = f (x) + f ′(x)(y − x) + O((y − x)2)

≤ f (x) + f ′(x)(y − x) +[

12 max

xf ′′(x)

](y − x)2

7

Page 29: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

The Hessian is bounded → The gradient does not change too fast

Taylor expansion:

f (y) = f (x) + f ′(x)(y − x) + 12! f′′(x)(y − x)2 +

13! f′′′(x)(y − x)3 + . . .

Truncated error:

f (y) = f (x) + f ′(x)(y − x) + O((y − x)2)

≤ f (x) + f ′(x)(y − x) +[

12 max

xf ′′(x)

](y − x)2

7

Page 30: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

The Hessian is bounded → The gradient does not change too fast

Taylor expansion:

f (y) = f (x) + f ′(x)(y − x) + 12! f′′(x)(y − x)2 +

13! f′′′(x)(y − x)3 + . . .

Truncated error:

f (y) = f (x) + f ′(x)(y − x) + O((y − x)2)

≤ f (x) + f ′(x)(y − x) +[

12 max

xf ′′(x)

](y − x)2

7

Page 31: Tensors for optimization(?)

First order

Gradient Descent (fixed α): xt+1 = xt − α∇f (xt)

The Hessian is bounded → The gradient does not change too fast

Taylor expansion:

f (y) = f (x) + f ′(x)(y − x) + 12! f′′(x)(y − x)2 +

13! f′′′(x)(y − x)3 + . . .

Truncated error:

f (y) = f (x) + f ′(x)(y − x) + O((y − x)2)

≤ f (x) + f ′(x)(y − x) +[

12 max

xf ′′(x)

](y − x)2

7

Page 32: Tensors for optimization(?)

First order

f

x

f (y) = f (x) +∇f (x)(y − x) + L2‖y − x‖2

8

Page 33: Tensors for optimization(?)

First order

f

x

f (y) = f (x) +∇f (x)(y − x) + L2‖y − x‖2

Convex quadratic upper bound on f :

0 = ∇f (y)=⇒ 0 = ∇f (x) + L(y − x)

=⇒ y = x − 1L∇f (x)

8

Page 34: Tensors for optimization(?)

First order

f

x

f (y) = f (x) +∇f (x)(y − x) + L2‖y − x‖2

8

Page 35: Tensors for optimization(?)

First order

f

x

f (y) = f (x) +∇f (x)(y − x) + L2‖y − x‖2

8

Page 36: Tensors for optimization(?)

First order

f

x

f (y) = f (x) +∇f (x)(y − x) + L2‖y − x‖2

8

Page 37: Tensors for optimization(?)

Second order

The Hessian does not change too fast

f (x) + f ′(x)(y − x) + 12 f ′′(x)(y − x)2 + O((y − x)3)

The third derivative is bounded

x ′ = argminy

{〈∇f (x), y − x〉+ 1

2∇2f (x)[y − x , y − x ] + M

6 ‖y − x‖3}

Cubic regularization of Newton’s method

9

Page 38: Tensors for optimization(?)

Second order

The Hessian does not change too fast

f (x) + f ′(x)(y − x) + 12 f ′′(x)(y − x)2 +

[ 16 maxx f ′′′(x)

](y − x)3

The third derivative is bounded

x ′ = argminy

{〈∇f (x), y − x〉+ 1

2∇2f (x)[y − x , y − x ] + M

6 ‖y − x‖3}

Cubic regularization of Newton’s method

9

Page 39: Tensors for optimization(?)

Second order

The Hessian does not change too fast

f (x) + f ′(x)(y − x) + 12 f ′′(x)(y − x)2 +

[ 16 maxx f ′′′(x)

](y − x)3

The third derivative is bounded

x ′ = argminy

{〈∇f (x), y − x〉+ 1

2∇2f (x)[y − x , y − x ] + M

6 ‖y − x‖3}

Cubic regularization of Newton’s method

9

Page 40: Tensors for optimization(?)

Cubic regularization: non-convex

Newton Cubic regularization

10

Page 41: Tensors for optimization(?)

Cubic regularization: non-convex

Newton Cubic regularization

10

Page 42: Tensors for optimization(?)

Cubic regularization: non-convex

Newton Cubic regularization

10

Page 43: Tensors for optimization(?)

Cubic regularization: non-convex

Newton Cubic regularization

10

Page 44: Tensors for optimization(?)

Cubic regularization: non-convex

Newton Cubic regularization

10

Page 45: Tensors for optimization(?)

Cubic regularization: non-convex

Stationary points

10

Page 46: Tensors for optimization(?)

Cubic regularization: non-convex

Stationary points

10

Page 47: Tensors for optimization(?)

General strategy

For mth order methods: If bounded (m + 1)th derivative

Minimize {mth order Taylor expansion + C‖x − y‖m+1}

So far:

Issues with naıve NewtonRegularity assumptions for GD and higher order methodsCubic regularization

Next:

How to solve the subproblemSome caveatsIs it faster? Fastest?

11

Page 48: Tensors for optimization(?)

General strategy

For mth order methods: If bounded (m + 1)th derivative

Minimize {mth order Taylor expansion + C‖x − y‖m+1}

So far:

Issues with naıve NewtonRegularity assumptions for GD and higher order methodsCubic regularization

Next:

How to solve the subproblemSome caveatsIs it faster? Fastest?

11

Page 49: Tensors for optimization(?)

General strategy

For mth order methods: If bounded (m + 1)th derivative

Minimize {mth order Taylor expansion + C‖x − y‖m+1}

So far:

Issues with naıve NewtonRegularity assumptions for GD and higher order methodsCubic regularization

Next:

How to solve the subproblemSome caveatsIs it faster? Fastest?

11

Page 50: Tensors for optimization(?)

Time per iteration

Cubic regularization update:

x ′ = x − d

mind

{〈g , d〉+ 1

2 H[d , d ] + M6 ‖d‖

3}

... It’s not convex

(but it is simpler)

If ‖d‖ = r =⇒ minimizing a quadratic (with simple constraints)

d =

[H +

Mr2 I

]−1

g

Find the fixed point of r =∥∥∥[H + Mr

2 I]−1 g

∥∥∥ (1D, convex)

Time: Matrix inverse (once, then reuse) O(n3)

+ a few iterations of a convex 1D solver (only matrix-vector products)

12

Page 51: Tensors for optimization(?)

Time per iteration

Cubic regularization update:

x ′ = x − d

mind

{〈g , d〉+ 1

2 H[d , d ] + M6 ‖d‖

3}

... It’s not convex (but it is simpler)

If ‖d‖ = r =⇒ minimizing a quadratic (with simple constraints)

d =

[H +

Mr2 I

]−1

g

Find the fixed point of r =∥∥∥[H + Mr

2 I]−1 g

∥∥∥ (1D, convex)

Time: Matrix inverse (once, then reuse) O(n3)

+ a few iterations of a convex 1D solver (only matrix-vector products)

12

Page 52: Tensors for optimization(?)

Time per iteration

Cubic regularization update:

x ′ = x − d

mind

{〈g , d〉+ 1

2 H[d , d ] + M6 ‖d‖

3}

... It’s not convex (but it is simpler)

If ‖d‖ = r =⇒ minimizing a quadratic (with simple constraints)

d =

[H +

Mr2 I

]−1

g

Find the fixed point of r =∥∥∥[H + Mr

2 I]−1 g

∥∥∥ (1D, convex)

Time: Matrix inverse (once, then reuse) O(n3)

+ a few iterations of a convex 1D solver (only matrix-vector products) 12

Page 53: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

13

Page 54: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

13

Page 55: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

13

Page 56: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

13

Page 57: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

Bounds on f , f ′, f ′′, f ′′′, . . .

Some functions get “smoother” with higher derivatives, some less so

13

Page 58: Tensors for optimization(?)

Caveats

GD works well 6= Cubic regularization works well

Quality of approximation goes up if higher derivatives are smooth enough

GD Cubic regularization

Bounds on f , f ′, f ′′, f ′′′, . . .

Some functions get “smoother” with higher derivatives, some less so

f f ′ f ′′ f ′′′ f ′′′′

sin(cx) c cos(cx) −c2 sin(cx) −c3 cos(cx) c4 sin(cx)

c < 1 =⇒ max f (m)(x)→ 0 c > 1 =⇒ max f (m)(x)→∞ 13

Page 59: Tensors for optimization(?)

Is it faster?

After T steps, f (xT )− f ∗ ≤ ?

(in convex world)

Cm depends on bound on f (m+1) (and initial error)

Gradient descent: f (xt)− f ∗ ≤ C1/TCubic regularization: f (xt)− f ∗ ≤ C2/T 2

mth-order (regularized): f (xt)− f ∗ ≤ Cm/T m

How many iterations to reach f (xT )− f ∗ ≤ ε ?

Gradient descent: T ≥ C1/ε

Cubic regularization: T ≥ (C2/ε)1/2

mth-order (regularized): T ≥ (Cm/ε)1/m

14

Page 60: Tensors for optimization(?)

Is it faster?

After T steps, f (xT )− f ∗ ≤ ?

(in convex world)Cm depends on bound on f (m+1) (and initial error)

Gradient descent: f (xt)− f ∗ ≤ C1/TCubic regularization: f (xt)− f ∗ ≤ C2/T 2

mth-order (regularized): f (xt)− f ∗ ≤ Cm/T m

How many iterations to reach f (xT )− f ∗ ≤ ε ?

Gradient descent: T ≥ C1/ε

Cubic regularization: T ≥ (C2/ε)1/2

mth-order (regularized): T ≥ (Cm/ε)1/m

14

Page 61: Tensors for optimization(?)

Is it faster?

After T steps, f (xT )− f ∗ ≤ ?

(in convex world)Cm depends on bound on f (m+1) (and initial error)

Gradient descent: f (xt)− f ∗ ≤ C1/TCubic regularization: f (xt)− f ∗ ≤ C2/T 2

mth-order (regularized): f (xt)− f ∗ ≤ Cm/T m

How many iterations to reach f (xT )− f ∗ ≤ ε ?

Gradient descent: T ≥ C1/ε

Cubic regularization: T ≥ (C2/ε)1/2

mth-order (regularized): T ≥ (Cm/ε)1/m

14

Page 62: Tensors for optimization(?)

Is it faster?

Time to reach f (xT )− f ∗ ≤ ε

10 4 10 2 100 102

hard error easy

10 2

10 1

100

101

102

103

104

105

T (n

umbe

r of

iter

atio

ns)

Gradient descentCubic regularization3rd order

Plot caveats

- Height depends on constants- Only slopes are accurate- Worst case, if assumptions hold- Log-log scale

Main takeaway:For tiny ε, higher order methods arebetter even if more expensive/iteration

15

Page 63: Tensors for optimization(?)

Is it fastest? (main part of the paper)

“Actual time”:

nope

Only need to solve subproblem approximately (at first)

ft(xt+1)− f ∗t ≤ εt , εt = O(

1tc

)

Number of iterations:

nope (for convex functions)

Gradient descent: 1/T

Cubic regularization: 1/T 2 Accelerated Gradient Descent: 1/T 2

Accelerated cubic regularization: 1/T 3 mth order: 1/T m+1

16

Page 64: Tensors for optimization(?)

Is it fastest? (main part of the paper)

“Actual time”: nope

Only need to solve subproblem approximately (at first)

ft(xt+1)− f ∗t ≤ εt , εt = O(

1tc

)

Number of iterations:

nope (for convex functions)

Gradient descent: 1/T

Cubic regularization: 1/T 2 Accelerated Gradient Descent: 1/T 2

Accelerated cubic regularization: 1/T 3 mth order: 1/T m+1

16

Page 65: Tensors for optimization(?)

Is it fastest? (main part of the paper)

“Actual time”: nope

Only need to solve subproblem approximately (at first)

ft(xt+1)− f ∗t ≤ εt , εt = O(

1tc

)

Number of iterations: nope (for convex functions)

Gradient descent: 1/T

Cubic regularization: 1/T 2

Accelerated Gradient Descent: 1/T 2

Accelerated cubic regularization: 1/T 3 mth order: 1/T m+1

16

Page 66: Tensors for optimization(?)

Is it fastest? (main part of the paper)

“Actual time”: nope

Only need to solve subproblem approximately (at first)

ft(xt+1)− f ∗t ≤ εt , εt = O(

1tc

)

Number of iterations: nope (for convex functions)

Gradient descent: 1/T

Cubic regularization: 1/T 2 Accelerated Gradient Descent: 1/T 2

Accelerated cubic regularization: 1/T 3 mth order: 1/T m+1

16

Page 67: Tensors for optimization(?)

Is it fastest? (main part of the paper)

“Actual time”: nope

Only need to solve subproblem approximately (at first)

ft(xt+1)− f ∗t ≤ εt , εt = O(

1tc

)

Number of iterations: nope (for convex functions)

Gradient descent: 1/T

Cubic regularization: 1/T 2 Accelerated Gradient Descent: 1/T 2

Accelerated cubic regularization: 1/T 3 mth order: 1/T m+1

16

Page 68: Tensors for optimization(?)

Main ideas

Optimization with higher order approximationsRegularity assumptionsConstructing upper boundsSolving polynomials

Next week: Super fast accelerated higher order methods (and maybe a tensor)

Thanks!

17

Estimation of low-rank tensors via convex optimizationttic.uchicago.edu/~ryotat/talks/tensor11berlin.pdf · Estimation of low-rank tensors via convex optimization Ryota Tomioka1,

Intro to Tensors

Decomposability of Tensors

CONSTRAINED OPTIMIZATION WITH LOW-RANK TENSORS AND ... · sor renders working with full tensors intractable already for moderate orders. Thus, more e cient tensor representations

Accelerating Online CP Decompositions for Higher Order Tensors

Tensors Presentation

A REVIEW OF VECTORS AND TENSORS - TAMU Mechanicsmechanics.tamu.edu/.../2016/10/Lecture-02-Vectors-and-Tensors-1.pdf · VECTORS&TENSORS - 22. SECOND-ORDER TENSORS . A second-order

Prior Disambiguation of Word Tensors for Constructing Sentence

Elastodynamic and elastostatic Green tensors for

CONSTRAINED OPTIMIZATION WITH LOW-RANK TENSORS … · CONSTRAINED OPTIMIZATION WITH LOW-RANK TENSORS AND APPLICATIONS TO PARAMETRIC PROBLEMS WITH PDEs ... Tensors and low-rank formats

Tensors and Optimization

The tensors useful for geophysics and estimation of ... · The tensors useful for geophysics and estimation of anisotropic polycrystalline physical properties ... 2nd Rank Tensor

Coupled Matrix and Tensor Factorizations: Models ...perso.ens-lyon.fr/bora.ucar/tensors-pp16/eacar-pp16.pdf · CMTF-OPT is a gradient–based optimization approach for joint factorization

Blackbox Identity Testing for Simple Depth 3 Circuits · rank lower bounds for a class of tensors we call simplicial tensors. These tensors are a simple generalization of triangular

Cartesian Tensors

Tutorial on MATLAB for tensors and the Tucker decomposition › archive › taoli › class › CAP6778-F10 › notes … · Tutorial on MATLAB for tensors and the Tucker decomposition

Aris Vectors Tensors

New Ranks for Even-Order Tensors and Their Applications in ...zhangs/Reports/2015_JMZ.pdfNew Ranks for Even-Order Tensors and Their Applications in Low-Rank Tensor Optimization Bo

CONSTRAINED OPTIMIZATION WITH LOW-RANK ...Constrained Optimization with Low-Rank Tensors 3 Since the described problem class, involving general state and control constraints and general

FOUNDATIONS OF FLUID MECHANICSbasic conservation laws ntroduction, vectors, defnition of tensors symmetric anti-symmetric tensors, operations with tensors vector cross product, derivatives,

COUPLED TENSORS OF PIEZOELECTRIC MATERIALS STATE …mastersonics.com/documents/coupled-tensors-of-piezoelectric-materials-state.pdfpiezoelectric ceramic. Formula stands for all shapes

6 Tensors - damtp.cam.ac.uk

Math Prelim for tensors

Generalized Euclidean Distances for Elasticity Tensors

Near-Real Time Moment Tensors for Earthquakes in Greece … · 2008-06-04 · Near Real Time Moment Tensors from AUTH (Greece) 1 Near-Real Time Moment Tensors for Earthquakes in Greece

A Complete Guide for performing Tensors computations …

Introduction to Tensors - Max Planck Societyresources.mpi-inf.mpg.de/departments/d5/teaching/... · Why Tensors? • Tensors can be used when matrices are not enough • A matrix

Tensors and OptimizationRavi Kannan Tensors and Optimization September 16, 2013 2 / 11 Reducing MAX-r-CSP to Tensor Optimization MAX-3-SAT: Clause: ( x i + xj + xk). Given a list of

Kron Tensors For Circuits

09 - Introduction to Tensors - Max Planck Societyresources.mpi-inf.mpg.de/d5/teaching/ss13/dmm/slides/09-tensors1... · 09 – Introduction to Tensors-09 - Introduction to Tensors

Symmetry classes for odd-order tensors

Homotopy embedding tensors

EIGENVECTORS OF TENSORS -0 - Home | Institute for Mathematics and its Applications · · 2016-02-01EIGENVECTORS OF TENSORS Bernd Sturmfels ... JM Landsberg: Tensors: Geometry and

Sparse Tensors Networks

Low Rank Approximation Lecture 9 · Manifold optimization General setting:Aim at solving optimization problem min X2Mr f(X); where M r is amanifoldof rank-r matrices or tensors. Goal:Modify