local, unconstrained function optimizationa local, unconstrained optimization template design...

29
Local, Unconstrained Function Optimization COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 1 / 29

Upload: others

Post on 15-Oct-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Local, Unconstrained FunctionOptimization

COMPSCI 371D — Machine Learning

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 1 / 29

Page 2: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Outline1 Motivation and Scope

2 Gradient, Hessian, and Convexity

3 A Local, Unconstrained Optimization Template

4 Steepest Descent

5 Termination

6 Convergence Speed of Steepest Descent

7 Newton’s Method

8 Convergence Speed of Newton’s Method

9 Counting Steps versus Clocking

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 2 / 27

Page 3: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Motivation and Scope

Motivation and Scope

• Parametric predictor: h(x ; v) : Rd ⇥ Rm ! Y

• As a predictor: h(x ; v) = hv(x) : Rd ! Y

• Risk: LT (v) = 1N

PN

n=1 `(yn, h(xn ; v)) : Rm ! R• For risk minimization, h(xn ; v) = hxn

(v) : Rm ! Y

• Training a parametric predictor with m real parametersis function optimization:v̂ = argminv2Rm LT (v)

• Some v may be subject to constraints.We ignore those ML problems for now.

• Other v may be integer-valued. We ignore those, too(combinatorial optimization).

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 3 / 27

OO

Page 4: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Motivation and Scope

Example

• A binary linear classifier has decision boundary b+wT x = 0

• So v =

b

w

�2 Rd+1

• m = d + 1• Counterexample: Can you think of a ML method

that does not involve v 2 Rm?

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 4 / 27

0

Page 5: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Motivation and Scope

Warning: Change of Notation

• Optimization is used for much more than ML• Even in ML, there are more than risks to optimize• So we use “generic notation” for optimization• Function to be minimized f (z) : Rm ! R• More in keeping with literature...

... except that we use z instead of x (too loaded for us!)• Minimizing f (z) is the same as maximizing �f (z)

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 5 / 27

Page 6: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Motivation and Scope

Only Local Minimization

• All we know about f is a “black box” (think Python function)• For many problems, f has many local minima• Start somewhere (z0), and take steps “down”

f (zk+1) < f (zk)

• When we get stuck at a local minimum, we declare success• We would like global minima, but all we get is local ones• For some problems, f has a unique minimum...• ... or at least a single connected set of minima

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 6 / 27

Page 7: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Gradient, Hessian, and Convexity

Gradient

rf (z) = @f

@z =

2

64

@f

@z1...@f

@zm

3

75

• If rf (z) exists everywhere, the conditionrf (z) = 0is necessary and sufficient for a stationary point(max, min, or saddle)

• Warning: only necessary for a minimum!• Reduces to first derivative for f : R ! R

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 7 / 27

NAIBLA

Page 8: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Gradient, Hessian, and Convexity

First Order Taylor Expansionf (z) ⇡ g1(z) = f (z0) + [rf (z0)]T (z � z0)approximates f (z) near z0 with a (hyper)plane through z0

z1

z2

f(z)

z0

rf (z0) points to direction of steepest increase of f at z0

• If we want to find z1 where f (z1) < f (z0), going along�rf (z0) seems promising

• This is the general idea of steepest descent

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 8 / 27

f Ee

xFCI

0

Page 9: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Gradient, Hessian, and Convexity

Hessian

H(z) =

2

664

@2f

@z21

. . . @2f

@z1@zm

......

@2f

@zm@z1. . . @2f

@z2m

3

775

• Symmetric matrix because of Schwarz’s theorem:

@2f

@zi@zj

=@2f

@zj@zi

• Eigenvalues are real because of symmetry• Reduces to d2f

dz2 for f : R ! R

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 9 / 27

Page 10: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Gradient, Hessian, and Convexity

Convexity

z

z'u z + (1-u) z'

f(u z + (1-u) z')

u f(z) + (1-u) f(z')

f(z')

f(z)

• Convex everywhere:For all z, z0 in the (open) domain of f and for all u 2 [0, 1]f (uz + (1 � u)z0) uf (z) + (1 � u)f (z0)

• Convex at z0: The function f is convex everywhere in someopen neighborhood of z0

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 10 / 27

iO f

Parametric

Page 11: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Gradient, Hessian, and Convexity

Convexity and Hessian

• If H(z) is defined at a stationary point z of f , then z is aminimum iff H(z) < 0

• “<” means positive semidefinite:zT Hz � 0 for all z 2 Rm

• Above is definition of H(z) < 0• To check computationally: All eigenvalues are nonnegative• H(z) < 0 reduces to d2f

dz2 � 0 for f : R ! R

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 11 / 27

O

Page 12: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Gradient, Hessian, and Convexity

Second Order Taylor Expansionf ⇡ g2(z) = f (z0)+ [rz0]T (z� z0)+ (z� z0)T H(z0)(z� z0)approximates f (z) near z0 with a quadratic equation throughz0

• For minimization, this is useful only when H(z) < 0• Function looks locally like a bowl

z1

z2

f(z)

z1z0

• If we want to find z1 where f (z1) < f (z0), going to thebottom of the bowl seems promising

• This is the general idea of Newton’s method

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 12 / 27

we

Page 13: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

A Local, Unconstrained Optimization Template

A Template

• Regardless of method, most local unconstrainedoptimization methods fit the following template:

k = 0while zk is not a minimum

compute step direction pk with kpkk = 1compute step size ↵k > 0zk+1 = zk + ↵kpk

k = k + 1end.

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 13 / 27

Page 14: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

A Local, Unconstrained Optimization Template

Design Decisions

• Whether to stop (“while zk is not a minimum”)• In what direction to proceed (pk )• How long a step to take in that direction (↵k )• Different decisions for the last two lead to different methods

with very different behaviors and computational costs

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 14 / 29

Page 15: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Steepest Descent

Steepest Descent: Follow the Gradient

• In what direction to proceed: pk = �rf (zk)

• “Steepest descent”(Ignoring normalization, fold into step size ↵k )

• Problem reduces to one dimension: h(↵) = f (zk + ↵pk)

• ↵ = 0 ) z = zk

• Find ↵ = ↵k > 0 s.t. f (zk + ↵kpk)

is a local minimum along the line• Line search (search along a line)• Q1: How to find ↵k?• Q2: Is this a good strategy?

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 15 / 29

iHEIR

Page 16: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Steepest Descent

Line Search

• Bracketing triple:• a < b < c, h(a) � h(b), h(b) h(c)• Contains a (local) minimum!• Split the bigger of [a, b] and [b, c] in half with a point u• Find a new, narrower bracketing triple involving u and two

out of a, b, c• Stop when the bracket is narrow enough (say, 10�6)• Pinned down a minimum to within 10�6

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 16 / 29

E Ei

Page 17: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Steepest Descent

Phase 1: Find a Bracketing Triple

α

h(α)

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 17 / 29

if lucci hcg break

EEE

a ggCy 05

Page 18: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Steepest Descent

Phase 2: Shrink the Bracketing Triple

α

h(α)

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 18 / 29

epic _Cbcuic gradientcais.es Kim

ffkfGh.la ff2etdPr defsteepestCftgfol

Tub defacalphe

EE EE returnfGatepeeb u a

q f eyasi

p ic ibi4 Cb.u

a b Kc

Page 19: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Steepest Descent

if b � a > c � bu = (a + b)/2if h(u) > h(b)

(a, b, c) = (u, b, c)otherwise

(a, b, c) = (a, u, b)end

otherwiseu = (b + c)/2if h(u) > h(b)

(a, b, c) = (a, b, u)otherwise

(a, b, c) = (b, u, c)end

end

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 19 / 29

Page 20: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Termination

Termination

• Are we still making “significant progress”?• Check f (zk�1)� f (zk)? (We want this to be strictly positive)• Check kzk�1 � zkk ? (We want this to be large enough)• Second is more stringent close the the minimum

because rf (z) ⇡ 0• Stop when kzk�1 � zkk < �

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 20 / 29

min on

again

f za

a

Page 21: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Termination

Is Steepest Descent a Good Strategy?• “We are going in the direction of fastest descent”• “We choose an optimal step by line search”• “Must be good, no?” Not so fast! (Pun intended)• An example for which we know the answer:

f (z) = c + aT z + 12zT Qz

Q < 0 (convex paraboloid)• All smooth functions look like this close enough to z⇤

z*

isocontours

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 21 / 29

tk 50z EIR Z

f G 32

Page 22: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Termination

Skating to a Minimum

z0

z*p0

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 22 / 29

he a 0

of m

Z

tangentb isocontouratZa

Page 23: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Termination

How to Measure Convergence Speed

• Asymptotics (k ! 1) are what matters• If z⇤ is the true solution, how doeskzk+1 � z⇤k compare with kzk � z⇤k for large k?

• Which converges faster:kzk+1 � z⇤k ⇡ �kzk � z⇤k1 orkzk+1 � z⇤k ⇡ �kzk � z⇤k2 ?

• Close to convergence these distances are small numberskzk � z⇤k2 ⌧ kzk � z⇤k1 [Example: (0.001)2 ⌧ (0.001)1]

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 23 / 29

z

ITEkta

0

Page 24: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Termination

How to Measure Convergence Speed

• A fast algorithm has a large exponent q inkzk+1 � z⇤k ⇡ �kzk � z⇤kq

when k is large• The order of convergence is the largest natural number q

such that

0 < limk!1

kzk+1 � z⇤kkzk � z⇤kq < 1

• (The value of the limit is �)

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 24 / 29

so

Page 25: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Convergence Speed of Steepest Descent

Convergence Speed of Steepest Descent

• Steepest descent has order of convergence q = 1(“linear convergence rate”)

0 < limk!1

kzk+1 � z⇤kkzk � z⇤k1 < 1

• Hopefully, when q = 1, we have at least � < 1• Example: � = 0.1kzk+1 � z⇤k ⇡ 0.1 kzk � z⇤k1

• Gain one correct decimal digit at every iteration

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 25 / 29

Page 26: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Convergence Speed of Newton’s Method

Newton’s Method

• Newton’s method has a quadratic convergence rate:

0 < limk!1

kzk+1 � z⇤kkzk � z⇤k2 < 1

• Now things are OK even if � � 1• Example: � = 1kzk+1 � z⇤k ⇡ kzk � z⇤k2

• Double the number of correct digits at every iteration• Very fast!

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 26 / 29

0

Page 27: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Newton’s Method

Newton’s Methodf ⇡ g2(z) = f (zk)+[rf (zk)]T (z�zk)+(z�zk)T H(zk)(z�zk)

• Check that H(z) < 0(otherwise, fall back on steepest descent for that step)

• Let �z = z � zk

f ⇡ g2(z) = f (zk) + [rf (zk)]T�z + (�z)T H(zk)�z

z1

z2

f(z)

z1z0

• Solve H(zk)�z = �rf (zk) (Jump to bottom of the bowl)• Repeat

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 27 / 29

Gate

Fg E 5A Try 2 CIR2 Spell it outy t

Eat EatAT EEEEEEEE.CI 9 ftp.nm 922B

Hari.my II

Page 28: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Counting Steps versus Clocking

Counting Steps versus Clocking

• Newton’s method finds direction ad step at once (�z)• No line search required• But need to evaluate gradient and Hessian at every step

m +�m+1

2

�derivatives!

• ...and solve an m ⇥ m linear system• Asymptotic complexity depends on assumptions on

arithmetic (exact, fixed-precision)• Practical times are significant, and naive algorithms are

O(m3)

• So, both storage space and computation time becomeprohibitive as m grows

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 28 / 29

Page 29: Local, Unconstrained Function OptimizationA Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while z k is not a minimum”) • In what direction

Counting Steps versus Clocking

Bottom Line

• Newton’s method takes few steps to converge...... but each step is expensive

• Advantageous when m is small• For bigger problems, use steepest descent• Compromises exist

(e.g., conjugate gradients, see Appendix in the notes)• For very big m, even steepest descent is too expensive• Stay tuned

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 29 / 29