local, unconstrained function optimizationa local, unconstrained optimization template design...

Local, Unconstrained FunctionOptimization

COMPSCI 371D — Machine Learning

COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 1 / 29

Outline1 Motivation and Scope

2 Gradient, Hessian, and Convexity

3 A Local, Unconstrained Optimization Template

4 Steepest Descent

5 Termination

6 Convergence Speed of Steepest Descent

7 Newton’s Method

8 Convergence Speed of Newton’s Method

9 Counting Steps versus Clocking


Motivation and Scope


• Parametric predictor: h(x ; v) : Rd ⇥ Rm ! Y

• As a predictor: h(x ; v) = hv(x) : Rd ! Y

• Risk: LT (v) = 1N

PN

n=1 `(yn, h(xn ; v)) : Rm ! R• For risk minimization, h(xn ; v) = hxn

(v) : Rm ! Y

• Training a parametric predictor with m real parametersis function optimization:v̂ = argminv2Rm LT (v)

• Some v may be subject to constraints.We ignore those ML problems for now.

• Other v may be integer-valued. We ignore those, too(combinatorial optimization).


OO


Example

• A binary linear classifier has decision boundary b+wT x = 0

• So v =

b

w

�2 Rd+1

• m = d + 1• Counterexample: Can you think of a ML method

that does not involve v 2 Rm?


0


Warning: Change of Notation

• Optimization is used for much more than ML• Even in ML, there are more than risks to optimize• So we use “generic notation” for optimization• Function to be minimized f (z) : Rm ! R• More in keeping with literature...

... except that we use z instead of x (too loaded for us!)• Minimizing f (z) is the same as maximizing �f (z)



Only Local Minimization

• All we know about f is a “black box” (think Python function)• For many problems, f has many local minima• Start somewhere (z0), and take steps “down”

f (zk+1) < f (zk)

• When we get stuck at a local minimum, we declare success• We would like global minima, but all we get is local ones• For some problems, f has a unique minimum...• ... or at least a single connected set of minima


Gradient, Hessian, and Convexity

Gradient

rf (z) = @f

@z =

2

64

@f

@z1...@f

@zm

3

75

• If rf (z) exists everywhere, the conditionrf (z) = 0is necessary and sufficient for a stationary point(max, min, or saddle)

• Warning: only necessary for a minimum!• Reduces to first derivative for f : R ! R


NAIBLA


First Order Taylor Expansionf (z) ⇡ g1(z) = f (z0) + [rf (z0)]T (z � z0)approximates f (z) near z0 with a (hyper)plane through z0

z1

z2

f(z)

z0

rf (z0) points to direction of steepest increase of f at z0

• If we want to find z1 where f (z1) < f (z0), going along�rf (z0) seems promising

• This is the general idea of steepest descent


f Ee

xFCI

0


Hessian

H(z) =

2

664

@2f

@z21

. . . @2f

@z1@zm

......

@2f

@zm@z1. . . @2f

@z2m

3

775

• Symmetric matrix because of Schwarz’s theorem:

@2f

@zi@zj

=@2f

@zj@zi

• Eigenvalues are real because of symmetry• Reduces to d2f

dz2 for f : R ! R



Convexity

z

z'u z + (1-u) z'

f(u z + (1-u) z')

u f(z) + (1-u) f(z')

f(z')

f(z)

• Convex everywhere:For all z, z0 in the (open) domain of f and for all u 2 [0, 1]f (uz + (1 � u)z0) uf (z) + (1 � u)f (z0)

• Convex at z0: The function f is convex everywhere in someopen neighborhood of z0


iO f

Parametric


Convexity and Hessian

• If H(z) is defined at a stationary point z of f , then z is aminimum iff H(z) < 0

• “<” means positive semidefinite:zT Hz � 0 for all z 2 Rm

• Above is definition of H(z) < 0• To check computationally: All eigenvalues are nonnegative• H(z) < 0 reduces to d2f

dz2 � 0 for f : R ! R


O


Second Order Taylor Expansionf ⇡ g2(z) = f (z0)+ [rz0]T (z� z0)+ (z� z0)T H(z0)(z� z0)approximates f (z) near z0 with a quadratic equation throughz0

• For minimization, this is useful only when H(z) < 0• Function looks locally like a bowl

z1

z2

f(z)

z1z0

• If we want to find z1 where f (z1) < f (z0), going to thebottom of the bowl seems promising

• This is the general idea of Newton’s method


we

A Local, Unconstrained Optimization Template

A Template

• Regardless of method, most local unconstrainedoptimization methods fit the following template:

k = 0while zk is not a minimum

compute step direction pk with kpkk = 1compute step size ↵k > 0zk+1 = zk + ↵kpk

k = k + 1end.


A Local, Unconstrained Optimization Template

Design Decisions

• Whether to stop (“while zk is not a minimum”)• In what direction to proceed (pk )• How long a step to take in that direction (↵k )• Different decisions for the last two lead to different methods

with very different behaviors and computational costs


Steepest Descent

Steepest Descent: Follow the Gradient

• In what direction to proceed: pk = �rf (zk)

• “Steepest descent”(Ignoring normalization, fold into step size ↵k )

• Problem reduces to one dimension: h(↵) = f (zk + ↵pk)

• ↵ = 0 ) z = zk

• Find ↵ = ↵k > 0 s.t. f (zk + ↵kpk)

is a local minimum along the line• Line search (search along a line)• Q1: How to find ↵k?• Q2: Is this a good strategy?


iHEIR

Steepest Descent

Line Search

• Bracketing triple:• a < b < c, h(a) � h(b), h(b) h(c)• Contains a (local) minimum!• Split the bigger of [a, b] and [b, c] in half with a point u• Find a new, narrower bracketing triple involving u and two

out of a, b, c• Stop when the bracket is narrow enough (say, 10�6)• Pinned down a minimum to within 10�6


E Ei

Steepest Descent

Phase 1: Find a Bracketing Triple

α

h(α)


if lucci hcg break

EEE

a ggCy 05

Steepest Descent

Phase 2: Shrink the Bracketing Triple

α

h(α)


epic _Cbcuic gradientcais.es Kim

ffkfGh.la ff2etdPr defsteepestCftgfol

Tub defacalphe

EE EE returnfGatepeeb u a

q f eyasi

p ic ibi4 Cb.u

a b Kc

Steepest Descent

if b � a > c � bu = (a + b)/2if h(u) > h(b)

(a, b, c) = (u, b, c)otherwise

(a, b, c) = (a, u, b)end

otherwiseu = (b + c)/2if h(u) > h(b)

(a, b, c) = (a, b, u)otherwise

(a, b, c) = (b, u, c)end

end


Termination

Termination

• Are we still making “significant progress”?• Check f (zk�1)� f (zk)? (We want this to be strictly positive)• Check kzk�1 � zkk ? (We want this to be large enough)• Second is more stringent close the the minimum

because rf (z) ⇡ 0• Stop when kzk�1 � zkk < �


min on

again

f za

a

Termination

Is Steepest Descent a Good Strategy?• “We are going in the direction of fastest descent”• “We choose an optimal step by line search”• “Must be good, no?” Not so fast! (Pun intended)• An example for which we know the answer:

f (z) = c + aT z + 12zT Qz

Q < 0 (convex paraboloid)• All smooth functions look like this close enough to z⇤

z*

isocontours


tk 50z EIR Z

f G 32

Termination

Skating to a Minimum

z0

z*p0


he a 0

of m

Z

tangentb isocontouratZa

Termination

How to Measure Convergence Speed

• Asymptotics (k ! 1) are what matters• If z⇤ is the true solution, how doeskzk+1 � z⇤k compare with kzk � z⇤k for large k?

• Which converges faster:kzk+1 � z⇤k ⇡ �kzk � z⇤k1 orkzk+1 � z⇤k ⇡ �kzk � z⇤k2 ?

• Close to convergence these distances are small numberskzk � z⇤k2 ⌧ kzk � z⇤k1 [Example: (0.001)2 ⌧ (0.001)1]


z

ITEkta

0

Termination

How to Measure Convergence Speed

• A fast algorithm has a large exponent q inkzk+1 � z⇤k ⇡ �kzk � z⇤kq

when k is large• The order of convergence is the largest natural number q

such that

0 < limk!1

kzk+1 � z⇤kkzk � z⇤kq < 1

• (The value of the limit is �)


so

Convergence Speed of Steepest Descent

Convergence Speed of Steepest Descent

• Steepest descent has order of convergence q = 1(“linear convergence rate”)

0 < limk!1

kzk+1 � z⇤kkzk � z⇤k1 < 1

• Hopefully, when q = 1, we have at least � < 1• Example: � = 0.1kzk+1 � z⇤k ⇡ 0.1 kzk � z⇤k1

• Gain one correct decimal digit at every iteration


Convergence Speed of Newton’s Method

Newton’s Method

• Newton’s method has a quadratic convergence rate:

0 < limk!1

kzk+1 � z⇤kkzk � z⇤k2 < 1

• Now things are OK even if � � 1• Example: � = 1kzk+1 � z⇤k ⇡ kzk � z⇤k2

• Double the number of correct digits at every iteration• Very fast!


0

Newton’s Method

Newton’s Methodf ⇡ g2(z) = f (zk)+[rf (zk)]T (z�zk)+(z�zk)T H(zk)(z�zk)

• Check that H(z) < 0(otherwise, fall back on steepest descent for that step)

• Let �z = z � zk

f ⇡ g2(z) = f (zk) + [rf (zk)]T�z + (�z)T H(zk)�z

z1

z2

f(z)

z1z0

• Solve H(zk)�z = �rf (zk) (Jump to bottom of the bowl)• Repeat


Gate

Fg E 5A Try 2 CIR2 Spell it outy t

Eat EatAT EEEEEEEE.CI 9 ftp.nm 922B

Hari.my II

Counting Steps versus Clocking


• Newton’s method finds direction ad step at once (�z)• No line search required• But need to evaluate gradient and Hessian at every step

m +�m+1

2

�derivatives!

• ...and solve an m ⇥ m linear system• Asymptotic complexity depends on assumptions on

arithmetic (exact, fixed-precision)• Practical times are significant, and naive algorithms are

O(m3)

• So, both storage space and computation time becomeprohibitive as m grows



Bottom Line

• Newton’s method takes few steps to converge...... but each step is expensive

• Advantageous when m is small• For bigger problems, use steepest descent• Compromises exist

(e.g., conjugate gradients, see Appendix in the notes)• For very big m, even steepest descent is too expensive• Stay tuned


local, unconstrained function optimizationa local, unconstrained optimization template design...

Documents