local, unconstrained function optimizationa local, unconstrained optimization template design...
TRANSCRIPT
Local, Unconstrained FunctionOptimization
COMPSCI 371D — Machine Learning
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 1 / 29
Outline1 Motivation and Scope
2 Gradient, Hessian, and Convexity
3 A Local, Unconstrained Optimization Template
4 Steepest Descent
5 Termination
6 Convergence Speed of Steepest Descent
7 Newton’s Method
8 Convergence Speed of Newton’s Method
9 Counting Steps versus Clocking
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 2 / 27
Motivation and Scope
Motivation and Scope
• Parametric predictor: h(x ; v) : Rd ⇥ Rm ! Y
• As a predictor: h(x ; v) = hv(x) : Rd ! Y
• Risk: LT (v) = 1N
PN
n=1 `(yn, h(xn ; v)) : Rm ! R• For risk minimization, h(xn ; v) = hxn
(v) : Rm ! Y
• Training a parametric predictor with m real parametersis function optimization:v̂ = argminv2Rm LT (v)
• Some v may be subject to constraints.We ignore those ML problems for now.
• Other v may be integer-valued. We ignore those, too(combinatorial optimization).
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 3 / 27
OO
Motivation and Scope
Example
• A binary linear classifier has decision boundary b+wT x = 0
• So v =
b
w
�2 Rd+1
• m = d + 1• Counterexample: Can you think of a ML method
that does not involve v 2 Rm?
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 4 / 27
0
Motivation and Scope
Warning: Change of Notation
• Optimization is used for much more than ML• Even in ML, there are more than risks to optimize• So we use “generic notation” for optimization• Function to be minimized f (z) : Rm ! R• More in keeping with literature...
... except that we use z instead of x (too loaded for us!)• Minimizing f (z) is the same as maximizing �f (z)
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 5 / 27
Motivation and Scope
Only Local Minimization
• All we know about f is a “black box” (think Python function)• For many problems, f has many local minima• Start somewhere (z0), and take steps “down”
f (zk+1) < f (zk)
• When we get stuck at a local minimum, we declare success• We would like global minima, but all we get is local ones• For some problems, f has a unique minimum...• ... or at least a single connected set of minima
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 6 / 27
Gradient, Hessian, and Convexity
Gradient
rf (z) = @f
@z =
2
64
@f
@z1...@f
@zm
3
75
• If rf (z) exists everywhere, the conditionrf (z) = 0is necessary and sufficient for a stationary point(max, min, or saddle)
• Warning: only necessary for a minimum!• Reduces to first derivative for f : R ! R
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 7 / 27
NAIBLA
Gradient, Hessian, and Convexity
First Order Taylor Expansionf (z) ⇡ g1(z) = f (z0) + [rf (z0)]T (z � z0)approximates f (z) near z0 with a (hyper)plane through z0
z1
z2
f(z)
z0
rf (z0) points to direction of steepest increase of f at z0
• If we want to find z1 where f (z1) < f (z0), going along�rf (z0) seems promising
• This is the general idea of steepest descent
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 8 / 27
f Ee
xFCI
0
Gradient, Hessian, and Convexity
Hessian
H(z) =
2
664
@2f
@z21
. . . @2f
@z1@zm
......
@2f
@zm@z1. . . @2f
@z2m
3
775
• Symmetric matrix because of Schwarz’s theorem:
@2f
@zi@zj
=@2f
@zj@zi
• Eigenvalues are real because of symmetry• Reduces to d2f
dz2 for f : R ! R
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 9 / 27
Gradient, Hessian, and Convexity
Convexity
z
z'u z + (1-u) z'
f(u z + (1-u) z')
u f(z) + (1-u) f(z')
f(z')
f(z)
• Convex everywhere:For all z, z0 in the (open) domain of f and for all u 2 [0, 1]f (uz + (1 � u)z0) uf (z) + (1 � u)f (z0)
• Convex at z0: The function f is convex everywhere in someopen neighborhood of z0
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 10 / 27
iO f
Parametric
Gradient, Hessian, and Convexity
Convexity and Hessian
• If H(z) is defined at a stationary point z of f , then z is aminimum iff H(z) < 0
• “<” means positive semidefinite:zT Hz � 0 for all z 2 Rm
• Above is definition of H(z) < 0• To check computationally: All eigenvalues are nonnegative• H(z) < 0 reduces to d2f
dz2 � 0 for f : R ! R
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 11 / 27
O
Gradient, Hessian, and Convexity
Second Order Taylor Expansionf ⇡ g2(z) = f (z0)+ [rz0]T (z� z0)+ (z� z0)T H(z0)(z� z0)approximates f (z) near z0 with a quadratic equation throughz0
• For minimization, this is useful only when H(z) < 0• Function looks locally like a bowl
z1
z2
f(z)
z1z0
• If we want to find z1 where f (z1) < f (z0), going to thebottom of the bowl seems promising
• This is the general idea of Newton’s method
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 12 / 27
we
A Local, Unconstrained Optimization Template
A Template
• Regardless of method, most local unconstrainedoptimization methods fit the following template:
k = 0while zk is not a minimum
compute step direction pk with kpkk = 1compute step size ↵k > 0zk+1 = zk + ↵kpk
k = k + 1end.
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 13 / 27
A Local, Unconstrained Optimization Template
Design Decisions
• Whether to stop (“while zk is not a minimum”)• In what direction to proceed (pk )• How long a step to take in that direction (↵k )• Different decisions for the last two lead to different methods
with very different behaviors and computational costs
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 14 / 29
Steepest Descent
Steepest Descent: Follow the Gradient
• In what direction to proceed: pk = �rf (zk)
• “Steepest descent”(Ignoring normalization, fold into step size ↵k )
• Problem reduces to one dimension: h(↵) = f (zk + ↵pk)
• ↵ = 0 ) z = zk
• Find ↵ = ↵k > 0 s.t. f (zk + ↵kpk)
is a local minimum along the line• Line search (search along a line)• Q1: How to find ↵k?• Q2: Is this a good strategy?
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 15 / 29
iHEIR
Steepest Descent
Line Search
• Bracketing triple:• a < b < c, h(a) � h(b), h(b) h(c)• Contains a (local) minimum!• Split the bigger of [a, b] and [b, c] in half with a point u• Find a new, narrower bracketing triple involving u and two
out of a, b, c• Stop when the bracket is narrow enough (say, 10�6)• Pinned down a minimum to within 10�6
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 16 / 29
E Ei
Steepest Descent
Phase 1: Find a Bracketing Triple
α
h(α)
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 17 / 29
if lucci hcg break
EEE
a ggCy 05
Steepest Descent
Phase 2: Shrink the Bracketing Triple
α
h(α)
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 18 / 29
epic _Cbcuic gradientcais.es Kim
ffkfGh.la ff2etdPr defsteepestCftgfol
Tub defacalphe
EE EE returnfGatepeeb u a
q f eyasi
p ic ibi4 Cb.u
a b Kc
Steepest Descent
if b � a > c � bu = (a + b)/2if h(u) > h(b)
(a, b, c) = (u, b, c)otherwise
(a, b, c) = (a, u, b)end
otherwiseu = (b + c)/2if h(u) > h(b)
(a, b, c) = (a, b, u)otherwise
(a, b, c) = (b, u, c)end
end
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 19 / 29
Termination
Termination
• Are we still making “significant progress”?• Check f (zk�1)� f (zk)? (We want this to be strictly positive)• Check kzk�1 � zkk ? (We want this to be large enough)• Second is more stringent close the the minimum
because rf (z) ⇡ 0• Stop when kzk�1 � zkk < �
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 20 / 29
min on
again
f za
a
Termination
Is Steepest Descent a Good Strategy?• “We are going in the direction of fastest descent”• “We choose an optimal step by line search”• “Must be good, no?” Not so fast! (Pun intended)• An example for which we know the answer:
f (z) = c + aT z + 12zT Qz
Q < 0 (convex paraboloid)• All smooth functions look like this close enough to z⇤
z*
isocontours
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 21 / 29
tk 50z EIR Z
f G 32
Termination
Skating to a Minimum
z0
z*p0
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 22 / 29
he a 0
of m
Z
tangentb isocontouratZa
Termination
How to Measure Convergence Speed
• Asymptotics (k ! 1) are what matters• If z⇤ is the true solution, how doeskzk+1 � z⇤k compare with kzk � z⇤k for large k?
• Which converges faster:kzk+1 � z⇤k ⇡ �kzk � z⇤k1 orkzk+1 � z⇤k ⇡ �kzk � z⇤k2 ?
• Close to convergence these distances are small numberskzk � z⇤k2 ⌧ kzk � z⇤k1 [Example: (0.001)2 ⌧ (0.001)1]
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 23 / 29
z
ITEkta
0
Termination
How to Measure Convergence Speed
• A fast algorithm has a large exponent q inkzk+1 � z⇤k ⇡ �kzk � z⇤kq
when k is large• The order of convergence is the largest natural number q
such that
0 < limk!1
kzk+1 � z⇤kkzk � z⇤kq < 1
• (The value of the limit is �)
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 24 / 29
so
Convergence Speed of Steepest Descent
Convergence Speed of Steepest Descent
• Steepest descent has order of convergence q = 1(“linear convergence rate”)
0 < limk!1
kzk+1 � z⇤kkzk � z⇤k1 < 1
• Hopefully, when q = 1, we have at least � < 1• Example: � = 0.1kzk+1 � z⇤k ⇡ 0.1 kzk � z⇤k1
• Gain one correct decimal digit at every iteration
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 25 / 29
Convergence Speed of Newton’s Method
Newton’s Method
• Newton’s method has a quadratic convergence rate:
0 < limk!1
kzk+1 � z⇤kkzk � z⇤k2 < 1
• Now things are OK even if � � 1• Example: � = 1kzk+1 � z⇤k ⇡ kzk � z⇤k2
• Double the number of correct digits at every iteration• Very fast!
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 26 / 29
0
Newton’s Method
Newton’s Methodf ⇡ g2(z) = f (zk)+[rf (zk)]T (z�zk)+(z�zk)T H(zk)(z�zk)
• Check that H(z) < 0(otherwise, fall back on steepest descent for that step)
• Let �z = z � zk
f ⇡ g2(z) = f (zk) + [rf (zk)]T�z + (�z)T H(zk)�z
z1
z2
f(z)
z1z0
• Solve H(zk)�z = �rf (zk) (Jump to bottom of the bowl)• Repeat
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 27 / 29
Gate
Fg E 5A Try 2 CIR2 Spell it outy t
Eat EatAT EEEEEEEE.CI 9 ftp.nm 922B
Hari.my II
Counting Steps versus Clocking
Counting Steps versus Clocking
• Newton’s method finds direction ad step at once (�z)• No line search required• But need to evaluate gradient and Hessian at every step
m +�m+1
2
�derivatives!
• ...and solve an m ⇥ m linear system• Asymptotic complexity depends on assumptions on
arithmetic (exact, fixed-precision)• Practical times are significant, and naive algorithms are
O(m3)
• So, both storage space and computation time becomeprohibitive as m grows
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 28 / 29
Counting Steps versus Clocking
Bottom Line
• Newton’s method takes few steps to converge...... but each step is expensive
• Advantageous when m is small• For bigger problems, use steepest descent• Compromises exist
(e.g., conjugate gradients, see Appendix in the notes)• For very big m, even steepest descent is too expensive• Stay tuned
COMPSCI 371D — Machine Learning Local, Unconstrained Function Optimization 29 / 29