![Page 1: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/1.jpg)
Neural Networks: Optimization Part 1
Intro to Deep Learning, Fall 2020
1
![Page 2: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/2.jpg)
Story so far• Neural networks are universal approximators
– Can model any odd thing– Provided they have the right architecture
• We must train them to approximate any function– Specify the architecture– Learn their weights and biases
• Networks are trained to minimize total “loss” on a training set– We do so through empirical risk minimization
• We use variants of gradient descent to do so• The gradient of the error with respect to network
parameters is computed through backpropagation2
![Page 3: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/3.jpg)
Recap: Gradient Descent Algorithm• In order to minimize any function ! " w.r.t. "• Initialize: – "#– $ = 0
• Do– $ = $ + 1– ")*+ = ") − -./!0
• while ! ") − ! ")1+ > 33
![Page 4: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/4.jpg)
Recap: Training Neural Nets by Gradient Descent
• Gradient descent algorithm:• Initialize all weights !",!$,… ,!&
• Do:– For every layer ' = 1…* compute:
• +!,-.// ="0 ∑2 +!, 345 67, 87
• !9 = !9 − ;+!,-.//<
• Until -.// has converged4
Total training error:
-.// = =>?7
345(67, 87;!",!$,… ,!&)
![Page 5: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/5.jpg)
Recap: Training Neural Nets by Gradient Descent
• Gradient descent algorithm:• Initialize all weights !",!$,… ,!&
• Do:– For every layer ', compute:
• (!)*+,, =". ∑0 (!) 123(56, 76)
• !9 = !9 − ;(!)*+,,<
• Until *+,, has converged5
Total training error:
*+,, = =>?6
123(56, 76;!",!$,… ,!&)
Computed using backprop
![Page 6: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/6.jpg)
Issues
• Convergence: How well does it learn– And how can we improve it
• How well will it generalize (outside training data)
• What does the output really mean?• Etc..
6
![Page 7: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/7.jpg)
Onward
7
![Page 8: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/8.jpg)
Onward
• Does backprop always work?• Convergence of gradient descent– Rates, restrictions,– Hessians– Acceleration and Nestorov– Alternate approaches
• Modifying the approach: Stochastic gradients• Speedup extensions: RMSprop, Adagrad
8
![Page 9: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/9.jpg)
Does backprop do the right thing?
• Is backprop always right?– Assuming it actually finds the minimum of the
divergence function?
9
![Page 10: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/10.jpg)
Recap: The differentiable activation
• Threshold activation: Equivalent to counting errors
– Shifting the threshold from T1 to T2 does not change classification error
– Does not indicate if moving the threshold left was good or not
10
T1 T2x x
y y
• Differentiable activation: Computes “distance to answer”
– “Distance” == divergence
– Perturbing the function changes this quantity,
• Even if the classification error itself doesn’t change
T2T1
0.5 0.5
![Page 11: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/11.jpg)
Does backprop do the right thing?• Is backprop always right?– Assuming it actually finds the global minimum of the
divergence function?
• In classification problems, the classification error is a non-differentiable function of weights
• The divergence function minimized is only a proxy for classification error
• Minimizing divergence may not minimize classification error
11
![Page 12: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/12.jpg)
Backprop fails to separate where perceptron succeeds
• Brady, Raghavan, Slawny, ’89• Simple problem, 3 training instances, single neuron• Perceptron training rule trivially find a perfect solution
!
"⨁
1
%
(1,0), +1
(0,1), +1
(-1,0), -1
12
![Page 13: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/13.jpg)
Backprop vs. Perceptron
• Back propagation using logistic function and !2divergence ($%& = ( − * +)
• Unique minimum trivially proved to exist, backpropfinds it
-
.⨁
1
(
(1,0), +1
(0,1), +1
(-1,0), -1
13
![Page 14: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/14.jpg)
Unique solution exists
• Let ! = #$% 1 − (– E.g. ! = #$% 0.99 representing a 99% confidence in the class
• From the three points we get three independent equations:
,-. 1 + ,/. 0 + 0 = !,-. 0 + ,/. 1 + 0 = !
,-. −1 + ,/. 0 + 0 = −!• Unique solution (,-= !,,- = !, 0 = 0) exists
– represents a unique line regardless of the value of !
4
5⨁
1
7
(1,0), +1
(0,1), +1
(-1,0), -1
14
![Page 15: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/15.jpg)
Backprop vs. Perceptron
• Now add a fourth point• ! is very large (point near −∞)• Perceptron trivially finds a solution (may take t2
iterations)
$
%⨁
1
(
(1,0), +1
(0,1), +1
(-1,0), -1
(0,-t), +1
15
![Page 16: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/16.jpg)
Backprop
• Consider backprop:• Contribution of fourth point
to derivative of L2 error:
!"#$ = 1 − ( − ) −*+, + . 2
Notation:0 = ) 1 = logistic activation
! !"#$!*+
= 2 1 − ( − ) −*+, + . )′ −*+, + . ,
! !"#$!. = −2 1 − ( − ) −*+, + . )′ −*+, + .
3
1⨁
1
0
(1,0), +1
(0,1), +1
(-1,0), -1
(0,-t), +1
16
1-e is the actual achievable value
![Page 17: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/17.jpg)
Backprop
!"#$ = 1 − ( − ) −*+, + . 2
Notation:0 = ) 1 = logistic activation
! !"#$!*+
= 2 1 − ( − ) −*+, + . )′ −*+, + . ,
! !"#$!. = 2 1 − ) −*+, + . )′ −*+, + . ,
• For very large positive ,, *+ > 4 (where 5 = *6,*+, . )
• 1 − ( − ) −*+, + . → 1 as , → ∞• ): −*+, + . → 0 exponentially as , → ∞• Therefore, for very large positive ,
! !"#$!*+
= ! !"#$!. = 0
17
![Page 18: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/18.jpg)
Backprop
• The fourth point at (0, −%) does not change the gradient of the L2divergence near the optimal solution for 3 points
• The optimum solution for 3 points is also a broad local minimum (0 gradient) for the 4-point problem!– Will be found by backprop nearly all the time
• Although the global minimum with unbounded weights will separate the classes correctly
'
(⨁
1
+
(1,0), +1
(0,1), +1
(-1,0), -1
(0,-t), +1
% very large
18
![Page 19: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/19.jpg)
Backprop
• Local optimum solution found by backprop• Does not separate the points even though the
points are linearly separable!
!
"⨁
1
%
(1,0), +1
(0,1), +1
(-1,0), -1
(0,-t), +1
19
![Page 20: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/20.jpg)
Backprop
• Solution found by backprop• Does not separate the points even though the points are linearly
separable!• Compare to the perceptron: Backpropagation fails to separate
where the perceptron succeeds
!
"⨁
1
%
(1,0), +1
(0,1), +1
(-1,0), -1
(0,-t), +1
20
![Page 21: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/21.jpg)
Backprop fails to separate where perceptron succeeds
• Brady, Raghavan, Slawny, ’89• Several linearly separable training examples• Simple setup: both backprop and perceptron
algorithms find solutions
!
"⨁
1
%
21
![Page 22: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/22.jpg)
A more complex problem
• Adding a “spoiler” (or a small number of spoilers)– Perceptron finds the linear separator, – Backprop does not find a separator
• A single additional input does not change the loss function significantly
!
"⨁
1
%
22
![Page 23: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/23.jpg)
A more complex problem
• Adding a “spoiler” (or a small number of spoilers)– Perceptron finds the linear separator, – Backprop does not find a separator
• A single additional input does not change the loss function significantly– Assuming weights are constrained to be bounded
!
"⨁
1
%
23
![Page 24: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/24.jpg)
A more complex problem
• Adding a “spoiler” (or a small number of spoilers)– Perceptron finds the linear separator, – For bounded !, backprop does not find a separator
• A single additional input does not change the loss function significantly
"
#⨁
1
&
24
![Page 25: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/25.jpg)
A more complex problem
• Adding a “spoiler” (or a small number of spoilers)– Perceptron finds the linear separator, – For bounded !, backprop does not find a separator
• A single additional input does not change the loss function significantly
"
#⨁
1
&
25
![Page 26: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/26.jpg)
A more complex problem
• Adding a “spoiler” (or a small number of spoilers)– Perceptron finds the linear separator, – For bounded !, backprop does not find a separator
• A single additional input does not change the loss function significantly
"
#⨁
1
&
26
![Page 27: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/27.jpg)
So what is happening here?• The perceptron may change greatly upon adding just a
single new training instance– But it fits the training data well– The perceptron rule has low bias
• Makes no errors if possible
– But high variance• Swings wildly in response to small changes to input
• Backprop is minimally changed by new training instances– Prefers consistency over perfection– It is a low-variance estimator, at the potential cost of bias
27
![Page 28: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/28.jpg)
Backprop fails to separate even when possible
• This is not restricted to single perceptrons• An MLP learns non-linear decision boundaries that are
determined from the entirety of the training data• Adding a few “spoilers” will not change their behavior
28
![Page 29: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/29.jpg)
Backprop fails to separate even when possible
29
• This is not restricted to single perceptrons• An MLP learns non-linear decision boundaries that are
determined from the entirety of the training data• Adding a few “spoilers” will not change their behavior
![Page 30: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/30.jpg)
Backpropagation: Finding the separator• Backpropagation will often not find a separating
solution even though the solution is within the class of functions learnable by the network
• This is because the separating solution is not a feasible optimum for the loss function
• One resulting benefit is that a backprop-trained neural network classifier has lower variance than an optimal classifier for the training data
30
![Page 31: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/31.jpg)
Variance and Depth
• Dark figures show desired decision boundary (2D)– 1000 training points, 660 hidden neurons– Network heavily overdesigned even for shallow nets
• Anecdotal: Variance decreases with– Depth– Data
31
6 layers 11 layers
3 layers 4 layers
6 layers 11 layers
3 layers 4 layers
10000 training instances
![Page 32: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/32.jpg)
The Loss Surface• The example (and statements)
earlier assumed the loss objective had a single global optimum that could be found– Statement about variance is
assuming global optimum
• What about local optima
32
![Page 33: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/33.jpg)
The Loss Surface• Popular hypothesis:
– In large networks, saddle points are far more common than local minima• Frequency of occurrence exponential in network size
– Most local minima are equivalent• And close to global minimum
– This is not true for small networks
• Saddle point: A point where– The slope is zero– The surface increases in some directions, but
decreases in others• Some of the Eigenvalues of the Hessian are positive;
others are negative
– Gradient descent algorithms often get “stuck” in saddle points 33
![Page 34: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/34.jpg)
The Controversial Loss Surface• Baldi and Hornik (89), “Neural Networks and Principal Component
Analysis: Learning from Examples Without Local Minima” : An MLP with a single hidden layer has only saddle points and no local Minima
• Dauphin et. al (2015), “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization” : An exponential number of saddle points in large networks
• Chomoranksa et. al (2015), “The loss surface of multilayer networks” : For large networks, most local minima lie in a band and are equivalent– Based on analysis of spin glass models
• Swirscz et. al. (2016), “Local minima in training of deep networks”, In networks of finite size, trained on finite data, you can have horrible local minima
• Watch this space…34
![Page 35: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/35.jpg)
Story so far• Neural nets can be trained via gradient descent that minimizes a
loss function
• Backpropagation can be used to derive the derivatives of the loss
• Backprop is not guaranteed to find a “true” solution, even if it exists, and lies within the capacity of the network to model– The optimum for the loss function may not be the “true” solution
• For large networks, the loss function may have a large number of unpleasant saddle points– Which backpropagation may find
35
![Page 36: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/36.jpg)
Convergence• In the discussion so far we have assumed the
training arrives at a local minimum
• Does it always converge?• How long does it take?
• Hard to analyze for an MLP, but we can look at the problem through the lens of convex optimization
36
![Page 37: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/37.jpg)
A quick tour of (convex) optimization
37
![Page 38: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/38.jpg)
Convex Loss Functions• A surface is “convex” if it is
continuously curving upward– We can connect any two points
on or above the surface without intersecting it
– Many mathematical definitions that are equivalent
• Caveat: Neural network loss surface is generally not convex– Streetlight effect
Contour plot of convex function
38
![Page 39: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/39.jpg)
Convergence of gradient descent• An iterative algorithm is said to
converge to a solution if the value updates arrive at a fixed point– Where the gradient is 0 and further
updates do not change the estimate
• The algorithm may not actually converge– It may jitter around the local
minimum– It may even diverge
• Conditions for convergence?
converging
jittering
diverging
39
![Page 40: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/40.jpg)
Convergence and convergence rate• Convergence rate: How fast the
iterations arrive at the solution• Generally quantified as
! = # $(&'() − # $∗# $(&) − # $∗
– $(&'()is the k-th iteration– $∗is the optimal value of $
• If ! is a constant (or upper bounded), the convergence is linear– In reality, its arriving at the solution
exponentially fast# $(&) − # $∗ ≤ !& # $(-) − # $∗
converging
40
![Page 41: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/41.jpg)
Convergence for quadratic surfaces
• Gradient descent to find the optimum of a quadratic, starting from w(#)
• Assuming fixed step size %• What is the optimal step size % to get there fastest?
Gradient descent with fixed step size %to estimate scalar parameter ww(#&') = w(#) − % *+ w(#)
*w
w(#)
,-.-/-01 + = 12 456 + 85 + 9
41
![Page 42: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/42.jpg)
Convergence for quadratic surfaces• Any quadratic objective can be written as
!(#) = ! w(') + !) w ' # − w(')
+ +,!′′ w
(') # − w(') ,
– Taylor expansion
• Minimizing w.r.t #, we get (Newton’s method)
#./0 = w ' − !′′ w ' 1+!′ w '
• Note:
2! w(')
2w = !′ w(')
• Comparing to the gradient descent rule, we see that we can arrive at the optimum in a single step using the optimum step size
3456 = !′′ w ' 1+ = 718
w('9+) = w(') − 3 2! w(')
2w
! = 12<#
, + =# + >
42
![Page 43: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/43.jpg)
With non-optimal step size
• For ! < !#$% the algorithm will converge monotonically
• For 2!#$% > ! > !#$% we have oscillating convergence
• For ! > 2!#$% we get divergence
w(*+,) = w(*) − ! 01 w(*)
0wGradient descent with fixed step size !to estimate scalar parameter w
43
![Page 44: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/44.jpg)
For generic differentiable convex objectives
• Any differentiable convex objective ! " can be approximated as
! ≈ ! w(&) + " − w(&) *! w(&)
*" + 12 " − w(&)- *-! w(&)
*"- + ⋯– Taylor expansion
• Using the same logic as before, we get (Newton’s method)
/012 =*-! w(&)
*"-
45
• We can get divergence if / ≥ 2/01244
approx
""789
![Page 45: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/45.jpg)
For functions of multivariate inputs
• Consider a simple quadratic convex (paraboloid) function
! = 12%
&'% +%&) + *– Since !& = ! (! is scalar), ' can always be made symmetric
• For convex !, ' is always positive definite, and has positive eigenvalues
• When ' is diagonal:
! = 12+,
-,,.,/ + 0,., + *
– The .,s are uncoupled– For convex (paraboloid) !, the -,, values are all positive– Just a sum of 1 independent quadratic functions
! = 2 % , % is a vector % = .3,./, … , .6
45
![Page 46: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/46.jpg)
Multivariate Quadratic with Diagonal !
• Equal-value contours will ellipses with principal axes parallel to the spatial axes
" = 12&
'!& +&') + * = 12+,
-,,.,/ + 0,., + *
46
![Page 47: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/47.jpg)
Multivariate Quadratic with Diagonal !
• Equal-value contours will be parallel to the axes– All “slices” parallel to an axis are shifted versions of one another
" = 12&''('
) + +'(' + , + -(¬(')
" = 121
2!1 +123 + , = 124'
&''(') + +'(' + ,
47
![Page 48: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/48.jpg)
Multivariate Quadratic with Diagonal !
• Equal-value contours will be parallel to the axis– All “slices” parallel to an axis are shifted versions of one another
" = 12&''('
) + +'(' + , + -(¬(')
" = 121
2!1 +123 + , = 124'
&''(') + +'(' + ,
48
![Page 49: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/49.jpg)
“Descents” are uncoupled
• The optimum of each coordinate is not affected by the other coordinates– I.e. we could optimize each coordinate independently
• Note: Optimal learning rate is different for the different coordinates
! = 12%&&'&
( + *&'& + + + ,(¬'&) ! = 12%(('(
( + *('( + + + ,(¬'()0&,234 = %&&5& 0(,234 = %((5&
49
![Page 50: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/50.jpg)
Vector update rule
• Conventional vector update rules for gradient descent: update entire vector against direction of gradient– Note : Gradient is perpendicular to equal value contour– The same learning rate is applied to all components
!(#$%) ← !(#) − )∇+,-
./(#$%) = ./
(#) − )1, ./
(#)
21w
!(#$%)!(#)
50
![Page 51: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/51.jpg)
Problem with vector update rule
• The learning rate must be lower than twice the smallest optimal learning rate for any component
! < 2min' !',)*+– Otherwise the learning will diverge
• This, however, makes the learning very slow– And will oscillate in all directions where !',)*+ ≤ ! < 2!',)*+
-(/01) ← -(/) − !5-67 8'(/01) = 8'
(/) − !:6 8'
(/)
:w
!',)*+ =:<6 8'
(/)
:8'<
=1
= >''=1
51
![Page 52: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/52.jpg)
Dependence on learning rate
• !",$%& = 1; !*,$%& = 0.33• ! = 2.1!*,$%&• ! = 2!*,$%&• ! = 1.5!*,$%&• ! = !*,$%&• ! = 0.75!*,$%&
52
![Page 53: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/53.jpg)
Dependence on learning rate
• !",$%& = 1; !*,$%& = 0.91; ! = 1.9 !*,$%&53
![Page 54: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/54.jpg)
Convergence• Convergence behaviors become increasingly
unpredictable as dimensions increase
• For the fastest convergence, ideally, the learning rate !must be close to both, the largest !",$%& and the smallest !",$%&– To ensure convergence in every direction– Generally infeasible
• Convergence is particularly slow if '()* +*,,-.'/0* +*,,-.
is large
– The “condition” number is small54
![Page 55: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/55.jpg)
Comments on the quadratic• Why are we talking about quadratics?
– Quadratic functions form some kind of benchmark– Convergence of gradient descent is linear
• Meaning it converges to solution exponentially fast
• The convergence for other kinds of functions can be viewed against this benchmark
• Actual losses will not be quadratic, but may locally have other structure– Local between current location and nearest local minimum
• Some examples in the following slides..– Strong convexity– Lifschitz continuity– Lifschitz smoothness
– ..and how they affect convergence of gradient descent
55
![Page 56: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/56.jpg)
Quadratic convexity
• A quadratic function has the form !"#$%# +#$' + (
– Every “slice” is a quadratic bowl• In some sense, the “standard” for gradient-descent based optimization
– Others convex functions will be steeper in some regions, but flatter in others• Gradient descent solution will have linear convergence
– Take )(log 1/0) steps to get within 0 of the optimal solution56
![Page 57: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/57.jpg)
Strong convexity
• A strongly convex function is at least quadratic in its convexity– Has a lower bound to its second derivative
• The function sits within a quadratic bowl– At any location, you can draw a quadratic bowl of fixed convexity (quadratic constant equal to
lower bound of 2nd derivative) touching the function at that point, which contains it
• Convergence of gradient descent algorithms at least as good as that of the enclosing quadratic
57
![Page 58: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/58.jpg)
Strong convexity
58
• A strongly convex function is at least quadratic in its convexity– Has a lower bound to its second derivative
• The function sits within a quadratic bowl– At any location, you can draw a quadratic bowl of fixed convexity (quadratic constant equal to
lower bound of 2nd derivative) touching the function at that point, which contains it
• Convergence of gradient descent algorithms at least as good as that of the enclosing quadratic
![Page 59: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/59.jpg)
Types of continuity
• Most functions are not strongly convex (if they are convex)• Instead we will talk in terms of Lifschitz smoothness• But first : a definition• Lifschitz continuous: The function always lies outside a cone
– The slope of the outer surface is the Lifschitz constant– ! " − ! $ ≤ &|" − $|
59
From wikipedia
![Page 60: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/60.jpg)
Lifschitz smoothness
• Lifschitz smooth: The function’s derivative is Lifschitz continuous– Need not be convex (or even differentiable)– Has an upper bound on second derivative (if it exists)
• Can always place a quadratic bowl of a fixed curvature within the function– Minimum curvature of quadratic must be >= upper bound of second
derivative of function (if it exists)60
![Page 61: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/61.jpg)
Lifschitz smoothness
61
• Lifschitz smooth: The function’s derivative is Lifschitz continuous– Need not be convex (or even differentiable)– Has an upper bound on second derivative (if it exists)
• Can always place a quadratic bowl of a fixed curvature within the function– Minimum curvature of quadratic must be >= upper bound of second
derivative of function (if it exists)
![Page 62: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/62.jpg)
Types of smoothness
62
• A function can be both strongly convex and Lipschitz smooth– Second derivative has upper and lower bounds– Convergence depends on curvature of strong convexity (at least linear)
• A function can be convex and Lifschitz smooth, but not strongly convex– Convex, but upper bound on second derivative– Weaker convergence guarantees, if any (at best linear)– This is often a reasonable assumption for the local structure of your loss function
![Page 63: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/63.jpg)
Types of smoothness
63
• A function can be both strongly convex and Lipschitz smooth– Second derivative has upper and lower bounds– Convergence depends on curvature of strong convexity (at least linear)
• A function can be convex and Lifschitz smooth, but not strongly convex– Convex, but upper bound on second derivative– Weaker convergence guarantees, if any (at best linear)– This is often a reasonable assumption for the local structure of your loss function
![Page 64: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/64.jpg)
Convergence Problems• For quadratic (strongly) convex functions, gradient descent is exponentially
fast– Linear convergence
• Assuming learning rate is non-divergent
• For generic (Lifschitz Smooth) convex functions however, it is very slow
! "($) − ! "∗ ∝ 1* ! "(+) − ! "∗
– And inversely proportional to learning rate
! "($) − ! "∗ ≤ 12.* "(+) − "∗
– Takes O 1/1 iterations to get to within 1 of the solution
– An inappropriate learning rate will destroy your happiness
• Second order methods will locally convert the loss function to quadratic– Convergence behavior will still depend on the nature of the original function
• Continuing with the quadratic-based explanation…64
![Page 65: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/65.jpg)
Convergence• Convergence behaviors become increasingly
unpredictable as dimensions increase
• For the fastest convergence, ideally, the learning rate !must be close to both, the largest !",$%& and the smallest !",$%&– To ensure convergence in every direction– Generally infeasible
• Convergence is particularly slow if '()* +*,,-.'/0* +*,,-.
is large
– The “condition” number is small65
![Page 66: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/66.jpg)
One reason for the problem
66
• The objective function has different eccentricities in different directions– Resulting in different optimal learning rates for different directions– The problem is more difficult when the ellipsoid is not axis aligned: the steps along the two
directions are coupled! Moving in one direction changes the gradient along the other
• Solution: Normalize the objective to have identical eccentricity in all directions– Then all of them will have identical optimal learning rates– Easier to find a working learning rate
![Page 67: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/67.jpg)
Solution: Scale the axes
• Scale (and rotate) the axes, such that all of them have identical (identity) “spread”– Equal-value contours are circular– Movement along the coordinate axes become independent
• Note: equation of a quadratic surface with circular equal-value contours can be written as
! = 12 %&
' %& + )*' %& + +
,-
,.
%,-
%,.
%,- = /-,-%,. = /.,.
%& = %,-%,. %& = 0&
& = ,-,.
0 = /- 00 /.
67
![Page 68: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/68.jpg)
Scaling the axes• Original equation:
! = 12%
&'% + )&% + *• We want to find a (diagonal) scaling matrix + such that
S =-. ⋯ 0⋮ ⋱ ⋮0 ⋯ -3
, 5% = S%
• And
! = 12 5%
& 5% + 6)& 5% + c
68
![Page 69: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/69.jpg)
Scaling the axes• Original equation:
! = 12%
&'% + )&% + *• We want to find a (diagonal) scaling matrix + such that
S =-. ⋯ 0⋮ ⋱ ⋮0 ⋯ -3
, 5% = S%
• And
! = 12 5%
& 5% + 6)& 5% + c
69
By inspection:S = '8.:
![Page 70: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/70.jpg)
Scaling the axes• We have
! = 12%
&'% + )&% + *+% = S%
! = 12 +%
& +% + -)& +% + c
= 12%
&S&S% + -)&S% + *• Equating linear and quadratic coefficients, we get
S&S = ', -)&S = )&• Solving: S = '0.2, -) = '30.2)
70
![Page 71: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/71.jpg)
Scaling the axes• We have
! = 12%
&'% + )&% + *+% = S%
! = 12 +%
& +% + -)& +% + c• Solving for S we get
+% = '/.1%, -) = '2/.1)71
![Page 72: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/72.jpg)
Scaling the axes• We have
! = 12%
&'% + )&% + *+% = S%
! = 12 +%
& +% + -)& +% + c• Solving for S we get
+% = '/.1%, -) = '2/.1)72
![Page 73: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/73.jpg)
The Inverse Square Root of A
• For any positive definite !, we can write! = #$#%
– Eigen decomposition– # is an orthogonal matrix– $ is a diagonal matrix of non-zero diagonal entries
• Defining !&.( = #$&.(#%– Check !&.( %!&.( = #$#% = !
• Defining !)&.( = #$)&.(#%– Check: !)&.( %!)&.( = #$)*#% = !)*
73
![Page 74: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/74.jpg)
Returning to our problem
• ! = #$ %&
' %& + )*' %& + +
• Computing the gradient, and noting that ,-./is symmetric, we can relate 0%&! and 0&!:
0%&! = %&' + )*'= &',-./ + *',1-./= &', + *' ,1-./= 0&!. ,1-./ 74
![Page 75: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/75.jpg)
Returning to our problem
• ! = #$ %&
' %& + )*' %& + +
• Gradient descent rule:
– %&(-.#) = %&(-) − 12%&! %&(-) '
– Learning rate is now independent of direction
• Using %& = 34.6&, and 2%&! %& ' = 374.62&! & '
&(-.#) = &(-) − 137#2&! &(-) '75
![Page 76: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/76.jpg)
Modified update rule
• !"($%&) = !"($) − *+!", !"($) -
• Leads to the modified gradient descent rule
"($%&) = "($) − *./&+", "($) -
76
!" = .0.2"
, = 12"
-." + 6-" + 7 , = 12 !"
- !" + 86- !" + 7
![Page 77: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/77.jpg)
For non-axis-aligned quadratics..
• If ! is not diagonal, the contours are not axis-aligned– Because of the cross-terms "#$%#%$– The major axes of the ellipsoids are the Eigenvectors of !, and their diameters are
proportional to the Eigen values of !
• But this does not affect the discussion– This is merely a rotation of the space from the axis-aligned case– The component-wise optimal learning rates along the major and minor axes of the equal-
contour ellipsoids will be different, causing problems• The optimal rates along the axes are Inversely proportional to the eigenvalues of !
& = 12*
+!* +*+- + .
& = 12/#
"##%#0 +/#1$
"#$%#%$
+/#2#%# + .
77
![Page 78: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/78.jpg)
For non-axis-aligned quadratics..
• The component-wise optimal learning rates along the major and minor axes of the contour ellipsoids will differ, causing problems– Inversely proportional to the eigenvalues of !
• This can be fixed as before by rotating and resizing the different directions to obtain the same normalized update rule as before:
"($%&) = "($) − *!+&, 78
![Page 79: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/79.jpg)
Generic differentiable multivariate convex functions
• Taylor expansion
! " ≈ ! "(%) + ("! "(%) " − *(%) + +, " −*(%) -.! *(%) " − *(%) + ⋯
79
![Page 80: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/80.jpg)
Generic differentiable multivariate convex functions
• Taylor expansion
! " ≈ ! "(%) + ("! "(%) " − *(%) + +, " −*(%) -.! *(%) " − *(%) + ⋯
• Note that this has the form 01"23" +"24 + 5
• Using the same logic as before, we get the normalized update rule"(670) = "(6) − 9:; *(6) <0="> "(6) ?
• For a quadratic function, the optimal @ is 1 (which is exactly Newton’s method)– And should not be greater than 2!
80
![Page 81: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/81.jpg)
Minimization by Newton’s method (" = 1)
• Iterated localized optimization with quadratic approximations
&('()) = &(') − "+, -(') .)/&0 &(') 1
– " = 1
Fit a quadratic at eachpoint and find theminimum of that quadratic
81
![Page 82: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/82.jpg)
• Iterated localized optimization with quadratic approximations
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
– ) = 1
Minimization by Newton’s method () = 1)
82
![Page 83: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/83.jpg)
• Iterated localized optimization with quadratic approximations
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
– ) = 1
Minimization by Newton’s method () = 1)
83
![Page 84: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/84.jpg)
• Iterated localized optimization with quadratic approximations
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
– ) = 1
Minimization by Newton’s method () = 1)
84
![Page 85: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/85.jpg)
Minimization by Newton’s method
• Iterated localized optimization with quadratic approximations
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
– ) = 185
![Page 86: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/86.jpg)
Minimization by Newton’s method
• Iterated localized optimization with quadratic approximations
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
– ) = 186
![Page 87: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/87.jpg)
Minimization by Newton’s method
• Iterated localized optimization with quadratic approximations
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
– ) = 187
![Page 88: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/88.jpg)
Minimization by Newton’s method
• Iterated localized optimization with quadratic approximations
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
– ) = 188
![Page 89: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/89.jpg)
Minimization by Newton’s method
• Iterated localized optimization with quadratic approximations
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
– ) = 189
![Page 90: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/90.jpg)
Minimization by Newton’s method
• Iterated localized optimization with quadratic approximations
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
– ) = 190
![Page 91: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/91.jpg)
Minimization by Newton’s method
• Iterated localized optimization with quadratic approximations
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
– ) = 191
![Page 92: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/92.jpg)
Issues: 1. The Hessian• Normalized update rule
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
• For complex models such as neural networks, with a very large number of parameters, the Hessian *+ ,(#) is extremely difficult to compute– For a network with only 100,000 parameters, the Hessian
will have 1010 cross-derivative terms– And its even harder to invert, since it will be enormous
92
![Page 93: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/93.jpg)
Issues: 1. The Hessian
• For non-convex functions, the Hessian may not be positive semi-definite, in which case the algorithm can diverge– Goes away from, rather than towards the minimum– Now requires additional checks to avoid movement in
directions corresponding to –ve Eigenvalues of the Hessian93
![Page 94: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/94.jpg)
Issues: 1. The Hessian
• For non-convex functions, the Hessian may not be positive semi-definite, in which case the algorithm can diverge– Goes away from, rather than towards the minimum– Now requires additional checks to avoid movement in
directions corresponding to –ve Eigenvalues of the Hessian94
![Page 95: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/95.jpg)
Issues: 1 – contd.• A great many approaches have been proposed in the
literature to approximate the Hessian in a number of ways and improve its positive definiteness– Boyden-Fletcher-Goldfarb-Shanno (BFGS)
• And “low-memory” BFGS (L-BFGS)• Estimate Hessian from finite differences
– Levenberg-Marquardt• Estimate Hessian from Jacobians• Diagonal load it to ensure positive definiteness
– Other “Quasi-newton” methods
• Hessian estimates may even be local to a set of variables
• Not particularly popular anymore for large neural networks..95
![Page 96: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/96.jpg)
Issues: 2. The learning rate
• Much of the analysis we just saw was based on trying to ensure that the step size was not so large as to cause divergence within a convex region– ! < 2!$%&
96
![Page 97: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/97.jpg)
Issues: 2. The learning rate
• For complex models such as neural networks the loss function is often not convex– Having ! > 2!$%& can actually help escape local optima
• However always having ! > 2!$%&will ensure that you never ever actually find a solution
97
![Page 98: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/98.jpg)
Decaying learning rate
• Start with a large learning rate– Greater than 2 (assuming Hessian normalization)– Gradually reduce it with iterations
Note: this is actually areduced step size
98
![Page 99: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/99.jpg)
Decaying learning rate• Typical decay schedules
– Linear decay: !" = $%"&'
– Quadratic decay: !" = $%"&' (
– Exponential decay: !" = !)*+,", where - > 0
• A common approach (for nnets):1. Train with a fixed learning rate ! until loss (or performance on
a held-out data set) stagnates2. ! ← 1!, where 1 < 1 (typically 0.1)3. Return to step 1 and continue training from where we left off
99
![Page 100: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/100.jpg)
Story so far : Convergence• Gradient descent can miss obvious answers– And this may be a good thing
• Convergence issues abound– The loss surface has many saddle points
• Although, perhaps, not so many bad local minima• Gradient descent can stagnate on saddle points
– Vanilla gradient descent may not converge, or may converge toooooo slowly• The optimal learning rate for one component may be too
high or too low for others
100
![Page 101: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/101.jpg)
Story so far : Second-order methods
• Second-order methods “normalize” the variation along the components to mitigate the problem of different optimal learning rates for different components– But this requires computation of inverses of second-
order derivative matrices– Computationally infeasible– Not stable in non-convex regions of the loss surface– Approximate methods address these issues, but
simpler solutions may be better101
![Page 102: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/102.jpg)
Story so far : Learning rate• Divergence-causing learning rates may not be a
bad thing– Particularly for ugly loss functions
• Decaying learning rates provide good compromise between escaping poor local minima and convergence
• Many of the convergence issues arise because we force the same learning rate on all parameters
102
![Page 103: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/103.jpg)
Lets take a step back
• Problems arise because of requiring a fixed step size across all dimensions– Because step are “tied” to the gradient
• Lets try releasing this requirement
!(#$%) ← !(#) − )(*!+),
-.(#$%) = -.
(#) − )0+ -.
(#)
0w
!(#$%)!(#)
103
![Page 104: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/104.jpg)
Derivative-inspired algorithms
• Algorithms that use derivative information for trends, but do not follow them absolutely
• Rprop• Quick prop
104
![Page 105: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/105.jpg)
RProp• Resilient propagation• Simple algorithm, to be followed independently for each
component– I.e. steps in different directions are not coupled
• At each time– If the derivative at the current location recommends continuing in the
same direction as before (i.e. has not changed sign from earlier):• increase the step, and continue in the same direction
– If the derivative has changed sign (i.e. we’ve overshot a minimum)• reduce the step and reverse direction
105
![Page 106: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/106.jpg)
Rprop
• Select an initial value !" and compute the derivative– Take an initial step ∆" against the derivative
• In the direction that reduces the function
– ∆" = %&'( )*( !,)), ∆"
– !" = !" − ∆"
"
/(")
!"0
∆"0
Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
106
![Page 107: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/107.jpg)
Rprop
• Compute the derivative in the new location– If the derivative has not changed sign from the previous
location, increase the step size and take a longer step• ∆" = α∆"• $" = $" − ∆"
a > 1
"
'(")
$"* $"+
,∆"*∆"*
Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
107
![Page 108: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/108.jpg)
Rprop
• Compute the derivative in the new location– If the derivative has not changed sign from the previous
location, increase the step size and take a step• ∆" = α∆"• $" = $" − ∆"
a > 1
"
'(")
$"* $"+
,∆"*
$"-
,-∆"*∆"*
Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
108
![Page 109: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/109.jpg)
Rprop
• Compute the derivative in the new location– If the derivative has changed sign – Return to the previous location
• !" = !" + ∆"– Shrink the step
• ∆" = &∆"– Take the smaller step forward
• !" = !" − ∆"
"
((")
!"+ !",
-∆"+
!".
-.∆"+∆"+
!"/
Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
109
![Page 110: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/110.jpg)
Rprop
• Compute the derivative in the new location– If the derivative has changed sign – Return to the previous location
• !" = !" + ∆"– Shrink the step
• ∆" = &∆"– Take the smaller step forward
• !" = !" − ∆"
"
((")
!"+ !",
-∆"+
!".
-.∆"+∆"+
!"/
Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
110
![Page 111: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/111.jpg)
Rprop
• Compute the derivative in the new location– If the derivative has changed sign – Return to the previous location
• !" = !" + ∆"– Shrink the step
• ∆" = &∆"– Take the smaller step forward
• !" = !" − ∆"
b < 1
"
((")
!"+ !",
-∆"+
!".
-.&∆"+
∆"+
Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
111
![Page 112: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/112.jpg)
Rprop
• Compute the derivative in the new location– If the derivative has changed sign – Return to the previous location
• !" = !" + ∆"– Shrink the step
• ∆" = &∆"– Take the smaller step forward
• !" = !" − ∆"
b < 1
"
((")
!"+ !",
-∆"+
!".
-.&∆"+
∆"+
Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
112
![Page 113: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/113.jpg)
Rprop (simplified)• Set ! = 1.2, & = 0.5
• For each layer ), for each *, ,:– Initialize -.,/,0, ∆-.,/,0 > 0,
– 34567 ), *, , =89::(<=,>,?)
8<=,>,?
– ∆-.,/,0 = sign 34567 ), *, , ∆-.,/,0– While not converged:
• -.,/,0 = -.,/,0 − ∆-.,/,0
• 7 ), *, , =89::(<=,>,?)
8<=,>,?
• If sign 34567 ), *, , == sign 7 ), *, , :
– ∆-.,/,0 = min(!∆-.,/,0, ∆GHI)
– 34567 ), *, , = 7 ), *, ,
• else:– -.,/,0 = -.,/,0 + ∆-.,/,0
– ∆-.,/,0 = max(&∆-.,/,0 , ∆G/M)
Ceiling and floor on step
113
![Page 114: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/114.jpg)
Rprop (simplified)• Set ! = 1.2, & = 0.5• For each layer ), for each *, ,:
– Initialize -.,/,0, ∆-.,/,0 > 0,
– 34567 ), *, , = 89::(<=,>,?)8<=,>,?
– ∆-.,/,0 = sign 34567 ), *, , ∆-.,/,0– While not converged:
• -.,/,0 = -.,/,0 − ∆-.,/,0
• 7 ), *, , = 89::(<=,>,?)8<=,>,?
• If sign 34567 ), *, , == sign 7 ), *, , :
– ∆-.,/,0 = !∆-.,/,0– 34567 ), *, , = 7 ), *, ,
• else:– -.,/,0 = -.,/,0 + ∆-.,/,0– ∆-.,/,0 = &∆-.,/,0
Obtained via backprop
Note: Different parameters updatedindependently
114
![Page 115: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/115.jpg)
RProp• A remarkably simple first-order algorithm, that
is frequently much more efficient than gradient descent.– And can even be competitive against some of the
more advanced second-order methods
• Only makes minimal assumptions about the loss function– No convexity assumption
115
![Page 116: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/116.jpg)
QuickProp
• Quickprop employs the Newton updates with two modifications
!(#$%) = !(#) − )*+ ,(#) -%.!/ !(#) 0
• But with two modifications
116
![Page 117: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/117.jpg)
QuickProp: Modification 1
• It treats each dimension independently• For ! = 1:%
&'()* = &'( − ,-- &'(|&/(, 1 ≠ ! 3*,′ &'(|&/(, 1 ≠ !• This eliminates the need to compute and invert expensive Hessians
&
,(&)
&7&()*
Within each component
117
![Page 118: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/118.jpg)
QuickProp: Modification 2
• It approximates the second derivative through finite differences• For ! = 1:%
&'()* = &'( − , &'(, &'(.*.*/′ &'(|&2(, 3 ≠ !
• This eliminates the need to compute expensive double derivatives
&
/(&)
&7&()*
Within each component
118
![Page 119: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/119.jpg)
QuickProp
• Updates are independent for every parameter• For every layer !, for every connection from node " in the (! − 1)th
layer to node ' in the !th layer:
(()*+) = (()) − -. ( ) − -′((()0+))∆(()0+)
0+-′((()))
Finite-difference approximation to double derivative obtained assuming a quadratic -()
(2,45()*+) = (2,45
()) − ∆(2,45())
∆(2,45()) =
∆(2,45()0+)
-66. (2,45()) − -66. (2,45
()0+) -66. (2,45())
119
![Page 120: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/120.jpg)
QuickProp
• Updates are independent for every parameter• For every layer !, for every connection from node " in the (! − 1)th
layer to node ' in the !th layer:
(()*+) = (()) − -. ( ) − -′((()0+))∆(()0+)
0+-′((()))
Finite-difference approximation to double derivative obtained assuming a quadratic -()
(2,45()*+) = (2,45
()) − ∆(2,45())
∆(2,45()) =
∆(2,45()0+)
-66. (2,45()) − -66. (2,45
()0+) -66. (2,45())
Computed usingbackprop
120
![Page 121: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/121.jpg)
Quickprop
• Employs Newton updates with empirically derived derivatives
• Prone to some instability for non-convex objective functions
• But is still one of the fastest training algorithms for many problems
121
![Page 122: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/122.jpg)
Story so far : Convergence• Gradient descent can miss obvious answers
– And this may be a good thing
• Vanilla gradient descent may be too slow or unstable due to the differences between the dimensions
• Second order methods can normalize the variation across dimensions, but are complex
• Adaptive or decaying learning rates can improve convergence
• Methods that decouple the dimensions can improve convergence
122
![Page 123: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/123.jpg)
A closer look at the convergence problem
• With dimension-independent learning rates, the solution will converge smoothly in some directions, but oscillate or diverge in others
• Proposal: – Keep track of oscillations– Emphasize steps in directions that converge smoothly– Shrink steps in directions that bounce around..
123
![Page 124: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/124.jpg)
A closer look at the convergence problem
• With dimension-independent learning rates, the solution will converge smoothly in some directions, but oscillate or diverge in others
• Proposal: – Keep track of oscillations– Emphasize steps in directions that converge smoothly– Shrink steps in directions that bounce around..
124
![Page 125: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/125.jpg)
The momentum methods• Maintain a running average of all
past steps– In directions in which the
convergence is smooth, the average will have a large value
– In directions in which the estimate swings, the positive and negative swings will cancel out in the average
• Update with the running average, rather than the current gradient
125
![Page 126: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/126.jpg)
Momentum Update
• The momentum method maintains a running average of all gradients until the current step
∆"($) = '∆"($()) − +,-./00 " $() 1
"($) = "($()) + ∆"($)
– Typical ' value is 0.9• The running average steps
– Get longer in directions where gradient retains the same sign– Become shorter in directions where the sign keeps flipping
Plain gradient update With momentum
126
![Page 127: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/127.jpg)
Training by gradient descent• Initialize all weights !",!$,… ,!&• Do:– For all ', (, ), initialize *+,-.// = 0– For all 2 = 1: 5
• For every layer ):– Compute *+,678(:;, <;)– Compute *+,-.// +=
"? *+,678(:;, <;)
– For every layer ):@A = @A − C(*+,-.//)5
• Until -.// has converged127
![Page 128: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/128.jpg)
Training with momentum• Initialize all weights !",!$,… ,!&• Do:– For all layers ', initialize ()*+,-- = 0, Δ12 = 0– For all 3 = 1: 6
• For every layer ':– Compute gradient ()*789(;<, =<)
– ()*+,-- +="@ ()*789(;<, =<)
– For every layer 'Δ12 = AΔ12 − C ()*+,-- 6
12 = 12 + Δ12
• Until +,-- has converged128
![Page 129: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/129.jpg)
Momentum Update
• The momentum method∆"($) = '∆"($()) − +,-./00 "($()) 1
• At any iteration, to compute the current step:– First computes the gradient step at the current location– Then adds in the historical average step
129
![Page 130: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/130.jpg)
Momentum Update
• The momentum method∆"($) = '∆"($()) − +,-./00 "($()) 1
• At any iteration, to compute the current step:– First computes the gradient step at the current location– Then adds in the historical average step
130
![Page 131: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/131.jpg)
Momentum Update
• The momentum method∆"($) = '∆"($()) − +,-./00 "($()) 1
• At any iteration, to compute the current step:– First computes the gradient step at the current location– Then adds in the scaled previous step
• Which is actually a running average131
![Page 132: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/132.jpg)
Momentum Update
• The momentum method∆"($) = '∆"($()) − +,-./00 "($()) 1
• At any iteration, to compute the current step:– First computes the gradient step at the current location– Then adds in the scaled previous step
• Which is actually a running average– To get the final step
132
![Page 133: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/133.jpg)
Momentum update
• Momentum update steps are actually computed in two stages– First: We take a step against the gradient at the current location– Second: Then we add a scaled version of the previous step
• The procedure can be made more optimal by reversing the order of operations..
133
1
2
![Page 134: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/134.jpg)
Nestorov’s Accelerated Gradient
• Change the order of operations
• At any iteration, to compute the current step:– First extend by the (scaled) historical average
– Then compute the gradient at the resultant position– Add the two to obtain the final step
134
![Page 135: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/135.jpg)
Nestorov’s Accelerated Gradient
• Change the order of operations
• At any iteration, to compute the current step:– First extend the previous step
– Then compute the gradient at the resultant position– Add the two to obtain the final step
135
![Page 136: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/136.jpg)
Nestorov’s Accelerated Gradient
• Change the order of operations• At any iteration, to compute the current step:– First extend the previous step– Then compute the gradient step at the resultant
position– Add the two to obtain the final step
136
![Page 137: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/137.jpg)
Nestorov’s Accelerated Gradient
• Change the order of operations• At any iteration, to compute the current step:– First extend the previous step– Then compute the gradient step at the resultant
position– Add the two to obtain the final step
137
![Page 138: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/138.jpg)
Nestorov’s Accelerated Gradient
• Nestorov’s method∆"($) = '∆"($()) − +,-./00 "($()) + '∆"($()) 2
"($) = "($()) + ∆"($)
138
![Page 139: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/139.jpg)
Nestorov’s Accelerated Gradient
• Comparison with momentum (example from Hinton)
• Converges much faster
139
![Page 140: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/140.jpg)
Training with Nestorov• Initialize all weights!",!$,… ,!&• Do:
– For all layers ', initialize ()*+,-- = 0, Δ12 = 0– For every layer '
12 = 12 + 4Δ12
– For all 5 = 1: 8• For every layer ':
– Compute gradient ()*9:;(=>, ?>)
– ()*+,-- +="A ()*9:;(=>, ?>)
– For every layer '12 = 12 − C(()*+,--)8
Δ12 = 4Δ12 − C(()*+,--)8
• Until +,-- has converged140
![Page 141: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/141.jpg)
Momentum and trend-based methods..
• We will return to this topic again, very soon..
141
![Page 142: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/142.jpg)
Story so far• Gradient descent can miss obvious answers
– And this may be a good thing
• Vanilla gradient descent may be too slow or unstable due to the differences between the dimensions
• Second order methods can normalize the variation across dimensions, but are complex
• Adaptive or decaying learning rates can improve convergence
• Methods that decouple the dimensions can improve convergence
• Momentum methods which emphasize directions of steady improvement are demonstrably superior to other methods
142
![Page 143: Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can](https://reader034.vdocuments.us/reader034/viewer/2022051913/6003b667d00b785aac0cbc41/html5/thumbnails/143.jpg)
Coming up
• Incremental updates• Revisiting “trend” algorithms• Generalization• Tricks of the trade– Divergences..– Activations– Normalizations
143