lecture 7: training neural networks, part 2€¦ · lecture 7 - 37 april 25, 2017 adam (full form)...
TRANSCRIPT
![Page 1: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/1.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20171
Lecture 7:Training Neural Networks,
Part 2
![Page 2: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/2.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20172
Administrative
- Assignment 1 is being graded, stay tuned- Project proposals due today by 11:59pm- Assignment 2 is out, due Thursday May 4 at 11:59pm
![Page 3: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/3.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20173
Administrative: Google Cloud- STOP YOUR INSTANCES when not in use!
![Page 4: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/4.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20174
Administrative: Google Cloud- STOP YOUR INSTANCES when not in use!- Keep track of your spending!- GPU instances are much more expensive than CPU
instances - only use GPU instance when you need it(e.g. for A2 only on TensorFlow / PyTorch notebooks)
![Page 5: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/5.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20175
Last time: Activation FunctionsSigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU
![Page 6: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/6.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20176
Last time: Activation FunctionsSigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU
Good default choice
![Page 7: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/7.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20177
Last time: Weight InitializationInitialization too small:Activations go to zero, gradients also zero,No learning
Initialization too big:Activations saturate (for tanh), Gradients zero, no learning
Initialization just right:Nice distribution of activations at all layers,Learning proceeds nicely
![Page 8: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/8.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20178
Last time: Data Preprocessing
![Page 9: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/9.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20179
Last time: Data PreprocessingBefore normalization: classification loss very sensitive to changes in weight matrix; hard to optimize
After normalization: less sensitive to small changes in weights; easier to optimize
![Page 10: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/10.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201710
Last time: Batch NormalizationInput:
Learnable params:
Output:
Intermediates:
![Page 11: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/11.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201711
Last time: Babysitting Learning
![Page 12: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/12.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201712
Last time: Hyperparameter Search
Important Parameter
Important Parameter
Uni
mpo
rtant
Pa
ram
eter
Uni
mpo
rtant
Pa
ram
eter
Grid Layout Random Layout
Coarse to fine search
![Page 13: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/13.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201713
Today
- Fancier optimization- Regularization- Transfer Learning
![Page 14: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/14.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201714
Optimization
W_1
W_2
![Page 15: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/15.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201715
Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?
Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large
![Page 16: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/16.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201716
Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction
Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large
![Page 17: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/17.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201717
Optimization: Problems with SGD
What if the loss function has a local minima or saddle point?
![Page 18: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/18.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201718
Optimization: Problems with SGD
What if the loss function has a local minima or saddle point?
Zero gradient, gradient descent gets stuck
![Page 19: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/19.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201719
Optimization: Problems with SGD
What if the loss function has a local minima or saddle point?
Saddle points much more common in high dimension
Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014
![Page 20: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/20.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201720
Optimization: Problems with SGD
Our gradients come from minibatches so they can be noisy!
![Page 21: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/21.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201721
SGD + MomentumSGD SGD+Momentum
- Build up “velocity” as a running mean of gradients- Rho gives “friction”; typically rho=0.9 or 0.99
![Page 22: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/22.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201722
SGD + Momentum
Local Minima Saddle points
Poor Conditioning
Gradient Noise
![Page 23: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/23.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201723
SGD + Momentum
Gradient
Velocity
actual step
Momentum update:
![Page 24: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/24.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201724
Gradient
Velocity
actual step
Momentum update:
Nesterov Momentum
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004Sutskever et al, “On the importance of initialization and momentum in deel learning”, ICML 2013
GradientVelocity
actual step
Nesterov Momentum
![Page 25: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/25.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201725
Nesterov Momentum
![Page 26: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/26.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201726
Nesterov MomentumAnnoying, usually we want update in terms of
![Page 27: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/27.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Change of variables and rearrange:
27
Nesterov MomentumAnnoying, usually we want update in terms of
![Page 28: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/28.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201728
Nesterov MomentumSGD
SGD+Momentum
Nesterov
![Page 29: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/29.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201729
AdaGrad
Added element-wise scaling of the gradient based on the historical sum of squares in each dimension
Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
![Page 30: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/30.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201730
AdaGrad
Q: What happens with AdaGrad?
![Page 31: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/31.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201731
AdaGrad
Q2: What happens to the step size over long time?
![Page 32: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/32.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201732
RMSProp
AdaGrad
RMSProp
Tieleman and Hinton, 2012
![Page 33: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/33.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201733
RMSPropSGD
SGD+Momentum
RMSProp
![Page 34: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/34.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201734
Adam (almost)
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
![Page 35: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/35.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201735
Adam (almost)
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Momentum
AdaGrad / RMSProp
Sort of like RMSProp with momentum
Q: What happens at first timestep?
![Page 36: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/36.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201736
Adam (full form)
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Momentum
AdaGrad / RMSProp
Bias correction
Bias correction for the fact that first and second moment estimates start at zero
![Page 37: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/37.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201737
Adam (full form)
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Momentum
AdaGrad / RMSProp
Bias correction
Bias correction for the fact that first and second moment estimates start at zero
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4is a great starting point for many models!
![Page 38: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/38.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201738
Adam
SGD
SGD+Momentum
RMSProp
Adam
![Page 39: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/39.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201739
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
Q: Which one of these learning rates is best to use?
![Page 40: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/40.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201740
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
=> Learning rate decay over time!
step decay: e.g. decay learning rate by half every few epochs.
exponential decay:
1/t decay:
![Page 41: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/41.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201741
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
Loss
Epoch
Learning rate decay!
![Page 42: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/42.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201742
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
Loss
Epoch
Learning rate decay!
More critical with SGD+Momentum, less common with Adam
![Page 43: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/43.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201743
First-Order Optimization
Loss
w1
![Page 44: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/44.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201744
First-Order Optimization
Loss
w1
(1) Use gradient form linear approximation(2) Step to minimize the approximation
![Page 45: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/45.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201745
Second-Order Optimization
Loss
w1
(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation
![Page 46: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/46.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201746
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
Second-Order Optimization
Q: What is nice about this update?
![Page 47: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/47.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201747
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
Second-Order Optimization
Q: What is nice about this update?
No hyperparameters!No learning rate!
![Page 48: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/48.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201748
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
Second-Order Optimization
Q2: Why is this bad for deep learning?
Hessian has O(N^2) elementsInverting takes O(N^3)N = (Tens or Hundreds of) Millions
![Page 49: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/49.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201749
Second-Order Optimization
- Quasi-Newton methods (BGFS most popular):instead of inverting the Hessian (O(n^3)), approximate inverse Hessian with rank 1 updates over time (O(n^2) each).
- L-BFGS (Limited memory BFGS): Does not form/store the full inverse Hessian.
![Page 50: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/50.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201750
Second-Order Optimization
- Quasi-Newton methods (BGFS most popular):instead of inverting the Hessian (O(n^3)), approximate inverse Hessian with rank 1 updates over time (O(n^2) each).
- L-BFGS (Limited memory BFGS): Does not form/store the full inverse Hessian.
![Page 51: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/51.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201751
L-BFGS
- Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
- Does not transfer very well to mini-batch setting. Gives bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.
Le et al, “On optimization methods for deep learning, ICML 2011”
![Page 52: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/52.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201752
- Adam is a good default choice in most cases
- If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)
In practice:
![Page 53: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/53.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201753
Beyond Training Error
Better optimization algorithms help reduce training loss
But we really care about error on new data - how to reduce the gap?
![Page 54: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/54.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201754
1. Train multiple independent models2. At test time average their results
Enjoy 2% extra performance
Model Ensembles
![Page 55: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/55.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201755
Model Ensembles: Tips and TricksInstead of training independent models, use multiple snapshots of a single model during training!
Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
![Page 56: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/56.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201756
Model Ensembles: Tips and TricksInstead of training independent models, use multiple snapshots of a single model during training!
Cyclic learning rate schedules can make this work even better!
Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
![Page 57: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/57.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201757
Model Ensembles: Tips and TricksInstead of using actual parameter vector, keep a moving average of the parameter vector and use that at test time (Polyak averaging)
Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.
![Page 58: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/58.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201758
How to improve single-model performance?
Regularization
![Page 59: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/59.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Add term to loss
59
In common use: L2 regularizationL1 regularizationElastic net (L1 + L2)
(Weight decay)
![Page 60: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/60.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201760
Regularization: DropoutIn each forward pass, randomly set some neurons to zeroProbability of dropping is a hyperparameter; 0.5 is common
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014
![Page 61: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/61.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201761
Regularization: Dropout Example forward pass with a 3-layer network using dropout
![Page 62: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/62.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201762
Regularization: DropoutHow can this possibly be a good idea?
Forces the network to have a redundant representation;Prevents co-adaptation of features
has an ear
has a tail
is furry
has claws
mischievous look
cat score
X
X
X
![Page 63: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/63.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201763
Regularization: DropoutHow can this possibly be a good idea?
Another interpretation:
Dropout is training a large ensemble of models (that share parameters).
Each binary mask is one model
An FC layer with 4096 units has24096 ~ 101233 possible masks!Only ~ 1082 atoms in the universe...
![Page 64: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/64.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201764
Dropout: Test time
Dropout makes our output random!
Output(label)
Input(image)
Random mask
Want to “average out” the randomness at test-time
But this integral seems hard …
![Page 65: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/65.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201765
Dropout: Test timeWant to approximate the integral
Consider a single neuron.a
x y
w1 w2
![Page 66: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/66.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201766
Dropout: Test timeWant to approximate the integral
Consider a single neuron.
At test time we have:a
x y
w1 w2
![Page 67: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/67.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201767
Dropout: Test timeWant to approximate the integral
Consider a single neuron.
At test time we have:During training we have:
a
x y
w1 w2
![Page 68: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/68.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201768
Dropout: Test timeWant to approximate the integral
Consider a single neuron.
At test time we have:During training we have:
a
x y
w1 w2
At test time, multiply by dropout probability
![Page 69: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/69.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201769
Dropout: Test time
At test time all neurons are active always=> We must scale the activations so that for each neuron:output at test time = expected output at training time
![Page 70: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/70.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201770
Dropout Summary
drop in forward pass
scale at test time
![Page 71: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/71.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201771
More common: “Inverted dropout”
test time is unchanged!
![Page 72: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/72.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201772
Regularization: A common patternTraining: Add some kind of randomness
Testing: Average out randomness (sometimes approximate)
![Page 73: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/73.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201773
Regularization: A common patternTraining: Add some kind of randomness
Testing: Average out randomness (sometimes approximate)
Example: Batch Normalization
Training: Normalize using stats from random minibatches
Testing: Use fixed stats to normalize
![Page 74: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/74.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201774
Load image and label
“cat”
CNN
Computeloss
Regularization: Data Augmentation
This image by Nikita is licensed under CC-BY 2.0
![Page 75: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/75.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201775
Regularization: Data Augmentation
Load image and label
“cat”
CNN
Computeloss
Transform image
![Page 76: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/76.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201776
Data AugmentationHorizontal Flips
![Page 77: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/77.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201777
Data AugmentationRandom crops and scales
Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch
![Page 78: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/78.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201778
Data AugmentationRandom crops and scales
Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch
Testing: average a fixed set of cropsResNet:1. Resize image at 5 scales: {224, 256, 384, 480, 640}2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips
![Page 79: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/79.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201779
Data AugmentationColor Jitter
Simple: Randomize contrast and brightness
![Page 80: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/80.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201780
Data AugmentationColor Jitter
Simple: Randomize contrast and brightness
More Complex:
1. Apply PCA to all [R, G, B] pixels in training set
2. Sample a “color offset” along principal component directions
3. Add offset to all pixels of a training image
(As seen in [Krizhevsky et al. 2012], ResNet, etc)
![Page 81: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/81.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201781
Data AugmentationGet creative for your problem!
Random mix/combinations of :- translation- rotation- stretching- shearing, - lens distortions, … (go crazy)
![Page 82: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/82.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201782
Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise
Examples:DropoutBatch NormalizationData Augmentation
![Page 83: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/83.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201783
Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise
Examples:DropoutBatch NormalizationData AugmentationDropConnect
Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013
![Page 84: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/84.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201784
Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise
Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max Pooling
Graham, “Fractional Max Pooling”, arXiv 2014
![Page 85: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/85.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201785
Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise
Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max PoolingStochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016
![Page 86: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/86.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201786
Transfer Learning
“You need a lot of a data if you want to train/use CNNs”
![Page 87: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/87.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201787
Transfer Learning
“You need a lot of a data if you want to train/use CNNs”
BUSTED
![Page 88: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/88.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201788
Transfer Learning with CNNs
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
1. Train on Imagenet
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014
![Page 89: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/89.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201789
Transfer Learning with CNNs
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
1. Train on Imagenet
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096
FC-C
2. Small Dataset (C classes)
Freeze these
Reinitialize this and train
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014
![Page 90: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/90.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201790
Transfer Learning with CNNs
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
1. Train on Imagenet
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096
FC-C
2. Small Dataset (C classes)
Freeze these
Reinitialize this and train
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096
FC-C
3. Bigger dataset
Freeze these
Train these
With bigger dataset, train more layers
Lower learning rate when finetuning; 1/10 of original LR is good starting point
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014
![Page 91: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/91.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201791
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
More generic
More specific
very similar dataset
very different dataset
very little data ? ?
quite a lot of data
? ?
![Page 92: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/92.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201792
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
More generic
More specific
very similar dataset
very different dataset
very little data Use Linear Classifier ontop layer
?
quite a lot of data
Finetune a few layers
?
![Page 93: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/93.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201793
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
More generic
More specific
very similar dataset
very different dataset
very little data Use Linear Classifier on top layer
You’re in trouble… Try linear classifier from different stages
quite a lot of data
Finetune a few layers
Finetune a larger number of layers
![Page 94: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/94.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201794
Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)
Image Captioning: CNN + RNN
Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.
Object Detection (Fast R-CNN)
Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.
![Page 95: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/95.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201795
Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)
Image Captioning: CNN + RNN
Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.
Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.
Object Detection (Fast R-CNN) CNN pretrained
on ImageNet
![Page 96: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/96.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201796
Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)
Image Captioning: CNN + RNN
Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.
Object Detection (Fast R-CNN) CNN pretrained
on ImageNet
Word vectors pretrained with word2vec Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for
Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.
![Page 97: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/97.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201797
Takeaway for your projects and beyond:Have some dataset of interest but it has < ~1M images?
1. Find a very large dataset that has similar data, train a big ConvNet there
2. Transfer learn to your dataset
Deep learning frameworks provide a “Model Zoo” of pretrained models so you don’t need to train your ownCaffe: https://github.com/BVLC/caffe/wiki/Model-ZooTensorFlow: https://github.com/tensorflow/modelsPyTorch: https://github.com/pytorch/vision
![Page 98: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/98.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201798
Summary
- Optimization- Momentum, RMSProp, Adam, etc
- Regularization- Dropout, etc
- Transfer learning- Use this for your projects!
![Page 99: Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42071cc4130c6b1a2f2d7f/html5/thumbnails/99.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201799
Next time: Deep Learning Software!