lecture 5 recap - niessner.github.io1โ ๐ +1โ 1 ๐ฝ๐ฟ๐ฝ ๐ +1= 2 ... โlโ e] 1...
TRANSCRIPT
Lecture 5 Recap
I2DL: Prof. Niessner, Dr. Dai 1
Gradient Descent for Neural Networks
๐ฅ0
๐ฅ1
๐ฅ2
โ0
โ1
โ2
โ3
เท๐ฆ0
เท๐ฆ1
๐ฆ0
๐ฆ1
เท๐ฆ๐ = ๐ด(๐1,๐ +
๐
โ๐๐ค1,๐,๐)
โ๐ = ๐ด(๐0,๐ +
๐
๐ฅ๐๐ค0,๐,๐)
Loss function๐ฟ๐ = เท๐ฆ๐ โ ๐ฆ๐
2
Just simple: ๐ด ๐ฅ = max(0, ๐ฅ)
๐ป๐พ,๐๐ ๐,๐ (๐พ) =
๐๐
๐๐ค0,0,0โฆโฆ๐๐
๐๐ค๐,๐,๐โฆโฆ๐๐
๐๐๐,๐
I2DL: Prof. Niessner, Dr. Dai 3
Stochastic Gradient Descent (SGD)๐ฝ๐+1 = ๐ฝ๐ โ ๐ผ๐ป๐ฝ๐ฟ(๐ฝ
๐ , ๐{1..๐}, ๐{1..๐})
๐ป๐ฝ๐ฟ =1
๐ฯ๐=1๐ ๐ป๐ฝ๐ฟ๐
:
๐ now refers to ๐-th iteration
๐ training samples in the current minibatch
Gradient for the ๐-th minibatch
I2DL: Prof. Niessner, Dr. Dai 4
Gradient Descent with Momentum๐๐+1 = ๐ฝ โ ๐๐ + ๐ป๐ฝ๐ฟ(๐ฝ
๐)
๐ฝ๐+1 = ๐ฝ๐ โ ๐ผ โ ๐๐+1
Exponentially-weighted average of gradient
Important: velocity ๐๐ is vector-valued!
Gradient of current minibatchvelocity
accumulation rate(โfrictionโ, momentum)
learning ratevelocitymodel
I2DL: Prof. Niessner, Dr. Dai 5
RMSProp
X-direction Small gradients
Y-D
irect
ion
Larg
e g
rad
ient
s
Source: A. Ng
๐๐+1 = ๐ฝ โ ๐๐ + (1 โ ๐ฝ)[๐ป๐ฝ๐ฟ โ ๐ป๐ฝ๐ฟ]
๐ฝ๐+1 = ๐ฝ๐ โ ๐ผ โ ๐ป๐ฝ๐ฟ
๐๐+1 + ๐
Weโre dividing by square gradients:- Division in Y-Direction will be large- Division in X-Direction will be small
(Uncentered) variance of gradients โ second momentum
Can increase learning rate!
I2DL: Prof. Niessner, Dr. Dai 6
Adamโข Combines Momentum and RMSProp
๐๐+1 = ๐ฝ1 โ ๐๐ + 1 โ ๐ฝ1 ๐ป๐ฝ๐ฟ ๐ฝ๐ ๐๐+1 = ๐ฝ2 โ ๐
๐ + (1 โ ๐ฝ2)[๐ป๐ฝ๐ฟ ๐ฝ๐ โ ๐ป๐ฝ๐ฟ ๐ฝ๐
I2DL: Prof. Niessner, Dr. Dai 7
โข ๐๐+1 and ๐๐+1 are initialized with zero โ bias towards zeroโ Typically, bias-corrected moment updates
เท๐๐+1 =๐๐+1
1 โ ๐ฝ1๐+1
เท๐๐+1 =๐๐+1
1 โ ๐ฝ2๐+1 ๐ฝ๐+1 = ๐ฝ๐ โ ๐ผ โ
เท๐๐+1
เท๐๐+1+๐
Training Neural Nets
I2DL: Prof. Niessner, Dr. Dai 12
Learning Rate: Implications
โข What if too high?
โข What if too low?
Source: http://cs231n.github.io/neural-networks-3/
I2DL: Prof. Niessner, Dr. Dai 13
Learning Rate
Need high learning rate when far away
Need low learning rate when close
I2DL: Prof. Niessner, Dr. Dai 14
Learning Rate Decay
โข ๐ผ =1
1+๐๐๐๐๐ฆ_๐๐๐ก๐โ๐๐๐๐โโ ๐ผ0
โ E.g., ๐ผ0 = 0.1, ๐๐๐๐๐ฆ_๐๐๐ก๐ = 1.0
โ Epoch 0: 0.1
โ Epoch 1: 0.05
โ Epoch 2: 0.033
โ Epoch 3: 0.025
... 0
0.02
0.04
0.06
0.08
0.1
0.12
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
Learning Rate over Epochs
I2DL: Prof. Niessner, Dr. Dai 15
Learning Rate DecayMany options:โข Step decay ๐ผ = ๐ผ โ ๐ก โ ๐ผ (only every n steps)
โ T is decay rate (often 0.5)
โข Exponential decay ๐ผ = ๐ก๐๐๐๐โ โ ๐ผ0โ t is decay rate (t < 1.0)
โข ๐ผ =๐ก
๐๐๐๐โโ ๐0
โ t is decay rate
โข Etc.I2DL: Prof. Niessner, Dr. Dai 16
Training ScheduleManually specify learning rate for entire training process
โข Manually set learning rate every n-epochsโข How?
โ Trial and error (the hard way)โ Some experience (only generalizes to some degree)
Consider: #epochs, training set size, network size, etc.
I2DL: Prof. Niessner, Dr. Dai 17
Basic Recipe for Training
โข Given ground dataset with ground labelsโ {๐ฅ๐, ๐ฆ๐}
โข ๐ฅ๐ is the ๐๐กโ training image, with label ๐ฆ๐โข Often dim ๐ฅ โซ dim(๐ฆ) (e.g., for classification)โข ๐ is often in the 100-thousands or millions
โ Take network ๐ and its parameters ๐ค, ๐โ Use SGD (or variation) to find optimal parameters ๐ค, ๐
โข Gradients from backpropagation
I2DL: Prof. Niessner, Dr. Dai 18
Gradient Descent on Train Setโข Given large train set with (๐) training samples {๐๐ , ๐๐}
โ Letโs say 1 million labeled imagesโ Letโs say our network has 500k parameters
โข Gradient has 500k dimensionsโข ๐ = 1 ๐๐๐๐๐๐๐
โข Extremely expensive to compute
I2DL: Prof. Niessner, Dr. Dai 19
Learningโข Learning means generalization to unknown dataset
โ (So far no โrealโ learning)โ I.e., train on known dataset โ test with optimized
parameters on unknown dataset
โข Basically, we hope that based on the train set, the optimized parameters will give similar results on different data (i.e., test data)
I2DL: Prof. Niessner, Dr. Dai 20
Learningโข Training set (โtrainโ):
โ Use for training your neural network
โข Validation set (โvalโ):โ Hyperparameter optimizationโ Check generalization progress
โข Test set (โtestโ):โ Only for the very endโ NEVER TOUCH DURING DEVELOPMENT OR TRAINING
I2DL: Prof. Niessner, Dr. Dai 21
Learningโข Typical splits
โ Train (60%), Val (20%), Test (20%)โ Train (80%), Val (10%), Test (10%)
โข During training:โ Train error comes from average minibatch errorโ Typically take subset of validation every n iterations
I2DL: Prof. Niessner, Dr. Dai 22
Basic Recipe for Machine Learningโข Split your data
Find your hyperparameters
20%
train testvalidation
20%60%
I2DL: Prof. Niessner, Dr. Dai 23
Basic Recipe for Machine Learningโข Split your data
20%
train testvalidation
20%60%
Ground truth error โฆ... 1%
Training set error โฆ.... 5%
Val/test set error โฆ.... 8%
Bias (underfitting)Variance (overfitting)Ex
amp
le s
cen
ario
I2DL: Prof. Niessner, Dr. Dai 24
Basic Recipe for Machine Learning
Credits: A. NgDone
I2DL: Prof. Niessner, Dr. Dai 25
Over- and Underfitting
Underfitted AppropriateOverfitted
Source: Deep Learning by Adam Gibson, Josh Patterson, OโReily Media Inc., 2017
I2DL: Prof. Niessner, Dr. Dai 26
Over- and Underfitting
Source: https://srdas.github.io/DLBook/ImprovingModelGeneralization.html
I2DL: Prof. Niessner, Dr. Dai 27
Learning Curvesโข Training graphs
- Accuracy - Loss
I2DL: Prof. Niessner, Dr. Dai 28
Good Learning Curves
test
val
Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
I2DL: Prof. Niessner, Dr. Dai 29
Overfitting Curves
test
Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
Val
I2DL: Prof. Niessner, Dr. Dai 30
Other Curves
Underfitting (loss still decreasing) Validation Set is easier than Training setSource: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
I2DL: Prof. Niessner, Dr. Dai 31
To Summarizeโข Underfitting
โ Training and validation losses decrease even at the end of training
โข Overfittingโ Training loss decreases and validation loss increases
โข Ideal Trainingโ Small gap between training and validation loss, and both
go down at same rate (stable without fluctuations).
I2DL: Prof. Niessner, Dr. Dai 32
To Summarizeโข Bad Signs
โ Training error not going downโ Validation error not going downโ Performance on validation better than on training setโ Tests on train set different than during training
โข Bad Practice โ Training set contains test dataโ Debug algorithm on test data
I2DL: Prof. Niessner, Dr. Dai 33
Never touch during development or
training
Hyperparametersโข Network architecture (e.g., num layers, #weights)โข Number of iterationsโข Learning rate(s) (i.e., solver parameters, decay, etc.)โข Regularization (more later next lecture) โข Batch sizeโข โฆโข Overall:
learning setup + optimization = hyperparameters
I2DL: Prof. Niessner, Dr. Dai 34
Hyperparameter Tuningโข Methods:
โ Manual search: โข most common
โ Grid search (structured, for โrealโ applications)โข Define ranges for all parameters spaces and
select pointsโข Usually pseudo-uniformly distributedโ Iterate over all possible configurations
โ Random search:Like grid search but one picks points at random in the predefined ranges
I2DL: Prof. Niessner, Dr. Dai 35
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Seco
nd
Par
amet
er
First Parameter
Grid search
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Seco
nd
Par
amet
er
First Parameter
Random search
How to Startโข Start with single training sample
โ Check if output correctโ Overfit accuracy should be 100% because
input just memorized
โข Increase to handful of samples (e.g. 4)โ Check if input is handled correctly
โข Move from overfitting to more samplesโ 5, 10, 100, 1000, โฆโ At some point, you should see generalization
I2DL: Prof. Niessner, Dr. Dai 36
โฆ
Find a Good Learning Rateโข Use all training data with small weight decayโข Perform initial loss sanity check e.g. log(๐ถ) for softmax
with ๐ถ classesโข Find a learning rate that makes
the loss drop significantly (exponentially) within 100 iterations
โข Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4
I2DL: Prof. Niessner, Dr. Dai 37
Training time
Loss
Coarse Grid Searchโข Choose a few values of learning rate and weight
decay around what worked from โข Train a few models for a few epochs.โข Good weight decay to try: 1e-4, 1e-5, 0
I2DL: Prof. Niessner, Dr. Dai 38
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Seco
nd
Par
amet
er
First Parameter
Grid search
Refine Gridโข Pick best models found with coarse grid.โข Refine grid search around these models.โข Train them for longer (10-20 epochs) without learning
rate decayโข Study loss curves
I2DL: Prof. Niessner, Dr. Dai 39
Timingsโข How long does each iteration take?
โ Get precise timings!โ If an iteration exceeds 500ms, things get dicey
โข Look for bottlenecksโ Dataloading: smaller resolution,
compression, train from SSDโ Backprop
โข Estimate total timeโ How long until you see some pattern?โ How long till convergence?
I2DL: Prof. Niessner, Dr. Dai 40
Network Architectureโข Frequent mistake: โLetโs use this super big network,
train for two weeks and we see where we stand.โ
I2DL: Prof. Niessner, Dr. Dai 41
โข Start with simplest network possibleโ Rule of thumb divide #layers
you started with by 5
โข Get debug cycles downโ Ideally, minutes
Debuggingโข Use train/validation/test curves
โ Evaluation needs to be consistentโ Numbers need to be comparable
โข Only make one change at a timeโ โIโve added 5 more layers and double the training size,
and now I also trained 5 days longer. Now itโs better, but why?โ
I2DL: Prof. Niessner, Dr. Dai 42
Common Mistakes in Practiceโข Did not overfit to single batch firstโข Forgot to toggle train/eval mode for networkโข Forgot to call .zero_grad() (in PyTorch) before calling
.backward()
โข Passed softmaxed outputs to a loss function that expects raw logits
I2DL: Prof. Niessner, Dr. Dai 43
Tensorboard: Visualization in
Practice
I2DL: Prof. Niessner, Dr. Dai 44
Tensorboard: Compare Train/Val Curves
test
I2DL: Prof. Niessner, Dr. Dai 45
Tensorboard: Compare Different Runs
test
I2DL: Prof. Niessner, Dr. Dai 46
Tensorboard: Visualize Model Predictionstest
I2DL: Prof. Niessner, Dr. Dai 47
Tensorboard: Visualize Model Predictionstest
I2DL: Prof. Niessner, Dr. Dai 48
Tensorboard: Compare Hyperparameters
test
I2DL: Prof. Niessner, Dr. Dai 49
Next Lectureโข Next lecture on 10th December
โ More about training neural networks: output functions, loss functions, activation functions
โข Exam date: Wednesday, 19.02.2019, 13.30 - 15.00โข Reminder: Exercise 1 due tomorrow 04.12.2019, 18:00โข Exercise 2 release: Thursday, 05.12.2019, 10:00 โ 12:00
I2DL: Prof. Niessner, Dr. Dai 50
See you next week!
I2DL: Prof. Niessner, Dr. Dai 51
Referencesโข Goodfellow et al. โDeep Learningโ (2016),
โ Chapter 6: Deep Feedforward Networks
โข Bishop โPattern Recognition and Machine Learningโ (2006), โ Chapter 5.5: Regularization in Network Nets
โข http://cs231n.github.io/neural-networks-1/
โข http://cs231n.github.io/neural-networks-2/
โข http://cs231n.github.io/neural-networks-3/
I2DL: Prof. Niessner, Dr. Dai 52