lecture 5 recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“lโ€™ e] 1...

47
Lecture 5 Recap I2DL: Prof. Niessner, Dr. Dai 1

Upload: others

Post on 24-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Lecture 5 Recap

I2DL: Prof. Niessner, Dr. Dai 1

Page 2: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Gradient Descent for Neural Networks

๐‘ฅ0

๐‘ฅ1

๐‘ฅ2

โ„Ž0

โ„Ž1

โ„Ž2

โ„Ž3

เทœ๐‘ฆ0

เทœ๐‘ฆ1

๐‘ฆ0

๐‘ฆ1

เทœ๐‘ฆ๐‘– = ๐ด(๐‘1,๐‘– +

๐‘—

โ„Ž๐‘—๐‘ค1,๐‘–,๐‘—)

โ„Ž๐‘— = ๐ด(๐‘0,๐‘— +

๐‘˜

๐‘ฅ๐‘˜๐‘ค0,๐‘—,๐‘˜)

Loss function๐ฟ๐‘– = เทœ๐‘ฆ๐‘– โˆ’ ๐‘ฆ๐‘–

2

Just simple: ๐ด ๐‘ฅ = max(0, ๐‘ฅ)

๐›ป๐‘พ,๐’ƒ๐‘“ ๐’™,๐’š (๐‘พ) =

๐œ•๐‘“

๐œ•๐‘ค0,0,0โ€ฆโ€ฆ๐œ•๐‘“

๐œ•๐‘ค๐‘™,๐‘š,๐‘›โ€ฆโ€ฆ๐œ•๐‘“

๐œ•๐‘๐‘™,๐‘š

I2DL: Prof. Niessner, Dr. Dai 3

Page 3: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Stochastic Gradient Descent (SGD)๐œฝ๐‘˜+1 = ๐œฝ๐‘˜ โˆ’ ๐›ผ๐›ป๐œฝ๐ฟ(๐œฝ

๐‘˜ , ๐’™{1..๐‘š}, ๐’š{1..๐‘š})

๐›ป๐œฝ๐ฟ =1

๐‘šฯƒ๐‘–=1๐‘š ๐›ป๐œฝ๐ฟ๐‘–

:

๐‘˜ now refers to ๐‘˜-th iteration

๐‘š training samples in the current minibatch

Gradient for the ๐‘˜-th minibatch

I2DL: Prof. Niessner, Dr. Dai 4

Page 4: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Gradient Descent with Momentum๐’—๐‘˜+1 = ๐›ฝ โ‹… ๐’—๐‘˜ + ๐›ป๐œฝ๐ฟ(๐œฝ

๐‘˜)

๐œฝ๐‘˜+1 = ๐œฝ๐‘˜ โˆ’ ๐›ผ โ‹… ๐’—๐‘˜+1

Exponentially-weighted average of gradient

Important: velocity ๐’—๐‘˜ is vector-valued!

Gradient of current minibatchvelocity

accumulation rate(โ€˜frictionโ€™, momentum)

learning ratevelocitymodel

I2DL: Prof. Niessner, Dr. Dai 5

Page 5: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

RMSProp

X-direction Small gradients

Y-D

irect

ion

Larg

e g

rad

ient

s

Source: A. Ng

๐’”๐‘˜+1 = ๐›ฝ โ‹… ๐’”๐‘˜ + (1 โˆ’ ๐›ฝ)[๐›ป๐œฝ๐ฟ โˆ˜ ๐›ป๐œฝ๐ฟ]

๐œฝ๐‘˜+1 = ๐œฝ๐‘˜ โˆ’ ๐›ผ โ‹…๐›ป๐œฝ๐ฟ

๐’”๐‘˜+1 + ๐œ–

Weโ€™re dividing by square gradients:- Division in Y-Direction will be large- Division in X-Direction will be small

(Uncentered) variance of gradients โ†’ second momentum

Can increase learning rate!

I2DL: Prof. Niessner, Dr. Dai 6

Page 6: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Adamโ€ข Combines Momentum and RMSProp

๐’Ž๐‘˜+1 = ๐›ฝ1 โ‹… ๐’Ž๐‘˜ + 1 โˆ’ ๐›ฝ1 ๐›ป๐œฝ๐ฟ ๐œฝ๐‘˜ ๐’—๐‘˜+1 = ๐›ฝ2 โ‹… ๐’—

๐‘˜ + (1 โˆ’ ๐›ฝ2)[๐›ป๐œฝ๐ฟ ๐œฝ๐‘˜ โˆ˜ ๐›ป๐œฝ๐ฟ ๐œฝ๐‘˜

I2DL: Prof. Niessner, Dr. Dai 7

โ€ข ๐’Ž๐‘˜+1 and ๐’—๐‘˜+1 are initialized with zero โ†’ bias towards zeroโ†’ Typically, bias-corrected moment updates

เท๐’Ž๐‘˜+1 =๐’Ž๐‘˜+1

1 โˆ’ ๐›ฝ1๐‘˜+1

เท๐’—๐‘˜+1 =๐’—๐‘˜+1

1 โˆ’ ๐›ฝ2๐‘˜+1 ๐œฝ๐‘˜+1 = ๐œฝ๐‘˜ โˆ’ ๐›ผ โ‹…

เท๐’Ž๐‘˜+1

เท๐’—๐‘˜+1+๐œ–

Page 7: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Training Neural Nets

I2DL: Prof. Niessner, Dr. Dai 12

Page 8: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Learning Rate: Implications

โ€ข What if too high?

โ€ข What if too low?

Source: http://cs231n.github.io/neural-networks-3/

I2DL: Prof. Niessner, Dr. Dai 13

Page 9: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Learning Rate

Need high learning rate when far away

Need low learning rate when close

I2DL: Prof. Niessner, Dr. Dai 14

Page 10: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Learning Rate Decay

โ€ข ๐›ผ =1

1+๐‘‘๐‘’๐‘๐‘Ž๐‘ฆ_๐‘Ÿ๐‘Ž๐‘ก๐‘’โˆ—๐‘’๐‘๐‘œ๐‘โ„Žโ‹… ๐›ผ0

โ€“ E.g., ๐›ผ0 = 0.1, ๐‘‘๐‘’๐‘๐‘Ž๐‘ฆ_๐‘Ÿ๐‘Ž๐‘ก๐‘’ = 1.0

โ†’ Epoch 0: 0.1

โ†’ Epoch 1: 0.05

โ†’ Epoch 2: 0.033

โ†’ Epoch 3: 0.025

... 0

0.02

0.04

0.06

0.08

0.1

0.12

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

Learning Rate over Epochs

I2DL: Prof. Niessner, Dr. Dai 15

Page 11: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Learning Rate DecayMany options:โ€ข Step decay ๐›ผ = ๐›ผ โˆ’ ๐‘ก โ‹… ๐›ผ (only every n steps)

โ€“ T is decay rate (often 0.5)

โ€ข Exponential decay ๐›ผ = ๐‘ก๐‘’๐‘๐‘œ๐‘โ„Ž โ‹… ๐›ผ0โ€“ t is decay rate (t < 1.0)

โ€ข ๐›ผ =๐‘ก

๐‘’๐‘๐‘œ๐‘โ„Žโ‹… ๐‘Ž0

โ€“ t is decay rate

โ€ข Etc.I2DL: Prof. Niessner, Dr. Dai 16

Page 12: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Training ScheduleManually specify learning rate for entire training process

โ€ข Manually set learning rate every n-epochsโ€ข How?

โ€“ Trial and error (the hard way)โ€“ Some experience (only generalizes to some degree)

Consider: #epochs, training set size, network size, etc.

I2DL: Prof. Niessner, Dr. Dai 17

Page 13: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Basic Recipe for Training

โ€ข Given ground dataset with ground labelsโ€“ {๐‘ฅ๐‘–, ๐‘ฆ๐‘–}

โ€ข ๐‘ฅ๐‘– is the ๐‘–๐‘กโ„Ž training image, with label ๐‘ฆ๐‘–โ€ข Often dim ๐‘ฅ โ‰ซ dim(๐‘ฆ) (e.g., for classification)โ€ข ๐‘– is often in the 100-thousands or millions

โ€“ Take network ๐‘“ and its parameters ๐‘ค, ๐‘โ€“ Use SGD (or variation) to find optimal parameters ๐‘ค, ๐‘

โ€ข Gradients from backpropagation

I2DL: Prof. Niessner, Dr. Dai 18

Page 14: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Gradient Descent on Train Setโ€ข Given large train set with (๐‘›) training samples {๐’™๐‘– , ๐’š๐‘–}

โ€“ Letโ€™s say 1 million labeled imagesโ€“ Letโ€™s say our network has 500k parameters

โ€ข Gradient has 500k dimensionsโ€ข ๐‘› = 1 ๐‘š๐‘–๐‘™๐‘™๐‘–๐‘œ๐‘›

โ€ข Extremely expensive to compute

I2DL: Prof. Niessner, Dr. Dai 19

Page 15: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Learningโ€ข Learning means generalization to unknown dataset

โ€“ (So far no โ€˜realโ€™ learning)โ€“ I.e., train on known dataset โ†’ test with optimized

parameters on unknown dataset

โ€ข Basically, we hope that based on the train set, the optimized parameters will give similar results on different data (i.e., test data)

I2DL: Prof. Niessner, Dr. Dai 20

Page 16: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Learningโ€ข Training set (โ€˜trainโ€™):

โ€“ Use for training your neural network

โ€ข Validation set (โ€˜valโ€™):โ€“ Hyperparameter optimizationโ€“ Check generalization progress

โ€ข Test set (โ€˜testโ€™):โ€“ Only for the very endโ€“ NEVER TOUCH DURING DEVELOPMENT OR TRAINING

I2DL: Prof. Niessner, Dr. Dai 21

Page 17: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Learningโ€ข Typical splits

โ€“ Train (60%), Val (20%), Test (20%)โ€“ Train (80%), Val (10%), Test (10%)

โ€ข During training:โ€“ Train error comes from average minibatch errorโ€“ Typically take subset of validation every n iterations

I2DL: Prof. Niessner, Dr. Dai 22

Page 18: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Basic Recipe for Machine Learningโ€ข Split your data

Find your hyperparameters

20%

train testvalidation

20%60%

I2DL: Prof. Niessner, Dr. Dai 23

Page 19: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Basic Recipe for Machine Learningโ€ข Split your data

20%

train testvalidation

20%60%

Ground truth error โ€ฆ... 1%

Training set error โ€ฆ.... 5%

Val/test set error โ€ฆ.... 8%

Bias (underfitting)Variance (overfitting)Ex

amp

le s

cen

ario

I2DL: Prof. Niessner, Dr. Dai 24

Page 20: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Basic Recipe for Machine Learning

Credits: A. NgDone

I2DL: Prof. Niessner, Dr. Dai 25

Page 21: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Over- and Underfitting

Underfitted AppropriateOverfitted

Source: Deep Learning by Adam Gibson, Josh Patterson, Oโ€˜Reily Media Inc., 2017

I2DL: Prof. Niessner, Dr. Dai 26

Page 22: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Over- and Underfitting

Source: https://srdas.github.io/DLBook/ImprovingModelGeneralization.html

I2DL: Prof. Niessner, Dr. Dai 27

Page 23: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Learning Curvesโ€ข Training graphs

- Accuracy - Loss

I2DL: Prof. Niessner, Dr. Dai 28

Page 24: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Good Learning Curves

test

val

Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

I2DL: Prof. Niessner, Dr. Dai 29

Page 25: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Overfitting Curves

test

Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

Val

I2DL: Prof. Niessner, Dr. Dai 30

Page 26: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Other Curves

Underfitting (loss still decreasing) Validation Set is easier than Training setSource: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

I2DL: Prof. Niessner, Dr. Dai 31

Page 27: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

To Summarizeโ€ข Underfitting

โ€“ Training and validation losses decrease even at the end of training

โ€ข Overfittingโ€“ Training loss decreases and validation loss increases

โ€ข Ideal Trainingโ€“ Small gap between training and validation loss, and both

go down at same rate (stable without fluctuations).

I2DL: Prof. Niessner, Dr. Dai 32

Page 28: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

To Summarizeโ€ข Bad Signs

โ€“ Training error not going downโ€“ Validation error not going downโ€“ Performance on validation better than on training setโ€“ Tests on train set different than during training

โ€ข Bad Practice โ€“ Training set contains test dataโ€“ Debug algorithm on test data

I2DL: Prof. Niessner, Dr. Dai 33

Never touch during development or

training

Page 29: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Hyperparametersโ€ข Network architecture (e.g., num layers, #weights)โ€ข Number of iterationsโ€ข Learning rate(s) (i.e., solver parameters, decay, etc.)โ€ข Regularization (more later next lecture) โ€ข Batch sizeโ€ข โ€ฆโ€ข Overall:

learning setup + optimization = hyperparameters

I2DL: Prof. Niessner, Dr. Dai 34

Page 30: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Hyperparameter Tuningโ€ข Methods:

โ€“ Manual search: โ€ข most common

โ€“ Grid search (structured, for โ€˜realโ€™ applications)โ€ข Define ranges for all parameters spaces and

select pointsโ€ข Usually pseudo-uniformly distributedโ†’ Iterate over all possible configurations

โ€“ Random search:Like grid search but one picks points at random in the predefined ranges

I2DL: Prof. Niessner, Dr. Dai 35

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Seco

nd

Par

amet

er

First Parameter

Grid search

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Seco

nd

Par

amet

er

First Parameter

Random search

Page 31: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

How to Startโ€ข Start with single training sample

โ€“ Check if output correctโ€“ Overfit accuracy should be 100% because

input just memorized

โ€ข Increase to handful of samples (e.g. 4)โ€“ Check if input is handled correctly

โ€ข Move from overfitting to more samplesโ€“ 5, 10, 100, 1000, โ€ฆโ€“ At some point, you should see generalization

I2DL: Prof. Niessner, Dr. Dai 36

โ€ฆ

Page 32: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Find a Good Learning Rateโ€ข Use all training data with small weight decayโ€ข Perform initial loss sanity check e.g. log(๐ถ) for softmax

with ๐ถ classesโ€ข Find a learning rate that makes

the loss drop significantly (exponentially) within 100 iterations

โ€ข Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4

I2DL: Prof. Niessner, Dr. Dai 37

Training time

Loss

Page 33: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Coarse Grid Searchโ€ข Choose a few values of learning rate and weight

decay around what worked from โ€ข Train a few models for a few epochs.โ€ข Good weight decay to try: 1e-4, 1e-5, 0

I2DL: Prof. Niessner, Dr. Dai 38

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Seco

nd

Par

amet

er

First Parameter

Grid search

Page 34: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Refine Gridโ€ข Pick best models found with coarse grid.โ€ข Refine grid search around these models.โ€ข Train them for longer (10-20 epochs) without learning

rate decayโ€ข Study loss curves

I2DL: Prof. Niessner, Dr. Dai 39

Page 35: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Timingsโ€ข How long does each iteration take?

โ€“ Get precise timings!โ€“ If an iteration exceeds 500ms, things get dicey

โ€ข Look for bottlenecksโ€“ Dataloading: smaller resolution,

compression, train from SSDโ€“ Backprop

โ€ข Estimate total timeโ€“ How long until you see some pattern?โ€“ How long till convergence?

I2DL: Prof. Niessner, Dr. Dai 40

Page 36: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Network Architectureโ€ข Frequent mistake: โ€œLetโ€™s use this super big network,

train for two weeks and we see where we stand.โ€

I2DL: Prof. Niessner, Dr. Dai 41

โ€ข Start with simplest network possibleโ€“ Rule of thumb divide #layers

you started with by 5

โ€ข Get debug cycles downโ€“ Ideally, minutes

Page 37: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Debuggingโ€ข Use train/validation/test curves

โ€“ Evaluation needs to be consistentโ€“ Numbers need to be comparable

โ€ข Only make one change at a timeโ€“ โ€œIโ€™ve added 5 more layers and double the training size,

and now I also trained 5 days longer. Now itโ€™s better, but why?โ€

I2DL: Prof. Niessner, Dr. Dai 42

Page 38: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Common Mistakes in Practiceโ€ข Did not overfit to single batch firstโ€ข Forgot to toggle train/eval mode for networkโ€ข Forgot to call .zero_grad() (in PyTorch) before calling

.backward()

โ€ข Passed softmaxed outputs to a loss function that expects raw logits

I2DL: Prof. Niessner, Dr. Dai 43

Page 39: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Tensorboard: Visualization in

Practice

I2DL: Prof. Niessner, Dr. Dai 44

Page 40: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Tensorboard: Compare Train/Val Curves

test

I2DL: Prof. Niessner, Dr. Dai 45

Page 41: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Tensorboard: Compare Different Runs

test

I2DL: Prof. Niessner, Dr. Dai 46

Page 42: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Tensorboard: Visualize Model Predictionstest

I2DL: Prof. Niessner, Dr. Dai 47

Page 43: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Tensorboard: Visualize Model Predictionstest

I2DL: Prof. Niessner, Dr. Dai 48

Page 44: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Tensorboard: Compare Hyperparameters

test

I2DL: Prof. Niessner, Dr. Dai 49

Page 45: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Next Lectureโ€ข Next lecture on 10th December

โ€“ More about training neural networks: output functions, loss functions, activation functions

โ€ข Exam date: Wednesday, 19.02.2019, 13.30 - 15.00โ€ข Reminder: Exercise 1 due tomorrow 04.12.2019, 18:00โ€ข Exercise 2 release: Thursday, 05.12.2019, 10:00 โ€“ 12:00

I2DL: Prof. Niessner, Dr. Dai 50

Page 46: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

See you next week!

I2DL: Prof. Niessner, Dr. Dai 51

Page 47: Lecture 5 Recap - niessner.github.io1โ‹…๐’Ž +1โˆ’ 1 ๐œฝ๐ฟ๐œฝ ๐’— +1= 2 ... โ€“Lโ€™ E] 1 QMPPMSR EFIP MQEKIW โ€“Lโ€™ E] SYV RIX[SV 500 TEVEQIXIVW โ€ขGradient has 500k dimensions

Referencesโ€ข Goodfellow et al. โ€œDeep Learningโ€ (2016),

โ€“ Chapter 6: Deep Feedforward Networks

โ€ข Bishop โ€œPattern Recognition and Machine Learningโ€ (2006), โ€“ Chapter 5.5: Regularization in Network Nets

โ€ข http://cs231n.github.io/neural-networks-1/

โ€ข http://cs231n.github.io/neural-networks-2/

โ€ข http://cs231n.github.io/neural-networks-3/

I2DL: Prof. Niessner, Dr. Dai 52