basics of dl•pytorch (can use tensorflow… •you have trained already several models and know...

73
Basics of DL Prof. Leal-Taixé and Prof. Niessner 1

Upload: others

Post on 05-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Basics of DL

Prof. Leal-Taixé and Prof. Niessner 1

Page 2: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

What we assume you know• Linear Algebra & Programming!

• Basics from the Introduction to Deep Learning lecture

• PyTorch (can use TensorFlow…)

• You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data.

Prof. Leal-Taixé and Prof. Niessner 2

Page 3: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

What is a Neural Network?

Prof. Leal-Taixé and Prof. Niessner 3

Page 4: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Neural Network• Linear score function 𝑓 = 𝑊𝑥

Prof. Leal-Taixé and Prof. Niessner 4

On CIFAR-10

On ImageNetCredit: Li/Karpathy/Johnson

Page 5: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Neural Network• Linear score function 𝑓 = 𝑊𝑥

• Neural network is a nesting of ‘functions’– 2-layers: 𝑓 = 𝑊2max(0,𝑊1𝑥)

– 3-layers: 𝑓 = 𝑊3max(0,𝑊2max(0,𝑊1𝑥))

– 4-layers: 𝑓 = 𝑊4 tanh (W3, max(0,𝑊2max(0,𝑊1𝑥)))

– 5-layers: 𝑓 = 𝑊5𝜎(𝑊4 tanh(W3, max(0,𝑊2max(0,𝑊1𝑥))))

– … up to hundreds of layers

Prof. Leal-Taixé and Prof. Niessner 5

Page 6: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Neural Network

Prof. Leal-Taixé and Prof. Niessner 6

2-layer network: 𝑓 = 𝑊2max(0,𝑊1𝑥)

𝑥ℎ𝑊1

128 × 128 = 16384 1000

𝑓𝑊2

10

1-layer network: 𝑓 = 𝑊𝑥

𝑥𝑊

128 × 128 = 16384

𝑓

10

Page 7: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Neural Network

Prof. Leal-Taixé and Prof. Niessner 7

Credit: Li/Karpathy/Johnson

Page 8: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Loss functions

Prof. Leal-Taixé and Prof. Niessner 8

Page 9: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Neural networksWhat is the shape of this function?

9Prof. Leal-Taixé and Prof. Niessner

Loss (Softmax,

Hinge)

Prediction

Page 10: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Loss functions• Softmax loss function

• Hinge Loss (derived from the Multiclass SVM loss)

10Prof. Leal-Taixé and Prof. Niessner

Evaluate the ground truth score for the image

Page 11: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Loss functions• Softmax loss function

– Optimizes until the loss is zero

• Hinge Loss (derived from the Multiclass SVM loss)

– Saturates whenever it has learned a class “well enough”

11Prof. Leal-Taixé and Prof. Niessner

Page 12: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Activation functions

Prof. Leal-Taixé and Prof. Niessner 12

Page 13: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

SigmoidForward

13Prof. Leal-Taixé and Prof. Niessner

Saturated neurons kill the gradient flow

Page 14: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Problem of positive output

14Prof. Leal-Taixé and Prof. Niessner

More on zero-mean data later

Page 15: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

tanh

15Prof. Leal-Taixé and Prof. Niessner

Zero-centered

Still saturates

Still saturates

LeCun 1991

Page 16: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Rectified Linear Units (ReLU)

16Prof. Leal-Taixé and Prof. Niessner

Large and consistent gradients

Does not saturateFast convergence

What happens if a ReLU outputs zero?

Dead ReLU

Page 17: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Maxout units

17Prof. Leal-Taixé and Prof. Niessner

Generalization of ReLUs

Linear regimes

Does not die

Does not saturate

Increase of the number of parameters

Page 18: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Optimization

Prof. Leal-Taixé and Prof. Niessner 18

Page 19: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Gradient Descent for Neural Networks

Prof. Leal-Taixé and Prof. Niessner 19

𝑥0

𝑥1

𝑥2

ℎ0

ℎ1

ℎ2

ℎ3

𝑦0

𝑦1

𝑡0

𝑡1

𝑦𝑖 = 𝐴(𝑏1,𝑖 +

𝑗

ℎ𝑗𝑤1,𝑖,𝑗)ℎ𝑗 = 𝐴(𝑏0,𝑗 +

𝑘

𝑥𝑘𝑤0,𝑗,𝑘)

𝐿𝑖 = 𝑦𝑖 − 𝑡𝑖2

𝛻𝑤,𝑏𝑓𝑥,𝑡 (𝑤) =

𝜕𝑓

𝜕𝑤0,0,0……𝜕𝑓

𝜕𝑤𝑙,𝑚,𝑛……𝜕𝑓

𝜕𝑏𝑙,𝑚

Just simple: 𝐴 𝑥 = max(0, 𝑥)

Page 20: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Stochastic Gradient Descent (SGD)𝜃𝑘+1 = 𝜃𝑘 − 𝛼𝛻𝜃𝐿(𝜃

𝑘 , 𝑥{1..𝑚}, 𝑦{1..𝑚})

𝛻𝜃𝐿 =1

𝑚σ𝑖=1𝑚 𝛻𝜃𝐿𝑖

Note the terminology: iteration vs epoch

Prof. Leal-Taixé and Prof. Niessner 20

𝑘 now refers to 𝑘-th iteration

𝑚 training samples in the current batch

Gradient for the 𝑘-th batch

Page 21: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Gradient Descent with Momentum𝑣𝑘+1 = 𝛽 ⋅ 𝑣𝑘 + 𝛻𝜃𝐿(𝜃

𝑘)

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅ 𝑣𝑘+1

Exponentially-weighted average of gradient

Important: velocity 𝑣𝑘 is vector-valued!

Prof. Leal-Taixé and Prof. Niessner 21

Gradient of current minibatchvelocity

accumulation rate(‘friction’, momentum)

learning ratevelocitymodel

Page 22: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Gradient Descent with Momentum

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅ 𝑣𝑘+1

Prof. Leal-Taixé and Prof. Niessner 22

Step will be largest when a sequence of gradients all point to the same direction

Fig. credit: I. Goodfellow

Hyperparameters are 𝛼, 𝛽𝛽 is often set to 0.9

Page 23: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

RMSProp

𝑠𝑘+1 = 𝛽 ⋅ 𝑠𝑘 + (1 − 𝛽)[𝛻𝜃𝐿 ∘ 𝛻𝜃𝐿]

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅𝛻𝜃𝐿

𝑠𝑘+1 + 𝜖

Hyperparameters: 𝛼, 𝛽, 𝜖

Prof. Leal-Taixé and Prof. Niessner 23

Typically 10−8Often 0.9

Element-wise multiplication

Needs tuning!

Page 24: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

RMSProp

Prof. Leal-Taixé and Prof. Niessner 24

X-directionSmall gradients

Y-D

irec

tio

nLa

rge

grad

ien

ts

Fig. credit: A. Ng

𝑠𝑘+1 = 𝛽 ⋅ 𝑠𝑘 + (1 − 𝛽)[𝛻𝜃𝐿 ∘ 𝛻𝜃𝐿]

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅𝛻𝜃𝐿

𝑠𝑘+1 + 𝜖We’re dividing by square gradients:- Division in Y-Direction will be large- Division in X-Direction will be small

(uncentered) variance of gradients -> second momentum

Can increase learning rate!

Page 25: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Adaptive Moment Estimation (Adam)Combines Momentum and RMSProp

𝑚𝑘+1 = 𝛽1 ⋅ 𝑚𝑘 + 1 − 𝛽1 𝛻𝜃𝐿 𝜃𝑘

𝑣𝑘+1 = 𝛽2 ⋅ 𝑣𝑘 + (1 − 𝛽2)[𝛻𝜃𝐿 𝜃𝑘 ∘ 𝛻𝜃𝐿 𝜃𝑘 ]

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅𝑚𝑘+1

𝑣𝑘+1+𝜖

Prof. Leal-Taixé and Prof. Niessner 25

First momentum: mean of gradients

Second momentum: variance of gradients

Page 26: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

AdamCombines Momentum and RMSProp

𝑚𝑘+1 = 𝛽1 ⋅ 𝑚𝑘 + 1 − 𝛽1 𝛻𝜃𝐿 𝜃𝑘

𝑣𝑘+1 = 𝛽2 ⋅ 𝑣𝑘 + (1 − 𝛽2)[𝛻𝜃𝐿 𝜃𝑘 ∘ 𝛻𝜃𝐿 𝜃𝑘 ]

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅ෝ𝑚𝑘+1

ො𝑣𝑘+1+𝜖

Prof. Leal-Taixé and Prof. Niessner 26

𝑚𝑘+1 and 𝑣𝑘+1 are initialized with zero-> bias towards zero

Typically, bias-corrected moment updates

ෝ𝑚𝑘+1 =𝑚𝑘

1 − 𝛽1

ො𝑣𝑘+1 =𝑣𝑘

1 − 𝛽2

Page 27: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Convergence

27

Page 28: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Training NNs

Prof. Leal-Taixé and Prof. Niessner 28

Page 29: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Importance of Learning Rate

Prof. Leal-Taixé and Prof. Niessner 29

Page 30: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Over- and Underfitting

Prof. Leal-Taixé and Prof. Niessner 30

Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017

Page 31: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Over- and Underfitting

Prof. Leal-Taixé and Prof. Niessner 31

Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html

Page 32: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Basic recipe for machine learning• Split your data

Prof. Leal-Taixé and Prof. Niessner 32

Find your hyperparameters

20%

train testvalidation

20%60%

Page 33: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Basic recipe for machine learning

Prof. Leal-Taixé and Prof. Niessner 33

Page 34: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Regularization

Prof. Leal-Taixé and Prof. Niessner 34

Page 35: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Regularization• Any strategy that aims to

35Prof. Leal-Taixé and Prof. Niessner

Lower validation error

Increasing training error

Page 36: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Data augmentation

36Prof. Leal-Taixé and Prof. Niessner Krizhevsky 2012

Page 37: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Early stoppingTraining time is also a hyperparameter

37Prof. Leal-Taixé and Prof. Niessner

Overfitting

Page 38: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Bagging and ensemble methods • Bagging: uses k different datasets

38Prof. Leal-Taixé and Prof. Niessner

Training Set 1 Training Set 2 Training Set 3

Page 39: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Dropout• Disable a random set of neurons (typically 50%)

39Prof. Leal-Taixé and Prof. Niessner Srivastava 2014

Fo

rward

Page 40: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

How to deal with images?

Prof. Leal-Taixé and Prof. Niessner 40

Page 41: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Using CNNs in Computer Vision

Prof. Leal-Taixé and Prof. Niessner 41Credit: Li/Karpathy/Johnson

Page 42: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Image filters• Each kernel gives us a different image filter

Prof. Leal-Taixé and Prof. Niessner 42

InputEdge detection−1 −1 −1−1 8 −1−1 −1 −1

Sharpen0 −1 0−1 5 −10 −1 0

Box mean1

9

1 1 11 1 11 1 1

Gaussian blur1

16

1 2 12 4 21 2 1

Page 43: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Convolutions on RGB Images

Prof. Leal-Taixé and Prof. Niessner 43

32

32

3

35

5

32 × 32 × 3 image (pixels 𝑥)

5 × 5 × 3 filter (weights 𝑤)

1

28

28

activation map(also feature map)

Convolve

slide over all spatial locations 𝑥𝑖and compute all output 𝑧𝑖 ;w/o padding, there are 28 × 28 locations

Page 44: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Convolution Layer

Prof. Leal-Taixé and Prof. Niessner 44

32

32

3

32 × 32 × 3 image

528

28

activation maps

Convolve

Let’s apply **five** filters,each with different weights!

Convolution “Layer”

Page 45: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Prof. Leal-Taixé and Prof. Niessner 45

CNN PrototypeConvNet is concatenation of Conv Layers and activations

32

32

3

28

28

5

24

24

8

Conv +ReLU

Conv +ReLU

Conv +ReLU

12

5 filters5 × 5 × 3

8 filters5 × 5 × 5

12 filters5 × 5 × 8

Input Image

20

Page 46: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

CNN learned filters

Prof. Leal-Taixé and Prof. Niessner 46

Page 47: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Prof. Leal-Taixé and Prof. Niessner 47

Pooling Layer: Max Pooling

3 1 3 5

6 0 7 9

3 2 1 4

0 2 4 3

6 9

3 4

Single depth slice of input

Max pool with2 × 2 filters and stride 2

‘Pooled’ output

Page 48: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Classic CNN architectures

Prof. Leal-Taixé and Prof. Niessner 48

Page 49: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

LeNet• Digit recognition: 10 classes

• Conv -> Pool -> Conv -> Pool -> Conv -> FC• As we go deeper: Width, height Number of filters

Prof. Leal-Taixé and Prof. Niessner 49

60k parameters

Page 50: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

AlexNet

• Softmax for 1000 classes

Prof. Leal-Taixé and Prof. Niessner 50

[Krizhevsky et al. 2012]

Page 51: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

VGGNet• Striving for simplicity

• CONV = 3x3 filters with stride 1, same convolutions

• MAXPOOL = 2x2 filters with stride 2

Prof. Leal-Taixé and Prof. Niessner 51

[Simonyan and Zisserman 2014]

Page 52: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

VGGNet

Prof. Leal-Taixé and Prof. Niessner 52

Conv=3x3,s=1,sameMaxpool=2x2,s=2

Still very common: VGG-16

Page 53: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

ResNet

Prof. Leal-Taixé and Prof. Niessner 53

[He et al. 2015]

Page 54: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

ResNet

- Xavier/2 init by He et al.• Xavier/2 initialization• SGD + Momentum (0.9)• Learning rate 0.1, divided by 10 when plateau• Mini-batch size 256• Weight decay of 1e-5• No dropout

Prof. Leal-Taixé and Prof. Niessner 54

[He et al. 2015]

Page 55: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

ResNet• If we make the network deeper, at some point

performance starts to degrade

• Too many parameters, the optimizer cannot properly train the network

Prof. Leal-Taixé and Prof. Niessner 55

Page 56: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

ResNet• If we make the network deeper, at some point

performance starts to degrade

Prof. Leal-Taixé and Prof. Niessner 56

Page 57: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Inception layer

Prof. Leal-Taixé and Prof. Niessner 57

Page 58: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

GoogLeNet: using the inception layer

Prof. Leal-Taixé and Prof. Niessner 58

[Szegedy et al. 2014]

Inception block

Page 59: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

CNN Architectures

Prof. Leal-Taixé and Prof. Niessner 59

Page 60: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

ADL4CV Content

Prof. Leal-Taixé and Prof. Niessner 64

Page 61: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Rough Outline• Lecture 1: introduction• Lecture 2: advanced architectures (e.g. siamese, capsules, attention) • Lecture 3: advanced architectures con’t• Lecture 4: Visualization, t-sne, grad-cam (active heatmaps), deep dream,

excitation backprop• Lecture 5: Bayesian Deep Learning• Lecture 6: Autoencoders, VAE

Lecture 7: GANs 1: Generative models, GANs.• Lecture 8: GANs 2: Generative models, GANs• Lecture 9: CNN++ / Audio<->Visual - autoregressive, pixelcnn• Lecture 10: RNN -> NLP <-> Visual Q&A (focus on the cross domain: CNN

for image, RNN for text) / • Lecture 11: Multi-dimensional CNN, 3D DL, video DL: pooling vs fully-conv,

operations… Self-supervised / unsupervised learning• Lecture 12: Domain Adaptation / Transfer LearningProf. Leal-Taixé and Prof. Niessner 65

Page 62: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

ADL4CV• Use Moodle!

• Enjoy 6 vs 30-40ish relationship (make use of it to prepare yourself for research)

• Interactive Lectures!

Prof. Leal-Taixé and Prof. Niessner 66

Page 63: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

How to train your neural network?

Prof. Leal-Taixé and Prof. Niessner 67

Page 64: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Is data loading correct?• Data output (target): overfit to single training sample

(needs to have 100% because it just memorizes input)– It’s irrespective of input !!!

• Data input: overfit to a handful (e.g., 4) training samples– It’s now conditioned on input data

• Save and re-load data from PyTorch / TensorFlow

Prof. Leal-Taixé and Prof. Niessner 68

Page 65: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Network debugging• Move from overfitting to a hand-full of samples

– 5, 10, 100, 1000…– At some point, we should see generalization

• Apply common sense: can we overfit to the current number of samples?

• Always be aware of network parameter count!

Prof. Leal-Taixé and Prof. Niessner 69

Page 66: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Check timings• How long does each iteration take?

– Get precise timings!!!– If an iteration takes > 500ms, things get dicey…

• Where is the bottleneck: data loading vs backprop?– Speed up data loading: smaller resolutions, compression, train

from SSD – e.g., network training is good idea– Speed up backprop:

• Estimate total timings: how long until you see some pattern? How long till convergence?

Prof. Leal-Taixé and Prof. Niessner 70

Page 67: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Network Architecture• 100% mistake so far: “let’s use super big network and

train for two weeks and we see where we stand.” [because we desperately need those 2%...]

• Start with simplest network possible: rule of thumb divide #layers you started with by 5.

• Get debug cycles down – ideally, minutes!!!

Prof. Leal-Taixé and Prof. Niessner 71

Page 68: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Debugging• Need train/val/test curves

– Evaluation needs to be consistent!– Numbers need to be comparable

• Only make one change at a time– “I’ve added 5 more layers and double the training size, and

now I also trained 5 days longer” – it’s better, but WHY?

Prof. Leal-Taixé and Prof. Niessner 72

Page 69: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

OverfittingONLY THINKG ABOUT THIS ONCE YOU’R TRAINING LOSS GOES DOWN AND YOU CAN OVERFIT!

Typically try this order:• Network too big – makes things also faster • More regularization; e.g., weight decay• Not enough data - makes things slower!• Dropout - makes things slower!• Guideline: make training harder -> generalize better

Prof. Leal-Taixé and Prof. Niessner 73

Page 70: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Pushing the limits!PROCEED ONLY IF YOU GENERALIZE AND YOU ADDRESSED OVERFITTING ISSUES!

• Bigger network -> more capacity, more power - needs also more data!

• Better architecture -> ResNet is typically standard, but InceptionNet architectures perform often better (e.g., InceptionNet v4, XceptionNet, etc.)

• Schedules for learning rate decay• Class-based re-weighting (e.g., give under-represented classes

higher weight)• Hyperparameter tuning: e.g., grid search; apply common sense!

Prof. Leal-Taixé and Prof. Niessner 74

Page 71: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Bad signs…• Train error doesn’t go down…• Validation error doesn’t go down… (ahhh we don’t learn)• Validation performs better than train… (trust me, this

scenario is very unlikely – unless you have a bug )• Test on train set is different error than train… (forgot

dropout?)• Often people mess up the last batch in an epoch…

• You are training set contains test data…• You debug your algorithm on test data…

Prof. Leal-Taixé and Prof. Niessner 75

Page 72: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

“Most common” neural net mistakes1) you didn't try to overfit a single batch first. 2) you forgot to toggle train/eval mode for the net. 3) you forgot to .zero_grad() (in pytorch) before

.backward(). 4) you passed softmaxed outputs to a loss that expects

raw logits. 5) you didn't use bias=False for your Linear/Conv2d

layer when using BatchNorm, or conversely forget to include it for the output layer

Prof. Leal-Taixé and Prof. Niessner 76Credit: A. Karpathy

Page 73: Basics of DL•PyTorch (can use TensorFlow… •You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data

Next lecture

• Next Monday: advanced architectures– Lecture always from 10am to 11:30am

• Keep projects in mind!– Start actively discussing -> reach out to us if you have

questions!

Prof. Leal-Taixé and Prof. Niessner 77