Case Study of CNN from LeNet to ResNet

NamHyuk Ahn @ Ajou Univ. 2016. 03. 09

Convolutional Neural Network

Convolution Layer

- Convolution (3-dim dot product) image and filter

- Stack filter in one layer (See blue and green output, called channel)

Convolution Layer- Local Connectivity

• Instead connect all pixels to neurons, connect only local region of input (called receptive field)

• It can reduce many parameter

- Parameter sharing

• To reduce parameter, each channel have same filter. (# of filter == # of channel)

Convolution Layer- Example) 1st conv layer in AlexNet

• Input: [224, 224], filter: [11x11x3], 96, output: [55, 55]

- Each filter extract different features (i.e. horizontal edge, vertical edge…)

Pooling Layer- Downsample image to reduce parameter

- Usually use max pooling (take maximum value in region)

ReLU, FC Layer

- ReLU

• Sort of activation function (e.g. sigmoid, tanh…)

- Fully-connected Layer

• Same as normal neural network

Convolutional Neural Network


Training CNN1. Calculate loss function with foward-prop

2. Optimize parameter w.r.t loss function with back-prop

• Use gradient descent method (SGD)

• Gradient of weight can calculate with chain rule of partial derivate

ILSVRC trend

AlexNet (2012) (ILSVRC 2012 winner)


- ReLU

- Data augmentation

- Dropout

- Ensemble CNN (1-CNN 18.2%, 7-CNN 15.4%)


- Other methods (but will not mention today)

• SGD + momentum (+ mini-batch)

• Multiple GPU

• Weight Decay

• Local Response Normalization

Problems of sigmoid

- Gradient vanishing

• when gradient pass sigmoid, it can vanish because local gradient of sigmoid can be almost zero.

- Output is not zero-centered

• cause bad performance


- Converge of SGD is faster than sigmoid-like

- Computationally cheap

Data augmentation- Randomly crop [256, 256] images to [224, 224]

- At test time, crop 5 images and average to predict

Dropout- Similar to bagging (approximation of bagging)

- Act like regularizer (reduce overfit)

- Instead of using all neurons, “dropout” some neurons randomly (usually 0.5 probability)

Dropout• At test time, not “dropout” neurons, but use

weighted neurons (usually 0.5)

• Weight is expected value of each neurons


- conv - pool - … - fc - softmax (similar to LeNet)

- Use large size filter (i.e. 11x11)


- Weights must be initalized randomly

• If not, all gradients of neurons will be same

• Usually, use gaussian distribution, std = 0.01

- Use mini-batch SGD and momentum SGD to update weight

VGGNet (2014) (ILSVRC 2014 2nd)


- Use small size kernel (always 3x3)

• Can use multiple non-linearlity (e.g. ReLU)

• Less weights to train

- Hard data augmentation (more than AlexNet)

- Ensemble 7 model (ILSVRC submission 7.3%)


- Most memory needs in early layers, most parameters increase in fc layers.

GoogLeNet - Inception v1 (2014) (ILSVRC 2014 winner)


Inception module- Use 1x1, 3x3 and 5x5 conv

simultaneously to capture variety of structure

- Capture dense structure to 1x1, more spread out structure to 3x3, 5x5

- Computational expensive

• Use 1x1 conv layer to reduce dimension (explain details in later in ResNet)

Auxiliary Classifiers- Deep network raises concern about effectiveness

of graident in backprop

- Loss of auxiliary is added to total loss (weighted by 0.3), remove at test time

Average Pooling

- Proposed in Network in Network (also used in GoogLeNet)

- Problems of fc layer

• Needs lots of parameter, easy to overfit

- Replace fc to average pooling

Average Pooling- Make channel as same as # of class in last conv

- Calc average on each channel, and pass to softmax

- Reduce overfit

MSRA ResNet (2015) (ILSVRC 2015 winner)

before ResNet..

- Have to know about


• Xavier Initalization

• Batch Normalization

PReLU- Adaptive version of ReLU

- Train slope of function when x < 0

- Slightly more parameter (# of layer x # of channel)

Xavier Initalization- If init with gaussian distribution, output of neurons

will be nearly zeros when network is deeep

- If increase std (1.0), output will saturate to -1 or 1

- Xavier init decide initial value by number of input neurons

- Looks fine, but this init method assume linear activation so can’t use in ReLU-like network

output is saturated

output is vanished

Xavier Initalization / 2

Xavier Initalization

Xavier Initalization / 2

Batch Normalization- Make output to be gaussian distribution, but

normalization cost a lot

• Calc mean, variance in each dimension (assume each dims are uncorrelated)

• Calc mean, variance in mini-batch (not entire set)

- Normalize constrain non-linearlity and constrain network by assume each dims are uncorrelated

• Linear transform output (factors are parameter)

Batch Normalization- When test, calc mean, variance using entire set (use

moving average)

- BN act like regularizer (don’t need Dropout)



Problem of degradation- More depth, more accurate but deep network can

vanish/explode gradient • BN, Xavier Init, Dropout can handle (~30 layer)

- More deeper, degradation problem occur • Not only overfit, but also increase training error

Deep Residual Learning

- Element-wise addition with F(x) and shortcut connection, and pass through ReLU non-linearlity

- Dim of x, F(x) are unequal (changing of channel), linear project x to match dim (done by 1x1 conv)

- Similar to LSTM

Deeper Bottleneck

- To reduce training time, modify as bottleneck design (just for economical reason)

• (3x3x3)x64x64 + (3x3x3)x64x64=221184 (left)

• (1x1x3)x256x64 + (3x3x3)x64x64 + (1x1x3)x64x256=208896 (right)

• More width(channel) in right, but similar parameter

• Similar method also used in GoogLeNet


- Data augmentation as AlexNet does

- Batch Normalization (no dropout)

- Xavier / 2 initalization

- Average pooling

- Structure follows VGGNet style



- Dropout, BN

- ReLU-like activation (e.g. PReLU, ELU..)

- Xavier initalization

- Average pooling

- Use pre-trained model :)

- Also Thanks to CS231n, I used some figures in CS231n lecture slides. see

