7. networks for computer vision ee-559 { deep learning...2018/06/08 · computer vision tasks:...

EE-559 – Deep learning

7. Networks for computer vision

Francois Fleuret

https://fleuret.org/dlc/

[version of: June 8, 2018]

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Tasks and data-sets

Francois Fleuret EE-559 – Deep learning / 7. Networks for computer vision 2 / 89

Computer vision tasks:

• classification,

• object detection,

• semantic or instance segmentation,

• other (tracking in videos, camera pose estimation, body pose estimation,3d reconstruction, denoising, super-resolution, auto-captioning, synthesis,etc.)


“Small scale” classification data-sets.

MNIST and Fashion-MNIST: 10 classes (digits or pieces of clothing) 50, 000train images, 10, 000 test images, 28× 28 grayscale.

(leCun et al., 1998; Xiao et al., 2017)

CIFAR10 and CIFAR100 (10 classes and 5× 20 “super classes”), 50, 000 trainimages, 10, 000 test images, 32× 32 RGB

(Krizhevsky, 2009, chap. 3)


ImageNet

http://www.image-net.org/

This data-set is build by filling the leaves of the “Wordnet” hierarchy, called“synsets” for “sets of synonyms”.

• 21, 841 non-empty synsets,

• 14, 197, 122 images,

• 1, 034, 908 images with bounding box annotations.

ImageNet Large Scale Visual Recognition Challenge 2012

• 1, 000 classes taken among all synsets,

• 1, 200, 000 training, and 50, 000 validation images.



http://www.image-net.org/

n02123394 2084.xml

<annotation >

<folder >n02123394 </folder >

<filename >n02123394_2084 </filename >

<source >

<database >ImageNet database </

database >

</source >

<size >

<width >500 </width >

<height >375 </ height >

<depth >3</depth >

</size >

<segmented >0</segmented >

<object >

<name >n02123394 </name >

<pose >Unspecified </pose >

<truncated >0</truncated >

<difficult >0</difficult >

<bndbox >

<xmin >265</xmin >

<ymin >185</ymin >

<xmax >470</xmax >

<ymax >374</ymax >

</bndbox >

</object >

<object >

<name >n02123394 </name >

<pose >Unspecified </pose >

<truncated >0</truncated >

<difficult >0</difficult >

<bndbox >

<xmin >90</xmin >

<ymin >1</ymin >

<xmax >323</xmax >

<ymax >353</ymax >

</bndbox >

</object >

</annotation >

n02123394 2084.JPEG


Cityscapes data-set

https://www.cityscapes-dataset.com/

Images from 50 cities over several months, each is the 20th image from a 30frame video snippets (1.8s). Meta-data about vehicle position + depth.

• 30 classes• flat: road, sidewalk, parking, rail track• human: person, rider• vehicle: car, truck, bus, on rails, motorcycle, bicycle, caravan, trailer• construction: building, wall, fence, guard rail, bridge, tunnel• object: pole, pole group, traffic sign, traffic light• nature: vegetation, terrain• sky: sky• void: ground, dynamic, static

• 5, 000 images with fine annotations

• 20, 000 images with coarse annotations.


https://www.cityscapes-dataset.com/

Cityscapes fine annotations (5, 000 images)

Cityscapes coarse annotations (20, 000 images)


Tasks and performance measures


Image classification consists of predicting its class, which is often the class ofthe “main object” visible in it.

The standard performance measures are:

• The error rate P(f (X ) 6= Y ) or conversely the accuracy P(f (X ) = y),

• the balanced error rate (BER) 1C

∑Cy=1 P(f (X ) 6= y | Y = y).


In the two-class case, we can define the True Positive (TP) rate as

P(f (X ) = 1 | Y = 1) and the False Positive (FP) rate as P(f (X ) = 1 | Y = 0).

The ideal algorithm would have TP ' 1 and FP ' 0.

Most of the algorithms produce a score, and the decision threshold isapplication-dependent:

• Cancer detection: Low threshold to get a high TP rate (you do not wantto miss a cancer), at the cost of a high FP rate (it will be double-checkedby a oncologist anyway),

• Image retrieval: High threshold to get a low FP rate (you do not want tobring an image that does not match the request), at the cost of a low TPrate (you have so many images that missing a lot is not an issue).


In that case, a standard performance representation is the Receiver operatingcharacteristic (ROC) that shows performance at multiple thresholds.

It is the minimum increasing function above the True Positive (TP) rate

P(f (X ) = 1 | Y = 1) vs. the False Positive (FP) rate P(f (X ) = 1 | Y = 0).

0.9

0.925

0.95

0.975

1

0 0.025 0.05 0.075 0.1

TP

FP

ROC

A standard measure is the area under the curve (AUC).


Object detection aims at predicting classes and locations of targets in animage. The notion of “location” is ill-defined. In the standard setup, the outputof the predictor is a series of bounding boxes, each with a class label.

A standard performance assessment considers that a predicted bounding box Bis correct if there is an annotated bounding box B for that class, such that theIntersection over Union (IoU) is large enough

area(B ∩ B)

area(B ∪ B)≥ 1

2.

BB

B ∩ B

BB

B ∪ B


Image segmentation consists of labeling individual pixels with the class of theobject it belongs to, and may also involve predicting the instance it belongs to.

The standard performance measure frames the task as a classification one. ForVOC2012, the segmentation accuracy (SA) for a class is defined as

SA =n

n + e

• n number of pixels of the right class, predicted as such,

• e number of pixels erroneously labeled.


All these performance measures are debatable, and in practice they arehighly application-dependent.

In spite of their weaknesses, the ones adopted as standards by the communityenable an assessment of the field’s “long-term progress”.


Image classification, standard convnets


The most standard networks for image classification are the LeNet family (leCunet al., 1998), and its modern extensions, among which AlexNet (Krizhevskyet al., 2012) and VGGNet (Simonyan and Zisserman, 2014).

They share a common structure of several convolutional layers seen as a featureextractor, followed by fully connected layers seen as a classifier.

The performance of AlexNet was a wake-up call for the computer visioncommunity, as it vastly out-performed other methods in spite of its simplicity.

Recent advances rely on moving from standard convolutional layers to localcomplex architectures to reduce the model size.


torchvision.models provides a collection of reference networks for computervision, e.g.:

import torchvision

alexnet = torchvision.models.alexnet ()

The trained models can be obtained by passing pretrained = True to the

constructor(s). This may involve an heavy download given there size.

B The networks from PyTorch listed in the coming slides may differ slightlyfrom the reference papers which introduced them historically.


LeNet5 (LeCun et al., 1989). 10 classes, input 1× 28× 28.

(features): Sequential (

(0): Conv2d(1, 6, kernel_size =(5, 5), stride =(1, 1))

(1): ReLU (inplace)

(2): MaxPool2d (size=(2, 2), stride =(2, 2), dilation =(1, 1))

(3): Conv2d(6, 16, kernel_size =(5, 5), stride =(1, 1))

(4): ReLU (inplace)


)

(classifier): Sequential (

(0): Linear (256 -> 120)

(1): ReLU (inplace)

(2): Linear (120 -> 84)

(3): ReLU (inplace)

(4): Linear (84 -> 10)

)


Alexnet (Krizhevsky et al., 2012). 1, 000 classes, input 3× 224× 224.


(0): Conv2d(3, 64, kernel_size =(11, 11), stride =(4, 4), padding =(2, 2))

(1): ReLU (inplace)


(3): Conv2d (64, 192, kernel_size =(5, 5), stride =(1, 1), padding =(2, 2))

(4): ReLU (inplace)



(7): ReLU (inplace)


(9): ReLU (inplace)


(11): ReLU (inplace)


)


(0): Dropout (p = 0.5)

(1): Linear (9216 -> 4096)

(2): ReLU (inplace)

(3): Dropout (p = 0.5)

(4): Linear (4096 -> 4096)

(5): ReLU (inplace)

(6): Linear (4096 -> 1000)

)


Krizhevsky et al. used data augmentation during training to reduce over-fitting.

They generated 2, 048 samples from every original training example through twoclasses of transformations:

• crop a 224× 224 image at a random position in the original 256× 256,and randomly reflect it horizontally,

• apply a color transformation using a PCA model of the color distribution.

During test the prediction is averaged over five random crops and theirhorizontal reflections.


VGGNet19 (Simonyan and Zisserman, 2014). 1, 000 classes, input3× 224× 224. 16 convolutional layers + 3 fully connected layers.



(1): ReLU (inplace)


(3): ReLU (inplace)



(6): ReLU (inplace)


(8): ReLU (inplace)





























)


VGGNet19 (cont.)


(0): Linear (25088 -> 4096)

(1): ReLU (inplace)

(2): Dropout (p = 0.5)

(3): Linear (4096 -> 4096)

(4): ReLU (inplace)

(5): Dropout (p = 0.5)

(6): Linear (4096 -> 1000)

)


We can illustrate the convenience of these pre-trained models on a simpleimage-classification problem.

To be sure this picture did not appear in the training data, it was not takenfrom the web.


import PIL , torch , torchvision

# Load and normalize the image

img = torchvision.transforms.ToTensor ()(PIL.Image.open(’blacklab.jpg ’))

img = img.view(1, img.size (0), img.size (1), img.size (2))

img = 0.5 + 0.5 * (img - img.mean()) / img.std()

# Load an already trained network and compute its prediction

alexnet = torchvision.models.alexnet(pretrained = True)

alexnet.eval()

output = alexnet(Variable(img))

# Prints the classes

scores , indexes = output.data.view(-1).sort(descending = True)

class_names = eval(open(’imagenet1000_clsid_to_human.txt ’, ’r’).read())

for k in range (15):

print (’#{:d} ({:.02f}) {:s}’.format(k, scores[k], class_names[indexes[k]]))


#1 (12.26) Weimaraner

#2 (10.95) Chesapeake Bay retriever

#3 (10.87) Labrador retriever

#4 (10.10) Staffordshire bullterrier , Staffordshire bull terrier

#5 (9.55) flat -coated retriever

#6 (9.40) Italian greyhound

#7 (9.31) American Staffordshire terrier , Staffordshire terrier , American pit bull

terrier , pit bull terrier

#8 (9.12) Great Dane

#9 (8.94) German short -haired pointer

#10 (8.53) Doberman , Doberman pinscher

#11 (8.35) Rottweiler

#12 (8.25) kelpie

#13 (8.24) barrow , garden cart , lawn cart , wheelbarrow

#14 (8.12) bucket , pail

#15 (8.07) soccer ball

Weimaraner Chesapeake Bay retriever


Fully convolutional networks


In many applications, standard convolutional networks are made fullyconvolutional by converting their fully connected layers to convolutional ones.

x(l)

x(l)

H

W

C

x(l+2)

w (l+2)

x(l+1)

w (l+1)

x(l+1)

HWC

HWC

Reshape

x(l+2)

w (l+2)

w (l+1)

H

W

C

x(l+1)x(l+1)

x(l+1) x(l+2)

~ ~


x(l)

x(l)

H

W

C

x(l+2)

w (l+2)

x(l+1)

w (l+1)

x(l+1)

HWC

HWCReshape

x(l+2)

w (l+2)

w (l+1)

H

W

C

x(l+1)

x(l+1)

x(l+1) x(l+2)

~ ~


This “convolutionization” does not change anything if the input size is suchthat the output has a single spatial cell, but it fully re-uses computation toget a prediction at multiple locations when the input is larger.

x(l)

x(l)

H

W

C

x(l+2)

w (l+2)

x(l+1)

w (l+1)

x(l+1)

HWC

HWCReshape

x(l+2)

w (l+2)

w (l+1)

H

W

C

x(l+1)

x(l+1)

x(l+1) x(l+2)

~ ~


We can write a routine that transforms a series of layers from a standardconvnets to make it fully convolutional:

def convolutionize(layers , input_size):

l = []

x = Variable(torch.zeros(torch.Size((1, ) + input_size)))

for m in layers:

if isinstance(m, nn.Linear):

n = nn.Conv2d(in_channels = x.size (1),

out_channels = m.weight.size (0),

kernel_size = (x.size (2), x.size (3)))

n.weight.data.view(-1).copy_(m.weight.data.view(-1))

n.bias.data.view(-1).copy_(m.bias.data.view(-1))

m = n

l.append(m)

x = m(x)

return l

model = torchvision.models.alexnet(pretrained = True)

model = nn.Sequential(

*convolutionize(list(model.features) + list(model.classifier), (3, 224, 224))

)

B This function makes the [strong and disputable] assumption that onlynn.Linear has to be converted.


Original Alexnet

AlexNet (



(1): ReLU (inplace)



(4): ReLU (inplace)



(7): ReLU (inplace)


(9): ReLU (inplace)




)


(0): Dropout (p = 0.5)

(1): Linear (9216 -> 4096)

(2): ReLU (inplace)

(3): Dropout (p = 0.5)

(4): Linear (4096 -> 4096)

(5): ReLU (inplace)

(6): Linear (4096 -> 1000)

)

)


Result of convolutionize

Sequential (


(1): ReLU (inplace)



(4): ReLU (inplace)



(7): ReLU (inplace)


(9): ReLU (inplace)




(13): Dropout (p = 0.5)

(14): Conv2d (256, 4096, kernel_size =(6, 6), stride =(1, 1))


(16): Dropout (p = 0.5)

(17): Conv2d (4096 , 4096, kernel_size =(1, 1), stride =(1, 1))


(19): Conv2d (4096 , 1000, kernel_size =(1, 1), stride =(1, 1))

)


In their “overfeat” approach, Sermanet et al. (2013) combined this with a stride1 final max-pooling to get multiple predictions.

Input image

Conv layers

Max-pooling

1000d

FC layers

Input image

Conv layers

Max-pooling

1000d

FC layers

AlexNet random cropping Overfeat dense max-pooling

Doing so, they could afford parsing the scene at 6 scales to improve invariance.


This “convolutionization” has a practical consequence, as we can now re-useclassification networks for dense prediction without re-training.

Also, and maybe more importantly, it blurs the conceptual boundary between“features” and “classifier” and leads to an intuitive understanding of convnetactivations as gradually transitioning from appearance to semantic.


In the case of a large output prediction map, a final prediction can be obtainedby averaging the final output map channel-wise.

If the last layer is linear, the averaging can be done first, as in the residualnetworks (He et al., 2015).


Image classification, network in network


Lin et al. (2013) re-interpreted a convolution filter as a one-layer perceptron,and extended it with an “MLP convolution” (aka “network in network”) toimprove the capacity vs. parameter ratio.

(a) Linear convolution layer

. . .

. . .

(b) Mlpconv layer

Figure 1: Comparison of linear convolution layer and mlpconv layer. The linear convolution layerincludes a linear filter while the mlpconv layer includes a micro network (we choose the multilayerperceptron in this paper). Both layers map the local receptive field to a confidence value of the latentconcept.

over the input in a similar manner as CNN and are then fed into the next layer. The overall structureof the NIN is the stacking of multiple mlpconv layers. It is called “Network In Network” (NIN) aswe have micro networks (MLP), which are composing elements of the overall deep network, withinmlpconv layers,

Instead of adopting the traditional fully connected layers for classification in CNN, we directlyoutput the spatial average of the feature maps from the last mlpconv layer as the confidence ofcategories via a global average pooling layer, and then the resulting vector is fed into the softmaxlayer. In traditional CNN, it is difficult to interpret how the category level information from theobjective cost layer is passed back to the previous convolution layer due to the fully connectedlayers which act as a black box in between. In contrast, global average pooling is more meaningfuland interpretable as it enforces correspondance between feature maps and categories, which is madepossible by a stronger local modeling using the micro network. Furthermore, the fully connectedlayers are prone to overfitting and heavily depend on dropout regularization [4] [5], while globalaverage pooling is itself a structural regularizer, which natively prevents overfitting for the overallstructure.

2 Convolutional Neural Networks

Classic convolutional neuron networks [1] consist of alternatively stacked convolutional layers andspatial pooling layers. The convolutional layers generate feature maps by linear convolutional filtersfollowed by nonlinear activation functions (rectifier, sigmoid, tanh, etc.). Using the linear rectifieras an example, the feature map can be calculated as follows:

fi,j,k = max(wTk xi,j , 0). (1)

Here (i, j) is the pixel index in the feature map, xij stands for the input patch centered at location(i, j), and k is used to index the channels of the feature map.

This linear convolution is sufficient for abstraction when the instances of the latent concepts arelinearly separable. However, representations that achieve good abstraction are generally highly non-linear functions of the input data. In conventional CNN, this might be compensated by utilizingan over-complete set of filters [6] to cover all variations of the latent concepts. Namely, individuallinear filters can be learned to detect different variations of a same concept. However, having toomany filters for a single concept imposes extra burden on the next layer, which needs to consider allcombinations of variations from the previous layer [7]. As in CNN, filters from higher layers mapto larger regions in the original input. It generates a higher level concept by combining the lowerlevel concepts from the layer below. Therefore, we argue that it would be beneficial to do a betterabstraction on each local patch, before combining them into higher level concepts.

In the recent maxout network [8], the number of feature maps is reduced by maximum poolingover affine feature maps (affine feature maps are the direct results from linear convolution without

2

(Lin et al., 2013)

As for the fully convolutional networks, such local MLPs can be implementedwith 1× 1 convolutions.


The same notion was generalized by Szegedy et al. (2015) for their GoogLeNet,through the use of module combining convolutions at multiple scales to let theoptimal ones be picked during training.

1x1 convolutions 3x3 convolutions 5x5 convolutions

Filter concatenation

Previous layer

3x3 max pooling

(a) Inception module, naıve version

1x1 convolutions

3x3 convolutions 5x5 convolutions

Filter concatenation

Previous layer

3x3 max pooling1x1 convolutions 1x1 convolutions

1x1 convolutions

(b) Inception module with dimension reductions

Figure 2: Inception module

increase in the number of outputs from stage to stage. Even while this architecture might cover theoptimal sparse structure, it would do it very inefficiently, leading to a computational blow up withina few stages.

This leads to the second idea of the proposed architecture: judiciously applying dimension reduc-tions and projections wherever the computational requirements would increase too much otherwise.This is based on the success of embeddings: even low dimensional embeddings might contain a lotof information about a relatively large image patch. However, embeddings represent information ina dense, compressed form and compressed information is harder to model. We would like to keepour representation sparse at most places (as required by the conditions of [2]) and compress thesignals only whenever they have to be aggregated en masse. That is, 1×1 convolutions are used tocompute reductions before the expensive 3×3 and 5×5 convolutions. Besides being used as reduc-tions, they also include the use of rectified linear activation which makes them dual-purpose. Thefinal result is depicted in Figure 2(b).

In general, an Inception network is a network consisting of modules of the above type stacked uponeach other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. Fortechnical reasons (memory efficiency during training), it seemed beneficial to start using Inceptionmodules only at higher layers while keeping the lower layers in traditional convolutional fashion.This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our currentimplementation.

One of the main beneficial aspects of this architecture is that it allows for increasing the number ofunits at each stage significantly without an uncontrolled blow-up in computational complexity. Theubiquitous use of dimension reduction allows for shielding the large number of input filters of thelast stage to the next layer, first reducing their dimension before convolving over them with a largepatch size. Another practically useful aspect of this design is that it aligns with the intuition thatvisual information should be processed at various scales and then aggregated so that the next stagecan abstract features from different scales simultaneously.

The improved use of computational resources allows for increasing both the width of each stageas well as the number of stages without getting into computational difficulties. Another way toutilize the inception architecture is to create slightly inferior, but computationally cheaper versionsof it. We have found that all the included the knobs and levers allow for a controlled balancing ofcomputational resources that can result in networks that are 2− 3× faster than similarly performingnetworks with non-Inception architecture, however this requires careful manual design at this point.

5 GoogLeNet

We chose GoogLeNet as our team-name in the ILSVRC14 competition. This name is an homage toYann LeCuns pioneering LeNet 5 network [10]. We also use GoogLeNet to refer to the particularincarnation of the Inception architecture used in our submission for the competition. We have alsoused a deeper and wider Inception network, the quality of which was slightly inferior, but adding itto the ensemble seemed to improve the results marginally. We omit the details of that network, sinceour experiments have shown that the influence of the exact architectural parameters is relatively

5

(Szegedy et al., 2015)


Szegedy et al. (2015) also introduce the idea of auxiliary classifiers to help thepropagation of the gradient in the early layers.

This is motivated by the reasonable performance of shallow networks thatindicates early layers already encode informative and invariant features.


The resulting GoogLeNet has 12 times less parameters than AlexNet and ismore accurate on ILSVRC14 (Szegedy et al., 2015).

input

Conv

7x7+2(S)

MaxPool

3x3+2(S)

LocalRespNorm

Conv

1x1+1(V)

Conv

3x3+1(S)

LocalRespNorm

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

AveragePool

5x5+3(V)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

AveragePool

5x5+3(V)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

AveragePool

7x7+1(V)

FC

Conv

1x1+1(S)

FC

FC

Softm

axActiv

atio

n

softm

ax0

Conv

1x1+1(S)

FC

FC

Softm

axActiv

atio

n

softm

ax1

Softm

axActiv

atio

n

softm

ax2

Figure3:G

oogLeN

etnetwork

with

allthebells

andw

histles

7


It was later extended with techniques we are going to see in the next slides:batch-normalization (Ioffe and Szegedy, 2015) and pass-through a laresnet (Szegedy et al., 2016).


Image classification, residual networks


We already saw the structure of the residual networks and how well theyperform on CIFAR10 (He et al., 2015).

The default residual block proposed by He et al. is of the form

. . .Conv

3 × 3

64→ 64

BN ReLU64

Conv

3 × 3

64→ 64

BN + ReLU . . .64

and as such requires 2× (3× 3× 64 + 1)× 64 ' 73k parameters.


To apply the same architecture to ImageNet, more channels are required, e.g.

. . .Conv

3 × 3

256→ 256

BN ReLU256

Conv

3 × 3

256→ 256

BN + ReLU . . .256

However, such a block requires 2× (3× 3× 256 + 1)× 256 ' 1.2m parameters.

They mitigated that requirement with what they call a bottleneck block:

. . .Conv

1 × 1

256→ 64

BN ReLU256

Conv

3 × 3

64→ 64

BN ReLU

Conv

1 × 1

64→ 256

BN + ReLU . . .256

256× 64 + (3× 3× 64 + 1)× 64 + 64× 256 ' 70k parameters.

The encoding pushed between blocks is high-dimensional, but the “contextualreasoning” in convolutional layers is done on a simpler feature representation.


layer name output size 18-layer 34-layer 50-layer 101-layer 152-layerconv1 112×112 7×7, 64, stride 2

conv2 x 56×56

3×3 max pool, stride 2[

3×3, 643×3, 64

]×2

[3×3, 643×3, 64

]×3

1×1, 643×3, 64

1×1, 256

×3

1×1, 643×3, 64

1×1, 256

×3

1×1, 643×3, 64

1×1, 256

×3

conv3 x 28×28[

3×3, 1283×3, 128

]×2

[3×3, 1283×3, 128

]×4

1×1, 1283×3, 1281×1, 512

×4

1×1, 1283×3, 1281×1, 512

×4

1×1, 1283×3, 1281×1, 512

×8

conv4 x 14×14[

3×3, 2563×3, 256

]×2

[3×3, 2563×3, 256

]×6

1×1, 2563×3, 2561×1, 1024

×6

1×1, 2563×3, 2561×1, 1024

×23

1×1, 2563×3, 256

1×1, 1024

×36

conv5 x 7×7[

3×3, 5123×3, 512

]×2

[3×3, 5123×3, 512

]×3

1×1, 5123×3, 5121×1, 2048

×3

1×1, 5123×3, 512

1×1, 2048

×3

1×1, 5123×3, 5121×1, 2048

×3

1×1 average pool, 1000-d fc, softmaxFLOPs 1.8×109 3.6×109 3.8×109 7.6×109 11.3×109

Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Down-sampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2.

0 10 20 30 40 5020

30

40

50

60

iter. (1e4)

erro

r (%

)

plain-18plain-34

0 10 20 30 40 5020

30

40

50

60

iter. (1e4)

erro

r (%

)

ResNet-18ResNet-34

18-layer

34-layer18-layer

34-layer

Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plainnetworks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared totheir plain counterparts.

plain ResNet18 layers 27.94 27.8834 layers 28.54 25.03

Table 2. Top-1 error (%, 10-crop testing) on ImageNet validation.Here the ResNets have no extra parameter compared to their plaincounterparts. Fig. 4 shows the training procedures.

34-layer plain net has higher training error throughout thewhole training procedure, even though the solution spaceof the 18-layer plain network is a subspace of that of the34-layer one.

We argue that this optimization difficulty is unlikely tobe caused by vanishing gradients. These plain networks aretrained with BN [16], which ensures forward propagatedsignals to have non-zero variances. We also verify that thebackward propagated gradients exhibit healthy norms withBN. So neither forward nor backward signals vanish. Infact, the 34-layer plain net is still able to achieve compet-itive accuracy (Table 3), suggesting that the solver worksto some extent. We conjecture that the deep plain nets mayhave exponentially low convergence rates, which impact the

reducing of the training error3. The reason for such opti-mization difficulties will be studied in the future.

Residual Networks. Next we evaluate 18-layer and 34-layer residual nets (ResNets). The baseline architecturesare the same as the above plain nets, expect that a shortcutconnection is added to each pair of 3×3 filters as in Fig. 3(right). In the first comparison (Table 2 and Fig. 4 right),we use identity mapping for all shortcuts and zero-paddingfor increasing dimensions (option A). So they have no extraparameter compared to the plain counterparts.

We have three major observations from Table 2 andFig. 4. First, the situation is reversed with residual learn-ing – the 34-layer ResNet is better than the 18-layer ResNet(by 2.8%). More importantly, the 34-layer ResNet exhibitsconsiderably lower training error and is generalizable to thevalidation data. This indicates that the degradation problemis well addressed in this setting and we manage to obtainaccuracy gains from increased depth.

Second, compared to its plain counterpart, the 34-layer

3We have experimented with more training iterations (3×) and still ob-served the degradation problem, suggesting that this problem cannot befeasibly addressed by simply using more iterations.

5

(He et al., 2015)


model top-1 err. top-5 err.

VGG-16 [41] 28.07 9.33GoogLeNet [44] - 9.15PReLU-net [13] 24.27 7.38

plain-34 28.54 10.02ResNet-34 A 25.03 7.76ResNet-34 B 24.52 7.46ResNet-34 C 24.19 7.40ResNet-50 22.85 6.71ResNet-101 21.75 6.05ResNet-152 21.43 5.71

Table 3. Error rates (%, 10-crop testing) on ImageNet validation.VGG-16 is based on our test. ResNet-50/101/152 are of option Bthat only uses projections for increasing dimensions.

method top-1 err. top-5 err.

VGG [41] (ILSVRC’14) - 8.43†

GoogLeNet [44] (ILSVRC’14) - 7.89VGG [41] (v5) 24.4 7.1PReLU-net [13] 21.59 5.71BN-inception [16] 21.99 5.81ResNet-34 B 21.84 5.71ResNet-34 C 21.53 5.60ResNet-50 20.74 5.25ResNet-101 19.87 4.60ResNet-152 19.38 4.49

Table 4. Error rates (%) of single-model results on the ImageNetvalidation set (except † reported on the test set).

method top-5 err. (test)VGG [41] (ILSVRC’14) 7.32GoogLeNet [44] (ILSVRC’14) 6.66VGG [41] (v5) 6.8PReLU-net [13] 4.94BN-inception [16] 4.82ResNet (ILSVRC’15) 3.57

Table 5. Error rates (%) of ensembles. The top-5 error is on thetest set of ImageNet and reported by the test server.

ResNet reduces the top-1 error by 3.5% (Table 2), resultingfrom the successfully reduced training error (Fig. 4 right vs.left). This comparison verifies the effectiveness of residuallearning on extremely deep systems.

Last, we also note that the 18-layer plain/residual netsare comparably accurate (Table 2), but the 18-layer ResNetconverges faster (Fig. 4 right vs. left). When the net is “notoverly deep” (18 layers here), the current SGD solver is stillable to find good solutions to the plain net. In this case, theResNet eases the optimization by providing faster conver-gence at the early stage.

Identity vs. Projection Shortcuts. We have shown that

3x3, 64

1x1, 64

relu

1x1, 256

relu

relu

3x3, 64

3x3, 64

relu

relu

64-d 256-d

Figure 5. A deeper residual function F for ImageNet. Left: abuilding block (on 56×56 feature maps) as in Fig. 3 for ResNet-34. Right: a “bottleneck” building block for ResNet-50/101/152.

parameter-free, identity shortcuts help with training. Nextwe investigate projection shortcuts (Eqn.(2)). In Table 3 wecompare three options: (A) zero-padding shortcuts are usedfor increasing dimensions, and all shortcuts are parameter-free (the same as Table 2 and Fig. 4 right); (B) projec-tion shortcuts are used for increasing dimensions, and othershortcuts are identity; and (C) all shortcuts are projections.

Table 3 shows that all three options are considerably bet-ter than the plain counterpart. B is slightly better than A. Weargue that this is because the zero-padded dimensions in Aindeed have no residual learning. C is marginally better thanB, and we attribute this to the extra parameters introducedby many (thirteen) projection shortcuts. But the small dif-ferences among A/B/C indicate that projection shortcuts arenot essential for addressing the degradation problem. So wedo not use option C in the rest of this paper, to reduce mem-ory/time complexity and model sizes. Identity shortcuts areparticularly important for not increasing the complexity ofthe bottleneck architectures that are introduced below.

Deeper Bottleneck Architectures. Next we describe ourdeeper nets for ImageNet. Because of concerns on the train-ing time that we can afford, we modify the building blockas a bottleneck design4. For each residual function F , weuse a stack of 3 layers instead of 2 (Fig. 5). The three layersare 1×1, 3×3, and 1×1 convolutions, where the 1×1 layersare responsible for reducing and then increasing (restoring)dimensions, leaving the 3×3 layer a bottleneck with smallerinput/output dimensions. Fig. 5 shows an example, whereboth designs have similar time complexity.

The parameter-free identity shortcuts are particularly im-portant for the bottleneck architectures. If the identity short-cut in Fig. 5 (right) is replaced with projection, one canshow that the time complexity and model size are doubled,as the shortcut is connected to the two high-dimensionalends. So identity shortcuts lead to more efficient modelsfor the bottleneck designs.

50-layer ResNet: We replace each 2-layer block in the

4Deeper non-bottleneck ResNets (e.g., Fig. 5 left) also gain accuracyfrom increased depth (as shown on CIFAR-10), but are not as economicalas the bottleneck ResNets. So the usage of bottleneck designs is mainly dueto practical considerations. We further note that the degradation problemof plain nets is also witnessed for the bottleneck designs.

6

(He et al., 2015)


This was extended to the ResNeXt architecture by Xie et al. (2016), with blockswith similar number of parameters, but split into 32 “aggregated” pathways.

. . . +

Conv

1 × 1

256→ 4

BN ReLU

256

Conv

3 × 3

4→ 4

BN ReLU

Conv

1 × 1

4→ 256

BN

Conv

1 × 1

256→ 4

BN ReLU

Conv

3 × 3

4→ 4

BN ReLU

Conv

1 × 1

4→ 256

BN

ReLU . . .256

. . .

When equalizing the number of parameters, this architecture performs betterthan a standard resnet.


Image classification, summary


To summarize roughly the evolution of convnets for image classification:

• standard ones are extensions of LeNet5,

• everybody loves ReLU,

• state-of-the-art networks have 100s of channels and 10s of layers,

• they can (should?) be fully convolutional,

• pass-through connections allow deeper “residual” nets,

• bottleneck local structures reduce the number of parameters,

• aggregated pathways reduce the number of parameters.


Image classification networks

LeNet5

(LeCun et al., 1989)LSTM

(Hochreiter and Schmidhuber, 1997)

Highway Net

(Srivastava et al., 2015)

No recurrence

Deep hierarchical CNN

(Ciresan et al., 2012)

Bigger + GPU

AlexNet

(Krizhevsky et al., 2012)

Bigger + ReLU

+ dropout

Overfeat

(Sermanet et al., 2013)

Fully

convolutional

VGG

(Simonyan and Zisserman, 2014)

Bigger +

small filters

Net in Net

(Lin et al., 2013)

MLPConv

GoogLeNet


Inception

modules

ResNet

(He et al., 2015)

No gating

BN-Inception

(Ioffe and Szegedy, 2015)

Batch

Normalization

Inception-ResNet


ResNeXt

(Xie et al., 2016)

DenseNet

(Huang et al., 2016)

Wide ResNet

(Zagoruyko and Komodakis, 2016)

Wider

Dense

pass-throughAggregated

channels


Object detection


The simplest strategy to move from image classification to object detection isto classify local regions, at multiple scales and locations.

. . .

Parsing at fixed scale Final list of detections

This “sliding window” approach evaluates a classifier multiple times, and itscomputational cost increases with the prediction accuracy.


This was mitigated in overfeat (Sermanet et al., 2013) by adding a regressionpart to predict the object’s bounding box.

Input image

Conv layers

Max-pooling

1000d

FC layers

classication

1000d

FC layers

classication

4d

FC layers

Localization


In the single-object case, the convolutional layers are frozen, and thelocalization layers are trained with a l2 loss.

Figure 7:Examples of bounding boxes produced by the regression network , before being com-bined into final predictions. The examples shown here are at asingle scale. Predictions may bemore optimal at other scales depending on the objects. Here,most of the bounding boxes which areinitially organized as a grid, converge to a single locationand scale. This indicates that the networkis very confident in the location of the object, as opposed to being spread out randomly. The top leftimage shows that it can also correctly identify multiple location if several objects are present. Thevarious aspect ratios of the predicted bounding boxes showsthat the network is able to cope withvarious object poses.

We fix the feature extraction layers (1-5) from the classification network and train the regressionnetwork using anℓ2 loss between the predicted and true bounding box for each example. The finalregressor layer is class-specific, having 1000 different versions, one for each class. We train thisnetwork using the same set of scales as described in Section 3. We compare the prediction of theregressor net at each spatial location with the ground-truth bounding box, shifted into the frame ofreference of the regressor’s translation offset within theconvolution (see Fig. 8). However, we donot train the regressor on bounding boxes with less than 50% overlap with the input field of view:since the object is mostly outside of these locations, it will be better handled by regression windowsthat do contain the object.

Training the regressors in a multi-scale manner is important for the across-scale prediction combi-nation. Training on a single scale will perform well on that scale and still perform reasonably onother scales. However training multi-scale will make predictions match correctly across scales andexponentially increase the confidence of the merged predictions. In turn, this allows us to performwell with a few scales only, rather than many scales as is typically the case in detection. The typicalratio from one scale to another in pedestrian detection [25]is about 1.05 to 1.1, here however we usea large ratio of approximately 1.4 (this number differs for each scale since dimensions are adjustedto fit exactly the stride of our network) which allows us to runour system faster.

4.3 Combining Predictions

We combine the individual predictions (see Fig. 7) via a greedy merge strategy applied to the regres-sor bounding boxes, using the following algorithm.

(a) Assign toCs the set of classes in the topk for each scales ∈ 1 . . . 6, found by taking themaximum detection class outputs across spatial locations for that scale.

(b) Assign toBs the set of bounding boxes predicted by the regressor networkfor each class inCs,across all spatial locations at scales.

10


Combining the multiple boxes is done with an ad hoc greedy algorithm.


This architecture can be applied directly to detection by adding a class“Background” to the object classes.

Negative samples are taken in each scene either at random or by selecting theones with the worst miss-classification.

Surprisingly, using class-specific localization layers did not provide better resultsthan having a single one shared across classes (Sermanet et al., 2013).


Other approaches evolved from AlexNet, relying on region proposals:

• Generate thousands of proposal bounding boxes with a non-CNN“objectness” approach such as Selective search (Uijlings et al., 2013),

• feed to an AlexNet-like network sub-images cropped and warped from theinput image (“R-CNN”, Girshick et al., 2013), or from the convolutionalfeature maps to share computation (“Fast R-CNN”, Girshick, 2015).

These methods suffer from the cost of the region proposal computation, whichis non-convolutional and non-GPUified.

They were improved by Ren et al. (2015) in “Faster R-CNN” by replacing theregion proposal algorithm with a convolutional processing similar to Overfeat.


The most famous algorithm from this lineage is “You Only Look Once” (YOLO,Redmon et al. 2015).

It comes back to a classical architecture with a series of convolutional layersfollowed by a few fully connected layers. It is sometime described as “one shot”since a single information pathway suffices.

YOLO’s network is not a pre-existing one. It uses leaky ReLU, and itsconvolutional layers make use of the 1× 1 bottleneck filters (Lin et al., 2013) tocontrol the memory footprint and computational cost.


making predictions. Unlike sliding window and regionproposal-based techniques, YOLO sees the entire imageduring training and test time so it implicitly encodes contex-tual information about classes as well as their appearance.Fast R-CNN, a top detection method [14], mistakes back-ground patches in an image for objects because it can’t seethe larger context. YOLO makes less than half the numberof background errors compared to Fast R-CNN.

Third, YOLO learns generalizable representations of ob-jects. When trained on natural images and tested on art-work, YOLO outperforms top detection methods like DPMand R-CNN by a wide margin. Since YOLO is highly gen-eralizable it is less likely to break down when applied tonew domains or unexpected inputs.

YOLO still lags behind state-of-the-art detection systemsin accuracy. While it can quickly identify objects in im-ages it struggles to precisely localize some objects, espe-cially small ones. We examine these tradeoffs further in ourexperiments.

All of our training and testing code is open source. Avariety of pretrained models are also available to download.

2. Unified Detection

We unify the separate components of object detectioninto a single neural network. Our network uses featuresfrom the entire image to predict each bounding box. It alsopredicts all bounding boxes across all classes for an im-age simultaneously. This means our network reasons glob-ally about the full image and all the objects in the image.The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.

Our system divides the input image into an S × S grid.If the center of an object falls into a grid cell, that grid cellis responsible for detecting that object.

Each grid cell predictsB bounding boxes and confidencescores for those boxes. These confidence scores reflect howconfident the model is that the box contains an object andalso how accurate it thinks the box is that it predicts. For-mally we define confidence as Pr(Object) ∗ IOUtruth

pred . If noobject exists in that cell, the confidence scores should bezero. Otherwise we want the confidence score to equal theintersection over union (IOU) between the predicted boxand the ground truth.

Each bounding box consists of 5 predictions: x, y, w, h,and confidence. The (x, y) coordinates represent the centerof the box relative to the bounds of the grid cell. The widthand height are predicted relative to the whole image. Finallythe confidence prediction represents the IOU between thepredicted box and any ground truth box.

Each grid cell also predicts C conditional class proba-bilities, Pr(Classi|Object). These probabilities are condi-tioned on the grid cell containing an object. We only predict

one set of class probabilities per grid cell, regardless of thenumber of boxes B.

At test time we multiply the conditional class probabili-ties and the individual box confidence predictions,

Pr(Classi|Object) ∗ Pr(Object) ∗ IOUtruthpred = Pr(Classi) ∗ IOUtruth

pred (1)

which gives us class-specific confidence scores for eachbox. These scores encode both the probability of that classappearing in the box and how well the predicted box fits theobject.

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Figure 2: The Model. Our system models detection as a regres-sion problem. It divides the image into an S×S grid and for eachgrid cell predicts B bounding boxes, confidence for those boxes,and C class probabilities. These predictions are encoded as anS × S × (B ∗ 5 + C) tensor.

For evaluating YOLO on PASCAL VOC, we use S = 7,B = 2. PASCAL VOC has 20 labelled classes so C = 20.Our final prediction is a 7× 7× 30 tensor.

2.1. Network Design

We implement this model as a convolutional neural net-work and evaluate it on the PASCAL VOC detection dataset[9]. The initial convolutional layers of the network extractfeatures from the image while the fully connected layerspredict the output probabilities and coordinates.

Our network architecture is inspired by the GoogLeNetmodel for image classification [34]. Our network has 24convolutional layers followed by 2 fully connected layers.Instead of the inception modules used by GoogLeNet, wesimply use 1× 1 reduction layers followed by 3× 3 convo-lutional layers, similar to Lin et al [22]. The full network isshown in Figure 3.

We also train a fast version of YOLO designed to pushthe boundaries of fast object detection. Fast YOLO uses aneural network with fewer convolutional layers (9 insteadof 24) and fewer filters in those layers. Other than the sizeof the network, all training and testing parameters are thesame between YOLO and Fast YOLO.

(Redmon et al., 2015)


The output corresponds to splitting the image into a regular S × S grid, withS = 7, and for each cell, to predict a 30d vector:

• B = 2 bounding boxes coordinates and confidence,

• C = 20 class probabilities, corresponding to the classes of Pascal VOC.

448

448

3

7

7

Conv. Layer7x7x64-s-2

Maxpool Layer2x2-s-2

33

112

112

192

33

56

56

256

Conn. Layer

4096

Conn. LayerConv. Layer3x3x192


Conv. Layers1x1x1283x3x2561x1x2563x3x512


33

28

28

512

Conv. Layers1x1x2563x3x5121x1x512

3x3x1024Maxpool Layer

2x2-s-2

33

14

14

1024

Conv. Layers1x1x512

3x3x10243x3x1024

3x3x1024-s-2

3

3

7

71024

7

71024

7

730

} ×4 } ×2Conv. Layers3x3x10243x3x1024

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1× 1convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classificationtask at half the resolution (224× 224 input image) and then double the resolution for detection.

The final output of our network is the 7× 7× 30 tensorof predictions.

2.2. Training

We pretrain our convolutional layers on the ImageNet1000-class competition dataset [30]. For pretraining we usethe first 20 convolutional layers from Figure 3 followed by aaverage-pooling layer and a fully connected layer. We trainthis network for approximately a week and achieve a singlecrop top-5 accuracy of 88% on the ImageNet 2012 valida-tion set, comparable to the GoogLeNet models in Caffe’sModel Zoo [24]. We use the Darknet framework for alltraining and inference [26].

We then convert the model to perform detection. Ren etal. show that adding both convolutional and connected lay-ers to pretrained networks can improve performance [29].Following their example, we add four convolutional lay-ers and two fully connected layers with randomly initializedweights. Detection often requires fine-grained visual infor-mation so we increase the input resolution of the networkfrom 224× 224 to 448× 448.

Our final layer predicts both class probabilities andbounding box coordinates. We normalize the bounding boxwidth and height by the image width and height so that theyfall between 0 and 1. We parametrize the bounding box xand y coordinates to be offsets of a particular grid cell loca-tion so they are also bounded between 0 and 1.

We use a linear activation function for the final layer andall other layers use the following leaky rectified linear acti-vation:

φ(x) =

{x, if x > 0

0.1x, otherwise(2)

We optimize for sum-squared error in the output of our

model. We use sum-squared error because it is easy to op-timize, however it does not perfectly align with our goal ofmaximizing average precision. It weights localization er-ror equally with classification error which may not be ideal.Also, in every image many grid cells do not contain anyobject. This pushes the “confidence” scores of those cellstowards zero, often overpowering the gradient from cellsthat do contain objects. This can lead to model instability,causing training to diverge early on.

To remedy this, we increase the loss from bounding boxcoordinate predictions and decrease the loss from confi-dence predictions for boxes that don’t contain objects. Weuse two parameters, λcoord and λnoobj to accomplish this. Weset λcoord = 5 and λnoobj = .5.

Sum-squared error also equally weights errors in largeboxes and small boxes. Our error metric should reflect thatsmall deviations in large boxes matter less than in smallboxes. To partially address this we predict the square rootof the bounding box width and height instead of the widthand height directly.

YOLO predicts multiple bounding boxes per grid cell.At training time we only want one bounding box predictorto be responsible for each object. We assign one predictorto be “responsible” for predicting an object based on whichprediction has the highest current IOU with the groundtruth. This leads to specialization between the bounding boxpredictors. Each predictor gets better at predicting certainsizes, aspect ratios, or classes of object, improving overallrecall.

During training we optimize the following, multi-part


xi,1 yi,1 wi,1 hi,1ci,1 . . . xi,B yi,B wi,B hi,B ci,B pi,1 . . . pi,C

5B values C values


So the network predicts class scores and bounding-box regressions, and althoughthe output comes from fully connected layers, it has a 2D structure.

It allows in particular YOLO to leverage the absolute location in the image toimprove performance (e.g. vehicles tend to be at the bottom, umbrella at thetop), which may or may not be desirable.


During training, YOLO makes the assumption that any of the S2 cells containsat most [the center of] a single object. We define for every image, cell indexi = 1, . . . , S2, predicted box index j = 1, . . . ,B and class index c = 1, . . . ,C

• 1obji is 1 if there is an object in cell i and 0 otherwise,

• 1obji,j is 1 if there is an object in cell i and predicted box j is the most fitting

one, 0 otherwise.

• pi,c is 1 if there is an object of class c in cell i , and 0 otherwise,

• xi , yi ,wi , hi the annotated object bounding box (defined only if 1obji = 1,

and relative in location and scale to the cell),

• ci,j IOU between the predicted box and the ground truth target.


The training procedure first computes on each image the value of the 1obji,j s and

ci,j , and then does one step to minimize

λcoord

S2∑i=1

B∑j=1

1obji,j

((xi−xi,j )2 + (yi−yi,j )2 +

(√wi−

√wi,j

)2+

(√hi−

√hi,j

)2)

+ λobj

S2∑i=1

B∑j=1

1obji,j (ci,j−ci,j )2 + λnoobj

S2∑i=1

B∑j=1

(1−1obj

i,j

)c2i,j

+ λclasses

S2∑i=1

1obji

C∑c=1

(pi,c−pi,c

)2.

where pi,c , xi,j , yi,j , wi,j , hi,j , ci,j are the network’s outputs.

(slightly re-written from Redmon et al. 2015)


Training YOLO relies on many engineering choices that illustrate well howinvolved is deep-learning “in practice”:

• Pre-train the 20 first convolutional layers on ImageNet classification,

• use 448× 448 input for detection, instead of 224× 224,

• use Leaky ReLU for all layers,

• dropout after the first fully connected layer,

• normalize bounding boxes parameters in [0, 1],

• use a quadratic loss not only for the bounding box coordinates, but also forthe confidence and the class scores,

• reduce the weight of large bounding boxes by using the square roots of thesize in the loss,

• reduce the importance of empty cells by weighting less theconfidence-related loss on them,

• use momentum 0.9, decay 5e − 4,

• data augmentation with scaling, translation, and HSV transformation.

A critical technical point is the design of the loss function that articulates botha classification and a regression objectives.


The Single Shot Multi-box Detector (SSD, Liu et al., 2015) improves uponYOLO with a fully-convolutional architectures and multi-scale maps.4 Liu et al.

300

300

3

VGG-16 through Conv5_3 layer

19

19

Conv7(FC7)

1024

10

10

Conv8_2

512

5

5

Conv9_2

2563

Conv10_2

256 256

38

38

Conv4_3

3

1

Image

Conv: 1x1x1024 Conv: 1x1x256Conv: 3x3x512-s2

Conv: 1x1x128Conv: 3x3x256-s2


Det

ectio

ns:8

732

per

Cla

ss

Classifier : Conv: 3x3x(4x(Classes+4))

512

448

448

3

Image

7

7

1024

7

7

30

Fully Connected

YOLO Customized Architecture

Non

-Max

imum

Sup

pres

sion

Fully Connected

Non

-Max

imum

Sup

pres

sion

Det

ectio

ns: 9

8 pe

r cla

ss

Conv11_2

74.3mAP 59FPS

63.4mAP 45FPS

Classifier : Conv: 3x3x(6x(Classes+4))

19

19

Conv6(FC6)

1024

Conv: 3x3x1024

SS

DY

OLO

Extra Feature Layers


Conv: 3x3x(4x(Classes+4))

Fig. 2: A comparison between two single shot detection models: SSD and YOLO [5].Our SSD model adds several feature layers to the end of a base network, which predictthe offsets to default boxes of different scales and aspect ratios and their associatedconfidences. SSD with a 300 × 300 input size significantly outperforms its 448 × 448YOLO counterpart in accuracy on VOC2007 test while also improving the speed.

box position relative to each feature map location (cf the architecture of YOLO[5] thatuses an intermediate fully connected layer instead of a convolutional filter for this step).Default boxes and aspect ratios We associate a set of default bounding boxes witheach feature map cell, for multiple feature maps at the top of the network. The defaultboxes tile the feature map in a convolutional manner, so that the position of each boxrelative to its corresponding cell is fixed. At each feature map cell, we predict the offsetsrelative to the default box shapes in the cell, as well as the per-class scores that indicatethe presence of a class instance in each of those boxes. Specifically, for each box out ofk at a given location, we compute c class scores and the 4 offsets relative to the originaldefault box shape. This results in a total of (c+ 4)k filters that are applied around eachlocation in the feature map, yielding (c+ 4)kmn outputs for a m× n feature map. Foran illustration of default boxes, please refer to Fig. 1. Our default boxes are similar tothe anchor boxes used in Faster R-CNN [2], however we apply them to several featuremaps of different resolutions. Allowing different default box shapes in several featuremaps let us efficiently discretize the space of possible output box shapes.

2.2 Training

The key difference between training SSD and training a typical detector that uses regionproposals, is that ground truth information needs to be assigned to specific outputs inthe fixed set of detector outputs. Some version of this is also required for training inYOLO[5] and for the region proposal stage of Faster R-CNN[2] and MultiBox[7]. Oncethis assignment is determined, the loss function and back propagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detectionas well as the hard negative mining and data augmentation strategies.

(Liu et al., 2015)


To summarize roughly how “one shot” deep detection can be achieved:

• networks trained on image classification capture localization information,

• regression layers can be attached to classification-trained networks,

• object localization does not have to be class-specific,

• multiple detection are estimated at each location to account for differentaspect ratios and scales.


Object detection networks

AlexNet

(Krizhevsky et al., 2012)

Overfeat


Box

regression

R-CNN

(Girshick et al., 2013)

Region proposal

+ crop in image

Fast R-CNN

(Girshick, 2015)

Crop in

feature maps

Faster R-CNN

(Ren et al., 2015)

Convolutional

region proposal

YOLO


No crop

SSD

(Liu et al., 2015)

Fully convolutional

+ multi-scale maps

Multi-scale

convolutions

+ multi boxes


Semantic segmentation


The historical approach to image segmentation was to define a measure ofsimilarity between pixels, and to cluster groups of similar pixels. Suchapproaches account poorly for semantic content.

The deep-learning approach re-casts semantic segmentation as pixelclassification, and re-uses networks trained for image classification by makingthem fully convolutional.


Shelhamer et al. (2016) use a pre-trained classification network (e.g. VGG 16layers) from which the final fully connected layer is removed, and the other onesare converted to 1× 1 convolutional filters.

They add a final 1× 1 convolutional layers with 21 output channels (VOC 20classes + “background”).

Since VGG16 has 5 max-pooling with 2× 2 kernels, with proper padding, theoutput is 1/25 = 1/32 the size of the input.

This map is then up-scaled with a de-convolution layer with kernel 64× 64 andstride 32× 32 to get a final map of same size as the input image.

Training is achieved with full images and pixel-wise cross-entropy, starting witha pre-trained VGG16. All layers are fine-tuned, although fixing the up-scalingde-convolution to bilinear does as well.


3d

12

, 64d

14

, 128d

18

, 256d

116

, 512d

132

, 512d

132

, 4096d

2× conv/relu

+ maxpool

2× conv/relu

+ maxpool

3× conv/relu

+ maxpool

3× conv/relu

+ maxpool

3× conv/relu

+ maxpool

2× fc-conv/relu

132

, 21d

21d

fc-conv

deconv

×32


Although this Fully Connected Network (FCN) achieved almost state-of-the-artresults when published, its main weakness is the coarseness of the signal fromwhich the final output is produced (1/32 of the original resolution).

Shelhamer et al. proposed an additional element, that consists of using thesame prediction/up-scaling from intermediate layers of the VGG network.


3d

12

, 64d

14

, 128d

18

, 256d

116

, 512d

132

, 512d

132

, 4096d

2× conv/relu

+ maxpool

2× conv/relu

+ maxpool

3× conv/relu

+ maxpool

3× conv/relu

+ maxpool

3× conv/relu

+ maxpool

2× fc-conv/relu

132

, 21d

116

, 21d116

, 21d

116

, 21d

18

, 21d

18

, 21d

21d

18

, 21d

fc-conv

deconv

×2

fc-conv

fc-conv

+

deconv

×2

+

deconv

×8


9

FCN-8s SDS [14] Ground Truth Image

Fig. 6. Fully convolutional networks improve performance on PASCAL.The left column shows the output of our most accurate net, FCN-8s. Thesecond shows the output of the previous best method by Hariharan et al.[14]. Notice the fine structures recovered (first row), ability to separateclosely interacting objects (second row), and robustness to occluders(third row). The fifth and sixth rows show failure cases: the net seeslifejackets in a boat as people and confuses human hair with a dog.

6 ANALYSIS

We examine the learning and inference of fully convolu-tional networks. Masking experiments investigate the role ofcontext and shape by reducing the input to only foreground,only background, or shape alone. Defining a “null” back-ground model checks the necessity of learning a backgroundclassifier for semantic segmentation. We detail an approxi-mation between momentum and batch size to further tunewhole image learning. Finally, we measure bounds on taskaccuracy for given output resolutions to show there is stillmuch to improve.

6.1 Cues

Given the large receptive field size of an FCN, it is naturalto wonder about the relative importance of foreground andbackground pixels in the prediction. Is foreground appear-ance sufficient for inference, or does the context influencethe output? Conversely, can a network learn to recognize aclass by its shape and context alone?

Masking To explore these issues we experiment withmasked versions of the standard PASCAL VOC segmenta-tion challenge. We both mask input to networks trained onnormal PASCAL, and learn new networks on the maskedPASCAL. See Table 8 for masked results.

TABLE 8The role of foreground, background, and shape cues. All scores are the

mean intersection over union metric excluding background. Thearchitecture and optimization are fixed to those of FCN-32s (Reference)

and only input masking differs.

train test

FG BG FG BG mean IU

Reference keep keep keep keep 84.8Reference-FG keep keep keep mask 81.0Reference-BG keep keep mask keep 19.8FG-only keep mask keep mask 76.1BG-only mask keep mask keep 37.8Shape mask mask mask mask 29.1

Masking the foreground at inference time is catastrophic.However, masking the foreground during learning yieldsa network capable of recognizing object segments withoutobserving a single pixel of the labeled class. Masking thebackground has little effect overall but does lead to classconfusion in certain cases. When the background is maskedduring both learning and inference, the network unsurpris-ingly achieves nearly perfect background accuracy; howevercertain classes are more confused. All-in-all this suggeststhat FCNs do incorporate context even though decisions aredriven by foreground pixels.

To separate the contribution of shape, we learn a netrestricted to the simple input of foreground/backgroundmasks. The accuracy in this shape-only condition is lowerthan when only the foreground is masked, suggesting thatthe net is capable of learning context to boost recognition.Nonetheless, it is surprisingly accurate. See Figure 7.

Background modeling It is standard in detection andsemantic segmentation to have a background model. Thismodel usually takes the same form as the models for theclasses of interest, but is supervised by negative instances.In our experiments we have followed the same approach,learning parameters to score all classes including back-ground. Is this actually necessary, or do class models suffice?

To investigate, we define a net with a “null” backgroundmodel that gives a constant score of zero. Instead of train-ing with the softmax loss, which induces competition bynormalizing across classes, we train with the sigmoid cross-entropy loss, which independently normalizes each score.For inference each pixel is assigned the highest scoring class.In all other respects the experiment is identical to our FCN-32s on PASCAL VOC. The null background net scores 1point lower than the reference FCN-32s and a control FCN-32s trained on all classes including background with thesigmoid cross-entropy loss. To put this drop in perspective,note that discarding the background model in this wayreduces the total number of parameters by less than 0.1%.Nonetheless, this result suggests that learning a dedicatedbackground model for semantic segmentation is not vital.

6.2 Momentum and batch sizeIn comparing optimization schemes for FCNs, we find that“heavy” online learning with high momentum trains moreaccurate models in less wall clock time (see Section 4.2).Here we detail a relationship between momentum and batchsize that motivates heavy learning.

Left column is the best network from Shelhamer et al. (2016).


10

Image Ground Truth Output Input

Fig. 7. FCNs learn to recognize by shape when deprived of other inputdetail. From left to right: regular image (not seen by network), groundtruth, output, mask input.

By writing the updates computed by gradient accumu-lation as a non-recursive sum, we will see that momentumand batch size can be approximately traded off, whichsuggests alternative training parameters. Let gt be the steptaken by minibatch SGD with momentum at time t,

gt = −ηk−1∑

i=0

∇θ`(xkt+i; θt−1) + pgt−1,

where `(x; θ) is the loss for example x and parameters θ,p < 1 is the momentum, k is the batch size, and η is thelearning rate. Expanding this recurrence as an infinite sumwith geometric coefficients, we have

gt = −η∞∑

s=0

k−1∑

i=0

ps∇θ`(xk(t−s)+i; θt−s).

In other words, each example is included in the sum withcoefficient pbj/kc, where the index j orders the examplesfrom most recently considered to least recently considered.Approximating this expression by dropping the floor, we seethat learning with momentum p and batch size k appearsto be similar to learning with momentum p′ and batchsize k′ if p(1/k) = p′(1/k

′). Note that this is not an exactequivalence: a smaller batch size results in more frequentweight updates, and may make more learning progress forthe same number of gradient computations. For typical FCNvalues of momentum 0.9 and a batch size of 20 images, anapproximately equivalent training regime uses momentum0.9(1/20) ≈ 0.99 and a batch size of one, resulting in online

learning. In practice, we find that online learning works welland yields better FCN models in less wall clock time.

6.3 Upper bounds on IU

FCNs achieve good performance on the mean IU segmen-tation metric even with spatially coarse semantic predic-tion. To better understand this metric and the limits ofthis approach with respect to it, we compute approximateupper bounds on performance with prediction at variousresolutions. We do this by downsampling ground truthimages and then upsampling back to simulate the bestresults obtainable with a particular downsampling factor.The following table gives the mean IU on a subset5 ofPASCAL 2011 val for various downsampling factors.

factor mean IU

128 50.964 73.332 86.116 92.88 96.44 98.5

Pixel-perfect prediction is clearly not necessary toachieve mean IU well above state-of-the-art, and, con-versely, mean IU is a not a good measure of fine-scale accu-racy. The gaps between oracle and state-of-the-art accuracyat every stride suggest that recognition and not resolution isthe bottleneck for this metric.

7 CONCLUSION

Fully convolutional networks are a rich class of models thataddress many pixelwise tasks. FCNs for semantic segmen-tation dramatically improve accuracy by transferring pre-trained classifier weights, fusing different layer representa-tions, and learning end-to-end on whole images. End-to-end, pixel-to-pixel operation simultaneously simplifies andspeeds up learning and inference. All code for this paper isopen source in Caffe, and all models are freely available inthe Caffe Model Zoo. Further works have demonstrated thegenerality of fully convolutional networks for a variety ofimage-to-image tasks.

ACKNOWLEDGEMENTS

This work was supported in part by DARPA’s MSEE andSMISC programs, NSF awards IIS-1427425, IIS-1212798, IIS-1116411, and the NSF GRFP, Toyota, and the BerkeleyVision and Learning Center. We gratefully acknowledgeNVIDIA for GPU donation. We thank Bharath Hariharanand Saurabh Gupta for their advice and dataset tools. Wethank Sergio Guadarrama for reproducing GoogLeNet inCaffe. We thank Jitendra Malik for his helpful comments.Thanks to Wei Liu for pointing out an issue wth our SIFTFlow mean IU computation and an error in our frequencyweighted mean IU formula.

Results with a network trained from mask only (Shelhamer et al., 2016).


It is noteworthy that for detection and semantic segmentation, there is an heavyre-use of large networks trained for classification.

The models themselves, as much as the source code of the algorithm thatproduced them, or the training data, are generic and re-usable assets.


torch.utils.data.DataLoader


Until now, we have dealt with image sets that could fit in memory, and wemanipulated them as regular tensors:

train_set = datasets.MNIST (’./data/mnist/’, train = True , download = True)

train_input = Variable(train_set.train_data.view(-1, 1, 28, 28).float())

train_target = Variable(train_set.train_labels)

Large sets do not fit in memory, and samples have to be constantly loadedduring training.

This require a [sophisticated] machinery to parallelize the loading itself, but alsothe normalization, and data-augmentation operations.


PyTorch offers the torch.utils.data.DataLoader object which combines adata-set and a sampling policy to create an iterator over mini-batches.

Standard data-sets are available in torchvision.datasets , and they allow toapply transformations over the images or the labels transparently.


from torch.utils.data import DataLoader

from torchvision import datasets , transforms

train_transforms = transforms.Compose(

[

transforms.RandomCrop (28, padding = 3),

transforms.ToTensor (),

transforms.Normalize(mean = (33.32 , ), std = (78.56 , ))

]

)

train_loader = DataLoader(

datasets.MNIST(root = ’./data ’, train = True , download = True ,

transform = train_transforms),

batch_size = 100,

num_workers = 4,

shuffle = True ,

pin_memory = torch.cuda.is_available ()

)


Given this train loader , we can now re-write our training procedure with aloop over the mini-batches

for e in range(nb_epochs):

for input , target in iter(train_loader):

if torch.cuda.is_available ():

input , target = input.cuda(), target.cuda()

input , target = Variable(input), Variable(target)

output = model(input)

loss = criterion(output , target)

model.zero_grad ()

loss.backward ()

optimizer.step()

Note that for data-sets that can fit in memory this is quite inefficient, as theyare constantly moved from the CPU to the GPU memory.


Example of neuro-surgery and fine-tuning in PyTorch


As an example of re-using a network and fine-tuning it, we will construct anetwork for CIFAR10 composed of:

• the first layer of an [already trained] AlexNet,

• several resnet blocks, stored in a nn.ModuleList and each combiningnn.Conv2d , nn.BatchNorm2d , and nn.ReLU ,

• a final channel-wise averaging, using nn.AvgPool2d , and

• a final fully connected linear layer nn.Linear .

During training, we keep the AlexNet features frozen for a few epochs. This isdone by setting requires grad of the related Parameter s to False .


data_dir = os.environ.get(’PYTORCH_DATA_DIR ’) or ’.’

num_workers = 4

batch_size = 64

transform = torchvision.transforms.ToTensor ()

train_set = torchvision.datasets.CIFAR10(root = data_dir , train = True ,

download = False , transform = transform)

train_loader = torch.utils.data.DataLoader(train_set , batch_size = batch_size ,

shuffle = True , num_workers = num_workers)

test_set = torchvision.datasets.CIFAR10(root = data_dir , train = False ,

download = False , transform = transform)

test_loader = torch.utils.data.DataLoader(test_set , batch_size = batch_size ,

shuffle = False , num_workers = num_workers)


def make_resnet_block(nb_channels , kernel_size = 3):

return nn.Sequential(

nn.Conv2d(nb_channels , nb_channels ,

kernel_size = kernel_size ,

padding = (kernel_size - 1) // 2),

nn.BatchNorm2d(nb_channels),

nn.ReLU(inplace = True),

nn.Conv2d(nb_channels , nb_channels ,

kernel_size = kernel_size ,

padding = (kernel_size - 1) // 2),

nn.BatchNorm2d(nb_channels),

)


class Monster(nn.Module):

def __init__(self , nb_residual_blocks , nb_channels):

super(Monster , self).__init__ ()

nb_alexnet_channels = 64

alexnet_feature_map_size = 7 # For 32x32 (e.g. CIFAR)

alexnet = torchvision.models.alexnet(pretrained = True)

# Conv2d(3, 64, kernel_size =(11, 11), stride =(4, 4), padding =(2, 2))

self.features = nn.Sequential(

alexnet.features [0],

nn.ReLU(inplace = True)

)

self.converter = nn.Sequential(

nn.Conv2d(nb_alexnet_channels , nb_channels ,

kernel_size = 3, padding = 1),

nn.ReLU(inplace = True)

)

self.resnet_blocks = nn.ModuleList ()

for k in range(nb_residual_blocks):

self.resnet_blocks.append(make_resnet_block(nb_channels , 3))

self.final_average = nn.AvgPool2d(alexnet_feature_map_size)

self.fc = nn.Linear(nb_channels , 10)


def freeze_features(self , q):

for p in self.features.parameters ():

# If frozen (q == True) we do NOT need the gradient

p.requires_grad = not q

def forward(self , x):

x = self.features(x)

x = self.converter(x)

for b in self.resnet_blocks:

x = x + b(x)

x = self.final_average(x).view(x.size (0), -1)

x = self.fc(x)

return x


nb_epochs = 100

nb_epochs_frozen_features = nb_epochs // 2

nb_residual_blocks = 16

nb_channels = 64

model , criterion = Monster(nb_residual_blocks , nb_channels), nn.CrossEntropyLoss ()


model.cuda()

criterion.cuda()

optimizer = optim.SGD(model.parameters (), lr = 1e-2)

model.train(True)

for e in range(nb_epochs):

model.freeze_features(e < nb_epochs_frozen_features)

acc_loss = 0.0

for input , target in iter(train_loader):


input , target = input.cuda(), target.cuda()

input , target = Variable(input), Variable(target)


loss = criterion(output , target)

acc_loss += loss.data [0]

model.zero_grad ()

loss.backward ()

optimizer.step()

print(e, acc_loss)


nb_test_errors , nb_test_samples = 0, 0

model.train(False)

for input , target in iter(test_loader):


input = input.cuda()

target = target.cuda()

input = Variable(input)


wta = torch.max(output.data , 1)[1]. view(-1)

for i in range(target.size (0)):

nb_test_samples += 1

if wta[i] != target[i]: nb_test_errors += 1

print(’test_error {:.02f}% ({:d}/{:d}) ’.format(

100 * nb_test_errors / nb_test_samples ,

nb_test_errors ,

nb_test_samples)

)


References

D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for imageclassification. CoRR, abs/1202.2745, 2012.

R. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurateobject detection and semantic segmentation. CoRR, abs/1311.2524, 2013.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR,abs/1512.03385, 2015.

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

G. Huang, Z. Liu, K. Weinberger, and L. van der Maaten. Densely connected convolutionalnetworks. CoRR, abs/1608.06993, 2016.

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. In International Conference on Machine Learning(ICML), 2015.

A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis,Department of Computer Science, University of Toronto, 2009.

A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutionalneural networks. In Neural Information Processing Systems (NIPS), 2012.

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.Jackel. Backpropagation applied to handwritten zip code recognition. NeuralComputation, 1(4):541–551, 1989.

Y. leCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied todocument recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD:single shot multibox detector. CoRR, abs/1512.02325, 2015.

J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified,real-time object detection. CoRR, abs/1506.02640, 2015.

S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time objectdetection with region proposal networks. CoRR, abs/1506.01497, 2015.

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat:Integrated recognition, localization and detection using convolutional networks. CoRR,abs/1312.6229, 2013.

E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semanticsegmentation. CoRR, abs/1605.06211, 2016.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale imagerecognition. CoRR, abs/1409.1556, 2014.

R. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. CoRR, abs/1505.00387,2015.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,and A. Rabinovich. Going deeper with convolutions. In Conference on Computer Visionand Pattern Recognition (CVPR), 2015.

C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact ofresidual connections on learning. CoRR, abs/1602.07261, 2016.

J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for objectrecognition. International Journal of Computer Vision, 104(2):154–171, 2013.

H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset forbenchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.

S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations fordeep neural networks. CoRR, abs/1611.05431.pdf, 2016.

S. Zagoruyko and N. Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.

7. networks for computer vision ee-559 { deep learning...2018/06/08 · computer vision tasks:...

Documents