modern convolutional neural network techniques for image segmentation
TRANSCRIPT
-
Modern Convolutional Neural Network
techniques for image segmentation
Deep Learning Journal Club
Gioele Ciaparrone
Michele Curci
November 30, 2016
University of Salerno
-
Index
1. Introduction
2. The Inception architecture
3. Fully convolutional networks
4. Hypercolumns
5. Conclusion
2
-
Introduction
-
CNN recap
Sequence of convolutional and pooling layers Rectifier activation function Fully connected layers at the end Softmax function for classification
4
-
Convolution I
5
-
Convolution II
Valid padding (left) and same padding (right) convolutions
6
-
LeNet-5 (1989-1998)
First CNN (1989) proven to work well, used for handwritten Zipcode recognition [1]
Refined through the years until the LeNet-5 version (1998) [2]
7
-
LeNet-5 interactive visualization [3]
Its possible to interact with the network in 3D, manually drawing a digit
to be classified, clicking on the neurons to get info about the parameters
and the connected units, or rotating and zooming the network:
http://scs.ryerson.ca/~aharley/vis/conv/ 8
http://scs.ryerson.ca/~aharley/vis/conv/
-
AlexNet (2012) [5]
After a long hiatus in which deep learning was ignored [4], theyreceived attention once again after Alex Krizhevsky overwhelmingly
won the ILSVRC in 2012 with AlexNet
Structure very similar to LeNet-5, but with some new key insights:very efficient GPU implementation, ReLU neurons and dropout
9
-
The Inception architecture
-
Motivations
Increasing model size tends to improve quality
More computational resources are needed
Computational efficiency and low parameter count are still important
Mobile vision and embedded systems
Big Data
11
-
Going Deeper with Convolutions [6]
The Inception module solves this problem making a better use of thecomputing resources
Proposed in 2014 by Christian Szegedy and other Google researchers
Used in the GoogLeNet architecture that won both the ILSVRC2014 classification and detection challanges
12
-
Inception module I
Visual information is processed at various scales and then aggregated Since pooling operations are beneficial in CNNs, a parallel pooling
path has been added
Problems: 3x3 and 5x5 convolutions can be very expensive on top of a layer
with lots of filters
The number of filters substantially increases for each Inception layeradded, leading to a computational blow up 13
-
Inception module II
Adding the 1x1 convolutions before the bigger convolutions reducesdimensionality
The same is done after the pooling layer
14
-
GoogLeNet I
GoogLeNet is a particular incarnation of the Inception architecture
22 convolutional layers (27 including pooling)
9 Inception modules
2 auxiliary classifiers to solve the vanishing gradient problem and forregularization
Designed with computational efficiency in mind Inference can be run on devices with limited computational
resources, especially memory
7 of these networks used in an ensemble for the ILSVRC 2014classification task
15
-
GoogLeNet II
16
-
GoogLeNet III
17
-
GoogLeNet - Training
Trained with the DistBelief distributed machine learning system
Asynchronous stochastic gradient descent with 0.9 momentum
Image sampling methods have changed many times before thecompetition
Converged models were trained on with other options
Models were trained on crops of different size
There isnt a definitive guidance to the most effective single way totrain these networks
18
-
GoogLeNet - ILSVRC 2014 Results
Classification (above) and object detection (below) results.19
-
DeepDream
Googles DeepDream uses a GoogLeNet to produce machine dreams
20
-
Inception-v2 and Inception-v3
The Inception module authors later presented new optimizedversions of the architecture, called Inception-v2 and Inception-v3 [7]
They managed to significantly improve GoogLeNet ILSVRC 2014results
The improvements were based on various key principles: Avoid representational bottlenecks Spatial aggregation on lower dimensional embeddings doesnt usually
induce relevant losses in representational power
Balance the width and depth of the network
21
-
Convolution factorization I
Factorizing convolutions allows to reduce the number of parameterswhile not loosing much expressiveness
For example 5x5 convolutions can be factorized into a pair of 3x3convolutions
It is also possible to factorize a NxN convolutions into a 1xN and aNx1 convolutions
22
-
Convolution factorization II
The original Inception module (left) and the new factorized module
(right).
23
-
Efficient grid size reduction - problem
Suppose we want to pass from a d d grid with k filters to a d2 d2
grid with 2k filters
We need to compute a stride-1 convolution and then a pooling Computational cost dominated by convolutions: 2d2k2 operations Inverting the order, the number of operations is reduced to 2( d2 )
2k2,
but we violate the bottleneck principle
24
-
Efficient grid size reduction - solution
The solution is an Inception module with convolution and poolingblocks with stride 2
Computationally efficient and no representational bottleneckintroduced
25
-
The new architecture
Using various modified Inception modules, here is the newInception-v2 architecture
26
-
Inception-v2: modules used
n = 7
27
-
Inception-v2: training and observations
The network was trained on the ILSVRC 2012 images usingstochastic gradient descent and the TensorFlow library
Experimental testings proved the two auxiliary classifiers to have lessimpact on the training convergence than expected
In the early training phases, the model performance was not affectedby the presence of the auxiliary classifiers: they only improved the
performance near the end of training
Removing the lower auxiliary classifier didnt have any effect
The main classifier performs better if batch normalization or dropoutare added to the auxiliary ones
The model was also trained and tested on smaller receptive fieldswith only a small loss of top-1 accuracy (76.6% for 299x299 RF vs.
75.2% on 79x79 RF). Important for post-classification of detection
28
-
Inception-v2 to Inception-v3 results (single model)
Each rows Inception-v2 model adds a feature with respect to theprevious rows model
The last lines model is referred to as the Inception-v3 model29
-
Inception-v3 vs other models (single and ensemble)
Single model results Ensemble results
On the ILSVRC 2012 dataset, there is a significant improvementversus state-of-the-art models, both with a single model and with an
ensemble of models
Note that the ensemble errors here are validation errors (except forthe one marked with *, that is a test error)
30
-
Fully convolutional networks
-
Semantic segmentation
Image segmentation is the process of partitioning an image inmultiple segments (set of pixels or super-pixels)
Semantic segmentation is the partitioning of an image intosemantically meaningful parts and to classify each part into one of
the pre-determined classes
Its possible to achieve the same result with pixel-wiseclassification, i.e. assigning a class to each pixel
32
-
Fully convolutional networks
Shelhamer et al. [8] showed that fully convolutional networks trainedpixels-to-pixels exceed the state-of-the-art in semantic segmentation
The fully convolutional networks they proposed take input ofarbitrary size and produce same-sized output to make dense
predictions
33
-
Convolutionalization of a classic net I
Typical recognition nets (AlexNet, GoogLeNet, etc.) take fixed-sizedinputs and produce non-spatial outputs
The fully connected layers have fixed dimensions and drop thespatial coordinates
However we can view these fully connected layers as convolutionsthat cover their entire input regions
34
-
Convolutionalization of a classic net II
These fully convolutional networks take input of any size and outputclassifications map
The resulting maps are equivalent to the evaluation of the originalnetwork on particular input patches
The new network is more than 5 times faster than the originalnetwork both at learning time and at inference time (considering a
10x10 output grid)
Note that the output dimensions are typically reduced bysubsampling
So output interpolation is needed to obtain dense predictions
The interpolation is obtained through backwards convolutions
35
-
Backwards strided convolution
Upsampling from 3x3 grid to 5x5
36
-
Architecture I
Coarse and local information is fused combining lower and higherlayers
3 network types with different layers fused were tested
37
-
Architecture II
3 proven classification architectures were transformed to fullyconvolutional: AlexNet, VGG16 and GoogLeNet
Each nets final classifier layer is discarded and all the fullyconnected layers are converted to convolutions
A 1x1 convolution with 21 channels (the number of classes in thePASCAL VOC 2011 dataset) is added to the end, followed by a
backwards convolution layer
38
-
Architecture III
The original nets were first pre-trained using image classification
Then they were transformed to fully convolutional for fine tuningusing whole images (using SGD with momentum)
The best results were obtained with FCN-VGG16
Training on whole images proved to be as effective as samplingpatches
39
-
Architecture comparison
The first models (FCN-32s) didnt fuse different layers, but theresulting output is very coarse
They then fused lower layers with the last one (as shown earlier) toobtain better results (mean IU 62.7 for FCN-8s vs. 59.4 for
FCN-32s)40
-
Results comparison I
The model reaches state-of-the-art performance on semanticsegmentation
Also the model is much faster at inference time than previousarchitectures
41
-
Results comparison II
42
-
Hypercolumns
-
Hypercolumns I
The last layer of a CNN captures general features of the image, butis too coarse spatially to allow precise localization
Earlier layers instead may be precise in localization but will notcapture semantics
Hariharan et al. [9] presented the hypercolumn concept, which putstogheter the information from both higher and lower layers to obtain
better results on 3 fine-grained localization tasks:
Simultaneous detection and segmentation Keypoint localization Part labeling
44
-
Hypercolumns II
The hypercolumn corresponding to a given input location is definedas the outputs of all units above that location at all layers of the
CNN, stacked into one vector
45
-
Problem setting I
Input: a set of detections (subjected to non-maximum suppression),each with a bounding box, a category label and a score
According to the task we are performing for each detection we want: segment out the object segment its parts predict its keypoints
Whichever the task, the bounding boxes are slightly expanded and a50x50 heatmap is predicted on each of them
46
-
Problem setting II
The information encoded in each heatmap and the number ofheatmaps depend on the chosen task:
For segmentation, the heatmap encodes the probability that aparticular location is inside the object
For part labeling a separate heatmap is predicted for each part,where each heatmap is the probability a location belongs to that part
For keypoint localization a separate heatmap is predicted for eachkeypoint, with each heatmap encoding the probability that the
keypoint is at a particular location
The heatmaps are finally resized to the size of the expandedbounding boxes
So all the tasks are solved assigning a probability to each of the50x50 locations
47
-
Problem setting III
For each of the 50x50 locations and for each category a classifiershould be trained
But doing so has 3 problems: The amount of data that each classifier sees during training is
heavily reduced
Training so many classifiers is computationally expensive While the classifier should vary according to the location, to adjacent
pixels should be classified similarly
The solution is to train a coarse K K (usually K = 5 or K = 10)grid of classifiers and interpolate between them
48
-
Network architecture
conv conv conv
upsample upsample upsample
sigmoid
classifier interpolation
Note: inverting the order of upsampling and convolutions (that calculate
the K K grids) and computing them separately for each of the 3combined layers allows to reduce computational cost
49
-
Bounding box refining
A special technique is used to improve the box selection, calledrescoring
50
-
SDS results
51
-
Keypoint prediction results
52
-
Part labeling results
53
-
Conclusion
-
Conclusion
We have seen how the Inception modules allow to train deeper andbetter networks in a computationally efficient manner
We have then observed how to transform a classification CNN into afully convolutional network for pixel-wise classification
We have learned the hypercolumn technique to combine high andlow level information to improve the accuracy on various fine-grained
localization tasks
55
-
Thank you for your patience! :)
56
-
References I
[1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, Backpropagation applied to
handwritten zip code recognition, Neural Computation, vol. 1(4),
pp. 541551, 1989.
[2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based
learning applied to document recognition, Proc. IEEE, vol. 86,
pp. 22782324, 1998.
[3] A. W. Harley, An interactive node-link visualization of convolutional
neural networks, in ISVC, pp. 867877, 2015.
[4] A. Kurenkov, A brief history of neural nets and deep learning, part
4. http://www.andreykurenkov.com/writing/
a-brief-history-of-neural-nets-and-deep-learning-part-4/.
57
http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/
-
References II
[5] A. Krizhevsky, I. Sutskever, , and G. Hinton, Imagenet classification
with deep convolutional neural networks, Advances in Neural
Information Processing Systems, vol. 25, pp. 11061114, 2012.
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with
convolutions, CoRR, vol. abs/1409.4842, 2014.
[7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
Rethinking the inception architecture for computer vision, CoRR,
vol. abs/1512.00567, 2015.
[8] E. Shelhamer, J. Long, and T. Darrell, Fully convolutional networks
for semantic segmentation, CoRR, vol. abs/1605.06211, 2016.
58
-
References III
[9] B. Hariharan, P. A. Arbelaez, R. B. Girshick, and J. Malik,
Hypercolumns for object segmentation and fine-grained
localization, CoRR, vol. abs/1411.5752, 2014.
59
IntroductionThe Inception architectureFully convolutional networksHypercolumnsConclusion
fd@rm@0: fd@rm@1: