1 spatial pyramid pooling in deep convolutional networks for visual recognition presenter byungin...
TRANSCRIPT
1
Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition
Presenter ByungIn Yoo
CS688/WST665
2
Contents
● Introduction
● Motivation
● Previous work
● Main Idea
● Details
● Experiments
● Conclusion
3
Introduction
● Web-scale image retrieval
● Classify images or videos
● Detect and localize object
● Estimate semantic and geometrical attributes
● Why is this challenging?
● View point
● Illumination
● Occlusion
● Scale
● Deformation
● Clutter background
4
● The current CNN require a fixed input image size (e.g., 224 x 224 )
● Recognition accuracy is degraded!
Motivation
Crop
Warp
224x224
ConvolutionalNeural Network
(CNN)
Content loss
Distortion
5
● The current CNN require a fixed input image size (e.g., 224 x 224 )
● Recognition accuracy is degraded!
Motivation
Crop
Warp
224x224
ConvolutionalNeural Network
(CNN)
Content loss
Distortion
SpatialPyramidPooling
6
Previous work (1/2)
● Spatial Pyramid Matching
- very successful in traditional computer vision
Grauman et al, The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features, ICCV 2005.Lazebnik et al, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, CVPR 2006.
7
Previous work (2/2)
● Zeiler-Fergus Architecture (2013, 1st)
● Google LeNet (2014, 1st)
ConvolutionPoolingSoftmaxOther
M.D. Zeiler et al, “Visualizing and understanding convolutional neural networks”, aXiv:1311.2901, 2013.Christian Szegedy et a, “Going Deeper with Convolutions”, arXiv:1409.4842, 2014.
22 Layers
8 Layers Still low accuracy! & Fixed Image Size
Too complex model! & Fixed Image Size
8
Main Idea (1/2)
● Add Spatial Pyramid Pooling layer!
SPPNet
PreviousNets
9
Main Idea (2/2)
● Generate fixed length representation regardless of image size/scale.
● Simple (still 8 layers) and Powerful Model!
● Variable input size/scale● Multi-size training, Multi-scale testing, Full image view
● Multi-level pooling● Robust to deformation
● Operated on feature map● Pooling in regions
10
Details – Convolutional Layers and Feature Maps
● Inherently, the convolutional layers can accept arbitrary size image.
● Feature map involve not only the strength of the responses, but also their spatial positions.
11
Details – The Spatial Pyramid Pooling Layer
● SPP-net is a new layer with Spatial Pyramid Pooling
Conv1
Conv2
Conv3
Conv4
Conv5
SPP
FC6
FC7
SoftMax
256 filters
256 x ( 4x4 + 2x2 + 1) = 5376 Dimension vector
12
Details – Training with the Spatial Pyramid Pooling
● Single-size training● Simply modify the configuration file of CNN frameworks
Conv1
Conv2
Conv3
Conv4
Conv5
SPP
FC6
FC7
SoftMax
Feature map: 13x13
13
Details – Training with the Spatial Pyramid Pooling
● Multiple-size training● Multiple networks sharing all weights
● Each network for a single size. (e.g. 224x224, 180x180)
● Improve scale-invariance
resize
14
Details – Fast CNN-based Object Detection
● The features can be computed from entire image only once.
● Similar accuracy, much faster (24x~64x) than R-CNN
2000 Convolutions! 1 Convolution!
15
Experiments (1/4)
● ILSVRC image classification task
● 1000 object classes (1,431,167 images)
16
Experiments (2/4)
● ILSVRC image classification task (rank #3)
● SPP improves all CNN architectures
Top-5 test accuracy
Top-5 val. accuracy
17
Experiments (3/4)
● ILSVRC image detection task
● Fully annotated 200 object classes across 121,931 images
● Allows evaluation of generic object detection in cluttered scenes at
scale
Detected Region
Ground-truth
:True
:False
18
Experiments (4/4)
● ILSVRC image detection task (rank #2)
● More practical than R-CNN
19
Conclusion
● SPP is flexible solution for handling different scales, sizes, and aspect ration.
● Spatial Pyramid Pooling improves accuracy.
● Multi-size training improves accuracy.
● Full-image representation improves accuracy.
● Classification: SPP improves all CNNs in the literature.
● Detection: Practical, fast and accurate than R-CNN.