1 spatial pyramid pooling in deep convolutional networks for visual recognition presenter byungin...

1

Spatial Pyramid Pooling in Deep Convolutional

Networks for Visual Recognition

Presenter ByungIn Yoo

CS688/WST665

2

Contents

● Introduction

● Motivation

● Previous work

● Main Idea

● Details

● Experiments

● Conclusion

3

Introduction

● Web-scale image retrieval

● Classify images or videos

● Detect and localize object

● Estimate semantic and geometrical attributes

● Why is this challenging?

● View point

● Illumination

● Occlusion

● Scale

● Deformation

● Clutter background

4

● The current CNN require a fixed input image size (e.g., 224 x 224 )

● Recognition accuracy is degraded!

Motivation

Crop

Warp

224x224

ConvolutionalNeural Network

(CNN)

Content loss

Distortion

5

● The current CNN require a fixed input image size (e.g., 224 x 224 )

● Recognition accuracy is degraded!

Motivation

Crop

Warp

224x224

ConvolutionalNeural Network

(CNN)

Content loss

Distortion

SpatialPyramidPooling

6

Previous work (1/2)

● Spatial Pyramid Matching

- very successful in traditional computer vision

Grauman et al, The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features, ICCV 2005.Lazebnik et al, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, CVPR 2006.

7

Previous work (2/2)

● Zeiler-Fergus Architecture (2013, 1st)

● Google LeNet (2014, 1st)

ConvolutionPoolingSoftmaxOther

M.D. Zeiler et al, “Visualizing and understanding convolutional neural networks”, aXiv:1311.2901, 2013.Christian Szegedy et a, “Going Deeper with Convolutions”, arXiv:1409.4842, 2014.

22 Layers

8 Layers Still low accuracy! & Fixed Image Size

Too complex model! & Fixed Image Size

8

Main Idea (1/2)

● Add Spatial Pyramid Pooling layer!

SPPNet

PreviousNets

9

Main Idea (2/2)

● Generate fixed length representation regardless of image size/scale.

● Simple (still 8 layers) and Powerful Model!

● Variable input size/scale● Multi-size training, Multi-scale testing, Full image view

● Multi-level pooling● Robust to deformation

● Operated on feature map● Pooling in regions

10

Details – Convolutional Layers and Feature Maps

● Inherently, the convolutional layers can accept arbitrary size image.

● Feature map involve not only the strength of the responses, but also their spatial positions.

11

Details – The Spatial Pyramid Pooling Layer

● SPP-net is a new layer with Spatial Pyramid Pooling

Conv1

Conv2

Conv3

Conv4

Conv5

SPP

FC6

FC7

SoftMax

256 filters

256 x ( 4x4 + 2x2 + 1) = 5376 Dimension vector

12

Details – Training with the Spatial Pyramid Pooling

● Single-size training● Simply modify the configuration file of CNN frameworks

Conv1

Conv2

Conv3

Conv4

Conv5

SPP

FC6

FC7

SoftMax

Feature map: 13x13

13

Details – Training with the Spatial Pyramid Pooling

● Multiple-size training● Multiple networks sharing all weights

● Each network for a single size. (e.g. 224x224, 180x180)

● Improve scale-invariance

resize

14

Details – Fast CNN-based Object Detection

● The features can be computed from entire image only once.

● Similar accuracy, much faster (24x~64x) than R-CNN

2000 Convolutions! 1 Convolution!

15

Experiments (1/4)

● ILSVRC image classification task

● 1000 object classes (1,431,167 images)

16

Experiments (2/4)

● ILSVRC image classification task (rank #3)

● SPP improves all CNN architectures

Top-5 test accuracy

Top-5 val. accuracy

17

Experiments (3/4)

● ILSVRC image detection task

● Fully annotated 200 object classes across 121,931 images

● Allows evaluation of generic object detection in cluttered scenes at

scale

Detected Region

Ground-truth

:True

:False

18

Experiments (4/4)

● ILSVRC image detection task (rank #2)

● More practical than R-CNN

19

Conclusion

● SPP is flexible solution for handling different scales, sizes, and aspect ration.

● Spatial Pyramid Pooling improves accuracy.

● Multi-size training improves accuracy.

● Full-image representation improves accuracy.

● Classification: SPP improves all CNNs in the literature.

● Detection: Practical, fast and accurate than R-CNN.

1 spatial pyramid pooling in deep convolutional networks for visual recognition presenter byungin...

Documents

fixed image size slide

images slide

13x13 slide

spatial pyramid matching

arbitrary size image

fixed input image size

image sizescale

entire image