arxiv:1803.11370v1 [cs.cv] 30 mar 2018 · parallel grid pooling for data augmentation akito takeki,...

18
Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Irie , and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp, The University of Tokyo [email protected], NTT Communications Science Laboratories Abstract. Convolutional neural network (CNN) architectures utilize downsampling layers, which restrict the subsequent layers to learn spa- tially invariant features while reducing computational costs. However, such a downsampling operation makes it impossible to use the full spec- trum of input features. Motivated by this observation, we propose a novel layer called parallel grid pooling (PGP) which is applicable to var- ious CNN models. PGP performs downsampling without discarding any intermediate feature. It works as data augmentation and is complemen- tary to commonly used data augmentation techniques. Furthermore, we demonstrate that a dilated convolution can naturally be represented us- ing PGP operations, which suggests that the dilated convolution can also be regarded as a type of data augmentation technique. Experimen- tal results based on popular image classification benchmarks demon- strate the effectiveness of the proposed method. Code is available at: https://github.com/akitotakeki/pgp-chainer. Keywords: Data Augmentation, Dilated Convolution, Convolutional Neural Networks, Image Classification 1 Introduction Deep learning using convolutional neural networks (CNNs) has achieved excellent performance for a wide range of tasks, such as image recognition [1,2,3], object detection [4,5,6], and semantic segmentation [7,8]. Most CNN architectures uti- lize spatial operation layers for downsampling. Spatial downsampling restricts subsequent layers in a CNN to learn spatially invariant features and reduces computational costs. Modern architectures often use multi-stride convolutional layers, such as a 3 × 3 or 1 × 1 convolution with a stride of 2 [1,3,9,10,11,12] or 2 × 2 average pooling with a stride of 2 [2]. The other methods for downsampling have been proposed [13,14,15,16]. One drawback of these downsampling is that it makes impossible to use the full spectrum of input features, and most previous works overlook the impor- tance of this drawback. To clarify this issue, we provide an example of such a downsampling operation in Fig. 1. Let F s k be a spatial operation (e.g., convolu- tion or pooling) with a kernel size of k × k and stride of s × s (s 2). In the first step, F 1 k is performed on the input and yields an intermediate output. In the second step, the intermediate output is split into an s × s grid pattern and downsampled by selecting the coordinate (0, 0) (top-left corner) from each grid, arXiv:1803.11370v1 [cs.CV] 30 Mar 2018

Upload: duongcong

Post on 11-Mar-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

Parallel Grid Pooling for Data Augmentation

Akito Takeki Daiki Ikami Go Iriedagger and Kiyoharu Aizawa

takekiiakmiaizawahaltu-tokyoacjp The University of Tokyodagger goirieieeeorg NTT Communications Science Laboratories

Abstract Convolutional neural network (CNN) architectures utilizedownsampling layers which restrict the subsequent layers to learn spa-tially invariant features while reducing computational costs Howeversuch a downsampling operation makes it impossible to use the full spec-trum of input features Motivated by this observation we propose anovel layer called parallel grid pooling (PGP) which is applicable to var-ious CNN models PGP performs downsampling without discarding anyintermediate feature It works as data augmentation and is complemen-tary to commonly used data augmentation techniques Furthermore wedemonstrate that a dilated convolution can naturally be represented us-ing PGP operations which suggests that the dilated convolution canalso be regarded as a type of data augmentation technique Experimen-tal results based on popular image classification benchmarks demon-strate the effectiveness of the proposed method Code is available athttpsgithubcomakitotakekipgp-chainer

Keywords Data Augmentation Dilated Convolution ConvolutionalNeural Networks Image Classification

1 Introduction

Deep learning using convolutional neural networks (CNNs) has achieved excellentperformance for a wide range of tasks such as image recognition [123] objectdetection [456] and semantic segmentation [78] Most CNN architectures uti-lize spatial operation layers for downsampling Spatial downsampling restrictssubsequent layers in a CNN to learn spatially invariant features and reducescomputational costs Modern architectures often use multi-stride convolutionallayers such as a 3times 3 or 1times 1 convolution with a stride of 2 [139101112] or2times2 average pooling with a stride of 2 [2] The other methods for downsamplinghave been proposed [13141516]

One drawback of these downsampling is that it makes impossible to use thefull spectrum of input features and most previous works overlook the impor-tance of this drawback To clarify this issue we provide an example of such adownsampling operation in Fig 1 Let Fs

k be a spatial operation (eg convolu-tion or pooling) with a kernel size of k times k and stride of s times s (s ge 2) In thefirst step F1

k is performed on the input and yields an intermediate output Inthe second step the intermediate output is split into an stimes s grid pattern anddownsampled by selecting the coordinate (0 0) (top-left corner) from each grid

arX

iv1

803

1137

0v1

[cs

CV

] 3

0 M

ar 2

018

2

Input

ℱ119896119904

Output

Convolution Pooling

stride 119904 kernel 119896 times 119896

119896

Input

Top-left Corner

Pooling

(Grid Pooling)

Grid

ℱ1198961 119970119904

(00)

Output

Convolution Pooling

stride 1 kernel 119896 times 119896

119896

(a) (b)

Fig 1 A two step view of a spatial operation (eg convolution or pooling) Fsk with a

kernel size of k times k and stride of stimes s (s ge 2) (a) is equivalent to (b)

resulting in an output feature that is s times smaller than the input feature

In this paper the second step downsampling operation G(00)s is referred to asgrid pooling (GP) for the coordinate (0 0) From this two-step example onecan see that the typical downsampling operation Fs

k utilizes only 1s2 of theintermediate output and discards the remaining 1minus 1s2

Motivated by the observation we propose a novel layer for CNNs called theparallel grid pooling (PGP) layer (Fig 2) PGP is utilized to perform down-sampling without discarding any intermediate feature and can be regarded as adata augmentation technique Specifically PGP splits the intermediate outputinto an stimess grid pattern and performs a grid pooling for each of the coordinates(i j) (0 le i le sminus 1 0 le j le sminus 1) in the grid This means that PGP transformsan input feature into s2 feature maps downsampled by 1s2 The layers follow-ing PGP compute s2 feature maps shifted by several pixels in parallel whichcan be interpreted as data augmentation in the feature space

Furthermore we demonstrate that dilated convolution [7171819] can nat-urally be decomposed into a convolution and PGP In general a dilated con-volution is considered as an efficient operation for spatial (and temporal) infor-mation and achieves the state-of-the-art performance for various tasks such assemantic segmentation [72021] speech recognition [18] and natural languageprocessing [1922] In this sense we provide a novel interpretation of the dilatedconvolution operation as a data augmentation technique in the feature space

The proposed PGP layer has several remarkable properties PGP can be im-plemented easily by inserting it after a multi-stride convolutional andor pool-ing layer without altering the original learning strategy PGP does not have anytrainable parameters and can even be utilized for test-time augmentation (ieeven if a CNN was not trained using PGP layers the performance can still beimproved by inserting PGP layers into the pretrained CNN at testing time)

We evaluate our method on three standard image classification benchmarksCIFAR-10 CIFAR-100 [23] SVHN [24] and ImageNet [25] Experimental resultsdemonstrate that PGP can improve the performance of recent state-of-the-artnetwork architectures and is complementary to widely used data augmentationtechniques (eg random flipping random cropping and random erasing)

The major contributions of this paper are as follows

3

ndash We propose PGP which performs downsampling without discarding any in-termediate feature information and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without alteringtheir learning strategies and the number of parameters

ndash We demonstrate that PGP is implicitly used in a dilated convolution opera-tion which suggests that dilated convolution can also be regarded as a typeof data augmentation technique

2 Related Works

PGP is most closely related to data augmentation In the context of traininga CNN data augmentation is a standard regularization technique that is usedto avoid overfitting and artificially enlarge the size of a training dataset Forexample applying crops flips and various affine transforms in a fixed orderor at random is a widely used technique in many computer vision tasks Dataaugmentation techniques can be divided into two categories data augmentationin the image space and data augmentation in the feature space

Data augmentation in the image space AlexNet [26] which achievesstate-of-the-art performance on the ILSVRC2012 dataset applies random flip-ping random cropping color perturbation and adding noise based on principalcomponent analysis TANDA [27] learns a generative sequence model that pro-duces data augmentation strategies using adversarial techniques on unlabeleddata The method in [28] proposes to train a class-conditional model of diffeo-morphisms by interpolating between nearest-neighbor data Recently trainingtechniques utilizing randomly masked images of rectangular regions such as ran-dom erasing [29] and cutout [30] have been proposed Using these augmentationtechniques recent methods have achieved state-of-the-art results for image clas-sification [3132] Mix-up [33] and BC-learning [34] generate between-class imagesby mixing two images belonging to different classes with a random ratio

Data augmentation in the feature space Some techniques generateaugmented training sets by interpolating [35] or extrapolating [36] features fromthe same class DAGAN [37] utilizes conditional generative adversarial net-works to conduct feature space data augmentation for one-shot learning A-Fast-RCNN [38] generates hard examples with occlusions and deformations using anadversarial learning strategy in feature space Smart augmentation [39] trains aneural network to combine training samples in order to generate augmented dataAs another type of feature space data augmentation AGA [40] and Fatten [41]perform augmentation in the feature space by using properties of training images(eg depth and pose)

While our PGP performs downsampling without discarding any intermediatefeature it splits the input feature map into branches which will be discussedin detail in Sec 31 The operations following PGP (eg convolution layersor fully connected layers) compute slightly shifted s2 feature maps which canbe interpreted as a novel data augmentation technique in the feature space Inaddition dilated convolution can be regarded as data augmentation techniquein the feature space which will be discussed in detail in Sec 32

4

Grid CNN

CNN

CNN

CNN

Intermediate

Feature Map

Parallel Pooling

Parallel Grid Pooling

(PGP) 119970119904(11)

119970119904(01)

119970119904(10)

119970119904(00)

Weight

Sharing

119909 ℱ1198961

Input

119907

Fig 2 An overview of Parallel Grid Pooling (PGP)

3 Proposed Method

31 Parallel Grid Pooling (PGP)

An overview of PGP is illustrated in Fig 2 Let x isin Rbtimesctimeshtimesw be an inputfeature map where b c h and w are the number of batches number of channelsheight and width respectively Fs

k is a spatial operation (eg a convolution or apooling) with a kernel size of k times k and stride of stimes s (s ge 2) The height andwidth of the output feature map are hs and ws respectivelyFs

k can be divided into two steps (Fig 1) In the first step v = F1k (x)

is performed which is an operation that makes full use of the input featuremap information with a stride of one producing an intermediate feature mapv Then v is split into ws times hs blocks of size s times s Let the coordinate ineach grid be (i j) (0 le i le s minus 1 0 le j le s minus 1) In the second step v isdownsampled by selecting the coordinate (0 0) (top-left corner) in each gridLet this downsampling operation Gs(ij) of selecting the coordinate (i j) in each

grid be called grid pooling (GP) To summarize given a kernel size of ktimes k andstride of stimes s (s ge 2) the multi-stride spatial operation layer is represented as

y = Fsk (x)

= Gs(00)(F1

k (x))

(1)

Our proposed PGP method is defined as(y(00) y(01) y(sminus1sminus1)

)=(Gs(00) (x) Gs(01) (x) Gs(sminus1sminus1) (x)

)= PGP (x) (2)

The conventional multi-stride operation uses y(00) which consists of only 1s2

of the intermediate output and discards the remaining 1 minus 1s2 of the featuremap In contrast PGP retains all possible choices of grid poolings meaning PGPmakes full use of the feature map without discarding the 1minus 1s2 portion of theintermediate output in the conventional method

5

A network architecture with a PGP layer can be viewed as one with s2 inter-nal branches by performing grid pooling in parallel Note that the weights (ieparameters) of the following network are shared between all branches hencethere is no additional parameter throughout the network PGP performs down-sampling while maintaining the spatial structure of the intermediate featuresthereby producing output features shifted by several pixels As described abovebecause the operations in each layer following PGP share the same weightsacross branches the subsequent layers in each branch are trained over a set ofslightly shifted feature maps simultaneously This works as data augmentationwith PGP the layers are trained with s2 times larger number of mini-batchescompared to the case without PGP

32 PGP vs Dilated Convolution

We here discuss the relationship between PGP and dilated convolution [720]Dilated convolution is a convolutional operation applied to an input with a de-fined gap Dilated convolution is considered to be an operation that efficientlyutilizes spatial and temporal information [718192042] For brevity let us con-sider one-dimensional signals Dilated convolution is defined as

y [i] =

Lsuml=1

x [i + rl]u [l] (3)

where x[i] is an input feature y[i] is an output feature u[l] is a convolution filterof kernel size L and r is a dilation rate A dilated convolution is equivalent toa regular convolution when r = 1 Let i = rj + k (0 le k le r minus 1) Then theoutput y [rj + k] is

y [rj + k] =

Lsuml=1

x [rj + k + rl]u [l] (4)

=

Lsuml=1

x [r (j + l) + k]u [l] (5)

This suggests that the output y [i] where i equiv k (mod r) can be obtained fromthe input x [i] | i equiv k (mod r) To summarize an r-dilated convolution isequivalent to an operation where the input is split into r branches and convo-lutions are performed with the same kernel u then spatially rearranged into asingle output

A dilated convolution in a two-dimensional space is illustrated in Fig 3 (incase r = 2) The r-dilated convolution downsamples an input feature with aninterval of r pixels then produces r2 branches as outputs Therefore the dilatedconvolution is equivalent to a following three-step operation PGP a convolu-tion sharing the same weight and inverse of PGP (PGPminus1) which rearranges thecells to their original positions For example let the size of the input features be

6

Conv

Conv

Conv

Conv

4 times (119887 119888 ℎ2 1199082)

(119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

=

Dilated

Convolution

(119903 = 2) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

PGP PGPminus120783

Fig 3 PGP is implicitly used by dilated convolution

x = (b c h w) where b c h and w are the number of batches number of chan-nels height and width respectively Using PGP the input feature x is dividedinto r2times (b c hr wr) subsequently a convolutional filter (with shared weightparameters) is applied to each branch in parallel (Fig 3) Then the intermedi-ate r2 feature maps are rearranged by PGP resulting the same size (b c h w)as the input by PGPminus1 In short PGP is embedded in a dilated convolutionwhich suggests that the success of dilated convolution can be attributed to dataaugmentation in the feature space

33 Architectural Differences between dilated convolution and PGP

The structure of the network model with PGP is better for learning than thatwith dilated convolutions For example consider the cases where two downsam-pling layers in a base CNN (Base-CNN) are replaced by either dilated convolution(Fig 4(b)) or PGP (Fig 4(d)) According to [7] the conventional multi-stridedownsampling layer and all subsequent convolution layers can be replaced withdilated convolution layers with two times dilation rate (Fig 4(b)) As mentionedin Section 32 a dilated convolution is equivalent to (PGP + Convolution +PGPminus1) When n r-dilated convolutions are stacked they are expressed by(PGP + Convolution times n + PGPminus1) This means that each branch in a CNNwith dilated convolution (DConv-CNN) is independent from the others

Let us consider the case of Fig 4(b) that dilated convolutions with two dif-ferent dilation rates (eg r1 = 2 and r2 = 4) are stacked Using the fact thata sequence of n 4-dilated convolutions is (PGP2 + Convolution timesn + PGPminus2)this architecture can readily be transformed to the one given in Fig 4(c) whichis the equivalent form with PGP We can clearly see that each branch split bythe former dilation layer is split again by the latter and all the branches areeventually rearranged by PGPminus1 layers prior the global average pooling (GAP)

7

CNN Image GP CNN GP CNN GAP FC Loss

(a) Base CNN (Base-CNN)

CNN GAP FC Image Loss

Dilated

CNN

(119903 = 2)

Dilated

CNN

(119903 = 4)

(b) CNN with dilated convolution (DConv-CNN)

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

GAP FC Image PGP -1

Loss

PGP

PGP

PGP

PGP -1

-1

-1

-1

(c) DConv-CNN represented by PGP

Loss

Weight

sharing

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

Image

The same structure as

Base-CNN

(d) CNN with PGP (PGP-CNN)

Fig 4 Comparison between DConv-CNN and PGP-CNN As mentioned in Section 31Base-CNN is expressed as the architecture using GP (a) (b) is equivalent to (c)

layer In contrast CNN with PGP (PGP-CNN) is implemented as follows Theconvolution or pooling layer that decreases resolution is set to 1 stride of thatlayer after which PGP is inserted (Fig 4(d)) In DConv-CNN all the branchesare aggregated through PGPminus1 layers just before the GAP layer which can beregarded as a feature ensemble while PGP-CNN does not use any PGPminus1 layerthroughout the network so the intermediate features are not fused until the endof the network

Unlike feature ensemble (ie the case of DConv-CNN) this structure en-forces all the branches to learn correct classification results As mentioned aboveDConv-CNN averages an ensemble of intermediate feature The likelihood ofeach class is then predicted based on the average Even if some branches containuseful features (eg for image classification) while the other branches do notacquire the CNN attempts to determine the correct class based on the averagewhich prevents the useless branches from further learning In contrast thanks tothe independency of branches PGP-CNN can perform additional learning withsuperior regularization

8

34 Weight transfer using PGP

Compared to DConv-CNN PGP-CNN has a significant advantage in terms ofweight transfer Specifically the weights learned on PGP-CNN work well onBase-CNN structure when transferred We demonstrate later in the experimentsthat this can improve performance of the base architecture However DConv-CNN does not have the same structure as Base-CNN because of PGPminus1 Thereplacement of downsampling by dilated convolutions makes it difficult to trans-fer weights from the pre-trained base model The base CNN with multi-strideconvolutions (the top of Fig 5 (a)) is represented by single-stride convolutionsand grid pooling (GP) layers (the bottom of Fig 5 (a)) DConv-CNN (the topof Fig 5 (b)) is equivalent to the base CNN with PGP and PGPminus1 (the bottomof Fig 5 (b)) Focusing on the difference in the size of the input features of a3times 3 convolution the size is (b c1 h w) in Base-CNN whereas (b c1 h2 w2)in DConv-CNN This difference in resolution makes it impossible to directlytransfer weights from the pre-trained model Unlike DConv-CNN PGP-CNNmaintains the same structure as Base-CNN (Fig 5 (c)) which results in bettertransferability and better performance

This weight transferability of PGP has two advantages First applying PGPto Base-CNN with learned weights in test phase can improve recognition perfor-mance without retraining For example if one does not have sufficient computingresources to train PGP-CNN using PGP only in the test phase can still make fulluse of the features discarded by downsampling in the training phase to achievehigher accuracy Second the weights learned by PGP-CNN can work well inBase-CNN As an example on an embedded system specialized to a particularCNN one can not change calculations inside the CNN However the weightslearned by on PGP-CNN can result in superior performance even when used inBase-CNN at test time

4 Experimental Results

41 Image classification

Datasets We evaluated PGP on benchmark datasets CIFAR-10 CIFAR-100 [23]and SVHN [24] Both CIFAR-10 and CIFAR-100 consist of 32times32 color imageswith 50000 images for training and 10000 images for testing The CIFAR-10dataset has 10 classes each containing 6000 images There are 5000 trainingimages and 1000 testing images for each class The CIFAR-100 dataset has 100classes each containing 600 images There are 500 training images and 100 test-ing images for each class We adopted a standard data augmentation scheme(mirroringshifting) that is widely used for these two datasets [1211] For pre-processing we normalized the data using the channel means and standard devi-ations We used all 50000 training images for training and calculated the finaltesting error after training

The Street View House Numbers (SVHN) dataset consists of 32 times 32 colordigit image There are 73257 images in the training set 26032 images in the test

9

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 2x2

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 2x2

Conv 1x1 str 1x1

GP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198882 ℎ2 1199082)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

GP

(119887 1198883 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Dconv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGPminus1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

PGP

Conv 3x3 str 1x1

4 times (119887 1198881 ℎ2 1199082)

4 times (119887 1198882 ℎ2 1199082)

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

4 times (119887 1198882 ℎ2 1199082)

4 times (119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

PGP

(119887 1198883 ℎ 119908)

(a) Base-CNN (b) DConv-CNN (c) PGP-CNN

Fig 5 The structure performing downsampling inside Base-CNN DConv-CNN andPGP-CNN This downsampling structure (3 times 3 convolution with stride of 2) is usedin a derivative of ResNet [9] WideResNet [10] and ResNeXt [11]

set and 531131 additional images for training Following the method in [102]we used all the training data without any data augmentation

Implementation details We trained a pre-activation ResNet (PreResNet) [12]All-CNN [43] WideResNet [10] ResNeXt [11] PyramidNet [44] and DenseNet [2]from scratch We use a 164-layer network for PreResNet We used WRN-28-10ResNeXt-29 (8x64d) PyramidNet-164 (Bottleneck alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers

When comparing the base CNN (Base) the CNN with dilated convolutions(DConv) and the CNN with PGP (PGP) we utilized the same hyperparametervalues (eg learning rate strategy) to carefully monitor the performance improve-ments We adopted the cosine shape learning rate schedule [45] which smoothlyanneals the learning rate [4647] The learning rate was initially set to 02 andgradually reduced to 0002 Following [2] for CIFAR and SVHN we trainedthe networks using mini-batches of size 64 for 300 epochs and 40 epochs re-spectively All networks were optimized using stochastic gradient descent (SGD)with a momentum of 09 and weight decay of 10times10minus4 We adopted the weightinitialization method introduced in [48] We used the Chainer framework [49]and ChainerCV [50] for the implementation of all experiments

10

Table 1 Test errors () with various architectures on CIFAR-10 CIFAR-100 andSVHN DConv dilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 415plusmn012 377plusmn009All-CNN 14M 842plusmn023 868plusmn043 717plusmn032WideResNet-28-10 365M 344plusmn006 388plusmn021 313plusmn015ResNeXt-29 (8times64d) 344M 386plusmn016 387plusmn013 322plusmn034PyramidNet-164 (α=28) 17M 391plusmn015 372plusmn007 338plusmn009DenseNet-BC-100 (k=12) 08M 460plusmn007 435plusmn022 411plusmn015

CIFAR-100

PreResNet-164 17M 2268plusmn027 2071plusmn027 2018plusmn045All-CNN 14M 3243plusmn031 3310plusmn039 2979plusmn033WideResNet-28-10 365M 1870plusmn039 1969plusmn050 1820plusmn017ResNeXt-29 (8times64d) 344M 1963plusmn112 1790plusmn057 1718plusmn029PyramidNet-164 (α=28) 17M 1965plusmn024 1910plusmn032 1812plusmn016DenseNet-BC-100 (k=12) 08M 2262plusmn029 2220plusmn003 2169plusmn037

SVHN

PreResNet-164 17M 196plusmn174 174plusmn007 154plusmn002All-CNN 14M 194plusmn006 177plusmn007 175plusmn002WideResNet-28-10 365M 181plusmn003 153plusmn002 138plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 166plusmn003 152plusmn001PyramidNet-164 (α=28) 17M 202plusmn005 186plusmn006 179plusmn005DenseNet-BC-100 (k=12) 08M 197plusmn012 177plusmn006 167plusmn007

Results The results of applying DConv and PGP on CIFAR-10 CIFAR-100and SVHN with different networks are listed in Table 1 In all cases PGP out-performed Base and DConv

Comparison to other data augmentation methods We compare our methodto random flipping (RF) random cropping (RC) and random erasing (RE) inTable 2 PGP achieves better performance than any other data augmentationmethods when applied alone When applied in combination with other data aug-mentation methods PGP works in a complementary manner Combining RFRC RE and PGP yields an error rate of 1916 which is 1174 improvementover the baseline without any augmentation methods

Comparison to classical ensemble method We compare our method to theclassical ensemble method in Table 3 On CIFAR-10 and SVHN a single CNNwith the PGP model achieves better performance than an ensemble of threebasic CNN models Furthermore PGP works in a complementary manner whenapplied to the ensemble

Weight transfer The results of applying PGP as a test-time data augmentationmethod on CIFAR-10 CIFAR-100 and SVHN on different networks are listedin Table 4 PGP achieves better performance than the base model except for

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 2: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

2

Input

ℱ119896119904

Output

Convolution Pooling

stride 119904 kernel 119896 times 119896

119896

Input

Top-left Corner

Pooling

(Grid Pooling)

Grid

ℱ1198961 119970119904

(00)

Output

Convolution Pooling

stride 1 kernel 119896 times 119896

119896

(a) (b)

Fig 1 A two step view of a spatial operation (eg convolution or pooling) Fsk with a

kernel size of k times k and stride of stimes s (s ge 2) (a) is equivalent to (b)

resulting in an output feature that is s times smaller than the input feature

In this paper the second step downsampling operation G(00)s is referred to asgrid pooling (GP) for the coordinate (0 0) From this two-step example onecan see that the typical downsampling operation Fs

k utilizes only 1s2 of theintermediate output and discards the remaining 1minus 1s2

Motivated by the observation we propose a novel layer for CNNs called theparallel grid pooling (PGP) layer (Fig 2) PGP is utilized to perform down-sampling without discarding any intermediate feature and can be regarded as adata augmentation technique Specifically PGP splits the intermediate outputinto an stimess grid pattern and performs a grid pooling for each of the coordinates(i j) (0 le i le sminus 1 0 le j le sminus 1) in the grid This means that PGP transformsan input feature into s2 feature maps downsampled by 1s2 The layers follow-ing PGP compute s2 feature maps shifted by several pixels in parallel whichcan be interpreted as data augmentation in the feature space

Furthermore we demonstrate that dilated convolution [7171819] can nat-urally be decomposed into a convolution and PGP In general a dilated con-volution is considered as an efficient operation for spatial (and temporal) infor-mation and achieves the state-of-the-art performance for various tasks such assemantic segmentation [72021] speech recognition [18] and natural languageprocessing [1922] In this sense we provide a novel interpretation of the dilatedconvolution operation as a data augmentation technique in the feature space

The proposed PGP layer has several remarkable properties PGP can be im-plemented easily by inserting it after a multi-stride convolutional andor pool-ing layer without altering the original learning strategy PGP does not have anytrainable parameters and can even be utilized for test-time augmentation (ieeven if a CNN was not trained using PGP layers the performance can still beimproved by inserting PGP layers into the pretrained CNN at testing time)

We evaluate our method on three standard image classification benchmarksCIFAR-10 CIFAR-100 [23] SVHN [24] and ImageNet [25] Experimental resultsdemonstrate that PGP can improve the performance of recent state-of-the-artnetwork architectures and is complementary to widely used data augmentationtechniques (eg random flipping random cropping and random erasing)

The major contributions of this paper are as follows

3

ndash We propose PGP which performs downsampling without discarding any in-termediate feature information and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without alteringtheir learning strategies and the number of parameters

ndash We demonstrate that PGP is implicitly used in a dilated convolution opera-tion which suggests that dilated convolution can also be regarded as a typeof data augmentation technique

2 Related Works

PGP is most closely related to data augmentation In the context of traininga CNN data augmentation is a standard regularization technique that is usedto avoid overfitting and artificially enlarge the size of a training dataset Forexample applying crops flips and various affine transforms in a fixed orderor at random is a widely used technique in many computer vision tasks Dataaugmentation techniques can be divided into two categories data augmentationin the image space and data augmentation in the feature space

Data augmentation in the image space AlexNet [26] which achievesstate-of-the-art performance on the ILSVRC2012 dataset applies random flip-ping random cropping color perturbation and adding noise based on principalcomponent analysis TANDA [27] learns a generative sequence model that pro-duces data augmentation strategies using adversarial techniques on unlabeleddata The method in [28] proposes to train a class-conditional model of diffeo-morphisms by interpolating between nearest-neighbor data Recently trainingtechniques utilizing randomly masked images of rectangular regions such as ran-dom erasing [29] and cutout [30] have been proposed Using these augmentationtechniques recent methods have achieved state-of-the-art results for image clas-sification [3132] Mix-up [33] and BC-learning [34] generate between-class imagesby mixing two images belonging to different classes with a random ratio

Data augmentation in the feature space Some techniques generateaugmented training sets by interpolating [35] or extrapolating [36] features fromthe same class DAGAN [37] utilizes conditional generative adversarial net-works to conduct feature space data augmentation for one-shot learning A-Fast-RCNN [38] generates hard examples with occlusions and deformations using anadversarial learning strategy in feature space Smart augmentation [39] trains aneural network to combine training samples in order to generate augmented dataAs another type of feature space data augmentation AGA [40] and Fatten [41]perform augmentation in the feature space by using properties of training images(eg depth and pose)

While our PGP performs downsampling without discarding any intermediatefeature it splits the input feature map into branches which will be discussedin detail in Sec 31 The operations following PGP (eg convolution layersor fully connected layers) compute slightly shifted s2 feature maps which canbe interpreted as a novel data augmentation technique in the feature space Inaddition dilated convolution can be regarded as data augmentation techniquein the feature space which will be discussed in detail in Sec 32

4

Grid CNN

CNN

CNN

CNN

Intermediate

Feature Map

Parallel Pooling

Parallel Grid Pooling

(PGP) 119970119904(11)

119970119904(01)

119970119904(10)

119970119904(00)

Weight

Sharing

119909 ℱ1198961

Input

119907

Fig 2 An overview of Parallel Grid Pooling (PGP)

3 Proposed Method

31 Parallel Grid Pooling (PGP)

An overview of PGP is illustrated in Fig 2 Let x isin Rbtimesctimeshtimesw be an inputfeature map where b c h and w are the number of batches number of channelsheight and width respectively Fs

k is a spatial operation (eg a convolution or apooling) with a kernel size of k times k and stride of stimes s (s ge 2) The height andwidth of the output feature map are hs and ws respectivelyFs

k can be divided into two steps (Fig 1) In the first step v = F1k (x)

is performed which is an operation that makes full use of the input featuremap information with a stride of one producing an intermediate feature mapv Then v is split into ws times hs blocks of size s times s Let the coordinate ineach grid be (i j) (0 le i le s minus 1 0 le j le s minus 1) In the second step v isdownsampled by selecting the coordinate (0 0) (top-left corner) in each gridLet this downsampling operation Gs(ij) of selecting the coordinate (i j) in each

grid be called grid pooling (GP) To summarize given a kernel size of ktimes k andstride of stimes s (s ge 2) the multi-stride spatial operation layer is represented as

y = Fsk (x)

= Gs(00)(F1

k (x))

(1)

Our proposed PGP method is defined as(y(00) y(01) y(sminus1sminus1)

)=(Gs(00) (x) Gs(01) (x) Gs(sminus1sminus1) (x)

)= PGP (x) (2)

The conventional multi-stride operation uses y(00) which consists of only 1s2

of the intermediate output and discards the remaining 1 minus 1s2 of the featuremap In contrast PGP retains all possible choices of grid poolings meaning PGPmakes full use of the feature map without discarding the 1minus 1s2 portion of theintermediate output in the conventional method

5

A network architecture with a PGP layer can be viewed as one with s2 inter-nal branches by performing grid pooling in parallel Note that the weights (ieparameters) of the following network are shared between all branches hencethere is no additional parameter throughout the network PGP performs down-sampling while maintaining the spatial structure of the intermediate featuresthereby producing output features shifted by several pixels As described abovebecause the operations in each layer following PGP share the same weightsacross branches the subsequent layers in each branch are trained over a set ofslightly shifted feature maps simultaneously This works as data augmentationwith PGP the layers are trained with s2 times larger number of mini-batchescompared to the case without PGP

32 PGP vs Dilated Convolution

We here discuss the relationship between PGP and dilated convolution [720]Dilated convolution is a convolutional operation applied to an input with a de-fined gap Dilated convolution is considered to be an operation that efficientlyutilizes spatial and temporal information [718192042] For brevity let us con-sider one-dimensional signals Dilated convolution is defined as

y [i] =

Lsuml=1

x [i + rl]u [l] (3)

where x[i] is an input feature y[i] is an output feature u[l] is a convolution filterof kernel size L and r is a dilation rate A dilated convolution is equivalent toa regular convolution when r = 1 Let i = rj + k (0 le k le r minus 1) Then theoutput y [rj + k] is

y [rj + k] =

Lsuml=1

x [rj + k + rl]u [l] (4)

=

Lsuml=1

x [r (j + l) + k]u [l] (5)

This suggests that the output y [i] where i equiv k (mod r) can be obtained fromthe input x [i] | i equiv k (mod r) To summarize an r-dilated convolution isequivalent to an operation where the input is split into r branches and convo-lutions are performed with the same kernel u then spatially rearranged into asingle output

A dilated convolution in a two-dimensional space is illustrated in Fig 3 (incase r = 2) The r-dilated convolution downsamples an input feature with aninterval of r pixels then produces r2 branches as outputs Therefore the dilatedconvolution is equivalent to a following three-step operation PGP a convolu-tion sharing the same weight and inverse of PGP (PGPminus1) which rearranges thecells to their original positions For example let the size of the input features be

6

Conv

Conv

Conv

Conv

4 times (119887 119888 ℎ2 1199082)

(119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

=

Dilated

Convolution

(119903 = 2) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

PGP PGPminus120783

Fig 3 PGP is implicitly used by dilated convolution

x = (b c h w) where b c h and w are the number of batches number of chan-nels height and width respectively Using PGP the input feature x is dividedinto r2times (b c hr wr) subsequently a convolutional filter (with shared weightparameters) is applied to each branch in parallel (Fig 3) Then the intermedi-ate r2 feature maps are rearranged by PGP resulting the same size (b c h w)as the input by PGPminus1 In short PGP is embedded in a dilated convolutionwhich suggests that the success of dilated convolution can be attributed to dataaugmentation in the feature space

33 Architectural Differences between dilated convolution and PGP

The structure of the network model with PGP is better for learning than thatwith dilated convolutions For example consider the cases where two downsam-pling layers in a base CNN (Base-CNN) are replaced by either dilated convolution(Fig 4(b)) or PGP (Fig 4(d)) According to [7] the conventional multi-stridedownsampling layer and all subsequent convolution layers can be replaced withdilated convolution layers with two times dilation rate (Fig 4(b)) As mentionedin Section 32 a dilated convolution is equivalent to (PGP + Convolution +PGPminus1) When n r-dilated convolutions are stacked they are expressed by(PGP + Convolution times n + PGPminus1) This means that each branch in a CNNwith dilated convolution (DConv-CNN) is independent from the others

Let us consider the case of Fig 4(b) that dilated convolutions with two dif-ferent dilation rates (eg r1 = 2 and r2 = 4) are stacked Using the fact thata sequence of n 4-dilated convolutions is (PGP2 + Convolution timesn + PGPminus2)this architecture can readily be transformed to the one given in Fig 4(c) whichis the equivalent form with PGP We can clearly see that each branch split bythe former dilation layer is split again by the latter and all the branches areeventually rearranged by PGPminus1 layers prior the global average pooling (GAP)

7

CNN Image GP CNN GP CNN GAP FC Loss

(a) Base CNN (Base-CNN)

CNN GAP FC Image Loss

Dilated

CNN

(119903 = 2)

Dilated

CNN

(119903 = 4)

(b) CNN with dilated convolution (DConv-CNN)

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

GAP FC Image PGP -1

Loss

PGP

PGP

PGP

PGP -1

-1

-1

-1

(c) DConv-CNN represented by PGP

Loss

Weight

sharing

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

Image

The same structure as

Base-CNN

(d) CNN with PGP (PGP-CNN)

Fig 4 Comparison between DConv-CNN and PGP-CNN As mentioned in Section 31Base-CNN is expressed as the architecture using GP (a) (b) is equivalent to (c)

layer In contrast CNN with PGP (PGP-CNN) is implemented as follows Theconvolution or pooling layer that decreases resolution is set to 1 stride of thatlayer after which PGP is inserted (Fig 4(d)) In DConv-CNN all the branchesare aggregated through PGPminus1 layers just before the GAP layer which can beregarded as a feature ensemble while PGP-CNN does not use any PGPminus1 layerthroughout the network so the intermediate features are not fused until the endof the network

Unlike feature ensemble (ie the case of DConv-CNN) this structure en-forces all the branches to learn correct classification results As mentioned aboveDConv-CNN averages an ensemble of intermediate feature The likelihood ofeach class is then predicted based on the average Even if some branches containuseful features (eg for image classification) while the other branches do notacquire the CNN attempts to determine the correct class based on the averagewhich prevents the useless branches from further learning In contrast thanks tothe independency of branches PGP-CNN can perform additional learning withsuperior regularization

8

34 Weight transfer using PGP

Compared to DConv-CNN PGP-CNN has a significant advantage in terms ofweight transfer Specifically the weights learned on PGP-CNN work well onBase-CNN structure when transferred We demonstrate later in the experimentsthat this can improve performance of the base architecture However DConv-CNN does not have the same structure as Base-CNN because of PGPminus1 Thereplacement of downsampling by dilated convolutions makes it difficult to trans-fer weights from the pre-trained base model The base CNN with multi-strideconvolutions (the top of Fig 5 (a)) is represented by single-stride convolutionsand grid pooling (GP) layers (the bottom of Fig 5 (a)) DConv-CNN (the topof Fig 5 (b)) is equivalent to the base CNN with PGP and PGPminus1 (the bottomof Fig 5 (b)) Focusing on the difference in the size of the input features of a3times 3 convolution the size is (b c1 h w) in Base-CNN whereas (b c1 h2 w2)in DConv-CNN This difference in resolution makes it impossible to directlytransfer weights from the pre-trained model Unlike DConv-CNN PGP-CNNmaintains the same structure as Base-CNN (Fig 5 (c)) which results in bettertransferability and better performance

This weight transferability of PGP has two advantages First applying PGPto Base-CNN with learned weights in test phase can improve recognition perfor-mance without retraining For example if one does not have sufficient computingresources to train PGP-CNN using PGP only in the test phase can still make fulluse of the features discarded by downsampling in the training phase to achievehigher accuracy Second the weights learned by PGP-CNN can work well inBase-CNN As an example on an embedded system specialized to a particularCNN one can not change calculations inside the CNN However the weightslearned by on PGP-CNN can result in superior performance even when used inBase-CNN at test time

4 Experimental Results

41 Image classification

Datasets We evaluated PGP on benchmark datasets CIFAR-10 CIFAR-100 [23]and SVHN [24] Both CIFAR-10 and CIFAR-100 consist of 32times32 color imageswith 50000 images for training and 10000 images for testing The CIFAR-10dataset has 10 classes each containing 6000 images There are 5000 trainingimages and 1000 testing images for each class The CIFAR-100 dataset has 100classes each containing 600 images There are 500 training images and 100 test-ing images for each class We adopted a standard data augmentation scheme(mirroringshifting) that is widely used for these two datasets [1211] For pre-processing we normalized the data using the channel means and standard devi-ations We used all 50000 training images for training and calculated the finaltesting error after training

The Street View House Numbers (SVHN) dataset consists of 32 times 32 colordigit image There are 73257 images in the training set 26032 images in the test

9

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 2x2

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 2x2

Conv 1x1 str 1x1

GP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198882 ℎ2 1199082)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

GP

(119887 1198883 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Dconv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGPminus1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

PGP

Conv 3x3 str 1x1

4 times (119887 1198881 ℎ2 1199082)

4 times (119887 1198882 ℎ2 1199082)

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

4 times (119887 1198882 ℎ2 1199082)

4 times (119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

PGP

(119887 1198883 ℎ 119908)

(a) Base-CNN (b) DConv-CNN (c) PGP-CNN

Fig 5 The structure performing downsampling inside Base-CNN DConv-CNN andPGP-CNN This downsampling structure (3 times 3 convolution with stride of 2) is usedin a derivative of ResNet [9] WideResNet [10] and ResNeXt [11]

set and 531131 additional images for training Following the method in [102]we used all the training data without any data augmentation

Implementation details We trained a pre-activation ResNet (PreResNet) [12]All-CNN [43] WideResNet [10] ResNeXt [11] PyramidNet [44] and DenseNet [2]from scratch We use a 164-layer network for PreResNet We used WRN-28-10ResNeXt-29 (8x64d) PyramidNet-164 (Bottleneck alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers

When comparing the base CNN (Base) the CNN with dilated convolutions(DConv) and the CNN with PGP (PGP) we utilized the same hyperparametervalues (eg learning rate strategy) to carefully monitor the performance improve-ments We adopted the cosine shape learning rate schedule [45] which smoothlyanneals the learning rate [4647] The learning rate was initially set to 02 andgradually reduced to 0002 Following [2] for CIFAR and SVHN we trainedthe networks using mini-batches of size 64 for 300 epochs and 40 epochs re-spectively All networks were optimized using stochastic gradient descent (SGD)with a momentum of 09 and weight decay of 10times10minus4 We adopted the weightinitialization method introduced in [48] We used the Chainer framework [49]and ChainerCV [50] for the implementation of all experiments

10

Table 1 Test errors () with various architectures on CIFAR-10 CIFAR-100 andSVHN DConv dilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 415plusmn012 377plusmn009All-CNN 14M 842plusmn023 868plusmn043 717plusmn032WideResNet-28-10 365M 344plusmn006 388plusmn021 313plusmn015ResNeXt-29 (8times64d) 344M 386plusmn016 387plusmn013 322plusmn034PyramidNet-164 (α=28) 17M 391plusmn015 372plusmn007 338plusmn009DenseNet-BC-100 (k=12) 08M 460plusmn007 435plusmn022 411plusmn015

CIFAR-100

PreResNet-164 17M 2268plusmn027 2071plusmn027 2018plusmn045All-CNN 14M 3243plusmn031 3310plusmn039 2979plusmn033WideResNet-28-10 365M 1870plusmn039 1969plusmn050 1820plusmn017ResNeXt-29 (8times64d) 344M 1963plusmn112 1790plusmn057 1718plusmn029PyramidNet-164 (α=28) 17M 1965plusmn024 1910plusmn032 1812plusmn016DenseNet-BC-100 (k=12) 08M 2262plusmn029 2220plusmn003 2169plusmn037

SVHN

PreResNet-164 17M 196plusmn174 174plusmn007 154plusmn002All-CNN 14M 194plusmn006 177plusmn007 175plusmn002WideResNet-28-10 365M 181plusmn003 153plusmn002 138plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 166plusmn003 152plusmn001PyramidNet-164 (α=28) 17M 202plusmn005 186plusmn006 179plusmn005DenseNet-BC-100 (k=12) 08M 197plusmn012 177plusmn006 167plusmn007

Results The results of applying DConv and PGP on CIFAR-10 CIFAR-100and SVHN with different networks are listed in Table 1 In all cases PGP out-performed Base and DConv

Comparison to other data augmentation methods We compare our methodto random flipping (RF) random cropping (RC) and random erasing (RE) inTable 2 PGP achieves better performance than any other data augmentationmethods when applied alone When applied in combination with other data aug-mentation methods PGP works in a complementary manner Combining RFRC RE and PGP yields an error rate of 1916 which is 1174 improvementover the baseline without any augmentation methods

Comparison to classical ensemble method We compare our method to theclassical ensemble method in Table 3 On CIFAR-10 and SVHN a single CNNwith the PGP model achieves better performance than an ensemble of threebasic CNN models Furthermore PGP works in a complementary manner whenapplied to the ensemble

Weight transfer The results of applying PGP as a test-time data augmentationmethod on CIFAR-10 CIFAR-100 and SVHN on different networks are listedin Table 4 PGP achieves better performance than the base model except for

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 3: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

3

ndash We propose PGP which performs downsampling without discarding any in-termediate feature information and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without alteringtheir learning strategies and the number of parameters

ndash We demonstrate that PGP is implicitly used in a dilated convolution opera-tion which suggests that dilated convolution can also be regarded as a typeof data augmentation technique

2 Related Works

PGP is most closely related to data augmentation In the context of traininga CNN data augmentation is a standard regularization technique that is usedto avoid overfitting and artificially enlarge the size of a training dataset Forexample applying crops flips and various affine transforms in a fixed orderor at random is a widely used technique in many computer vision tasks Dataaugmentation techniques can be divided into two categories data augmentationin the image space and data augmentation in the feature space

Data augmentation in the image space AlexNet [26] which achievesstate-of-the-art performance on the ILSVRC2012 dataset applies random flip-ping random cropping color perturbation and adding noise based on principalcomponent analysis TANDA [27] learns a generative sequence model that pro-duces data augmentation strategies using adversarial techniques on unlabeleddata The method in [28] proposes to train a class-conditional model of diffeo-morphisms by interpolating between nearest-neighbor data Recently trainingtechniques utilizing randomly masked images of rectangular regions such as ran-dom erasing [29] and cutout [30] have been proposed Using these augmentationtechniques recent methods have achieved state-of-the-art results for image clas-sification [3132] Mix-up [33] and BC-learning [34] generate between-class imagesby mixing two images belonging to different classes with a random ratio

Data augmentation in the feature space Some techniques generateaugmented training sets by interpolating [35] or extrapolating [36] features fromthe same class DAGAN [37] utilizes conditional generative adversarial net-works to conduct feature space data augmentation for one-shot learning A-Fast-RCNN [38] generates hard examples with occlusions and deformations using anadversarial learning strategy in feature space Smart augmentation [39] trains aneural network to combine training samples in order to generate augmented dataAs another type of feature space data augmentation AGA [40] and Fatten [41]perform augmentation in the feature space by using properties of training images(eg depth and pose)

While our PGP performs downsampling without discarding any intermediatefeature it splits the input feature map into branches which will be discussedin detail in Sec 31 The operations following PGP (eg convolution layersor fully connected layers) compute slightly shifted s2 feature maps which canbe interpreted as a novel data augmentation technique in the feature space Inaddition dilated convolution can be regarded as data augmentation techniquein the feature space which will be discussed in detail in Sec 32

4

Grid CNN

CNN

CNN

CNN

Intermediate

Feature Map

Parallel Pooling

Parallel Grid Pooling

(PGP) 119970119904(11)

119970119904(01)

119970119904(10)

119970119904(00)

Weight

Sharing

119909 ℱ1198961

Input

119907

Fig 2 An overview of Parallel Grid Pooling (PGP)

3 Proposed Method

31 Parallel Grid Pooling (PGP)

An overview of PGP is illustrated in Fig 2 Let x isin Rbtimesctimeshtimesw be an inputfeature map where b c h and w are the number of batches number of channelsheight and width respectively Fs

k is a spatial operation (eg a convolution or apooling) with a kernel size of k times k and stride of stimes s (s ge 2) The height andwidth of the output feature map are hs and ws respectivelyFs

k can be divided into two steps (Fig 1) In the first step v = F1k (x)

is performed which is an operation that makes full use of the input featuremap information with a stride of one producing an intermediate feature mapv Then v is split into ws times hs blocks of size s times s Let the coordinate ineach grid be (i j) (0 le i le s minus 1 0 le j le s minus 1) In the second step v isdownsampled by selecting the coordinate (0 0) (top-left corner) in each gridLet this downsampling operation Gs(ij) of selecting the coordinate (i j) in each

grid be called grid pooling (GP) To summarize given a kernel size of ktimes k andstride of stimes s (s ge 2) the multi-stride spatial operation layer is represented as

y = Fsk (x)

= Gs(00)(F1

k (x))

(1)

Our proposed PGP method is defined as(y(00) y(01) y(sminus1sminus1)

)=(Gs(00) (x) Gs(01) (x) Gs(sminus1sminus1) (x)

)= PGP (x) (2)

The conventional multi-stride operation uses y(00) which consists of only 1s2

of the intermediate output and discards the remaining 1 minus 1s2 of the featuremap In contrast PGP retains all possible choices of grid poolings meaning PGPmakes full use of the feature map without discarding the 1minus 1s2 portion of theintermediate output in the conventional method

5

A network architecture with a PGP layer can be viewed as one with s2 inter-nal branches by performing grid pooling in parallel Note that the weights (ieparameters) of the following network are shared between all branches hencethere is no additional parameter throughout the network PGP performs down-sampling while maintaining the spatial structure of the intermediate featuresthereby producing output features shifted by several pixels As described abovebecause the operations in each layer following PGP share the same weightsacross branches the subsequent layers in each branch are trained over a set ofslightly shifted feature maps simultaneously This works as data augmentationwith PGP the layers are trained with s2 times larger number of mini-batchescompared to the case without PGP

32 PGP vs Dilated Convolution

We here discuss the relationship between PGP and dilated convolution [720]Dilated convolution is a convolutional operation applied to an input with a de-fined gap Dilated convolution is considered to be an operation that efficientlyutilizes spatial and temporal information [718192042] For brevity let us con-sider one-dimensional signals Dilated convolution is defined as

y [i] =

Lsuml=1

x [i + rl]u [l] (3)

where x[i] is an input feature y[i] is an output feature u[l] is a convolution filterof kernel size L and r is a dilation rate A dilated convolution is equivalent toa regular convolution when r = 1 Let i = rj + k (0 le k le r minus 1) Then theoutput y [rj + k] is

y [rj + k] =

Lsuml=1

x [rj + k + rl]u [l] (4)

=

Lsuml=1

x [r (j + l) + k]u [l] (5)

This suggests that the output y [i] where i equiv k (mod r) can be obtained fromthe input x [i] | i equiv k (mod r) To summarize an r-dilated convolution isequivalent to an operation where the input is split into r branches and convo-lutions are performed with the same kernel u then spatially rearranged into asingle output

A dilated convolution in a two-dimensional space is illustrated in Fig 3 (incase r = 2) The r-dilated convolution downsamples an input feature with aninterval of r pixels then produces r2 branches as outputs Therefore the dilatedconvolution is equivalent to a following three-step operation PGP a convolu-tion sharing the same weight and inverse of PGP (PGPminus1) which rearranges thecells to their original positions For example let the size of the input features be

6

Conv

Conv

Conv

Conv

4 times (119887 119888 ℎ2 1199082)

(119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

=

Dilated

Convolution

(119903 = 2) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

PGP PGPminus120783

Fig 3 PGP is implicitly used by dilated convolution

x = (b c h w) where b c h and w are the number of batches number of chan-nels height and width respectively Using PGP the input feature x is dividedinto r2times (b c hr wr) subsequently a convolutional filter (with shared weightparameters) is applied to each branch in parallel (Fig 3) Then the intermedi-ate r2 feature maps are rearranged by PGP resulting the same size (b c h w)as the input by PGPminus1 In short PGP is embedded in a dilated convolutionwhich suggests that the success of dilated convolution can be attributed to dataaugmentation in the feature space

33 Architectural Differences between dilated convolution and PGP

The structure of the network model with PGP is better for learning than thatwith dilated convolutions For example consider the cases where two downsam-pling layers in a base CNN (Base-CNN) are replaced by either dilated convolution(Fig 4(b)) or PGP (Fig 4(d)) According to [7] the conventional multi-stridedownsampling layer and all subsequent convolution layers can be replaced withdilated convolution layers with two times dilation rate (Fig 4(b)) As mentionedin Section 32 a dilated convolution is equivalent to (PGP + Convolution +PGPminus1) When n r-dilated convolutions are stacked they are expressed by(PGP + Convolution times n + PGPminus1) This means that each branch in a CNNwith dilated convolution (DConv-CNN) is independent from the others

Let us consider the case of Fig 4(b) that dilated convolutions with two dif-ferent dilation rates (eg r1 = 2 and r2 = 4) are stacked Using the fact thata sequence of n 4-dilated convolutions is (PGP2 + Convolution timesn + PGPminus2)this architecture can readily be transformed to the one given in Fig 4(c) whichis the equivalent form with PGP We can clearly see that each branch split bythe former dilation layer is split again by the latter and all the branches areeventually rearranged by PGPminus1 layers prior the global average pooling (GAP)

7

CNN Image GP CNN GP CNN GAP FC Loss

(a) Base CNN (Base-CNN)

CNN GAP FC Image Loss

Dilated

CNN

(119903 = 2)

Dilated

CNN

(119903 = 4)

(b) CNN with dilated convolution (DConv-CNN)

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

GAP FC Image PGP -1

Loss

PGP

PGP

PGP

PGP -1

-1

-1

-1

(c) DConv-CNN represented by PGP

Loss

Weight

sharing

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

Image

The same structure as

Base-CNN

(d) CNN with PGP (PGP-CNN)

Fig 4 Comparison between DConv-CNN and PGP-CNN As mentioned in Section 31Base-CNN is expressed as the architecture using GP (a) (b) is equivalent to (c)

layer In contrast CNN with PGP (PGP-CNN) is implemented as follows Theconvolution or pooling layer that decreases resolution is set to 1 stride of thatlayer after which PGP is inserted (Fig 4(d)) In DConv-CNN all the branchesare aggregated through PGPminus1 layers just before the GAP layer which can beregarded as a feature ensemble while PGP-CNN does not use any PGPminus1 layerthroughout the network so the intermediate features are not fused until the endof the network

Unlike feature ensemble (ie the case of DConv-CNN) this structure en-forces all the branches to learn correct classification results As mentioned aboveDConv-CNN averages an ensemble of intermediate feature The likelihood ofeach class is then predicted based on the average Even if some branches containuseful features (eg for image classification) while the other branches do notacquire the CNN attempts to determine the correct class based on the averagewhich prevents the useless branches from further learning In contrast thanks tothe independency of branches PGP-CNN can perform additional learning withsuperior regularization

8

34 Weight transfer using PGP

Compared to DConv-CNN PGP-CNN has a significant advantage in terms ofweight transfer Specifically the weights learned on PGP-CNN work well onBase-CNN structure when transferred We demonstrate later in the experimentsthat this can improve performance of the base architecture However DConv-CNN does not have the same structure as Base-CNN because of PGPminus1 Thereplacement of downsampling by dilated convolutions makes it difficult to trans-fer weights from the pre-trained base model The base CNN with multi-strideconvolutions (the top of Fig 5 (a)) is represented by single-stride convolutionsand grid pooling (GP) layers (the bottom of Fig 5 (a)) DConv-CNN (the topof Fig 5 (b)) is equivalent to the base CNN with PGP and PGPminus1 (the bottomof Fig 5 (b)) Focusing on the difference in the size of the input features of a3times 3 convolution the size is (b c1 h w) in Base-CNN whereas (b c1 h2 w2)in DConv-CNN This difference in resolution makes it impossible to directlytransfer weights from the pre-trained model Unlike DConv-CNN PGP-CNNmaintains the same structure as Base-CNN (Fig 5 (c)) which results in bettertransferability and better performance

This weight transferability of PGP has two advantages First applying PGPto Base-CNN with learned weights in test phase can improve recognition perfor-mance without retraining For example if one does not have sufficient computingresources to train PGP-CNN using PGP only in the test phase can still make fulluse of the features discarded by downsampling in the training phase to achievehigher accuracy Second the weights learned by PGP-CNN can work well inBase-CNN As an example on an embedded system specialized to a particularCNN one can not change calculations inside the CNN However the weightslearned by on PGP-CNN can result in superior performance even when used inBase-CNN at test time

4 Experimental Results

41 Image classification

Datasets We evaluated PGP on benchmark datasets CIFAR-10 CIFAR-100 [23]and SVHN [24] Both CIFAR-10 and CIFAR-100 consist of 32times32 color imageswith 50000 images for training and 10000 images for testing The CIFAR-10dataset has 10 classes each containing 6000 images There are 5000 trainingimages and 1000 testing images for each class The CIFAR-100 dataset has 100classes each containing 600 images There are 500 training images and 100 test-ing images for each class We adopted a standard data augmentation scheme(mirroringshifting) that is widely used for these two datasets [1211] For pre-processing we normalized the data using the channel means and standard devi-ations We used all 50000 training images for training and calculated the finaltesting error after training

The Street View House Numbers (SVHN) dataset consists of 32 times 32 colordigit image There are 73257 images in the training set 26032 images in the test

9

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 2x2

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 2x2

Conv 1x1 str 1x1

GP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198882 ℎ2 1199082)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

GP

(119887 1198883 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Dconv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGPminus1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

PGP

Conv 3x3 str 1x1

4 times (119887 1198881 ℎ2 1199082)

4 times (119887 1198882 ℎ2 1199082)

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

4 times (119887 1198882 ℎ2 1199082)

4 times (119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

PGP

(119887 1198883 ℎ 119908)

(a) Base-CNN (b) DConv-CNN (c) PGP-CNN

Fig 5 The structure performing downsampling inside Base-CNN DConv-CNN andPGP-CNN This downsampling structure (3 times 3 convolution with stride of 2) is usedin a derivative of ResNet [9] WideResNet [10] and ResNeXt [11]

set and 531131 additional images for training Following the method in [102]we used all the training data without any data augmentation

Implementation details We trained a pre-activation ResNet (PreResNet) [12]All-CNN [43] WideResNet [10] ResNeXt [11] PyramidNet [44] and DenseNet [2]from scratch We use a 164-layer network for PreResNet We used WRN-28-10ResNeXt-29 (8x64d) PyramidNet-164 (Bottleneck alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers

When comparing the base CNN (Base) the CNN with dilated convolutions(DConv) and the CNN with PGP (PGP) we utilized the same hyperparametervalues (eg learning rate strategy) to carefully monitor the performance improve-ments We adopted the cosine shape learning rate schedule [45] which smoothlyanneals the learning rate [4647] The learning rate was initially set to 02 andgradually reduced to 0002 Following [2] for CIFAR and SVHN we trainedthe networks using mini-batches of size 64 for 300 epochs and 40 epochs re-spectively All networks were optimized using stochastic gradient descent (SGD)with a momentum of 09 and weight decay of 10times10minus4 We adopted the weightinitialization method introduced in [48] We used the Chainer framework [49]and ChainerCV [50] for the implementation of all experiments

10

Table 1 Test errors () with various architectures on CIFAR-10 CIFAR-100 andSVHN DConv dilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 415plusmn012 377plusmn009All-CNN 14M 842plusmn023 868plusmn043 717plusmn032WideResNet-28-10 365M 344plusmn006 388plusmn021 313plusmn015ResNeXt-29 (8times64d) 344M 386plusmn016 387plusmn013 322plusmn034PyramidNet-164 (α=28) 17M 391plusmn015 372plusmn007 338plusmn009DenseNet-BC-100 (k=12) 08M 460plusmn007 435plusmn022 411plusmn015

CIFAR-100

PreResNet-164 17M 2268plusmn027 2071plusmn027 2018plusmn045All-CNN 14M 3243plusmn031 3310plusmn039 2979plusmn033WideResNet-28-10 365M 1870plusmn039 1969plusmn050 1820plusmn017ResNeXt-29 (8times64d) 344M 1963plusmn112 1790plusmn057 1718plusmn029PyramidNet-164 (α=28) 17M 1965plusmn024 1910plusmn032 1812plusmn016DenseNet-BC-100 (k=12) 08M 2262plusmn029 2220plusmn003 2169plusmn037

SVHN

PreResNet-164 17M 196plusmn174 174plusmn007 154plusmn002All-CNN 14M 194plusmn006 177plusmn007 175plusmn002WideResNet-28-10 365M 181plusmn003 153plusmn002 138plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 166plusmn003 152plusmn001PyramidNet-164 (α=28) 17M 202plusmn005 186plusmn006 179plusmn005DenseNet-BC-100 (k=12) 08M 197plusmn012 177plusmn006 167plusmn007

Results The results of applying DConv and PGP on CIFAR-10 CIFAR-100and SVHN with different networks are listed in Table 1 In all cases PGP out-performed Base and DConv

Comparison to other data augmentation methods We compare our methodto random flipping (RF) random cropping (RC) and random erasing (RE) inTable 2 PGP achieves better performance than any other data augmentationmethods when applied alone When applied in combination with other data aug-mentation methods PGP works in a complementary manner Combining RFRC RE and PGP yields an error rate of 1916 which is 1174 improvementover the baseline without any augmentation methods

Comparison to classical ensemble method We compare our method to theclassical ensemble method in Table 3 On CIFAR-10 and SVHN a single CNNwith the PGP model achieves better performance than an ensemble of threebasic CNN models Furthermore PGP works in a complementary manner whenapplied to the ensemble

Weight transfer The results of applying PGP as a test-time data augmentationmethod on CIFAR-10 CIFAR-100 and SVHN on different networks are listedin Table 4 PGP achieves better performance than the base model except for

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 4: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

4

Grid CNN

CNN

CNN

CNN

Intermediate

Feature Map

Parallel Pooling

Parallel Grid Pooling

(PGP) 119970119904(11)

119970119904(01)

119970119904(10)

119970119904(00)

Weight

Sharing

119909 ℱ1198961

Input

119907

Fig 2 An overview of Parallel Grid Pooling (PGP)

3 Proposed Method

31 Parallel Grid Pooling (PGP)

An overview of PGP is illustrated in Fig 2 Let x isin Rbtimesctimeshtimesw be an inputfeature map where b c h and w are the number of batches number of channelsheight and width respectively Fs

k is a spatial operation (eg a convolution or apooling) with a kernel size of k times k and stride of stimes s (s ge 2) The height andwidth of the output feature map are hs and ws respectivelyFs

k can be divided into two steps (Fig 1) In the first step v = F1k (x)

is performed which is an operation that makes full use of the input featuremap information with a stride of one producing an intermediate feature mapv Then v is split into ws times hs blocks of size s times s Let the coordinate ineach grid be (i j) (0 le i le s minus 1 0 le j le s minus 1) In the second step v isdownsampled by selecting the coordinate (0 0) (top-left corner) in each gridLet this downsampling operation Gs(ij) of selecting the coordinate (i j) in each

grid be called grid pooling (GP) To summarize given a kernel size of ktimes k andstride of stimes s (s ge 2) the multi-stride spatial operation layer is represented as

y = Fsk (x)

= Gs(00)(F1

k (x))

(1)

Our proposed PGP method is defined as(y(00) y(01) y(sminus1sminus1)

)=(Gs(00) (x) Gs(01) (x) Gs(sminus1sminus1) (x)

)= PGP (x) (2)

The conventional multi-stride operation uses y(00) which consists of only 1s2

of the intermediate output and discards the remaining 1 minus 1s2 of the featuremap In contrast PGP retains all possible choices of grid poolings meaning PGPmakes full use of the feature map without discarding the 1minus 1s2 portion of theintermediate output in the conventional method

5

A network architecture with a PGP layer can be viewed as one with s2 inter-nal branches by performing grid pooling in parallel Note that the weights (ieparameters) of the following network are shared between all branches hencethere is no additional parameter throughout the network PGP performs down-sampling while maintaining the spatial structure of the intermediate featuresthereby producing output features shifted by several pixels As described abovebecause the operations in each layer following PGP share the same weightsacross branches the subsequent layers in each branch are trained over a set ofslightly shifted feature maps simultaneously This works as data augmentationwith PGP the layers are trained with s2 times larger number of mini-batchescompared to the case without PGP

32 PGP vs Dilated Convolution

We here discuss the relationship between PGP and dilated convolution [720]Dilated convolution is a convolutional operation applied to an input with a de-fined gap Dilated convolution is considered to be an operation that efficientlyutilizes spatial and temporal information [718192042] For brevity let us con-sider one-dimensional signals Dilated convolution is defined as

y [i] =

Lsuml=1

x [i + rl]u [l] (3)

where x[i] is an input feature y[i] is an output feature u[l] is a convolution filterof kernel size L and r is a dilation rate A dilated convolution is equivalent toa regular convolution when r = 1 Let i = rj + k (0 le k le r minus 1) Then theoutput y [rj + k] is

y [rj + k] =

Lsuml=1

x [rj + k + rl]u [l] (4)

=

Lsuml=1

x [r (j + l) + k]u [l] (5)

This suggests that the output y [i] where i equiv k (mod r) can be obtained fromthe input x [i] | i equiv k (mod r) To summarize an r-dilated convolution isequivalent to an operation where the input is split into r branches and convo-lutions are performed with the same kernel u then spatially rearranged into asingle output

A dilated convolution in a two-dimensional space is illustrated in Fig 3 (incase r = 2) The r-dilated convolution downsamples an input feature with aninterval of r pixels then produces r2 branches as outputs Therefore the dilatedconvolution is equivalent to a following three-step operation PGP a convolu-tion sharing the same weight and inverse of PGP (PGPminus1) which rearranges thecells to their original positions For example let the size of the input features be

6

Conv

Conv

Conv

Conv

4 times (119887 119888 ℎ2 1199082)

(119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

=

Dilated

Convolution

(119903 = 2) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

PGP PGPminus120783

Fig 3 PGP is implicitly used by dilated convolution

x = (b c h w) where b c h and w are the number of batches number of chan-nels height and width respectively Using PGP the input feature x is dividedinto r2times (b c hr wr) subsequently a convolutional filter (with shared weightparameters) is applied to each branch in parallel (Fig 3) Then the intermedi-ate r2 feature maps are rearranged by PGP resulting the same size (b c h w)as the input by PGPminus1 In short PGP is embedded in a dilated convolutionwhich suggests that the success of dilated convolution can be attributed to dataaugmentation in the feature space

33 Architectural Differences between dilated convolution and PGP

The structure of the network model with PGP is better for learning than thatwith dilated convolutions For example consider the cases where two downsam-pling layers in a base CNN (Base-CNN) are replaced by either dilated convolution(Fig 4(b)) or PGP (Fig 4(d)) According to [7] the conventional multi-stridedownsampling layer and all subsequent convolution layers can be replaced withdilated convolution layers with two times dilation rate (Fig 4(b)) As mentionedin Section 32 a dilated convolution is equivalent to (PGP + Convolution +PGPminus1) When n r-dilated convolutions are stacked they are expressed by(PGP + Convolution times n + PGPminus1) This means that each branch in a CNNwith dilated convolution (DConv-CNN) is independent from the others

Let us consider the case of Fig 4(b) that dilated convolutions with two dif-ferent dilation rates (eg r1 = 2 and r2 = 4) are stacked Using the fact thata sequence of n 4-dilated convolutions is (PGP2 + Convolution timesn + PGPminus2)this architecture can readily be transformed to the one given in Fig 4(c) whichis the equivalent form with PGP We can clearly see that each branch split bythe former dilation layer is split again by the latter and all the branches areeventually rearranged by PGPminus1 layers prior the global average pooling (GAP)

7

CNN Image GP CNN GP CNN GAP FC Loss

(a) Base CNN (Base-CNN)

CNN GAP FC Image Loss

Dilated

CNN

(119903 = 2)

Dilated

CNN

(119903 = 4)

(b) CNN with dilated convolution (DConv-CNN)

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

GAP FC Image PGP -1

Loss

PGP

PGP

PGP

PGP -1

-1

-1

-1

(c) DConv-CNN represented by PGP

Loss

Weight

sharing

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

Image

The same structure as

Base-CNN

(d) CNN with PGP (PGP-CNN)

Fig 4 Comparison between DConv-CNN and PGP-CNN As mentioned in Section 31Base-CNN is expressed as the architecture using GP (a) (b) is equivalent to (c)

layer In contrast CNN with PGP (PGP-CNN) is implemented as follows Theconvolution or pooling layer that decreases resolution is set to 1 stride of thatlayer after which PGP is inserted (Fig 4(d)) In DConv-CNN all the branchesare aggregated through PGPminus1 layers just before the GAP layer which can beregarded as a feature ensemble while PGP-CNN does not use any PGPminus1 layerthroughout the network so the intermediate features are not fused until the endof the network

Unlike feature ensemble (ie the case of DConv-CNN) this structure en-forces all the branches to learn correct classification results As mentioned aboveDConv-CNN averages an ensemble of intermediate feature The likelihood ofeach class is then predicted based on the average Even if some branches containuseful features (eg for image classification) while the other branches do notacquire the CNN attempts to determine the correct class based on the averagewhich prevents the useless branches from further learning In contrast thanks tothe independency of branches PGP-CNN can perform additional learning withsuperior regularization

8

34 Weight transfer using PGP

Compared to DConv-CNN PGP-CNN has a significant advantage in terms ofweight transfer Specifically the weights learned on PGP-CNN work well onBase-CNN structure when transferred We demonstrate later in the experimentsthat this can improve performance of the base architecture However DConv-CNN does not have the same structure as Base-CNN because of PGPminus1 Thereplacement of downsampling by dilated convolutions makes it difficult to trans-fer weights from the pre-trained base model The base CNN with multi-strideconvolutions (the top of Fig 5 (a)) is represented by single-stride convolutionsand grid pooling (GP) layers (the bottom of Fig 5 (a)) DConv-CNN (the topof Fig 5 (b)) is equivalent to the base CNN with PGP and PGPminus1 (the bottomof Fig 5 (b)) Focusing on the difference in the size of the input features of a3times 3 convolution the size is (b c1 h w) in Base-CNN whereas (b c1 h2 w2)in DConv-CNN This difference in resolution makes it impossible to directlytransfer weights from the pre-trained model Unlike DConv-CNN PGP-CNNmaintains the same structure as Base-CNN (Fig 5 (c)) which results in bettertransferability and better performance

This weight transferability of PGP has two advantages First applying PGPto Base-CNN with learned weights in test phase can improve recognition perfor-mance without retraining For example if one does not have sufficient computingresources to train PGP-CNN using PGP only in the test phase can still make fulluse of the features discarded by downsampling in the training phase to achievehigher accuracy Second the weights learned by PGP-CNN can work well inBase-CNN As an example on an embedded system specialized to a particularCNN one can not change calculations inside the CNN However the weightslearned by on PGP-CNN can result in superior performance even when used inBase-CNN at test time

4 Experimental Results

41 Image classification

Datasets We evaluated PGP on benchmark datasets CIFAR-10 CIFAR-100 [23]and SVHN [24] Both CIFAR-10 and CIFAR-100 consist of 32times32 color imageswith 50000 images for training and 10000 images for testing The CIFAR-10dataset has 10 classes each containing 6000 images There are 5000 trainingimages and 1000 testing images for each class The CIFAR-100 dataset has 100classes each containing 600 images There are 500 training images and 100 test-ing images for each class We adopted a standard data augmentation scheme(mirroringshifting) that is widely used for these two datasets [1211] For pre-processing we normalized the data using the channel means and standard devi-ations We used all 50000 training images for training and calculated the finaltesting error after training

The Street View House Numbers (SVHN) dataset consists of 32 times 32 colordigit image There are 73257 images in the training set 26032 images in the test

9

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 2x2

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 2x2

Conv 1x1 str 1x1

GP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198882 ℎ2 1199082)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

GP

(119887 1198883 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Dconv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGPminus1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

PGP

Conv 3x3 str 1x1

4 times (119887 1198881 ℎ2 1199082)

4 times (119887 1198882 ℎ2 1199082)

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

4 times (119887 1198882 ℎ2 1199082)

4 times (119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

PGP

(119887 1198883 ℎ 119908)

(a) Base-CNN (b) DConv-CNN (c) PGP-CNN

Fig 5 The structure performing downsampling inside Base-CNN DConv-CNN andPGP-CNN This downsampling structure (3 times 3 convolution with stride of 2) is usedin a derivative of ResNet [9] WideResNet [10] and ResNeXt [11]

set and 531131 additional images for training Following the method in [102]we used all the training data without any data augmentation

Implementation details We trained a pre-activation ResNet (PreResNet) [12]All-CNN [43] WideResNet [10] ResNeXt [11] PyramidNet [44] and DenseNet [2]from scratch We use a 164-layer network for PreResNet We used WRN-28-10ResNeXt-29 (8x64d) PyramidNet-164 (Bottleneck alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers

When comparing the base CNN (Base) the CNN with dilated convolutions(DConv) and the CNN with PGP (PGP) we utilized the same hyperparametervalues (eg learning rate strategy) to carefully monitor the performance improve-ments We adopted the cosine shape learning rate schedule [45] which smoothlyanneals the learning rate [4647] The learning rate was initially set to 02 andgradually reduced to 0002 Following [2] for CIFAR and SVHN we trainedthe networks using mini-batches of size 64 for 300 epochs and 40 epochs re-spectively All networks were optimized using stochastic gradient descent (SGD)with a momentum of 09 and weight decay of 10times10minus4 We adopted the weightinitialization method introduced in [48] We used the Chainer framework [49]and ChainerCV [50] for the implementation of all experiments

10

Table 1 Test errors () with various architectures on CIFAR-10 CIFAR-100 andSVHN DConv dilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 415plusmn012 377plusmn009All-CNN 14M 842plusmn023 868plusmn043 717plusmn032WideResNet-28-10 365M 344plusmn006 388plusmn021 313plusmn015ResNeXt-29 (8times64d) 344M 386plusmn016 387plusmn013 322plusmn034PyramidNet-164 (α=28) 17M 391plusmn015 372plusmn007 338plusmn009DenseNet-BC-100 (k=12) 08M 460plusmn007 435plusmn022 411plusmn015

CIFAR-100

PreResNet-164 17M 2268plusmn027 2071plusmn027 2018plusmn045All-CNN 14M 3243plusmn031 3310plusmn039 2979plusmn033WideResNet-28-10 365M 1870plusmn039 1969plusmn050 1820plusmn017ResNeXt-29 (8times64d) 344M 1963plusmn112 1790plusmn057 1718plusmn029PyramidNet-164 (α=28) 17M 1965plusmn024 1910plusmn032 1812plusmn016DenseNet-BC-100 (k=12) 08M 2262plusmn029 2220plusmn003 2169plusmn037

SVHN

PreResNet-164 17M 196plusmn174 174plusmn007 154plusmn002All-CNN 14M 194plusmn006 177plusmn007 175plusmn002WideResNet-28-10 365M 181plusmn003 153plusmn002 138plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 166plusmn003 152plusmn001PyramidNet-164 (α=28) 17M 202plusmn005 186plusmn006 179plusmn005DenseNet-BC-100 (k=12) 08M 197plusmn012 177plusmn006 167plusmn007

Results The results of applying DConv and PGP on CIFAR-10 CIFAR-100and SVHN with different networks are listed in Table 1 In all cases PGP out-performed Base and DConv

Comparison to other data augmentation methods We compare our methodto random flipping (RF) random cropping (RC) and random erasing (RE) inTable 2 PGP achieves better performance than any other data augmentationmethods when applied alone When applied in combination with other data aug-mentation methods PGP works in a complementary manner Combining RFRC RE and PGP yields an error rate of 1916 which is 1174 improvementover the baseline without any augmentation methods

Comparison to classical ensemble method We compare our method to theclassical ensemble method in Table 3 On CIFAR-10 and SVHN a single CNNwith the PGP model achieves better performance than an ensemble of threebasic CNN models Furthermore PGP works in a complementary manner whenapplied to the ensemble

Weight transfer The results of applying PGP as a test-time data augmentationmethod on CIFAR-10 CIFAR-100 and SVHN on different networks are listedin Table 4 PGP achieves better performance than the base model except for

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 5: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

5

A network architecture with a PGP layer can be viewed as one with s2 inter-nal branches by performing grid pooling in parallel Note that the weights (ieparameters) of the following network are shared between all branches hencethere is no additional parameter throughout the network PGP performs down-sampling while maintaining the spatial structure of the intermediate featuresthereby producing output features shifted by several pixels As described abovebecause the operations in each layer following PGP share the same weightsacross branches the subsequent layers in each branch are trained over a set ofslightly shifted feature maps simultaneously This works as data augmentationwith PGP the layers are trained with s2 times larger number of mini-batchescompared to the case without PGP

32 PGP vs Dilated Convolution

We here discuss the relationship between PGP and dilated convolution [720]Dilated convolution is a convolutional operation applied to an input with a de-fined gap Dilated convolution is considered to be an operation that efficientlyutilizes spatial and temporal information [718192042] For brevity let us con-sider one-dimensional signals Dilated convolution is defined as

y [i] =

Lsuml=1

x [i + rl]u [l] (3)

where x[i] is an input feature y[i] is an output feature u[l] is a convolution filterof kernel size L and r is a dilation rate A dilated convolution is equivalent toa regular convolution when r = 1 Let i = rj + k (0 le k le r minus 1) Then theoutput y [rj + k] is

y [rj + k] =

Lsuml=1

x [rj + k + rl]u [l] (4)

=

Lsuml=1

x [r (j + l) + k]u [l] (5)

This suggests that the output y [i] where i equiv k (mod r) can be obtained fromthe input x [i] | i equiv k (mod r) To summarize an r-dilated convolution isequivalent to an operation where the input is split into r branches and convo-lutions are performed with the same kernel u then spatially rearranged into asingle output

A dilated convolution in a two-dimensional space is illustrated in Fig 3 (incase r = 2) The r-dilated convolution downsamples an input feature with aninterval of r pixels then produces r2 branches as outputs Therefore the dilatedconvolution is equivalent to a following three-step operation PGP a convolu-tion sharing the same weight and inverse of PGP (PGPminus1) which rearranges thecells to their original positions For example let the size of the input features be

6

Conv

Conv

Conv

Conv

4 times (119887 119888 ℎ2 1199082)

(119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

=

Dilated

Convolution

(119903 = 2) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

PGP PGPminus120783

Fig 3 PGP is implicitly used by dilated convolution

x = (b c h w) where b c h and w are the number of batches number of chan-nels height and width respectively Using PGP the input feature x is dividedinto r2times (b c hr wr) subsequently a convolutional filter (with shared weightparameters) is applied to each branch in parallel (Fig 3) Then the intermedi-ate r2 feature maps are rearranged by PGP resulting the same size (b c h w)as the input by PGPminus1 In short PGP is embedded in a dilated convolutionwhich suggests that the success of dilated convolution can be attributed to dataaugmentation in the feature space

33 Architectural Differences between dilated convolution and PGP

The structure of the network model with PGP is better for learning than thatwith dilated convolutions For example consider the cases where two downsam-pling layers in a base CNN (Base-CNN) are replaced by either dilated convolution(Fig 4(b)) or PGP (Fig 4(d)) According to [7] the conventional multi-stridedownsampling layer and all subsequent convolution layers can be replaced withdilated convolution layers with two times dilation rate (Fig 4(b)) As mentionedin Section 32 a dilated convolution is equivalent to (PGP + Convolution +PGPminus1) When n r-dilated convolutions are stacked they are expressed by(PGP + Convolution times n + PGPminus1) This means that each branch in a CNNwith dilated convolution (DConv-CNN) is independent from the others

Let us consider the case of Fig 4(b) that dilated convolutions with two dif-ferent dilation rates (eg r1 = 2 and r2 = 4) are stacked Using the fact thata sequence of n 4-dilated convolutions is (PGP2 + Convolution timesn + PGPminus2)this architecture can readily be transformed to the one given in Fig 4(c) whichis the equivalent form with PGP We can clearly see that each branch split bythe former dilation layer is split again by the latter and all the branches areeventually rearranged by PGPminus1 layers prior the global average pooling (GAP)

7

CNN Image GP CNN GP CNN GAP FC Loss

(a) Base CNN (Base-CNN)

CNN GAP FC Image Loss

Dilated

CNN

(119903 = 2)

Dilated

CNN

(119903 = 4)

(b) CNN with dilated convolution (DConv-CNN)

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

GAP FC Image PGP -1

Loss

PGP

PGP

PGP

PGP -1

-1

-1

-1

(c) DConv-CNN represented by PGP

Loss

Weight

sharing

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

Image

The same structure as

Base-CNN

(d) CNN with PGP (PGP-CNN)

Fig 4 Comparison between DConv-CNN and PGP-CNN As mentioned in Section 31Base-CNN is expressed as the architecture using GP (a) (b) is equivalent to (c)

layer In contrast CNN with PGP (PGP-CNN) is implemented as follows Theconvolution or pooling layer that decreases resolution is set to 1 stride of thatlayer after which PGP is inserted (Fig 4(d)) In DConv-CNN all the branchesare aggregated through PGPminus1 layers just before the GAP layer which can beregarded as a feature ensemble while PGP-CNN does not use any PGPminus1 layerthroughout the network so the intermediate features are not fused until the endof the network

Unlike feature ensemble (ie the case of DConv-CNN) this structure en-forces all the branches to learn correct classification results As mentioned aboveDConv-CNN averages an ensemble of intermediate feature The likelihood ofeach class is then predicted based on the average Even if some branches containuseful features (eg for image classification) while the other branches do notacquire the CNN attempts to determine the correct class based on the averagewhich prevents the useless branches from further learning In contrast thanks tothe independency of branches PGP-CNN can perform additional learning withsuperior regularization

8

34 Weight transfer using PGP

Compared to DConv-CNN PGP-CNN has a significant advantage in terms ofweight transfer Specifically the weights learned on PGP-CNN work well onBase-CNN structure when transferred We demonstrate later in the experimentsthat this can improve performance of the base architecture However DConv-CNN does not have the same structure as Base-CNN because of PGPminus1 Thereplacement of downsampling by dilated convolutions makes it difficult to trans-fer weights from the pre-trained base model The base CNN with multi-strideconvolutions (the top of Fig 5 (a)) is represented by single-stride convolutionsand grid pooling (GP) layers (the bottom of Fig 5 (a)) DConv-CNN (the topof Fig 5 (b)) is equivalent to the base CNN with PGP and PGPminus1 (the bottomof Fig 5 (b)) Focusing on the difference in the size of the input features of a3times 3 convolution the size is (b c1 h w) in Base-CNN whereas (b c1 h2 w2)in DConv-CNN This difference in resolution makes it impossible to directlytransfer weights from the pre-trained model Unlike DConv-CNN PGP-CNNmaintains the same structure as Base-CNN (Fig 5 (c)) which results in bettertransferability and better performance

This weight transferability of PGP has two advantages First applying PGPto Base-CNN with learned weights in test phase can improve recognition perfor-mance without retraining For example if one does not have sufficient computingresources to train PGP-CNN using PGP only in the test phase can still make fulluse of the features discarded by downsampling in the training phase to achievehigher accuracy Second the weights learned by PGP-CNN can work well inBase-CNN As an example on an embedded system specialized to a particularCNN one can not change calculations inside the CNN However the weightslearned by on PGP-CNN can result in superior performance even when used inBase-CNN at test time

4 Experimental Results

41 Image classification

Datasets We evaluated PGP on benchmark datasets CIFAR-10 CIFAR-100 [23]and SVHN [24] Both CIFAR-10 and CIFAR-100 consist of 32times32 color imageswith 50000 images for training and 10000 images for testing The CIFAR-10dataset has 10 classes each containing 6000 images There are 5000 trainingimages and 1000 testing images for each class The CIFAR-100 dataset has 100classes each containing 600 images There are 500 training images and 100 test-ing images for each class We adopted a standard data augmentation scheme(mirroringshifting) that is widely used for these two datasets [1211] For pre-processing we normalized the data using the channel means and standard devi-ations We used all 50000 training images for training and calculated the finaltesting error after training

The Street View House Numbers (SVHN) dataset consists of 32 times 32 colordigit image There are 73257 images in the training set 26032 images in the test

9

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 2x2

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 2x2

Conv 1x1 str 1x1

GP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198882 ℎ2 1199082)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

GP

(119887 1198883 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Dconv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGPminus1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

PGP

Conv 3x3 str 1x1

4 times (119887 1198881 ℎ2 1199082)

4 times (119887 1198882 ℎ2 1199082)

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

4 times (119887 1198882 ℎ2 1199082)

4 times (119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

PGP

(119887 1198883 ℎ 119908)

(a) Base-CNN (b) DConv-CNN (c) PGP-CNN

Fig 5 The structure performing downsampling inside Base-CNN DConv-CNN andPGP-CNN This downsampling structure (3 times 3 convolution with stride of 2) is usedin a derivative of ResNet [9] WideResNet [10] and ResNeXt [11]

set and 531131 additional images for training Following the method in [102]we used all the training data without any data augmentation

Implementation details We trained a pre-activation ResNet (PreResNet) [12]All-CNN [43] WideResNet [10] ResNeXt [11] PyramidNet [44] and DenseNet [2]from scratch We use a 164-layer network for PreResNet We used WRN-28-10ResNeXt-29 (8x64d) PyramidNet-164 (Bottleneck alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers

When comparing the base CNN (Base) the CNN with dilated convolutions(DConv) and the CNN with PGP (PGP) we utilized the same hyperparametervalues (eg learning rate strategy) to carefully monitor the performance improve-ments We adopted the cosine shape learning rate schedule [45] which smoothlyanneals the learning rate [4647] The learning rate was initially set to 02 andgradually reduced to 0002 Following [2] for CIFAR and SVHN we trainedthe networks using mini-batches of size 64 for 300 epochs and 40 epochs re-spectively All networks were optimized using stochastic gradient descent (SGD)with a momentum of 09 and weight decay of 10times10minus4 We adopted the weightinitialization method introduced in [48] We used the Chainer framework [49]and ChainerCV [50] for the implementation of all experiments

10

Table 1 Test errors () with various architectures on CIFAR-10 CIFAR-100 andSVHN DConv dilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 415plusmn012 377plusmn009All-CNN 14M 842plusmn023 868plusmn043 717plusmn032WideResNet-28-10 365M 344plusmn006 388plusmn021 313plusmn015ResNeXt-29 (8times64d) 344M 386plusmn016 387plusmn013 322plusmn034PyramidNet-164 (α=28) 17M 391plusmn015 372plusmn007 338plusmn009DenseNet-BC-100 (k=12) 08M 460plusmn007 435plusmn022 411plusmn015

CIFAR-100

PreResNet-164 17M 2268plusmn027 2071plusmn027 2018plusmn045All-CNN 14M 3243plusmn031 3310plusmn039 2979plusmn033WideResNet-28-10 365M 1870plusmn039 1969plusmn050 1820plusmn017ResNeXt-29 (8times64d) 344M 1963plusmn112 1790plusmn057 1718plusmn029PyramidNet-164 (α=28) 17M 1965plusmn024 1910plusmn032 1812plusmn016DenseNet-BC-100 (k=12) 08M 2262plusmn029 2220plusmn003 2169plusmn037

SVHN

PreResNet-164 17M 196plusmn174 174plusmn007 154plusmn002All-CNN 14M 194plusmn006 177plusmn007 175plusmn002WideResNet-28-10 365M 181plusmn003 153plusmn002 138plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 166plusmn003 152plusmn001PyramidNet-164 (α=28) 17M 202plusmn005 186plusmn006 179plusmn005DenseNet-BC-100 (k=12) 08M 197plusmn012 177plusmn006 167plusmn007

Results The results of applying DConv and PGP on CIFAR-10 CIFAR-100and SVHN with different networks are listed in Table 1 In all cases PGP out-performed Base and DConv

Comparison to other data augmentation methods We compare our methodto random flipping (RF) random cropping (RC) and random erasing (RE) inTable 2 PGP achieves better performance than any other data augmentationmethods when applied alone When applied in combination with other data aug-mentation methods PGP works in a complementary manner Combining RFRC RE and PGP yields an error rate of 1916 which is 1174 improvementover the baseline without any augmentation methods

Comparison to classical ensemble method We compare our method to theclassical ensemble method in Table 3 On CIFAR-10 and SVHN a single CNNwith the PGP model achieves better performance than an ensemble of threebasic CNN models Furthermore PGP works in a complementary manner whenapplied to the ensemble

Weight transfer The results of applying PGP as a test-time data augmentationmethod on CIFAR-10 CIFAR-100 and SVHN on different networks are listedin Table 4 PGP achieves better performance than the base model except for

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 6: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

6

Conv

Conv

Conv

Conv

4 times (119887 119888 ℎ2 1199082)

(119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

=

Dilated

Convolution

(119903 = 2) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908) (119887 119888 ℎ 119908)

PGP PGPminus120783

Fig 3 PGP is implicitly used by dilated convolution

x = (b c h w) where b c h and w are the number of batches number of chan-nels height and width respectively Using PGP the input feature x is dividedinto r2times (b c hr wr) subsequently a convolutional filter (with shared weightparameters) is applied to each branch in parallel (Fig 3) Then the intermedi-ate r2 feature maps are rearranged by PGP resulting the same size (b c h w)as the input by PGPminus1 In short PGP is embedded in a dilated convolutionwhich suggests that the success of dilated convolution can be attributed to dataaugmentation in the feature space

33 Architectural Differences between dilated convolution and PGP

The structure of the network model with PGP is better for learning than thatwith dilated convolutions For example consider the cases where two downsam-pling layers in a base CNN (Base-CNN) are replaced by either dilated convolution(Fig 4(b)) or PGP (Fig 4(d)) According to [7] the conventional multi-stridedownsampling layer and all subsequent convolution layers can be replaced withdilated convolution layers with two times dilation rate (Fig 4(b)) As mentionedin Section 32 a dilated convolution is equivalent to (PGP + Convolution +PGPminus1) When n r-dilated convolutions are stacked they are expressed by(PGP + Convolution times n + PGPminus1) This means that each branch in a CNNwith dilated convolution (DConv-CNN) is independent from the others

Let us consider the case of Fig 4(b) that dilated convolutions with two dif-ferent dilation rates (eg r1 = 2 and r2 = 4) are stacked Using the fact thata sequence of n 4-dilated convolutions is (PGP2 + Convolution timesn + PGPminus2)this architecture can readily be transformed to the one given in Fig 4(c) whichis the equivalent form with PGP We can clearly see that each branch split bythe former dilation layer is split again by the latter and all the branches areeventually rearranged by PGPminus1 layers prior the global average pooling (GAP)

7

CNN Image GP CNN GP CNN GAP FC Loss

(a) Base CNN (Base-CNN)

CNN GAP FC Image Loss

Dilated

CNN

(119903 = 2)

Dilated

CNN

(119903 = 4)

(b) CNN with dilated convolution (DConv-CNN)

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

GAP FC Image PGP -1

Loss

PGP

PGP

PGP

PGP -1

-1

-1

-1

(c) DConv-CNN represented by PGP

Loss

Weight

sharing

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

Image

The same structure as

Base-CNN

(d) CNN with PGP (PGP-CNN)

Fig 4 Comparison between DConv-CNN and PGP-CNN As mentioned in Section 31Base-CNN is expressed as the architecture using GP (a) (b) is equivalent to (c)

layer In contrast CNN with PGP (PGP-CNN) is implemented as follows Theconvolution or pooling layer that decreases resolution is set to 1 stride of thatlayer after which PGP is inserted (Fig 4(d)) In DConv-CNN all the branchesare aggregated through PGPminus1 layers just before the GAP layer which can beregarded as a feature ensemble while PGP-CNN does not use any PGPminus1 layerthroughout the network so the intermediate features are not fused until the endof the network

Unlike feature ensemble (ie the case of DConv-CNN) this structure en-forces all the branches to learn correct classification results As mentioned aboveDConv-CNN averages an ensemble of intermediate feature The likelihood ofeach class is then predicted based on the average Even if some branches containuseful features (eg for image classification) while the other branches do notacquire the CNN attempts to determine the correct class based on the averagewhich prevents the useless branches from further learning In contrast thanks tothe independency of branches PGP-CNN can perform additional learning withsuperior regularization

8

34 Weight transfer using PGP

Compared to DConv-CNN PGP-CNN has a significant advantage in terms ofweight transfer Specifically the weights learned on PGP-CNN work well onBase-CNN structure when transferred We demonstrate later in the experimentsthat this can improve performance of the base architecture However DConv-CNN does not have the same structure as Base-CNN because of PGPminus1 Thereplacement of downsampling by dilated convolutions makes it difficult to trans-fer weights from the pre-trained base model The base CNN with multi-strideconvolutions (the top of Fig 5 (a)) is represented by single-stride convolutionsand grid pooling (GP) layers (the bottom of Fig 5 (a)) DConv-CNN (the topof Fig 5 (b)) is equivalent to the base CNN with PGP and PGPminus1 (the bottomof Fig 5 (b)) Focusing on the difference in the size of the input features of a3times 3 convolution the size is (b c1 h w) in Base-CNN whereas (b c1 h2 w2)in DConv-CNN This difference in resolution makes it impossible to directlytransfer weights from the pre-trained model Unlike DConv-CNN PGP-CNNmaintains the same structure as Base-CNN (Fig 5 (c)) which results in bettertransferability and better performance

This weight transferability of PGP has two advantages First applying PGPto Base-CNN with learned weights in test phase can improve recognition perfor-mance without retraining For example if one does not have sufficient computingresources to train PGP-CNN using PGP only in the test phase can still make fulluse of the features discarded by downsampling in the training phase to achievehigher accuracy Second the weights learned by PGP-CNN can work well inBase-CNN As an example on an embedded system specialized to a particularCNN one can not change calculations inside the CNN However the weightslearned by on PGP-CNN can result in superior performance even when used inBase-CNN at test time

4 Experimental Results

41 Image classification

Datasets We evaluated PGP on benchmark datasets CIFAR-10 CIFAR-100 [23]and SVHN [24] Both CIFAR-10 and CIFAR-100 consist of 32times32 color imageswith 50000 images for training and 10000 images for testing The CIFAR-10dataset has 10 classes each containing 6000 images There are 5000 trainingimages and 1000 testing images for each class The CIFAR-100 dataset has 100classes each containing 600 images There are 500 training images and 100 test-ing images for each class We adopted a standard data augmentation scheme(mirroringshifting) that is widely used for these two datasets [1211] For pre-processing we normalized the data using the channel means and standard devi-ations We used all 50000 training images for training and calculated the finaltesting error after training

The Street View House Numbers (SVHN) dataset consists of 32 times 32 colordigit image There are 73257 images in the training set 26032 images in the test

9

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 2x2

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 2x2

Conv 1x1 str 1x1

GP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198882 ℎ2 1199082)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

GP

(119887 1198883 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Dconv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGPminus1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

PGP

Conv 3x3 str 1x1

4 times (119887 1198881 ℎ2 1199082)

4 times (119887 1198882 ℎ2 1199082)

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

4 times (119887 1198882 ℎ2 1199082)

4 times (119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

PGP

(119887 1198883 ℎ 119908)

(a) Base-CNN (b) DConv-CNN (c) PGP-CNN

Fig 5 The structure performing downsampling inside Base-CNN DConv-CNN andPGP-CNN This downsampling structure (3 times 3 convolution with stride of 2) is usedin a derivative of ResNet [9] WideResNet [10] and ResNeXt [11]

set and 531131 additional images for training Following the method in [102]we used all the training data without any data augmentation

Implementation details We trained a pre-activation ResNet (PreResNet) [12]All-CNN [43] WideResNet [10] ResNeXt [11] PyramidNet [44] and DenseNet [2]from scratch We use a 164-layer network for PreResNet We used WRN-28-10ResNeXt-29 (8x64d) PyramidNet-164 (Bottleneck alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers

When comparing the base CNN (Base) the CNN with dilated convolutions(DConv) and the CNN with PGP (PGP) we utilized the same hyperparametervalues (eg learning rate strategy) to carefully monitor the performance improve-ments We adopted the cosine shape learning rate schedule [45] which smoothlyanneals the learning rate [4647] The learning rate was initially set to 02 andgradually reduced to 0002 Following [2] for CIFAR and SVHN we trainedthe networks using mini-batches of size 64 for 300 epochs and 40 epochs re-spectively All networks were optimized using stochastic gradient descent (SGD)with a momentum of 09 and weight decay of 10times10minus4 We adopted the weightinitialization method introduced in [48] We used the Chainer framework [49]and ChainerCV [50] for the implementation of all experiments

10

Table 1 Test errors () with various architectures on CIFAR-10 CIFAR-100 andSVHN DConv dilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 415plusmn012 377plusmn009All-CNN 14M 842plusmn023 868plusmn043 717plusmn032WideResNet-28-10 365M 344plusmn006 388plusmn021 313plusmn015ResNeXt-29 (8times64d) 344M 386plusmn016 387plusmn013 322plusmn034PyramidNet-164 (α=28) 17M 391plusmn015 372plusmn007 338plusmn009DenseNet-BC-100 (k=12) 08M 460plusmn007 435plusmn022 411plusmn015

CIFAR-100

PreResNet-164 17M 2268plusmn027 2071plusmn027 2018plusmn045All-CNN 14M 3243plusmn031 3310plusmn039 2979plusmn033WideResNet-28-10 365M 1870plusmn039 1969plusmn050 1820plusmn017ResNeXt-29 (8times64d) 344M 1963plusmn112 1790plusmn057 1718plusmn029PyramidNet-164 (α=28) 17M 1965plusmn024 1910plusmn032 1812plusmn016DenseNet-BC-100 (k=12) 08M 2262plusmn029 2220plusmn003 2169plusmn037

SVHN

PreResNet-164 17M 196plusmn174 174plusmn007 154plusmn002All-CNN 14M 194plusmn006 177plusmn007 175plusmn002WideResNet-28-10 365M 181plusmn003 153plusmn002 138plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 166plusmn003 152plusmn001PyramidNet-164 (α=28) 17M 202plusmn005 186plusmn006 179plusmn005DenseNet-BC-100 (k=12) 08M 197plusmn012 177plusmn006 167plusmn007

Results The results of applying DConv and PGP on CIFAR-10 CIFAR-100and SVHN with different networks are listed in Table 1 In all cases PGP out-performed Base and DConv

Comparison to other data augmentation methods We compare our methodto random flipping (RF) random cropping (RC) and random erasing (RE) inTable 2 PGP achieves better performance than any other data augmentationmethods when applied alone When applied in combination with other data aug-mentation methods PGP works in a complementary manner Combining RFRC RE and PGP yields an error rate of 1916 which is 1174 improvementover the baseline without any augmentation methods

Comparison to classical ensemble method We compare our method to theclassical ensemble method in Table 3 On CIFAR-10 and SVHN a single CNNwith the PGP model achieves better performance than an ensemble of threebasic CNN models Furthermore PGP works in a complementary manner whenapplied to the ensemble

Weight transfer The results of applying PGP as a test-time data augmentationmethod on CIFAR-10 CIFAR-100 and SVHN on different networks are listedin Table 4 PGP achieves better performance than the base model except for

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 7: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

7

CNN Image GP CNN GP CNN GAP FC Loss

(a) Base CNN (Base-CNN)

CNN GAP FC Image Loss

Dilated

CNN

(119903 = 2)

Dilated

CNN

(119903 = 4)

(b) CNN with dilated convolution (DConv-CNN)

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

GAP FC Image PGP -1

Loss

PGP

PGP

PGP

PGP -1

-1

-1

-1

(c) DConv-CNN represented by PGP

Loss

Weight

sharing

CNN

CNN

PGP CNN

CNN

CNN PGP

PGP

PGP

PGP

Image

The same structure as

Base-CNN

(d) CNN with PGP (PGP-CNN)

Fig 4 Comparison between DConv-CNN and PGP-CNN As mentioned in Section 31Base-CNN is expressed as the architecture using GP (a) (b) is equivalent to (c)

layer In contrast CNN with PGP (PGP-CNN) is implemented as follows Theconvolution or pooling layer that decreases resolution is set to 1 stride of thatlayer after which PGP is inserted (Fig 4(d)) In DConv-CNN all the branchesare aggregated through PGPminus1 layers just before the GAP layer which can beregarded as a feature ensemble while PGP-CNN does not use any PGPminus1 layerthroughout the network so the intermediate features are not fused until the endof the network

Unlike feature ensemble (ie the case of DConv-CNN) this structure en-forces all the branches to learn correct classification results As mentioned aboveDConv-CNN averages an ensemble of intermediate feature The likelihood ofeach class is then predicted based on the average Even if some branches containuseful features (eg for image classification) while the other branches do notacquire the CNN attempts to determine the correct class based on the averagewhich prevents the useless branches from further learning In contrast thanks tothe independency of branches PGP-CNN can perform additional learning withsuperior regularization

8

34 Weight transfer using PGP

Compared to DConv-CNN PGP-CNN has a significant advantage in terms ofweight transfer Specifically the weights learned on PGP-CNN work well onBase-CNN structure when transferred We demonstrate later in the experimentsthat this can improve performance of the base architecture However DConv-CNN does not have the same structure as Base-CNN because of PGPminus1 Thereplacement of downsampling by dilated convolutions makes it difficult to trans-fer weights from the pre-trained base model The base CNN with multi-strideconvolutions (the top of Fig 5 (a)) is represented by single-stride convolutionsand grid pooling (GP) layers (the bottom of Fig 5 (a)) DConv-CNN (the topof Fig 5 (b)) is equivalent to the base CNN with PGP and PGPminus1 (the bottomof Fig 5 (b)) Focusing on the difference in the size of the input features of a3times 3 convolution the size is (b c1 h w) in Base-CNN whereas (b c1 h2 w2)in DConv-CNN This difference in resolution makes it impossible to directlytransfer weights from the pre-trained model Unlike DConv-CNN PGP-CNNmaintains the same structure as Base-CNN (Fig 5 (c)) which results in bettertransferability and better performance

This weight transferability of PGP has two advantages First applying PGPto Base-CNN with learned weights in test phase can improve recognition perfor-mance without retraining For example if one does not have sufficient computingresources to train PGP-CNN using PGP only in the test phase can still make fulluse of the features discarded by downsampling in the training phase to achievehigher accuracy Second the weights learned by PGP-CNN can work well inBase-CNN As an example on an embedded system specialized to a particularCNN one can not change calculations inside the CNN However the weightslearned by on PGP-CNN can result in superior performance even when used inBase-CNN at test time

4 Experimental Results

41 Image classification

Datasets We evaluated PGP on benchmark datasets CIFAR-10 CIFAR-100 [23]and SVHN [24] Both CIFAR-10 and CIFAR-100 consist of 32times32 color imageswith 50000 images for training and 10000 images for testing The CIFAR-10dataset has 10 classes each containing 6000 images There are 5000 trainingimages and 1000 testing images for each class The CIFAR-100 dataset has 100classes each containing 600 images There are 500 training images and 100 test-ing images for each class We adopted a standard data augmentation scheme(mirroringshifting) that is widely used for these two datasets [1211] For pre-processing we normalized the data using the channel means and standard devi-ations We used all 50000 training images for training and calculated the finaltesting error after training

The Street View House Numbers (SVHN) dataset consists of 32 times 32 colordigit image There are 73257 images in the training set 26032 images in the test

9

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 2x2

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 2x2

Conv 1x1 str 1x1

GP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198882 ℎ2 1199082)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

GP

(119887 1198883 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Dconv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGPminus1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

PGP

Conv 3x3 str 1x1

4 times (119887 1198881 ℎ2 1199082)

4 times (119887 1198882 ℎ2 1199082)

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

4 times (119887 1198882 ℎ2 1199082)

4 times (119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

PGP

(119887 1198883 ℎ 119908)

(a) Base-CNN (b) DConv-CNN (c) PGP-CNN

Fig 5 The structure performing downsampling inside Base-CNN DConv-CNN andPGP-CNN This downsampling structure (3 times 3 convolution with stride of 2) is usedin a derivative of ResNet [9] WideResNet [10] and ResNeXt [11]

set and 531131 additional images for training Following the method in [102]we used all the training data without any data augmentation

Implementation details We trained a pre-activation ResNet (PreResNet) [12]All-CNN [43] WideResNet [10] ResNeXt [11] PyramidNet [44] and DenseNet [2]from scratch We use a 164-layer network for PreResNet We used WRN-28-10ResNeXt-29 (8x64d) PyramidNet-164 (Bottleneck alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers

When comparing the base CNN (Base) the CNN with dilated convolutions(DConv) and the CNN with PGP (PGP) we utilized the same hyperparametervalues (eg learning rate strategy) to carefully monitor the performance improve-ments We adopted the cosine shape learning rate schedule [45] which smoothlyanneals the learning rate [4647] The learning rate was initially set to 02 andgradually reduced to 0002 Following [2] for CIFAR and SVHN we trainedthe networks using mini-batches of size 64 for 300 epochs and 40 epochs re-spectively All networks were optimized using stochastic gradient descent (SGD)with a momentum of 09 and weight decay of 10times10minus4 We adopted the weightinitialization method introduced in [48] We used the Chainer framework [49]and ChainerCV [50] for the implementation of all experiments

10

Table 1 Test errors () with various architectures on CIFAR-10 CIFAR-100 andSVHN DConv dilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 415plusmn012 377plusmn009All-CNN 14M 842plusmn023 868plusmn043 717plusmn032WideResNet-28-10 365M 344plusmn006 388plusmn021 313plusmn015ResNeXt-29 (8times64d) 344M 386plusmn016 387plusmn013 322plusmn034PyramidNet-164 (α=28) 17M 391plusmn015 372plusmn007 338plusmn009DenseNet-BC-100 (k=12) 08M 460plusmn007 435plusmn022 411plusmn015

CIFAR-100

PreResNet-164 17M 2268plusmn027 2071plusmn027 2018plusmn045All-CNN 14M 3243plusmn031 3310plusmn039 2979plusmn033WideResNet-28-10 365M 1870plusmn039 1969plusmn050 1820plusmn017ResNeXt-29 (8times64d) 344M 1963plusmn112 1790plusmn057 1718plusmn029PyramidNet-164 (α=28) 17M 1965plusmn024 1910plusmn032 1812plusmn016DenseNet-BC-100 (k=12) 08M 2262plusmn029 2220plusmn003 2169plusmn037

SVHN

PreResNet-164 17M 196plusmn174 174plusmn007 154plusmn002All-CNN 14M 194plusmn006 177plusmn007 175plusmn002WideResNet-28-10 365M 181plusmn003 153plusmn002 138plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 166plusmn003 152plusmn001PyramidNet-164 (α=28) 17M 202plusmn005 186plusmn006 179plusmn005DenseNet-BC-100 (k=12) 08M 197plusmn012 177plusmn006 167plusmn007

Results The results of applying DConv and PGP on CIFAR-10 CIFAR-100and SVHN with different networks are listed in Table 1 In all cases PGP out-performed Base and DConv

Comparison to other data augmentation methods We compare our methodto random flipping (RF) random cropping (RC) and random erasing (RE) inTable 2 PGP achieves better performance than any other data augmentationmethods when applied alone When applied in combination with other data aug-mentation methods PGP works in a complementary manner Combining RFRC RE and PGP yields an error rate of 1916 which is 1174 improvementover the baseline without any augmentation methods

Comparison to classical ensemble method We compare our method to theclassical ensemble method in Table 3 On CIFAR-10 and SVHN a single CNNwith the PGP model achieves better performance than an ensemble of threebasic CNN models Furthermore PGP works in a complementary manner whenapplied to the ensemble

Weight transfer The results of applying PGP as a test-time data augmentationmethod on CIFAR-10 CIFAR-100 and SVHN on different networks are listedin Table 4 PGP achieves better performance than the base model except for

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 8: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

8

34 Weight transfer using PGP

Compared to DConv-CNN PGP-CNN has a significant advantage in terms ofweight transfer Specifically the weights learned on PGP-CNN work well onBase-CNN structure when transferred We demonstrate later in the experimentsthat this can improve performance of the base architecture However DConv-CNN does not have the same structure as Base-CNN because of PGPminus1 Thereplacement of downsampling by dilated convolutions makes it difficult to trans-fer weights from the pre-trained base model The base CNN with multi-strideconvolutions (the top of Fig 5 (a)) is represented by single-stride convolutionsand grid pooling (GP) layers (the bottom of Fig 5 (a)) DConv-CNN (the topof Fig 5 (b)) is equivalent to the base CNN with PGP and PGPminus1 (the bottomof Fig 5 (b)) Focusing on the difference in the size of the input features of a3times 3 convolution the size is (b c1 h w) in Base-CNN whereas (b c1 h2 w2)in DConv-CNN This difference in resolution makes it impossible to directlytransfer weights from the pre-trained model Unlike DConv-CNN PGP-CNNmaintains the same structure as Base-CNN (Fig 5 (c)) which results in bettertransferability and better performance

This weight transferability of PGP has two advantages First applying PGPto Base-CNN with learned weights in test phase can improve recognition perfor-mance without retraining For example if one does not have sufficient computingresources to train PGP-CNN using PGP only in the test phase can still make fulluse of the features discarded by downsampling in the training phase to achievehigher accuracy Second the weights learned by PGP-CNN can work well inBase-CNN As an example on an embedded system specialized to a particularCNN one can not change calculations inside the CNN However the weightslearned by on PGP-CNN can result in superior performance even when used inBase-CNN at test time

4 Experimental Results

41 Image classification

Datasets We evaluated PGP on benchmark datasets CIFAR-10 CIFAR-100 [23]and SVHN [24] Both CIFAR-10 and CIFAR-100 consist of 32times32 color imageswith 50000 images for training and 10000 images for testing The CIFAR-10dataset has 10 classes each containing 6000 images There are 5000 trainingimages and 1000 testing images for each class The CIFAR-100 dataset has 100classes each containing 600 images There are 500 training images and 100 test-ing images for each class We adopted a standard data augmentation scheme(mirroringshifting) that is widely used for these two datasets [1211] For pre-processing we normalized the data using the channel means and standard devi-ations We used all 50000 training images for training and calculated the finaltesting error after training

The Street View House Numbers (SVHN) dataset consists of 32 times 32 colordigit image There are 73257 images in the training set 26032 images in the test

9

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 2x2

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 2x2

Conv 1x1 str 1x1

GP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198882 ℎ2 1199082)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

GP

(119887 1198883 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Dconv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGPminus1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

PGP

Conv 3x3 str 1x1

4 times (119887 1198881 ℎ2 1199082)

4 times (119887 1198882 ℎ2 1199082)

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

4 times (119887 1198882 ℎ2 1199082)

4 times (119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

PGP

(119887 1198883 ℎ 119908)

(a) Base-CNN (b) DConv-CNN (c) PGP-CNN

Fig 5 The structure performing downsampling inside Base-CNN DConv-CNN andPGP-CNN This downsampling structure (3 times 3 convolution with stride of 2) is usedin a derivative of ResNet [9] WideResNet [10] and ResNeXt [11]

set and 531131 additional images for training Following the method in [102]we used all the training data without any data augmentation

Implementation details We trained a pre-activation ResNet (PreResNet) [12]All-CNN [43] WideResNet [10] ResNeXt [11] PyramidNet [44] and DenseNet [2]from scratch We use a 164-layer network for PreResNet We used WRN-28-10ResNeXt-29 (8x64d) PyramidNet-164 (Bottleneck alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers

When comparing the base CNN (Base) the CNN with dilated convolutions(DConv) and the CNN with PGP (PGP) we utilized the same hyperparametervalues (eg learning rate strategy) to carefully monitor the performance improve-ments We adopted the cosine shape learning rate schedule [45] which smoothlyanneals the learning rate [4647] The learning rate was initially set to 02 andgradually reduced to 0002 Following [2] for CIFAR and SVHN we trainedthe networks using mini-batches of size 64 for 300 epochs and 40 epochs re-spectively All networks were optimized using stochastic gradient descent (SGD)with a momentum of 09 and weight decay of 10times10minus4 We adopted the weightinitialization method introduced in [48] We used the Chainer framework [49]and ChainerCV [50] for the implementation of all experiments

10

Table 1 Test errors () with various architectures on CIFAR-10 CIFAR-100 andSVHN DConv dilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 415plusmn012 377plusmn009All-CNN 14M 842plusmn023 868plusmn043 717plusmn032WideResNet-28-10 365M 344plusmn006 388plusmn021 313plusmn015ResNeXt-29 (8times64d) 344M 386plusmn016 387plusmn013 322plusmn034PyramidNet-164 (α=28) 17M 391plusmn015 372plusmn007 338plusmn009DenseNet-BC-100 (k=12) 08M 460plusmn007 435plusmn022 411plusmn015

CIFAR-100

PreResNet-164 17M 2268plusmn027 2071plusmn027 2018plusmn045All-CNN 14M 3243plusmn031 3310plusmn039 2979plusmn033WideResNet-28-10 365M 1870plusmn039 1969plusmn050 1820plusmn017ResNeXt-29 (8times64d) 344M 1963plusmn112 1790plusmn057 1718plusmn029PyramidNet-164 (α=28) 17M 1965plusmn024 1910plusmn032 1812plusmn016DenseNet-BC-100 (k=12) 08M 2262plusmn029 2220plusmn003 2169plusmn037

SVHN

PreResNet-164 17M 196plusmn174 174plusmn007 154plusmn002All-CNN 14M 194plusmn006 177plusmn007 175plusmn002WideResNet-28-10 365M 181plusmn003 153plusmn002 138plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 166plusmn003 152plusmn001PyramidNet-164 (α=28) 17M 202plusmn005 186plusmn006 179plusmn005DenseNet-BC-100 (k=12) 08M 197plusmn012 177plusmn006 167plusmn007

Results The results of applying DConv and PGP on CIFAR-10 CIFAR-100and SVHN with different networks are listed in Table 1 In all cases PGP out-performed Base and DConv

Comparison to other data augmentation methods We compare our methodto random flipping (RF) random cropping (RC) and random erasing (RE) inTable 2 PGP achieves better performance than any other data augmentationmethods when applied alone When applied in combination with other data aug-mentation methods PGP works in a complementary manner Combining RFRC RE and PGP yields an error rate of 1916 which is 1174 improvementover the baseline without any augmentation methods

Comparison to classical ensemble method We compare our method to theclassical ensemble method in Table 3 On CIFAR-10 and SVHN a single CNNwith the PGP model achieves better performance than an ensemble of threebasic CNN models Furthermore PGP works in a complementary manner whenapplied to the ensemble

Weight transfer The results of applying PGP as a test-time data augmentationmethod on CIFAR-10 CIFAR-100 and SVHN on different networks are listedin Table 4 PGP achieves better performance than the base model except for

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 9: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

9

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 2x2

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 2x2

Conv 1x1 str 1x1

GP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198882 ℎ2 1199082)

(119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

GP

(119887 1198883 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Dconv 3x3 str 1x1

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGPminus1

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

PGP

Conv 3x3 str 1x1

4 times (119887 1198881 ℎ2 1199082)

4 times (119887 1198882 ℎ2 1199082)

(119887 1198882 ℎ 119908)

(119887 1198883 ℎ 119908)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

Conv 1x1 str 1x1

PGP

Conv 1x1 str 1x1

(119887 1198881 ℎ 119908)

Conv 3x3 str 1x1

(119887 1198882 ℎ 119908)

4 times (119887 1198882 ℎ2 1199082)

4 times (119887 1198883 ℎ2 1199082)

(119887 1198880 ℎ 119908)

Conv 1x1 str 1x1

PGP

(119887 1198883 ℎ 119908)

(a) Base-CNN (b) DConv-CNN (c) PGP-CNN

Fig 5 The structure performing downsampling inside Base-CNN DConv-CNN andPGP-CNN This downsampling structure (3 times 3 convolution with stride of 2) is usedin a derivative of ResNet [9] WideResNet [10] and ResNeXt [11]

set and 531131 additional images for training Following the method in [102]we used all the training data without any data augmentation

Implementation details We trained a pre-activation ResNet (PreResNet) [12]All-CNN [43] WideResNet [10] ResNeXt [11] PyramidNet [44] and DenseNet [2]from scratch We use a 164-layer network for PreResNet We used WRN-28-10ResNeXt-29 (8x64d) PyramidNet-164 (Bottleneck alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers

When comparing the base CNN (Base) the CNN with dilated convolutions(DConv) and the CNN with PGP (PGP) we utilized the same hyperparametervalues (eg learning rate strategy) to carefully monitor the performance improve-ments We adopted the cosine shape learning rate schedule [45] which smoothlyanneals the learning rate [4647] The learning rate was initially set to 02 andgradually reduced to 0002 Following [2] for CIFAR and SVHN we trainedthe networks using mini-batches of size 64 for 300 epochs and 40 epochs re-spectively All networks were optimized using stochastic gradient descent (SGD)with a momentum of 09 and weight decay of 10times10minus4 We adopted the weightinitialization method introduced in [48] We used the Chainer framework [49]and ChainerCV [50] for the implementation of all experiments

10

Table 1 Test errors () with various architectures on CIFAR-10 CIFAR-100 andSVHN DConv dilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 415plusmn012 377plusmn009All-CNN 14M 842plusmn023 868plusmn043 717plusmn032WideResNet-28-10 365M 344plusmn006 388plusmn021 313plusmn015ResNeXt-29 (8times64d) 344M 386plusmn016 387plusmn013 322plusmn034PyramidNet-164 (α=28) 17M 391plusmn015 372plusmn007 338plusmn009DenseNet-BC-100 (k=12) 08M 460plusmn007 435plusmn022 411plusmn015

CIFAR-100

PreResNet-164 17M 2268plusmn027 2071plusmn027 2018plusmn045All-CNN 14M 3243plusmn031 3310plusmn039 2979plusmn033WideResNet-28-10 365M 1870plusmn039 1969plusmn050 1820plusmn017ResNeXt-29 (8times64d) 344M 1963plusmn112 1790plusmn057 1718plusmn029PyramidNet-164 (α=28) 17M 1965plusmn024 1910plusmn032 1812plusmn016DenseNet-BC-100 (k=12) 08M 2262plusmn029 2220plusmn003 2169plusmn037

SVHN

PreResNet-164 17M 196plusmn174 174plusmn007 154plusmn002All-CNN 14M 194plusmn006 177plusmn007 175plusmn002WideResNet-28-10 365M 181plusmn003 153plusmn002 138plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 166plusmn003 152plusmn001PyramidNet-164 (α=28) 17M 202plusmn005 186plusmn006 179plusmn005DenseNet-BC-100 (k=12) 08M 197plusmn012 177plusmn006 167plusmn007

Results The results of applying DConv and PGP on CIFAR-10 CIFAR-100and SVHN with different networks are listed in Table 1 In all cases PGP out-performed Base and DConv

Comparison to other data augmentation methods We compare our methodto random flipping (RF) random cropping (RC) and random erasing (RE) inTable 2 PGP achieves better performance than any other data augmentationmethods when applied alone When applied in combination with other data aug-mentation methods PGP works in a complementary manner Combining RFRC RE and PGP yields an error rate of 1916 which is 1174 improvementover the baseline without any augmentation methods

Comparison to classical ensemble method We compare our method to theclassical ensemble method in Table 3 On CIFAR-10 and SVHN a single CNNwith the PGP model achieves better performance than an ensemble of threebasic CNN models Furthermore PGP works in a complementary manner whenapplied to the ensemble

Weight transfer The results of applying PGP as a test-time data augmentationmethod on CIFAR-10 CIFAR-100 and SVHN on different networks are listedin Table 4 PGP achieves better performance than the base model except for

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 10: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

10

Table 1 Test errors () with various architectures on CIFAR-10 CIFAR-100 andSVHN DConv dilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 415plusmn012 377plusmn009All-CNN 14M 842plusmn023 868plusmn043 717plusmn032WideResNet-28-10 365M 344plusmn006 388plusmn021 313plusmn015ResNeXt-29 (8times64d) 344M 386plusmn016 387plusmn013 322plusmn034PyramidNet-164 (α=28) 17M 391plusmn015 372plusmn007 338plusmn009DenseNet-BC-100 (k=12) 08M 460plusmn007 435plusmn022 411plusmn015

CIFAR-100

PreResNet-164 17M 2268plusmn027 2071plusmn027 2018plusmn045All-CNN 14M 3243plusmn031 3310plusmn039 2979plusmn033WideResNet-28-10 365M 1870plusmn039 1969plusmn050 1820plusmn017ResNeXt-29 (8times64d) 344M 1963plusmn112 1790plusmn057 1718plusmn029PyramidNet-164 (α=28) 17M 1965plusmn024 1910plusmn032 1812plusmn016DenseNet-BC-100 (k=12) 08M 2262plusmn029 2220plusmn003 2169plusmn037

SVHN

PreResNet-164 17M 196plusmn174 174plusmn007 154plusmn002All-CNN 14M 194plusmn006 177plusmn007 175plusmn002WideResNet-28-10 365M 181plusmn003 153plusmn002 138plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 166plusmn003 152plusmn001PyramidNet-164 (α=28) 17M 202plusmn005 186plusmn006 179plusmn005DenseNet-BC-100 (k=12) 08M 197plusmn012 177plusmn006 167plusmn007

Results The results of applying DConv and PGP on CIFAR-10 CIFAR-100and SVHN with different networks are listed in Table 1 In all cases PGP out-performed Base and DConv

Comparison to other data augmentation methods We compare our methodto random flipping (RF) random cropping (RC) and random erasing (RE) inTable 2 PGP achieves better performance than any other data augmentationmethods when applied alone When applied in combination with other data aug-mentation methods PGP works in a complementary manner Combining RFRC RE and PGP yields an error rate of 1916 which is 1174 improvementover the baseline without any augmentation methods

Comparison to classical ensemble method We compare our method to theclassical ensemble method in Table 3 On CIFAR-10 and SVHN a single CNNwith the PGP model achieves better performance than an ensemble of threebasic CNN models Furthermore PGP works in a complementary manner whenapplied to the ensemble

Weight transfer The results of applying PGP as a test-time data augmentationmethod on CIFAR-10 CIFAR-100 and SVHN on different networks are listedin Table 4 PGP achieves better performance than the base model except for

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 11: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

11

Table 2 Test errors () with different data augmentation methods on CIFAR-100based on PreResNet-164 RF random flipping RC random cropping RE randomerasing PGP parallel grid pooling

Network RF RC RE wo PGP w PGP

PreResNet-164

3090plusmn032 2268plusmn027X 2640plusmn018 2070plusmn034

X 2394plusmn007 2234plusmn023X 2752plusmn022 2132plusmn021

X X 2268plusmn027 2018plusmn045X X 2439plusmn055 2001plusmn014

X X 2259plusmn009 2071plusmn002X X X 2148plusmn021 1916plusmn020

Table 3 Ensembled test errors () on CIFAR-10100 and SVHN based on PreResNet-164 ldquotimeskrdquo means an ensembling of k CNN models DConv dilated convolution PGPparallel grid pooling

Dataset Networktimes1 times3

Base DConv PGP Base DConv PGP

CIFAR-10PreResNet-164

471 415 377 398 337 314CIFAR-100 2268 2071 2018 1916 1735 1715SVHN 196 174 154 165 151 139

on the ALL-CNN and ResNeXt models The reason of the incompatibility is leftfor future work

The results of applying PGP as a training-time data augmentation techniqueon the CIFAR-10 CIFAR-100 and SVHN datasets with different networks arelisted in Table 5 The models trained with PGP which have the same structureto the base model at the test phase outperformed the base model and themodel with dilated convolutions The models trained with dilated convolutionsperformed worse compared to the baseline models

42 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]The ImageNet dataset is comprised of 128 million training images and 50000validation images from 1000 classes With the data augmentation scheme fortraining presented in [91151] input images of size 224 times 224 were randomlycropped from a resized image using scale and aspect ratio augmentation Wereimplemented a derivative of ResNet [9] for training According to the dilatednetwork strategy [720] we used dilated convolutions or PGP after the conv4stage All networks were optimized using SGD with a momentum of 09 andweight decay of 10times 10minus4 The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs We adopted the weight initialization method

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 12: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

12

Table 4 Test errors () when applying PGP as a test-time data augmentation withvarious architectures on CIFAR-10 CIFAR-100 and SVHN PGP parallel grid pool-ing

Dataset Network Params Base PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 456plusmn014All-CNN 14M 842plusmn023 903plusmn030WideResNet-28-10 365M 344plusmn006 339plusmn002ResNeXt-29 (8times64d) 344M 386plusmn016 401plusmn021PyramidNet-164 (α=28) 17M 391plusmn015 382plusmn005DenseNet-BC-100 (k=12) 08M 460plusmn007 453plusmn012

CIFAR-100

PreResNet-164 17M 2268plusmn027 2219plusmn026All-CNN 14M 3243plusmn031 3324plusmn030WideResNet-28-10 365M 1870plusmn039 1860plusmn039ResNeXt-29 (8times64d) 344M 1963plusmn112 2002plusmn098PyramidNet-164 (α=28) 17M 1965plusmn024 1934plusmn028DenseNet-BC-100 (k=12) 08M 2262plusmn029 2233plusmn026

SVHN

PreResNet-164 17M 196plusmn007 183plusmn006All-CNN 14M 194plusmn006 386plusmn005WideResNet-28-10 365M 181plusmn003 177plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 247plusmn051PyramidNet-164 (α=28) 17M 202plusmn005 208plusmn008DenseNet-BC-100 (k=12) 08M 197plusmn012 189plusmn008

introduced in [48] We adopted a cosine-shaped learning rate schedule wherethe learning rate was initially set to 01 and gradually reduced to 10times 10minus4

The results when applying dilated convolution and PGP on ImageNet withResNet-50 and ResNet-101 are listed in Table 6 As can be seen in the resultsthe network with PGP outperformed the baseline performance PGP works in acomplementary manner when applied in combination with 10-crop testing andcan be used as a training-time or testing-time data augmentation technique

43 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52]and MS-COCO [53] datasets NUS-WIDE contains 269648 images with a totalof 5018 tags collected from Flickr These images are manually annotated with81 concepts including objects and scenes Following the method in [54] we used161789 images for training and 107859 images for testing MS-COCO contains82783 images for training and 40504 images for testing which are labeled as 80common objects The input images were resized to 256times256 randomly croppedand then resized again to 224times 224

We trained a derivative of ResNet [9] with pre-training using the ImageNetdataset We used the same pre-trained weights throughout all experiments forfair comparison All networks were optimized using SGD with a momentum of

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 13: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

13

Table 5 Test errors () when applying PGP as a training-time data augmentationtechnique with various architectures on CIFAR-10 CIFAR-100 and SVHN DConvdilated convolution PGP parallel grid pooling

Dataset Network Params Base DConv PGP

CIFAR-10

PreResNet-164 17M 471plusmn001 730plusmn020 408plusmn009All-CNN 14M 842plusmn023 3877plusmn085 730plusmn031WideResNet-28-10 365M 344plusmn006 790plusmn090 330plusmn013ResNeXt-29 (8times64d) 344M 386plusmn016 1691plusmn045 336plusmn027PyramidNet-164 (α=28) 17M 391plusmn015 682plusmn046 355plusmn008DenseNet-BC-100 (k=12) 08M 460plusmn007 703plusmn070 436plusmn010

CIFAR-100

PreResNet-164 17M 2268plusmn027 2490plusmn047 2101plusmn054All-CNN 14M 3243plusmn031 6427plusmn221 3026plusmn044WideResNet-28-10 365M 1870plusmn039 2690plusmn080 1856plusmn021ResNeXt-29 (8times64d) 344M 1963plusmn112 4457plusmn1378 1767plusmn031PyramidNet-164 (α=28) 17M 1965plusmn024 2528plusmn030 1858plusmn013DenseNet-BC-100 (k=12) 08M 2262plusmn029 2756plusmn078 2246plusmn016

SVHN

PreResNet-164 17M 196plusmn174 321plusmn032 167plusmn004All-CNN 14M 194plusmn006 464plusmn060 180plusmn004WideResNet-28-10 365M 181plusmn003 311plusmn017 145plusmn003ResNeXt-29 (8times64d) 344M 181plusmn002 564plusmn061 158plusmn003PyramidNet-164 (α=28) 17M 202plusmn005 323plusmn010 190plusmn009DenseNet-BC-100 (k=12) 08M 197plusmn012 274plusmn010 178plusmn007

09 and weight decay of 50 times 10minus4 The models were trained for 16 and 30epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCOrespectively We adopted the weight initialization method introduced in [48] Forfine tuning the learning rate was initially set to 10 times 10minus3 and divided by 10after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epochfor MS-COCO We employed mean average precision (mAP) macromicro F-measure precision and recall for performance evaluation If the confidences foreach class were greater than 05 the label were predicted as positive

The results when applying dilated convolution and PGP on ResNet-50 andResNet-101 are listed in Table 7 The networks with PGP achieved higher mAPcompared to the others In contrast the networks with dilated convolutionsachieved lower scores than the baseline which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convo-lutions due to the difference in resolution as we mentioned in Sec 34 This prob-lem can be avoided by maintaining the dilation rate in the layers that performdownsampling like the method used in a dilated residual networks (DRN) [17]The operation allows the CNN with dilated convolutions to perform convolu-tion with the same resolution as the base CNN The DRN achieved better mAPscores than the base CNN and nearly the same score as the CNN with PGPThe CNN with PGP was however superior to DRN in that the CNN with PGPhas the weight transferability

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 14: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

14

Table 6 Classification error () for the ImageNet 2012 validation set The error wasevaluated with and without 10-crop testing for 224times224-pixel inputs DConv dilatedconvolution PGP parallel grid pooling

Network Params Train Test1 crop 10 crops

top-1 top-5 top-1 top-5

ResNet-50 256M

Base Base 2369 700 2187 607DConv DConv 2247 627 2117 558PGP PGP 2240 630 2083 556Base PGP 2332 685 2162 593

DConv Base 3144 1140 2679 819PGP Base 2301 666 2122 574

ResNet-101 445M

Base Base 2249 638 2085 550DConv DConv 2126 561 2003 502PGP PGP 2134 565 1981 500Base PGP 2213 621 2046 536

DConv Base 2563 801 2240 605PGP Base 2180 595 2018 515

5 Conclusion

In this paper we proposed PGP which performs downsampling without dis-carding any intermediate feature and can be regarded as a data augmentationtechnique PGP is easily applicable to various CNN models without altering theirlearning strategies Additionally we demonstrated that PGP is implicitly usedin dilated convolution which suggests that dilated convolution is also a type ofdata augmentation technique Experiments on CIFAR-10100 and SVHN withsix network models demonstrated that PGP outperformed the base CNN andthe CNN with dilated convolutions PGP also obtained better results than thebase CNN and the CNN with dilated convolutions on the ImageNet dataset forimage classification and the NUS-WIDE and MS-COCO datasets for multi-labelimage classification

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 15: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

15

Table 7 Quantitative results on NUS-WIDE and MS-COCO DConv dilated con-volution DRN dilated residual networks PGP parallel grid pooling

Dataset Network Params Arch mAPMacro Micro

F P R F P R

NUS-WIDE

ResNet-50 256M

Base 557 521 633 470 702 753 658DConv 553 517 633 466 700 757 652DRN 559 522 438 470 701 758 653PGP 561 519 644 462 700 764 646

ResNet-101 445M

Base 565 530 636 483 704 754 661DConv 563 534 627 492 705 749 665DRN 568 531 631 489 705 751 665PGP 572 535 646 487 706 757 662

MS-COCO

ResNet-50 256M

Base 734 675 782 609 728 822 654DConv 727 666 779 597 722 819 645DRN 742 683 788 618 734 825 661PGP 742 679 800 605 731 837 649

ResNet-101 445M

Base 746 689 776 628 738 822 669DConv 740 683 780 620 732 816 664DRN 753 697 782 638 743 818 681PGP 755 694 798 629 742 831 670

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 16: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

16

References

1 He K Zhang X Ren S Sun J Deep residual learning for image recognitionIn CVPR (2016)

2 Huang G Liu Z van der Maaten L Weinberger KQ Densely connectedconvolutional networks In CVPR (2017)

3 Hu J Shen L Sun G Squeeze-and-excitation networks arXiv preprintarXiv170901507 (2017)

4 Liu W Anguelov D Erhan D Szegedy C Reed S Fu CY Berg ACSsd Single shot multibox detector In ECCV (2016)

5 Redmon J Farhadi A YOLO9000 Better faster stronger In CVPR (2017)6 Lin TY Goyal P Girshick R He K Dollar P Focal loss for dense object

detection In CVPR (2017)7 Chen LC Papandreou G Kokkinos I Murphy K Yuille AL DeepLab

Semantic image segmentation with deep convolutional nets atrous convolutionand fully connected CRFs TPAMI (2017)

8 Zhao H Shi J Qi X Wang X Jia J Pyramid scene parsing network InCVPR (2017)

9 Gross S Wilber M Training and investigating residual nets Facebook AIResearch CA[Online] httptorchchblog20160204resnetshtml (2016)

10 Zagoruyko S Komodakis N Wide residual networks In BMVC (2016)11 Xie S Girshick R Dollar P Tu Z He K Aggregated residual transformations

for deep neural networks In CVPR (2017)12 He K Zhang X Ren S Sun J Identity mappings in deep residual networks

In ECCV (2016)13 Zeiler MD Fergus R Stochastic pooling for regularization of deep convolutional

neural networks In ICLR (2013)14 Graham B Fractional max-pooling arXiv preprint arXiv14126071 (2014)15 Lee CY Gallagher PW Tu Z Generalizing pooling functions in convolutional

neural networks Mixed gated and tree In AISTATS (2016)16 Zhai S Wu H Kumar A Cheng Y Lu Y Zhang Z Feris R S3Pool

Pooling with stochastic spatial sampling In CVPR (2017)17 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)18 Oord Avd Dieleman S Zen H Simonyan K Vinyals O Graves A Kalch-

brenner N Senior A Kavukcuoglu K WaveNet A generative model for rawaudio arXiv preprint arXiv160903499 (2016)

19 Kalchbrenner N Espeholt L Simonyan K Oord Avd Graves AKavukcuoglu K Neural machine translation in linear time arXiv preprintarXiv161010099 (2016)

20 Yu F Koltun V Multi-scale context aggregation by dilated convolutions InICLR (2016)

21 Chen LC Zhu Y Papandreou G Schroff F Adam H Encoder-decoderwith atrous separable convolution for semantic image segmentation arXiv preprintarXiv180202611 (2018)

22 Yang Z Hu Z Salakhutdinov R Berg-Kirkpatrick T Improved variationalautoencoders for text modeling using dilated convolutions In ICML (2017)

23 Hinton GE Learning multiple layers of representation Trends in cognitivesciences 11(10) (2007) 428ndash434

24 Netzer Y Wang T Coates A Bissacco A Wu B Ng AY Reading digits innatural images with unsupervised feature learning In NIPS Workshop on MachineLearning Systemsrdquo Volume 2011 (2011)

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 17: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

17

25 Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al ImageNet large scale visual recog-nition challenge IJCV 115(3) (2015) 211ndash252

26 Krizhevsky A Sutskever I Hinton GE ImageNet classification with deepconvolutional neural networks In NIPS (2012)

27 Ratner AJ Ehrenberg H Hussain Z Dunnmon J Re C Learning to com-pose domain-specific transformations for data augmentation In NIPS (2017)

28 Hauberg S Freifeld O Larsen ABL Fisher J Hansen L Dreaming moredata Class-dependent distributions over diffeomorphisms for learned data aug-mentation In AISTATS (2016)

29 Zhong Z Zheng L Kang G Li S Yang Y Random erasing data augmenta-tion arXiv preprint arXiv170804896 (2017)

30 DeVries T Taylor GW Improved regularization of convolutional neural net-works with cutout arXiv preprint arXiv170804552 (2017)

31 Yamada Y Iwamura M Kise K Shakedrop regularization arXiv preprintarXiv180202375 (2018)

32 Real E Aggarwal A Huang Y Le QV Regularized evolution for imageclassifier architecture search arXiv preprint arXiv180201548 (2018)

33 Zhang H Cisse M Dauphin YN Lopez-Paz D mixup Beyond empirical riskminimization In ICLR (2018)

34 Yuji T Yoshitaka U Tatsuya H Between-class learning for image classificationIn CVPR (2018)

35 Chawla NV Bowyer KW Hall LO Kegelmeyer WP SMOTE syntheticminority over-sampling technique JAIR 16 (2002) 321ndash357

36 DeVries T Taylor GW Dataset augmentation in feature space ICLR workshop(2017)

37 Fawzi A Samulowitz H Turaga D Frossard P Adaptive data augmentationfor image classification In ICIP (2016)

38 Wang X Shrivastava A Gupta A A-fast-rcnn Hard positive generation viaadversary for object detection In CVPR (2017)

39 Lemley J Bazrafkan S Corcoran P Smart augmentation-learning an optimaldata augmentation strategy IEEE Access (2017)

40 Dixit M Kwitt R Niethammer M Vasconcelos N AGA Attribute-guidedaugmentation In CVPR (2017)

41 Liu B Dixit M Kwitt R Vasconcelos N Feature space transfer for dataaugmentation In CVPR (2018)

42 Yu F Koltun V Funkhouser T Dilated residual networks In CVPR (2017)43 Springenberg JT Dosovitskiy A Brox T Riedmiller M Striving for simplic-

ity The all convolutional net In ICLR Workshop (2015)44 Han D Kim J Kim J Deep pyramidal residual networks In CVPR (2017)45 Loshchilov I Hutter F SGDR stochastic gradient descent with restarts In

ICLR (2017)46 Huang G Li Y Pleiss G Liu Z Hopcroft JE Weinberger KQ Snapshot

ensembles Train 1 get m for free In ICLR (2017)47 Zoph B Vasudevan V Shlens J Le QV Learning transferable architectures

for scalable image recognition In CVPR (2018)48 He K Zhang X Ren S Sun J Delving deep into rectifiers Surpassing human-

level performance on ImageNet classification In ICCV (2015)49 Tokui S Oono K Hido S Clayton J Chainer a next-generation open source

framework for deep learning In NIPS Workshop (2015)

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation
Page 18: arXiv:1803.11370v1 [cs.CV] 30 Mar 2018 · Parallel Grid Pooling for Data Augmentation Akito Takeki, Daiki Ikami, Go Iriey, and Kiyoharu Aizawa {takeki,iakmi,aizawa}@hal.t.u-tokyo.ac.jp,

18

50 Niitani Y Ogawa T Saito S Saito M ChainerCV a library for deep learningin computer vision In ACMMM (2017)

51 Chen Y Li J Xiao H Jin X Yan S Feng J Dual path networks In NIPS(2017)

52 Chua TS Tang J Hong R Li H Luo Z Zheng Y NUS-WIDE a real-world web image database from national university of singapore In ACM-CIVR(2009)

53 Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL Microsoft COCO Common objects in context In ECCV (2014)

54 Zhu F Li H Ouyang W Yu N Wang X Learning spatial regularization withimage-level supervisions for multi-label image classification In CVPR (2017)

  • Parallel Grid Pooling for Data Augmentation