multi-sample dropout for accelerated training and better

9
Multi-Sample Dropout for Accelerated Training and Better Generalization Hiroshi Inoue IBM Research - Tokyo Tokyo, Japan [email protected] Abstract—Dropout is a simple but efficient regularization technique for achieving better generalization of deep neural networks (DNNs); hence it is widely used in tasks based on DNNs. During training, dropout randomly discards a portion of the neurons to avoid overfitting. This paper presents an enhanced dropout technique, which we call multi-sample dropout, for both accelerating training and improving generalization over the original dropout. The original dropout creates a randomly selected subset (called a dropout sample) from the input in each training iteration while the multi-sample dropout creates multiple dropout samples. The loss is calculated for each sample, and then the sample losses are averaged to obtain the final loss. This technique can be easily implemented by duplicating a part of the network after the dropout layer while sharing the weights among the duplicated fully connected layers. Experimental results using image classification tasks including ImageNet, CIFAR-10, and CIFAR-100 showed that multi-sample dropout accelerates training. Moreover, the networks trained using multi-sample dropout achieved lower error rates compared to networks trained with the original dropout. Index Terms—deep neural network, dropout, regularization I. I NTRODUCTION Dropout [8] is one of the key regularization techniques for improving the generalization of deep neural networks (DNNs). Because of its simplicity and efficiency, the original dropout and various similar techniques are widely used to train neural networks for various tasks. The use of dropout prevents the trained network from overfitting to the training data by randomly discarding (i.e., ”dropping”) 50% of the neurons at each training iteration. As a result, the neurons cannot depend on each other, and the trained network achieves better generalization. During inference, neurons are not discarded, so all information is preserved; instead, each outgoing value is multiplied by 0.5 to make the average value consistent with the training time. The network used for inference can be viewed as an ensemble of many sub-networks randomly created during training. The success of dropout inspired the development of many techniques using various ways for selecting information to discard. For example, DropConnect [23] discards a portion of the connections between neurons randomly selected during training instead of randomly discarding neurons. This paper reports multi-sample dropout, a dropout tech- nique extended in a different way especially for deep convo- lutional neural networks (CNNs). The original dropout creates a randomly selected subset (a dropout sample) from the input during training. The proposed multi-sample dropout creates multiple dropout samples. The loss is calculated for each sample, and then the sample losses are averaged to obtain the final loss used for back propagation. By calculating losses for M dropout samples and ensembling them, network parameters are updated to achieve smaller loss with any of these samples. This is similar to performing M training repetitions for each input image in the same minibatch. Therefore, it significantly reduces the number of iterations needed for training. With our multi-sample dropout, we do not discard neurons during the inference, as with the original dropout. We observed that multi-sample dropout also improved ac- curacy of the trained network with increasing number of the dropout samples. Noh et al. [18] showed that creating multiple noise samples by noise injection, such as dropout, during the training of deep networks makes stochastic gradient descent (SGD) optimizers have a tighter lower bound in the marginal likelihood over noise. Our multi-sample dropout is an easy and effective way to exploit the benefits of using multiple noise samples without adding huge computation overhead. In CNNs, dropout is typically applied to layers near the end of the network. VGG16 [19], for example, uses dropout for 2 fully connected layers following 13 convolution layers. Because the execution time for the fully connected layers is much shorter than that for the convolution layers, duplicating the fully connected layers for each of multiple dropout samples does not significantly increase the total execution time per iteration. Experiments using the ImageNet, CIFAR-10, and CIFAR-100 datasets showed that, with an increasing number of dropout samples created at each iteration, the improvements obtained (reduced number of iterations needed for training) became more significant at the expense of a longer execution time per iteration. Consideration of the reduced number of iterations along with the increased time per iteration revealed that the total training time was the shortest with a moderate number of dropout samples, such as 8. Multi-sample dropout can be easily implemented on various existing deep learning frameworks without adding a new operator by duplicating a part of the network after the dropout layer while sharing the weights among the fully connected layers duplicated for each dropout sample. The main contribution of this paper is multi-sample dropout, a new regularization technique for accelerating the training of deep neural networks compared to the original dropout. Evaluation of multi-sample dropout on image classification tasks demonstrated that it increases accuracy for both the training and validation sets as well as accelerating the training. arXiv:1905.09788v3 [cs.NE] 21 Oct 2020

Upload: others

Post on 18-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Multi-Sample Dropout for Accelerated Trainingand Better Generalization

Hiroshi InoueIBM Research - Tokyo Tokyo, Japan

[email protected]

Abstract—Dropout is a simple but efficient regularizationtechnique for achieving better generalization of deep neuralnetworks (DNNs); hence it is widely used in tasks based on DNNs.During training, dropout randomly discards a portion of theneurons to avoid overfitting. This paper presents an enhanceddropout technique, which we call multi-sample dropout, forboth accelerating training and improving generalization overthe original dropout. The original dropout creates a randomlyselected subset (called a dropout sample) from the input in eachtraining iteration while the multi-sample dropout creates multipledropout samples. The loss is calculated for each sample, andthen the sample losses are averaged to obtain the final loss. Thistechnique can be easily implemented by duplicating a part ofthe network after the dropout layer while sharing the weightsamong the duplicated fully connected layers. Experimental resultsusing image classification tasks including ImageNet, CIFAR-10,and CIFAR-100 showed that multi-sample dropout acceleratestraining. Moreover, the networks trained using multi-sampledropout achieved lower error rates compared to networks trainedwith the original dropout.

Index Terms—deep neural network, dropout, regularization

I. INTRODUCTION

Dropout [8] is one of the key regularization techniquesfor improving the generalization of deep neural networks(DNNs). Because of its simplicity and efficiency, the originaldropout and various similar techniques are widely used to trainneural networks for various tasks. The use of dropout preventsthe trained network from overfitting to the training data byrandomly discarding (i.e., ”dropping”) 50% of the neuronsat each training iteration. As a result, the neurons cannotdepend on each other, and the trained network achieves bettergeneralization. During inference, neurons are not discarded, soall information is preserved; instead, each outgoing value ismultiplied by 0.5 to make the average value consistent with thetraining time. The network used for inference can be viewed asan ensemble of many sub-networks randomly created duringtraining. The success of dropout inspired the development ofmany techniques using various ways for selecting informationto discard. For example, DropConnect [23] discards a portionof the connections between neurons randomly selected duringtraining instead of randomly discarding neurons.

This paper reports multi-sample dropout, a dropout tech-nique extended in a different way especially for deep convo-lutional neural networks (CNNs). The original dropout createsa randomly selected subset (a dropout sample) from the inputduring training. The proposed multi-sample dropout createsmultiple dropout samples. The loss is calculated for each

sample, and then the sample losses are averaged to obtain thefinal loss used for back propagation. By calculating losses forM dropout samples and ensembling them, network parametersare updated to achieve smaller loss with any of these samples.This is similar to performing M training repetitions for eachinput image in the same minibatch. Therefore, it significantlyreduces the number of iterations needed for training. With ourmulti-sample dropout, we do not discard neurons during theinference, as with the original dropout.

We observed that multi-sample dropout also improved ac-curacy of the trained network with increasing number of thedropout samples. Noh et al. [18] showed that creating multiplenoise samples by noise injection, such as dropout, during thetraining of deep networks makes stochastic gradient descent(SGD) optimizers have a tighter lower bound in the marginallikelihood over noise. Our multi-sample dropout is an easy andeffective way to exploit the benefits of using multiple noisesamples without adding huge computation overhead.

In CNNs, dropout is typically applied to layers near theend of the network. VGG16 [19], for example, uses dropoutfor 2 fully connected layers following 13 convolution layers.Because the execution time for the fully connected layers ismuch shorter than that for the convolution layers, duplicatingthe fully connected layers for each of multiple dropout samplesdoes not significantly increase the total execution time periteration. Experiments using the ImageNet, CIFAR-10, andCIFAR-100 datasets showed that, with an increasing numberof dropout samples created at each iteration, the improvementsobtained (reduced number of iterations needed for training)became more significant at the expense of a longer executiontime per iteration. Consideration of the reduced number ofiterations along with the increased time per iteration revealedthat the total training time was the shortest with a moderatenumber of dropout samples, such as 8.

Multi-sample dropout can be easily implemented on variousexisting deep learning frameworks without adding a newoperator by duplicating a part of the network after the dropoutlayer while sharing the weights among the fully connectedlayers duplicated for each dropout sample.

The main contribution of this paper is multi-sample dropout,a new regularization technique for accelerating the trainingof deep neural networks compared to the original dropout.Evaluation of multi-sample dropout on image classificationtasks demonstrated that it increases accuracy for both thetraining and validation sets as well as accelerating the training.

arX

iv:1

905.

0978

8v3

[cs

.NE

] 2

1 O

ct 2

020

average

convolution + relu

pooling

dropout

fully connected

softmax + loss func

convolution + relu

pooling

dropout

fully connected

softmax + loss func

dropout

full connected

softmax + loss func

different masks

shared weights

original dropout our multi-sample dropout (with two dropout samples)

loss

loss

Fig. 1. Overview of original dropout and our multi-sample dropout.

II. MULTI-SAMPLE DROPOUT

A. Overview

This section describes the multi-sample dropout technique.The basic idea is quite simple: create multiple dropout samplesinstead of only one. Figure 1 depicts an easy way to implementmulti-sample dropout (with two dropout samples) using anexisting deep learning framework with only common opera-tors. The dropout layer and several layers after the dropoutare duplicated for each dropout sample; in the figure, the”dropout,” ”fully connected,” and ”softmax + loss func” layersare duplicated. Different masks are used for each dropoutsample in the dropout layer so that a different subset ofneurons is used for each dropout sample. In contrast, theparameters (i.e., connection weights) are shared between theduplicated fully connected layers. The loss is computed foreach dropout sample using the same loss function, e.g., crossentropy, and the final loss value is obtained by averaging theloss values for all dropout samples. This final loss value is usedas the objective function for optimization during training. Weselect the class label as the prediction based on the averageof outputs from the last fully connected layer. Although aconfiguration with two dropout samples is shown in Figure 1,multi-sample dropout can be configured to use any number ofdropout samples. The original dropout can be seen as a specialcase of multi-sample dropout where the number of samples isset to one.

During inference, neurons are not discarded as is done inthe original dropout. The loss can be calculated using only onedropout sample because the dropout samples become identicalat the inference time if we do not drop any neurons at thedropout layer. Hence, we always use only one dropout sampleat inference regardless of the training method.

Compared to Importance Weighted Stochastic Gradient De-scent (IWSGD) [18], which also makes multiple samples bydropout, we only duplicate operations in a part of the forward

pass after the dropout while IWSGD duplicate operationsin the entire backward pass as well as the forward pass.Hence our multi-sample dropout is much more light weightin terms of computation costs. Especially when dropout isapplied to a layer near the end of the network, the additionalexecution time due to the duplicated operations in multi-sample dropout is not significant; this characteristic makesmulti-sample dropout more suitable for deep CNNs. We canapply multi-sample dropout for shallow networks, such as themultilayer perceptron. We observed that multi-sample dropoutreduces the number of iterations for training even for shallownetworks, but the costs of the increased execution time periteration surpassed the benefits; due to the increase in thecomputation time per iteration, multi-sample dropout actuallydegraded the training speed in terms of the computation time.

If the network includes multiple dropout layers, we canapply multi-sample dropout at any of these dropout layers.Multi sampling at an earlier dropout layer may increasediversity among dropout samples and increase the benefits intrade for the higher additional costs due to more duplicatedlayers.

B. Why multi-sample dropout accelerates training

Intuitively, the effect of multi-sample dropout with Mdropout samples is similar to that of enlarging the size ofa minibatch M times by duplicating each sample in the mini-batch M times, e.g. Batch Augmentation [9]. For example,if a minibatch consists of two data samples 〈A,B〉, traininga network by using multi-sample dropout with two dropoutsamples closely corresponds to training a network by using theoriginal dropout and a minibatch of 〈A,A,B,B〉 assuming adifferent mask applied to each sample in the minibatch. Thisis similar to batch augmentation [9], which applies a differentdata augmentation for each of duplicated samples to makediversity among duplicated samples. Using a larger minibatchsize with duplicated samples may not make sense to accelerate

the training because it increases the computation time periteration by M times. In contrast, multi-sample dropout canenjoy similar gains without a huge increase in computationcost per iteration for deep CNNs because it duplicates onlythe operations after dropout. For example, when we duplicatethe last two fully connected layers of VGG16 [19] eight times,we observed the increased execution time per iteration by only2%. Because of the non-linearity of the activation functions,the original dropout with duplicated samples and multi-sampledropout do not give exactly the same results. However, similaracceleration was observed in the training in terms of thenumber of iterations, as shown by the experimental results.

C. Why multi-sample dropout yields higher accuracy

Noh et al. [18] showed that creating multiple samplesduring the training of deep networks improves the accuracyof the trained network. Training of a noisy network (e.g. withdropout) requires optimizing the marginal likelihood over thenoise (Lmarginal) and SGD optimizers optimize the networkusing approximated marginal likelihood based on the finitenumber of samples (LSGD) as its objective function. Here,the SGD objective function LSGD is the lower-bound of themarginal likelihood over the noise (Lmarginal) and using moredropout samples makes the lower-bound tighter, i.e.Lmarginal ≥ LSGD(M + 1) ≥ LSGD(M),

here, LSGD(M) means LSGD when M dropout samples areused. This results in better accuracy in the trained networkwith increasing number of dropout samples. Although Noh’sImportance Weighted Stochastic Gradient Descent (IWSGD)makes multiple noise (dropout) samples at dropout like ourmulti-sample dropout, it executes both the forward pass andbackward pass separately for each sample, and then calculatesthe gradients for updating network parameters as weighted av-erage of the gradients calculated for each dropout sample withthe normalized likelihood for the sample as the weight. Ourresults showed that much simpler and light-weight techniquewhich only duplicates a small part of the forward pass canenjoy the benefits of using multiple dropout samples.

D. Other sources of diversity among samples

The key to faster training with multi-sample dropout is thediversity among dropout samples; if there is no diversity, themulti-sampling technique gives no gain and simply wastescomputation resources. Although we tested only dropout inthis paper, the multi-sampling technique can be used with othersources of diversity. For example, variants of dropout, such asDropConnect, can be enhanced by using the multi-samplingtechnique.

III. EXPERIMENTAL RESULTS

A. Implementation

This section describes the effects of using the multi-sampledropout for various image classification tasks including theImageNet, CIFAR-10, and CIFAR-100 datasets. For the Im-ageNet dataset, as well as for the full dataset with 1,000classes, a reduced dataset with only the first 100 classes was

tested (ImageNet-100). For most of the experiments, we useeight as the number of dropout samples, which generallygives good tradeoff between benefits and additional cost. Forthe CIFAR-10, and CIFAR-100, an 8-layer network with sixconvolutional layers and batch normalization [12] followedby two fully connected layers with dropout was used. Thisnetwork executes dropout twice with dropout ratios of 30%,which are tuned for the original dropout but are used herefor all cases unless otherwise specified. The same networkarchitecture except for the number of neurons in the outputlayer was used for the CIFAR-10 (10 output neurons), andCIFAR-100 (100 output neurons) datasets. The network wastrained using the Adam optimizer [13] with a batch size of100. These tasks were run on a NVIDIA K20m GPU.Forthe ImageNet datasets, VGG16 was used as the networkarchitecture, and the network was trained using stochasticgradient descent with momentum as the optimization methodwith a batch size of 100 samples. The initial learning rateof 0.01 was exponentially decayed by multiplying it by 0.92at each epoch. Weight decay regularization was used with adecay rate of 5 · 10−4 by following the original paper. In theVGG16 architecture, dropout was applied for the first two fullyconnected layers with 20% as dropout ratio. A NVIDIA V100GPU was used for the training with ImageNet datasets. Forall datasets, data augmentation was used by extracting a patchfrom a random position of the input image and by performingrandom horizontal flipping during training [14]. For ImageNet,we additionally apply random resizing and tilting. For thevalidation set, the patch from the center position was extractedand fed into the classifier without any modifications. All testswere executed five times and the averages of five results areshown in the figure with 95% confidence intervals.

B. Improvements by multi-sample dropout

Figure 2 plots the trends in validation errors and trainingerrors against training time for three configurations: trainedwith the original dropout, with multi-sample dropout, andwithout dropout. For multi-sample dropout, the losses for eightdropout samples were averaged. How the number of dropoutsamples affects performance is discussed in the next section.The figure shows that multi-sample dropout achieved fastertraining than the original dropout for all datasets, i.e. botherrors became smaller with the same training time. As iscommon in regularization techniques, dropout achieves bettergeneralization (i.e., lower validation error rates) comparedwith the ”without dropout” case at the expense of slowertraining. Multi-sample dropout alleviates this slowdown whilestill achieving better generalization.

Table I summarizes the final validation error rates andtraining error rates. After training, the networks trained withmulti-sample dropout were observed to have reduced errorrates for all datasets compared with those of the originaldropout on average. Note that the improvements in validationerrors for ImageNet datasets were not significant based on theconfidence intervals.

CIFAR-10

CIFAR-100

ImageNet-100

ImageNet

training errorvalidation error

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

0.1 1 10 100 1000

train

ing

erro

r rat

e

execution time (minutes)

original dropout (1 sample)multi-sample dropout (8 samples)without dropout

0%

5%

10%

15%

20%

25%

30%

35%

0.1 1 10 100 1000

valid

atio

n er

ror r

ate

execution time (minutes)

original dropout (1 sample)multi-sample dropout (8 samples)without dropout

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

0.1 1 10 100 1000tra

inin

g er

ror r

ate

execution time (minutes)

original dropout (1 sample)multi-sample dropout (8 samples)without dropout

0%

10%

20%

30%

40%

50%

60%

70%

80%

0.1 1 10 100 1000

valid

atio

n er

ror r

ate

execution time (minutes)

original dropout (1 sample)multi-sample dropout (8 samples)without dropout

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

10 100 1000

train

ing

erro

r rat

e

execution time (minutes)

original dropout (1 sample)multi-sample dropout (8 samples)without dropout

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

10 100 1000

valid

atio

n er

ror r

ate

execution time (minutes)

original dropout (1 sample)multi-sample dropout (8 samples)without dropout

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

10 100 1000 10000

train

ing

erro

r rat

e

execution time (minutes)

original dropout (1 sample)multi-sample dropout (8 samples)without dropout

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

10 100 1000 10000

valid

atio

n er

ror r

ate

execution time (minutes)

original dropout (1 sample)multi-sample dropout (8 samples)without dropout

Fig. 2. Trends in error rates for validation set and training set against training time for original dropout and multi-sample dropout (in average of five trainingruns). Multi-sample dropout achieved faster convergence than the original dropout. Note that training errors for multi-sample dropout include the effects ofthe inherent ensemble mechanism, but no ensemble was used while evaluating validation sets.

TABLE IFINAL VALIDATION AND TRAINING ERROR RATES OF TRAINED NETWORKS WITH ORIGINAL DROPOUT, WITH MULTI-SAMPLE DROPOUT (WITH 8

SAMPLES) AND WITHOUT DROPOUT (IN AVERAGE OF FIVE RUNS ASSOCIATED WITH 95% CONFIDENCE INTERVAL).

datasetvalidation error rate training error rate

original dropout multi-sample dropout no dropout original dropout multi-sample

dropout no dropout

CIFAR-10 7.95% 0.18% 7.38% 0.08% 9.33% 0.17% 0.51% 0.00% 0.02% 0.00% 0.20% 0.02%CIFAR-100 30.9% 0.24% 29.8% 0.26% 36.6% 0.23% 4.87% 0.02% 0.08% 0.00% 0.64% 0.00%

ImageNet-100 25.1% 0.65% 24.4% 0.30% 25.8% 0.40% 10.8% 0.22% 7.78% 0.97% 7.41% 0.84%ImageNet 27.5% 0.09% 27.4% 0.04% 27.6% 0.10% 17.8% 0.07% 15.48% 0.02% 14.0% 0.04%

0%

5%

10%

15%

20%

25%

30%

35%

0.1 1 10 100 1000va

lidat

ion

erro

r rat

eexecution time (minutes)

original dropout (1 sample)multi-sample dropout (2 samples)multi-sample dropout (8 samples)multi-sample dropout (32 samples)

0%

5%

10%

15%

20%

25%

30%

35%

1 10 100 1000

valid

atio

n er

ror r

ate

training epoch

original dropout (1 sample)multi-sample dropout (2 samples)multi-sample dropout (8 samples)multi-sample dropout (32 samples)

(a) CIFAR-10 validation errorsagainst number of epochs

(b) CIFAR-10 validation errorsagainst wall-clock time

Fig. 3. Validation errors during training for CIFAR-10 using different numbers of dropout samples. Using more dropout samples makes convergence fasterin terms of number of iterations at the cost of increased execution time per iteration; an excessively large number of dropout samples may hurt the trainingspeed.

TABLE IIEXECUTION TIME PER ITERATION RELATIVE TO THAT OF ORIGINAL DROPOUT FOR DIFFERENT NUMBERS OF DROPOUT SAMPLES. INCREASING THE

NUMBER OF DROPOUT SAMPLES LENGTHENED THE COMPUTATION TIME PER ITERATION. WE FAILED TO EXECUTE VGG16 WITH 16 DROPOUT SAMPLESDUE TO THE OUT-OF-MEMORY ERROR WITH THE CURRENT MINIBATCH SIZE.

dataset networkoriginal dropout multi-sample dropout

1 sample 2 samples 4 samples 8 samples 16 samples 32 samples 64 samplesCIFAR-10

8-layer CNN1.00 1.02 1.07 1.17 1.36 1.99 3.38

CIFAR-100 1.00 1.02 1.07 1.17 1.37 2.06 3.29ImageNet VGG16 1.00 1.00 1.01 1.02 - - -

C. Effects of parameters on performance

Number of dropout samples: Figure 3(a) and 3(b) com-pare the validation errors for different numbers of dropoutsamples (with 1, 2, 8, and 32 samples) for CIFAR-10 againstthe number of training epochs and wall-clock training timerespectively. Using a larger number of dropout samples madetraining progress faster in terms of the number of epochs(iterations) as well as making the final error rates lower.

The accelerated training in terms of the number of iterationsdue to using more dropout samples came at cost: increasedexecution time per iteration. Consideration of the increasedexecution time per iteration along with the reduced numberof iterations revealed that multi-sample dropout achieves thelargest speed up in training time when a moderate number ofdropout samples, such as 8, is used, as shown in Figure 3(b).Using an excessive number of dropout samples may actuallyslow down the training.

The execution time per iteration relative to that of originaldropout is shown in Table II for different numbers of dropoutsamples. The VGG16 network architecture was used for theImageNet dataset and a smaller 8-layer CNN was used for theother datasets, as mentioned above. Because a larger networktends to spend more time in deep convolutional layers than inthe fully connected layers, which are duplicated in our multi-sample dropout technique, the overhead in execution timecompared with that of the original dropout is more significantfor the smaller network than it is with the VGG16 networkarchitecture. Multi-sample dropout with eight dropout samplesincreased the execution time per iteration by about 1.97% forthe VGG16 architecture and by about 17.2% for the smallnetwork. Hence, larger network may benefit from multi-sampledropout more compared to smaller networks. For very smallernetworks, the increases in the execution time per iteration maysurpass the benefits.

0%

5%

10%

15%

20%

25%

30%

35%

40%

1 2 4 8 16 32 64 nodropout

valid

atio

n er

ror r

ate

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

1 2 4 8 16 32 64 nodropout

train

ing

erro

r rat

e

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

1 2 4 8 16 32 64 nodropout

valid

atio

n er

ror r

ate

CIFAR-10 validation error

0.0%

0.1%

0.2%

0.3%

0.4%

0.5%

0.6%

1 2 4 8 16 32 64 nodropout

train

ing

erro

r rat

e

multi-sampledropout

originaldropout multi-sample

dropoutoriginaldropout

number of dropout samples number of dropout samples

CIFAR-10 training error

CIFAR-100 validation error

CIFAR-100 training error

Fig. 4. Final validation and training error rates with different numbers of dropout samples. Multi-sample dropout achieved lower error rates with more dropoutsamples.

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 10 100 1000

valid

atio

n er

ror r

ate

epoch

original dropout (1 sample)dropout ratio = 30%

multi-sample dropout (8 samples)dropout ratio = 30%

original dropout (1 sample)dropout ratio = 90%

multi-sample dropout (8 samples)dropout ratio = 90%

0%

2%

4%

6%

8%

10%

12%

nodropout

10% 30% 50% 70% 90%

valid

atio

n er

ror r

ate

dropout ratio

original dropout (1 sample)multi-sample dropout (8 samples)

(a) (b)

Fig. 5. (a) Validation error rates and (b) progress of validation errors during the training with original and multi-sample dropout for various dropout ratios.Multi-sample dropout works regardless of the dropout ratio.

The final validation error rates and training error rates ofnetworks trained using different number of dropout samplesduring the training are shown in Figure 4 for CIFAR-10and CIFAR-100. Multi-sample dropout achieved lower errorrates as the number of dropout samples was increased. Thegains in the validation errors were relatively small when thenumber was increased above eight considering the increasedcomputation costs shown in Table II.

From these observations, it was determined that eight is areasonable value for the number of dropout samples, and it isused in other experiments.

Dropout ratio: Another important parameter is the dropoutratio, which controls the ratio of the neurons to discard. In the8-layer CNN used for CIFAR datasets, we used 30% as theratios in the two dropout layers. This value was tuned for theoriginal dropout but also used for multi-sample dropout. Herewe shown how multi-sample dropout works for the CIFAR-10dataset with various dropout ratios: 10%, 30% (default), 50%,70%, and 90%.

Figure 5(a) shows the final validation error rates. Regardlessof the dropout ratio setting, multi-sample dropout consistentlyachieved lower error rates than the original dropout. When

TABLE IIIFINAL VALIDATION AND TRAINING ERROR RATES FOR RESNET WITH AND WITHOUT MULTI-SAMPLE DROPOUT (USING 8 DROPOUT SAMPLES).

datasetvalidation error rate training error rate

original dropout multi-sample dropout no dropout original dropout multi-sample

dropout no dropout

CIFAR-10 6.98% 0.09% 6.89% 0.18% 6.95% 0.06% 0.10% 0.00% 0.07% 0.01% 0.09% 0.00%CIFAR-100 30.0% 0.52% 29.8% 0.21% 30.2% 0.14% 0.48% 0.00% 0.12% 0.00% 0.28% 0.00%

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

0 5 10 15 20 25 30

train

ing

loss

epochs

original dropout (1 sample)

multi-sample dropout (4 samples)0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

0 50 100 150 200

train

ing

loss

seconds

original dropout (1 sample)

multi-sample dropout (4 samples)

CIFAR-10 training lossagainst number of epochs

CIFAR-10 training lossagainst wall-clock time

Fig. 6. Comparison of training loss using a shallow network (a 4-layer multilayer perceptron) with and without multi-sample dropout for CIFAR-10. Theadditional computation cost of multi-sample dropout becomes significant in shallow networks while it is quite small for deep convolutional networks.

excessively high dropout ratios, such as 90%, were used,dropout degraded the validation error rate compared with the”without dropout” case. Even with such ratios, multi-sampledropout achieved improvements compared with the originaldropout.

Figure 5(b) compares the trend in convergence of validationerrors for dropout ratios of 30% (default) and 90%. Whenexcessively high dropout ratios of 90% were used, the speedupby the multi-sample dropout was much more significant.

These results show that multi-sample dropout does notdepend on a specific dropout ratio to achieve improvementsand that it can be used with a wide range of dropout ratiosettings.

D. Effect of multi-sample dropout when original dropout doesnot work

Our multi-sample dropout can magnify the benefits of theoriginal dropout. When the original dropout works poorlyto improve the accuracy, the multi-sample dropout may alsowork poorly. For example, it is known that adding a dropoutat the end of ResNet architecture [7], e.g. after the globalaverage pooling layer, does not improve the final accuracy;hence ResNet typically does not employ dropout after the finalpooling layer. We tested ResNet with multi-sample dropoutafter the pooling layer using CIFAR-10 and CIFAR-100.Table III summarizes the performance with and without multi-sample dropout. The gain from multi-sample dropout in thevalidation error rates was smaller than the gain with our 8-layer CNN for these datasets (shown in Table I). Whetherdropout works well or not depends on many aspects of theworkload, e.g. used network architecture, amount of training

data, and other regularization techniques (e.g. [17]). Thesecharacteristics also matters for multi-sample dropout.

E. Applying multi-sample dropout for shallow networks

As discussed in Section II.B of the paper, multi-sampledropout is mainly targeting deep convolutional neural networksin which most of the computation time is consumed in theconvolution layers before the dropout. Here, we show theeffect of multi-sample dropout on shallow networks usinga multilayer perceptron as an example. We use a networkconsists of four fully connected layers each has 2,000 neurons.We apply dropout for each fully connected layer. For multi-sampling, we created four dropout samples at the last fullyconnected layer; i.e. only one layer is duplicated. Figure 6shows the training loss for CIFAR-10 dataset with and withoutmulti-sample dropout. Multi-sample dropout yields smallertraining loss compared to the original dropout after the samenumber of iterations (epochs). However, due to the increasein the computation time per iteration, multi-sample dropoutactually degraded the training speed in terms of the trainingtime; the execution time per iteration increased by more than50% even we created only four dropout samples. Hence, tomake multi-sample dropout effective, it is important to applymulti sampling near the end of the network to limit the numberof operations duplicated for multiple dropout samples.

F. Multi-sample dropout and duplicating samples in the sameminibatch

As discussed in Section II.B, the effect of multi-sampledropout with M dropout samples is similar to that of enlargingthe size of a minibatch M times by duplicating each sample

5%

10%

15%

20%

25%

30%

35%

1 10 100

valid

atio

n er

ror r

ate

epoch

original dropout (1 sample)original dropout with 2x duplicated minibatchmulti-sample dropout (2 samples)original dropout with 4x duplicated minibatchmulti-sample dropout (4 samples)

30%

35%

40%

45%

50%

55%

60%

65%

70%

75%

1 10 100

valid

atio

n er

ror r

ate

epoch

original dropout (1 sample)original dropout with 2x duplicated minibatchmulti-sample dropout (2 samples)original dropout with 4x duplicated minibatchmulti-sample dropout (4 samples)

CIFAR-10 CIFAR-100

Fig. 7. Comparison of original dropout with data duplication in minibatch and multi-sample dropout. X-axis shows number of epochs. Both techniques yieldsimilar improvements in accuracy while the computation cost is much smaller for the multi-sample dropout.

in the minibatch M times. This is the primary reason forthe accelerated training with multi-sample dropout. This isillustrated in Figure 7: the validation errors with multi-sampledropout match well with those of the original dropout usingduplicated data in a minibatch. Training errors and traininglosses also match well although they are not shown here. Ifthe same sample is included in a minibatch multiple times,the results from the multiple samples are ensembled when theparameters are updated, even if there is no explicit ensemblingin the network. Duplicating a sample in input and ensemblingthem at the parameter updates seems to have a quite similareffect on training to that of multi-sample dropout, whichduplicates a sample at the dropout and ensembles them at theend of the forward pass. However, duplicating data M timesmake the execution time per iteration (and hence the totaltraining time) M times longer than without duplication. Multi-sample dropout achieves similar benefits at a much smallercomputation cost.

IV. RELATED WORK

The multi-sample dropout regularization technique pre-sented in this paper can achieve better generalization and fastertraining than the original dropout. Dropout is one of the mostwidely used regularization techniques, but a wide variety ofother regularization techniques for better generalization havebeen reported. They include, for example, weight decay [15],data augmentation [3], [2], [24], [11], label smoothing [21],and batch normalization [12]. Although batch normalization isaimed at accelerating training, it also improves generalization.Many of these techniques are network independent whileothers, such as Shake-Shake [5] and Drop-Path [16], arespecialized for a specific network architecture.

The success of dropout led to the development of many vari-ations that extend the basic idea of dropout (e.g. [6], [10], [22],[4]. The techniques reported use a variety of ways to randomlydrop information in the network. For example, DropConnect[23] discards randomly selected connections between neurons.DropBlock [6] randomly discards areas in convolution layerswhile dropout is typically used in fully connected layersafter the convolution layers. Stochastic Depth [10] randomlyskip layers in a very deep network. However, none of these

techniques use the approach used in our multi-sample dropout.Many of them can be used with multi sampling technique tomake the divergence among dropout samples. Another wayto enhance the dropout is adaptively tuning the dropout ratio(e.g. [1]. These techniques are also orthogonal to the multisampling technique, since the multi-sample dropout does notdepend on specific dropout ratio as we have already shown.

Multi-sample dropout calculates the final prediction andloss by averaging the results from multiple loss functions.Several network architectures have multiple exits with lossfunctions. For example, GoogLeNet [20] has two early exitsin addition to the main exit, and the final prediction is madeusing a weighted average of the outputs from these three lossfunctions. Unlike multi-sample dropout, GoogLeNet createsthe two additional exits at earlier positions in the network.Multi-sample dropout creates multiple uniform exits, each witha loss function, by duplicating a part of the network.

V. SUMMARY

In this paper, we described multi-sample dropout, a regu-larization technique for accelerating training and improvinggeneralization. The key is creating multiple dropout samplesat the dropout layer while the original dropout creates onlyone sample. Multi-sample dropout can be easily implementedusing existing deep learning frameworks by duplicating a partof the network after the dropout layer. Experimental results us-ing image classification tasks demonstrated that multi-sampledropout reduces training time and improves accuracy. Becauseof its simplicity, the basic idea of the multi-sampling techniquecan be used in wide range of neural network applications andtasks.

REFERENCES

[1] Lei Jimmy Ba and Brendan Frey. Adaptive dropout for trainingdeep neural networks. In Annual Conference on Neural InformationProcessing Systems (NIPS), pages 3084–3092, 2013.

[2] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. PhilipKegelmeyer. Smote: Synthetic minority over-sampling technique. Jour-nal of Artificial Intelligence Research, 16(1):321–357, 2002.

[3] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, andQuoc V. Le. Autoaugment: Learning augmentation policies from data.arXiv:1805.09501, 2018.

[4] Terrance DeVries and Graham W. Taylor. Improved regularization ofconvolutional neural networks with cutout. arXiv:1708.04552, 2017.

[5] Xavier Gastaldi. Shake-shake regularization. arXiv:1705.07485, 2017.[6] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regu-

larization method for convolutional networks. In Annual Conferenceon Neural Information Processing Systems (NIPS), pages 10727–10737,2018.

[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. In Conference on ComputerVision and Pattern Recognition (CVPR), pages 770–778, 2016.

[8] Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,and Ruslan Salakhutdinov. Improving neural networks by preventingco-adaptation of feature detectors. arXiv:1207.0580, 2012.

[9] Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, andDaniel Soudry. Augment your batch: Improving generalization throughinstance repetition. In Conference on Computer Vision and PatternRecognition (CVPR), 2020.

[10] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger.Deep networks with stochastic depth. arXiv:1603.09382, 2016.

[11] Hiroshi Inoue. Data augmentation by pairing samples for imagesclassification. arXiv:1801.02929, 2018.

[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accel-erating deep network training by reducing internal covariate shift.arXiv:1502.03167, 2015.

[13] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv:1412.6980, 2014.

[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classi-fication with deep convolutional neural networks. In Annual Conferenceon Neural Information Processing Systems (NIPS), pages 1106–1114,2012.

[15] Anders Krogh and John A. Hertz. A simple weight decay can improvegeneralization. In Annual Conference on Neural Information ProcessingSystems (NIPS), pages 950–957, 1991.

[16] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet:Ultra-deep neural networks without residuals. In International Confer-ence on Learning Representation (ICLR), 2017.

[17] Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding thedisharmony between dropout and batch normalization by variance shift.In Conference on Computer Vision and Pattern Recognition (CVPR),2019.

[18] Hyeonwoo Noh, Tackgeun You, Jonghwan Mun, and Bohyung Han.Regularizing deep neural networks by noise: Its interpretation andoptimization. In Annual Conference on Neural Information ProcessingSystems (NIPS), pages 5115–5124, 2017.

[19] Karen Simonyan and Andrew Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv:1409.1556, 2014.

[20] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and AndrewRabinovich. Going deeper with convolutions. In Conference onComputer Vision and Pattern Recognition (CVPR), 2015.

[21] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, andZbigniew Wojna. Rethinking the inception architecture for computervision. In Conference on Computer Vision and Pattern Recognition(CVPR), 2016.

[22] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, andChristopher Bregler. Efficient object localization using convolutionalnetworks. arXiv:1411.4280, 2014.

[23] Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus.Regularization of neural networks using dropconnect. In InternationalConference on Machine Learning (ICML), pages III–1058–III–1066,2013.

[24] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv:1710.09412,2017.