what makes imagenet good for transfer learning? · text task” for training deep convolutional...

What makes ImageNet good for transfer learning?

Minyoung Huh Pulkit Agrawal Alexei A. EfrosBerkeley Artificial Intelligence Research (BAIR) Laboratory

UC Berkeley{minyoung,pulkitag,aaefros}@berkeley.edu

Abstract

The tremendous success of ImageNet-trained deep fea-tures on a wide range of transfer tasks raises the question:what is it about the ImageNet dataset that makes the learntfeatures as good as they are? This work provides an em-pirical investigation into the various facets of this question,such as, looking at the importance of the amount of exam-ples, number of classes, balance between images-per-classand classes, and the role of fine and coarse grained recog-nition. We pre-train CNN features on various subsets of theImageNet dataset and evaluate transfer performance on avariety of standard vision tasks. Our overall findings sug-gest that most changes in the choice of pre-training datalong thought to be critical, do not significantly affect trans-fer performance.

1. IntroductionIt has become increasingly common within the com-

puter vision community to treat image classification on Im-ageNet [35] not as an end in itself, but rather as a “pre-text task” for training deep convolutional neural networks(CNNs [25, 22]) to learn good general-purpose features.This practice of first training a CNN to perform image clas-sification on ImageNet (i.e. pre-training) and then adaptingthese features for a new target task (i.e. fine-tuning) has be-come the de facto standard for solving a wide range of com-puter vision problems. Using ImageNet pre-trained CNNfeatures, impressive results have been obtained on severalimage classification datasets [10, 33], as well as object de-tection [12, 37], action recognition [38], human pose esti-mation [6], image segmentation [7], optical flow [42], im-age captioning [9, 19] and others [24].

Given the success of ImageNet pre-trained CNN fea-tures, it is only natural to ask: what is it about the ImageNetdataset that makes the learnt features as good as they are?One school of thought believes that it is the sheer size of

the dataset (1.2 million labeled images) that forces the rep-resentation to be general. Others argue that it is the largenumber of distinct object classes (1000), which forces thenetwork to learn a hierarchy of generalizable features. Yetothers believe that the secret sauce is not just the large num-ber of classes, but the fact that many of these classes arevisually similar (e.g. many different breeds of dogs), turn-ing this into a fine-grained recognition task and pushing therepresentation to “work harder”. But, while almost every-one in computer vision seems to have their own opinion onthis hot topic, little empirical evidence has been producedso far.

In this work, we systematically investigate which as-pects of the ImageNet task are most critical for learninggood general-purpose features. We evaluate the features byfine-tuning on three tasks: object detection on PASCAL-VOC 2007 dataset (PASCAL-DET), action classificationon PASCAL-VOC 2012 dataset (PASCAL-ACT-CLS) andscene classification on the SUN dataset (SUN-CLS); seeSection 3 for more details.

The paper is organized as a set of experiments answeringa list of key questions about feature learning with ImageNet.The following is a summary of our main findings:

1. How many pre-training ImageNet examples are sufficientfor transfer learning? Pre-training with only half the Im-ageNet data (500 images per class instead of 1000) resultsin only a small drop in transfer learning performance (1.5mAP drop on PASCAL-DET). This drop is much smallerthan the drop on the ImageNet classification task itself. SeeSection 4 and Figure 1 for details.

2. How many pre-training ImageNet classes are sufficientfor transfer learning? Pre-training with an order of mag-nitude fewer classes (127 classes instead of 1000) resultsin only a small drop in transfer learning performance (2.8mAP drop on PASCAL-DET). Curiously, we also foundthat for some transfer tasks, pre-training with fewer classesleads to better performance. See Section 5.1 and Figure 2for details.

1

arX

iv:1

608.

0861

4v2

[cs

.CV

] 1

0 D

ec 2

016

Figure 1: Change in transfer task performance of a CNN pre-trainedwith varying number of images per ImageNet class. The left y-axisis the mean class accuracy used for SUN and ImageNet CLS. Theright y-axis measures mAP for PASCAL DET and ACTION-CLS.The number of examples per class are reduced by random sam-pling. Accuracy on the ImageNet classification task increases fasteras compared to performance on transfer tasks.

Figure 2: Change in transfer task performance with varying numberof pre-training ImageNet classes. The number of ImageNet classesare varied using the technique described in Section 5.1. With only486 pre-training classes, transfer performances are unaffected andonly a small drop is observed when only 79 classes are used for pre-training. The ImageNet classification performance is measured byfintetuning the last layer to the original 1000-way classification.

3. How important is fine-grained recognition for learninggood features for transfer learning? Features pre-trainedwith a subset of ImageNet classes that do not require fine-grained discrimination still demonstrate good transfer per-formance. See Section 5.2 and Figure 2 for details.

4. Does pre-training on coarse classes produce features ca-pable of fine-grained recognition (and vice versa) on Ima-geNet itself? We found that a CNN trained to classify onlybetween the 127 coarse ImageNet classes produces fea-tures capable of telling apart fine-grained ImageNet classeswhose labels it has never seen in training (section 5.3).Likewise, a CNN trained to classify the 1000 ImageNetclasses is able to distinguish between unseen coarse-levelclasses higher up in the WordNet hierarchy (section 5.4).

5. Given the same budget of pre-training images, should wehave more classes or more images per class? Training withfewer classes but more images per class performs slightlybetter at transfer tasks than training with more classes butfewer images per class. See Section 5.5 and Table 2 fordetails.

6. Is more data always helpful? We found that training with771 ImageNet classes (out of 1000) that exclude all PAS-CAL VOC classes, achieves nearly the same performanceon PASCAL-DET as training on complete ImageNet. Fur-ther experiments confirm that blindly adding more trainingdata does not always lead to better performance and cansometimes hurt performance. See Section 6, and Table 9for more details.

2. Related Work

A number of papers have studied transfer learning inCNNs, including the various factors that affect pre-trainingand fine-tuning. For example, the question of whether pre-training should be terminated early to prevent over-fittingand what layers should be used for transfer learning wasstudied by [2, 44]. A thorough investigation of good archi-tectural choices for transfer learning was conducted by [3],while [26] propose an approach to fine-tuning for new taskswithout ”forgetting” the old ones. In contrast to theseworks, we use a fixed fine-tuning pr

One central downside of supervised pre-training is thatlarge quantity of expensive manually-supervised trainingdata is required. The possibility of using large amountsof unlabelled data for feature learning has therefore beenvery attractive. Numerous methods for learning featuresby optimizing some auxiliary criterion of the data itselfhave been proposed. The most well-known such criteriaare image reconstruction [5, 36, 29, 27, 32, 20] (see [4] fora comprehensive overview) and feature slowness [43, 14].Unfortunately, features learned using these methods turnedout not to be competitive with those obtained from super-vised ImageNet pre-training [31]. To try and force betterfeature generalization, more recent “self-supervised” meth-ods use more difficult data prediction auxiliary tasks in aneffort to make the CNNs “work harder”. Attempted self-supervised tasks include predictions of ego-motion [1, 16],spatial context [8, 31, 28], temporal context [41], and evencolor [45, 23] and sound [30]. While features learned usingthese methods often come close to ImageNet performance,to date, none have been able to beat it.

2

Figure 3: An illustration of the bottom up procedure used to con-struct different label sets using the WordNet tree. Each node of thetree represents a class and the leaf nodes are shown in red. Differ-ent label sets are iteratively constructed by clustering together allthe leaf nodes with a common parent. In each iteration, only leafnodes are clustered. This procedure results into a sequence of labelsets for 1.2M images, where each consequent set contains labelscoarser than the previous one. Because the WordNet tree is im-balanced, even after multiple iterations, label sets contain someclasses that are present in the 1000 way ImageNet challenge.

A reasonable middle ground between the expensive,fully-supervised pre-training and free unsupervised pre-training is to use weak supervision. For example, [18] usethe YFCC100M dataset of 100 million Flickr images la-beled with noisy user tags as pre-training instead of Ima-geNet. But yet again, even though YFCC100M is almosttwo orders of magnitude larger than ImageNet, somewhatsurprisingly, the resulting features do not appear to give anysubstantial boost over these pre-trained on ImageNet.

Overall, despite keen interest in this problem, alterna-tive methods for learning general-purpose deep featureshave not managed to outperform ImageNet-supervised pre-training on transfer tasks.

The goal of this work is to try and understand what is thesecret to ImageNet’s continuing success.

3. Experimental SetupThe process of using supervised learning to initialize

CNN parameters using the task of ImageNet classificationis referred to as pre-training. The process of adapting pre-trained CNN to continuously train on a target dataset isreferred to as finetuning. All of our experiments use theCaffe [17] implementation of the a single network architec-ture proposed by Krizhevsky et al. [22]. We refer to thisarchitecture as AlexNet.

We closely follow the experimental setup of Agrawalet al. [2] for evaluating the generalization of pre-trainedfeatures on three transfer tasks: PASCAL VOC 2007 ob-ject detection (PASCAL-DET), PASCAL VOC 2012 actionrecognition (PASCAL-ACT-CLS) and scene classificationon SUN dataset (SUN-CLS).

• For PASCAL-DET, we used the PASCAL VOC 2007train/val for finetuning using the experimental setup and

Pre-trained Dataset PASCAL SUNOriginal 58.3 52.2127 Classes 55.5 48.7Random 41.3 [21] 35.7 [2]

Table 1: The transfer performance of a network pre-trained us-ing 127 (coarse) classes obtained after top-down clustering of theWordNet tree is comparable to a transfer performance after fine-tuning on all 1000 ImageNet classes. This indicates that fine-grained recognition is not necessary for learning good transferablefeatures.

code provided by Faster-RCNN [34] and report perfor-mance on the test set. Finetuning on PASCAL-DET wasperformed by adapting the pre-trained convolution layersof AlexNet. The model was trained for 70K iterationsusing stochastic gradient descent (SGD), with an initiallearning rate of 0.001 with a reduction by a factor of 10at 40K iteration.

• For PASCAL-ACT-CLS, we used PASCAL VOC 2012train/val for finetuning and testing using the experimen-tal setup and code provided by R*CNN [13]. The fine-tuning process for PASCAL-ACT-CLS mimics the pro-cedure described for PASCAL-DET.

• For SUN-CLS we used the same train/val/test splits asused by [2]. Finetuning on SUN was performed by firstreplacing the FC-8 layer in the AlexNet model with a ran-domly initialized, and fully connected layer with 397 out-put units. Finetuning was performed for 50K iterationsusing SGD with an initial learning rate of 0.001 whichwas reduced by a factor of 10 every 20K iterations.

Faster-RCNN and R*CNN are known to have varianceacross training runs; we therefore run it three times and re-port the mean ± standard deviation. On the other hand, [2],reports little variance between runs on SUN-CLS so we re-port our result using a single run.

In some experiments we pre-train on ImageNet using adifferent number of images per class. The model with 1000images/class uses the original ImageNet ILSVRC 2012training set. Models with N images/class for N < 1000 aretrained by drawing a random sample of N images from allimages of that class made available as part of the ImageNettraining set.

4. How does the amount of pre-training dataaffect transfer performance?

For answering this question, we trained 5 differentAlexNet models from scratch using 50, 125, 250, 500 and1000 images per each of the 1000 ImageNet classes usingthe procedure described in Section 3. The variation in per-formance with amount of pre-training data when these mod-els are finetuned for PASCAL-DET, PASCAL-ACT-CLS

3

Figure 4: Does a CNN trained for discriminating between coarse classes learns a feature embedding capable of distinguishing between fineclasses? We quantified this by measuring the induction accuracy defined as following: after training a feature embedding for a particularset of classes (set A), the induction accuracy is the nearest neighbor (top-1 and top-5) classification accuracy measured in the FC8 featurespace of the subset of 1000 ImageNet classes not present in set A. The syntax on the x-axis A Classes(B) indicates that the networkwas trained with A classes and the induction accuracy was measured on B classes. The baseline accuracy is the accuracy on B classes whenthe CNN was trained for all 1000 classes. The margin between the baseline and the induction accuracy indicates a drop in the network’sability to distinguish fine classes when being trained on coarse classes. The results show that features learnt by pre-training on just 127classes still lead to fairly good induction.

and SUN-CLS is shown in Figure 1. For PASCAL-DET, themean average precision (mAP) for CNNs with 1000, 500and 250 images/class is found to be 58.3, 57.0 and 54.6. Asimilar trend is observed for PASCAL-ACT-CLS and SUN-CLS. These results indicate that using half the amount ofpre-training data leads to only a marginal reduction in per-formance on transfer tasks. It is important to note that theperformance on the ImageNet classification task (the pre-training task) steadily increases with the amount of trainingdata, whereas on transfer tasks, the performance increasewith respect to additional pre-training data is significantlyslower. This suggests that while adding additional exam-ples to ImageNet classes will improve the ImageNet per-formance, it has diminishing return for transfer task perfor-mance.

5. How does the taxonomy of the pre-trainingtask affect transfer performance?

In the previous section we investigated how varyingnumber of pre-training images per class effects the perfor-mance in transfer tasks. Here we investigate the flip side:keeping the amount of data constant while changing thenomenclature of training labels.

5.1. The effect of number of pre-training classes ontransfer performance

The 1000 classes of the ImageNet challenge [35] are de-rived from leaves of the WordNet tree [11]. Using this tree,it is possible to generate different class taxonomies whilekeeping the total number of images constant. One can gen-erate taxonomies in two ways: (1) bottom up clustering,wherein the leaf nodes belonging to a common parent areiteratively clustered together (see Figure 3), or (2) by fix-ing the distance of the nodes from the root node (i.e. top

down clustering). Using bottom up clustering, 18 possibletaxonomies can be generated. Among these, we chose 5sets of labels constituting 918, 753, 486, 79 and 9 classesrespectively. Using top-down clustering only 3 label setsof 127, 10 and 2 can be generated, and we used the onewith 127 classes. For studying the effect of number of pre-training classes on transfer performance, we trained sepa-rate AlexNet CNNs from scratch using these label sets.

Figure 2 shows the effect of number of pre-trainingclasses obtained using bottom up clustering of WordNet treeon transfer performance. We also include the performanceof these different networks on the Imagenet classificationtask itself after finetuning only the last layer to distinguishbetween all the 1000 classes. The results show that increasein performance on transfer tasks is significantly slower withincrease in number of classes as compared to performanceon Imagenet itself. Using only 486 classes results in a per-formance drop of 1.7 mAP for PASCAL-DET, 0.8% accu-racy for SUN-CLS and a boost of 0.6 mAP for PASCAL-ACT-CLS. Table 1 shows the transfer performance afterpre-training with 127 classes obtained from top down clus-tering. The results from this table and the figure indicatethat only diminishing returns in transfer performance areobserved when more than 127 classes are used. Our resultsalso indicate that making the ImageNet classes finer will nothelp improve transfer performance.

It can be argued that the PASCAL task requires discrim-ination between only 20 classes and therefore pre-trainingwith only 127 classes should not lead to substantial reduc-tion in performance. However, the trend also holds true forSUN-CLS that requires discrimination between 397 classes.These two results taken together suggest that although train-ing with a large number of classes is beneficial, diminishingreturns are observed beyond using 127 distinct classes for

4

Figure 5: Can feature embeddings obtained by training on coarse classes be able to distinguish fine classes they were never trained on? E.g.by training on monkeys, can the network pick out macaques? Here we look at the FC7 nearest neighbors (NN) of two randomly sampledimages: a macaque (left column) and a giant schnauzer (right column), with each row showing feature embeddings trained with differentnumber of classes (from fine to coarse). The row(s) above the dotted line indicate that the image class (i.e. macaque/giant schnauzer) wasone of the training classes, whereas in rows below the image class was not present in the training set. Images in green indicate that theNN image belongs to the correct fine class (i.e. either macaque or giant schnauzer); orange indicates the correct coarse class (based on theWordNet hierarchy) but incorrect fine class; red indicated incorrect coarse class. All green images below the dotted line indicate instancesof correct fine-grain nearest neighbor retrieval for features that were never trained on that class.

pre-training.

Furthermore, for PASCAL-ACT-CLS and SUN-CLS,finetuning on CNNs pre-trained with class set sizes of 918,and 753 actually results in better performance than usingall 1000 classes. This may indicate that having too manyclasses for pre-training works against learning good gener-alizable features. Hence, when generating a dataset, oneshould be attentive of the nomenclature of the classes.

5.2. Is fine-grain recognition necessary for learningtransferable features?

ImageNet challenge requires a classifier to distinguishbetween 1000 classes, some of which are very fine-grained,such as different breeds of dogs and cats. Indeed, most hu-mans do not perform well on ImageNet unless specificallytrained [35], and yet are easily able to perform most every-day visual tasks. This raises the question: is fine-grainedrecognition necessary for CNN models to learn good fea-ture representations, or is coarse-grained object recognition(e.g. just distinguishing cats from dogs) is sufficient?

Note that the label set of 127 classes from the previousexperiment contains 65 classes that are present in the origi-nal set of 1000 classes and the remainder are inner nodes ofthe WordNet tree. However, all these 127 classes (see sup-plementary materials) represent coarse semantic concepts.As discussed earlier, pre-training with these classes resultsin only a small drop in transfer performance (see Table 1).This suggests that performing fine-grained recognition isonly marginally helpful and does not appear to be criticalfor learning good transferable features.

5.3. Does training with coarse classes induce fea-tures relevant for fine-grained recognition?

Earlier, we have shown that the features learned on the127 coarse classes perform almost as well on our transfertasks as the full set of 1000 ImageNet classes. Here wewill probe this further by asking a different question: is thefeature embedding induced by the coarse class classifica-tion task capable of separating the fine labels of ImageNet(which it never saw at training)?

To investigate this, we used top-1 and top-5 nearestneighbors in the FC7 feature space to measure the ac-curacy of identifying fine-grained ImageNet classes aftertraining only on a set of coarse classes. We call this mea-sure, “induction accuracy”. As a qualitative example, Fig-ure 5 shows nearest neighbors for a macaque (left) and aschnauzer (right) for feature embeddings trained on Ima-geNet but with different number of classes. All green-border images below the dotted line indicate instances ofcorrect fine-grain nearest neighbor retrieval for features thatwere never trained on that class.

Quantitative results are shown in Figure 4. The resultsshow that when 127 classes are used, fine-grained recogni-tion k-NN performance is only about 15% lower comparedto training directly for these fine-grained classes (i.e. base-line accuracy). This is rather surprising and suggests thatCNNs implicitly discover features capable of distinguish-ing between finer classes while attempting to distinguishbetween relatively coarse classes.

5

Figure 6: Does the network learn to discriminate coarse seman-tic concepts by training only on finer sub-classes? The degree towhich the concept of coarse class is learnt was quantified by mea-suring the difference (in percentage points) between the accuracyof classifying the coarse class and the average accuracy of indi-vidually classifying all the sub-classes of this coarse class. Here,the top and bottom classes sorted by this metric are shown usingthe label set of size 127 with classes with at least 5 subclasses.We observe that classes whose subclasses are visually consistent(e.g. mammal) are better represented than these that are visuallydissimilar (e.g. home appliance).

5.4. Does training with fine-grained classes inducefeatures relevant for coarse recognition?

Investigating whether the network learns features rel-evant for fine-grained recognition by training on coarseclasses raises the reverse question: does training with fine-grained classes induce features relevant for coarse recog-nition? If this is indeed the case, then we would expectthat when a CNN makes an error, it is more likely to con-fuse a sub-class (i.e. error in fine-grained recognition) withother sub-classes of the same coarse class. This effect canbe measured by computing the difference between the accu-racy of classifying the coarse class and the average accuracyof individually classifying all the sub-classes of this coarseclass (please see supplementary materials for details).

Figure 6 shows the results. We find that coarse seman-tic classes such as mammal, fruit, bird, etc. that containvisually similar sub-classes show the hypothesized effect,whereas classes such as tool and home appliance that con-tain visually dissimilar subclasses do not exhibit this effect.These results indicate that subclasses that share a commonvisual structure allow the CNN to learn features that aremore generalizable. This might suggest a way to improvefeature generalization by making class labels respect visualcommonality rather than simply WordNet semantics.

5.5. More Classes or More Examples Per Class?

Results in previous sections show that it is possible toachieve good performance on transfer tasks using signifi-cantly less pre-training data and fewer pre-training classes.However it is unclear what is more important – the numberof classes or the number or examples per class. One ex-

Dataset PASCAL SUNData size 500K 250K 125K 500K 250K 125KMore examples/class 57.1 54.8 50.6 50.6 45.7 42.2More classes 57.0 52.5 49.8 49.7 46.7 42.3

Table 2: For a fixed budget of pre-training data, is it better to havemore examples per class and fewer classes or vice-versa? Therow ‘more examples/class‘ was pretrained with subsets of Ima-geNet containing 500, 250 and 125 classes with 1000 exampleseach. The row ‘more classes‘ was pretrained with 1000 classes,but 500, 250 and 125 examples each. Interestingly, the transferperformance on both PASCAL and SUN appears to be broadlysimilar under both scenarios.

Pre-trained Dataset PASCALImageNet 58.3 ± 0.3Pascal removed ImageNet 57.8 ± 0.1Places 53.8 ± 0.1

Table 3: PASCAL-DET results after pre-training on entire Im-ageNet, PASCAL-removed-ImageNet and Places data sets. Re-moving PASCAL classes from ImageNet leads to an insignificantreduction in performance.

treme is to only have 1 class and all 1.2M images from thisclass and the other extreme is to have 1.2M classes and 1image per class. It is clear that both ways of splitting thedata will result in poor generalization, so the answer mustlie somewhere in-between.

To investigate this, we split the same amount of pre-training data in two ways: (1) more classes with fewer im-ages per class, and (2) fewer classes with more images perclass. We use datasets of size 500K, 250K and 125K im-ages for this experiment. For 500K images, we consideredtwo ways of constructing the training set – (1) 1000 classeswith 500 images/class, and (2) 500 classes with 1000 im-ages/class. Similar splits were made for data budgets of250K and 125K images. The 500, 250 and 125 classes forthese experiments were drawn from a uniform distributionamong the 1000 ImageNet classes. Similarly, the imagesubsets containing 500, 250 and 125 images were drawnfrom a uniform distribution among the images that belongto the class.

The results presented in Table 2 show that having moreimages per class with fewer number of classes results infeatures that perform very slightly better on PASCAL-DET, whereas for SUN-CLS, the performance is compara-ble across the two settings.

5.6. How important is to pre-train on classes thatare also present in a target task?

It is natural to expect that higher correlation between pre-training and transfer tasks leads to better performance on atransfer task. This indeed has been shown to be true in [44].One possible source of correlation between pre-training and

6

Figure 7: An illustration of the procedure used to split the Ima-geNet dataset. Splits were constructed in 2 different ways. Therandom split selects classes at random from the 1000 ImageNetclasses. The minimal split is made in a manner that ensures notwo classes in the same split have a common ancestor up to depthfour of WordNet tree. Collage in Figure 8 visualizes the randomand minimal splits.

transfer tasks are classes common to both tasks. In orderto investigate how strong is the influence of these commonclasses, we ran an experiment where we removed all theclasses from ImageNet that are contained in the PASCALchallenge. PASCAL has 20 classes, some of which mapto more than one ImageNet class and thus, after applyingthis exclusion criterion we are only left with 771 ImageNetclasses.

Table 3 compares the results on PASCAL-DET whenthe PASCAL-removed-ImageNet is used for pre-trainingagainst the original ImageNet and a baseline of pre-training on the Places [46] dataset. The PASCAL-removed-ImageNet achieves mAP of 57.8 (compared to 58.3 with thefull ImageNet) indicating that training on ImageNet classesthat are not present in PASCAL is sufficient to learn featuresthat are also good for PASCAL classes.

6. Does data augmentation from non-targetclasses always improve performance?

The analysis using PASCAL-removed ImageNet indi-cates that pre-training on non-PASCAL classes aids perfor-mance on PASCAL. This raises the question: is it alwaysbetter to add pre-training data from additional classes thatare not part of the target task? To investigate and test thishypothesis, we chose two different methods of splitting theImageNet classes. The first is random split, in which the1000 ImageNet classes are split randomly; the second is aminimal split, in which the classes are deliberately split toensure that similar classes are not in the same split, (Fig-ure 7). In order to determine if additional data helps perfor-mance for classes in split A, we pre-trained two CNNs – onefor classifying all classes in split A and the other for clas-sifying all classes in both split A and B (i.e. full dataset).We then finetuned the last layer of the network trained onthe full dataset on split A only. If it is the case that addi-

Figure 8: Visualization of the random and minimal splits used fortesting - is adding more pre-training data always useful? The twominimal sets contain disparate sets of objects. The minimal splitA and B consists mostly of inanimate objects and living things re-spectively. On the other hand, random splits contain semanticallysimilar objects.

tional data from split B helps performance on split A, thenthe CNN pre-trained with the full dataset should performbetter than CNN pre-trained only on split A.

Using the random split, Figure 9 shows that the resultsof this experiment confirms the intuition that additional datais indeed useful for both splits. However, under a randomclass split within ImageNet, we are almost certain to haveextremely similar classes (e.g. two different breeds of dogs)ending up on the different sides of the split. So, what wehave shown so far is that we can improve performance on,say, husky classification by also training on poodles. Hence,the motivation for the minimal split: does adding arbitrary,unrelated classes, such as fire trucks, help dog classifica-tion?

The classes in minimal split A do not share any commonancestor with minimal split B up until the nodes at depth 4of the WordNet hierarchy (Figure 7). This ensures that anyclass in split A is sufficiently disjoint from split B. Split Ahas 522 classes and split B has 478 classes (N.B.: for con-sistency, random splits A and B also had the same numberof classes). In order to intuitively understand the differencebetween min splits A and B, we have visualized a randomsample of images in these splits in Figure 8. Min split Aconsists of mostly static images and min split B consists ofliving objects.

Contrary to the earlier observation, Figure 9 shows thatboth min split A and B performs better than the full datasetwhen we finetune only the last layer. This result is quite sur-prising because it shows that finetuning the last layer froma network pre-trained on the full dataset, it is not possible

7

Figure 9: Does adding arbitrary classes to pre-training data al-ways improve transfer performance? This question was tested bytraining two CNNs, one for classifying classes in split A and otherfor classifying classes in split A and B both. We then finetunedthe CNN trained on both the splits on split A. If it is the casethat adding more pre-training data helps, then performance of theCNN pre-trained on both the splits (black) should be higher thana CNN pre-trained on a single split (orange). For random splits,this indeed is the case, whereas for minimal splits adding morepre-training data hurts performance. This suggests, that additionalpre-training data is useful only if it is correlated to the target task.

to match the performance of a network trained on just onesplit. We have observed that when training all the layers foran extensive amount of time (420K iterations), the accuracyof min split A does benefit from pre-training on split B butdoes not for min split B. One explanation could be that im-ages in split B (e.g. person) is contained in images in splitA, (e.g. buildings, clothing) but not vice versa.

While it might be possible to recover performance withvery clever adjustments of learning rates, current resultssuggest that training with data from unrelated classes maypush the network into a local minimum from which it mightbe hard to find a better optima that can be obtained by train-ing the network from scratch.

7. Discussion

In this work we analyzed factors that affect the qualityof ImageNet pre-trained features for transfer learning. Ourgoal was not to consider alternative neural network archi-tectures, but rather to establish facts about which aspects ofthe training data are important for feature learning.

The current consensus in the field is that the key to learn-ing highly generalizable deep features is the large amountsof training data and the large number of classes.

To quote the influential R-CNN paper: “..success re-sulted from training a large CNN on 1.2 million labeledimages...” [12]. After the publication of R-CNN, most re-searchers assumed that the full ImageNet is necessary topre-train good general-purpose features. Our work quan-titatively questions this assumption, and yields some quitesurprising results. For example, we have found that a sig-

nificant reduction in the number of classes or the numberof images used in pre-training has only a modest effect ontransfer task performance.

While we do not have an explanation as to the causeof this resilience, we list some speculative possibilities thatshould inform further study of this topic:

• In our experiments, we investigated only one CNN ar-chitecture – AlexNet. While ImageNet-trained AlexNetfeatures are currently the most popular starting pointfor fine-tuning on transfer tasks, there exist deeperarchitectures such as VGG [39], ResNet [15], andGoogLeNet [40]. It would be interesting to see if ourfindings hold up on deeper networks. If not, it mightsuggest that AlexNet capacity is less than previouslythought.

• Our results might indicate that researchers have beenoverestimating the amount of data required for learn-ing good general CNN features. If that is the case, itmight suggest that CNN training is not as data-hungryas previously thought. It would also suggest that beat-ing ImageNet-trained features with models trained on amuch bigger data corpus will be much harder than oncethought.

• Finally, it might be that the currently popular target tasks,such as PASCAL and SUN, are too similar to the origi-nal ImageNet task to really test the generalization of thelearned features. Alternatively, perhaps a more appropri-ate approach to test the generalization is with much lessfine-tuning (e.g. one-shot-learning) or no fine-tuning atall (e.g. nearest neighbour in the learned feature space).

In conclusion, while the answer to the titular question“What makes ImageNet good for transfer learning?” stilllacks a definitive answer, our results have shown that a lotof “folk wisdom” on why ImageNet works well is not ac-curate. We hope that this paper will pique our colleagues’curiosity and facilitate further research on this fascinatingtopic.

8. Acknowledgements

This work was supported in part by ONR MURIN00014-14-1-0671. We gratefully acknowledge NVIDIAcorporation for the donation of K40 GPUs and access tothe NVIDIA PSG cluster for this research. We would liketo acknowledge the support from the Berkeley Vision andLearning Center (BVLC) and Berkeley DeepDrive (BDD).Minyoung Huh was partially supported by the Rose HillFoundation.

8

References[1] P. Agrawal, J. Carreira, and J. Malik. Learning to see by

moving. In Proceedings of the IEEE International Confer-ence on Computer Vision, pages 37–45, 2015.

[2] P. Agrawal, R. Girshick, and J. Malik. Analyzing the perfor-mance of multilayer neural networks for object recognition.In Computer Vision–ECCV 2014, pages 329–344. Springer,2014.

[3] H. Azizpour, A. Razavian, J. Sullivan, A. Maki, and S. Carls-son. From generic to specific deep representations for vi-sual recognition. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops, pages36–45, 2015.

[4] Y. Bengio, A. C. Courville, and P. Vincent. Unsupervisedfeature learning and deep learning: A review and new per-spectives. CoRR, abs/1206.5538, 1, 2012.

[5] H. Bourlard and Y. Kamp. Auto-association by multilayerperceptrons and singular value decomposition. Biologicalcybernetics, 59(4-5):291–294, 1988.

[6] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Humanpose estimation with iterative error feedback. arXiv preprintarXiv:1507.06550, 2015.

[7] J. Dai, K. He, and J. Sun. Instance-aware semantic seg-mentation via multi-task network cascades. arXiv preprintarXiv:1512.04412, 2015.

[8] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi-sual representation learning by context prediction. In Pro-ceedings of the IEEE International Conference on ComputerVision, pages 1422–1430, 2015.

[9] J. Donahue, L. Anne Hendricks, S. Guadarrama,M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-rell. Long-term recurrent convolutional networks for visualrecognition and description. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 2625–2634, 2015.

[10] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti-vation feature for generic visual recognition. arXiv preprintarXiv:1310.1531, 2013.

[11] C. Fellbaum. WordNet: An Electronic Lexical Database.Bradford Books, 1998.

[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In Computer Vision and Pattern Recognition(CVPR), 2014 IEEE Conference on, pages 580–587. IEEE,2014.

[13] G. Gkioxari, R. Girshick, and J. Malik. Contextual actionrecognition with rcnn. In ICCV, 2015.

[14] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun.Unsupervised feature learning from temporal data. arXivpreprint arXiv:1504.02518, 2015.

[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. CoRR, abs/1512.03385, 2015.

[16] D. Jayaraman and K. Grauman. Learning image representa-tions tied to ego-motion. In Proceedings of the IEEE Inter-national Conference on Computer Vision, pages 1413–1421,2015.

[17] Y. Jia. Caffe: An open source convolutional archi-tecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013.

[18] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache.Learning visual features from large weakly supervised data.In ECCV, 2016.

[19] A. Karpathy and L. Fei-Fei. Deep visual-semantic align-ments for generating image descriptions. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 3128–3137, 2015.

[20] D. P. Kingma and M. Welling. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114, 2013.

[21] P. Krahenbuhl, C. Doersch, J. Donahue, and T. Darrell. Data-dependent initializations of convolutional neural networks.In ICLR, 2016.

[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in neural information processing systems, pages1097–1105, 2012.

[23] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep-resentations for automatic colorization. In ECCV, 2016.

[24] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature,521(7553):436–444, 2015.

[25] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.Howard, W. Hubbard, and L. D. Jackel. Backpropagationapplied to handwritten zip code recognition. Neural compu-tation, 1(4):541–551, 1989.

[26] Z. Li and D. Hoiem. Learning without forgetting. In ECCV,2016.

[27] H. Mobahi, R. Collobert, and J. Weston. Deep learning fromtemporal coherence in video. In Proceedings of the 26th An-nual International Conference on Machine Learning, pages737–744. ACM, 2009.

[28] M. Noroozi and F. Paolo. Unsupervised learning of visualrepresentations by solving jigsaw puzzles. In ECCV, 2016.

[29] B. A. Olshausen et al. Emergence of simple-cell receptivefield properties by learning a sparse code for natural images.Nature, 381(6583):607–609, 1996.

[30] A. Owens, P. Isola, J. McDermott, A. Torralba, E. Adelson,and F. William. Visually indicated sounds. In CVPR, 2016.

[31] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, andA. Efros. Context encoders: Feature learning by inpainting.In CVPR, 2016.

[32] M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun. Un-supervised learning of invariant feature hierarchies with ap-plications to object recognition. In Computer Vision andPattern Recognition, 2007. CVPR’07. IEEE Conference on,pages 1–8. IEEE, 2007.

9

http://caffe.berkeleyvision.org/

http://caffe.berkeleyvision.org/

[33] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnnfeatures off-the-shelf: an astounding baseline for recogni-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition Workshops, pages 806–813,2014.

[34] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. InAdvances in Neural Information Processing Systems, pages91–99, 2015.

[35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei. ImageNet Large Scale VisualRecognition Challenge. International Journal of ComputerVision (IJCV), 2015.

[36] R. Salakhutdinov and G. E. Hinton. Deep boltzmann ma-chines. In International Conference on Artificial Intelligenceand Statistics, pages 448–455, 2009.

[37] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun. Overfeat: Integrated recognition, localizationand detection using convolutional networks. arXiv preprintarXiv:1312.6229, 2013.

[38] K. Simonyan and A. Zisserman. Two-stream convolutionalnetworks for action recognition in videos. In Advancesin Neural Information Processing Systems, pages 568–576,2014.

[39] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014.

[40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In CVPR, 2015.

[41] X. Wang and A. Gupta. Unsupervised learning of visual rep-resentations using videos. In Proceedings of the IEEE Inter-national Conference on Computer Vision, pages 2794–2802,2015.

[42] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid.Deepflow: Large displacement optical flow with deep match-ing. In Proceedings of the IEEE International Conference onComputer Vision, pages 1385–1392, 2013.

[43] L. Wiskott and T. J. Sejnowski. Slow feature analysis: Un-supervised learning of invariances. Neural computation,14(4):715–770, 2002.

[44] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How trans-ferable are features in deep neural networks? In Advances inNeural Information Processing Systems, pages 3320–3328,2014.

[45] R. Zhang, P. Isola, and A. Efros. Colorful image colorization.In ECCV, 2016.

[46] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.Learning deep features for scene recognition using placesdatabase. NIPS, 2014.

10

what makes imagenet good for transfer learning? · text task” for training deep convolutional...

Documents