arxiv:1912.05361v1 [cs.cv] 11 dec 2019 · sudhanshu mittal maxim tatarchenko ozg¨ un c¸ic¸ek...

Parting with Illusions about Deep Active Learning

Sudhanshu Mittal Maxim Tatarchenko Ozgun Cicek Thomas BroxUniversity of Freiburg

Abstract

Active learning aims to reduce the high labeling costinvolved in training machine learning models on largedatasets by efficiently labeling only the most informativesamples. Recently, deep active learning has shown suc-cess on various tasks. However, the conventional evalu-ation scheme used for deep active learning is below par.Current methods disregard some apparent parallel work inthe closely related fields. Active learning methods are quitesensitive w.r.t. changes in the training procedure like dataaugmentation. They improve by a large-margin when in-tegrated with semi-supervised learning, but barely performbetter than the random baseline. We re-implement variouslatest active learning approaches for image classificationand evaluate them under more realistic settings. We furthervalidate our findings for semantic segmentation. Based onour observations, we realistically assess the current state ofthe field and propose a more suitable evaluation protocol.

1. Introduction

Supervised training of convolutional networks hasshown remarkable success in various computer vision tasks.Its price is the collection of large datasets and their annota-tion. Especially the data annotation is a common bottle-neck. Depending on the task, its cost may vary from a fewseconds to a few hours per sample.

Active learning (AL) presumably mitigates this large an-notation cost. It is based on the attractive idea that somesamples are more valuable for learning than others - byidentifying those in the pool of unlabeled data, we can usean annotator’s time more efficiently. The typical process inAL includes multiple cycles, where in each cycle a batchof samples is selected from the pool of still unlabeled datausing a query function. The selected samples are manuallyannotated and are added to the labeled set. Then the modelis re-trained. The process is repeated until the maximum an-notation budget or the desired performance level is reached.

The appeal of the active learning idea has spawned amultitude of ConvNet-based AL methods. In this paper weaim to objectively assess the state of the field and challenge

10 15 20 25 30 35 4060

70

80

90

100

State-of-

the-art AL

with augmentation

Semi-supervised learning

% of Labeled Data

Accuracy(%

)

Figure 1. State-of-the-art active learning methods do not consis-tently use modern data augmentation techniques or advances inthe closely related field of semi-supervised learning which leadsto the wrong impression about the current state of the field. Re-sults are shown for image classification on CIFAR 10.

the principal hypothesis behind active learning: active se-lection of the samples to be labeled leads to a significantreduction in the annotation effort compared to random se-lection. Our study seeks answers to the following four sci-entific questions.

(1) Since a widely accepted evaluation protocol is miss-ing, methods are often tested under incompatible circum-stances: different architectures, different augmentationstrategies, etc. We evaluate the effect of compatible exper-imental settings on the ranking of methods. In particular,do AL methods work consistently well in conjunction withdata augmentation?

(2) Contemporary papers on active learning largely ig-nore the progress of the closely related field of semi-supervised learning, where approaches effectively operateunder the same assumptions with regard to the used data.What is the effect as concepts from semi-supervised learn-ing are integrated into active learning?

(3) Existing methods are typically not evaluated in a low-budget setting - a mode crucially important to kick-start net-work training on a new dataset. How do active learningconcepts work in such low-budget regime?

1

arX

iv:1

912.

0536

1v1

[cs

.CV

] 1

1 D

ec 2

019

(4) Active learning is typically evaluated only on im-age classification tasks where manual labeling is relativelycheap compared to other tasks, e.g. semantic segmentation.Can active learning better exploit its potential in the morecostly scope of semantic segmentation, where efficient an-notation is practically more relevant?

In this work, keeping in mind the aforementioned ques-tions, we perform an extensive comparison of existing ap-proaches for image classification and semantic segmenta-tion. Our experiments reveal that the progress recentlymade in the field of active learning is practically negligiblewhen viewed under more realistic circumstances: in par-ticular, using modern data augmentation and taking the ad-vances of semi-supervised learning into account, see Figure1. Based on our extensive study, we suggest a more suitableevaluation protocol.

2. Related Work

Deep active learning (AL) methods can be categorized intothree types: uncertainty-based methods, representation-based methods and learning-based methods. Additionally,some methods have proposed a hybrid approach.

Uncertainty-based methods try to find the samples whichare hard to learn. Several methods have been proposedto estimate uncertainty for neural networks using bayesian[8, 15, 16, 23] and non-bayesian approaches [25, 32]. Galet al. [17] proposed to estimate posterior uncertainty usingdropout for active learning. Wang et al. [41] used the en-tropy of the softmax output in a neural network as a proxyuncertainty measure to query samples. Beluch et al. [6] useensemble method to estimate prediction uncertainty and se-lects new samples based on a statistical measure of com-mittee disagreement called variation ratio [21]. They showthis method outperforms all other uncertainty-based meth-ods. Representation-based methods [36, 43], also referredas density-based methods, try to find a diverse set of sam-ples that optimally represents the complete dataset distri-bution. Sener et al. [36] formulated the active learningproblem as core-set selection and showed effectiveness forCNNs. Learning-based approaches [39, 44] use an auxil-iary network module and loss function to learn a measureof information gain from new samples. Yoo et al. [44] pro-posed to learn a loss prediction module to predict targetlosses of unlabeled samples and selects samples with high-est predicted loss. It can also be considered as a pseudo-uncertainty heuristic. Sinha et al. [39] proposed a semi-supervised active learning approach that learns a VAE-GANhybrid network to select unlabeled samples that are not wellrepresented in the labeled set. It can also be considered as arepresentation type method.

Many of the above mentioned approaches mainly focuson image classification. Lately, a few works have proposed

to solve tasks involving higher annotation cost like objectdetection [44], pedestrian detection [44], human pose esti-mation [27] and segmentation [20, 39]. We focus on seman-tic segmentation in this work, since creating segmentationmasks is a highly expensive labeling task. This makes itone of the most relevant task for active learning. SuggestiveAnnotation [43], Cereals [28] and VAAL [39] are few workswhich have shown applicability of deep active learning forsemantic segmentation. VAAL is a task-agnostic learning-based approach using adversarial training. Suggestive An-notation is a hybrid approach proposed for a binary segmen-tation problem. Cereals is a patch-based selection approachbased on a hybrid heuristic of uncertainty, learned labelingcost model and spatial coherency of the image.

Semi-supervised Learning (SSL) methods make use of theunlabeled data for training the model. Effectively, this classof methods uses the same amount of information as the ALmethods. SSL has recently seen a lot of progress due to con-sistency regularization [7, 42, 46]. Consistency regulariza-tion minimizes the discrepancy between class predictions ofdifferently perturbed unlabeled image. Various additionalschemes have been proposed to avoid overfitting and to im-prove training stability, such as temporal ensembling [24],student-teacher model [40], adversarial perturbation [30],self-supervision [46], data filtering [42], and snapshot en-sembling [5]. In this work, we use a semi-supervised learn-ing method UDA [42] for image classification.

A few recent works [19, 29] applied semi-supervisedmethods also to semantic segmentation. Mittal et al. [29]proposed s4GAN that uses a conditional GAN [18] to learnfrom unlabeled images, French et al. [14] used consistencyregularization and Kalluri et al. [22] proposed a featurealignment objective to learning from unlabeled samples.We make use of s4GAN [29] in this work.

Semi-supervised Active Learning. Most representation-based AL methods use unlabeled samples to learn the un-derlying distribution, but only a few methods use semi-supervised learning to improve their selection criteria [35,36, 39, 41]. Sinha et al. [39] used unlabeled pool to learnits distribution against the distribution of labeled samples,but do not take its advantage to improve the feature repre-sentation of the target model itself. Sener et al. [36] havealso previously shown the advantage of using the unlabeledpool for learning the target model. Wang et al. [41] also ex-plored the usage of the most-certain samples from the un-labeled pool using pseudo-labeling, but the pseudo-labelingprocess can easily propagate erroneous labels if not tunedproperly. Ravanbakhsh et al. [35] proposed a GAN-basedapproach to make use of the unlabelled pool and utilizes thediscriminator score to query low-confident samples for ac-tive learning. Recently, two open-source concurrent works[2, 3] have also shown some similar findings as our work.However, they are restricted to only image classification.

Oliver et al. [31] raised concerns about the evaluationscheme for semi-supervised learning to help guide its ap-plicability to real-world problems. In this work, we raisesimilar questions about unrealistic evaluation schemes foractive learning by providing evidence through extensive ex-periments. In the following sections, we analyze the perfor-mance of active learning methods for image classificationand semantic segmentation respectively.

3. Active Learning for Image ClassificationIn active learning, we usually start with a small set of la-

beled samples Lwhose size is defined by the initial labelingbudget Bi and a large pool of unlabeled samples U . In eachcycle, a set of samples is selected from the unlabeled poolU according to the sampling budget Bs and added to the la-beled set L with the corresponding annotations provided bythe oracle annotator. The selection of samples is performedusing a query function, which can be learned using the fullset of available samples (U ∪L). This process is also calledpool-based active learning. This acquisition step is iteratedover several cycles until the objective is achieved.

In this section, we assess the performance of state-of-the-art AL methods for image classification and compare themwith the state-of-the-art semi-supervised approach. We val-idate our experiments using at least one recent approachfrom each of three categories of AL methods as defined inthe related work section.

3.1. Baseline Methods

Random. A new set of samples is selected randomly fromthe unlabeled pool and is added to the labeled pool withannotations.Entropy [38] is an information-theoretic measure used asan uncertainty metric for sampling. This method naively se-lects samples for which the pseudo-probabilities predictedby the softmax classifier have the highest entropy.Ensemble with Variation Ratio (ENS-varR). The secondmethod, which selects samples based on an uncertainty cri-terion relies on using ensembles. It has been shown to con-sistently outperform all other uncertainty-based approachesfor active learning by Beluch et al. [6]. The core of themethod is to calculate the variation ratio (varR) metric givenas the proportion of predicted class labels that are not themodal class prediction:

varR = 1− fmT

, (1)

where fm is the frequency of the modal class and T is thenumber of ensemble members. This heuristic is motivatedby the query-by-committee algorithm proposed by Seung etal. [37]. The query function selects the samples with largervarR values. The ensemble is only used for sample querying

- the target performance is still reported for a single model.Similar to Beluch et al. [6], we use an ensemble of 5 modelsfor our experiments.Core-set. This type of method selects a batch of samplessuch that the performance of the model trained on the la-beled set matches the performance of the model trained onthe whole dataset [34]. The recent core-set approach pro-posed by Sener et al. [36] casts the core-set selection prob-lem as a k-center problem and proposes a robust k-centerapproach. The proposed approach chooses a subset, suchthat the largest distance between chosen point and unlabeledpoints is minimized in the feature space. For the core-set ap-proach, we make use of the k-center greedy implementationsince it is much faster and only performs marginally worsethan the robust version.Learning Loss (LL). This method [44] proposes a loss pre-diction module which is attached to the target network toestimate the loss value of the unlabeled samples. The sam-ples with the largest predicted loss are selected for annota-tion. This auxiliary module is trained to preserve the pair-wise ranking of the original loss values which is imposedusing a hinge loss function over random pairs of samples ina minibatch.Unsupervised Data Augmentation (UDA). UDA [42] is asemi-supervised learning method for image classification.It uses consistency regularization to learn from unlabeledsamples along with AutoAugment [10] and other augmen-tation techniques to reduce overfitting. We selected thismethod because: 1) it shows state-of-the-art performance,2) it is based on a simple idea and is easy to implement.Also, the method performs well even when the number oflabeled samples is very small. Our implementation usedonline data augmentation instead of the offline one in theoriginal work [42].

3.2. Experiments and Results

3.2.1 Evaluation protocol

Datasets. We evaluate the methods on the CIFAR-10 andCIFAR-100 datasets. Both datasets contain the same set of60,000 images, assigned to 10 and 100 classes respectively.The training and test set contain 50,000 and 10,000 im-ages respectively. CIFAR-10 is the most commonly testeddataset in the field of active learning. CIFAR-100 is an ex-tension with 100 classes, which makes the task more chal-lenging. The initial labeling budget is Bi = 5000 and thesampling budget is Bs = 2500 labels for each cycle. Wetested this configuration for 6 sampling cycles (i.e. goingfrom 10% to 40% labeled samples). In the first step, werandomly sampled a class-balanced subset of samples fromthe unlabeled pool.

Training Details. For the network architecture, we con-sistently use the Wide-Resnet-Network [45] with depth=28

5000 7500 10000 12500 15000 17500 2000060

65

70

75

80

85

90

95

Number of Labeled Images

Accuracy(%

)

Random-wo-AugEntropy-wo-Aug

LL-wo-AugCore-set-wo-AugENS-varR-wo-Aug

RandomEntropy

LLCore-setENS-varR

Figure 2. Using data augmentation on CIFAR-10 significantly im-proves the performance of active learning methods and makes therelative difference between them less pronounced. Results withoutaugmentation are denoted as ’X-wo-Aug’.

and width=2 (WRN-28-2). We select WRN due to its effi-ciency and widespread adoption. WRN-28-2 contains only1.5M parameters showing close-to-state-of-the-art perfor-mance on CIFAR datasets. The WRN-28-2 classificationnetwork is optimized using SGD optimizer with a baselearning rate of 3e-2, momentum 0.9 and weight decay rateof 5e-4. We use a cosine learning rate schedule for trainingeach model. We trained all AL methods (without SSL meth-ods) for 150 epochs per sampling cycle with a batch size of64. We train the semi-supervised AL methods for 50k it-erations per sampling cycle with a batch size of 64 for thelabeled loss and a batch size of 320 for the unlabeled loss.We mask out unlabeled examples whose highest probabili-ties across categories are less than 0.6 and set the softmax-temperature scaling constant to 0.5. Other hyperparametersare used exactly as proposed in [42]. Our implementation isbased on the open source toolbox Pytorch [33].

All results are shown as performance curves. We reportthe mean performance over 3 trials with different initial la-beled sets for all single model-based methods and over 2trials for ensemble-based methods due to higher computa-tion cost and lower variance.

LL methods usually starts with a higher initial perfor-mance due to the extra regularization effect from the loss-prediction module. All other methods start from similarinitial performance with slight difference due to the modelvariance. This variance is more prominent in the beginningdue to the overfitting effect on small labeled set.

3.2.2 Do AL methods work consistently well togetherwith data augmentation?

Data augmentation is a widely accepted regularization tech-nique, which increases the power of machine learning mod-

5000 7500 10000 12500 15000 17500 2000020

30

40

50

60

70


Accuracy(%

)

Random-wo-Aug Entropy-wo-AugLL-wo-Aug Core-set-wo-Aug

ENS-varR-wo-Aug RandomEntropy LLCore-set ENS-varR

Figure 3. The performance of AL methods on CIFAR-100 im-proves significantly when using up-to-date image augmentation.Results without augmentation are denoted as ’X-wo-Aug’.

els particularly when there is little labeled data. Neverthe-less, several latest AL works [6, 39] resort to either not us-ing any augmentation during training, or only doing sim-plistic horizontal flipping. In this experiment, we validatedthe importance of elaborate up-to-date image augmentationfor the performance of AL methods.

We first evaluated all methods without any augmenta-tion. Subsequently, we evaluated the same methods withaugmentation, which includes using the AutoAugment poli-cies found by Cubuk et al. [10] , cutout [11], horizontal ran-dom flipping, and random cropping. Figure 2 shows thatwithout using any augmentation, all AL methods clearlyperform better than the random baseline. The LL methodshows distinct improvement over other methods (matchingthe results from Yoo et al. [44]) and an overall improve-ment of 3.2% over the random baseline on the CIFAR-10dataset. When the same experiment is performed with aug-mentation, all the methods improve drastically in absoluteperformance. However, the relative effect of using differ-ent AL methods becomes far less pronounced: all the ALmethods show similar performance within a range of 0.4%.In conclusion, AL works well with data augmentation, butdata augmentation blurs the differences between AL strate-gies: they all perform largely the same.

For completion, we further validate the importance of us-ing up-to-date augmentation for AL methods on the CIFAR-100 dataset. We evaluate all methods with and withoutaugmentation similar to the CIFAR-10 experiment. Theoverall conclusion is also very similar: Without augmen-tation, the LL method shows distinct improvement of 1.4%over the random baseline; with augmentation, all the meth-ods improve by a large margin in absolute performance butthe relative difference between different methods becomesinsignificant and the relative ranking of different methodschanges. Performance curves are shown in Figure 3.

5000 7500 10000 12500 15000 17500 2000082

84

86

88

90

92

94

96


Accuracy(%

)

RandomEntropy

LLCore-setENS-varR

SSL-RandomSSL-Entropy

SSL-LLSSL-Core-setSSL-ENS-varRFully-supervised

Figure 4. Combining AL methods with semi-supervised learningleads to significant performance improvement on CIFAR-10 com-pared to the raw AL case. Results shown in the large-budget set-ting with Bi = 5000,Bs = 2500.

3.2.3 Does semi-supervised learning or active learningmake better use of the pool of unlabeled data?

A largely common practice in the previous works has beento utilize the unlabeled pool only for sampling, although itis available throughout the learning process (otherwise onecould not sample from it) and could be used more rigor-ously. Using semi-supervised learning, we can utilize thisunlabeled pool for training the model itself. To this end, weemployed the UDA semi-supervised learning method. Weintegrated SSL into the AL methods by training the modelusing the UDA objective and defining the query functionbased on this model. In each cycle, the target model istrained using UDA instead of the standard supervised train-ing. Data augmentation stays the same as in Sec. 3.2.2. Werefer to the integrated methods as SSL-X, where X is thename of the AL method.

Figures 4 and 5 show a remarkably strong performanceof the SSL method (SSL-Random) on CIFAR10 and CI-FAR100: when using 5K random labeled samples, SSLalmost reaches the same performance which AL meth-ods achieved on 20K samples picked by the correspond-ing query functions. Also for the remaining data ratios,there is a large performance gap between semi-supervisedand active learning, both on CIFAR-10 and CIFAR-100.Clearly, semi-supervised learning makes much better use ofthe same data than active learning.

SSL and AL can be combined, which yields an improve-ment over raw SSL on CIFAR-10. The SSL-LL methodperforms best and shows an improvement over the randombaseline by 0.7% after 6 cycles. However, on CIFAR-100the relative ranking of the AL methods changes completely;

5000 7500 10000 12500 15000 17500 2000040

45

50

55

60

65

70

75

80


Accuracy(%

)

RandomEntropy

LLCore-setENS-varR


SSL-LLSSL-Core-setSSL-ENS-varRFully-supervised

Figure 5. Integrating SSL and AL leads to overall performanceimprovement on CIFAR-100, however, not all combinations con-sistently outperform random sampling. Results shown in the large-budget setting with Bi = 5000,Bs = 2500.

SSL-LL performs worse than the other methods and strug-gles even to compete with the random selection method.

The same is true for raw active learning without SSL:on CIFAR-100 some active learning methods do not reachthe performance of randomly drawing the samples to be la-beled, shown in Figure 5.

3.2.4 Does active learning consistently outperformrandom sampling in low-budget regimes?

There is an inconsistency in the methods’ behavior whenswitching from CIFAR-10 to CIFAR-100. This challengesthe principal assumption of active learning that a dedicatedselection strategy always improves over random selectionof samples. Does active learning benefit from a low-budgetsetting, where every sample is particularly crucial? In cer-tain applications, such as medical image analysis, already10000 annotated samples can be very costly. Thus, trainingwith only few labeled samples in the beginning is attrac-tive. We explored such low-budget setting with Bi and Bsfor each cycle set to 250 labels for CIFAR-10 and 500 labelsfor CIFAR-100. We tested this setting for 7 sampling cycleswith a total budget of 2000 and 4000 labels for CIFAR-10and CIFAR-100, respectively. We kept all the augmentationtechniques from the previous experiments.

The results are shown in Figures 6 and 7. None of the ac-tive learning methods consistently outperforms the randombaseline, neither on CIFAR-10 nor on CIFAR-100. Thisholds always for the combination of active learning andsemi-supervised learning, whereas for raw active learningonly ENS-varR could marginally outperform the randombaseline. In fact, some techniques perform considerably

250 500 750 1000 1250 1500 1750 200035404550556065707580859095100


Accuracy(%

)

RandomEntropy

LLCore-setENS-varR


SSL-LLSSL-Core-setSSL-ENS-varR

Figure 6. When evaluated in the low-budget regime (Bi = Bs =250) on CIFAR-10, integrated SSL-AL methods are still betterthan their raw counterparts, however, SSL with random samplingshows the best performance.

500 1000 1500 2000 2500 3000 3500 40005

10

15

20

25

30

35

40

45

50

55

60


Accuracy(%

)

RandomEntropy

LLCore-setENS-varR


SSL-LLSSL-Core-setSSL-ENS-varR

Figure 7. When evaluated in the low-budget regime (Bi = Bs =500) on CIFAR-100, most integrated SSL-AL methods are stillbetter than their raw counterparts but nothing beats SSL with ran-dom sampling.

worse than the random baseline, especially in conjunctionwith semi-supervised learning, showing that their selectionstrategy is counter-productive in the low-budget regime.

3.2.5 Comparison to Transfer Learning

Oliver et al. [31] argued that transfer learning may be apreferable alternative to semi-supervised learning when asuitable labeled dataset is available for transfer learning.Following the recommendation, we compare the perfor-mance of the SSL-Random baseline with a fine-tuned Im-

ageNet pre-trained network on CIFAR-10.The ImageNet pre-trained network is fine-tuned only on

the labeled samples. The experiment was conducted withResnet-18 due to the availability of pre-trained ImageNetweights. We observe that the SSL-AL method clearly out-performs fine-tuning of a pre-trained ImageNet network inboth high- and low-budget settings. We tested both bud-get setting for 4 sampling cycles, the corresponding resultsare shown in Figure 8 and 9 respectively. This experimentshows that including an up-to-date semi-supervised learn-ing algorithm into an active learning pipeline makes senseeven when large pre-training data is available.

4. Active Learning for Semantic Segmentation

Image classification is a standard active learning task.However, with its relatively low annotation cost (1 click perimage) it is not necessarily the most important one. In thissection we aim to evaluate the applicability of active learn-ing methods to a task with a significantly higher labelingcost - semantic segmentation. We adapt existing AL meth-ods for classification to support semantic segmentation.

250 500 750 1000 125065

70

75

80

85

90

95


Accuracy(%

)

Pre-trained ImageNet RandomSSL-Random

Figure 8. The SSL-Random baseline clearly outperforms a fine-tuned network pre-trained on ImageNet in the low-budget setting.Results shown on CIFAR-10.

5000 7500 10000 12500 1500092

93

94

95

96


Accuracy(%

)

Pre-trained ImageNet RandomSSL-Random

Figure 9. The SSL-Random baseline clearly outperforms a fine-tuned network pre-trained on ImageNet in the large-budget setting.Results shown on CIFAR-10.

10 20 30 40 5075

80

85

90

95

100

95.06

(10)(5)

(15)

(20)(25)

(30)

(35)

(40)

Average click cost per image

mIoU

Figure 10. Trade-off between the polygon approximation qualityand the annotation cost in clicks. Tolerance values used for eachmeasurement are mentioned above the N markers.

4.1. Annotation model

A conventional active learning setup includes a humanin the loop who annotates the samples picked by the queryfunction. Since training with an actual human annotatoris prohibitively expensive, we simulated its actions duringtraining. We used the number of clicks required to anno-tate the entire image as a proxy for the annotation cost.To do so, we approximate each connected component inthe ground truth image with a polygon using the Ramer-Douglas-Peucker algorithm [12]. The approximation qual-ity is controlled by a pre-defined pixel-level tolerance pa-rameter. The total number of clicks per image is then cal-culated by adding up the number of vertices for all poly-gons in this image. We perform a grid search over differ-ent tolerance values ranging from 5 to 40 pixels to find asuitable value. Figure 10 shows the trade-off between theaverage click cost per image and the polygon approxima-tion quality of annotations for different tolerance values.The trade-off between different tolerance values and label-ing quality is shown in Figure 11. We select the pixel-levellabeling tolerance of 10 pixels. The approximated labels re-tain 95.06% mIoU as compared to the original ground-truthlabels. According to this approximation, an average imagecosts around 33 clicks to label.

4.2. Baseline Methods

We evaluated two kinds of AL methods: uncertainty-based methods and learning-based methods.Random. We consider the random sampling method as ourfirst baseline.Entropy. This uncertainty method is based on the softmax-entropy of the segmentation prediction. Different to theclassification case, we need some integral uncertainty mea-sure that aggregates per-pixel uncertainty values. There area few heuristics for this [1, 43], but none of them is con-sidered standard. We evaluated several simple heuristicsincluding averaging and taking the maximum value overthe image, and concluded that the best results are achievedby counting the number of pixels per-image with an uncer-

tainty value higher than a certain threshold. We use 0.6 as athreshold value, which was determined via grid search.

Ensemble with Average Entropy (ENS-ent). This seconduncertainty-based method ENS-ent is based on the aver-age entropy over the predictions from all members of themodel ensemble. We used the same information accumula-tion heuristic as used for the Entropy method.

Learn Loss (LL). We adapted the LL [44] method fromimage classification to semantic segmentation. Since theoriginal module is proposed for a resnet architecture andthe segmentation network used in this work is also based ona resnet architecture, the exact method is directly adaptedby reusing the original loss prediction module.

Semi-supervised Learning (s4GAN). To leverage unla-beled samples, we used the semi-supervised semantic seg-mentation method by Mittal et al. [29]. It has been shown toproduce large performance gain with as few as 2% labeledsamples on the PASCAL VOC dataset. We only used thes4GAN branch of the proposed SSL method, which can betrained in an end-to-end manner, and dropped the classifi-cation branch for simplicity. The s4GAN method is basedon a conditional generative adversarial network, which usesthe segmentation network as a generator. The discrimina-tor of the s4GAN discriminates between the predicted andground-truth segmentation masks. We used the same hyper-parameters as provided by Mittal et al. [29] for our experi-ments. We also combine all the above mentioned AL meth-ods with the s4GAN method and evaluate them in the activelearning setting.

SSL-D-score. Inspired by Ravanbakhsh et al. [35], we pro-pose to use the discriminator of the s4GAN network as aquery function for sampling. The output of the discrimina-tor varies between 0 and 1, where higher score is assignedto a higher quality of segmentation prediction. In otherwords, the discriminator of the s4GAN network acts as acritic which provides a higher rating for better segmenta-tion quality. This heuristic selects the samples which are not

Original 5 10 15 20

Figure 11. Labeling quality when using polygon approximationwith different tolerance values (in pixels). We picked the tolerancevalue of 10 pixels for our experiments.

Original 5k 10k 15k 20k 25k 30k GT

Ent

ropy

SSL

-Ran

dom

Ent

ropy

SSL

-Ran

dom

Figure 12. Qualitative semantic segmentation results on PASCAL-VOC at each cycle comparing the Entropy-Image method and the SSL-Random-Image baseline. The column headings indicate the click budget used to train the corresponding model.

well represented by the current learned model, which is in-dicated with lower rating. We refer to this semi-supervisedapproach for active learning as the SSL-D-score method.

4.3. Experimental design

We show the performance of the AL methods for seman-tic segmentation on PASCAL-VOC 2012 [13]. The datasetconsists of 20 foreground classes and one background class.We use the augmented annotated dataset which contains10582 training images and 1449 validation images.

Data Setting. In AL experiments for segmentation, we de-fine the labeling cost in clicks. We use the initial label-ing budget Bi and subsequent sampling budget Bs of 5000clicks, which is approximately 1.5% of total labeling costof the dataset. In the first cycle, randomly sampled imagesare completely labeled until Bi is exhausted. In the subse-quent cycles, an AL query method selects images based ona certain criterion and labels the picked image until Bs isexhausted. We test all the segmentation AL methods for 5sampling cycles. All the results are shown on the validationset.

Training Details. We used the DeepLabv2 [9] architec-ture for all the experiments. The DeepLabv2 model is pre-trained using Microsoft COCO [26] dataset. The network isoptimized using a SGD optimizer with a base-learning rateof 2.5e-4, momentum of 0.9 and a weight decay of 5e-4.We use poly-learning policy similar to the original segmen-tation work [9]. We use a batch size of 10 and train each cy-cle for 150 epochs. The network gets an input image of size

321× 321. The used DeepLabv2 version achieves a perfor-mance of 73.6 and 71.6 mIoU on original and approximatedground-truth labels respectively on the validation set. Thecombined semi-supervised active learning (SSL-AL) meth-ods are trained for 10k iterations for each cycle with a batchsize of 8. The training procedure and hyperparameters forsemi-supervised learning are the same as in [29].

We evaluate AL methods for semantic segmentation intwo different settings: (1) using standard augmentationsand (2) using semi-supervised learning for training the tar-get model. The mean performance is reported over 3 tri-als for all single model-based methods and over 2 trials forensemble-based methods due to higher computation cost.

4.4. Results

4.4.1 Is active learning beneficial in more labor-intenselabeling tasks than image classification, e.g. se-mantic segmentation?

To find the relevance of active learning for semantic seg-mentation, we first train the model only using augmenta-tions including random horizontal flipping and random re-sized cropping. Other geometry preserving augmentationlike brightness, contrast, rotation, scaling do not show anymeasurable improvement for semantic segmentation [4]. Inthe results, the uncertainty method based on entropy per-forms best and shows an improvement of around 1.1%mIoU over the random sampling baseline after 5 AL sam-pling cycles. The LL method fails to outperform the randombaseline approach. Corresponding performance curves are

5000 10000 15000 20000 25000 3000035

40

45

50

55

60

65

70

75

Annotation Cost (#clicks)

mIoU

Random-Image Entropy-ImageLL-Image ENS-ent-Image

SSL-Random-Image SSL-Entropy-ImageSSL-LL-Image SSL-D-score-Image

Fully-sup-approx Fully-sup-orig

Figure 13. Integrated SSL-AL methods for semantic segmentationmostly perform better than their raw counterparts on PASCAL-VOC with Bi = Bs = 5000 clicks (≈ 1.5% of the dataset). Noneof the methods outperforms SSL with random sampling.

shown with solid lines in Figure 13.Further, we use the unlabeled pool of images to train the

target model for semantic segmentation and analyze the per-formance. We utilize the s4GAN model [29] to leverage theinformation from unlabeled samples. When used on top ofthe SSL method, all AL methods show a clear gain in per-formance. The performance of the random baseline (Ran-dom) when combined with s4GAN increases by the largestmargin of 4.1% mIoU and reaches the overall best value.In addition, SSL-D-score heuristic also shows comparableperformance to the random baseline after 5 sampling cycles,but does not bring any improvement over the SSL-Randombaseline. The performance curves for all integrated methodsare shown with dashed lines in Figure 13. Figure 12 showsthe qualitative results at each sampling cycle, comparing theEntropy-Image method and the SSL-Random-Image base-line.

Figure 14. Image labeling in a polygon-level labeling regime.From left to right: Original image, approximated ground-truth,pixel-wise entropy and selected polygon for labeling based on theentropy heuristic.

4.4.2 Polygon-level Labeling

Here, we explore whether labeling only a part of an im-age is more effective than labeling the complete image. Weevaluate active learning methods for semantic segmenta-tion, where only a region of an image is selected by thequery function. This region is approximately labeled usinga polygon by the annotation simulator. We evaluate meth-ods where an image is selected randomly, but the polygonin the image is selected based on the active learning heuris-tic. We compare entropy-based and random polygon selec-tion methods in both raw and SSL-integrated active learningsettings.

Experiment Details. Entropy of a polygon is measured in asimilar way as in the image-level labeling regime. We createa binary mask for the pixel-wise entropy based on a thresh-old and use the area of the high-entropy pixels as our selec-tion heuristic. Only in the first cycle, images are completelylabeled until the Bi is covered. In the subsequent cycles,images are labeled polygon-wise until the sampling budgetBs is exhausted. Figure 14 shows two examples of howpolygon-level labeling regime works based on the entropyheuristic. The budget settings and the hyperparameters ex-actly match those from the image-level labeling regime.

Results. Entropy-based polygon selection approach ismore effective than random polygon selection for the rawactive learning (without SSL) setting. However, whencombined with semi-supervised learning, both entropy(Random-Image-Entropy-Polygon) and random (Random-Image-Random-Polygon) polygon selection strategies per-form very similarly. Results are shown in Figure 15.Moreover when all polygon-level labeling approaches are

5000 10000 15000 20000 25000 3000045

50

55

60

65

70

Annotation Cost (#clicks)

mIoU

SSL-Random-ImageRandom-Image-Random-PolygonRandom-Image-Entropy-Polygon

SSL-Random-Image-Random-PolygonSSL-Random-Image-Entropy-Polygon

Figure 15. Active learning for semantic segmentation: compari-son between SSL integrated with active learning (SSL-X) againstthe standard setting. Results shown on the PASCAL-VOC datasetwith Bi = Bs = 5000 clicks. The suffixes ‘Image’ and ‘Poly-gon’ refer to the image-level and polygon-level labeling regimesrespectively.

compared with the image-level labeling approaches, wefind SSL-Random-Image baseline even outperforms all thepolygon-level active learning methods.

In this experiment, we also observed that the SSL-Random-Image baseline outperformed the SSL-Random-Image-Random-Polygon baseline, showing that image-level labeling is a more effective way of labeling an image.

5. DiscussionOur experiments provide strong evidence that the current

evaluation protocol used in active learning is sub-optimalwhich in turn leads to wrong conclusions about the meth-ods’ performance and the state of the field in general.

Evaluating on CIFAR-100 which is marginally differentfrom CIFAR-10, dramatically changes the ranking of themethods. Applying state-of-the-art data augmentation sig-nificantly increases the scores of all methods making themvirtually indistinguishable in terms of final performance.

Modern semi-supervised learning algorithms applied inthe conventional active learning setting show a higher rel-ative performance increase than any of the active learningmethods proposed in the recent years.

State-of-the-art active learning approaches often fail tooutperform simple random sampling, especially when thelabeling budget is small - a setting critically important formany real-world applications.

Based on these observations, we formulate a more ap-propriate evaluation protocol and recommend using it forbenchmarking future active learning methods.

1. AL methods should be evaluated on a wider range ofdatasets to assess their general robustness.

2. It is important to evaluate AL methods with up-to-date network architectures and up-to-date augmenta-tion techniques.

3. There should always be a direct comparison betweenAL methods and SSL methods.

4. Together with the existing large-budget regime, ALmethods should be evaluated in the low-budget regime.

It would be interesting to know why AL often performsworse than random sampling and consistently does so in thelow-budget regime. For now, we can only speculate. We be-lieve that AL sampling introduces a bias into the distributionof annotated samples, i.e., the sampled distribution does notsufficiently match the true distribution anymore. The dam-age by this bias is larger than the positive effect of learningfrom “more interesting” samples. If this hypothesis is true,research in active learning should focus on ways that avoidany bias by the selection strategy.

Our results also indicate that adapting AL methods totasks with higher labeling cost, e.g. semantic segmentation,

is a non-trivial problem. Although there is not enough em-pirical evidence for this, we speculate that such tasks canpotentially benefit much more from employing informedsample selection strategies and thus define a promising di-rection for future research.

AcknowledgementsThis study was supported by the German Federal Min-

istry of Education and Research via the project Deep-PTLand by the Intel Network of Intelligent Systems.

References[1] Hamed H. Aghdam, Abel Gonzalez-Garcia, Joost

van de Weijer, and Antonio M. Lopez. Active learningfor deep detection neural networks. In ICCV, 2019. 7

[2] Anonymous. Consistency-based semi-supervised ac-tive learning: Towards minimizing labeling budget. InSubmitted to ICLR, 2020. under review. 2

[3] Anonymous. Rethinking deep active learning: Us-ing unlabeled data at model training. In Submitted toICLR, 2020. under review. 2

[4] Anonymous. Semi-supervised semantic segmentationneeds strong, high-dimensional perturbations. In Sub-mitted to ICLR, 2020. under review. 8

[5] Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, andAndrew Gordon Wilson. There are many consistentexplanations of unlabeled data: Why you should aver-age. In ICLR, 2019. 2

[6] William H. Beluch, Tim Genewein, Andreas Nrn-berger, and Jan M. Khler. The power of ensemblesfor active learning in image classification. In CVPR,2018. 2, 3, 4

[7] David Berthelot, Nicholas Carlini, Ian Goodfellow,Nicolas Papernot, Avital Oliver, and Colin Raffel.Mixmatch: A holistic approach to semi-supervisedlearning. In NeurIPS, 2019. 2

[8] Charles Blundell, Julien Cornebise, KorayKavukcuoglu, and Daan Wierstra. Weight uncertaintyin neural networks. In ICML, 2015. 2

[9] Liang-Chieh Chen, George Papandreou, IasonasKokkinos, Kevin Murphy, and Alan L. Yuille.Deeplab: Semantic image segmentation with deepconvolutional nets, atrous convolution, and fully con-nected crfs. TPAMI, 2018. 8

[10] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, VijayVasudevan, and Quoc V. Le. Autoaugment: Learningaugmentation strategies from data. In CVPR, 2019. 3,4

[11] Terrance DeVries and Graham W Taylor. Improvedregularization of convolutional neural networks withcutout. arXiv preprint arXiv:1708.04552, 2017. 4

[12] David H. Douglas and Thomas K. Peucker. Algo-rithms for the Reduction of the Number of Points Re-quired to Represent a Digitized Line or its Caricature.In Classics in Cartography. 2011. 7

[13] Mark Everingham, Luc Gool, Christopher K.Williams, John Winn, and Andrew Zisserman. Thepascal visual object classes (voc) challenge. IJCV,2010. 8

[14] Geoffrey French, Timo Aila, Samuli Laine, MichalMackiewicz, and Graham D. Finlayson. Consistencyregularization and cutmix for semi-supervised seman-tic segmentation. CoRR, 2019. 2

[15] Yarin Gal and Zoubin Ghahramani. Bayesian convo-lutional neural networks with Bernoulli approximatevariational inference. In ICLR-workshop track, 2016.2

[16] Yarin Gal and Zoubin Ghahramani. Dropout as abayesian approximation: Representing model uncer-tainty in deep learning. In ICML, 2016. 2

[17] Yarin Gal, Riashat Islam, and Zoubin Ghahramani.Deep bayesian active learning with image data. InICML, 2017. 2

[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarialnets. In NeurIPS. 2014. 2

[19] W.-C. Hung, Y.-H. Tsai, Y.-T. Liou, Y.-Y. Lin, and M.-H. Yang. Adversarial learning for semi-supervised se-mantic segmentation. In BMVC, 2018. 2

[20] S. D. Jain and K. Grauman. Active image segmenta-tion propagation. In CVPR, 2016. 2

[21] Elmer H. Johnson. Elementary applied statistics: Forstudents in behavioral science. Social Forces, 1966. 2

[22] Tarun Kalluri, Girish Varma, Manmohan Chandraker,and C.V. Jawahar. Universal semi-supervised semanticsegmentation. In ICCV, 2019. 2

[23] Alex Kendall and Yarin Gal. What Uncertainties DoWe Need in Bayesian Deep Learning for ComputerVision? In NeurIPS, 2017. 2

[24] Samuli Laine and Timo Aila. Temporal ensemblingfor semi-supervised learning. In ICLR (Poster), 2017.2

[25] Balaji Lakshminarayanan, Alexander Pritzel, andCharles Blundell. Simple and scalable predictiveuncertainty estimation using deep ensembles. InNeurIPS. 2017. 2

[26] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollr, andC. Lawrence Zitnick. Microsoft COCO: Common ob-jects in context. In ECCV, 2014. 8

[27] B. Liu and V. Ferrari. Active learning for human poseestimation. In ICCV, 2017. 2

[28] Radek Mackowiak, Philip Lenz, Omair Ghori, FerranDiego, Oliver Lange, and Carsten Rother. Cereals -cost-effective region-based active learning for seman-tic segmentation. In BMVC, 2018. 2

[29] Sudhanshu Mittal, Maxim Tatarchenko, and ThomasBrox. Semi-supervised semantic segmentation withhigh- and low-level consistency. arXiv preprintarXiv:1908.05724, 2019. 2, 7, 8, 9

[30] T. Miyato, S. Maeda, M. Koyama, and S. Ishii. Virtualadversarial training: A regularization method for su-pervised and semi-supervised learning. TPAMI, 2019.2

[31] Avital Oliver, Augustus Odena, Colin A Raffel,Ekin Dogus Cubuk, and Ian Goodfellow. Realis-tic evaluation of deep semi-supervised learning algo-rithms. In NeurIPS. 2018. 3, 6

[32] Ian Osband, Charles Blundell, Alexander Pritzel, andBenjamin Van Roy. Deep exploration via bootstrappeddqn. In NeurIPS. 2016. 2

[33] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer. Au-tomatic differentiation in pytorch. 2017. 4

[34] R. Paul, D. Feldman, D. Rus, and P. Newman. Visualprecis generation using coresets. In ICRA, 2014. 3

[35] Mahdyar Ravanbakhsh, Tassilo Klein, Kayhan Bat-manghelich, and Moin Nabi. Uncertainty-driven se-mantic segmentation through human-machine collab-orative learning. In International Conference on Med-ical Imaging with Deep Learning – Extended AbstractTrack, 2019. 2, 7

[36] Ozan Sener and Silvio Savarese. Active learning forconvolutional neural networks: A core-set approach.In ICLR, 2018. 2, 3

[37] H. S. Seung, M. Opper, and H. Sompolinsky. Queryby committee. In Annual Workshop on ComputationalLearning Theory, 1992. 3

[38] Claude Elwood Shannon. A mathematical theory ofcommunication. The Bell System Technical Journal,1948. 3

[39] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell.Variational adversarial active learning. In ICCV, 2019.2, 4

[40] Antti Tarvainen and Harri Valpola. Mean teachers arebetter role models: Weight-averaged consistency tar-gets improve semi-supervised deep learning results. InNeurIPS. 2017. 2

[41] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang,and Liang Lin. Cost-effective active learning for deepimage classification. IEEE Trans. Cir. and Sys. forVideo Technol., 2017. 2

[42] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-ThangLuong, and Quoc V Le. Unsupervised data aug-mentation for consistency training. arXiv preprintarXiv:1904.12848, 2019. 2, 3, 4

[43] Lin Yang, Yizhe Zhang, Jianxu Chen, Siyuan Zhang,and Danny Z. Chen. Suggestive annotation: A deepactive learning framework for biomedical image seg-mentation. In Medical Image Computing and Com-puter Assisted Intervention - MICCAI, 2017. 2, 7

[44] Donggeun Yoo and In So Kweon. Learning loss foractive learning. In CVPR, 2019. 2, 3, 4, 7

[45] Sergey Zagoruyko and Nikos Komodakis. Wide resid-ual networks. In BMVC, 2016. 3

[46] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov,and Lucas Beyer. S4l: Self-supervised semi-supervised learning. In ICCV, 2019. 2

arxiv:1912.05361v1 [cs.cv] 11 dec 2019 · sudhanshu mittal maxim tatarchenko ozg¨ un c¸ic¸ek...

Documents