semi- and weakly supervised directional bootstrapping ... · (enhanced-sn). both the coarse-sn and...

8
1 Semi- and Weakly Supervised Directional Bootstrapping Model for Automated Skin Lesion Segmentation Yutong Xie, Jianpeng Zhang, Yong Xia, and Chunhua Shen Abstract—Automated skin lesion segmentation on dermoscopy images is an essential and challenging task in the computer-aided diagnosis of skin cancer. Despite their prevalence and relatively good performance, deep learning based segmentation methods re- quire a myriad number of training images with pixel-level dense annotation, which is hard to obtain due to the efforts and costs related to dermoscopy images acquisition and annotation. In this paper, we propose the semi- and weakly supervised directional bootstrapping (SWSDB) model for skin lesion segmentation, which consists of three deep convolutional neural networks: a coarse segmentation network (coarse-SN), a dilated classification network (dilated-CN) and an enhanced segmentation network (enhanced-SN). Both the coarse-SN and enhanced-SN are trained using the images with pixel-level annotation, and the dilated- CN is trained using the images with image-level class labels. The coarse-SN generates rough segmentation masks that provide a prior bootstrapping for the dilated-CN and help it produce accurate lesion localization maps. The maps are then fed into the enhancedSN to transfer the localization information learned from image-level labels to the enhanced-SN to generate segmentation results. Furthermore, we introduce a hybrid loss that is the weighted sum of a dice loss and a rank loss to the coarse-SN and enhanced-SN, ensuring both networks’ good compatibility for the data with imbalanced classes and imbalanced hard-easy pixels. We evaluated the proposed SWSDB model on the ISIC- 2017 challenge dataset and PH2 dataset and achieved a Jaccard index of 80.4% and 89.4%, respectively, setting a new record in skin lesion segmentation. Index Terms—Semi-supervised learning, weakly supervised learning, skin lesion segmentation. I. I NTRODUCTION S KIN cancer is one of the most common cancers all over the world, with the estimated new cases of 99, 550 and death of 13, 460 in the United States in 2018 [1], [2]. Dermoscopy, a popular in vivo non-invasive imaging tool, is one of the essential means to improve diagnostic perfor- mance and reduce skin cancer deaths [3]. Dermoscopy images produced globally are currently analysed by dermatologists almost entirely through visual inspection, which requires a This work was supported in part by the National Natural Science Foundation of China under Grants 61771397 and 61471297, and by an Australian Research Council (ARC) Future Fellowship to C. Shen. (Corresponding author: Y. Xia) Y. Xie, J. Zhang, and Y. Xia are with the National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science and Engineering, Northwestern Polytechni- cal University, Xi’an 710072, China (e-mail: [email protected]; [email protected]; [email protected]). C. Shen is with the School of Computer Science, University of Adelaide, SA, Australia (chun- [email protected]). The first two authors’ contribution was made when visiting The University of Adelaide. Fig. 1. Challenges on skin lesion segmentation. From these figures, we can see that skin lesion has variety of appearances, fuzzy object borders and some inevitable distractions. Red dash line represents the skin lesions contoured by dermatologists. high degree of skill and concentration and is time-consuming, expensive, and prone to operator bias. Computer-aided anal- ysis avoids many of these issues and has increasingly being studied to assist dermatologists in improving the efficiency and objectivity of dermoscopy image analysis. Automated segmentation of skin lesions is an essential step in computer-aided analysis of dermoscopy images [4], [5]. It is a very challenging task due to three aspects. First, lesions of different subjects exhibit significant variations in location, shape, color, and texture. Second, the low contrast between a lesion and the surrounding skin results in fuzzy boundaries of lesions. Third, some cases may contain artifacts, such as hair, frames, blood vessels, and air bubbles. Six examples that illustrate these issues are displayed in Figure 1. Recently, deep convolutional neural network (DCNN) mod- els have formulated skin lesion segmentation into a pixel-level classification problem and achieved remarkable success [6], [4], [7], [8], [9], [10], [11]. However, training a DCNN model requires a myriad number of images with pixel-wise labels, which are hard to obtain due to the tremendous efforts and costs related to dermoscopy image acquisition and pixel- wise annotation. There are two strategies to address this issue. Semi-supervised learning (SSL) enables us to train a DCNN using both labeled and unlabeled data, reducing the required number of pixel-wisely labeled images. Weakly supervised learning (WSL) uses weak annotations, e.g. bound- ing boxes [12], scribbles [13], points [14] or image-level labels [15], [16], to train the model, reducing the workload required for image annotation. Among various types of weak arXiv:1903.03313v2 [cs.CV] 12 Mar 2019

Upload: others

Post on 16-Oct-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semi- and Weakly Supervised Directional Bootstrapping ... · (enhanced-SN). Both the coarse-SN and enhanced-SN are trained using the images with pixel-level annotation, and the dilated-CN

1

Semi- and Weakly Supervised DirectionalBootstrapping Model for Automated Skin Lesion

SegmentationYutong Xie, Jianpeng Zhang, Yong Xia, and Chunhua Shen

Abstract—Automated skin lesion segmentation on dermoscopyimages is an essential and challenging task in the computer-aideddiagnosis of skin cancer. Despite their prevalence and relativelygood performance, deep learning based segmentation methods re-quire a myriad number of training images with pixel-level denseannotation, which is hard to obtain due to the efforts and costsrelated to dermoscopy images acquisition and annotation. In thispaper, we propose the semi- and weakly supervised directionalbootstrapping (SWSDB) model for skin lesion segmentation,which consists of three deep convolutional neural networks: acoarse segmentation network (coarse-SN), a dilated classificationnetwork (dilated-CN) and an enhanced segmentation network(enhanced-SN). Both the coarse-SN and enhanced-SN are trainedusing the images with pixel-level annotation, and the dilated-CN is trained using the images with image-level class labels.The coarse-SN generates rough segmentation masks that providea prior bootstrapping for the dilated-CN and help it produceaccurate lesion localization maps. The maps are then fed into theenhancedSN to transfer the localization information learned fromimage-level labels to the enhanced-SN to generate segmentationresults. Furthermore, we introduce a hybrid loss that is theweighted sum of a dice loss and a rank loss to the coarse-SNand enhanced-SN, ensuring both networks’ good compatibilityfor the data with imbalanced classes and imbalanced hard-easypixels. We evaluated the proposed SWSDB model on the ISIC-2017 challenge dataset and PH2 dataset and achieved a Jaccardindex of 80.4% and 89.4%, respectively, setting a new record inskin lesion segmentation.

Index Terms—Semi-supervised learning, weakly supervisedlearning, skin lesion segmentation.

I. INTRODUCTION

SKIN cancer is one of the most common cancers allover the world, with the estimated new cases of 99, 550

and death of 13, 460 in the United States in 2018 [1], [2].Dermoscopy, a popular in vivo non-invasive imaging tool,is one of the essential means to improve diagnostic perfor-mance and reduce skin cancer deaths [3]. Dermoscopy imagesproduced globally are currently analysed by dermatologistsalmost entirely through visual inspection, which requires a

This work was supported in part by the National Natural Science Foundationof China under Grants 61771397 and 61471297, and by an AustralianResearch Council (ARC) Future Fellowship to C. Shen. (Correspondingauthor: Y. Xia)

Y. Xie, J. Zhang, and Y. Xia are with the National Engineering Laboratoryfor Integrated Aero-Space-Ground-Ocean Big Data Application Technology,School of Computer Science and Engineering, Northwestern Polytechni-cal University, Xi’an 710072, China (e-mail: [email protected];[email protected]; [email protected]). C. Shen is with theSchool of Computer Science, University of Adelaide, SA, Australia ([email protected]). The first two authors’ contribution was madewhen visiting The University of Adelaide.

Fig. 1. Challenges on skin lesion segmentation. From these figures, we cansee that skin lesion has variety of appearances, fuzzy object borders and someinevitable distractions. Red dash line represents the skin lesions contoured bydermatologists.

high degree of skill and concentration and is time-consuming,expensive, and prone to operator bias. Computer-aided anal-ysis avoids many of these issues and has increasingly beingstudied to assist dermatologists in improving the efficiency andobjectivity of dermoscopy image analysis.

Automated segmentation of skin lesions is an essential stepin computer-aided analysis of dermoscopy images [4], [5]. Itis a very challenging task due to three aspects. First, lesionsof different subjects exhibit significant variations in location,shape, color, and texture. Second, the low contrast betweena lesion and the surrounding skin results in fuzzy boundariesof lesions. Third, some cases may contain artifacts, such ashair, frames, blood vessels, and air bubbles. Six examples thatillustrate these issues are displayed in Figure 1.

Recently, deep convolutional neural network (DCNN) mod-els have formulated skin lesion segmentation into a pixel-levelclassification problem and achieved remarkable success [6],[4], [7], [8], [9], [10], [11]. However, training a DCNN modelrequires a myriad number of images with pixel-wise labels,which are hard to obtain due to the tremendous efforts andcosts related to dermoscopy image acquisition and pixel-wise annotation. There are two strategies to address thisissue. Semi-supervised learning (SSL) enables us to traina DCNN using both labeled and unlabeled data, reducingthe required number of pixel-wisely labeled images. Weaklysupervised learning (WSL) uses weak annotations, e.g. bound-ing boxes [12], scribbles [13], points [14] or image-levellabels [15], [16], to train the model, reducing the workloadrequired for image annotation. Among various types of weak

arX

iv:1

903.

0331

3v2

[cs

.CV

] 1

2 M

ar 2

019

Page 2: Semi- and Weakly Supervised Directional Bootstrapping ... · (enhanced-SN). Both the coarse-SN and enhanced-SN are trained using the images with pixel-level annotation, and the dilated-CN

2

annotations, image-level labels are the weakest and most easy-to-get supervision [15], [16]. We suggest using the hybridsemi- and weakly supervised learning for skin lesion segmen-tation, i.e. training the segmentation DCNN model using someimages with pixel-wise dense annotation and other imageswith image-level labels.

Specifically, we tackle the problem by focusing on gener-ating high-quality lesion localization maps from image-levellabels. With the bootstrapping of the maps, we train a seg-mentation network for this task with pixel-level labels. To thisend, we need solve two problems: (1) how to obtain accuratelocalization maps and (2) how to use the maps to improve thesegmentation network. A typical solution to the first problemis to leverage a classification network to localize the region oflesion and obtain high-quality class-specific maps [15], [16],[17], [18]. However, using directly the maps produced by theclassification network may generate only the discriminativeregion related to this class, which is not sufficiently accurate toguide a good generation of the segmentation mask. A popularapproach to the second problem is to create proxy pixel-levellabels for image-level labelled images based on the localizationmaps, and to jointly use the proxy label set and real label set totrain a segmentation network [15], [16]. . Nevertheless, oncethe localization is wrong, the proxy labels will be affectedthereby may lead to unsatisfying segmentation results. In thiscase, using the proxy labels makes the segmentation networkbe optimized towards a wrong direction.

In this paper, we propose a semi- and weakly superviseddirectional bootstrapping (SWSDB) model for skin lesionsegmentation, which consists of three DCNNs: a coarse seg-mentation network (coarse-SN), a dilated classification net-work (dilated-CN) and an enhanced segmentation network(enhanced-SN). Both the coarse-SN and enhanced-SN aretrained using data with pixel-level dense annotations, andthe dilated-CN is trained using data with image-level classlabels. The coarse-SN is constructed to generate rough lesionmasks, which, together with original images, are fed into thedilated-CN to produce lesion localization maps. Each roughlesion mask provides a prior bootstrapping for the dilated-CN and makes it focus more accurately on the lesion. Theenhanced-SN has an encoder-decoder architecture, in whichthe image features extracted from the encoder are first fusedwith the lesion localization maps using an enhanced layer,and then fed into the decoder to generate the segmentationresults. By this way, we can not only transfer the high-quality localization information learned by the dilated-CN tothe enhanced-SN, but also alleviate the negative impact ofinaccurate localization on segmentation results by fusing theimage features with the localization maps. We introduce ahybrid loss that is the weighted sum of a dice loss and arank loss to both the coarse-SN and enhanced-SN. The diceloss has good compatibility with imbalanced data, whereasthe rank loss has good compatibility with imbalanced hard-easy pixels. The results on the ISIC-2017 and PH2 datasetsshow that the proposed SWSDB model outperforms the stateof the arts in skin lesion segmentation. Our contributions aresummarized as follows:

1) We propose the SWSDB model that can use image-level

labels to guide the pixel-wise training phase and alleviatethe relying on densely annotated training samples, which issignificant since it is difficult and expensive to acquire pixel-wise annotations in medical imaging. We achieved the newstate-of-the-art skin lesion segmentation performance on theISIC-2017 and PH2 datasets.

2) We leverage the rough lesion masks from the coarse-SN to boost the localization ability of the dilated-CN, whichfurther promotes the enhanced-SN. We also propose a noveland effective method that transfers the high-quality local-ization information from the classification network to thesegmentation network for a performance boost.

3) To enhance the segmentation compatibility on imageswith imbalanced hard-easy pixels, we propose a rank losswhich is jointly employed with the Dice loss in the segmen-tation networks

II. RELATED WORK

A. Skin lesion segmentation

A series of different methods have been proposed to effec-tively segment lesions from the background skin tissues, whichcan be broadly classified as: clustering, histogram threshold-ing, region growing, active contour and deep learning [5], [19],[4], [7], [8], [9], [20], [10], [11]. Especially, recent DCNN-based methods have achieved notable success in the skin lesionsegmentation [4], [7], [8], [9], [10], [11]. Bi et al. [9] propose amulti-stage fully convolutional networks (FCN) with parallelintegration method to segment skin lesions. Yuan et al. [4]develop an end-to-end DCNN with a Jaccard distance basedloss for skin lesion segmentation without prior knowledgeand sample re-weighting. Li et al. [7] present a new densedeconvolutional network based on residual learning to segmentlesions. Mirikharaji et al. [10] propose to encode the starshape prior into the loss of DCNN to guarantee a globalstructure in segmentation results. Sarker et al. [11] presenta robust deep encoder-decoder network learning framework toaccurately segment the boundaries of lesion regions. Despitegreat improvements, these DCNN-based methods are designedin the fully supervised setting and need a huge of pixel-levellabels, which involves a large amount labor and time to obtain.In contrast, obtaining image-level labels is easier. Thus a directidea is using a large number of image-level labels to helpsegmentation network learn on pixel-level labels, which is thefocus of our paper.

B. Image segmentation based on image-level labels

Image-level labels are the simplest and easy-to-get supervi-sion for learning to segment compared to other supervisions,e.g. bounding boxes [12], scribbles [13], points [14] etc.Segmentation methods, based on image-level labels, mainlyfocus on generating high-quality object region cues for su-pervising segmentation. Hong et al. [15] propose a decouplednetwork for semi-supervised segmentation, where the class-specific localization cues are transferred from classification tosegmentation network. Kolesnikov et al. [18] propose a seed,expand and constrain (SEC) framework where object localiza-tion cues, expansion and refining boundary are integrated a

Page 3: Semi- and Weakly Supervised Directional Bootstrapping ... · (enhanced-SN). Both the coarse-SN and enhanced-SN are trained using the images with pixel-level annotation, and the dilated-CN

3

Fig. 2. Illustration of our pipeline that contains three steps: (1) we train the coarse-SN with hybrid loss in the images with pixel-level labels. (2) we concatenatethe original input and the mask obtained by the coarse-SN as the input to train the dilated-CN in the images with image-level labels for localization. (3) wetrain the enhanced-SN with hybrid loss in the images with pixel-level labels, which fuses the localization maps generated from the dilated-CN and imagesfeatures extracted from the encoder as the input of decoder to provide final segmentation masks.

unified framework for segmentation. However, both methodscan only provide the small and sparse class-related regionsfor supervision, which is not sufficiently accurate for learningreliable segmentation model. To ease this problem, Wei etal. [21] present an adversarial erasing method to progressivelytrain multiple classification networks for expanding object re-gions. Shen et al. [17] propose a bi-directional transfer learningframework to generate high quality object localization masks.References [22] and [16] employ the dilated convolutionsinstead of traditional convolutions by enlarging the receptivefield to gain more accurately object regions. Our modelalso adapts the dilated convolutions to classification network.Different from [22] and [16], we additionally provide a coarseprior mask to help the dilated network better localize lesions.Both the decoupled network [15] and the proposed model inputclass-related localization maps generated from classificationnetwork to decoder network. However, the decoupled networkdiscards encoder and only trains decoder network, where theerror accumulation caused by inaccurate localization is hardto correct. We fuse image features produced by encoder andlocalization maps as decoder input and then train encoder anddecoder, which can ease the negative impact of inaccuratelocalization when only using the maps.

III. METHOD

The pipeline of our proposed SWSDB model is describedin Figure 2, which consists of three networks: coarse-SN,dilated-CN and enhanced-SN. Denote IN = {(Xn,Y n)}N1assegmentation training set, which consists of N1 images. Eachinput Xn is annotated by pixel-level labels Y n and each pixellabel belongs to background class {0} or lesion class {1}.Denote IM = {(Xm,Y m)}N2 as classification training set,which consists of N2(N2 > N1) images. Each input Xm is

annotated by image-level label Y m ∈ {l1, ...lc}, where c isthe number of classes. Our goal is to employ the dataset IMto improve the segmentation performance on the dataset IN .The three networks transfer directional information and finallyprovide us with a high quality segmentation mask. In detail,we first train the coarse-SN in the dataset IN . Then, usingthe mask obtained by the coarse-SN boosts the training ofdilated-CN on the dataset IM to produce high quality lesionlocalization maps. Finally, we train the enhanced-SN on theIN dataset with the help of localization maps generated fromthe dilated-CN to obtain final masks.

A. Coarse-SN

Our SWSDB model first trains a coarse-SN on pixel-levellabels to roughly segment lesions. The coarse mask providesa rough lesion location for the dilated-CN to enhance itslocalization ability. We employ the state-of-the-art semanticsegmentation network Deeplabv3+ [23], pre-trained on MS-COCO [24] and PASCAL VOC 2012 datasets [25], as thebase bone of coarse-SN. To adapt the Deeplabv3+ networkto our segmentation task, we remove its last convolutionallayer, and then added a new convolutional layer with the outputchannel of one for prediction. The weights of the new layerare randomly initialized, and the activation function in the lastlayer is set to the sigmoid function.

B. Bootstrapping dilated-CN for localization

We use the coarse masks generated from coarse-SN toboost the localization ability of dilated-CN on the trainingimages with image-level labels. The motivation is that directlyemploying dilated-CN to localize lesions may only obtainthe discriminative regions related to this class, which is not

Page 4: Semi- and Weakly Supervised Directional Bootstrapping ... · (enhanced-SN). Both the coarse-SN and enhanced-SN are trained using the images with pixel-level annotation, and the dilated-CN

4

Fig. 3. Architecture of dilated-CN.

(a) input (b) w/o coarse mask (c) with coarse mask (d) ground truth

Fig. 4. Comparison of the CAMs obtained from the dilated-CN without (w/o)and with the bootstrapping of coarse masks.

sufficiently accurate to guide enhanced-SN to generate a goodmask. The coarse masks can provide a prior bootstrapping thatis beneficial to ease the difficulties and obtain accurate locationof lesions. The architecture of dilated-CN is shown in Figure 3.The images with image-level labels and the correspondingcoarse masks are concatenated as the input to dilated-CN.Inspired by the recent success of dilated convolution for denseobject localization on image-level labels [22], [16], we presentthe classification network based on dilated convolutions. Thenetwork is built upon the popular classification network Xcep-tion [26] pre-trained on ImageNet [27]. We first remove thelast pooling layer of Xception to enlarge the resolution offeature maps. Then, the separable dilated convolution withdilated rate of 2 is applied to replace the last two separableconvolutions of Xception. Such operation can localize lesionrelated regions perceived by a larger receptive field [22],[16]. After performing the global average pooling (GAP), theproduced features are fed to a new fully connected (FC) layerwith c neurons followed by softmax activation function toimage-level predictions.

Fig. 5. Architecture of enhanced-SN.

The weights of revised separable dilated convolutional lay-ers and the new FC layer are randomly initialized. The weightsof the 4th channel (i.e., coarse mask) on input are initializedby averaging the weights of another three channels (i.e., RGBimage). We optimize the dilated-CN by minimizing the cross-entropy loss. Finally, we employ the classification activationmap (CAM) [28] method to produce the fine localization mapsM from the last separable dilated convolutional layer. Figure 4shows a qualitative comparison between dilated-CN withoutand with the bootstrapping of coarse masks. CAMs in thesecond column shows the poor localization for lesions, whichare generated from dilated-CN without coarse masks. With thecoarse masks joint, the third column CAMs from the dilated-CN with coarse masks provide better localization maps.

C. Bootstrapping enhanced-SN for predictions

The class-specific localization maps M generated fromthe dilated-CN are employed to further help the training ofenhanced-SN on the pixel-level labels. The architecture ofthe enhanced-SN is shown in Figure 5, which consists ofan encoder network, a decoder network and an enhancedlayer between two networks. The encoder and decoder net-works share architecture and parameters with coarse-SN. Theenhanced layer first concatenates the output features F ofencoder and localization maps M in the channel dimension,and then uses a 1×1 convolutional layer followed by the batchnormalization layer and ReLu activation function to fuse thehigh-level features and lesion localization information. Theresultant feature maps from enhanced layer input the decoderto produce the final fine segmentation mask. We randomlyinitialize the weights of enhanced layer and train the enhanced-SN on the pixel-level labels.

D. Hybrid loss for coarse- and enhanced-SN

We propose a hybrid loss function to optimize coarse- andenhanced-SN, which consists of a dice loss and a rank lossand is defined as:

Lhybrid = Ldice + λLrank (1)

where λ is a weight factor to balance the two losses. Ldice

is the dice loss which measures the amount of agreementbetween prediction and ground truth, which has the strongcompatibility on the data with imbalanced class [29]. However,it is still challenging to train the segmentation network andachieve a high-quality score only using the dice loss, giventhe severe imbalance between hard and easy pixels of each

Page 5: Semi- and Weakly Supervised Directional Bootstrapping ... · (enhanced-SN). Both the coarse-SN and enhanced-SN are trained using the images with pixel-level annotation, and the dilated-CN

5

(a) input (b) dice loss (c) hybrid loss (d) ground truth

Fig. 6. Comparison of the predicted masks of coarse-SN with the dice lossand the proposed hybrid loss.

sample. We tackle the problem by designing a rank loss whichadds additional constraints between hard pixels of lesion andbackground. The motivation is that the pixels easily recognizedcontribute little to the optimization, whereas hard pixels (e.g.,boundary pixels) are difficult to be distinguished and thusprovide more informative for the network learning. We designan online rank scheme to select hard pixels dynamically basedon the prediction error, which origins from the observationthat hard pixels usually produce bigger error than easy ones.In implementation, after forward propagation of each batch,we rank these pixels of lesion and background areas by theirerror, respectively. The top K pixels with the largest error inlesion or background area are selected as hard pixels of thisarea. Let H0

ni and H1nj be the prediction values of ith hard

pixel of background and jth hard pixel of lesion for nth inputimage. We identify the rank loss as formula (2):

Lrank (Xn,Y n) =1

K2

K∑i=1

K∑j=1

max{0,H0ni(Xn,Y n)

−H1nj(Xn,Y n) +margin}

(2)

which enforces H1nj > H0

ni + margin in training. Such adesign can enable network pay more attention to these hardpixels and thus learn more discriminative information fromthese hard ones easily. Figure 6 shows a qualitative comparisonbetween coarse-SN with dice loss and the proposed hybridloss. The second column masks generate from the coarse-SNwith the dice loss, which gives rough extent of the lesion.Clearly the masks on the third column from the hybrid lossprovide more accurate predictions.

IV. EXPERIMENTS

A. Datasets

We evaluate of the proposed SWSDB model on two bench-mark datasets:ISIC-2017 dataset. The dataset of 2017 International SkinImaging Collaboration (ISIC) skin lesion segmentation chal-lenge [30] contains 2000 training, 150 validation, and 600test dermoscopy images. Each image is paired with the expertmanual tracing of the skin lesion boundaries for segmentationtask and the lesion gold standard diagnosis (i.e. melanoma,

nevus and seborrheic keratosis) for classification task. Further-more, we also collect additional 1320 dermoscopy images withonly image-level labels, including 466 melanoma images, 32seborrheic keratosis images, and 822 nevus images, from theISIC archive1 for extending the training set of classificationtask.PH2 dataset. The PH2 public dataset [31] contains 200 der-moscopy images, including 160 nevus, and 40 melanomas. All200 dermoscopy images are obtained on the same conditionsthrough Tuebinger Mole Analyzer system using a 20-foldmagnification and paired with the expert manual tracing ofthe skin lesion boundaries.

B. Training/testing settings

Training phase. The coarse- and enhanced-SN are trained onthe ISIC-2017 training set with only pixel-level labels. Whilethe dilated-CN is trained on the ISIC-2017 and additionaltraining sets with only image-level labels. To alleviate theoverfitting of deep networks, we employ the online dataargumentation including randomly cropping from the centralof each training image with the scale from 50% to 100% oforiginal image size, random rotation from 0 to 10 degree, shearfrom 0 to 0.1 radian, shift from 0 to 20 pixels, zoom 110% ofwidth and height, whitening, horizontal and vertical flips, toenlarge the training dataset. The augmented patches then areresized to 224 × 224 for training. The Adam algorithm [32]with a batch size of 16 and 32 are adopted to optimizesegmentation and classification networks, respectively. Bothof the learning rates are initialized to 0.0001. Both of themaximum epoch numbers are set to 500. We use the ISIC-2017validation set to monitor the performance of three networksand stop the training process when the network falls intooverfitting. In term of the hyper-parameters in the hybrid loss,we set λ=0.05, K=30, margin=0.3.Testing phase. The ISIC-2017 testing set is directly fed tothe trained SWSDB model to evaluate the results. For PH2dataset, we perform the following two experiments. In the firstexperiment, we directly test PH2 dataset to the SWSDB modeltrained on ISIC-2017 training set. In the second experiment,we fine-tune the trained SWSDB model by using PH2 datasetwith the 4-fold cross validation. In each folder, we fine-tune the model with 50 images and validate with ISIC-2017validation set. The network performing best on the validationset is selected and applied to perform image segmentationin rest 150 images. We calculated four evaluation metricsto evaluate the segmentation performance, including Jaccardindex (JA), dice coefficient (DI), pixel-wise accuracy (AC)and sensitivity (SE). Note that the final ranking on ISIC-2017testing set is determined according to JA in this challenge.

C. Results

We compare our method with several recently publishedskin lesion segmentation methods in Table I. In the ISIC-2017 dataset, comparison methods include a convolutional-deconvolutional neural network (CDNN) [8], a new dense

1https://www.isic-archive.com

Page 6: Semi- and Weakly Supervised Directional Bootstrapping ... · (enhanced-SN). Both the coarse-SN and enhanced-SN are trained using the images with pixel-level annotation, and the dilated-CN

6

deconvolutional network (DDN) [7], a fully convolutionalnetwork with star shape prior (FCN+SSP) [10] and a skinlesion segmentation deep model based on dilated residualand pyramid pooling network (SLSDeep) [11]. In the PH2dataset, comparison methods are multi-stage FCN with parallelintegration (mFCNPI) [9], a retrained FCN (RFCN) [4] anda simple linear iterative clustering (SLIC) method [20]. Notethat we compare the performance reported in their publishedpapers.

The results in Table I show that our method achieves the bestperformance on two datasets. The JA measure improves by2.2% than the SLSDeep method [11] on the ISIC-2017 (from78.2% to 80.4%). The direct application of the model trainedon ISIC-2017 training set to PH2 dataset achieves a best JAindex with a 2.7% improvement than the mFCNPI method [9](from 84.0% to 86.7%), which proves that our model has thestrong generalization ability. By further fine-tuning the trainedmodel with the PH2 dataset, our SWSDB method achievesa higher JA of 89.4% and the best DI of 94.2%. We alsovisualize some segmentation results in Figure 7, which showsthat the coarse-SN can boost the dilated-CN to produce moreaccurate localization maps and these maps is useful to guidethe enhanced-SN for achieving better segmentation results.

D. Ablation analysis

In our model, coarse-SN boosts dilated-CN to produce high-quality localization maps which are transferred into enhanced-SN to obtain the finally fine masks. Therefore, the quality oflocalization maps and transfer manner affect the segmentationresults.The quality of localization maps. Starting with the simplestone that only coarse-SN is used and trained on pixel-levellabels. Using the images with image-level labels, we traindilated-CN on the images with / without the bootstrapping ofcoarse-SN to generate lesion localization maps, respectively.These maps are used to guide the training of enhanced-SN.From Table II, we can observe that transferring localizationmaps from dilated-CN without the bootstrapping of coarse-SNinto enhanced-SN obtain a poor JA than with the bootstrapping(2nd row vs. 3rd row), even less than using coarse-SN (2nd

row vs. 1st row). The main reason is that dilated-CN withoutthe bootstrapping of coarse-SN has a weak localization abilitythat damages the performance of enhanced-SN. Our methodthat using dilated-CN with the bootstrapping of coarse-SN canproduce good localization maps and thus make enhanced-SNachieve a better performance than coarse-SN (3rd row vs. 1st

row).Analysis of transfer manner. Table III compares with dif-ferent transfer manners while keeping the localization mapssame. Decoupled network [15] that only transfers localizationmaps to the decoder. Our method, fusing image features fromencoder and localization maps into the decoder, eases the erroraccumulation caused by the inaccurate maps and achievesa higher JA with a 3.8% improvement. MDC method [16]creates proxy pixel-level labels from localization maps for theimages with image-level labels, and then trains two segmenta-tion networks on real and proxy pixel-level labels by sharing

parameters. Different from MDC, we transfer the map intothe decoder and avoid the impact of inaccurate labels, andthus obtain a 1.1% improvement of JA than MDC method.Analysis of loss function. We also further compare the impactof different loss functions, we perform the dice loss and theproposed hybrid loss in our SWSDB model. Table IV showsthat our hybrid loss attains a higher JA than dice loss. It alsosuggests that the hybrid loss taking account into both the datawith imbalanced classes and imbalanced hard-easy pixels canproduce more accurate estimation.

E. Analysis for number of pixel-level labels

It is important to analyze how the number of images withpixel-level labels used affects the results. Table V depicts theperformance obtained when setting the number of images withpixel-level labels to 100, 500, 1000 and 2000. Note that theimages with image-level labels in these settings are same. Wefind that, when the number increases from 100 to 2000, theoverall JA, DI and AC improve by 4.3%, 3.8% and 2.0% onthe validation set, respectively. It suggests that the more thepixel-level labels we use, the higher the overall performancewe obtain. The result is not surprising, since more and moreinformation of skin lesion can be learned with the increasingof pixel-level labels in our model.

In addition, we also compare the performance between ourSWSDB model and the fully supervised method under thedifferent number of images with pixel-level labels in Figure 8.We draw the JA value of our SWSDB model (trained withpixel- and image-level labels) and fully supervised deeplabv3+network (trained only with pixel-level labels). It is clear thatthough the performance of fully supervised network improveswith the increasing of pixel-level labels, our semi- and weaklysupervised method consistently obtains a higher performancein same settings, which proves that our model effectivelyemploys the images with image-level labels. At the sametime, we also find that, as we use more pixel-level labels,the difference of JA value between our SWSDB and fullysupervised methods becomes smaller decreased from 2.9%to 1.1%. It confirms that our method with a small set ofpixel-level labels can better leverage the images with image-level labels. Another an interesting finding is that our SWSDBmodel only using 1000 images with pixel-level labels achievesa JA of 78.7% that is the comparable performance with fullysupervised method using 2000 images with pixel-level labels.It further demonstrates the effectiveness of our model.

F. Model complexity

Three networks in our SWSDB model are trained using theopen source Keras and Tensorflow software packages. In ourexperiments, it takes about 2 days to train the SWSDB model(1 day for coarse-SN, 0.5 day for dilated-CN and 0.5 day forenhanced-SN) and less than 0.5 second to apply it to segmenteach skin lesion on a server with 4 NVIDIA GTX 1080 TiGPUs and 512GB Memory. Although training the model istime-consuming, it can be done offline. The fast online testingsuggests that our SWSDB model could be used in a routineclinical workflow.

Page 7: Semi- and Weakly Supervised Directional Bootstrapping ... · (enhanced-SN). Both the coarse-SN and enhanced-SN are trained using the images with pixel-level annotation, and the dilated-CN

7

Datasets ISIC-2017 PH2

Methods CDNN,2017 [8]

DDN,2017 [7]

FCN+SSP,2018 [10]

SLSDeep,2018 [11]

OurSWSDB

mFCNPI,2017 [9]

RFCN,2017 [4]

SLIC,2018 [20]

OurSWSDB

Fine-tunedSWSDB

JA 76.5 76.5 77.3 78.2 80.4 84.0 - - 86.7 89.4DI 84.9 86.6 85.7 87.8 87.8 90.7 93.8 - 92.6 94.2AC 93.4 93.9 93.8 93.6 94.7 94.2 - 90.4 95.8 96.5SE 82.5 82.5 85.5 81.6 87.4 94.9 - 91 97.9 96.7

TABLE IPERFORMANCE EVALUATION ON ISIC-2017 TESTING SET AND PH2 DATASET.

Input

Coarse

-SN

Location

map

Enhanced

-SN

Ground

truth

0

Fig. 7. Examples of predicted segmentation masks by our SWSDB model.

CS DC ES JA DI AC SE√× × 78.9 86.5 95.7 88.7

×√ √

77.7 86.0 95.5 89.1√ √ √80.0 87.9 96.2 87.4

TABLE IIABLATION STUDY USING LOCALIZATION MAPS OBTAINED FROM THE

DILATED-CN WITH AND WITHOUT BOOTSTRAPPING OF COARSE-SN ONTHE ISIC-2017 VALIDATION SET. ’CS’ IS COARSE-SN. ’DC’ IS

DILATED-CN. ’ES’ IS ENHANCED-SN.

Methods JA DI AC SEDecoupled net,2015 [15] 76.2 84.9 94.4 89.5

MDC, 2018 [16] 78.9 87.0 96.1 86.1Our SWSDB 80.0 87.9 96.2 87.4

TABLE IIICOMPARISON WITH DIFFERENT TRANSFER MANNERS ON THE ISIC-2017

VALIDATION SET.

Methods JA DI AC SESWSDB with dice loss 77.0 84.9 95.4 83.7SWSDB with hybrid loss 80.0 87.9 96.2 87.4

TABLE IVCOMPARISON UNDER DIFFERENT LOSS SETTINGS ON THE ISIC-2017

VALIDATION SET.

Number JA DI AC SE100 75.7 84.1 94.2 90.2500 76.8 85.5 95.4 91.51000 78.7 86.6 95.7 89.52000 80.0 87.9 96.2 87.4

TABLE VABLATION STUDY USING DIFFERENT NUMBER OF IMAGES WITH

PIXEL-LEVEL LABELS ON THE ISIC-2017 VALIDATION SET.

Fig. 8. JA curves of our SWSDB and full supervised methods under differentnumber of images with pixel-level labels on the ISIC-2017 validation set.

Page 8: Semi- and Weakly Supervised Directional Bootstrapping ... · (enhanced-SN). Both the coarse-SN and enhanced-SN are trained using the images with pixel-level annotation, and the dilated-CN

8

V. CONCLUSION

In this paper, we have presented a novel SWSDB modelfor skin lesion segmentation by using image- and pixel-levellabels, where coarse-SN provides prior bootstrapping to dilat-edCN for accurately localizing lesions and then the fine local-ization ability is transferred into enhancedSN to obtain betterlesion segmentation masks. We introduce a hybrid loss to boththe coarse-SN and enhanced-SN, and make both networkshave good compatibility on the data with imbalanced classesand imbalanced hard-easy pixels. Comprehensive experimentalresults on the ISIC 2017 and PH2 datasets demonstrate theeffectiveness of our SWSDB model, which outperforms theexisting state-of-the-art skin lesion segmentation methods withsubstantial margins. We believe that this method is general andcan be extended to other semi- and weakly supervised medicalimage segmentation tasks.

REFERENCES

[1] R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2018,” CA:A Cancer Journal for Clinicians, vol. 68, no. 1, pp. 7–30, 2018. 1

[2] C. M. Balch, J. E. Gershenwald, S.-j. Soong, J. F. Thompson, M. B.Atkins, D. R. Byrd, A. C. Buzaid, A. J. Cochran, D. G. Coit, S. Ding,A. M. Eggermont, K. T. Flaherty, P. A. Gimotty, J. M. Kirkwood,K. M. McMasters, M. C. Mihm, D. L. Morton, M. I. Ross, A. J. Sober,and V. K. Sondak, “Final version of 2009 ajcc melanoma staging andclassification,” Journal of Clinical Oncology, vol. 27, no. 36, pp. 6199–6206, 2009. 1

[3] “Diagnostic accuracy of dermoscopy,” The Lancet Oncology, vol. 3,no. 3, pp. 159–165, 2002. 1

[4] Y. Yuan, M. Chao, and Y. Lo, “Automatic skin lesion segmentationusing deep fully convolutional networks with jaccard distance,” IEEETransactions on Medical Imaging, vol. 36, no. 9, pp. 1876–1886, 2017.1, 2, 6, 7

[5] E. Ahn, J. Kim, L. Bi, A. Kumar, C. Li, M. Fulham, and D. D. Feng,“Saliency-based lesion segmentation via background detection in der-moscopic images,” IEEE Journal of Biomedical and Health Informatics,vol. 21, no. 6, pp. 1685–1693, Nov 2017. 1, 2

[6] K. Korotkov and R. Garcia, “Computerized analysis of pigmented skinlesions: A review,” Artificial Intelligence in Medicine, vol. 56, no. 2, pp.69–90, 2012. 1

[7] H. Li, X. He, F. Zhou, Z. Yu, D. Ni, S. Chen, T. Wang, and B. Lei,“Dense deconvolutional network for skin lesion segmentation,” IEEEJournal of Biomedical and Health Informatics, pp. 1–1, 2018. 1, 2, 6,7

[8] Y. Yuan and Y. Lo, “Improving dermoscopic image segmentation withenhanced convolutional-deconvolutional networks,” IEEE Journal ofBiomedical and Health Informatics, pp. 1–1, 2018. 1, 2, 5, 7

[9] L. Bi, J. Kim, E. Ahn, A. Kumar, M. Fulham, and D. Feng, “Dermo-scopic image segmentation via multistage fully convolutional networks,”IEEE Transactions on Biomedical Engineering, vol. 64, no. 9, pp. 2065–2074, Sept 2017. 1, 2, 6, 7

[10] Z. Mirikharaji and G. Hamarneh, “Star shape prior in fully convolutionalnetworks for skin lesion segmentation,” in MICCAI, vol. 11073, 2018,pp. 737–745. 1, 2, 6, 7

[11] M. M. K. Sarker, H. A. Rashwan, F. Akram, S. F. Banu, A. Saleh, V. K.Singh, F. U. H. Chowdhury, S. Abdulwahab, S. Romani, P. Radeva, andD. Puig, “Slsdeep: Skin lesion segmentation based on dilated residualand pyramid pooling networks,” in MICCAI, vol. 11071, 2018, pp. 21–29. 1, 2, 6, 7

[12] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes tosupervise convolutional networks for semantic segmentation,” in ICCV,2015, pp. 1635–1643. 1, 2

[13] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers, “Normal-ized cut loss for weakly-supervised cnn segmentation,” in CVPR, 2018.1, 2

[14] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei, “Whats thepoint: Semantic segmentation with point supervision,” in ECCV, vol.9911, 2016, pp. 549–565. 1, 2

[15] S. Hong, H. Noh, and B. Han, “Decoupled deep neural network for semi-supervised semantic segmentation,” in NIPS, 2015, pp. 1495–1503. 1,2, 3, 6, 7

[16] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang, “Revisitingdilated convolution: A simple approach for weakly- and semi-supervisedsemantic segmentation,” in CVPR, 2018. 1, 2, 3, 4, 6, 7

[17] T. Shen, G. Lin, C. Shen, and I. Reid, “Bootstrapping the performanceof webly supervised semantic segmentation,” in CVPR, 2018. 2, 3

[18] A. Kolesnikov and C. H. Lampert, “Seed, expand and constrain: Threeprinciples for weakly-supervised image segmentation,” in ECCV, 2016,pp. 695–711. 2

[19] F. Peruch, F. Bogo, M. Bonazza, V. Cappelleri, and E. Peserico,“Simpler, faster, more accurate melanocytic lesion segmentation throughmeds,” IEEE Transactions on Biomedical Engineering, vol. 61, no. 2,pp. 557–565, Feb 2014. 2

[20] D. Patino, J. Avendano, and J. W. Branch, “Automatic skin lesion seg-mentation on dermoscopic images by the means of superpixel merging,”in MICCAI, vol. 11073, 2018, pp. 728–736. 2, 6, 7

[21] Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan, “Object regionmining with adversarial erasing: A simple classification to semanticsegmentation approach,” in CVPR, 2017, pp. 6488–6496. 3

[22] T. F. Fisher Yu, Vladlen Koltun, “Dilated residual networks,” in CVPR,2017, pp. 636–644. 3, 4

[23] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmen-tation,” in ECCV, 2018. 3

[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in ECCV, 2014, pp. 740–755. 3

[25] M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman,“The pascal visual object classes (voc) challenge,” Int. J. Comput. Vision,vol. 88, no. 2, pp. 303–338, Jun. 2010. 3

[26] F. Chollet, “Xception: Deep learning with depthwise separable convo-lutions,” in CVPR, 2017, pp. 1800–1807. 4

[27] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: Alarge-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.4

[28] B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba, “Learning deepfeatures for discriminative localization,” in CVPR, 2016, pp. 2921–2929.4

[29] F. Milletari, N. Navab, and S. Ahmadi, “V-net: Fully convolutionalneural networks for volumetric medical image segmentation,” in 3DV,2016, pp. 565–571. 4

[30] N. C. F. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti,S. W. Dusza, A. Kalloo, K. Liopyris, N. K. Mishra, H. Kittler, andA. Halpern, “Skin lesion analysis toward melanoma detection: A chal-lenge at the 2017 international symposium on biomedical imaging (isbi),hosted by the international skin imaging collaboration (isic),” arXivpreprint, 2017. 5

[31] T. Mendona, P. M. Ferreira, J. S. Marques, A. R. S. Marcal, andJ. Rozeira, “Ph2:a dermoscopic image database for research and bench-marking,” in EMBC, 2013, pp. 5437–5440. 5

[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in ICLR, 2015. 5