convolutional neural network based medical …...abstract a key research topic in medical image...

Convolutional Neural Network based

Medical Imaging Segmentation: Recent

Progress and Challenges

Jiaxing Tan

A literature review submitted to

the Graduate Faculty in Computer Science in partial fulfillment of

the requirements for the degree of Doctor of Philosophy,

The City University of New York

Committee Members:

Dr. Yumei Huo (Advisor)

Dr. Shuqun Zhang

Dr. Sos Agaian

Dr. Yingli Tian

January 14, 2018

c© Copyright by Jiaxing Tan, 2018.

All rights reserved.

Abstract

A key research topic in Medical Image Analysis is image segmentation. Tradition-

ally such task is solved by hand-engineered features based methods, which could be

highly dataset related. Convolutional Neural Network (CNN) has shown great suc-

cess in many areas, especially in the area of computer vision. Different from the

hand-engineered feature based classification, Convolutional Neural Network uses self-

learned features from data for classification. Recently, some progress has been made

in the area of Convolutional Neural Network based image segmentation, which cast

light on the area of medical imaging. This survey gives a brief introduction on Con-

volutional Neural Network based medical image segmentation. Three categories of

methods are discussed are discussed: CNN based method, Encoder-Decoder based

method and Generative Adversarial Network based method. For each category, first

a brief introduction is given followed by a timeline of model evolving then some recent

progress on medical imaging is introduced.

Besides some technical details, we also introduce some available public packages

for a fast development and some public data sources.

iii

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

0.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

0.2 Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . . 5

0.2.1 What is Convolutional Neural Network . . . . . . . . . . . . . 5

0.2.2 Some available Packages . . . . . . . . . . . . . . . . . . . . . 9

0.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

0.3 Some Recent CNN based methods . . . . . . . . . . . . . . . . . . . . 11

0.3.1 2D CNN based method . . . . . . . . . . . . . . . . . . . . . . 11

0.3.2 3D CNN based lung nodule detection . . . . . . . . . . . . . . 14

0.3.3 Holistic CNN Model . . . . . . . . . . . . . . . . . . . . . . . 16

0.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

0.4 Encoder-Decoder Based CNN Structure . . . . . . . . . . . . . . . . . 19

0.5 Recent Progress on the Encoder-Decoder Structure Based Segmentation 20

0.5.1 A timeline for such models . . . . . . . . . . . . . . . . . . . . 20

0.5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

0.6 Generative Adversarial Network . . . . . . . . . . . . . . . . . . . . . 25

0.7 GAN Based Segmentation models . . . . . . . . . . . . . . . . . . . . 26

0.7.1 GAN for segmentation . . . . . . . . . . . . . . . . . . . . . . 27

0.7.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

0.8 Some Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

iv

0.8.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

0.8.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 30

0.8.3 Some Other Challenges . . . . . . . . . . . . . . . . . . . . . . 31

0.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

v

Figure 1: An example of semantic segmentation from the famous VOC dataset. Left:Input image. Right: semantic segmentation result.

0.1 INTRODUCTION

One key research topic in Medical Imaging is image segmentation. Image segmenta-

tion, or Semantic Segmentation, is a pixel level image understanding task which is

to perform a pixel-by-pixel classification to decide the class of each pixel. Figure 1

shows one example from the famous VOC [1] dataset, where the real image is on the

left and the semantic segmentation result is illustrated on the right, which decides

the label of each pixel as either background, motor or rider.. As is shown in figure

1, given an image on the left, the semantic segmentation result is illustrated on the

right, which decides the label of each pixel. In the given example, the pixels are

labeled as background, motor or rider.

In the context of Medical Imaging, such segmentation method could be utilized

to solve the problems such as nodule detection, anomaly detection and organ seg-

mentation. Let’s take lung cancer detection as an example to show the importance

of image segmentation in medical imaging as it is the leading cause of cancer-related

deaths. Based on the statistics provided by American Cancer Society [2], lung cancer

caused 158,080 deaths estimated for the United States in 2016. Early detection of

lung cancer is the key to prevent lung cancer and thus help in a sharp increase in the

survival rate. A popular detection tool, computer tomography (CT), has been ana-

lyzed subjectively by radiologists. The anticipated large amount interpretation effort

1

Figure 2: An example of the segmentation task in the context of medical imaging.The goal is to detect the location of lung nodules given a lung CT scan.

demands a computer-aided detection (CAD) scheme to help radiologists to efficiently

diagnose lung cancer. As a result, a lot of nodule segmentation based automated

CAD methods have been developed. Moreover, lung volume segmentation is always

performed as the first step in some other automated CAD methods and lung disease

diagnosis. Thus, a robust and high quality image segmentation method, if successful,

could heavily reduce the cost of operation and speed up the waiting time for not only

lung CT but also various types of medical image analysis.

However, medical image suffers the fact of high noisy and low quality, which makes

it is much harder to perform segmentation on the medical images. As is shown in

figure 2, which are some examples of lung CT slices with nodules in red mask, the

nodules are quite small compared to the whole image. So the detection task is very

hard like finding the needle in the haystack [5].

2

Traditionally, medical image segmentation is performed by hand-engineered fea-

ture based classification. One common attribute of such method is that some empir-

ical ”magic numbers” are used for thresholding and preprocessing, see the methods

proposed by Shen et al. [6] and Duggan et al. [7] for examples. There remains a risk

that those empirical values could be dataset specific which might impede the model

to be a general solution to different datasets. As the final goal of medical imaging is

to build CAD to serve the massive people, a robust and adaptive model is highly rec-

ommended. In this aspect, Convolutional Neural Network, which could learn feature

by itself with little or no empirical priori, casts some light on the medical image area.

Convolutional Neural Network (CNN) has shown a great success in computer vi-

sion on the ImageNet challenge. Since AlexNet[9] was proposed, the performance

improvement has been achieved for almost every year with deeper and deeper struc-

tures supported by high performance computing facilities. Compared with manually

selected feature based classifier, CNN could learn features from data itself, which

turns out to be more efficient and automatic [11]. Also, some progress has been made

in medical imaging using CNN to solve real-life problems. Recently in some medical

imaging segmentation competitions such as Kaggle lung cancer detection competi-

tion [3] and LUNA16 Challenge [4], the top ranked teams all used CNN as a solution

method.

In this survey, we aim at giving a brief introduction on what is happening in

the area of CNN based medical image segmentation with typical methods. As CNN

has developed into different sub-types, we will discuss the CNN based medical seg-

mentation methods in three categories: CNN based method, Encoder-Decoder based

method and Generative Adversarial Network based method. For each category, we

first briefly introduce the technical points of model design with some related papers in

the area of computer vision and then we demonstrate and illustrate the corresponding

model design and literatures in medical imaging area. And then we provide lists of

3

available public packages for fast development and some public datasets (specifically

for lung cancer detection). At last, we discussed some challenges and possible data

preparation method and finally reach a conclusion.

The rest of this survey is organized as follows:In section 2, we give a general picture

of CNN with a list of commonly used public packages. Then we introduce three

categories of CNN based medical segmentation methods in the following 3 sections.

In section 8, we demonstrate some challenges and their possible solutions. At last,

the conclusion is given in section 9.

4

0.2 Convolution Neural Network

In this section, we will give a brief introduction of Convolution Neural Network. Our

introduction is based on the example shown in Figure 3. Then a list of publicly

available packages is provided for fast development.

0.2.1 What is Convolutional Neural Network

Convolutional Neural Network are similar to the deep neural networks as they are

made up of neurons that have learn-able weights and biases. But the difference is to

reduce the connections of the deep neural network, it applied weight sharing in dif-

ferent locations. Such weight sharing is found out to be equivalent to the convolution

operation in signal processing. More commonly, CNN is introduced as a deep neural

network inspired by the biology study of human cortex. A CNN is constructed by four

types of layers: input layer, multiple convolution layers and several fully-connected

layers in the end. Each convolution layer is followed by a subsampling layer [8].

A CNN example is shown in figure 3. It contains an input layer of size 256×256×1

and two convolution layers each followed by a max-pooling layer. The first convolution

layer has 32 filters with size 7× 7, and the second has 64 filters with size 5× 5. After

the convolution layers are two fully connected layers. The first layer has 128 neurons

and the second has 2 neurons. A softmax is applied to the last layer to normalize

the outputs in the range [0, 1]. We will introduce each of the layers below with this

example.

Input Layer The Input layer is in charge of reading data with a predefined size

without performing any changes to it. In figure 3, the input layer reads in a CT

scan image with size 256 × 256. Note that the nodes of the input layer are passive,

meaning that they do not modify the data [24]. They receive a single value on their

input, and duplicate the value to the hidden layers.

5

Figure 3: An example of a CNN with 2 convolution layers and 2 fully connectedlayers. Each convolution layer is followed by a pooling layer.

In practice, some packages, such as caffe, Tensorflow and Theano require to specify

the input size using the input layer. Some packages, such as pytorch, could self-adapt

to the input data without specify. But the test data and the training data must be

of the same size as different sizes will result in different numbers of parameters.

Convolution layer A convolution layer has k different kernels, and each has

the shape m × n and performs convolution operation (denoted as ?) on each of j

sub-images of the input image i. A non-linear function g will be applied to the

convolution result with a bias b added. The whole procedure is shown in equation 1.

f = g((Wi ∗ i1:j) + b) (1)

The output of the convolution layer is k feature maps, each generated by a con-

volution operation with one kernel applied on the whole image. In figure 3, there are

2 convolution layers. Conv1 has 32 kernels, each of size 7× 7, conv2 has 64 kernels,

each of size 5× 5.

Besides the 2D convolution layer, as some tasks need to deal with 3D data, such

as spatial convolution over volumes, 3D convolution will be needed. Under this need,

3D convolution layer is also available for most of the current packages.

6

One additional type of convolution layer that draws a lot of researchers’ interest is

the transpose convolution layer, which is also known as deconvolution layer. Decon-

volution layer is first proposed by Zeiler et al. [12] and can be thought of as a reverse

operation of convolution. And this idea is extended to build an Encoder-Decoder

structure [13] taking advantage of the reverse operation, which we will discussed later

in the corresponding section. Same as the convolution operation, transpose convolu-

tion operation also has 2D and 3D versions.

Subsampling Layer Subsampling layer, following a convolution layer, performs

a down-sampling operation on the feature maps generated by the convolution layer.

There are several ways to perform sub-sampling, such as average-pooling, median-

pooling, global average pooling and max-pooling. In figure 3, the network uses max-

pooling strategy.

Besides pooling based sampling, currently a new trend appears as the (down)sampling

is performed by the convolution layer using strides larger than 1. To further explain

this, let’s take the 2D convolution operation provided by pytorch [32] as an example

. As shown in equation 2:

Wout = floor((Win + 2paddingdilation(kernel size1)1)/stride+ 1), (2)

Where Wout is the output size, Win is the input size, padding is the size of zero-

padding added to both sides of the input. Dilation is spacing between kernel elements.

Stride is the stride of the convolution. We can see if we set stride to be 2, it is

equivalent to perform a down-sample operation by a factor of 2. So a new downsample

method is to replace the down-pooling layer with a convolution layer with filter size

1× 1 and stride 2.

7

Fully Connected Layer A fully connected layer is the same as the layers in a

traditional Multi-Layer Perceptron (MLP). The input is an image i with operations

shown in equation 3, where b denotes the bias. The last layer of a CNN model is

usually a fully connected layer which serves as an output layer. In the output layer,

the number of neurons denotes the number of classes in the classification task.

f = g((Wi × i1:j) + b) (3)

Non-linear Function Non-linear function g is used in both convolution and fully

connected layer. Functions like Tanh, Sigmoid and ReLu are often used as a Non-

linear Function. For the output layer, i.e. the last layer of a CNN, a softmax function

is commonly used.

Here we list some commonly used Non-linear Functions. The first one is the ”old

fashioned” sigmoid function, as is shown in equation 4.

Sigmoid =1

1 + e−x(4)

ReLU, first used by Glorot et al. [15] is one of the most commonly used non-linear

function now. The equation of ReLU is shown in equation 5.

ReLU = max(0, x) (5)

Currently, some variants of ReLU has also been proposed and utilized in the

newest models. For example, LeakyReLU, which allow a small, non-zero gradient

when the unit is not active has been proposed by Maas et al. [16]. The equation of

LeakyReLU is shown in equation 6. Such design could alleviate potential problems

caused by ReLU, which sets 0 to all negative values.

8

LeakyReLU(x, α) = max(x, 0) + α×min(x, 0). (6)

0.2.2 Some available Packages

One advantage of CNN is that several public packages are available. Instead of

building your CNN from scratch, you could try the packages listed below:

• Caffe: http://caffe.berkeleyvision.org/

• Torch: http://torch.ch/

• Theano: http://deeplearning.net/software/theano/

• Lasagne: https://lasagne.readthedocs.io/en/latest/

• TensorFlow: https://www.tensorflow.org/

• Keras: https://keras.io/

• Caffe2: https://caffe2.ai/

• PyTorch: http://pytorch.org/

• Mxnet: http://mxnet.io/

All these packages are publicly available online with multi-platform and GPU

support, which reduce the workload of implementing a CNN based task. Also, the

packages have been optimized to be efficient.

The principle of selection from this list, in my opinion, should be based on the

code support. For each of the typical models, such as VGG-Net [20] or ResNet [22],

has been pretrained in different platforms so under such circumstances there is not

too much difference in selecting packages. But some new models only have been

implemented by single or few packages. In this situation, you should choose the one

9

which this model has been implemented with to guarantee performance. Moreover,

note that the maintenance of Theano has been stopped.

0.2.3 Conclusion

In this section, we demonstrate the general design of a Convolution Neural Network

and briefly discuss the four types of layers used by CNN with some new trends.

Besides, we listed out some commonly used packages for model development.

10

0.3 Some Recent CNN based methods

In this section, we demonstrate some typical recent methods on CNN based lung

nodule detection. In the literature, various CNN designs have been proposed. Most

of the method reviewed here are bounding box based methods. The general idea is

infer the class label of the center pixel(s) using its neighbors nearby. We classify them

into three groups: 2D CNN based, 3D CNN based and Holistic CNN based.

The 2D and 3D methods are based on whether a 2D or 3D neighborhood is

considered for classification of the centered pixel(s). As a complementary to the

bounding box methods, we also introduce methods not using bounding box, which

is called holistic in some medical imaging literatures. We will introduce these three

three groups accordingly in this section.

0.3.1 2D CNN based method

The main idea of 2D CNN based segmentation method is to make the decision of

the centered pixel(s) by its 2D neighborhood. The input is a 2D image slice, and the

output is the predicted label of the centered pixel(s). Let’s take the work by Sermanet

et al. [34] as an example, which is illustrated in Figure 4. At each location, an image

slice is generated to make prediction of its centered pixel. The model will generate

predicted label with confidence score given an input image slice.

To increase the prediction accuracy of such bounding box based method. Voting

algorithm, such as multi-view voting has been applied to make prediction of a single

input. In [35], Krizhevsky et al. applied multi-view voting to boost performance:

a fixed set of 10 views (4 corners and center, with horizontal flip) is averaged. To

further explore this idea, in [34], it introduces a new type of view by multiple scales

at each location. It is also claim that while the sliding window approach may be

computationally prohibitive for certain types of model, it is inherently efficient in the

11

Figure 4: Example from [34], the classifier/detector outputs a class and a confidencefor each location. Then these bounding boxes are grouped together to generate finalsegmentation mask.

case of ConvNets. Such multi-scale method further increases views for voting and

remains efficiency.

Given a set of CT scans of a patient, which usually contains more than 300 slices

depending on the body size of the patient, radiologists will check the scan slice by

slice to detect the nodule. For each scan, radiologists will observe every sub-region

in it. This procedure is performed in 2D slices.

In this procedure, each slice could be viewed as a 2D image and the inspection

could be viewed as image classification on sub-regions. So we could use a 2D CNN

12

based classifier to simulate it. After applying some pre-processing methods such as

lung segmentation and noise elimination, the slice would be cut into small sub-regions.

Each region is an input into CNN and the output is a decision whether such region

contains a nodule or not. The combined result shows if a nodule exists in this slice.

Most recently, Nima, et. al [14] compares the performance of several CNN struc-

tures on lung nodule detection with MTANN. Several models, including a shallow

CNN, rd-CNN, famous LeNet [19] and AlexNet[9], are compared in this paper, which

gives some guidance on how to design CNN. Although MTANN outperforms CNN,

MTANN is a group of neural networks, and each is in charge of a certain situation,

while CNN only has one network for all situations. It could be very interesting to

know if we design a group of CNN, each to deal with a certain situation, will them

give us a better result than MTANN?

Also, we would like to mention the work of Shin, Hoo-Chang, et al [18], where

they use transfer learning to take advantage of the pre-trained CNN model from

computer vision to perform CT scan analysis. Although they lung disease detection

instead of lung nodule detection, we think this paper is complimentary to [14]. In [18]

three very deep CNN structures achieve good result while in [14] only shallow CNN

structures have good performance. They developed a strategy of transform the single

channel CT scan into a 3 channel image. In this way, people can use some pretrained

model from computer vision instead of training their model from scratch. We will

further explain pretrain in the challenge section. They experiment on three famous

CNN structures and compare variants in the number of parameters and different

training strategies: train the model from scratch, transfer learning and ”off-the-shelf”.

The results show that too deep CNN could have its performance limited by the size

of dataset. When use a deep CNN from computer vision, reduce the number of

parameters could possibly increase the performance. The idea of Transfer learning

provides a way to use pretrained model to perform CT scan analysis.

13

Figure 5: Feature embedding visualizations of Imagenet [10] pretrained with Caffeand C3D [21] on UCF101 dataset using t-SNE [28]. C3D features are semanticallyseparable compared to Imagenet suggesting that it is a better feature for videos. Eachclip is visualized as a point and clips belonging to the same action have the same color.Best viewed in color.

0.3.2 3D CNN based lung nodule detection

Traditionally, a CNN takes a 2D matrix as an input. However, sometimes a 3D

neighborhood is need to make predictions. Stimulated by such need, there are some

recent publications in computer vision introduce 3D CNN. Here I briefly review two

papers that originated such idea. Du Tran et. al [21] proposed C3D model to apply

3D CNN for video scene recognition to take advantage of the third dimension of

the video, time property, for a better performance. To understanding the benefit

of 3D convolution in video action recognition, [21] provides a feature embedding

result illustrated in figure 5. It compares the power of feature learning between the

benchmark ImageNet [10] with C3D. The visualization is made by t-SNE [28], which

can generate low-dimension representation of the high dimension data. Obviously,

the feature learned by C3D are better than 2D based models.

To recognize 3D object, Daniel Maturana et.al [25] has applied a 3D CNN which

achieved promising performance. As shown in figure 6, the structure of voxnet is very

straightforward and you can think of it is just a CNN which replace 2D convolution

with 3D. One more thing I would like to mention is, voxnet publish its benchmark

14

Figure 6: Architecture of voxnet model.

dataset for 3D object recognition. On its website, it maintains a list of current

benchmark models with model score, which could be used as a reference to valid 3D

models.

When radiologists check each scan, for a better inspection, besides going through

each region of the scan, they will also check the same region on the slices before or

after the current slice to decide whether there is a nodule inside. Such detection

procedure takes advantage of the 3D nature of the CT scan, which could also be an

inspiration on the design of CNN based detection.

In the area of lung nodule detection, due to the 3D nature of CT scan, it will

be reasonable to apply 3D CNN. Some efforts have also been made. Rushil Anirudh

[26] has applied 3D CNN on a weakly labelled lung nodule dataset. He uses a voxel

v̂ as a input into the CNN to decide whether the center point v located at (x, y, z)

to be a nodule or not. v̂ is defined as (x − w : x + w, y − w : y + w, z − h : z + h),

15

which means not only neighbors of v in the same slice but also the neighbors on the

previous and latter slice are considered to make the decision. This design is closer

to how radiologist performs lung nodule detection. A sensitivity of 80% for 10 false

positives per scan has been given on their weakly labeled dataset as a result.

In [27], Golan, Rotem et al. developed a 3D CNN based lung nodule detector

using votes for nodule locating. Normally in a detection procedure, the type of each

pixel in the scan only decides once. In their work, the detection result for each pixel

is acquired by a combination of multiple votes. The votes are generated by sliding

windows in the 3D space. Each sliding window would provide one vote for all the

pixels inside by its classification result, where all the pixels inside are considered to

have the same type as the classification result. In this way, there are multiple votes

for a single pixel coming from different sliding windows. The final decision is made

by a comparison of the total votes from different sliding windows with a predefined

threshold. Such strategy could reduce the prediction error by ensembling multiple

detection results.

0.3.3 Holistic CNN Model

In contrast to the bounding box based models mentioned in the previous two parts,

I review literature working on give predictions based on the whole picture without

cutting it into small slices. The term ”Holistic” is used in Medical Imaging area as

the input is the whole image.

As the output of a CNN is a vector of classification results, a good design of

classes could allow the detection on the whole image at once. Recently, Yolo [30]

proposed that object detection is performed with only one scan. With an input

image, the output given by the CNN shows the type and the location, in the form of

the boundary, of the object. As shown in 7, it takes a whole image and the output is

the object class, position, and a bounding box will be generated to show the location

16

Figure 7

of the object. Some examples could be seen below. This is the first paper that

use a CNN to generate such information all at once. The difference of this CNN to

the traditional CNN is bounding box size and object location are also considered as

neurons in the output layer. It is very amazing that the network could learn how to

represent such information. The shortcoming is such complex design requires huge

amount of data. In my experiment, if data is too little, the performance is bad.

In the area of medical imaging, Mingchen Gao et. al, [29] has applied a holistic

classification on lung CT scans to detect 6 different kinds of diseases. Although the

task is lung disease detection instead of lung cancer, this paper casts some light on

using a different pipeline to perform nodule detection. Their methods take the whole

lung CT image as an input and the output is whether this patient is healthy or has

some kinds of lung disease. Their method reaches a descent result in the experiment.

Besides efficiency, another advantage of taking the whole CT scan as an input is that

the noise affect less as more information is available.

0.3.4 Conclusion

A CNN based Lung Nodule detection could be thought of, in terms of Computer

Vision, an image segmentation problem or an object detection problem. From the

recent medical imaging publications, most of the papers use a bounding box to make

a decision pixel by pixel.

17

All the methods we have mentioned in the first two parts mainly obey the pipeline

so that the detection result of a given slice is based on the results from a group of

sub-tasks which perform nodule detection in each region of the slice with a sliding

window. T

For the bounding box based method, in my opinion:

Pro:

1. The input size is small, so that more accuracy result could be generated

2. Pixel by pixel could mitigate the effect of false positive using a post-precessing

method

Con:

1. Time consuming, could be hard for realtime result

2. In Computer Vision, the new trend is to operate on the whole picture at once

instead of pixel by pixel.

Then there is the concern whether we could perform nodule detection on the

whole image to achieve the detection result without dividing it into sub-tasks. Holistic

Detection method has been researched on in the area of Computer Vision and Medical

Imaging. Limited by the fact that the output of a CNN is a vector of predicted

confidence score for each class, the only trick is to redefined each class to represent

the location of the whole image. Apparently, a natural idea is to overcome such

limitation, which results in the method mentioned in the next section. Also, I will

mention some new trend in medical imaging.

18

0.4 Encoder-Decoder Based CNN Structure

An auto-encoder s an artificial neural network used for unsupervised learning of effi-

cient codings [46], which contains an Encoder part and a Decoder part. The Encoder

part aims at down-sample the input to reach a low-dimension representation of the

original data. Then the Decoder will up-sample the low-dimension data to recover

the original information.

Auto-encoders has already shown a success in the image segmentation [47]. Nowa-

days, there is a trend to design CNN into the form of Encoder-Decoder structure. The

Encoder part is the same as the convolution layers in a CNN, as the convolution layers

could be viewed as a feature extractor with the power of down-sampling. To design

the Decoder, as it could be viewed as the reverse operation of convolution, the decon-

volution layer which we mentioned before and up-sampling layer are used to recover

the low-dimension representation to its original size.

The idea of deep convolution Encoder-Decoder structure, to the best of my knowl-

edge, should originated from [12], where method on how to reverse the output of a

convolution layer to its input has been discussed. The proposed method is shown in

figure 8.

Then another step has been made by Zeiler et al. [13] where a symmetric structure

has been proposed with reconstruction loss for training. After that, people start to

work on Deep convolutional based Encoder-Decoder structure for image segmentation.

19

Figure 8: The method proposed by Zeiler et al. [12], as shown in figure, this is howto reverse the operation of convolution with ReLU as activation function followed bydown-sampling.

0.5 Recent Progress on the Encoder-Decoder

Structure Based Segmentation

0.5.1 A timeline for such models

From the previous discussion, we know that due to the output layer of CNN is classi-

fication labels with confidence score. In the work of Long et al. [31], a new type CNN

called Fully Convoluted Neural Network (FCN) is mentioned. Instead of the fully

connected layer which is traditionally comes at the very end of a CNN, which limited

the achievement of one-pass scan, a pure convolution layer based structure is pro-

posed. Then upsampling operation is applied to the last few extracted low-dimension

feature to recover a segmentation mask of larger size. As shown in figure 9, a cat

20

Figure 9: Design of fcn

Figure 10: Design of Deeplab

picture is taken as an image and the segmented result is shown as a mask image. A

FCN is a network only has Convolution Layers and Pooling layers without any fully

connected layers. Transforming fully connected layers into convolution layers enables

a classification net to output a heatmap.

Another new trend is to ask CNN to draw a picture about the segmentation results

in the image. This idea is originated by Chen et al. [33], as shown in 10. Based on

a pre-trained VGGnet, given an input image of a dog, the segmented result could

be given in the format of an image mask by a Decoder in a symmetric structure.

Although the performance is not as good as the papers coming after this brilliant

idea. This design is the first one cast light on this path.

We can see that in [33], an Encoder-Decoder structure has been given. The encoder

is Convolution and Pooling operation, while the decoder is based on Deconvolution

and upsampling. The idea of building such decoder is firstly raised in [36], where

the author applied such methods to visualize what is happening in each layer of a

21

Figure 11: PSPnet result comparison

Figure 12: Design of PSPnet

CNN. The deconvolution and upsampling is not an exact reverse operation but could

roughly goes back a step.

Furthermore, this direction has developed fast recently. PSPNet [37] has been

proposed to further reuse different levels of features, which is designed to perform

image segmentation on natural scenes. A comparison with other networks could be

shown in figure 11. The boundary and detection result of PSPNet is more smooth

and accurate.

The trick of PSPNet is the design of fusion different level of features, i.e the feature

map acquired by each Convolution Layer. As shown in 12, different sized feature map

are supposed to have different level features. A combination of all these features could

outperforms the previous design. However, how to design and fuse these layers are

barely explained. The code published in Github is written in Caffe and contains 2000

lines for model definition, which leaves a hard job for people to understand.

22

Figure 13: Design of U-net

One important paper to mention is U-net [38], which is very famous in medical

imaging segmentation working as a benchmark. The detailed design could be shown

in 13. Although it used a small dataset to train model, Unet has been proved very

powerful in many other scenarios. Learned from Residual Network, each module in

the figure below is a residual block containing several layers of Convnet and sampling.

The novel idea here is not only build an auto-encoder, Unet will combine the output

from the corresponding encoder with the input of the decoder as a new input into the

next decoder layer. This design is hard to believe but turns out to be very useful by

a series of related networks. U-net model has been widely applied and proved to be

a success in the area of Medical Imaging on applications such as Lumbar Surgery[48]

and gland segmentation[49].

One step ahead, I would like to mention the LinkNet [40]. This is a very new

paper just available in Arxiv. The network architecture could be seen below. In

nature, it is very similar to Unet but it is designed for image segmentation. Input of

each encoder layer is also bypassed to the output of its corresponding decoder. By

doing this the model aims at recovering lost spatial information that can be used by

23

Figure 14: Structure of Linknet

the decoder and its upsampling operations. Such design could reduce parameters and

be more efficient.

0.5.2 Conclusion

In this section, we go through the timeline of Encoder-Decoder based segmentation

network structure. As we can see, there are two trends emerged in the development

of Encoder-Decoder based models. The first is the reuse of features to gain more

information for segmentation. The second is reduce parameters to speed up.

24

Figure 15: General Framework of GAN. GAN usually consists of two networks: theGenerator and the Discriminator. The Generator is trained to generate fake datasamples from noises while the Discriminator network tries to discriminate fake datafrom the real data.

0.6 Generative Adversarial Network

Generative Adversarial Network (GAN) is a deep generative model proposed by Good-

fellow et al. [41], and later optimized by DCGAN [42] and WGAN [43]. The general

framework is demonstrated in figure 15. There are two kinds of networks in GAN:

the generator and the discriminator. The generator network is trained to fool the dis-

criminator network, while the discriminator serves to distinguish whether an image is

generated by the generator or is ground-truth. The generator and the discriminator

are updated in parallel.

The goal of the Generator G is to learn a distribution pz matching the data,

while the goal of Discriminator D is to distinguish the real data (i.e. from the real

distribution pz) from the fake data generated by G. The adversarial comes from the

min-max game between G and D, which is formulated as:

minG

maxD

V (D,G) = Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))], (7)

25

where G tries to minimize this objective against an adversarial D that tries to

maximize it.

However, the vanilla GAN suffers model collapse problem due to its loss design.

To make the training process more stable, Arjovsky et al. [43] proposed Wasserstein

GAN by using Earth Mover (EM) distance or Wasserstein-1 to evaluate the distance

between the real distribution and the fake distribution [43]. Specifically, given two

distributions, Pdata and Pz, with samples x ∼ Pdata and y ∼ Pz, the Wasserstein-1

distance is defined as:

W (Pdata, Pz) = infγ∈

∏(Pdata,Pz)

E(x,y)∼γ[‖x− y‖], (8)

where∏

(Pdata, Pz) denotes the set of all joint distributions γ(x, y) whose marginals

are respectively Pdata and Pz. The term γ(x, y) could be viewed as the cost from x

to y in order to transform the distributions Pdata into the distribution Pz. And the

Wasserstein-1 loss actually indicates optimal transport cost. Under this design, the

loss for the G network is:

LG = −Ex∼Pz [D(x)] (9)

And the loss for the D network is:

LD = Ex∼Pz [D(G(x))]− Ex∼Preal[D(x)] (10)

0.7 GAN Based Segmentation models

In this section, I will review related GAN based segmentation method. But this is

a very new area and only two papers available at the moment, which is one area of

great potential.

26

Figure 16

0.7.1 GAN for segmentation

Luc et al. [39] proposed a GAN based segmentation method which is shown in figure

16 . The model contains two parts, a segmentator and a discriminator. The segmentor

is designed to generate a mask of the natural image, while the discriminator would

decide if the generated mask the same as the ground truth mask. The quality of

the generated mask is evaluated by how well it is to fool the discriminator. The loss

function is similar to the CGAN while some modification to consider the relationship

between mask and the original image. Given a data set of N training images xn and

a corresponding label maps yn θs, θa representing the parameters of segmentor and

advesarial model, the loss used in this model is shown in equation 11.

L(θs, θa) =N∑n=1

L(s(xn), yn)− λ[bce((a(xn), yn), 1) + bce((a(xn), s(xn)), 0)], (11)

where , a(x, y) ∈ [0, 1] denotes the scalar probability with which the adversarial

model predicts that y is the ground truth label map of x, as opposed to being a label

map produced by the segmentation model s()

The motivation for their approach is that, with the help of GAN, it can detect

and correct higher-order inconsistencies between ground truth segmentation maps

27

and the ones produced by the segmentation net. The experiments show that their

adversarial approach leads to improved accuracy on the Stanford Background and

PASCAL VOC 2012 datasets

Souly et al. [44] proposed two neutral frameworks for GAN based semi-supervised

and weakly supervised learning. The difference with [39] is it asked the Discriminator

to generate mask instead of the Generator. Besides the original K classes of the seg-

mentation task, one extra class, fake class, has been added for the discriminator so

that it could decide whether the input is a fake one generated by the Generator. To

ensure higher quality of generated images for GANs with consequent improved pixel

classification, the second framework extend the framework by adding weakly anno-

tated data. This greatly overcome the shortcoming of few labelled data is available.

In the area of medical imaging, recently only one paper has been published re-

lated to GAN based segmentation. segAN [45] has been proposed, which is workflow

is shown in 17. Since image segmentation requires dense, pixel-level labeling, the

gradient feedback by the original GAN may not be enough for training, which has

been mentioned in WGAN. To overcome this, a new loss is proposed by the design of

adversarial critic network with a multi-scale L1 loss function to force the critic and

segmentor to learn both global and local features that capture long- and short-range

spatial relationships between pixels. The loss proposed in this paper is very similar

to the idea of EM loss used by WGAN.

0.7.2 Conclusion

In this section, several literatures are reviewed on the topic of GAN based image

segmentation. This direction is a new direction where only 3 papers could be searched.

The model structure just follows the GAN workflow without too much modifications.

Apparently, there is a lot of meat on this bone.

28

Figure 17: Structure of segAN

0.8 Some Challenges

Despite of the success of applying CNN for lung nodule detection, challenges also

remain. In this section, we discuss some challenges presented in some papers and

experienced in our practice. As CNN is a highly data-oriented method, a majority of

the challenges lie in the data part. So we discuss data related challenges in the first

two parts and other challenges in the third part.

0.8.1 Data Source

CNN, as some other big data technologies, requires a large enough dataset to learn

the classification rules. Different from computer vision area, where large and clean

benchmark dataset is available, limited lung nodule dataset is available to the public.

Most people have their own datasets containing different numbers of patients from

various sources. Where to get data is a big challenge to perform a deep learning based

detection. Here we list some public lung nodule dataset used by recent publications

as a reference.

29

• SPIE-AAPM-LUNGx dataset: a dataset used for a lung challenge originally to

decide whether a nodule is benign or malignant.

• LIDC-IDRI: contains 1018 cases, the largest public database founded by the

Lung Image Database Consortium and Image Database Resource Initiative.

On the website lung nodule CT scan is available for download.

• ELCAP Public Lung Image Database: contains 50 low-dose thin-slice chest CT

images with annotations for small nodules.

• NSCLC-Radiomics: contains 422 non-small cell lung cancer (NSCLC) patients

0.8.2 Data Preparation

The major purposes of the data preparation are to make the training data less con-

fusing, more fit to CNN and enrich data size.

To make the data less confusing, some literatures perform lung segmentation to

reduce noise. Then possible smooth methods could be applied to the segmented lung

parts. Also, some other unnecessary parts, like some light dots or air, could be filtered

out with threshold or other techniques.

For the purpose of modifying the data to be more fit for CNN, one challenge to

be mention is the difference between CT scan and a RGB image. For a RGB image,

it contains three channels, each channel has data ranging from 0 to 255. For a CT

scan, it has only one channel with data ranging from [-1000,3000], which is much

larger. Based on our experiments, if one directly puts CT scan with such large range

into CNN, the performance will be limited. To make CT scan more similar to the

image originally processed by CNN in computer vision, one solution is to rescale the

data range of CT scan to [0, 255]. This could definitely cause information loss. In

[29] and [18], an idea has been raised that turn the one channel CT scan into three

channels by separating attenuations into three levels: low, normal and high. Then

30

the three channeled image would be rescaled into [0, 255]. One benefit of this method

is that the CT image now is in the same format with a RGB image. People can

use pretrained CNN model to enhance detection performance. The pretrained model

could be viewed as: first, we train a CNN to learn from normal RGB image so that

it could extract some strong features from RGB image, in other words, these features

are very good at recognizing RGB images of different classes; then we finetune the

CNN on the transformed CT image. This is like training an expert with a new task,

the difficulty of which is much less than training a newbie from scratch.

To enlarge the size of dataset to meet the need of big data by CNN, some methods

such as image translation could be applied to enlarge dataset. The generated ones

are considered different from the original image. Also, adding random noise, such as

white noise, to the original image, could also be a solution to enlarge dataset.

One more thing is the issue of imbalanced dataset. As nodule detection is a binary

classification problem (Nodule or Non-nodule), to train a classifier, the dataset should

be a balanced one, which means both classes have equal number of samples. However,

obviously, in a set of CT scans, the number of slices containing nodule is much smaller

than that of slices do not contain nodules. So when preparing the training dataset, we

need to balance the dataset to make the number of two types of samples, containing

nodule or not, to be equal.

0.8.3 Some Other Challenges

Besides the data-related challenges mentioned above, there remains some other chal-

lenges. Below we list some of them:

• HPC support: The training of CNN based model requires huge amount of

calculations on huge amount of data. Even with the help of HPC can the CNN

model be trained in a durable length of time. Nowadays, besides training CNN

purely on CPU, CUDA accelerated GPU has also been used for training as well.

31

• High Cost: As with the need of HPC, another challenge is cost. The support

of HPC consumes large amount of energy and requires facilities. The concept

of energy aware computing could be considered when designing a HPC system

for CNN.

• Multi-disciplinary Cooperation Required: The design of a CNN based lung

nodule detection system requires the cooperation from multiple disciplinary such

as medical, radiology and computer science. Each expert from a certain area

provides their own domain knowledge. The knowledge from different domains

could guide how to perform data preparation, how to design the CNN model

and how to make the system user friendly to radiologists. So it is very important

for people from different areas to understand each other. Also, how to protect

the data privacy is also a big concern in the cooperation.

32

0.9 Conclusion

In this survey, we give an introduction on the recent progress of using CNN for

medical image segmentation. A list of public packages as well as a list of public

dataset are given. We can see that CNN has shown great potential in the area of lung

nodule detection from bounding box based to more advanced methods. Meanwhile,

challenges still remain and researchers are working on solving them. We can see a

very promising future for the CNN based medical imaging segmentation.

33

Bibliography

[1] Everingham, Mark, and John Winn. ”The pascal visual object classes challenge2011 (voc2011) development kit.” Pattern Analysis, Statistical Modelling andComputational Learning, Tech. Rep (2011).

[2] American Cancer Society. http://www.cancer.org/cancer/lungcancer-non-smallcell/detailedguide/non-small-cell-lung-cancer-key-statistics

[3] Data Science Bowl 2017 https://www.kaggle.com/c/data-science-bowl-2017,2017

[4] Setio, Arnaud Arindra Adiyoso, et al. ”Validation, comparison, and combinationof algorithms for automatic detection of pulmonary nodules in computed to-mography images: The LUNA16 challenge.” Medical image analysis 42 (2017):1-13.

[5] Zhe, Xiaoning, Michael L. Cher, and R. Daniel Bonfil. ”Circulating tumor cells:finding the needle in the haystack.” American journal of cancer research 1.6(2011): 740.

[6] Shen, Shiwen, et al. ”An automated lung segmentation approach using bidirec-tional chain codes to improve nodule detection accuracy.” Computers in biologyand medicine 57 (2015): 139-149.

[7] Duggan, Nirn, et al. ”A Technique for Lung Nodule Candidate Detection in CTUsing Global Minimization Methods.” Energy Minimization Methods in Com-puter Vision and Pattern Recognition. Springer International Publishing, 2015.

[8] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press,2016.

[9] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classifica-tion with deep convolutional neural networks.” Advances in neural informationprocessing systems. 2012.

[10] Jia, Yangqing, et al. ”Caffe: Convolutional architecture for fast feature embed-ding.” Proceedings of the 22nd ACM international conference on Multimedia.ACM, 2014.

[11] LeCun, Yann, et al. ”Deep learning.” Nature 521.7553 (2015): 436-444.

34

[12] Zeiler, Matthew D., and Rob Fergus. ”Visualizing and understanding convo-lutional networks.” European conference on computer vision. Springer, Cham,2014.

[13] Zeiler, Matthew D., et al. ”Deconvolutional networks.” Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.

[14] Nima Tajbakhs, et al. ”Comparing two classes of end-to-end machine-learningmodels in lung nodule detection and classification: MTANNs vs. CNNs”, PatternRecognition, in press.

[15] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. ”Deep sparse rectifier neuralnetworks.” Proceedings of the Fourteenth International Conference on ArtificialIntelligence and Statistics. 2011.

[16] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. ”Rectifier nonlinearitiesimprove neural network acoustic models.” Proc. ICML. Vol. 30. No. 1. 2013.

[17] Paszke, Adam, et al. ”PyTorch.” (2017).

[18] Shin, Hoo-Chang, et al. ”Deep convolutional neural networks for computer-aideddetection: CNN architectures, dataset characteristics and transfer learning.”IEEE transactions on medical imaging 35.5 (2016): 1285-1298.

[19] LeCun, Yann, et al. ”Gradient-based learning applied to document recognition.”Proceedings of the IEEE 86.11 (1998): 2278-2324.

[20] Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional networksfor large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).

[21] Tran, Du, et al. ”Learning Spatiotemporal Features with 3D Convolutional Net-works.” arXiv preprint arXiv:1412.0767 (2014).

[22] He, Kaiming, et al. ”Deep residual learning for image recognition.” Proceedingsof the IEEE conference on computer vision and pattern recognition. 2016.

[23] Wu, Zhirong, et al. ”3d shapenets: A deep representation for volumetric shapes.”Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion. 2015.

[24] Smith, Steven. Digital signal processing: a practical guide for engineers andscientists. Newnes, 2013.

[25] Maturana, Daniel, and Sebastian Scherer. ”Voxnet: A 3d convolutional neu-ral network for real-time object recognition.” Intelligent Robots and Systems(IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015.

[26] Anirudh, Rushil, et al. ”Lung nodule detection using 3D convolutional neuralnetworks trained on weakly labeled data.” SPIE Medical Imaging. InternationalSociety for Optics and Photonics, 2016.

35

[27] Golan, Rotem, et al. ”Lung nodule detection in CT images using deep convo-lutional neural networks.” Neural Networks (IJCNN), 2016 International JointConference on. IEEE, 2016.

[28] Maaten, Laurens van der, and Geoffrey Hinton. ”Visualizing data using t-SNE.”Journal of Machine Learning Research 9.Nov (2008): 2579-2605.

[29] Gao, Mingchen, et al. ”Holistic classification of CT attenuation patterns for inter-stitial lung diseases via deep convolutional neural networks.” Computer Methodsin Biomechanics and Biomedical Engineering: Imaging & Visualization (2016):1-6.

[30] Redmon, Joseph, et al. ”You only look once: Unified, real-time object detection.”arXiv preprint arXiv:1506.02640 (2015).

[31] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. ”Fully convolutional net-works for semantic segmentation.”Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition. 2015.

[32] Paszke, Adam, et al. ”Automatic differentiation in PyTorch.” (2017).

[33] Chen, Liang-Chieh, et al. ”Deeplab: Semantic image segmentation with deepconvolutional nets, atrous convolution, and fully connected crfs.”arXiv preprintarXiv:1606.00915(2016).

[34] Sermanet, Pierre, et al. ”Overfeat: Integrated recognition, localization and de-tection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).

[35] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classifica-tion with deep convolutional neural networks.” Advances in neural informationprocessing systems. 2012.

[36] Yosinski, Jason, et al. ”Understanding neural networks through deep visualiza-tion.”arXiv preprint arXiv:1506.06579(2015).

[37] Zhao, Hengshuang, et al. ”Pyramid scene parsing network.”arXiv preprintarXiv:1612.01105(2016).

[38] Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolu-tional networks for biomedical image segmentation. InInternational Conferenceon Medical Image Computing and Computer-Assisted Intervention(pp. 234-241).Springer, Cham.

[39] Luc, Pauline, et al. ”Semantic segmentation using adversarial networks.” arXivpreprint arXiv:1611.08408 (2016). APA

[40] LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmenta-tion

36

[41] Goodfellow, Ian, et al. ”Generative adversarial nets.” Advances in neural infor-mation processing systems. 2014.

[42] Radford, Alec, Luke Metz, and Soumith Chintala. ”Unsupervised representationlearning with deep convolutional generative adversarial networks.” arXiv preprintarXiv:1511.06434 (2015).

[43] Arjovsky, Martin, Soumith Chintala, and Lon Bottou. ”Wasserstein gan.” arXivpreprint arXiv:1701.07875 (2017).

[44] Souly, Nasim, Concetto Spampinato, and Mubarak Shah. ”Semi and Weakly Su-pervised Semantic Segmentation Using Generative Adversarial Network.” arXivpreprint arXiv:1703.09695 (2017).

[45] Xue, Yuan, et al. ”SegAN: Adversarial Network with Multi-scale L1 Loss forMedical Image Segmentation.” arXiv preprint arXiv:1706.01805 (2017).

[46] Bengio, Yoshua. ”Learning deep architectures for AI.” Foundations and trendsin Machine Learning 2.1 (2009): 1-127.

[47] Nash, Charlie, and Chris KI Williams. ”The shape variational autoencoder: Adeep generative model of partsegmented 3D objects.” Computer Graphics Forum.Vol. 36. No. 5. 2017.

[48] Baka, Nora, Sieger Leenstra, and Theo van Walsum. ”Ultrasound Aided Ver-tebral Level Localization for Lumbar Surgery.” IEEE transactions on medicalimaging 36.10 (2017): 2138-2147.

[49] Manivannan, Siyamalan, et al. ”Structure Prediction for Gland Segmentationwith Hand-Crafted and Deep Convolutional Features.” IEEE transactions onmedical imaging (2017).

37

convolutional neural network based medical …...abstract a key research topic in medical image...

Documents