convolutional neural network based medical …...abstract a key research topic in medical image...
TRANSCRIPT
Convolutional Neural Network based
Medical Imaging Segmentation: Recent
Progress and Challenges
Jiaxing Tan
A literature review submitted to
the Graduate Faculty in Computer Science in partial fulfillment of
the requirements for the degree of Doctor of Philosophy,
The City University of New York
Committee Members:
Dr. Yumei Huo (Advisor)
Dr. Shuqun Zhang
Dr. Sos Agaian
Dr. Yingli Tian
January 14, 2018
c© Copyright by Jiaxing Tan, 2018.
All rights reserved.
Abstract
A key research topic in Medical Image Analysis is image segmentation. Tradition-
ally such task is solved by hand-engineered features based methods, which could be
highly dataset related. Convolutional Neural Network (CNN) has shown great suc-
cess in many areas, especially in the area of computer vision. Different from the
hand-engineered feature based classification, Convolutional Neural Network uses self-
learned features from data for classification. Recently, some progress has been made
in the area of Convolutional Neural Network based image segmentation, which cast
light on the area of medical imaging. This survey gives a brief introduction on Con-
volutional Neural Network based medical image segmentation. Three categories of
methods are discussed are discussed: CNN based method, Encoder-Decoder based
method and Generative Adversarial Network based method. For each category, first
a brief introduction is given followed by a timeline of model evolving then some recent
progress on medical imaging is introduced.
Besides some technical details, we also introduce some available public packages
for a fast development and some public data sources.
iii
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
0.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . . 5
0.2.1 What is Convolutional Neural Network . . . . . . . . . . . . . 5
0.2.2 Some available Packages . . . . . . . . . . . . . . . . . . . . . 9
0.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.3 Some Recent CNN based methods . . . . . . . . . . . . . . . . . . . . 11
0.3.1 2D CNN based method . . . . . . . . . . . . . . . . . . . . . . 11
0.3.2 3D CNN based lung nodule detection . . . . . . . . . . . . . . 14
0.3.3 Holistic CNN Model . . . . . . . . . . . . . . . . . . . . . . . 16
0.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.4 Encoder-Decoder Based CNN Structure . . . . . . . . . . . . . . . . . 19
0.5 Recent Progress on the Encoder-Decoder Structure Based Segmentation 20
0.5.1 A timeline for such models . . . . . . . . . . . . . . . . . . . . 20
0.5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
0.6 Generative Adversarial Network . . . . . . . . . . . . . . . . . . . . . 25
0.7 GAN Based Segmentation models . . . . . . . . . . . . . . . . . . . . 26
0.7.1 GAN for segmentation . . . . . . . . . . . . . . . . . . . . . . 27
0.7.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
0.8 Some Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
iv
0.8.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
0.8.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 30
0.8.3 Some Other Challenges . . . . . . . . . . . . . . . . . . . . . . 31
0.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
v
Figure 1: An example of semantic segmentation from the famous VOC dataset. Left:Input image. Right: semantic segmentation result.
0.1 INTRODUCTION
One key research topic in Medical Imaging is image segmentation. Image segmenta-
tion, or Semantic Segmentation, is a pixel level image understanding task which is
to perform a pixel-by-pixel classification to decide the class of each pixel. Figure 1
shows one example from the famous VOC [1] dataset, where the real image is on the
left and the semantic segmentation result is illustrated on the right, which decides
the label of each pixel as either background, motor or rider.. As is shown in figure
1, given an image on the left, the semantic segmentation result is illustrated on the
right, which decides the label of each pixel. In the given example, the pixels are
labeled as background, motor or rider.
In the context of Medical Imaging, such segmentation method could be utilized
to solve the problems such as nodule detection, anomaly detection and organ seg-
mentation. Let’s take lung cancer detection as an example to show the importance
of image segmentation in medical imaging as it is the leading cause of cancer-related
deaths. Based on the statistics provided by American Cancer Society [2], lung cancer
caused 158,080 deaths estimated for the United States in 2016. Early detection of
lung cancer is the key to prevent lung cancer and thus help in a sharp increase in the
survival rate. A popular detection tool, computer tomography (CT), has been ana-
lyzed subjectively by radiologists. The anticipated large amount interpretation effort
1
Figure 2: An example of the segmentation task in the context of medical imaging.The goal is to detect the location of lung nodules given a lung CT scan.
demands a computer-aided detection (CAD) scheme to help radiologists to efficiently
diagnose lung cancer. As a result, a lot of nodule segmentation based automated
CAD methods have been developed. Moreover, lung volume segmentation is always
performed as the first step in some other automated CAD methods and lung disease
diagnosis. Thus, a robust and high quality image segmentation method, if successful,
could heavily reduce the cost of operation and speed up the waiting time for not only
lung CT but also various types of medical image analysis.
However, medical image suffers the fact of high noisy and low quality, which makes
it is much harder to perform segmentation on the medical images. As is shown in
figure 2, which are some examples of lung CT slices with nodules in red mask, the
nodules are quite small compared to the whole image. So the detection task is very
hard like finding the needle in the haystack [5].
2
Traditionally, medical image segmentation is performed by hand-engineered fea-
ture based classification. One common attribute of such method is that some empir-
ical ”magic numbers” are used for thresholding and preprocessing, see the methods
proposed by Shen et al. [6] and Duggan et al. [7] for examples. There remains a risk
that those empirical values could be dataset specific which might impede the model
to be a general solution to different datasets. As the final goal of medical imaging is
to build CAD to serve the massive people, a robust and adaptive model is highly rec-
ommended. In this aspect, Convolutional Neural Network, which could learn feature
by itself with little or no empirical priori, casts some light on the medical image area.
Convolutional Neural Network (CNN) has shown a great success in computer vi-
sion on the ImageNet challenge. Since AlexNet[9] was proposed, the performance
improvement has been achieved for almost every year with deeper and deeper struc-
tures supported by high performance computing facilities. Compared with manually
selected feature based classifier, CNN could learn features from data itself, which
turns out to be more efficient and automatic [11]. Also, some progress has been made
in medical imaging using CNN to solve real-life problems. Recently in some medical
imaging segmentation competitions such as Kaggle lung cancer detection competi-
tion [3] and LUNA16 Challenge [4], the top ranked teams all used CNN as a solution
method.
In this survey, we aim at giving a brief introduction on what is happening in
the area of CNN based medical image segmentation with typical methods. As CNN
has developed into different sub-types, we will discuss the CNN based medical seg-
mentation methods in three categories: CNN based method, Encoder-Decoder based
method and Generative Adversarial Network based method. For each category, we
first briefly introduce the technical points of model design with some related papers in
the area of computer vision and then we demonstrate and illustrate the corresponding
model design and literatures in medical imaging area. And then we provide lists of
3
available public packages for fast development and some public datasets (specifically
for lung cancer detection). At last, we discussed some challenges and possible data
preparation method and finally reach a conclusion.
The rest of this survey is organized as follows:In section 2, we give a general picture
of CNN with a list of commonly used public packages. Then we introduce three
categories of CNN based medical segmentation methods in the following 3 sections.
In section 8, we demonstrate some challenges and their possible solutions. At last,
the conclusion is given in section 9.
4
0.2 Convolution Neural Network
In this section, we will give a brief introduction of Convolution Neural Network. Our
introduction is based on the example shown in Figure 3. Then a list of publicly
available packages is provided for fast development.
0.2.1 What is Convolutional Neural Network
Convolutional Neural Network are similar to the deep neural networks as they are
made up of neurons that have learn-able weights and biases. But the difference is to
reduce the connections of the deep neural network, it applied weight sharing in dif-
ferent locations. Such weight sharing is found out to be equivalent to the convolution
operation in signal processing. More commonly, CNN is introduced as a deep neural
network inspired by the biology study of human cortex. A CNN is constructed by four
types of layers: input layer, multiple convolution layers and several fully-connected
layers in the end. Each convolution layer is followed by a subsampling layer [8].
A CNN example is shown in figure 3. It contains an input layer of size 256×256×1
and two convolution layers each followed by a max-pooling layer. The first convolution
layer has 32 filters with size 7× 7, and the second has 64 filters with size 5× 5. After
the convolution layers are two fully connected layers. The first layer has 128 neurons
and the second has 2 neurons. A softmax is applied to the last layer to normalize
the outputs in the range [0, 1]. We will introduce each of the layers below with this
example.
Input Layer The Input layer is in charge of reading data with a predefined size
without performing any changes to it. In figure 3, the input layer reads in a CT
scan image with size 256 × 256. Note that the nodes of the input layer are passive,
meaning that they do not modify the data [24]. They receive a single value on their
input, and duplicate the value to the hidden layers.
5
Figure 3: An example of a CNN with 2 convolution layers and 2 fully connectedlayers. Each convolution layer is followed by a pooling layer.
In practice, some packages, such as caffe, Tensorflow and Theano require to specify
the input size using the input layer. Some packages, such as pytorch, could self-adapt
to the input data without specify. But the test data and the training data must be
of the same size as different sizes will result in different numbers of parameters.
Convolution layer A convolution layer has k different kernels, and each has
the shape m × n and performs convolution operation (denoted as ?) on each of j
sub-images of the input image i. A non-linear function g will be applied to the
convolution result with a bias b added. The whole procedure is shown in equation 1.
f = g((Wi ∗ i1:j) + b) (1)
The output of the convolution layer is k feature maps, each generated by a con-
volution operation with one kernel applied on the whole image. In figure 3, there are
2 convolution layers. Conv1 has 32 kernels, each of size 7× 7, conv2 has 64 kernels,
each of size 5× 5.
Besides the 2D convolution layer, as some tasks need to deal with 3D data, such
as spatial convolution over volumes, 3D convolution will be needed. Under this need,
3D convolution layer is also available for most of the current packages.
6
One additional type of convolution layer that draws a lot of researchers’ interest is
the transpose convolution layer, which is also known as deconvolution layer. Decon-
volution layer is first proposed by Zeiler et al. [12] and can be thought of as a reverse
operation of convolution. And this idea is extended to build an Encoder-Decoder
structure [13] taking advantage of the reverse operation, which we will discussed later
in the corresponding section. Same as the convolution operation, transpose convolu-
tion operation also has 2D and 3D versions.
Subsampling Layer Subsampling layer, following a convolution layer, performs
a down-sampling operation on the feature maps generated by the convolution layer.
There are several ways to perform sub-sampling, such as average-pooling, median-
pooling, global average pooling and max-pooling. In figure 3, the network uses max-
pooling strategy.
Besides pooling based sampling, currently a new trend appears as the (down)sampling
is performed by the convolution layer using strides larger than 1. To further explain
this, let’s take the 2D convolution operation provided by pytorch [32] as an example
. As shown in equation 2:
Wout = floor((Win + 2paddingdilation(kernel size1)1)/stride+ 1), (2)
Where Wout is the output size, Win is the input size, padding is the size of zero-
padding added to both sides of the input. Dilation is spacing between kernel elements.
Stride is the stride of the convolution. We can see if we set stride to be 2, it is
equivalent to perform a down-sample operation by a factor of 2. So a new downsample
method is to replace the down-pooling layer with a convolution layer with filter size
1× 1 and stride 2.
7
Fully Connected Layer A fully connected layer is the same as the layers in a
traditional Multi-Layer Perceptron (MLP). The input is an image i with operations
shown in equation 3, where b denotes the bias. The last layer of a CNN model is
usually a fully connected layer which serves as an output layer. In the output layer,
the number of neurons denotes the number of classes in the classification task.
f = g((Wi × i1:j) + b) (3)
Non-linear Function Non-linear function g is used in both convolution and fully
connected layer. Functions like Tanh, Sigmoid and ReLu are often used as a Non-
linear Function. For the output layer, i.e. the last layer of a CNN, a softmax function
is commonly used.
Here we list some commonly used Non-linear Functions. The first one is the ”old
fashioned” sigmoid function, as is shown in equation 4.
Sigmoid =1
1 + e−x(4)
ReLU, first used by Glorot et al. [15] is one of the most commonly used non-linear
function now. The equation of ReLU is shown in equation 5.
ReLU = max(0, x) (5)
Currently, some variants of ReLU has also been proposed and utilized in the
newest models. For example, LeakyReLU, which allow a small, non-zero gradient
when the unit is not active has been proposed by Maas et al. [16]. The equation of
LeakyReLU is shown in equation 6. Such design could alleviate potential problems
caused by ReLU, which sets 0 to all negative values.
8
LeakyReLU(x, α) = max(x, 0) + α×min(x, 0). (6)
0.2.2 Some available Packages
One advantage of CNN is that several public packages are available. Instead of
building your CNN from scratch, you could try the packages listed below:
• Caffe: http://caffe.berkeleyvision.org/
• Torch: http://torch.ch/
• Theano: http://deeplearning.net/software/theano/
• Lasagne: https://lasagne.readthedocs.io/en/latest/
• TensorFlow: https://www.tensorflow.org/
• Keras: https://keras.io/
• Caffe2: https://caffe2.ai/
• PyTorch: http://pytorch.org/
• Mxnet: http://mxnet.io/
All these packages are publicly available online with multi-platform and GPU
support, which reduce the workload of implementing a CNN based task. Also, the
packages have been optimized to be efficient.
The principle of selection from this list, in my opinion, should be based on the
code support. For each of the typical models, such as VGG-Net [20] or ResNet [22],
has been pretrained in different platforms so under such circumstances there is not
too much difference in selecting packages. But some new models only have been
implemented by single or few packages. In this situation, you should choose the one
9
which this model has been implemented with to guarantee performance. Moreover,
note that the maintenance of Theano has been stopped.
0.2.3 Conclusion
In this section, we demonstrate the general design of a Convolution Neural Network
and briefly discuss the four types of layers used by CNN with some new trends.
Besides, we listed out some commonly used packages for model development.
10
0.3 Some Recent CNN based methods
In this section, we demonstrate some typical recent methods on CNN based lung
nodule detection. In the literature, various CNN designs have been proposed. Most
of the method reviewed here are bounding box based methods. The general idea is
infer the class label of the center pixel(s) using its neighbors nearby. We classify them
into three groups: 2D CNN based, 3D CNN based and Holistic CNN based.
The 2D and 3D methods are based on whether a 2D or 3D neighborhood is
considered for classification of the centered pixel(s). As a complementary to the
bounding box methods, we also introduce methods not using bounding box, which
is called holistic in some medical imaging literatures. We will introduce these three
three groups accordingly in this section.
0.3.1 2D CNN based method
The main idea of 2D CNN based segmentation method is to make the decision of
the centered pixel(s) by its 2D neighborhood. The input is a 2D image slice, and the
output is the predicted label of the centered pixel(s). Let’s take the work by Sermanet
et al. [34] as an example, which is illustrated in Figure 4. At each location, an image
slice is generated to make prediction of its centered pixel. The model will generate
predicted label with confidence score given an input image slice.
To increase the prediction accuracy of such bounding box based method. Voting
algorithm, such as multi-view voting has been applied to make prediction of a single
input. In [35], Krizhevsky et al. applied multi-view voting to boost performance:
a fixed set of 10 views (4 corners and center, with horizontal flip) is averaged. To
further explore this idea, in [34], it introduces a new type of view by multiple scales
at each location. It is also claim that while the sliding window approach may be
computationally prohibitive for certain types of model, it is inherently efficient in the
11
Figure 4: Example from [34], the classifier/detector outputs a class and a confidencefor each location. Then these bounding boxes are grouped together to generate finalsegmentation mask.
case of ConvNets. Such multi-scale method further increases views for voting and
remains efficiency.
Given a set of CT scans of a patient, which usually contains more than 300 slices
depending on the body size of the patient, radiologists will check the scan slice by
slice to detect the nodule. For each scan, radiologists will observe every sub-region
in it. This procedure is performed in 2D slices.
In this procedure, each slice could be viewed as a 2D image and the inspection
could be viewed as image classification on sub-regions. So we could use a 2D CNN
12
based classifier to simulate it. After applying some pre-processing methods such as
lung segmentation and noise elimination, the slice would be cut into small sub-regions.
Each region is an input into CNN and the output is a decision whether such region
contains a nodule or not. The combined result shows if a nodule exists in this slice.
Most recently, Nima, et. al [14] compares the performance of several CNN struc-
tures on lung nodule detection with MTANN. Several models, including a shallow
CNN, rd-CNN, famous LeNet [19] and AlexNet[9], are compared in this paper, which
gives some guidance on how to design CNN. Although MTANN outperforms CNN,
MTANN is a group of neural networks, and each is in charge of a certain situation,
while CNN only has one network for all situations. It could be very interesting to
know if we design a group of CNN, each to deal with a certain situation, will them
give us a better result than MTANN?
Also, we would like to mention the work of Shin, Hoo-Chang, et al [18], where
they use transfer learning to take advantage of the pre-trained CNN model from
computer vision to perform CT scan analysis. Although they lung disease detection
instead of lung nodule detection, we think this paper is complimentary to [14]. In [18]
three very deep CNN structures achieve good result while in [14] only shallow CNN
structures have good performance. They developed a strategy of transform the single
channel CT scan into a 3 channel image. In this way, people can use some pretrained
model from computer vision instead of training their model from scratch. We will
further explain pretrain in the challenge section. They experiment on three famous
CNN structures and compare variants in the number of parameters and different
training strategies: train the model from scratch, transfer learning and ”off-the-shelf”.
The results show that too deep CNN could have its performance limited by the size
of dataset. When use a deep CNN from computer vision, reduce the number of
parameters could possibly increase the performance. The idea of Transfer learning
provides a way to use pretrained model to perform CT scan analysis.
13
Figure 5: Feature embedding visualizations of Imagenet [10] pretrained with Caffeand C3D [21] on UCF101 dataset using t-SNE [28]. C3D features are semanticallyseparable compared to Imagenet suggesting that it is a better feature for videos. Eachclip is visualized as a point and clips belonging to the same action have the same color.Best viewed in color.
0.3.2 3D CNN based lung nodule detection
Traditionally, a CNN takes a 2D matrix as an input. However, sometimes a 3D
neighborhood is need to make predictions. Stimulated by such need, there are some
recent publications in computer vision introduce 3D CNN. Here I briefly review two
papers that originated such idea. Du Tran et. al [21] proposed C3D model to apply
3D CNN for video scene recognition to take advantage of the third dimension of
the video, time property, for a better performance. To understanding the benefit
of 3D convolution in video action recognition, [21] provides a feature embedding
result illustrated in figure 5. It compares the power of feature learning between the
benchmark ImageNet [10] with C3D. The visualization is made by t-SNE [28], which
can generate low-dimension representation of the high dimension data. Obviously,
the feature learned by C3D are better than 2D based models.
To recognize 3D object, Daniel Maturana et.al [25] has applied a 3D CNN which
achieved promising performance. As shown in figure 6, the structure of voxnet is very
straightforward and you can think of it is just a CNN which replace 2D convolution
with 3D. One more thing I would like to mention is, voxnet publish its benchmark
14
Figure 6: Architecture of voxnet model.
dataset for 3D object recognition. On its website, it maintains a list of current
benchmark models with model score, which could be used as a reference to valid 3D
models.
When radiologists check each scan, for a better inspection, besides going through
each region of the scan, they will also check the same region on the slices before or
after the current slice to decide whether there is a nodule inside. Such detection
procedure takes advantage of the 3D nature of the CT scan, which could also be an
inspiration on the design of CNN based detection.
In the area of lung nodule detection, due to the 3D nature of CT scan, it will
be reasonable to apply 3D CNN. Some efforts have also been made. Rushil Anirudh
[26] has applied 3D CNN on a weakly labelled lung nodule dataset. He uses a voxel
v̂ as a input into the CNN to decide whether the center point v located at (x, y, z)
to be a nodule or not. v̂ is defined as (x − w : x + w, y − w : y + w, z − h : z + h),
15
which means not only neighbors of v in the same slice but also the neighbors on the
previous and latter slice are considered to make the decision. This design is closer
to how radiologist performs lung nodule detection. A sensitivity of 80% for 10 false
positives per scan has been given on their weakly labeled dataset as a result.
In [27], Golan, Rotem et al. developed a 3D CNN based lung nodule detector
using votes for nodule locating. Normally in a detection procedure, the type of each
pixel in the scan only decides once. In their work, the detection result for each pixel
is acquired by a combination of multiple votes. The votes are generated by sliding
windows in the 3D space. Each sliding window would provide one vote for all the
pixels inside by its classification result, where all the pixels inside are considered to
have the same type as the classification result. In this way, there are multiple votes
for a single pixel coming from different sliding windows. The final decision is made
by a comparison of the total votes from different sliding windows with a predefined
threshold. Such strategy could reduce the prediction error by ensembling multiple
detection results.
0.3.3 Holistic CNN Model
In contrast to the bounding box based models mentioned in the previous two parts,
I review literature working on give predictions based on the whole picture without
cutting it into small slices. The term ”Holistic” is used in Medical Imaging area as
the input is the whole image.
As the output of a CNN is a vector of classification results, a good design of
classes could allow the detection on the whole image at once. Recently, Yolo [30]
proposed that object detection is performed with only one scan. With an input
image, the output given by the CNN shows the type and the location, in the form of
the boundary, of the object. As shown in 7, it takes a whole image and the output is
the object class, position, and a bounding box will be generated to show the location
16
Figure 7
of the object. Some examples could be seen below. This is the first paper that
use a CNN to generate such information all at once. The difference of this CNN to
the traditional CNN is bounding box size and object location are also considered as
neurons in the output layer. It is very amazing that the network could learn how to
represent such information. The shortcoming is such complex design requires huge
amount of data. In my experiment, if data is too little, the performance is bad.
In the area of medical imaging, Mingchen Gao et. al, [29] has applied a holistic
classification on lung CT scans to detect 6 different kinds of diseases. Although the
task is lung disease detection instead of lung cancer, this paper casts some light on
using a different pipeline to perform nodule detection. Their methods take the whole
lung CT image as an input and the output is whether this patient is healthy or has
some kinds of lung disease. Their method reaches a descent result in the experiment.
Besides efficiency, another advantage of taking the whole CT scan as an input is that
the noise affect less as more information is available.
0.3.4 Conclusion
A CNN based Lung Nodule detection could be thought of, in terms of Computer
Vision, an image segmentation problem or an object detection problem. From the
recent medical imaging publications, most of the papers use a bounding box to make
a decision pixel by pixel.
17
All the methods we have mentioned in the first two parts mainly obey the pipeline
so that the detection result of a given slice is based on the results from a group of
sub-tasks which perform nodule detection in each region of the slice with a sliding
window. T
For the bounding box based method, in my opinion:
Pro:
1. The input size is small, so that more accuracy result could be generated
2. Pixel by pixel could mitigate the effect of false positive using a post-precessing
method
Con:
1. Time consuming, could be hard for realtime result
2. In Computer Vision, the new trend is to operate on the whole picture at once
instead of pixel by pixel.
Then there is the concern whether we could perform nodule detection on the
whole image to achieve the detection result without dividing it into sub-tasks. Holistic
Detection method has been researched on in the area of Computer Vision and Medical
Imaging. Limited by the fact that the output of a CNN is a vector of predicted
confidence score for each class, the only trick is to redefined each class to represent
the location of the whole image. Apparently, a natural idea is to overcome such
limitation, which results in the method mentioned in the next section. Also, I will
mention some new trend in medical imaging.
18
0.4 Encoder-Decoder Based CNN Structure
An auto-encoder s an artificial neural network used for unsupervised learning of effi-
cient codings [46], which contains an Encoder part and a Decoder part. The Encoder
part aims at down-sample the input to reach a low-dimension representation of the
original data. Then the Decoder will up-sample the low-dimension data to recover
the original information.
Auto-encoders has already shown a success in the image segmentation [47]. Nowa-
days, there is a trend to design CNN into the form of Encoder-Decoder structure. The
Encoder part is the same as the convolution layers in a CNN, as the convolution layers
could be viewed as a feature extractor with the power of down-sampling. To design
the Decoder, as it could be viewed as the reverse operation of convolution, the decon-
volution layer which we mentioned before and up-sampling layer are used to recover
the low-dimension representation to its original size.
The idea of deep convolution Encoder-Decoder structure, to the best of my knowl-
edge, should originated from [12], where method on how to reverse the output of a
convolution layer to its input has been discussed. The proposed method is shown in
figure 8.
Then another step has been made by Zeiler et al. [13] where a symmetric structure
has been proposed with reconstruction loss for training. After that, people start to
work on Deep convolutional based Encoder-Decoder structure for image segmentation.
19
Figure 8: The method proposed by Zeiler et al. [12], as shown in figure, this is howto reverse the operation of convolution with ReLU as activation function followed bydown-sampling.
0.5 Recent Progress on the Encoder-Decoder
Structure Based Segmentation
0.5.1 A timeline for such models
From the previous discussion, we know that due to the output layer of CNN is classi-
fication labels with confidence score. In the work of Long et al. [31], a new type CNN
called Fully Convoluted Neural Network (FCN) is mentioned. Instead of the fully
connected layer which is traditionally comes at the very end of a CNN, which limited
the achievement of one-pass scan, a pure convolution layer based structure is pro-
posed. Then upsampling operation is applied to the last few extracted low-dimension
feature to recover a segmentation mask of larger size. As shown in figure 9, a cat
20
Figure 9: Design of fcn
Figure 10: Design of Deeplab
picture is taken as an image and the segmented result is shown as a mask image. A
FCN is a network only has Convolution Layers and Pooling layers without any fully
connected layers. Transforming fully connected layers into convolution layers enables
a classification net to output a heatmap.
Another new trend is to ask CNN to draw a picture about the segmentation results
in the image. This idea is originated by Chen et al. [33], as shown in 10. Based on
a pre-trained VGGnet, given an input image of a dog, the segmented result could
be given in the format of an image mask by a Decoder in a symmetric structure.
Although the performance is not as good as the papers coming after this brilliant
idea. This design is the first one cast light on this path.
We can see that in [33], an Encoder-Decoder structure has been given. The encoder
is Convolution and Pooling operation, while the decoder is based on Deconvolution
and upsampling. The idea of building such decoder is firstly raised in [36], where
the author applied such methods to visualize what is happening in each layer of a
21
Figure 11: PSPnet result comparison
Figure 12: Design of PSPnet
CNN. The deconvolution and upsampling is not an exact reverse operation but could
roughly goes back a step.
Furthermore, this direction has developed fast recently. PSPNet [37] has been
proposed to further reuse different levels of features, which is designed to perform
image segmentation on natural scenes. A comparison with other networks could be
shown in figure 11. The boundary and detection result of PSPNet is more smooth
and accurate.
The trick of PSPNet is the design of fusion different level of features, i.e the feature
map acquired by each Convolution Layer. As shown in 12, different sized feature map
are supposed to have different level features. A combination of all these features could
outperforms the previous design. However, how to design and fuse these layers are
barely explained. The code published in Github is written in Caffe and contains 2000
lines for model definition, which leaves a hard job for people to understand.
22
Figure 13: Design of U-net
One important paper to mention is U-net [38], which is very famous in medical
imaging segmentation working as a benchmark. The detailed design could be shown
in 13. Although it used a small dataset to train model, Unet has been proved very
powerful in many other scenarios. Learned from Residual Network, each module in
the figure below is a residual block containing several layers of Convnet and sampling.
The novel idea here is not only build an auto-encoder, Unet will combine the output
from the corresponding encoder with the input of the decoder as a new input into the
next decoder layer. This design is hard to believe but turns out to be very useful by
a series of related networks. U-net model has been widely applied and proved to be
a success in the area of Medical Imaging on applications such as Lumbar Surgery[48]
and gland segmentation[49].
One step ahead, I would like to mention the LinkNet [40]. This is a very new
paper just available in Arxiv. The network architecture could be seen below. In
nature, it is very similar to Unet but it is designed for image segmentation. Input of
each encoder layer is also bypassed to the output of its corresponding decoder. By
doing this the model aims at recovering lost spatial information that can be used by
23
Figure 14: Structure of Linknet
the decoder and its upsampling operations. Such design could reduce parameters and
be more efficient.
0.5.2 Conclusion
In this section, we go through the timeline of Encoder-Decoder based segmentation
network structure. As we can see, there are two trends emerged in the development
of Encoder-Decoder based models. The first is the reuse of features to gain more
information for segmentation. The second is reduce parameters to speed up.
24
Figure 15: General Framework of GAN. GAN usually consists of two networks: theGenerator and the Discriminator. The Generator is trained to generate fake datasamples from noises while the Discriminator network tries to discriminate fake datafrom the real data.
0.6 Generative Adversarial Network
Generative Adversarial Network (GAN) is a deep generative model proposed by Good-
fellow et al. [41], and later optimized by DCGAN [42] and WGAN [43]. The general
framework is demonstrated in figure 15. There are two kinds of networks in GAN:
the generator and the discriminator. The generator network is trained to fool the dis-
criminator network, while the discriminator serves to distinguish whether an image is
generated by the generator or is ground-truth. The generator and the discriminator
are updated in parallel.
The goal of the Generator G is to learn a distribution pz matching the data,
while the goal of Discriminator D is to distinguish the real data (i.e. from the real
distribution pz) from the fake data generated by G. The adversarial comes from the
min-max game between G and D, which is formulated as:
minG
maxD
V (D,G) = Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))], (7)
25
where G tries to minimize this objective against an adversarial D that tries to
maximize it.
However, the vanilla GAN suffers model collapse problem due to its loss design.
To make the training process more stable, Arjovsky et al. [43] proposed Wasserstein
GAN by using Earth Mover (EM) distance or Wasserstein-1 to evaluate the distance
between the real distribution and the fake distribution [43]. Specifically, given two
distributions, Pdata and Pz, with samples x ∼ Pdata and y ∼ Pz, the Wasserstein-1
distance is defined as:
W (Pdata, Pz) = infγ∈
∏(Pdata,Pz)
E(x,y)∼γ[‖x− y‖], (8)
where∏
(Pdata, Pz) denotes the set of all joint distributions γ(x, y) whose marginals
are respectively Pdata and Pz. The term γ(x, y) could be viewed as the cost from x
to y in order to transform the distributions Pdata into the distribution Pz. And the
Wasserstein-1 loss actually indicates optimal transport cost. Under this design, the
loss for the G network is:
LG = −Ex∼Pz [D(x)] (9)
And the loss for the D network is:
LD = Ex∼Pz [D(G(x))]− Ex∼Preal[D(x)] (10)
0.7 GAN Based Segmentation models
In this section, I will review related GAN based segmentation method. But this is
a very new area and only two papers available at the moment, which is one area of
great potential.
26
Figure 16
0.7.1 GAN for segmentation
Luc et al. [39] proposed a GAN based segmentation method which is shown in figure
16 . The model contains two parts, a segmentator and a discriminator. The segmentor
is designed to generate a mask of the natural image, while the discriminator would
decide if the generated mask the same as the ground truth mask. The quality of
the generated mask is evaluated by how well it is to fool the discriminator. The loss
function is similar to the CGAN while some modification to consider the relationship
between mask and the original image. Given a data set of N training images xn and
a corresponding label maps yn θs, θa representing the parameters of segmentor and
advesarial model, the loss used in this model is shown in equation 11.
L(θs, θa) =N∑n=1
L(s(xn), yn)− λ[bce((a(xn), yn), 1) + bce((a(xn), s(xn)), 0)], (11)
where , a(x, y) ∈ [0, 1] denotes the scalar probability with which the adversarial
model predicts that y is the ground truth label map of x, as opposed to being a label
map produced by the segmentation model s()
The motivation for their approach is that, with the help of GAN, it can detect
and correct higher-order inconsistencies between ground truth segmentation maps
27
and the ones produced by the segmentation net. The experiments show that their
adversarial approach leads to improved accuracy on the Stanford Background and
PASCAL VOC 2012 datasets
Souly et al. [44] proposed two neutral frameworks for GAN based semi-supervised
and weakly supervised learning. The difference with [39] is it asked the Discriminator
to generate mask instead of the Generator. Besides the original K classes of the seg-
mentation task, one extra class, fake class, has been added for the discriminator so
that it could decide whether the input is a fake one generated by the Generator. To
ensure higher quality of generated images for GANs with consequent improved pixel
classification, the second framework extend the framework by adding weakly anno-
tated data. This greatly overcome the shortcoming of few labelled data is available.
In the area of medical imaging, recently only one paper has been published re-
lated to GAN based segmentation. segAN [45] has been proposed, which is workflow
is shown in 17. Since image segmentation requires dense, pixel-level labeling, the
gradient feedback by the original GAN may not be enough for training, which has
been mentioned in WGAN. To overcome this, a new loss is proposed by the design of
adversarial critic network with a multi-scale L1 loss function to force the critic and
segmentor to learn both global and local features that capture long- and short-range
spatial relationships between pixels. The loss proposed in this paper is very similar
to the idea of EM loss used by WGAN.
0.7.2 Conclusion
In this section, several literatures are reviewed on the topic of GAN based image
segmentation. This direction is a new direction where only 3 papers could be searched.
The model structure just follows the GAN workflow without too much modifications.
Apparently, there is a lot of meat on this bone.
28
Figure 17: Structure of segAN
0.8 Some Challenges
Despite of the success of applying CNN for lung nodule detection, challenges also
remain. In this section, we discuss some challenges presented in some papers and
experienced in our practice. As CNN is a highly data-oriented method, a majority of
the challenges lie in the data part. So we discuss data related challenges in the first
two parts and other challenges in the third part.
0.8.1 Data Source
CNN, as some other big data technologies, requires a large enough dataset to learn
the classification rules. Different from computer vision area, where large and clean
benchmark dataset is available, limited lung nodule dataset is available to the public.
Most people have their own datasets containing different numbers of patients from
various sources. Where to get data is a big challenge to perform a deep learning based
detection. Here we list some public lung nodule dataset used by recent publications
as a reference.
29
• SPIE-AAPM-LUNGx dataset: a dataset used for a lung challenge originally to
decide whether a nodule is benign or malignant.
• LIDC-IDRI: contains 1018 cases, the largest public database founded by the
Lung Image Database Consortium and Image Database Resource Initiative.
On the website lung nodule CT scan is available for download.
• ELCAP Public Lung Image Database: contains 50 low-dose thin-slice chest CT
images with annotations for small nodules.
• NSCLC-Radiomics: contains 422 non-small cell lung cancer (NSCLC) patients
0.8.2 Data Preparation
The major purposes of the data preparation are to make the training data less con-
fusing, more fit to CNN and enrich data size.
To make the data less confusing, some literatures perform lung segmentation to
reduce noise. Then possible smooth methods could be applied to the segmented lung
parts. Also, some other unnecessary parts, like some light dots or air, could be filtered
out with threshold or other techniques.
For the purpose of modifying the data to be more fit for CNN, one challenge to
be mention is the difference between CT scan and a RGB image. For a RGB image,
it contains three channels, each channel has data ranging from 0 to 255. For a CT
scan, it has only one channel with data ranging from [-1000,3000], which is much
larger. Based on our experiments, if one directly puts CT scan with such large range
into CNN, the performance will be limited. To make CT scan more similar to the
image originally processed by CNN in computer vision, one solution is to rescale the
data range of CT scan to [0, 255]. This could definitely cause information loss. In
[29] and [18], an idea has been raised that turn the one channel CT scan into three
channels by separating attenuations into three levels: low, normal and high. Then
30
the three channeled image would be rescaled into [0, 255]. One benefit of this method
is that the CT image now is in the same format with a RGB image. People can
use pretrained CNN model to enhance detection performance. The pretrained model
could be viewed as: first, we train a CNN to learn from normal RGB image so that
it could extract some strong features from RGB image, in other words, these features
are very good at recognizing RGB images of different classes; then we finetune the
CNN on the transformed CT image. This is like training an expert with a new task,
the difficulty of which is much less than training a newbie from scratch.
To enlarge the size of dataset to meet the need of big data by CNN, some methods
such as image translation could be applied to enlarge dataset. The generated ones
are considered different from the original image. Also, adding random noise, such as
white noise, to the original image, could also be a solution to enlarge dataset.
One more thing is the issue of imbalanced dataset. As nodule detection is a binary
classification problem (Nodule or Non-nodule), to train a classifier, the dataset should
be a balanced one, which means both classes have equal number of samples. However,
obviously, in a set of CT scans, the number of slices containing nodule is much smaller
than that of slices do not contain nodules. So when preparing the training dataset, we
need to balance the dataset to make the number of two types of samples, containing
nodule or not, to be equal.
0.8.3 Some Other Challenges
Besides the data-related challenges mentioned above, there remains some other chal-
lenges. Below we list some of them:
• HPC support: The training of CNN based model requires huge amount of
calculations on huge amount of data. Even with the help of HPC can the CNN
model be trained in a durable length of time. Nowadays, besides training CNN
purely on CPU, CUDA accelerated GPU has also been used for training as well.
31
• High Cost: As with the need of HPC, another challenge is cost. The support
of HPC consumes large amount of energy and requires facilities. The concept
of energy aware computing could be considered when designing a HPC system
for CNN.
• Multi-disciplinary Cooperation Required: The design of a CNN based lung
nodule detection system requires the cooperation from multiple disciplinary such
as medical, radiology and computer science. Each expert from a certain area
provides their own domain knowledge. The knowledge from different domains
could guide how to perform data preparation, how to design the CNN model
and how to make the system user friendly to radiologists. So it is very important
for people from different areas to understand each other. Also, how to protect
the data privacy is also a big concern in the cooperation.
32
0.9 Conclusion
In this survey, we give an introduction on the recent progress of using CNN for
medical image segmentation. A list of public packages as well as a list of public
dataset are given. We can see that CNN has shown great potential in the area of lung
nodule detection from bounding box based to more advanced methods. Meanwhile,
challenges still remain and researchers are working on solving them. We can see a
very promising future for the CNN based medical imaging segmentation.
33
Bibliography
[1] Everingham, Mark, and John Winn. ”The pascal visual object classes challenge2011 (voc2011) development kit.” Pattern Analysis, Statistical Modelling andComputational Learning, Tech. Rep (2011).
[2] American Cancer Society. http://www.cancer.org/cancer/lungcancer-non-smallcell/detailedguide/non-small-cell-lung-cancer-key-statistics
[3] Data Science Bowl 2017 https://www.kaggle.com/c/data-science-bowl-2017,2017
[4] Setio, Arnaud Arindra Adiyoso, et al. ”Validation, comparison, and combinationof algorithms for automatic detection of pulmonary nodules in computed to-mography images: The LUNA16 challenge.” Medical image analysis 42 (2017):1-13.
[5] Zhe, Xiaoning, Michael L. Cher, and R. Daniel Bonfil. ”Circulating tumor cells:finding the needle in the haystack.” American journal of cancer research 1.6(2011): 740.
[6] Shen, Shiwen, et al. ”An automated lung segmentation approach using bidirec-tional chain codes to improve nodule detection accuracy.” Computers in biologyand medicine 57 (2015): 139-149.
[7] Duggan, Nirn, et al. ”A Technique for Lung Nodule Candidate Detection in CTUsing Global Minimization Methods.” Energy Minimization Methods in Com-puter Vision and Pattern Recognition. Springer International Publishing, 2015.
[8] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press,2016.
[9] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classifica-tion with deep convolutional neural networks.” Advances in neural informationprocessing systems. 2012.
[10] Jia, Yangqing, et al. ”Caffe: Convolutional architecture for fast feature embed-ding.” Proceedings of the 22nd ACM international conference on Multimedia.ACM, 2014.
[11] LeCun, Yann, et al. ”Deep learning.” Nature 521.7553 (2015): 436-444.
34
[12] Zeiler, Matthew D., and Rob Fergus. ”Visualizing and understanding convo-lutional networks.” European conference on computer vision. Springer, Cham,2014.
[13] Zeiler, Matthew D., et al. ”Deconvolutional networks.” Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.
[14] Nima Tajbakhs, et al. ”Comparing two classes of end-to-end machine-learningmodels in lung nodule detection and classification: MTANNs vs. CNNs”, PatternRecognition, in press.
[15] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. ”Deep sparse rectifier neuralnetworks.” Proceedings of the Fourteenth International Conference on ArtificialIntelligence and Statistics. 2011.
[16] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. ”Rectifier nonlinearitiesimprove neural network acoustic models.” Proc. ICML. Vol. 30. No. 1. 2013.
[17] Paszke, Adam, et al. ”PyTorch.” (2017).
[18] Shin, Hoo-Chang, et al. ”Deep convolutional neural networks for computer-aideddetection: CNN architectures, dataset characteristics and transfer learning.”IEEE transactions on medical imaging 35.5 (2016): 1285-1298.
[19] LeCun, Yann, et al. ”Gradient-based learning applied to document recognition.”Proceedings of the IEEE 86.11 (1998): 2278-2324.
[20] Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional networksfor large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
[21] Tran, Du, et al. ”Learning Spatiotemporal Features with 3D Convolutional Net-works.” arXiv preprint arXiv:1412.0767 (2014).
[22] He, Kaiming, et al. ”Deep residual learning for image recognition.” Proceedingsof the IEEE conference on computer vision and pattern recognition. 2016.
[23] Wu, Zhirong, et al. ”3d shapenets: A deep representation for volumetric shapes.”Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion. 2015.
[24] Smith, Steven. Digital signal processing: a practical guide for engineers andscientists. Newnes, 2013.
[25] Maturana, Daniel, and Sebastian Scherer. ”Voxnet: A 3d convolutional neu-ral network for real-time object recognition.” Intelligent Robots and Systems(IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015.
[26] Anirudh, Rushil, et al. ”Lung nodule detection using 3D convolutional neuralnetworks trained on weakly labeled data.” SPIE Medical Imaging. InternationalSociety for Optics and Photonics, 2016.
35
[27] Golan, Rotem, et al. ”Lung nodule detection in CT images using deep convo-lutional neural networks.” Neural Networks (IJCNN), 2016 International JointConference on. IEEE, 2016.
[28] Maaten, Laurens van der, and Geoffrey Hinton. ”Visualizing data using t-SNE.”Journal of Machine Learning Research 9.Nov (2008): 2579-2605.
[29] Gao, Mingchen, et al. ”Holistic classification of CT attenuation patterns for inter-stitial lung diseases via deep convolutional neural networks.” Computer Methodsin Biomechanics and Biomedical Engineering: Imaging & Visualization (2016):1-6.
[30] Redmon, Joseph, et al. ”You only look once: Unified, real-time object detection.”arXiv preprint arXiv:1506.02640 (2015).
[31] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. ”Fully convolutional net-works for semantic segmentation.”Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition. 2015.
[32] Paszke, Adam, et al. ”Automatic differentiation in PyTorch.” (2017).
[33] Chen, Liang-Chieh, et al. ”Deeplab: Semantic image segmentation with deepconvolutional nets, atrous convolution, and fully connected crfs.”arXiv preprintarXiv:1606.00915(2016).
[34] Sermanet, Pierre, et al. ”Overfeat: Integrated recognition, localization and de-tection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).
[35] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classifica-tion with deep convolutional neural networks.” Advances in neural informationprocessing systems. 2012.
[36] Yosinski, Jason, et al. ”Understanding neural networks through deep visualiza-tion.”arXiv preprint arXiv:1506.06579(2015).
[37] Zhao, Hengshuang, et al. ”Pyramid scene parsing network.”arXiv preprintarXiv:1612.01105(2016).
[38] Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolu-tional networks for biomedical image segmentation. InInternational Conferenceon Medical Image Computing and Computer-Assisted Intervention(pp. 234-241).Springer, Cham.
[39] Luc, Pauline, et al. ”Semantic segmentation using adversarial networks.” arXivpreprint arXiv:1611.08408 (2016). APA
[40] LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmenta-tion
36
[41] Goodfellow, Ian, et al. ”Generative adversarial nets.” Advances in neural infor-mation processing systems. 2014.
[42] Radford, Alec, Luke Metz, and Soumith Chintala. ”Unsupervised representationlearning with deep convolutional generative adversarial networks.” arXiv preprintarXiv:1511.06434 (2015).
[43] Arjovsky, Martin, Soumith Chintala, and Lon Bottou. ”Wasserstein gan.” arXivpreprint arXiv:1701.07875 (2017).
[44] Souly, Nasim, Concetto Spampinato, and Mubarak Shah. ”Semi and Weakly Su-pervised Semantic Segmentation Using Generative Adversarial Network.” arXivpreprint arXiv:1703.09695 (2017).
[45] Xue, Yuan, et al. ”SegAN: Adversarial Network with Multi-scale L1 Loss forMedical Image Segmentation.” arXiv preprint arXiv:1706.01805 (2017).
[46] Bengio, Yoshua. ”Learning deep architectures for AI.” Foundations and trendsin Machine Learning 2.1 (2009): 1-127.
[47] Nash, Charlie, and Chris KI Williams. ”The shape variational autoencoder: Adeep generative model of partsegmented 3D objects.” Computer Graphics Forum.Vol. 36. No. 5. 2017.
[48] Baka, Nora, Sieger Leenstra, and Theo van Walsum. ”Ultrasound Aided Ver-tebral Level Localization for Lumbar Surgery.” IEEE transactions on medicalimaging 36.10 (2017): 2138-2147.
[49] Manivannan, Siyamalan, et al. ”Structure Prediction for Gland Segmentationwith Hand-Crafted and Deep Convolutional Features.” IEEE transactions onmedical imaging (2017).
37