evaluation of a radiomics model for classification of lung ...1359286/fulltext01.pdf · and methods...
TRANSCRIPT
IN DEGREE PROJECT MEDICAL ENGINEERING,SECOND CYCLE, 30 CREDITS
, STOCKHOLM SWEDEN 2019
Evaluation of a Radiomics model for classification of Lung Nodules
PARASTU RAHGOZAR
KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH
Evaluation of a Radiomics model
for classification of Lung Nodules
KTH Master Thesis Report
Parastu Rahgozar
KTH ROYAL INSTITUTE OF TECHNOLOGYSchool of Engineering Sciences in Chemistry, Biotechnology and Health
2
Abstract
Lung cancer has been a major cause of death among types of cancers in the world.
In the early stages, lung nodules can be detected by the aid of imaging modalities
such as Computed Tomography (CT). In this stage, radiologists look for irregular
rounded-shaped nodules in the lung which are normally less than 3 centimeters
in diameter. Recent advancements in image analysis have proven that images
contain more information than regular parameters such as intensity, histogram and
morphological details.
Therefore, in this project we have focused on extracting quantitative, hand-crafted
features from nearly 1400 lung CT images to train a variety of classifiers based on
them.
In the first experiment, in total 424 Radiomics features per image has been used
to train classifiers such as: Random Forest (RF), Support Vector Machine (SVM),
Decision Tree (DT), Naıve Bayes (NB), Linear Discriminant Analysis (LDA) and
Multi-Layer Perceptron (MLP). In the second experiment, we evaluate each feature
category separately with our classifiers. The third experiment includes wrapper
feature selection methods (Forward/Backward/Recursive) and filter-based feature
selection methods (Fisher score, Gini Index and Mutual information). They have
been implemented to find the most relevant feature set in model construction.
Performance of each learning method has been evaluated by accuracy score, where
we achieved the highest accuracy of 78% with Random Forest classifier (74% in 5-
fold average) and 0.82 Area Under the Receiver Operating Characteristics (AUROC)
curve. After RF, NB and MLP showed the best average accuracy of 71.4% and 71%
respectively.
Keywords
Master Thesis, Radiomics, Tumor Classification, Lung Nodule.
i
Sammanfattning
Lungcancer har varit en viktig dodsorsak bland alla typer av cancer i hela
varlden. I de tidiga stadierna kan lungnoduler detekteras med hjalp av olika
avbildningsmetoder som till exempel Datortomografi (DT). I detta skede letar
radiologer efter oregelbundna sf ariska knolar i lungan som normalt ar mindre an 3
centimeter i diameter. De senaste framstegen inom bildanalys har visat att bilder
innehaller annu mer information an vanliga parametrar som intensitet, histogram
och morfologiska detaljer.
Darfor har studerat vi i detta projekt fokuserat pa att extrahera kvantitativa,
manuelt-behandlade funktioner fran nastan 1400 lung-CT-bilder for att trana en
mangd klassificerare baserade pa dem.
I det forsta experimentet har vi anvant totalt 424 Radiomics-funktioner per bild
anvants for att trana klassificerare som: Random Forest (RF), Support Vector
Machine (SVM), Decision Tree (DT), Naıve Bayes (NB), Linear Discriminant
Analysis (LDA) och Multi-Layer Perceptron (MLP). I det andra experimentet
har vi utvarderat varje funktionskategori separat med vara klassificerare. Det
tredje experimentet har inkluderat metodval for omslagsfunktioner (framat/bakat
/rekursivt) och filterbaserade metoder for val av funktioner (Fisher Score, Gini-
index och Mutual information). Alla tre har implementerats for att hitta de mest
relevanta funktionerna i den modellkonstruktion.
Prestandan for varje inlarningsmetod har utvarderats med noggrannhetspoang, dar
vi uppnadde den hogsta noggrannheten pa 78% med Random Forest klassificerare
(74% i femfaldigt medelvarde) och 0.82 Area Under the Receiver Operating
Characteristics (AUROC)-kurva. Efter RF visade NB och MLP den basta
genomsnittliga noggrannheten pa 71.4% respektive 71%.
Nyckelord
Masterexamen, Radiomics, tumorklassificering, Lungnodule
ii
Acknowledgements
I would first like to thank my thesis supervisor Dr. Chunliang Wang, for his
guidance, patience and supervision during this project. Special thanks to Mr. Mehdi
Astaraki for his constant advice, consideration and guidance to keep me in the right
the direction whenever he thought I needed it.
I would also like to thank my reviewer, Prof. Orjan Smedby for his kind comments
and attention regarding my progress. I am gratefully indebted to him for his very
valuable comments on this thesis. Also, thanks to Yupei Chen, Cristina Zanin and
Didrik Nimander for their helpful comments during our supervision group meetings
throughout the thesis project.
Finally, I must express my very profound gratitude to my parents, for providing me
with unfailing support, love, confidence and continuous encouragement throughout
my years of study. To my boyfriend, who advised and supported me through the
process of researching and writing this thesis. This accomplishment would not have
been possible without them. Thank you.
iii
Authors
Parastu Rahgozar ([email protected])
Master of Science, Medical Engineering
KTH Royal Institute of Technology
Place for Project
Stockholm, Sweden
KTH Flemingsberg Campus
Examiner
Orjan Smedby
KTH Royal Institute of Technology
Supervisor
Chunliang Wang
KTH Royal Institute of Technology
Contents
1 Introduction 2
1.1 Biology of Lung Nodules . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Medical Imaging Modality . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Application of Radiomics . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Research Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Methodology and Methods 5
2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Wrapper based methods . . . . . . . . . . . . . . . . . . . . . 8
2.4.2 Filter-based Feature Selection . . . . . . . . . . . . . . . . . . 8
2.5 Subject-Wise Cross Validation . . . . . . . . . . . . . . . . . . . . . . 9
2.6 SMOTE - Synthetic Minority Oversampling Technique . . . . . . . . 10
2.7 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7.1 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . 10
2.7.2 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . . . 11
2.7.3 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . 11
2.7.4 Decision Tree (DT) . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7.5 Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . 11
2.7.6 Naıve Bayes (NB) . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Evaluation of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8.2 Area Under the Receiver Operating Characteristics (AUROC) 12
2.9 Setup of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9.2 Classification Models . . . . . . . . . . . . . . . . . . . . . . . 14
3 Results 15
3.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
v
3.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Discussion 23
4.1 Feature Extraction Categories . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Feature selection methods . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Conclusions 26
A Appendix - State of the Art 33
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
A.2 Clinical definition of Lung Nodules . . . . . . . . . . . . . . . . . . . 34
A.3 Background on Lung cancer diagnosis . . . . . . . . . . . . . . . . . . 34
A.4 Previous computer vision algorithms used in nodule classification . . 35
A.5 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
A.6 Radiomics Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.6.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.6.2 Segmentation of Region of Interest . . . . . . . . . . . . . . . 38
A.6.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 38
A.6.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.7 Radiomics Classifiers (Mathematical Model) . . . . . . . . . . . . . . 41
A.8 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.8.1 Area Under the Receiver Operating Characteristics (AUROC) 41
1
1 Introduction
Medical images are broadly used for prognosis observation, prediction of future
status and cancer detection. They play an important role in treatment selection since
there are different treatment plans for a benign or malignant nodule. Invasive biopsy
is a common procedure for nodule diagnosis, which includes extracting part of the
lesion and running tests on it. The majority of tumors are spatially heterogeneous
and therefore, the biopsy does not reveal the entire tumor characterization [1] .
Computed Tomography is one of the most common imaging systems used for nodule
diagnosis and detection and recent advances have proven that due to good imaging
properties, lung nodules can be more precisely observed. Extracting features such
as texture, intensity, size and shape of the tumor, can give an estimation regarding
the probability of a tumor being benign or malignant [2].
Radiomics, a recent advancement in the imaging field, is used to extract high
dimensional features from images, and use them in a predictive model for diagnosis
and differentiate between malignant and benign tumors. Quantitative, hand-crafted
image features, also called ”Radiomic Features”, could provide richer information
about intensity, shape, size or volume, and texture of tumor phenotype that is
distinct or complementary to that provided by clinical reports, laboratory test
results, and genomic or proteomic assays [3]. Research has shown that Radiomics can
be a promising approach to give information about the internal heterogeneity of the
tumor, which is related to genetical patterns. This approach helps with personalizing
the treatment for patients according to their tumor prognosis [4].
1.1 Biology of Lung Nodules
Lung cancer is the leading cause of cancer death in the world, causing as many
deaths as the next four most deadly cancers combined (breast, prostate, colon and
pancreas) according to [5].
As an initial definition proposed by [6], a lung nodule is a rounded or irregular oval-
shaped growth in the lungs, which usually is less than 3 centimeters in diameter.
An opacity less than 3 millimeters is considered as ”micro-nodule” while any nodule
more than 3 centimeters in diameter is considered as a ”mass” and is more likely to
2
be a cancerous one [7]. Next step for physicians after detecting a nodule is to decide
if the nodule is suspicious enough to not only run further tests, but also avoid any
unnecessary experiments and tests [6]. One of the most useful, yet non-invasive,
methods is medical imaging.
1.2 Medical Imaging Modality
Medical images are broadly used for prognosis observation, prediction of future
status and cancer detection. They play an important role in treatment selection since
there are different treatment plans for a benign or malignant nodule. Invasive biopsy
is a common procedure for nodule diagnosis, which includes extracting part of the
lesion and running tests on it. The majority of tumors are spatially heterogeneous
and therefore, the biopsy does not reveal the entire tumor characterization [1] .
One of the most popular imaging modalities in this criteria, is Computed
Tomography (CT), which can play an important role in both diagnosis and follow-up,
since early diagnosis of Solitary Pulmonary Nodules (SPN) can have a significant
effect on finding a safe and prompt solution. SPNs are very probable to become
malignant nodules in the close future [8].
1.3 Application of Radiomics
The aim of Radiomics is to extract quantified characteristics from the medical
image with the aid of automated algorithms. There has been numerous pipelines
and methods developed (hard-coded or deep learning methods), although the lack
of standardization in definition ad image processing makes it difficult to compare
different sets of results and it affects the reproducibility of them [9]. Therefore,
PyRadiomics [9] is an open-source platform available for engineers to implement it
in both Python or 3D-Slicer. This platform consists of several classes, such as image
pre-processing, mask production, filter application and feature extraction.
3
1.4 Research Aim
This project aims at:
• Implementing the Radiomics method on lung nodules in order to classify Lung
nodules
• Comparing performance of several predictive classifiers on extracted features
• Selecting most informative Radiomics features as a smaller set to train
classifiers
4
2 Methodology and Methods
2.1 Dataset
This project has been implemented on ”Kaggle Data Science Challenge 2017” 1.
Initial data set consisted of more than 1400 CT 3D images, labeled as benign or
malignant. All images are in DICOM format and size of each image is (Z,512,512),
Z being number of slices per image.
2.2 Data Pre-processing
The first step in image pre-processing is to convert the 3D images from DICOM
to NIfTI-1 (Neuroimaging Informatics Technology Initiative) format, convert the
grayscale to Housfield Unit and normalize the histogram of the image. The
conversion and histogram windowing in this section should be done based on
standard parameters for lung CT images. Since radiomics features are usually
calculated from a region of interest, in the next step, each nodule is separately
cropped and a binary mask is produced based on the dimensions of the cropped
nodule. Therefore, the dimension of the final image files may vary. If one image
contains of more than one nodule, the final format would be one cropped image and
one binary mask for each one.
2.3 Feature Extraction
Feature extraction is the initial step and one of the most important stages in image
processing. There are different types and classes of image features that can be
extracted from medical images which help to describe the characteristics of the
extracted area known as tumor. Quantitative features give information about the
shape of lesion, intensity histogram or spatial parameters of the image. They can
also be informative regarding texture features which focus on the distribution of the
gray levels over the pixels and spatial arrangement of the intensity in voxels [10].
1https://www.kaggle.com/c/data-science-bowl-2017/data
5
Overall, the quantitative features are divided into several subgroups which have been
implemented in this project:
• First Order Statistical Features: To observe the texture characteristics of an
image area, it is needed to look at gray level distribution over the pixels. The
first (and second) order features are defined to quantify relations underlying
the distribution of the observed image. These features are calculated based on
individual voxel values without taking into account the spatial relationships.
These properties are calculated based on histogram of the image such
as Variance, Skewness, Kurtosis and Median [11]. In general, first-order
histogram is defined as below:
P (I) =Number of pixels with gray level I
Total number of pixels in the region
• Applied filter on Statistical features:
Laplacian of Gaussian filter is a spatial filter to measure the second derivative
of the image. Laplacian filter mostly highlights the high intensity values of an
image and therefore it is applicable in edge detection. Before Laplacian filter,
often a smoothing Gauassian filter is also applied. This filter gets a gray scale
image as input and produces a gray scale image in the output. In the equation
below, the value of σ can vary [12].
LoG(x, y) = − 1
πσ4[1− x2 + y2
2σ2]e
−x2+y2
2σ2
Wavelet transform is a technique to evaluate the frequencies within an image,
according to different scales. The wavelet coefficients are derived based on
these scales and whether the direction of frequency changes or not. This
observation is done based on fast or slow variations in the gray level value of an
image. Therefore, areas with abrupt changes in gray value, are assigned high
spatial frequency, while regions with slow changes are related to low spatial
frequency. Wavelet-derived features are mainly aimed at texture differences in
an image [13].
• Gray-Level Matrix features:
6
Texture features were primarily defined to analyze the surface structure
in 2D image, but their implementation can be extended to 3D images as
well. Moreover, while first order features can give information regarding the
gray-level properties of the image, there is no details about the positions
of the gray levels in relation to others. For instance, whether all low-
level gray values are next each other or they have been interchanged with
some high-value gray levels. Grey-Level Co-occurance Matrix (GLCM) can
quantify the coarseness and smoothness which are good parameters with high
discriminatory power.This matrix expresses how neighbourhood pixels in a
3D volume are distributed in different directions [14]. It is calculated in 13
various directions and the the mean value is kept. A couple of features in
this matrix are contrast; the local level variations and entropy; a function to
measure randomness [11].
Another category of grey-level matrix features is Grey-Level Run Length
Matrix (GLRLM) which was initially proposed in [15]. Unlike GLCM, this
matrix analyzes the run length (length of a sequence of pixels in one direction).
Grey Level Size Zone Matrix (GLSZM) assesses groups (zones) of connected
voxels in the same neighbourhood. Given a 3D image, a single voxel can be
connected to 26 neighbour voxels. This group of features can be both computed
in a 3D matrix, or from 2D matrixes and averages over slices [14]. Some of the
texture features calculated in this project are: Contrast, Correlation, Sum of
Squares (Variance), Entropy, Inverse Difference Moment, Maximal Correlation
Coefficient [16].
• Shape Features:
Geometric properties regarding a region of interest (cropped nodule) is
described by shape features in terms such as compactness, sphericity and
surface area.
• Clinical features:
A total amount of 9 features were defined and clinically determined by a
radiologist. These features consist of ”calcification or fat contain”, ”attached
to the artery”, ”attached to the artery”, etc. which were labeled as True/False
and then converted to binary results.
7
2.4 Feature Selection Methods
One of the most important parts in machine learning, is the input data given to the
model. By analyzing quality of the input data, it will be possible to recognize noise
and reduce the dimension of the big scale data. Therefore, only the real important
and useful will be taken for implementation.
There are two main categories in feature selection methods and in order to give a
comparison, three methods out of each category each has been assessed.
2.4.1 Wrapper based methods
Wrapper methods were introduced by [17] which is basically calculating the
estimated accuracy of the algorithm for each feature. Forward Feature Selection
each accuracy is calculated once another unused feature is added to the subset, and
determines which feature(s) is the best to add based on the accuracy. This approach
makes wrapper methods computationally costly and do not adapt to large-scale
data [18].
Backward Feature Elimination removes one feature at each iteration, based
on trained algorithm on all input features, and keeps set of features that provides
the least error rate in the set. This process continues until there is no significant
improvement on the error rate [19], while in Recursive Feature Elimination
a linear regression model is trained on a subset of features each time and the
feature with the smallest ranking criterion is removed since it has the least effect on
classification [20].
2.4.2 Filter-based Feature Selection
Filter methods [21] follow another approach. The concept is based on choosing a
feature set regardless of learning algorithm to be used while training. Therefore,
there is no bias implied from the learning method on the selected features.
• Gini Index: The concept of the Gini-Index theory is described as to suppose
that S is a set of s samples, and that these samples belong to k different classes
(Ci ,i = 1,...,k). S can be divided into k number of subsets based on differences
8
of classes (Si, i =1,...,k). Suppose that Si is a sample set that belongs to class
Ci,and that si is the sample number of sets Si; then, the Gini Index of set S is:
Gini(S) = 1−k∑
i=1
p2i
where Pi is the probability, estimated with si/s, that any sample belongs to
Ci. Gini(S)’s minimum is 0, that is, all of the members in the set belong
to the same class; this denotes that the maximum useful information can be
obtained.
• Fisher Score : Considering a data space spanned by features, they are sorted
in several classes based on the distance between one another. Of course, the
distance between data points in the same class should be small while different
classes are further away from each other. Fisher score tends to find set of
features with the largest distance between them [22].
• Mutual Information (MI) is a powerful statistical tool to evaluate dependency
and relationships between datasets. MI can detect any kind of relevance
between the given datasets whether it is mean value, variance or other factors
[23]. If X and Y are totally unrelated, then they are independent from one
another and X does not give any clue about Y. In this case, their mutual
information is zero. The basic concept of this function relies on entropy
estimation from k-nearest neighbors distances.
2.5 Subject-Wise Cross Validation
The aim of k-fold cross-validation is to generate training and validation set from the
same population, while we make sure that each observation is used exactly once for
validation and also all observations are used for training and validation separately.
In this approach, the data is randomly divided into k number of equal subgroups,
and in each loop, one subgroup is kept as validation (test) and the k-1 number of
subgroups are used as training data. In this project there can be a single patient
(same ID) with several analyzed nodules. Therefore, the k-fold cross validation
needs to take this issue into account and the process needs to be done subject-wise,
9
which means generating k-folds of IDs rather than nodules.
2.6 SMOTE - Synthetic Minority Oversampling
Technique
In the majority of binary classification cases, classes are presented as ”benign”
and ”malignant”, or ”normal” and ”abnormal” and real world’s cases include more
data labeled as ”normal” than ”abnormal” or in binary approach, more 0 than 1.
Therefore, this bias towards the labels may result in misclassification of an abnormal
example as a normal example. One way to solve this issue, is to oversample the
minority class and increase the sensitivity of the model towards it [24].
2.7 Classification Models
All the images from lung nodules has been labeled as zero (0) or (1) (as benign
or malignant) based on the diagnosis from an experienced radiologist. Therefore,
the desired classification methods only need to satisfy a simple binary classification.
The important point in the chosen models is the huge number of features which
will be fed to the model. Based on previous research done on this topic [9] [25]
Machine Learning (ML) classification models such as Support Vector Machine,
Random Forest and neural networks have been implemented in this field before. As
a matter of comparison of performance, this project involves several classifiers:
2.7.1 Support Vector Machine (SVM)
The idea of SVM is to map the training data into a higher-dimensional feature space,
then build a separating hyperplane with maximum margin between the classes,
which results in a decision boundary in the input space. This function can be
applied linearly, or by a kernel function, either polynomial, splines and radial basis
function (RBF) [26].
10
2.7.2 Random Forest (RF)
Random Forest is an ensemble algorithm that uses perturb-and-combine techniques
to create a strong classifier from a set of weak classifiers. The trees in the ensemble of
classification trees are created from a sample drawn with replacement from the part
of the dataset used for training. The scikit-learn implementation of this algorithm
uses the average probabilistic predictions of the classifiers to combine them into
one.
2.7.3 Linear Discriminant Analysis (LDA)
LDA is generalized method based on Fisher’s linear discriminant, that is popular
in statistics and machine learning. The goal of this algorithm is to find a linear
combination of features that can be utilized in separation of two classes. When
it comes to assigning a class to set of observations (feature sets), LDA assumes
the probability density functions are normally distributed. Then, it models the
distribution of predictors separately for each class, and uses Bayes theory to estimate
the probability of observations.
2.7.4 Decision Tree (DT)
Decision trees are non-parametric supervised learning. Decision trees create a
predictive model of a target variable based on learning decision rules extracted
from the features of a dataset. This decision process can be thought of as a series of
if-then-else statements that are often utilized in general flowcharts and algorithms.
While decision tree is easy to visualize, one of the downsides of this algorithm is
that it may lead to overfitting in large datasets [27].
2.7.5 Multi-Layer Perceptron (MLP)
Multi-layer perceptron is a simple form of neural networks, which consists of an
input vector, an output vector and a set of interconnected neurons (nodes) which
are connected by weights and lead to an output signal. These weights are modified
by a transfer or activation function which initializes the nodes. This function can be
11
linear or non-linear based on the implementation needed in the project. If there is
no feedback from output side back to the network, the network is known as ”Feed-
Forward Neural Network” [28].
2.7.6 Naıve Bayes (NB)
Naıve Bayes is another supervised classifier, from the family of probabilistic
classifiers. It works based on the Bayes theorem assuming conditional independence
between features. Based on this rule, using the joint probabilities of sample
observations and classes, the algorithm attempts to estimate the conditional
probabilities of features given an observation [27].
2.8 Evaluation of Classifiers
In literature, there are a wide variety of evaluation metrics used to measure the
performance of the models. Couple of the most common performance metrics are
listed below and has been used in this project:
2.8.1 Accuracy
In general, accuracy tells us how many predictions were correct out of total test
cases. How we measure accuracy in this case:
Accuracy =TN + TP
TN + TP + FN + FP
where TN, TP, FN, FP stand for True Negative, True Positive, False Negative and
False Positive respectively.
2.8.2 Area Under the Receiver Operating Characteristics
(AUROC)
As a performance measurement for classification, this approach can be implemented
in different settings (thresholds). While Receiver Operating Characteristic
Matrix ROC is the curve based on probability of ”True Positives” versus ”False
12
Negatives”his plot can display how strong is the model to separate and distinguish
between the categories.ROC is a probability curve and AUC is the border to measure
separability It tells how much model is capable of distinguishing between classes.
Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy,
Higher the AUC, better the model is at distinguishing between patients with disease
and no disease.
TPR =TP
TP + FN
FPR = 1− specificity =TN
TN + FP
2.9 Setup of the data
Complete dateset of 1398 individual nodules were used in this study. Each nodule
has been converted to Hounsfield unit, then its histogram is thresholded between
(-1000,500). Each nodules has a 3D binary mask. In total, there are 1036 images
with benign tumors and 362 images with malignant ones.
(a) (b) (c)
Figure 2.1: Sample of data (a) Axial slice of Lung CT image, (b) Cropped imageof the identified nodule, (c) Binary mask of the nodule
2.9.1 Feature Extraction
Since this project aims to evaluate a variety of options in Radiomics, we calculated
features from both original image and two different filters applied to the images
13
(Wavelet and LoG). The filters are applicable to extract first order features from
images.
The combination of all feature classes (Shape, First order statistical, GLCM,
GLRLM, GLSZM, Clinical, LOG and Wavelet) led to a set of 424 features extracted
from each nodule. The set includes: 9 clinical, 17 shape, 19 first order, 24 GLCM,
16 GLRLM, 16 GLSZM, 171 LoG and 152 Wavelet hand-crafted features.
Final dataframe consists of 1398 nodules and 424 features (per nodule) after data
is normalized with ”Z-score” normalization.
2.9.2 Classification Models
Six different classification models with settings as described below were used:
• Support Vector Machine (SVM) (C-measure = 5, three kernels (linear,
polynomial, radial basis function))
• Random Forest (RF) (100 trees)
• Linear Discriminant Analysis (LDA)
• Decision Tree (DT) (Maximum depth = 5, Minimum leaf= 10)
• Multi-Layer Perceptron (MLP) (3 hidden layers, each 30 nodes)
• Naıve Bayes (NB) (Gaussian Naıve Bayes)
14
3 Results
3.1 Experiment 1
In the first experiment, six mentioned models were trained with 424 hand-crafted
features (whole data) on a 5-fold cross validation. Evaluation is done by measuring
accuracy and Area Under the Receiver Operating Characteristics (AUROC) for
all models. Table 3.1 indicates accuracy results based on classification models
implemented and table 3.2 shows the AUROC value for each classification model.
Based on the achieved results, random forest learning model reaches the highest
average accuracy (74%), followed by Naıve Bayes and Multi Layer Perceptron with
71.4% and 71% respectively. Area under the ROC curve reaches a high of 0.82 in
this trial in random forest classifier.
Table 3.1: Accuracy percentage and Standard Deviation in Experiment 1, trainingwith 424 features
Folds SVM5 SVM(P) SVM(R) RF LDA NB DT MLPFold1 69% 73% 73% 78% 70% 71% 66% 75%Fold2 66% 64% 70% 74% 67% 66% 66% 70%Fold3 72% 70% 74% 73% 68% 73% 69% 72%Fold4 68% 72% 71% 73% 67% 73% 69% 69%Fold5 69% 69% 66% 72% 62% 74% 70% 69%Ave. 68.8% 69.6% 70.8% 74% 66.8% 71.4% 68% 71%Std. ±1.95 ±3.12 ±2.81 ±2.01 ±2.6 ±2.8 ±1.62 ±2.8
Table 3.2: Area under the ROC curve and average in Experiment 1, training with424 features
Folds SVM5 RF LDA NB DT MLPFold1 0.73 0.85 0.74 0.71 0.79 0.75Fold2 0.71 0.76 0.68 0.71 0.67 0.66Fold3 0.72 0.82 0.73 0.74 0.68 0.72Fold4 0.70 0.78 0.72 0.73 0.72 0.77Fold5 0.68 0.79 0.7 0.75 0.72 0.73Ave. 0.708 0.80 0.714 0.728 0.716 0.726
15
(a) (b)
(c) (d)
(e) (f)
Figure 3.1: Area under the ROC curve for different learning models ((a) SVM(5),(b) Random Forest, (c) LDA,(d) NB, (e) DT, (f) MLP)
16
3.2 Experiment 2
In the second experiment, six classifiers were trained with separate feature categories
(Clinical, First order, Shape, GLCM, GLRLM, GLSZM, LoG, Wavelet) on a 5-fold
cross validation. Results are shown in heat-maps (Figure 3.3) for better comparison.
The initial outcome shows that features extracted with wavelet and LoG filters, have
the strongest relation with outcome and training the model based on these features,
can lead to the most accurate results (Random Forest, 76% average accuracy). Next
best results are acheived from shape features, with both random forest and naıve
bayes methods, 73%. Figure 3.2 indicates the AUROC for random forest classifier,
per each feature category.
Table 3.3: Average of accuracy percentage (over the 5 folds) in Experiment 2,training with separate feature categories
Feature Group SVM5 SVM(P) SVM(R) RF LDA NB DT MLPClinical 58% 65% 55% 56% 53% 52% 57% 54%
FoS 67% 69% 69% 72% 68% 61% 67% 67%Wavelet/loG 69% 67% 70% 76% 67% 67% 66% 69%
Shape 70% 71% 70% 73% 69% 73% 68% 68%GLCM 61% 61% 62% 65% 63% 51% 63% 64%GLSZM 69% 69% 67% 67% 69% 69% 65% 68%GLRLM 69% 70% 70% 68% 69% 73% 67% 68%
17
(a) (b)
(c) (d)
(e) (f)
(g)
Figure 3.2: Area under the ROC curve (Random Forest) for seperate featurecategories ((a) Clinical, (b) First Order,(c) Shape, (d) Wavelet/LoG, (e) GLCM,
(f) GLRLM, (g) GLSZM)18
(a) (b)
(c) (d)
(e)
Figure 3.3: Heatmap of accuracy score based on chosen feature category andclassifier (per each fold)
3.3 Experiment 3
In the third experiment, six classifiers were trained based on the features output from
feature selection methods. Utilized feature selection methods were: FFS, BFS, RFS,
Fisher score, Gini Index and Mutual information. Each of these methods pick the
top 25 features and then each classifier is trained on a dataframe of 1398×25.
19
The results from this experiment are presented in Table 3.4. In this set, using
recursive feature selection led to the best results overall in all classifiers. Although,
random forest has slightly better accuracy in all methods and after that naıve bayes.
Moreover, we observed the features picked by each method, and in all categories more
than 10 out of 25 features belonged to wavelet and LoG category. Distribution of
the picked features from each category by each method is as follows:
• FFS: (Wavelet/LoG: 13, GLSZM: 4, Shape: 7, First Order: 1)
• BFS: (Wavelet/Log: 15, Shape: 5, GLRLM: 2, First Order: 3)
• RFS: (Wavelet/LoG: 11, GLRLM: 5, Shape: 2, First Order: 7)
• Fisher: (Wavelet/LoG: 11, GLCM: 3, GLRLM: 4, Shape: 3, First Order: 4)
• Gini In.: (Wavelet/LoG: 12, GLCM: 1, GLRLM: 6, Shape: 3, First Order: 3)
• Mutual: (Wavelet/LoG: 10, GLCM: 5, GLRLM: 2, Shape: 4, First Order: 4)
Table 3.4: Average of accuracy percentage in Experiment 3, training with featurespicked by FS methods
Feature Selector SVM5 SVM(P) SVM(R) RF LDA NB DT MLPFFS 68% 69% 69% 72% 69% 68% 69% 67%BFS 69% 68% 72% 72% 70% 71% 67% 69%RFS 72% 72% 72% 71% 71% 73% 68% 68%
Fisher 69% 70% 69% 72% 70% 73% 69% 69%Gini In. 69% 71% 61% 72% 70% 73% 69% 69%
Mutual Info. 65% 62% 68% 71% 67% 68% 65% 65%
20
(a) (b)
(c) (d)
(e)
Figure 3.4: Heatmap of accuracy score based on chosen feature selection methodand classifier (per each fold)
21
(a) (b)
(c) (d)
(e) (f)
Figure 3.5: Area under the ROC curve (Random Forest) for seperate featureselectors ((a) FFS, (b) BFS,(c) RFS, (d) Gini, (e) Mutual Info, (f) Fisher)
22
4 Discussion
This project mainly involves three types of comparisons as result. First, the
comparison between extracted feature categories is discussed. Then the learning
methods are compared followed by a discussion on feature selection methods. This
section will be closed by a short discussion on performance metrics.
4.1 Feature Extraction Categories
Overall, there are seven feature categories used in this work and they are separately
fed to classifiers in experiment 2. The purpose of this experiment to find which
category has more influence on the final results. Heatmaps (Figure 3.3 (a)) shows the
outcome in Fold 2. GLCM category (24 features) followed by clinical category which
contains of 9 visually-inspected clinical parameters, have the least performance (RF
43% and 51% respectively) among all categories.
On the other hand, features extracted after applying wavelet filter or Laplacian
Gaussian filter seem to be most informative in final results. As was mentioned
earlier in experiment 3, out of 25 features picked by the feature selection methods,
more than 40% of them were from wavelet and LoG category. This matter, is a
confirmation on the importance of wavelet and LoG features in the overall outcome.
Among the gray level categories (GLRLM, GLCM and GLSZM), GLCM has been
slightly weaker than other mentioned categories.
4.2 Learning Methods
In this project, a variety of learning methods has been used to classify binary labels.
Overall, it can be seen that Random Forest method has slightly reached better results
according to accuracy score. Comparing all the learning methods on our data which
has been a wide variety of numerical data with different scales, can indicate that
Random Forest works well with a mixture of numerical and categorical features even
if features are on various scales. Roughly speaking, with Random Forest you can
use data as they are. Random forest algorithm is robust and does not include many
hyper-parameters to be tuned.
23
In all three experiments, NB has shown close results to RF. This result is interesting
since theoretically, NB works best on small data set. For large dimensions of data,
it is possible that the likelihood may not follow a certain distribution unless the
features are dependent of each other, which in our data set is most probably the
case.
As the third best classifier, MLP and SVM closely follow NB. Both of these classifiers
produce a hyperplane to discriminate the classes. Of course, depending on the
specific hyper-parameters that we use, the result may slightly differ which is one of
the factors in MLP that makes other classifiers easier to implement. Some of these
parameters are number of layers, number of neurons per each layer, learning rate,
activation function, etc. Generally, SVM is better at avoiding over-fitting and it is
less complex than MLP.
4.3 Feature selection methods
Results extracted on experiment 3, are accuracy achieved by training the classifiers
on top 25 features selected from feature selection methods. As a general comparison,
it is shown that wrapper methods (FFS, BFS, RFS) lead to more accurate outcome,
RFS giving the best (Naıve Bayes 73%) and Mutual information giving the least
desirable outcomes (SVM 62%).
As it was discussed in the beginning, wrapper methods have a core difference from
filter methods which is the basic rule they select features based on. Filter based
methods use some mathematical evaluation function while the wrapper methods
use a classification performance of a classifier (like accuracy) to do the evaluation.
In this project the classifier assigned in all three methods is RF. Wrapper based
methods have given better set of features (in case of accuracy score) since they use
the target classifier in the feature selection algorithm but on the negative side they
are computationally expensive and not efficient. One run time can take several hours
based on the size of data set.
24
4.4 Evaluation metrics
The performance of the experiments have been evaluated by two metrics; accuracy
score and area under the receiver operating characteristic curve (AUROC). Looking
at heat-maps and the curves (Table 3.1 and Table 3.2) it can be seen that in most
cases AUROC is higher than accuracy score. This can be explained by the fact
that AUC more so reflects the model’s ability to show a constant difference in data
with different labels (i.e class 0 is generally higher/lower than class 1), and doesn’t
explicitly depend upon the model’s ability to correctly assign things to classes. The
overall accuracy is measured at a specific point, and reflects the model’s ability to
place data into a proper category based on a specific threshold.
4.5 Future Work
In this project we aimed to analyze the power of extracting hand-crafted Radiomics
features and different combination of feature categories and filters. The classifiers
used in this work are just a few examples of binary classifications and simple
implementation of neural networks (MLP). Therefore, developing a more advanced
neural network or perhaps types of auto-encoders can significantly affect the
outcome. Moreover, we have extracted first order statistical features in combination
of applying wavelet and LoG filters. Another trial could be to first apply the
filters and then extract grey level features (GLCM, GLRLM and GLSZM) once
again. Regarding the optimization of the learning models, we can also focus on the
prediction errors, such as bias-variance trade-off or reduce the dimensionality with
PCA (Principal Component Analysis) methods.
25
5 Conclusions
This project mainly is a comprehensive comparison between categories of radiomics
features, dimensionality reduction and implementing different types of classification
models and evaluating their performance on CT lung nodule images. Overall,
1398 images each containing 424 radiomic features were normalized, cross validated,
oversampled and fed into six different classifiers. The final outcome represents that
Random Forest classifier has the highest ability to learn and predict based on a big
scaled dataframe. Also, we can conclude that radiomic features can visibly influence
the performance of binary classifiers and as a whole feature set, they provide a better
result than both clinical features and individual groups of radiomics features. To
summarize the achievements of this project, following conclusions can be made:
• In this implementation Random forest performed the best classifier followed
by Naıve Bayes
• Among the feature categories, first order feature extracted with Wavelet and
LoG filters is the most informative category
• Wavelet/LoG are most significant groups picked by feature selection methods
• Gini Index and RFS succeeded in selecting an informative subset among
feature selection methods
To improve this project, several approaches can be taken into consideration, such as
using the combination of filters and feature categories, or extracting deep features as
well as radiomic features. Overall, this work can be an initial step into investigating
radiomics usability in image classification and can be generalized into further
applications.
26
References
[1] C.-H. Chen, C.-K. Chang, C.-Y. Tu, W.-C. Liao, B.-R. Wu, K.-T. Chou, Y.-R.
Chiou, S.-N. Yang, G. Zhang, and T.-C. Huang, “Radiomic features analysis in
computed tomography images of lung nodule classification,” PloS one, vol. 13,
no. 2, p. e0192002, 2018.
[2] P. Lambin, E. Rios-Velazquez, R. Leijenaar, S. Carvalho, R. G. Van Stiphout,
P. Granton, C. M. Zegers, R. Gillies, R. Boellard, A. Dekker et al., “Radiomics:
extracting more information from medical images using advanced feature
analysis,” European journal of cancer, vol. 48, no. 4, pp. 441–446, 2012.
[3] H. J. Aerts, E. R. Velazquez, R. T. Leijenaar, C. Parmar, P. Grossmann,
S. Carvalho, J. Bussink, R. Monshouwer, B. Haibe-Kains, D. Rietveld et al.,
“Decoding tumour phenotype by noninvasive imaging using a quantitative
radiomics approach,” Nature communications, vol. 5, p. 4006, 2014.
[4] B. Zhao, Y. Tan, W.-Y. Tsai, J. Qi, C. Xie, L. Lu, and L. H. Schwartz,
“Reproducibility of radiomics for deciphering tumor phenotype with imaging,”
Scientific reports, vol. 6, p. 23428, 2016.
[5] P. B. Bach, J. N. Mirkin, T. K. Oliver, C. G. Azzoli, D. A. Berry, O. W.
Brawley, T. Byers, G. A. Colditz, M. K. Gould, J. R. Jett et al., “Benefits and
harms of ct screening for lung cancer: a systematic review,” Jama, vol. 307,
no. 22, pp. 2418–2429, 2012.
[6] D. M. Hansell, A. A. Bankier, H. MacMahon, T. C. McLoud, N. L. Muller, and
J. Remy, “Fleischner society: glossary of terms for thoracic imaging,” Radiology,
vol. 246, no. 3, pp. 697–722, 2008.
[7] 2019. [Online]. Available:
https://my.clevelandclinic.org/health/diseases/14799-pulmonary-nodules
[8] W. Li, P. Cao, D. Zhao, and J. Wang, “Pulmonary nodule classification
with deep convolutional neural networks on computed tomography images,”
Computational and mathematical methods in medicine, vol. 2016, 2016.
27
[9] J. J. Van Griethuysen, A. Fedorov, C. Parmar, A. Hosny, N. Aucoin,
V. Narayan, R. G. Beets-Tan, J.-C. Fillion-Robin, S. Pieper, and H. J. Aerts,
“Computational radiomics system to decode the radiographic phenotype,”
Cancer research, vol. 77, no. 21, pp. e104–e107, 2017.
[10] S. Rizzo, F. Botta, S. Raimondi, D. Origgi, C. Fanciullo, A. G. Morganti,
and M. Bellomi, “Radiomics: the facts and the challenges of image analysis,”
European radiology experimental, vol. 2, no. 1, p. 36, 2018.
[11] N. Aggarwal and R. Agrawal, “First and second order statistics features for
classification of magnetic resonance brain images,” Journal of Signal and
Information Processing, vol. 3, no. 02, p. 146, 2012.
[12] S. Liqin, S. Dinggang, and Q. Feihu, “Edge detection on real time using log
filter,” in Proceedings of ICSIPNN’94. International Conference on Speech,
Image Processing and Neural Networks. IEEE, 1994, pp. 37–40.
[13] G. Castellano, L. Bonilha, L. Li, and F. Cendes, “Texture analysis of medical
images,” Clinical radiology, vol. 59, no. 12, pp. 1061–1069, 2004.
[14] A. Zwanenburg, S. Leger, M. Vallieres, S. Lock et al., “Image biomarker
standardisation initiative,” arXiv preprint arXiv:1612.07003, 2016.
[15] R. W. Conners and C. A. Harlow, “Some theoretical considerations concerning
texture analysis of radiographic images,” in 1976 IEEE Conference on Decision
and Control including the 15th Symposium on Adaptive Processes. IEEE, 1976,
pp. 162–167.
[16] R. M. Haralick, K. Shanmugam et al., “Textural features for image
classification,” IEEE Transactions on systems, man, and cybernetics, no. 6,
pp. 610–621, 1973.
[17] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and the subset
selection problem,” in Machine Learning Proceedings 1994. Elsevier, 1994, pp.
121–129.
[18] S. Das, “Filters, wrappers and a boosting-based hybrid for feature selection,”
in Icml, vol. 1, 2001, pp. 74–81.
28
[19] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial
intelligence, vol. 97, no. 1-2, pp. 273–324, 1997.
[20] S. Maldonado and R. Weber, “A wrapper method for feature selection using
support vector machines,” Information Sciences, vol. 179, no. 13, pp. 2208–
2217, 2009.
[21] A. K. Jain and K. Karu, “Texture analysis: Representation and matching,” in
International Conference on Image Analysis and Processing. Springer, 1995,
pp. 2–10.
[22] Q. Gu, Z. Li, and J. Han, “Generalized fisher score for feature selection,” arXiv
preprint arXiv:1202.3725, 2012.
[23] B. C. Ross, “Mutual information between discrete and continuous data sets,”
PloS one, vol. 9, no. 2, p. e87357, 2014.
[24] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
synthetic minority over-sampling technique,” Journal of artificial intelligence
research, vol. 16, pp. 321–357, 2002.
[25] A. Bosch, A. Zisserman, and X. Munoz, “Image classification using random
forests and ferns,” in 2007 IEEE 11th international conference on computer
vision. Ieee, 2007, pp. 1–8.
[26] S. Sahu, M. Prasad, and B. Tripathy, “A support vector machine binary
classification and image segmentation of remote sensing data of chilika lagloon.”
[27] T. Li, C. Zhang, and M. Ogihara, “A comparative study of feature selection
and multiclass classification methods for tissue classification based on gene
expression,” Bioinformatics, vol. 20, no. 15, pp. 2429–2437, 2004.
[28] M. W. Gardner and S. Dorling, “Artificial neural networks (the multilayer
perceptron)a review of applications in the atmospheric sciences,” Atmospheric
environment, vol. 32, no. 14-15, pp. 2627–2636, 1998.
[29] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, and A. Jemal,
“Global cancer statistics 2018: Globocan estimates of incidence and mortality
worldwide for 36 cancers in 185 countries,” CA: a cancer journal for clinicians,
vol. 68, no. 6, pp. 394–424, 2018.
29
[30] D. Kumar, A. Wong, and D. A. Clausi, “Lung nodule classification using deep
features in ct images,” in 2015 12th Conference on Computer and Robot Vision.
IEEE, 2015, pp. 133–138.
[31] T. P. Coroller, P. Grossmann, Y. Hou, E. R. Velazquez, R. T. Leijenaar,
G. Hermann, P. Lambin, B. Haibe-Kains, R. H. Mak, and H. J. Aerts, “Ct-
based radiomic signature predicts distant metastasis in lung adenocarcinoma,”
Radiotherapy and Oncology, vol. 114, no. 3, pp. 345–350, 2015.
[32] H. MacMahon, J. H. Austin, G. Gamsu, C. J. Herold, J. R. Jett, D. P.
Naidich, E. F. Patz Jr, and S. J. Swensen, “Guidelines for management of
small pulmonary nodules detected on ct scans: a statement from the fleischner
society,” Radiology, vol. 237, no. 2, pp. 395–400, 2005.
[33] M. P. Rivera and A. C. Mehta, “Initial diagnosis of lung cancer: Accp evidence-
based clinical practice guidelines,” Chest, vol. 132, no. 3, pp. 131S–148S, 2007.
[34] H. Lee and Y.-P. P. Chen, “Image based computer aided diagnosis
system for cancer detection,” Expert Systems with Applications,
vol. 42, no. 12, pp. 5356 – 5365, 2015. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0957417415000986
[35] K.-L. Hua, C.-H. Hsu, S. C. Hidayati, W.-H. Cheng, and Y.-J. Chen,
“Computer-aided classification of lung nodules on computed tomography
images via deep learning technique,” OncoTargets and therapy, vol. 8, 2015.
[36] A. A. Abdullah and S. M. Shaharum, “Lung cancer cell classification method
using artificial neural network,” information engineering letters, vol. 2, no. 1,
2012.
[37] C. Parmar, P. Grossmann, J. Bussink, P. Lambin, and H. J. Aerts, “Machine
learning methods for quantitative radiomic biomarkers,” Scientific reports,
vol. 5, p. 13087, 2015.
[38] J.-G. Lee, S. Jun, Y.-W. Cho, H. Lee, G. B. Kim, J. B. Seo, and N. Kim, “Deep
learning in medical imaging: general overview,” Korean journal of radiology,
vol. 18, no. 4, pp. 570–584, 2017.
30
[39] M. Mohri, A. Rostamizadeh, and A. Talwalkar, “Foundations of machine
learning. ch. 1, 1–3,” 2012.
[40] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep
belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[41] M. Avanzo, J. Stancanello, and I. E. Naqa, “Beyond imaging: The promise of
radiomics,” Physica Medica, vol. 38, pp. 122 – 139, 2017. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1120179717301874
[42] R. Wilson and A. Devaraj, “Radiomics of pulmonary nodules and lung cancer,”
Translational lung cancer research, vol. 6, no. 1, p. 86, 2017.
[43] R. J. Gillies, P. E. Kinahan, and H. Hricak, “Radiomics: images are more than
pictures, they are data,” Radiology, vol. 278, no. 2, pp. 563–577, 2015.
[44] Y. Huang, Z. Liu, L. He, X. Chen, D. Pan, Z. Ma, C. Liang, J. Tian, and
C. Liang, “Radiomics signature: a potential biomarker for the prediction of
disease-free survival in early-stage (i or ii) nonsmall cell lung cancer,” Radiology,
vol. 281, no. 3, pp. 947–957, 2016.
[45] P. Lambin, R. T. Leijenaar, T. M. Deist, J. Peerlings, E. E. De Jong,
J. Van Timmeren, S. Sanduleanu, R. T. Larue, A. J. Even, A. Jochems et al.,
“Radiomics: the bridge between medical imaging and personalized medicine,”
Nature Reviews Clinical Oncology, vol. 14, no. 12, p. 749, 2017.
[46] V. Kumar, Y. Gu, S. Basu, A. Berglund, S. A. Eschrich, M. B. Schabath,
K. Forster, H. J. Aerts, A. Dekker, D. Fenstermacher et al., “Radiomics: the
process and the challenges,” Magnetic resonance imaging, vol. 30, no. 9, pp.
1234–1248, 2012.
[47] R. Thawani, M. McLane, N. Beig, S. Ghose, P. Prasanna, V. Velcheti, and
A. Madabhushi, “Radiomics and radiogenomics in lung cancer: a review for
the clinician,” Lung Cancer, vol. 115, pp. 34–41, 2018.
[48] G. Lee, H. Y. Lee, H. Park, M. L. Schiebler, E. J. van Beek, Y. Ohno, J. B.
Seo, and A. Leung, “Radiomics and its emerging role in lung cancer research,
imaging biomarkers and clinical management: state of the art,” European
journal of radiology, vol. 86, pp. 297–307, 2017.
31
[49] F. Han, H. Wang, G. Zhang, H. Han, B. Song, L. Li, W. Moore, H. Lu,
H. Zhao, and Z. Liang, “Texture feature analysis for computer-aided diagnosis
on pulmonary nodules,” Journal of digital imaging, vol. 28, no. 1, pp. 99–115,
2015.
[50] S. Hawkins, H. Wang, Y. Liu, A. Garcia, O. Stringfield, H. Krewer, Q. Li,
D. Cherezov, R. A. Gatenby, Y. Balagurunathan et al., “Predicting malignant
nodules from screening ct scans,” Journal of Thoracic Oncology, vol. 11, no. 12,
pp. 2120–2128, 2016.
[51] R. T. Larue, G. Defraene, D. De Ruysscher, P. Lambin, and W. Van Elmpt,
“Quantitative radiomics studies for tissue characterization: a review of
technology and methodological procedures,” The British journal of radiology,
vol. 90, no. 1070, p. 20160665, 2017.
[52] B. Zhang, X. He, F. Ouyang, D. Gu, Y. Dong, L. Zhang, X. Mo, W. Huang,
J. Tian, and S. Zhang, “Radiomic machine-learning classifiers for prognostic
biomarkers of advanced nasopharyngeal carcinoma,” Cancer letters, vol. 403,
pp. 21–27, 2017.
[53] C. Parmar, P. Grossmann, D. Rietveld, M. M. Rietbergen, P. Lambin, and H. J.
Aerts, “Radiomic machine-learning classifiers for prognostic biomarkers of head
and neck cancer,” Frontiers in oncology, vol. 5, p. 272, 2015.
[54] P. Yin, N. Mao, C. Zhao, J. Wu, C. Sun, L. Chen, and N. Hong, “Comparison of
radiomics machine-learning classifiers and feature selection for differentiation of
sacral chordoma and sacral giant cell tumour based on 3d computed tomography
features,” European radiology, pp. 1–7, 2018.
[55] C. Parmar, R. T. Leijenaar, P. Grossmann, E. R. Velazquez, J. Bussink,
D. Rietveld, M. M. Rietbergen, B. Haibe-Kains, P. Lambin, and H. J. Aerts,
“Radiomic feature clusters and prognostic signatures specific for lung and head
& neck cancer,” Scientific reports, vol. 5, p. 11044, 2015.
32
Appendices
A Appendix - State of the Art
A.1 Introduction
The following article reviews the state of the art advancements regarding deep
learning methods used in classification of Lung nodules.
Medical imaging systems are broadly useful to get a better insight of internal
organs and their ambiguities. Also, to receive information regarding prognosis
observations, prediction of future progress and cancer detection. They play an
important role in treatment selection since there are different treatment plans for a
benign or malignant nodule.
While there are a variety of image modalities for different purposes (imaging
hard tissue, soft tissue, blood flow, etc.), Computed Tomography (CT) has shown
outstanding capabilities in cancer detection and tumor characterization. Recent
advancements in the field of computer vision and analysis have taken steps forward
to extract information from 2-Dimensional or 3-Dimensional images.
Extracting features such as texture, intensity, size and shape of the tumor can
give an estimation regarding the probability of a tumor being benign or malignant
[2]. Radiomics, a recent advancement in imaging field, is used to get high
dimensional features from images, and use them in a predictive model for diagnosis
and differentiate between malignant and benign tumors. Quantitative image
features, called also ”radiomic features” could provide richer information about
intensity, shape, size or volume, and texture of tumor phenotype that is distinct
or complementary to that provided by clinical reports, laboratory test results, and
genomic or proteomic assays [3]. Researches have shown that Radiomics can be
a promising approach to give information about the internal heterogeneity of the
tumor, which is related to genetical patterns. This approach helps with personalizing
the treatment for patients according to their tumor prognosis [4].
33
The content includes: first a general overview on importance of diagnosis of lung
tumours and the great needs in this field and the most common solutions being
offered in this field. Second; the recent advancements in the field of computer aided
diagnosis or detection (CAD) using machine learning algorithms and its combination
with genomics studies, which has resulted in new era of ”Radiomics”. The topic is
followed by a general review on Radiomics, the basics, its structure and recent
advancements.
A.2 Clinical definition of Lung Nodules
A pulmonary (Lung) nodule is defined as rounded or oval-shaped growth in lungs,
which can also be known as a lesion. Nodules in the lung have usually diameter less
than three centimeters. Diameters more than three centimeters are considered as the
term ”mass” and more probable to indicate cancer. Most nodules are noncancerous
(benign). Risk factors for malignant pulmonary nodules include a history of smoking
and older age [7].
A.3 Background on Lung cancer diagnosis
Based on the recent status report of American Cancer Society [29], lung cancer has
been the leading cause of death among both male and female patients and overall it
has been the most commonly diagnosed type of cancer in 2018. Although there can
be several factors involved in the diagnosis, the mortality rate of this type of cancer
has been reported over 17% reported in [30]. This category of cancer, has two major
types; non-small cell lung cancer (non-SCLC) and small cell lung cancer (SCLC),
non-SCLC being the most frequently-diagnosed type with the share of 85-90% from
all cancer types [31].
Moreover, due to diagnostic difficulties of lung cancer at early stages (small lesions
often undetectable by X-rays), by the time the tumor is detected, the patient is
already in the advanced stage and there is not much possible to be done. Meanwhile,
these nodules can be easier detected by Computed Tomography (CT) scans of lungs,
and this technology tends to be developed by every new generation release in medical
34
technology advancements [32]. Therefore, it is important to emphasize the role of
radiologists in the huge effort they contribute to identify and label the nodules as
benign or malignant [30].
During past 5 years it has been proven that CT scan from chest can reveal
valuable information regarding morphological and biological characteristics of the
lung nodules [32]. According to a guideline by American College of Chest
Physicians (CHEST) [33], there are several actions which may be taken to investigate
malignancy of a lung nodule. Based on the type of lung cancer (SCLC and
non-SCLC), size and location of the tumor, several techniques can be taken into
consideration. While biopsy has been mentioned as a promising yet invasive
diagnostic solution for other types of cancer, scientist have come to a different
conclusion in case of lung cancer. Main purpose of choosing a suitable modality
for diagnosis is to use it for both diagnosis and prognosis of the tumor and also
try to avoid invasive tests as far as possible [33]. Therefore, the vast applications
of imaging modalities, with an emphasis on CT (capable of quantifying tumor’s
intensity [31]) is investigated in clinical oncology, from diagnosis to personalizing
the treatment for patients [2].
A.4 Previous computer vision algorithms used in nodule
classification
As it was previously mentioned, the task of labeling each nodule based on its
malignancy, considering the huge number of cases per each radiologist can be time
consuming and overwhelming for them. Therefore, the role of computer vision
algorithms can be inevitably influential to process the data and help to make a
quick decision [30]. These algorithms are named Computer Aided Diagnosis CAD
systems and they are categorized into detection (CADe) and diagnosis (CADx).
Focus of this project is on CADx algorithms to distinguish between benign and
malignant nodules.
In this technology, after medical images are acquired, they undergo several pre-
processing stages (software-based algorithms) to segment the suspected area of
35
cancer from the background (using thresholds, level sets, ...) and the basic features
such as shape, size and texture features are then calculated and extracted from
them [34] and this process is followed by classification and the results are evaluated
by an accuracy percentage. Usually result of each step it this process is highly
dependent on the previous one and the initial features extracted from images [35].
Current focus in CAD and deep learning field is to automate the process of feature
extraction from the images and make a non-complicated pipeline to handle the
processing and pattern recognition steps [35].
In previous studies, feed forward neural networks with a small number of features
have been utilized to classify the nodules in X-ray images [36]. Moreover, there has
been research on application of binary decision tree classifiers (0 and 1) to classify
more than 2000 nodules from 157 patients (data was provided by National Cancer
Institute Lung Image Database NCI-LIDC) to investigate the performance of such
systems and reached an accuracy rate of 75.01% [30].
A.5 Background
To introduce the science and basics of Radiomics, first the definition and
functionality of ”Machine-Learning” needs to be discussed. ”Machine-learning”
refers to computational models and algorithms which utilizes data to self-improve the
performance or increase the accuracy of predictions [37]. The models are trained
with a batch of labeled data named as ”Training set” which makes the network
learn the certain patterns and correlations between data, and then it gets examined
by the unseen data known as ”Test set” generally categorized as ”Supervised”
and ”Unsupervised” learning algorithms [38]. Regarding the subject of tumor
classification, there has been studies done on deep learning, which is a special type
of machine learning and it is similar to human cognition system and has gained
the attentions in healthcare big data as well. Based on an overview done in [38],
Naıve BayesNB model has been used as a typical classification algorithm since it
had a considerably good performance, and Support Vector Machine (SVM) has been
chosen as the most popular one. Practices done with one of the well-known networks
for classification and regression, Artificial Neural Network (ANN), has proven great
36
performance in different areas despite some drawbacks in optimization and over-
fitting [39]. One study, focuses on the unsupervised restricted Boltzmann machine
as a deeper architecture than ANNs and reports an improvement in both over-fitting
and optimization [40].
With the concept of deep learning and personalized treatment being trendy in
the past decade, studies have been focusing on the correlations between extracted
features and also possible relation between these features and certain drug responses
[41]. Since 2010, the research on this topic was formalized by adapting the term
”Radiomics”, consisting of two parts; ”Radio” as in radiology, referring to the
image acquisition method, and ”omics” which comes from the term ”genomics”, the
science of studying human genes. In this concept, radiomics is utilized to study the
quantitative features derived from images to conclude the future status of suspected
nodule [42], [43].
From a technical point of view, radiomics is the act of converting medical images
to high-dimensional data, which can be studied and processed to extract information
from.
According to [41], there are two main aspects to radiomics functionality. First is
the number of extracted features used in the network, which is considerably huge
(hundreds to thousands) comparing to conventional CAD algorithms. Second, is the
field of investigation of radiomics. The features extracted in this approach can be
investigated for diagnostic and prognostic clinical application, which means the final
treatment of the patient will be personalized based on their individual status [43]
and also there will be richer information obtained regarding shape, size or texture
of the detected tumor [41].
To compare radiomics’s advancements with conventional CAD systems, it can be
concluded that the previous approaches focused on a binary answer (whether there
is a lesion or not), while radiomics broadens this aspect by delivering an output
that not only can be used by radiologists as a supporting decision, but also can be
combined with specific characteristics of each patient. A full set of features assigned
to certain patient holding prognostic details is called ”radiomic signature” [43].
37
For instance, heterogenicity investigated in an image, can be connected to genomic
heterogenicity, which can conclude to tumor getting worse, since heterogeneity in
tumors can make them more resistant to treatment [41], [44].
A.6 Radiomics Workflow
A.6.1 Data Acquisition
Among all the medical imaging modalities, CT has shown promising results in
assessment of textural features of tumor and it is most-frequently suggested modality
by radiologists especially in lung cancer cases [41].
A.6.2 Segmentation of Region of Interest
Second step is to cut the Region of Interest (ROI) and choose a prediction target
(i.e. malignancy). This section is critical and important since the features will
be generated from the segmented volume and also tumours often have a vague
boundry to divide. For this means, radiomics analysis should be done on a sub
region such as a metastatic lesion or cancerous tumour [45]. Segmentation of the
tumor is mostly done by a home-grown developed algorithm and could be semi-
automated. There also cases which radiologists do the segmentation manually [4].
Several algorithms can be used for segmentation. Most popular ones mentioned
by [46] are region growing methods (which are rapid but sensitive to noise on the
image), level set method, active contours, etc. However, there is no perfect algorithm
defined that works efficient for all image modalities. To calculate the features in the
next step, there should be a segmentation mask defined, which consists of voxels
located within the ROI. The segmented region should include two masks; intensity
mask and morphological mask [14].
A.6.3 Feature Extraction
After segmentation, there are sets of quantitative features extracted from each
tumor which describes different properties of it such as nodule volume, nodule shape
38
and intensity patterns. Two main categories of features are divided in ”Agnostic”
and ”Semantic”. Semantic features are visualized and defined by radiologists to
describe the ROI and they can be utilized to build a classifier, while agnostic focuses
on heterogeneity and quantitative features of the lesion [34]. Some examples of
semantic features are: Size, Shape, Location, Vascularity, Necrosis, etc. Apart from
semantic, other sets of features are divided into couple of subgroups as follows [47],
[46]:
This category includes the basic features of the tumor such as shape, volume and
derivative measurements (area-to-volume ratio, first-order histograms, compactness,
etc.) and they are calculated in 3D not per slice. One of the studies done in this
field shows that shape features can be used to distinguish malignant and benign
tumors by analysing the surface-area-to-volume ratio [48].
• Surface Area
• Surface to Volume Ratio
• Volume
• Sphericity
• Area Density
• ...
These set of features were initially performed by [16] and includes distribution of
intensity levels in form of voxels. One of the most common parameters used in this
criteria is Local Binary Pattern (LBP) which turns an image into an array of binary
pixels and then the dimension of matrix is reduced based on the relative binary
value of neighboring pixels and explains the texture variety based on the histogram
of the whole image [49].
More sophisticated method based on the same basis, is application of grey level
scale to the pixels. While there are assumptions regarding these features’ role in
prognosis of cancer, Hawkins et al. [50] generated a model with 23 features to
predict the status of a nodule in upcoming years (1-2) and reached an accuracy
of 80%. Furthermore, recently there have been Software Packages released for
39
automatic radiomics feature extraction, such as ”Imaging Biomarker Explorer
(IBEX)” [41].
A.6.4 Feature Selection
This step is dependent on number of selected feature categories and parameters,
due to the facts that the total number of extractable features is virtually unlimited,
some of the features are correlated and taking all the features into account results in
over-fitting and performance reduction of the model. In order to avoid over-fitting
and compensate for the robustness of model in presence of such big variability, the
best suitable features must be picked with the aid of dimensionality reduction or
feature selection techniques [45].
As a review done by [51], filter-based selection techniques are often used to pick
the most informative features by emphasizing on a proper classification method.
Aerts et al. [3] had 647 total images of Lung tumors (training and testing) from
1019 patients, 440 total features extracted and performed first order, shape, texture
features to indicate that choosing features based on stability and reproducibility can
also result in picking the most informative features.
While [41] believes the smoothest method would be to use a scoring system where
features are graded based on stability or correlation, and then the worst-ranked ones
will be omitted. The drawback of such algorithm would be the fact that sometimes
the dependency between features actually increases the performance and accuracy
of the model.
In [52] six feature selection methods have been analyzed based on their statistical
approaches. Logistic Regression, Support Vector Machine SVM, Random Forest
RF, Distance Correlation DC, Elastic Net Logistic Regression EN-LOG, Sure
Independent Screening SIS. One of the most frequently discussed method is
Minimum Redundancy Maximum Relevance mRMR [31] [37] [53], and it calculates
the mutual information (MI) among a specific set. Then the mRMR ranks the
outcomes on a decreasing basis but also minimizing the average MI of each set.
40
A.7 Radiomics Classifiers (Mathematical Model)
The next step is to define a model to build a classifier based on the radiomic
features. Generally, the ML classifiers are divided into two main groups; supervised
and unsupervised classifiers. Supervised classifiers are given two sets of data, one
training set which includes set of examples as an input vector, and a known outcome.
This method may face over-fitting, which means the model will decide mostly based
on the noise from images rather than the original data.
Moreover, couple of most utilized supervised classifiers are Logistic Regression
Kernel Support (simplest one), Random Forest (RF) , Support Vector Machine
(SVM) and Artificial Neural Network (ANN), K-Nearest Neighbor (KNN) and Naıve
Bayes (NB). These methods were compared in a couple of studies [41] [52], the
performance was evaluated by test error and Area Under the Curve (AUC). The
highest AUC, and lowest test error belonged to RF feature selector used followed
by RF classifier (AUC, 0.8464), and next place goes to RF + Adaboost (AUC,
0.8204). The classifiers are often trained by 10-fold cross-validation method in
training set [54].
Unsupervised classifiers take an extra priori variable (i.e survival rate) instead of
divided sets of data. This method, aims to find a type of cancer among a database
of patients. For instance, consensus clustering is used in [55] to make fewer clusters
from a high dimensional data. In this practice, a space of 440 features was reduced
to 13 non-redundant features.
A.8 Performance Evaluation
A.8.1 Area Under the Receiver Operating Characteristics
(AUROC)
As a performance measurement for classification, this approach can be
implemented in different settings (thresholds). While Receiver Operating
Characteristic Matrix ROC is the curve based on probability of ”True Positives”
versus ”False Negatives”his plot can display how strong is the model to separate
and distinguish between the categories.ROC is a probability curve and AUC is the
41
border to measure separability. It tells how much model is capable of distinguishing
between classes. The larger the area under the curve, the better the model is at
predicting 0s as 0s and 1s as 1s. By analogy, the higher the AUC, the better the
model is at distinguishing between patients with disease and no disease.
TPR =TP
TP + FN
Specificity =TN
TN + FP
FPR = 1− specificity =TN
TN + FP
42
TRITA CBH-GRU-2019:109
www.kth.se