evaluation of a radiomics model for classification of lung ...1359286/fulltext01.pdf · and methods...

IN DEGREE PROJECT MEDICAL ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Evaluation of a Radiomics model for classification of Lung Nodules

PARASTU RAHGOZAR

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH

Evaluation of a Radiomics model

for classification of Lung Nodules

KTH Master Thesis Report

Parastu Rahgozar

KTH ROYAL INSTITUTE OF TECHNOLOGYSchool of Engineering Sciences in Chemistry, Biotechnology and Health

2

Abstract

Lung cancer has been a major cause of death among types of cancers in the world.

In the early stages, lung nodules can be detected by the aid of imaging modalities

such as Computed Tomography (CT). In this stage, radiologists look for irregular

rounded-shaped nodules in the lung which are normally less than 3 centimeters

in diameter. Recent advancements in image analysis have proven that images

contain more information than regular parameters such as intensity, histogram and

morphological details.

Therefore, in this project we have focused on extracting quantitative, hand-crafted

features from nearly 1400 lung CT images to train a variety of classifiers based on

them.

In the first experiment, in total 424 Radiomics features per image has been used

to train classifiers such as: Random Forest (RF), Support Vector Machine (SVM),

Decision Tree (DT), Naıve Bayes (NB), Linear Discriminant Analysis (LDA) and

Multi-Layer Perceptron (MLP). In the second experiment, we evaluate each feature

category separately with our classifiers. The third experiment includes wrapper

feature selection methods (Forward/Backward/Recursive) and filter-based feature

selection methods (Fisher score, Gini Index and Mutual information). They have

been implemented to find the most relevant feature set in model construction.

Performance of each learning method has been evaluated by accuracy score, where

we achieved the highest accuracy of 78% with Random Forest classifier (74% in 5-

fold average) and 0.82 Area Under the Receiver Operating Characteristics (AUROC)

curve. After RF, NB and MLP showed the best average accuracy of 71.4% and 71%

respectively.

Keywords

Master Thesis, Radiomics, Tumor Classification, Lung Nodule.

i

Sammanfattning

Lungcancer har varit en viktig dodsorsak bland alla typer av cancer i hela

varlden. I de tidiga stadierna kan lungnoduler detekteras med hjalp av olika

avbildningsmetoder som till exempel Datortomografi (DT). I detta skede letar

radiologer efter oregelbundna sf ariska knolar i lungan som normalt ar mindre an 3

centimeter i diameter. De senaste framstegen inom bildanalys har visat att bilder

innehaller annu mer information an vanliga parametrar som intensitet, histogram

och morfologiska detaljer.

Darfor har studerat vi i detta projekt fokuserat pa att extrahera kvantitativa,

manuelt-behandlade funktioner fran nastan 1400 lung-CT-bilder for att trana en

mangd klassificerare baserade pa dem.

I det forsta experimentet har vi anvant totalt 424 Radiomics-funktioner per bild

anvants for att trana klassificerare som: Random Forest (RF), Support Vector

Machine (SVM), Decision Tree (DT), Naıve Bayes (NB), Linear Discriminant

Analysis (LDA) och Multi-Layer Perceptron (MLP). I det andra experimentet

har vi utvarderat varje funktionskategori separat med vara klassificerare. Det

tredje experimentet har inkluderat metodval for omslagsfunktioner (framat/bakat

/rekursivt) och filterbaserade metoder for val av funktioner (Fisher Score, Gini-

index och Mutual information). Alla tre har implementerats for att hitta de mest

relevanta funktionerna i den modellkonstruktion.

Prestandan for varje inlarningsmetod har utvarderats med noggrannhetspoang, dar

vi uppnadde den hogsta noggrannheten pa 78% med Random Forest klassificerare

(74% i femfaldigt medelvarde) och 0.82 Area Under the Receiver Operating

Characteristics (AUROC)-kurva. Efter RF visade NB och MLP den basta

genomsnittliga noggrannheten pa 71.4% respektive 71%.

Nyckelord

Masterexamen, Radiomics, tumorklassificering, Lungnodule

ii

Acknowledgements

I would first like to thank my thesis supervisor Dr. Chunliang Wang, for his

guidance, patience and supervision during this project. Special thanks to Mr. Mehdi

Astaraki for his constant advice, consideration and guidance to keep me in the right

the direction whenever he thought I needed it.

I would also like to thank my reviewer, Prof. Orjan Smedby for his kind comments

and attention regarding my progress. I am gratefully indebted to him for his very

valuable comments on this thesis. Also, thanks to Yupei Chen, Cristina Zanin and

Didrik Nimander for their helpful comments during our supervision group meetings

throughout the thesis project.

Finally, I must express my very profound gratitude to my parents, for providing me

with unfailing support, love, confidence and continuous encouragement throughout

my years of study. To my boyfriend, who advised and supported me through the

process of researching and writing this thesis. This accomplishment would not have

been possible without them. Thank you.

iii

Authors

Parastu Rahgozar ([email protected])

Master of Science, Medical Engineering

KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden

KTH Flemingsberg Campus

Examiner

Orjan Smedby


Supervisor

Chunliang Wang


Contents

1 Introduction 2

1.1 Biology of Lung Nodules . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Medical Imaging Modality . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Application of Radiomics . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Research Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Methodology and Methods 5

2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 Wrapper based methods . . . . . . . . . . . . . . . . . . . . . 8

2.4.2 Filter-based Feature Selection . . . . . . . . . . . . . . . . . . 8

2.5 Subject-Wise Cross Validation . . . . . . . . . . . . . . . . . . . . . . 9

2.6 SMOTE - Synthetic Minority Oversampling Technique . . . . . . . . 10

2.7 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.7.1 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . 10

2.7.2 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . . . 11

2.7.3 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . 11

2.7.4 Decision Tree (DT) . . . . . . . . . . . . . . . . . . . . . . . . 11

2.7.5 Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . 11

2.7.6 Naıve Bayes (NB) . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.8 Evaluation of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.8.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.8.2 Area Under the Receiver Operating Characteristics (AUROC) 12

2.9 Setup of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.9.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 13

2.9.2 Classification Models . . . . . . . . . . . . . . . . . . . . . . . 14

3 Results 15

3.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

v

3.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Discussion 23

4.1 Feature Extraction Categories . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Feature selection methods . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Conclusions 26

A Appendix - State of the Art 33

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

A.2 Clinical definition of Lung Nodules . . . . . . . . . . . . . . . . . . . 34

A.3 Background on Lung cancer diagnosis . . . . . . . . . . . . . . . . . . 34

A.4 Previous computer vision algorithms used in nodule classification . . 35

A.5 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

A.6 Radiomics Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A.6.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 38

A.6.2 Segmentation of Region of Interest . . . . . . . . . . . . . . . 38

A.6.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 38

A.6.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 40

A.7 Radiomics Classifiers (Mathematical Model) . . . . . . . . . . . . . . 41

A.8 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.8.1 Area Under the Receiver Operating Characteristics (AUROC) 41

1

1 Introduction

Medical images are broadly used for prognosis observation, prediction of future

status and cancer detection. They play an important role in treatment selection since

there are different treatment plans for a benign or malignant nodule. Invasive biopsy

is a common procedure for nodule diagnosis, which includes extracting part of the

lesion and running tests on it. The majority of tumors are spatially heterogeneous

and therefore, the biopsy does not reveal the entire tumor characterization [1] .

Computed Tomography is one of the most common imaging systems used for nodule

diagnosis and detection and recent advances have proven that due to good imaging

properties, lung nodules can be more precisely observed. Extracting features such

as texture, intensity, size and shape of the tumor, can give an estimation regarding

the probability of a tumor being benign or malignant [2].

Radiomics, a recent advancement in the imaging field, is used to extract high

dimensional features from images, and use them in a predictive model for diagnosis

and differentiate between malignant and benign tumors. Quantitative, hand-crafted

image features, also called ”Radiomic Features”, could provide richer information

about intensity, shape, size or volume, and texture of tumor phenotype that is

distinct or complementary to that provided by clinical reports, laboratory test

results, and genomic or proteomic assays [3]. Research has shown that Radiomics can

be a promising approach to give information about the internal heterogeneity of the

tumor, which is related to genetical patterns. This approach helps with personalizing

the treatment for patients according to their tumor prognosis [4].

1.1 Biology of Lung Nodules

Lung cancer is the leading cause of cancer death in the world, causing as many

deaths as the next four most deadly cancers combined (breast, prostate, colon and

pancreas) according to [5].

As an initial definition proposed by [6], a lung nodule is a rounded or irregular oval-

shaped growth in the lungs, which usually is less than 3 centimeters in diameter.

An opacity less than 3 millimeters is considered as ”micro-nodule” while any nodule

more than 3 centimeters in diameter is considered as a ”mass” and is more likely to

2

be a cancerous one [7]. Next step for physicians after detecting a nodule is to decide

if the nodule is suspicious enough to not only run further tests, but also avoid any

unnecessary experiments and tests [6]. One of the most useful, yet non-invasive,

methods is medical imaging.

1.2 Medical Imaging Modality

Medical images are broadly used for prognosis observation, prediction of future

status and cancer detection. They play an important role in treatment selection since

there are different treatment plans for a benign or malignant nodule. Invasive biopsy

is a common procedure for nodule diagnosis, which includes extracting part of the

lesion and running tests on it. The majority of tumors are spatially heterogeneous

and therefore, the biopsy does not reveal the entire tumor characterization [1] .

One of the most popular imaging modalities in this criteria, is Computed

Tomography (CT), which can play an important role in both diagnosis and follow-up,

since early diagnosis of Solitary Pulmonary Nodules (SPN) can have a significant

effect on finding a safe and prompt solution. SPNs are very probable to become

malignant nodules in the close future [8].

1.3 Application of Radiomics

The aim of Radiomics is to extract quantified characteristics from the medical

image with the aid of automated algorithms. There has been numerous pipelines

and methods developed (hard-coded or deep learning methods), although the lack

of standardization in definition ad image processing makes it difficult to compare

different sets of results and it affects the reproducibility of them [9]. Therefore,

PyRadiomics [9] is an open-source platform available for engineers to implement it

in both Python or 3D-Slicer. This platform consists of several classes, such as image

pre-processing, mask production, filter application and feature extraction.

3

1.4 Research Aim

This project aims at:

• Implementing the Radiomics method on lung nodules in order to classify Lung

nodules

• Comparing performance of several predictive classifiers on extracted features

• Selecting most informative Radiomics features as a smaller set to train

classifiers

4

2 Methodology and Methods

2.1 Dataset

This project has been implemented on ”Kaggle Data Science Challenge 2017” 1.

Initial data set consisted of more than 1400 CT 3D images, labeled as benign or

malignant. All images are in DICOM format and size of each image is (Z,512,512),

Z being number of slices per image.

2.2 Data Pre-processing

The first step in image pre-processing is to convert the 3D images from DICOM

to NIfTI-1 (Neuroimaging Informatics Technology Initiative) format, convert the

grayscale to Housfield Unit and normalize the histogram of the image. The

conversion and histogram windowing in this section should be done based on

standard parameters for lung CT images. Since radiomics features are usually

calculated from a region of interest, in the next step, each nodule is separately

cropped and a binary mask is produced based on the dimensions of the cropped

nodule. Therefore, the dimension of the final image files may vary. If one image

contains of more than one nodule, the final format would be one cropped image and

one binary mask for each one.

2.3 Feature Extraction

Feature extraction is the initial step and one of the most important stages in image

processing. There are different types and classes of image features that can be

extracted from medical images which help to describe the characteristics of the

extracted area known as tumor. Quantitative features give information about the

shape of lesion, intensity histogram or spatial parameters of the image. They can

also be informative regarding texture features which focus on the distribution of the

gray levels over the pixels and spatial arrangement of the intensity in voxels [10].

1https://www.kaggle.com/c/data-science-bowl-2017/data

5

Overall, the quantitative features are divided into several subgroups which have been

implemented in this project:

• First Order Statistical Features: To observe the texture characteristics of an

image area, it is needed to look at gray level distribution over the pixels. The

first (and second) order features are defined to quantify relations underlying

the distribution of the observed image. These features are calculated based on

individual voxel values without taking into account the spatial relationships.

These properties are calculated based on histogram of the image such

as Variance, Skewness, Kurtosis and Median [11]. In general, first-order

histogram is defined as below:

P (I) =Number of pixels with gray level I

Total number of pixels in the region

• Applied filter on Statistical features:

Laplacian of Gaussian filter is a spatial filter to measure the second derivative

of the image. Laplacian filter mostly highlights the high intensity values of an

image and therefore it is applicable in edge detection. Before Laplacian filter,

often a smoothing Gauassian filter is also applied. This filter gets a gray scale

image as input and produces a gray scale image in the output. In the equation

below, the value of σ can vary [12].

LoG(x, y) = − 1

πσ4[1− x2 + y2

2σ2]e

−x2+y2

2σ2

Wavelet transform is a technique to evaluate the frequencies within an image,

according to different scales. The wavelet coefficients are derived based on

these scales and whether the direction of frequency changes or not. This

observation is done based on fast or slow variations in the gray level value of an

image. Therefore, areas with abrupt changes in gray value, are assigned high

spatial frequency, while regions with slow changes are related to low spatial

frequency. Wavelet-derived features are mainly aimed at texture differences in

an image [13].

• Gray-Level Matrix features:

6

Texture features were primarily defined to analyze the surface structure

in 2D image, but their implementation can be extended to 3D images as

well. Moreover, while first order features can give information regarding the

gray-level properties of the image, there is no details about the positions

of the gray levels in relation to others. For instance, whether all low-

level gray values are next each other or they have been interchanged with

some high-value gray levels. Grey-Level Co-occurance Matrix (GLCM) can

quantify the coarseness and smoothness which are good parameters with high

discriminatory power.This matrix expresses how neighbourhood pixels in a

3D volume are distributed in different directions [14]. It is calculated in 13

various directions and the the mean value is kept. A couple of features in

this matrix are contrast; the local level variations and entropy; a function to

measure randomness [11].

Another category of grey-level matrix features is Grey-Level Run Length

Matrix (GLRLM) which was initially proposed in [15]. Unlike GLCM, this

matrix analyzes the run length (length of a sequence of pixels in one direction).

Grey Level Size Zone Matrix (GLSZM) assesses groups (zones) of connected

voxels in the same neighbourhood. Given a 3D image, a single voxel can be

connected to 26 neighbour voxels. This group of features can be both computed

in a 3D matrix, or from 2D matrixes and averages over slices [14]. Some of the

texture features calculated in this project are: Contrast, Correlation, Sum of

Squares (Variance), Entropy, Inverse Difference Moment, Maximal Correlation

Coefficient [16].

• Shape Features:

Geometric properties regarding a region of interest (cropped nodule) is

described by shape features in terms such as compactness, sphericity and

surface area.

• Clinical features:

A total amount of 9 features were defined and clinically determined by a

radiologist. These features consist of ”calcification or fat contain”, ”attached

to the artery”, ”attached to the artery”, etc. which were labeled as True/False

and then converted to binary results.

7

2.4 Feature Selection Methods

One of the most important parts in machine learning, is the input data given to the

model. By analyzing quality of the input data, it will be possible to recognize noise

and reduce the dimension of the big scale data. Therefore, only the real important

and useful will be taken for implementation.

There are two main categories in feature selection methods and in order to give a

comparison, three methods out of each category each has been assessed.

2.4.1 Wrapper based methods

Wrapper methods were introduced by [17] which is basically calculating the

estimated accuracy of the algorithm for each feature. Forward Feature Selection

each accuracy is calculated once another unused feature is added to the subset, and

determines which feature(s) is the best to add based on the accuracy. This approach

makes wrapper methods computationally costly and do not adapt to large-scale

data [18].

Backward Feature Elimination removes one feature at each iteration, based

on trained algorithm on all input features, and keeps set of features that provides

the least error rate in the set. This process continues until there is no significant

improvement on the error rate [19], while in Recursive Feature Elimination

a linear regression model is trained on a subset of features each time and the

feature with the smallest ranking criterion is removed since it has the least effect on

classification [20].

2.4.2 Filter-based Feature Selection

Filter methods [21] follow another approach. The concept is based on choosing a

feature set regardless of learning algorithm to be used while training. Therefore,

there is no bias implied from the learning method on the selected features.

• Gini Index: The concept of the Gini-Index theory is described as to suppose

that S is a set of s samples, and that these samples belong to k different classes

(Ci ,i = 1,...,k). S can be divided into k number of subsets based on differences

8

of classes (Si, i =1,...,k). Suppose that Si is a sample set that belongs to class

Ci,and that si is the sample number of sets Si; then, the Gini Index of set S is:

Gini(S) = 1−k∑

i=1

p2i

where Pi is the probability, estimated with si/s, that any sample belongs to

Ci. Gini(S)’s minimum is 0, that is, all of the members in the set belong

to the same class; this denotes that the maximum useful information can be

obtained.

• Fisher Score : Considering a data space spanned by features, they are sorted

in several classes based on the distance between one another. Of course, the

distance between data points in the same class should be small while different

classes are further away from each other. Fisher score tends to find set of

features with the largest distance between them [22].

• Mutual Information (MI) is a powerful statistical tool to evaluate dependency

and relationships between datasets. MI can detect any kind of relevance

between the given datasets whether it is mean value, variance or other factors

[23]. If X and Y are totally unrelated, then they are independent from one

another and X does not give any clue about Y. In this case, their mutual

information is zero. The basic concept of this function relies on entropy

estimation from k-nearest neighbors distances.

2.5 Subject-Wise Cross Validation

The aim of k-fold cross-validation is to generate training and validation set from the

same population, while we make sure that each observation is used exactly once for

validation and also all observations are used for training and validation separately.

In this approach, the data is randomly divided into k number of equal subgroups,

and in each loop, one subgroup is kept as validation (test) and the k-1 number of

subgroups are used as training data. In this project there can be a single patient

(same ID) with several analyzed nodules. Therefore, the k-fold cross validation

needs to take this issue into account and the process needs to be done subject-wise,

9

which means generating k-folds of IDs rather than nodules.

2.6 SMOTE - Synthetic Minority Oversampling

Technique

In the majority of binary classification cases, classes are presented as ”benign”

and ”malignant”, or ”normal” and ”abnormal” and real world’s cases include more

data labeled as ”normal” than ”abnormal” or in binary approach, more 0 than 1.

Therefore, this bias towards the labels may result in misclassification of an abnormal

example as a normal example. One way to solve this issue, is to oversample the

minority class and increase the sensitivity of the model towards it [24].

2.7 Classification Models

All the images from lung nodules has been labeled as zero (0) or (1) (as benign

or malignant) based on the diagnosis from an experienced radiologist. Therefore,

the desired classification methods only need to satisfy a simple binary classification.

The important point in the chosen models is the huge number of features which

will be fed to the model. Based on previous research done on this topic [9] [25]

Machine Learning (ML) classification models such as Support Vector Machine,

Random Forest and neural networks have been implemented in this field before. As

a matter of comparison of performance, this project involves several classifiers:

2.7.1 Support Vector Machine (SVM)

The idea of SVM is to map the training data into a higher-dimensional feature space,

then build a separating hyperplane with maximum margin between the classes,

which results in a decision boundary in the input space. This function can be

applied linearly, or by a kernel function, either polynomial, splines and radial basis

function (RBF) [26].

10

2.7.2 Random Forest (RF)

Random Forest is an ensemble algorithm that uses perturb-and-combine techniques

to create a strong classifier from a set of weak classifiers. The trees in the ensemble of

classification trees are created from a sample drawn with replacement from the part

of the dataset used for training. The scikit-learn implementation of this algorithm

uses the average probabilistic predictions of the classifiers to combine them into

one.

2.7.3 Linear Discriminant Analysis (LDA)

LDA is generalized method based on Fisher’s linear discriminant, that is popular

in statistics and machine learning. The goal of this algorithm is to find a linear

combination of features that can be utilized in separation of two classes. When

it comes to assigning a class to set of observations (feature sets), LDA assumes

the probability density functions are normally distributed. Then, it models the

distribution of predictors separately for each class, and uses Bayes theory to estimate

the probability of observations.

2.7.4 Decision Tree (DT)

Decision trees are non-parametric supervised learning. Decision trees create a

predictive model of a target variable based on learning decision rules extracted

from the features of a dataset. This decision process can be thought of as a series of

if-then-else statements that are often utilized in general flowcharts and algorithms.

While decision tree is easy to visualize, one of the downsides of this algorithm is

that it may lead to overfitting in large datasets [27].

2.7.5 Multi-Layer Perceptron (MLP)

Multi-layer perceptron is a simple form of neural networks, which consists of an

input vector, an output vector and a set of interconnected neurons (nodes) which

are connected by weights and lead to an output signal. These weights are modified

by a transfer or activation function which initializes the nodes. This function can be

11

linear or non-linear based on the implementation needed in the project. If there is

no feedback from output side back to the network, the network is known as ”Feed-

Forward Neural Network” [28].

2.7.6 Naıve Bayes (NB)

Naıve Bayes is another supervised classifier, from the family of probabilistic

classifiers. It works based on the Bayes theorem assuming conditional independence

between features. Based on this rule, using the joint probabilities of sample

observations and classes, the algorithm attempts to estimate the conditional

probabilities of features given an observation [27].

2.8 Evaluation of Classifiers

In literature, there are a wide variety of evaluation metrics used to measure the

performance of the models. Couple of the most common performance metrics are

listed below and has been used in this project:

2.8.1 Accuracy

In general, accuracy tells us how many predictions were correct out of total test

cases. How we measure accuracy in this case:

Accuracy =TN + TP

TN + TP + FN + FP

where TN, TP, FN, FP stand for True Negative, True Positive, False Negative and

False Positive respectively.

2.8.2 Area Under the Receiver Operating Characteristics

(AUROC)

As a performance measurement for classification, this approach can be implemented

in different settings (thresholds). While Receiver Operating Characteristic

Matrix ROC is the curve based on probability of ”True Positives” versus ”False

12

Negatives”his plot can display how strong is the model to separate and distinguish

between the categories.ROC is a probability curve and AUC is the border to measure

separability It tells how much model is capable of distinguishing between classes.

Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy,

Higher the AUC, better the model is at distinguishing between patients with disease

and no disease.

TPR =TP

TP + FN

FPR = 1− specificity =TN

TN + FP

2.9 Setup of the data

Complete dateset of 1398 individual nodules were used in this study. Each nodule

has been converted to Hounsfield unit, then its histogram is thresholded between

(-1000,500). Each nodules has a 3D binary mask. In total, there are 1036 images

with benign tumors and 362 images with malignant ones.

(a) (b) (c)

Figure 2.1: Sample of data (a) Axial slice of Lung CT image, (b) Cropped imageof the identified nodule, (c) Binary mask of the nodule

2.9.1 Feature Extraction

Since this project aims to evaluate a variety of options in Radiomics, we calculated

features from both original image and two different filters applied to the images

13

(Wavelet and LoG). The filters are applicable to extract first order features from

images.

The combination of all feature classes (Shape, First order statistical, GLCM,

GLRLM, GLSZM, Clinical, LOG and Wavelet) led to a set of 424 features extracted

from each nodule. The set includes: 9 clinical, 17 shape, 19 first order, 24 GLCM,

16 GLRLM, 16 GLSZM, 171 LoG and 152 Wavelet hand-crafted features.

Final dataframe consists of 1398 nodules and 424 features (per nodule) after data

is normalized with ”Z-score” normalization.

2.9.2 Classification Models

Six different classification models with settings as described below were used:

• Support Vector Machine (SVM) (C-measure = 5, three kernels (linear,

polynomial, radial basis function))

• Random Forest (RF) (100 trees)

• Linear Discriminant Analysis (LDA)

• Decision Tree (DT) (Maximum depth = 5, Minimum leaf= 10)

• Multi-Layer Perceptron (MLP) (3 hidden layers, each 30 nodes)

• Naıve Bayes (NB) (Gaussian Naıve Bayes)

14

3 Results

3.1 Experiment 1

In the first experiment, six mentioned models were trained with 424 hand-crafted

features (whole data) on a 5-fold cross validation. Evaluation is done by measuring

accuracy and Area Under the Receiver Operating Characteristics (AUROC) for

all models. Table 3.1 indicates accuracy results based on classification models

implemented and table 3.2 shows the AUROC value for each classification model.

Based on the achieved results, random forest learning model reaches the highest

average accuracy (74%), followed by Naıve Bayes and Multi Layer Perceptron with

71.4% and 71% respectively. Area under the ROC curve reaches a high of 0.82 in

this trial in random forest classifier.

Table 3.1: Accuracy percentage and Standard Deviation in Experiment 1, trainingwith 424 features

Folds SVM5 SVM(P) SVM(R) RF LDA NB DT MLPFold1 69% 73% 73% 78% 70% 71% 66% 75%Fold2 66% 64% 70% 74% 67% 66% 66% 70%Fold3 72% 70% 74% 73% 68% 73% 69% 72%Fold4 68% 72% 71% 73% 67% 73% 69% 69%Fold5 69% 69% 66% 72% 62% 74% 70% 69%Ave. 68.8% 69.6% 70.8% 74% 66.8% 71.4% 68% 71%Std. ±1.95 ±3.12 ±2.81 ±2.01 ±2.6 ±2.8 ±1.62 ±2.8

Table 3.2: Area under the ROC curve and average in Experiment 1, training with424 features

Folds SVM5 RF LDA NB DT MLPFold1 0.73 0.85 0.74 0.71 0.79 0.75Fold2 0.71 0.76 0.68 0.71 0.67 0.66Fold3 0.72 0.82 0.73 0.74 0.68 0.72Fold4 0.70 0.78 0.72 0.73 0.72 0.77Fold5 0.68 0.79 0.7 0.75 0.72 0.73Ave. 0.708 0.80 0.714 0.728 0.716 0.726

15

(a) (b)

(c) (d)

(e) (f)

Figure 3.1: Area under the ROC curve for different learning models ((a) SVM(5),(b) Random Forest, (c) LDA,(d) NB, (e) DT, (f) MLP)

16

3.2 Experiment 2

In the second experiment, six classifiers were trained with separate feature categories

(Clinical, First order, Shape, GLCM, GLRLM, GLSZM, LoG, Wavelet) on a 5-fold

cross validation. Results are shown in heat-maps (Figure 3.3) for better comparison.

The initial outcome shows that features extracted with wavelet and LoG filters, have

the strongest relation with outcome and training the model based on these features,

can lead to the most accurate results (Random Forest, 76% average accuracy). Next

best results are acheived from shape features, with both random forest and naıve

bayes methods, 73%. Figure 3.2 indicates the AUROC for random forest classifier,

per each feature category.

Table 3.3: Average of accuracy percentage (over the 5 folds) in Experiment 2,training with separate feature categories

Feature Group SVM5 SVM(P) SVM(R) RF LDA NB DT MLPClinical 58% 65% 55% 56% 53% 52% 57% 54%

FoS 67% 69% 69% 72% 68% 61% 67% 67%Wavelet/loG 69% 67% 70% 76% 67% 67% 66% 69%

Shape 70% 71% 70% 73% 69% 73% 68% 68%GLCM 61% 61% 62% 65% 63% 51% 63% 64%GLSZM 69% 69% 67% 67% 69% 69% 65% 68%GLRLM 69% 70% 70% 68% 69% 73% 67% 68%

17

(a) (b)

(c) (d)

(e) (f)

(g)

Figure 3.2: Area under the ROC curve (Random Forest) for seperate featurecategories ((a) Clinical, (b) First Order,(c) Shape, (d) Wavelet/LoG, (e) GLCM,

(f) GLRLM, (g) GLSZM)18

(a) (b)

(c) (d)

(e)

Figure 3.3: Heatmap of accuracy score based on chosen feature category andclassifier (per each fold)

3.3 Experiment 3

In the third experiment, six classifiers were trained based on the features output from

feature selection methods. Utilized feature selection methods were: FFS, BFS, RFS,

Fisher score, Gini Index and Mutual information. Each of these methods pick the

top 25 features and then each classifier is trained on a dataframe of 1398×25.

19

The results from this experiment are presented in Table 3.4. In this set, using

recursive feature selection led to the best results overall in all classifiers. Although,

random forest has slightly better accuracy in all methods and after that naıve bayes.

Moreover, we observed the features picked by each method, and in all categories more

than 10 out of 25 features belonged to wavelet and LoG category. Distribution of

the picked features from each category by each method is as follows:

• FFS: (Wavelet/LoG: 13, GLSZM: 4, Shape: 7, First Order: 1)

• BFS: (Wavelet/Log: 15, Shape: 5, GLRLM: 2, First Order: 3)

• RFS: (Wavelet/LoG: 11, GLRLM: 5, Shape: 2, First Order: 7)

• Fisher: (Wavelet/LoG: 11, GLCM: 3, GLRLM: 4, Shape: 3, First Order: 4)

• Gini In.: (Wavelet/LoG: 12, GLCM: 1, GLRLM: 6, Shape: 3, First Order: 3)

• Mutual: (Wavelet/LoG: 10, GLCM: 5, GLRLM: 2, Shape: 4, First Order: 4)

Table 3.4: Average of accuracy percentage in Experiment 3, training with featurespicked by FS methods

Feature Selector SVM5 SVM(P) SVM(R) RF LDA NB DT MLPFFS 68% 69% 69% 72% 69% 68% 69% 67%BFS 69% 68% 72% 72% 70% 71% 67% 69%RFS 72% 72% 72% 71% 71% 73% 68% 68%

Fisher 69% 70% 69% 72% 70% 73% 69% 69%Gini In. 69% 71% 61% 72% 70% 73% 69% 69%

Mutual Info. 65% 62% 68% 71% 67% 68% 65% 65%

20

(a) (b)

(c) (d)

(e)

Figure 3.4: Heatmap of accuracy score based on chosen feature selection methodand classifier (per each fold)

21

(a) (b)

(c) (d)

(e) (f)

Figure 3.5: Area under the ROC curve (Random Forest) for seperate featureselectors ((a) FFS, (b) BFS,(c) RFS, (d) Gini, (e) Mutual Info, (f) Fisher)

22

4 Discussion

This project mainly involves three types of comparisons as result. First, the

comparison between extracted feature categories is discussed. Then the learning

methods are compared followed by a discussion on feature selection methods. This

section will be closed by a short discussion on performance metrics.

4.1 Feature Extraction Categories

Overall, there are seven feature categories used in this work and they are separately

fed to classifiers in experiment 2. The purpose of this experiment to find which

category has more influence on the final results. Heatmaps (Figure 3.3 (a)) shows the

outcome in Fold 2. GLCM category (24 features) followed by clinical category which

contains of 9 visually-inspected clinical parameters, have the least performance (RF

43% and 51% respectively) among all categories.

On the other hand, features extracted after applying wavelet filter or Laplacian

Gaussian filter seem to be most informative in final results. As was mentioned

earlier in experiment 3, out of 25 features picked by the feature selection methods,

more than 40% of them were from wavelet and LoG category. This matter, is a

confirmation on the importance of wavelet and LoG features in the overall outcome.

Among the gray level categories (GLRLM, GLCM and GLSZM), GLCM has been

slightly weaker than other mentioned categories.

4.2 Learning Methods

In this project, a variety of learning methods has been used to classify binary labels.

Overall, it can be seen that Random Forest method has slightly reached better results

according to accuracy score. Comparing all the learning methods on our data which

has been a wide variety of numerical data with different scales, can indicate that

Random Forest works well with a mixture of numerical and categorical features even

if features are on various scales. Roughly speaking, with Random Forest you can

use data as they are. Random forest algorithm is robust and does not include many

hyper-parameters to be tuned.

23

In all three experiments, NB has shown close results to RF. This result is interesting

since theoretically, NB works best on small data set. For large dimensions of data,

it is possible that the likelihood may not follow a certain distribution unless the

features are dependent of each other, which in our data set is most probably the

case.

As the third best classifier, MLP and SVM closely follow NB. Both of these classifiers

produce a hyperplane to discriminate the classes. Of course, depending on the

specific hyper-parameters that we use, the result may slightly differ which is one of

the factors in MLP that makes other classifiers easier to implement. Some of these

parameters are number of layers, number of neurons per each layer, learning rate,

activation function, etc. Generally, SVM is better at avoiding over-fitting and it is

less complex than MLP.

4.3 Feature selection methods

Results extracted on experiment 3, are accuracy achieved by training the classifiers

on top 25 features selected from feature selection methods. As a general comparison,

it is shown that wrapper methods (FFS, BFS, RFS) lead to more accurate outcome,

RFS giving the best (Naıve Bayes 73%) and Mutual information giving the least

desirable outcomes (SVM 62%).

As it was discussed in the beginning, wrapper methods have a core difference from

filter methods which is the basic rule they select features based on. Filter based

methods use some mathematical evaluation function while the wrapper methods

use a classification performance of a classifier (like accuracy) to do the evaluation.

In this project the classifier assigned in all three methods is RF. Wrapper based

methods have given better set of features (in case of accuracy score) since they use

the target classifier in the feature selection algorithm but on the negative side they

are computationally expensive and not efficient. One run time can take several hours

based on the size of data set.

24

4.4 Evaluation metrics

The performance of the experiments have been evaluated by two metrics; accuracy

score and area under the receiver operating characteristic curve (AUROC). Looking

at heat-maps and the curves (Table 3.1 and Table 3.2) it can be seen that in most

cases AUROC is higher than accuracy score. This can be explained by the fact

that AUC more so reflects the model’s ability to show a constant difference in data

with different labels (i.e class 0 is generally higher/lower than class 1), and doesn’t

explicitly depend upon the model’s ability to correctly assign things to classes. The

overall accuracy is measured at a specific point, and reflects the model’s ability to

place data into a proper category based on a specific threshold.

4.5 Future Work

In this project we aimed to analyze the power of extracting hand-crafted Radiomics

features and different combination of feature categories and filters. The classifiers

used in this work are just a few examples of binary classifications and simple

implementation of neural networks (MLP). Therefore, developing a more advanced

neural network or perhaps types of auto-encoders can significantly affect the

outcome. Moreover, we have extracted first order statistical features in combination

of applying wavelet and LoG filters. Another trial could be to first apply the

filters and then extract grey level features (GLCM, GLRLM and GLSZM) once

again. Regarding the optimization of the learning models, we can also focus on the

prediction errors, such as bias-variance trade-off or reduce the dimensionality with

PCA (Principal Component Analysis) methods.

25

5 Conclusions

This project mainly is a comprehensive comparison between categories of radiomics

features, dimensionality reduction and implementing different types of classification

models and evaluating their performance on CT lung nodule images. Overall,

1398 images each containing 424 radiomic features were normalized, cross validated,

oversampled and fed into six different classifiers. The final outcome represents that

Random Forest classifier has the highest ability to learn and predict based on a big

scaled dataframe. Also, we can conclude that radiomic features can visibly influence

the performance of binary classifiers and as a whole feature set, they provide a better

result than both clinical features and individual groups of radiomics features. To

summarize the achievements of this project, following conclusions can be made:

• In this implementation Random forest performed the best classifier followed

by Naıve Bayes

• Among the feature categories, first order feature extracted with Wavelet and

LoG filters is the most informative category

• Wavelet/LoG are most significant groups picked by feature selection methods

• Gini Index and RFS succeeded in selecting an informative subset among

feature selection methods

To improve this project, several approaches can be taken into consideration, such as

using the combination of filters and feature categories, or extracting deep features as

well as radiomic features. Overall, this work can be an initial step into investigating

radiomics usability in image classification and can be generalized into further

applications.

26

References

[1] C.-H. Chen, C.-K. Chang, C.-Y. Tu, W.-C. Liao, B.-R. Wu, K.-T. Chou, Y.-R.

Chiou, S.-N. Yang, G. Zhang, and T.-C. Huang, “Radiomic features analysis in

computed tomography images of lung nodule classification,” PloS one, vol. 13,

no. 2, p. e0192002, 2018.

[2] P. Lambin, E. Rios-Velazquez, R. Leijenaar, S. Carvalho, R. G. Van Stiphout,

P. Granton, C. M. Zegers, R. Gillies, R. Boellard, A. Dekker et al., “Radiomics:

extracting more information from medical images using advanced feature

analysis,” European journal of cancer, vol. 48, no. 4, pp. 441–446, 2012.

[3] H. J. Aerts, E. R. Velazquez, R. T. Leijenaar, C. Parmar, P. Grossmann,

S. Carvalho, J. Bussink, R. Monshouwer, B. Haibe-Kains, D. Rietveld et al.,

“Decoding tumour phenotype by noninvasive imaging using a quantitative

radiomics approach,” Nature communications, vol. 5, p. 4006, 2014.

[4] B. Zhao, Y. Tan, W.-Y. Tsai, J. Qi, C. Xie, L. Lu, and L. H. Schwartz,

“Reproducibility of radiomics for deciphering tumor phenotype with imaging,”

Scientific reports, vol. 6, p. 23428, 2016.

[5] P. B. Bach, J. N. Mirkin, T. K. Oliver, C. G. Azzoli, D. A. Berry, O. W.

Brawley, T. Byers, G. A. Colditz, M. K. Gould, J. R. Jett et al., “Benefits and

harms of ct screening for lung cancer: a systematic review,” Jama, vol. 307,

no. 22, pp. 2418–2429, 2012.

[6] D. M. Hansell, A. A. Bankier, H. MacMahon, T. C. McLoud, N. L. Muller, and

J. Remy, “Fleischner society: glossary of terms for thoracic imaging,” Radiology,

vol. 246, no. 3, pp. 697–722, 2008.

[7] 2019. [Online]. Available:

https://my.clevelandclinic.org/health/diseases/14799-pulmonary-nodules

[8] W. Li, P. Cao, D. Zhao, and J. Wang, “Pulmonary nodule classification

with deep convolutional neural networks on computed tomography images,”

Computational and mathematical methods in medicine, vol. 2016, 2016.

27

[9] J. J. Van Griethuysen, A. Fedorov, C. Parmar, A. Hosny, N. Aucoin,

V. Narayan, R. G. Beets-Tan, J.-C. Fillion-Robin, S. Pieper, and H. J. Aerts,

“Computational radiomics system to decode the radiographic phenotype,”

Cancer research, vol. 77, no. 21, pp. e104–e107, 2017.

[10] S. Rizzo, F. Botta, S. Raimondi, D. Origgi, C. Fanciullo, A. G. Morganti,

and M. Bellomi, “Radiomics: the facts and the challenges of image analysis,”

European radiology experimental, vol. 2, no. 1, p. 36, 2018.

[11] N. Aggarwal and R. Agrawal, “First and second order statistics features for

classification of magnetic resonance brain images,” Journal of Signal and

Information Processing, vol. 3, no. 02, p. 146, 2012.

[12] S. Liqin, S. Dinggang, and Q. Feihu, “Edge detection on real time using log

filter,” in Proceedings of ICSIPNN’94. International Conference on Speech,

Image Processing and Neural Networks. IEEE, 1994, pp. 37–40.

[13] G. Castellano, L. Bonilha, L. Li, and F. Cendes, “Texture analysis of medical

images,” Clinical radiology, vol. 59, no. 12, pp. 1061–1069, 2004.

[14] A. Zwanenburg, S. Leger, M. Vallieres, S. Lock et al., “Image biomarker

standardisation initiative,” arXiv preprint arXiv:1612.07003, 2016.

[15] R. W. Conners and C. A. Harlow, “Some theoretical considerations concerning

texture analysis of radiographic images,” in 1976 IEEE Conference on Decision

and Control including the 15th Symposium on Adaptive Processes. IEEE, 1976,

pp. 162–167.

[16] R. M. Haralick, K. Shanmugam et al., “Textural features for image

classification,” IEEE Transactions on systems, man, and cybernetics, no. 6,

pp. 610–621, 1973.

[17] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and the subset

selection problem,” in Machine Learning Proceedings 1994. Elsevier, 1994, pp.

121–129.

[18] S. Das, “Filters, wrappers and a boosting-based hybrid for feature selection,”

in Icml, vol. 1, 2001, pp. 74–81.

28

[19] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial

intelligence, vol. 97, no. 1-2, pp. 273–324, 1997.

[20] S. Maldonado and R. Weber, “A wrapper method for feature selection using

support vector machines,” Information Sciences, vol. 179, no. 13, pp. 2208–

2217, 2009.

[21] A. K. Jain and K. Karu, “Texture analysis: Representation and matching,” in

International Conference on Image Analysis and Processing. Springer, 1995,

pp. 2–10.

[22] Q. Gu, Z. Li, and J. Han, “Generalized fisher score for feature selection,” arXiv

preprint arXiv:1202.3725, 2012.

[23] B. C. Ross, “Mutual information between discrete and continuous data sets,”

PloS one, vol. 9, no. 2, p. e87357, 2014.

[24] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:

synthetic minority over-sampling technique,” Journal of artificial intelligence

research, vol. 16, pp. 321–357, 2002.

[25] A. Bosch, A. Zisserman, and X. Munoz, “Image classification using random

forests and ferns,” in 2007 IEEE 11th international conference on computer

vision. Ieee, 2007, pp. 1–8.

[26] S. Sahu, M. Prasad, and B. Tripathy, “A support vector machine binary

classification and image segmentation of remote sensing data of chilika lagloon.”

[27] T. Li, C. Zhang, and M. Ogihara, “A comparative study of feature selection

and multiclass classification methods for tissue classification based on gene

expression,” Bioinformatics, vol. 20, no. 15, pp. 2429–2437, 2004.

[28] M. W. Gardner and S. Dorling, “Artificial neural networks (the multilayer

perceptron)a review of applications in the atmospheric sciences,” Atmospheric

environment, vol. 32, no. 14-15, pp. 2627–2636, 1998.

[29] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, and A. Jemal,

“Global cancer statistics 2018: Globocan estimates of incidence and mortality

worldwide for 36 cancers in 185 countries,” CA: a cancer journal for clinicians,

vol. 68, no. 6, pp. 394–424, 2018.

29

[30] D. Kumar, A. Wong, and D. A. Clausi, “Lung nodule classification using deep

features in ct images,” in 2015 12th Conference on Computer and Robot Vision.

IEEE, 2015, pp. 133–138.

[31] T. P. Coroller, P. Grossmann, Y. Hou, E. R. Velazquez, R. T. Leijenaar,

G. Hermann, P. Lambin, B. Haibe-Kains, R. H. Mak, and H. J. Aerts, “Ct-

based radiomic signature predicts distant metastasis in lung adenocarcinoma,”

Radiotherapy and Oncology, vol. 114, no. 3, pp. 345–350, 2015.

[32] H. MacMahon, J. H. Austin, G. Gamsu, C. J. Herold, J. R. Jett, D. P.

Naidich, E. F. Patz Jr, and S. J. Swensen, “Guidelines for management of

small pulmonary nodules detected on ct scans: a statement from the fleischner

society,” Radiology, vol. 237, no. 2, pp. 395–400, 2005.

[33] M. P. Rivera and A. C. Mehta, “Initial diagnosis of lung cancer: Accp evidence-

based clinical practice guidelines,” Chest, vol. 132, no. 3, pp. 131S–148S, 2007.

[34] H. Lee and Y.-P. P. Chen, “Image based computer aided diagnosis

system for cancer detection,” Expert Systems with Applications,

vol. 42, no. 12, pp. 5356 – 5365, 2015. [Online]. Available:

http://www.sciencedirect.com/science/article/pii/S0957417415000986

[35] K.-L. Hua, C.-H. Hsu, S. C. Hidayati, W.-H. Cheng, and Y.-J. Chen,

“Computer-aided classification of lung nodules on computed tomography

images via deep learning technique,” OncoTargets and therapy, vol. 8, 2015.

[36] A. A. Abdullah and S. M. Shaharum, “Lung cancer cell classification method

using artificial neural network,” information engineering letters, vol. 2, no. 1,

2012.

[37] C. Parmar, P. Grossmann, J. Bussink, P. Lambin, and H. J. Aerts, “Machine

learning methods for quantitative radiomic biomarkers,” Scientific reports,

vol. 5, p. 13087, 2015.

[38] J.-G. Lee, S. Jun, Y.-W. Cho, H. Lee, G. B. Kim, J. B. Seo, and N. Kim, “Deep

learning in medical imaging: general overview,” Korean journal of radiology,

vol. 18, no. 4, pp. 570–584, 2017.

30

[39] M. Mohri, A. Rostamizadeh, and A. Talwalkar, “Foundations of machine

learning. ch. 1, 1–3,” 2012.

[40] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep

belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.

[41] M. Avanzo, J. Stancanello, and I. E. Naqa, “Beyond imaging: The promise of

radiomics,” Physica Medica, vol. 38, pp. 122 – 139, 2017. [Online]. Available:

http://www.sciencedirect.com/science/article/pii/S1120179717301874

[42] R. Wilson and A. Devaraj, “Radiomics of pulmonary nodules and lung cancer,”

Translational lung cancer research, vol. 6, no. 1, p. 86, 2017.

[43] R. J. Gillies, P. E. Kinahan, and H. Hricak, “Radiomics: images are more than

pictures, they are data,” Radiology, vol. 278, no. 2, pp. 563–577, 2015.

[44] Y. Huang, Z. Liu, L. He, X. Chen, D. Pan, Z. Ma, C. Liang, J. Tian, and

C. Liang, “Radiomics signature: a potential biomarker for the prediction of

disease-free survival in early-stage (i or ii) nonsmall cell lung cancer,” Radiology,

vol. 281, no. 3, pp. 947–957, 2016.

[45] P. Lambin, R. T. Leijenaar, T. M. Deist, J. Peerlings, E. E. De Jong,

J. Van Timmeren, S. Sanduleanu, R. T. Larue, A. J. Even, A. Jochems et al.,

“Radiomics: the bridge between medical imaging and personalized medicine,”

Nature Reviews Clinical Oncology, vol. 14, no. 12, p. 749, 2017.

[46] V. Kumar, Y. Gu, S. Basu, A. Berglund, S. A. Eschrich, M. B. Schabath,

K. Forster, H. J. Aerts, A. Dekker, D. Fenstermacher et al., “Radiomics: the

process and the challenges,” Magnetic resonance imaging, vol. 30, no. 9, pp.

1234–1248, 2012.

[47] R. Thawani, M. McLane, N. Beig, S. Ghose, P. Prasanna, V. Velcheti, and

A. Madabhushi, “Radiomics and radiogenomics in lung cancer: a review for

the clinician,” Lung Cancer, vol. 115, pp. 34–41, 2018.

[48] G. Lee, H. Y. Lee, H. Park, M. L. Schiebler, E. J. van Beek, Y. Ohno, J. B.

Seo, and A. Leung, “Radiomics and its emerging role in lung cancer research,

imaging biomarkers and clinical management: state of the art,” European

journal of radiology, vol. 86, pp. 297–307, 2017.

31

[49] F. Han, H. Wang, G. Zhang, H. Han, B. Song, L. Li, W. Moore, H. Lu,

H. Zhao, and Z. Liang, “Texture feature analysis for computer-aided diagnosis

on pulmonary nodules,” Journal of digital imaging, vol. 28, no. 1, pp. 99–115,

2015.

[50] S. Hawkins, H. Wang, Y. Liu, A. Garcia, O. Stringfield, H. Krewer, Q. Li,

D. Cherezov, R. A. Gatenby, Y. Balagurunathan et al., “Predicting malignant

nodules from screening ct scans,” Journal of Thoracic Oncology, vol. 11, no. 12,

pp. 2120–2128, 2016.

[51] R. T. Larue, G. Defraene, D. De Ruysscher, P. Lambin, and W. Van Elmpt,

“Quantitative radiomics studies for tissue characterization: a review of

technology and methodological procedures,” The British journal of radiology,

vol. 90, no. 1070, p. 20160665, 2017.

[52] B. Zhang, X. He, F. Ouyang, D. Gu, Y. Dong, L. Zhang, X. Mo, W. Huang,

J. Tian, and S. Zhang, “Radiomic machine-learning classifiers for prognostic

biomarkers of advanced nasopharyngeal carcinoma,” Cancer letters, vol. 403,

pp. 21–27, 2017.

[53] C. Parmar, P. Grossmann, D. Rietveld, M. M. Rietbergen, P. Lambin, and H. J.

Aerts, “Radiomic machine-learning classifiers for prognostic biomarkers of head

and neck cancer,” Frontiers in oncology, vol. 5, p. 272, 2015.

[54] P. Yin, N. Mao, C. Zhao, J. Wu, C. Sun, L. Chen, and N. Hong, “Comparison of

radiomics machine-learning classifiers and feature selection for differentiation of

sacral chordoma and sacral giant cell tumour based on 3d computed tomography

features,” European radiology, pp. 1–7, 2018.

[55] C. Parmar, R. T. Leijenaar, P. Grossmann, E. R. Velazquez, J. Bussink,

D. Rietveld, M. M. Rietbergen, B. Haibe-Kains, P. Lambin, and H. J. Aerts,

“Radiomic feature clusters and prognostic signatures specific for lung and head

& neck cancer,” Scientific reports, vol. 5, p. 11044, 2015.

32

Appendices

A Appendix - State of the Art

A.1 Introduction

The following article reviews the state of the art advancements regarding deep

learning methods used in classification of Lung nodules.

Medical imaging systems are broadly useful to get a better insight of internal

organs and their ambiguities. Also, to receive information regarding prognosis

observations, prediction of future progress and cancer detection. They play an

important role in treatment selection since there are different treatment plans for a

benign or malignant nodule.

While there are a variety of image modalities for different purposes (imaging

hard tissue, soft tissue, blood flow, etc.), Computed Tomography (CT) has shown

outstanding capabilities in cancer detection and tumor characterization. Recent

advancements in the field of computer vision and analysis have taken steps forward

to extract information from 2-Dimensional or 3-Dimensional images.

Extracting features such as texture, intensity, size and shape of the tumor can

give an estimation regarding the probability of a tumor being benign or malignant

[2]. Radiomics, a recent advancement in imaging field, is used to get high

dimensional features from images, and use them in a predictive model for diagnosis

and differentiate between malignant and benign tumors. Quantitative image

features, called also ”radiomic features” could provide richer information about

intensity, shape, size or volume, and texture of tumor phenotype that is distinct

or complementary to that provided by clinical reports, laboratory test results, and

genomic or proteomic assays [3]. Researches have shown that Radiomics can be

a promising approach to give information about the internal heterogeneity of the

tumor, which is related to genetical patterns. This approach helps with personalizing

the treatment for patients according to their tumor prognosis [4].

33

The content includes: first a general overview on importance of diagnosis of lung

tumours and the great needs in this field and the most common solutions being

offered in this field. Second; the recent advancements in the field of computer aided

diagnosis or detection (CAD) using machine learning algorithms and its combination

with genomics studies, which has resulted in new era of ”Radiomics”. The topic is

followed by a general review on Radiomics, the basics, its structure and recent

advancements.

A.2 Clinical definition of Lung Nodules

A pulmonary (Lung) nodule is defined as rounded or oval-shaped growth in lungs,

which can also be known as a lesion. Nodules in the lung have usually diameter less

than three centimeters. Diameters more than three centimeters are considered as the

term ”mass” and more probable to indicate cancer. Most nodules are noncancerous

(benign). Risk factors for malignant pulmonary nodules include a history of smoking

and older age [7].

A.3 Background on Lung cancer diagnosis

Based on the recent status report of American Cancer Society [29], lung cancer has

been the leading cause of death among both male and female patients and overall it

has been the most commonly diagnosed type of cancer in 2018. Although there can

be several factors involved in the diagnosis, the mortality rate of this type of cancer

has been reported over 17% reported in [30]. This category of cancer, has two major

types; non-small cell lung cancer (non-SCLC) and small cell lung cancer (SCLC),

non-SCLC being the most frequently-diagnosed type with the share of 85-90% from

all cancer types [31].

Moreover, due to diagnostic difficulties of lung cancer at early stages (small lesions

often undetectable by X-rays), by the time the tumor is detected, the patient is

already in the advanced stage and there is not much possible to be done. Meanwhile,

these nodules can be easier detected by Computed Tomography (CT) scans of lungs,

and this technology tends to be developed by every new generation release in medical

34

technology advancements [32]. Therefore, it is important to emphasize the role of

radiologists in the huge effort they contribute to identify and label the nodules as

benign or malignant [30].

During past 5 years it has been proven that CT scan from chest can reveal

valuable information regarding morphological and biological characteristics of the

lung nodules [32]. According to a guideline by American College of Chest

Physicians (CHEST) [33], there are several actions which may be taken to investigate

malignancy of a lung nodule. Based on the type of lung cancer (SCLC and

non-SCLC), size and location of the tumor, several techniques can be taken into

consideration. While biopsy has been mentioned as a promising yet invasive

diagnostic solution for other types of cancer, scientist have come to a different

conclusion in case of lung cancer. Main purpose of choosing a suitable modality

for diagnosis is to use it for both diagnosis and prognosis of the tumor and also

try to avoid invasive tests as far as possible [33]. Therefore, the vast applications

of imaging modalities, with an emphasis on CT (capable of quantifying tumor’s

intensity [31]) is investigated in clinical oncology, from diagnosis to personalizing

the treatment for patients [2].

A.4 Previous computer vision algorithms used in nodule

classification

As it was previously mentioned, the task of labeling each nodule based on its

malignancy, considering the huge number of cases per each radiologist can be time

consuming and overwhelming for them. Therefore, the role of computer vision

algorithms can be inevitably influential to process the data and help to make a

quick decision [30]. These algorithms are named Computer Aided Diagnosis CAD

systems and they are categorized into detection (CADe) and diagnosis (CADx).

Focus of this project is on CADx algorithms to distinguish between benign and

malignant nodules.

In this technology, after medical images are acquired, they undergo several pre-

processing stages (software-based algorithms) to segment the suspected area of

35

cancer from the background (using thresholds, level sets, ...) and the basic features

such as shape, size and texture features are then calculated and extracted from

them [34] and this process is followed by classification and the results are evaluated

by an accuracy percentage. Usually result of each step it this process is highly

dependent on the previous one and the initial features extracted from images [35].

Current focus in CAD and deep learning field is to automate the process of feature

extraction from the images and make a non-complicated pipeline to handle the

processing and pattern recognition steps [35].

In previous studies, feed forward neural networks with a small number of features

have been utilized to classify the nodules in X-ray images [36]. Moreover, there has

been research on application of binary decision tree classifiers (0 and 1) to classify

more than 2000 nodules from 157 patients (data was provided by National Cancer

Institute Lung Image Database NCI-LIDC) to investigate the performance of such

systems and reached an accuracy rate of 75.01% [30].

A.5 Background

To introduce the science and basics of Radiomics, first the definition and

functionality of ”Machine-Learning” needs to be discussed. ”Machine-learning”

refers to computational models and algorithms which utilizes data to self-improve the

performance or increase the accuracy of predictions [37]. The models are trained

with a batch of labeled data named as ”Training set” which makes the network

learn the certain patterns and correlations between data, and then it gets examined

by the unseen data known as ”Test set” generally categorized as ”Supervised”

and ”Unsupervised” learning algorithms [38]. Regarding the subject of tumor

classification, there has been studies done on deep learning, which is a special type

of machine learning and it is similar to human cognition system and has gained

the attentions in healthcare big data as well. Based on an overview done in [38],

Naıve BayesNB model has been used as a typical classification algorithm since it

had a considerably good performance, and Support Vector Machine (SVM) has been

chosen as the most popular one. Practices done with one of the well-known networks

for classification and regression, Artificial Neural Network (ANN), has proven great

36

performance in different areas despite some drawbacks in optimization and over-

fitting [39]. One study, focuses on the unsupervised restricted Boltzmann machine

as a deeper architecture than ANNs and reports an improvement in both over-fitting

and optimization [40].

With the concept of deep learning and personalized treatment being trendy in

the past decade, studies have been focusing on the correlations between extracted

features and also possible relation between these features and certain drug responses

[41]. Since 2010, the research on this topic was formalized by adapting the term

”Radiomics”, consisting of two parts; ”Radio” as in radiology, referring to the

image acquisition method, and ”omics” which comes from the term ”genomics”, the

science of studying human genes. In this concept, radiomics is utilized to study the

quantitative features derived from images to conclude the future status of suspected

nodule [42], [43].

From a technical point of view, radiomics is the act of converting medical images

to high-dimensional data, which can be studied and processed to extract information

from.

According to [41], there are two main aspects to radiomics functionality. First is

the number of extracted features used in the network, which is considerably huge

(hundreds to thousands) comparing to conventional CAD algorithms. Second, is the

field of investigation of radiomics. The features extracted in this approach can be

investigated for diagnostic and prognostic clinical application, which means the final

treatment of the patient will be personalized based on their individual status [43]

and also there will be richer information obtained regarding shape, size or texture

of the detected tumor [41].

To compare radiomics’s advancements with conventional CAD systems, it can be

concluded that the previous approaches focused on a binary answer (whether there

is a lesion or not), while radiomics broadens this aspect by delivering an output

that not only can be used by radiologists as a supporting decision, but also can be

combined with specific characteristics of each patient. A full set of features assigned

to certain patient holding prognostic details is called ”radiomic signature” [43].

37

For instance, heterogenicity investigated in an image, can be connected to genomic

heterogenicity, which can conclude to tumor getting worse, since heterogeneity in

tumors can make them more resistant to treatment [41], [44].

A.6 Radiomics Workflow

A.6.1 Data Acquisition

Among all the medical imaging modalities, CT has shown promising results in

assessment of textural features of tumor and it is most-frequently suggested modality

by radiologists especially in lung cancer cases [41].

A.6.2 Segmentation of Region of Interest

Second step is to cut the Region of Interest (ROI) and choose a prediction target

(i.e. malignancy). This section is critical and important since the features will

be generated from the segmented volume and also tumours often have a vague

boundry to divide. For this means, radiomics analysis should be done on a sub

region such as a metastatic lesion or cancerous tumour [45]. Segmentation of the

tumor is mostly done by a home-grown developed algorithm and could be semi-

automated. There also cases which radiologists do the segmentation manually [4].

Several algorithms can be used for segmentation. Most popular ones mentioned

by [46] are region growing methods (which are rapid but sensitive to noise on the

image), level set method, active contours, etc. However, there is no perfect algorithm

defined that works efficient for all image modalities. To calculate the features in the

next step, there should be a segmentation mask defined, which consists of voxels

located within the ROI. The segmented region should include two masks; intensity

mask and morphological mask [14].

A.6.3 Feature Extraction

After segmentation, there are sets of quantitative features extracted from each

tumor which describes different properties of it such as nodule volume, nodule shape

38

and intensity patterns. Two main categories of features are divided in ”Agnostic”

and ”Semantic”. Semantic features are visualized and defined by radiologists to

describe the ROI and they can be utilized to build a classifier, while agnostic focuses

on heterogeneity and quantitative features of the lesion [34]. Some examples of

semantic features are: Size, Shape, Location, Vascularity, Necrosis, etc. Apart from

semantic, other sets of features are divided into couple of subgroups as follows [47],

[46]:

This category includes the basic features of the tumor such as shape, volume and

derivative measurements (area-to-volume ratio, first-order histograms, compactness,

etc.) and they are calculated in 3D not per slice. One of the studies done in this

field shows that shape features can be used to distinguish malignant and benign

tumors by analysing the surface-area-to-volume ratio [48].

• Surface Area

• Surface to Volume Ratio

• Volume

• Sphericity

• Area Density

• ...

These set of features were initially performed by [16] and includes distribution of

intensity levels in form of voxels. One of the most common parameters used in this

criteria is Local Binary Pattern (LBP) which turns an image into an array of binary

pixels and then the dimension of matrix is reduced based on the relative binary

value of neighboring pixels and explains the texture variety based on the histogram

of the whole image [49].

More sophisticated method based on the same basis, is application of grey level

scale to the pixels. While there are assumptions regarding these features’ role in

prognosis of cancer, Hawkins et al. [50] generated a model with 23 features to

predict the status of a nodule in upcoming years (1-2) and reached an accuracy

of 80%. Furthermore, recently there have been Software Packages released for

39

automatic radiomics feature extraction, such as ”Imaging Biomarker Explorer

(IBEX)” [41].

A.6.4 Feature Selection

This step is dependent on number of selected feature categories and parameters,

due to the facts that the total number of extractable features is virtually unlimited,

some of the features are correlated and taking all the features into account results in

over-fitting and performance reduction of the model. In order to avoid over-fitting

and compensate for the robustness of model in presence of such big variability, the

best suitable features must be picked with the aid of dimensionality reduction or

feature selection techniques [45].

As a review done by [51], filter-based selection techniques are often used to pick

the most informative features by emphasizing on a proper classification method.

Aerts et al. [3] had 647 total images of Lung tumors (training and testing) from

1019 patients, 440 total features extracted and performed first order, shape, texture

features to indicate that choosing features based on stability and reproducibility can

also result in picking the most informative features.

While [41] believes the smoothest method would be to use a scoring system where

features are graded based on stability or correlation, and then the worst-ranked ones

will be omitted. The drawback of such algorithm would be the fact that sometimes

the dependency between features actually increases the performance and accuracy

of the model.

In [52] six feature selection methods have been analyzed based on their statistical

approaches. Logistic Regression, Support Vector Machine SVM, Random Forest

RF, Distance Correlation DC, Elastic Net Logistic Regression EN-LOG, Sure

Independent Screening SIS. One of the most frequently discussed method is

Minimum Redundancy Maximum Relevance mRMR [31] [37] [53], and it calculates

the mutual information (MI) among a specific set. Then the mRMR ranks the

outcomes on a decreasing basis but also minimizing the average MI of each set.

40

A.7 Radiomics Classifiers (Mathematical Model)

The next step is to define a model to build a classifier based on the radiomic

features. Generally, the ML classifiers are divided into two main groups; supervised

and unsupervised classifiers. Supervised classifiers are given two sets of data, one

training set which includes set of examples as an input vector, and a known outcome.

This method may face over-fitting, which means the model will decide mostly based

on the noise from images rather than the original data.

Moreover, couple of most utilized supervised classifiers are Logistic Regression

Kernel Support (simplest one), Random Forest (RF) , Support Vector Machine

(SVM) and Artificial Neural Network (ANN), K-Nearest Neighbor (KNN) and Naıve

Bayes (NB). These methods were compared in a couple of studies [41] [52], the

performance was evaluated by test error and Area Under the Curve (AUC). The

highest AUC, and lowest test error belonged to RF feature selector used followed

by RF classifier (AUC, 0.8464), and next place goes to RF + Adaboost (AUC,

0.8204). The classifiers are often trained by 10-fold cross-validation method in

training set [54].

Unsupervised classifiers take an extra priori variable (i.e survival rate) instead of

divided sets of data. This method, aims to find a type of cancer among a database

of patients. For instance, consensus clustering is used in [55] to make fewer clusters

from a high dimensional data. In this practice, a space of 440 features was reduced

to 13 non-redundant features.

A.8 Performance Evaluation

A.8.1 Area Under the Receiver Operating Characteristics

(AUROC)

As a performance measurement for classification, this approach can be

implemented in different settings (thresholds). While Receiver Operating

Characteristic Matrix ROC is the curve based on probability of ”True Positives”

versus ”False Negatives”his plot can display how strong is the model to separate

and distinguish between the categories.ROC is a probability curve and AUC is the

41

border to measure separability. It tells how much model is capable of distinguishing

between classes. The larger the area under the curve, the better the model is at

predicting 0s as 0s and 1s as 1s. By analogy, the higher the AUC, the better the

model is at distinguishing between patients with disease and no disease.

TPR =TP

TP + FN

Specificity =TN

TN + FP

FPR = 1− specificity =TN

TN + FP

42

TRITA CBH-GRU-2019:109

www.kth.se

evaluation of a radiomics model for classification of lung ...1359286/fulltext01.pdf · and methods...

Documents