artificial intelligence for segmentation of nuclei from ...1455037/fulltext01.pdfexamensarbete inom...

INOM EXAMENSARBETE BIOTEKNIK,AVANCERAD NIVÅ, 30 HP

, STOCKHOLM SVERIGE 2018

Artificial intelligence for segmentation of nuclei from transmitted images

NORAH KLINTBERG SAKAL

KTHSKOLAN FÖR KEMI, BIOTEKNOLOGI OCH HÄLSA

2

Table of content

ABSTRACT 3

INTRODUCTION 4 BACKGROUND OF DATA MINING 7 THE BASICS OF LEARNING 7 VARIETIES OF LEARNING METHODS 10 CHOOSING A MODEL AND GENERALIZATION 12 MODEL VALIDATION AND VERIFICATION 12 MODEL ASSESSMENT 13 ARTIFICIAL NEURAL NETWORKS 17 NEURAL NETWORK ARCHITECTURE 21 LEARNING RULES 23 MULTILEVEL NEURAL NETWORKS 26 BACKPROPAGATION 27 LEARNING RATES 27 OVERFITTING 28 IMAGE STUDY 28 SEGMENTATION PROBLEM 29 U-NET 34

MATERIALS AND METHODS 35 CELL PREPARATION 35 IMAGE PREPROCESSING 35 NETWORK ARCHITECTURE 40 TRAINING 41

RESULTS 43 DICE COEFFICIENT 46

DISCUSSION 51 SUMMARY 55

FUTURE PERSPECTIVES 56

ACKNOWLEDGEMENTS 57

REFERENCES 58

APPENDIX 1 60

APPENDIX 2 62

APPENDIX 3 65

APPENDIX 4 66

APPENDIX 5 73

3

Abstract

State-of-the-art fluorescent imaging research is strictly limited to eight fluorophore labels during the study of intercellular interactions among organelles. The number of excited fluorophore colors is restricted due to overlap in the narrow spectra of visual wavelength. However, this requires a considerable effort of analysis to be able to tell the overlapping signals apart. Significant overlap already occurs with the use of more than four fluorophores and is leaving researchers limited to a small number of labels and the hard decision to prioritize between cellular labels to use. Except for the physical limitations of fluorescent labeling, the labeling itself causes behavioral abnormalities due to sample perturbation. In addition to this, the labeling dye or dye-adjacent antibodies are potentially causing phototoxicity and photobleaching thus limiting the timescale of live cell imaging. Nontoxic imaging modalities such as transmitted-light microscopes, such as bright-field and phase contrast methods, are available but not nearly achieving images of the specificity as when using fluorophore labeling. An approach that could increase the number of organelles simultaneously studied with fluorophore labels, while being cost-effective and nontoxic as transmitted-light microscopes would be an invaluable tool in the quest to enhance knowledge of cellular studies of organelles. Here we present a deep learning solution, using convolutional neural networks built to predict the fluorophore labeling effect on the nucleus, from a transmitted-light input. This solution renders a fluorescent channel available for another marker and would eliminate the process of labeling the nucleus with dye or dye-conjugated antibodies by instead using deep convolutional neural networks.

4

Introduktion Allra senaste forskningen inom fluorescensmikroskopi är begränsat till upp till åtta fluoroforer för studier av intracellulära kommunikationer mellan organeller. Antalet fluorescerande färger är begränsade till följd av spektralt överlapp i det synliga våglängdsområdet. Överlappande signaler behöver matematiskt bearbetas vilket innebär ökad arbetsinsats och signifikant överlappning sker redan vid användning av fler än fyra fluoroforer. Denna begräsning innebär i slutändan att forskare har ett litet antal fluoroforer att arbeta med och behöver därmed prioritera vilka cellulära strukturer som kan märkas samtidigt. Utöver de spektrala begräsningarna med fluorescensmikroskopi, så innebär även själva färgningen av cellulära komponenter en negativ cellulär påverkan i form av avvikande beteende. Fluorescerande färgämnen och märkta antikroppar orsakar potentiellt fototoxicitet och ljusblekning, vilket begränsar tidsrymden vid studier av levande celler. Ljusfältsmikroskop som bright-field and faskontrast har inte en toxisk påverkan men producerar inte i närheten lika detaljerade bilder som fluorescensmikroskop gör. Ett tillvägagångssätt som skulle kunna öka antalet organeller som simultant kan undersökas med fluoroforer, som samtidigt är kostnadseffektiv och inte har en toxisk påverkan som ljusfältsmikroskop, skulle vara ett ovärderligt verktyg för utökad kunskap vid cellulära studier av organeller. Här presenteras en maskininlärningsmetod byggd med artificiella neuronnät för att predicera fluorescerande infärgningen av cellkärnan i fluorescensmikroskop, med bilder från ljusfältsmikroskop. Denna lösning frigör en fluorofor som kan användas till andra organeller samtidigt som arbetet med fluorescerande infärgning av cellkärnan inte längre är nödvändigt och ersätts med ett artificiellt neuronnät.

5

Introduction Various organelles orchestrate the numerous functions that occur within cells with distinctive responsibilities. Different organelles are participating during nutrient transformation, energy generation and reproducing processes, to name a few. Metabolism of lipid compounds, as an example, occurs at multiple locations throughout the cell. This requires careful intercellular contact between the endoplasmatic reticulum, lipid droplets, mitochondria and vacuoles where different stages of the metabolisms take place (Barbosa, et al., 2015). Therefore, studying the behavior and processes of cellular organelles is a fundamental and central part of life science. Microscopes make it possible to visually study structures, biochemical processes and above all, organelle activities of living cells. Due to cells and biological samples mostly being composed of water, cell morphology examination in their natural state requires assorted contrast enhancements. The challenge of accurately visualizing biological compounds drives the development of advanced optical techniques that enhance the contrast and makes samples observable (Christiansen, et al., 2018). Light and optical microscopes for imaging studies range from low-contrast transmitted-light microscopes including bright-field, dark-field and phase contrast, to highly detailed images of fluorescence microscopy. All methods come with the corresponding trade-off between expense, resolution and biological distress. While the benefits of transmitted-light microscopes, such as bright-field and phase contrast, are categorized as low-cost, simple procedure and nontoxic to the biological sample, these technologies are hampered by the lack of precision and specificity. Fluorescence microscopy, on the other hand, is providing high-resolution spatial characteristics and molecular structures (Ounkomol, et al., 2018). However, the disadvantages of fluorescence microscopy are firstly the need for fluorescence labeling with certain dyes or dye-associated antibodies, and secondly, it is a time-consuming and costly method. Another aspect of fluorescence microscopy is the labeling dye itself being harmful to molecular structures, causing phototoxicity, photobleaching and potentially altering the usual behavior of cellular processes (Ounkomol, et al., 2018). The potential cause of abnormal behavior and toxicity of macromolecules is also a limitation for time-series imaging, which presents substantial biological insights over time. Nevertheless, fluorescence microscopy is a powerful method for high-resolution analysis of macromolecular distribution in cells and is widely used across the whole field (Sullivan & Lundberg, 2018). State-of-the-art imaging research within intracellular coordination uses multiple fluorescence labels to illustrate the how cellular samples are collaborating. By staining different organelles such as the ER, Golgi, lysosome, peroxisome, mitochondria and lipid droplets with individual labels, it is possible to study organelle interactions (Valm, et al., 2017). The research revealed the whole journey of lipid droplets from the main site of lipid synthesis, throughout multiple interactions with all the labeled organelles. The fundamental limitation of the number of labels possible to use in this approach is essentially the spectral overlap of fluorescence microscopy (Sullivan & Lundberg, 2018). The number of colors, excited fluorophores, are restricted to the narrow spectra of visible wavelengths. With the restricted maximum of eight colors in total, ranging from ultraviolet, violet, blue, green, yellow, orange, red and near IR. However, significant overlap between the colors already occurs when using more than four, due to fluorophores extensive spectral profiles. This leaves researches not only limited to the use of up to six labels but with data that requires a considerable effort of mathematically distinguishing the overlap between the signals of different samples.

In summary, the nontoxic transmitted-light microscopes are lacking in resolution while fluorescence microscopy is harming and physically limited to a number of labels. An approach that could increase the number of organelles simultaneously studied with labels, while being cost-

6

effective and nontoxic as transmitted-light microscopes would be a cutting-edge tool in the quest to enhance knowledge of intercellular studies of organelles. Here we present a deep learning solution, using convolutional neural networks built to predict the fluorophore labeling effect on the nucleus, from a transmitted-light input. This solution would eliminate the process of labeling the nucleus with dye or dye-conjugated antibodies by instead using deep learning, a type of machine learning. The theorem of this solution is that organelle location can be predicted from transmitted light inputs with a convolutional neural network, built with a U-Net architecture (Ronneberger, et al., 2015). The trained model predicts the fluorophore labeling effect on the nuclei from bright-field and phase contrast input. Essential benefits of this solution are either the use of pure “in silico” labeling without the need of fluorescent dyes or the rendering of an additional fluorescence channel available by eliminating the staining process of the nucleus. Thereby both enabling imaging over time through lowered risks of phototoxicity but mostly increasing the number of organelles that can be examined simultaneously. This by freeing a fluorophore channel, otherwise used for labeling the cell nucleus. In this thesis, I will explore if there is a sufficient amount of unknown regularities in transmitted-light images, from bright-field and phase contrast imaging, that a convolutional neural network can learn, segment and then represent the fluorophore labeling effect of the nucleus? The aim is to train a convolutional neural network to segment the fluorophore labeling effect of subcellular nuclei from transmitted-light images. The hypotheses are that segmentation of phase contrast images will perform better than bright-field images and that DICE as loss function performs better than cross entropy.

Figure 1. The top row shows the aim; train a convolutional neural network to segment fluorophore labeling effect from bright-field images and phase contrast images. (A) Shows an input phase contrast image with fluorophore nuclei overlay (B) and phase contrast image with segmented mask overlay (C). (D) Shows an input bright-field image with fluorophore nuclei overlay (E) and lastly a bright-field image with segmented mask overlay (F).

7

Background of data mining There is a growing need for understanding and evaluating extensive, multiplex and information-rich data in all kind of industries. The challenge is to obtain the valuable information that is hidden in the complex data in order to be able to benefit and use the extracted knowledge. Data mining describes using iterative computational methods to extract, reveal and understand the underlying knowledge of large and complex data sets (Kantardzic, 2011). The methodology is of great use during exploratory studies when the details and character of the data outcome are unknown in advance. Data mining is the strive to discover state-of-the-art knowledge and information out of extensive and complex data sets. In order to achieve the best results, there should be a balance between human expertise formulating specific objectives and the computational power of systematically searching through and interpret large sets of data. The core of data mining can be condensed into two fundamental goals; one being the prediction and the other being description (Kantardzic, 2011). Predictive data mining can be described as the task of examining data and categorically label the output while descripting data mining focuses on understanding unknown pattern and nontrivial information in the given data set. System identification and predicting the behavior of a system consists of two fundamental steps; identifying the structure and identifying the parameters. The first step is deciding a class of models to use, by implementing existing knowledge of the system, the model can be described by a problem-dependent parameterized function ! = # $, & . The variables of the model consist of the output y, u being the input vector and t as parameter vector. Prior knowledge about the system decides model class to use, and the aim is to find the most suitable model out if the chosen class (Kantardzic, 2011). Further, when the architecture of the model is chosen, the next step is to iterative optimize and improve the parameter vector to approximate a model that best describes the system. Parameter optimization results in the function !∗ = # $, &∗ which best describes the system that is being examined. Successful structure identification is connected to prior knowledge, trial and error is otherwise the strategy if prior knowledge is lacking. The process of data mining can be divided into five iterative steps. The first step is to come up with a problem statement and compose a hypothesis, which generally needs prior knowledge of the problem that is being examined. The first step also consists of choosing a suitable model class as a platform to build on. Collecting the data is the second step, followed by processing of the data which is the third step. The preprocessing of data is usually a time-consuming task consisting of both distinguishing outliers and scaling and selecting which features to use further. The work of distinguishing outliers is followed by the decision of keeping deviating data for a more robust model or outlier removal as part of the preprocessing step. Scaling is an essential step since the features of different range, will influence the model very differently. Data within the range of 0,1 will for example unevenly influence the model compared to data within the range of 0,100 , scaling will, therefore, bring them both to the equivalent weight. Model evaluation is the next step after data preprocessing where an appropriate model is chosen for further analyze. The fifth and last step is evaluating the model and interpreting the results.

The basics of learning There are two fundamental phases of every learning procedure; learning undefined dependencies of a given system and using the learned knowledge for predicting dependencies on future input. The learning phase is known as induction while applying the learned knowledge on new input is called deduction. An algorithm is a form of learning method that drives the process of inductive learning by estimating unidentified dependencies between the input and output from the samples of the given data. When the algorithm learned from the data and achieved a precise estimation of the

8

dependencies, the knowledge can be used to predict outputs on new unseen inputs. The concept of the learning process consists of three fundamental parts; first a data set as input vector +, then a system that takes the input vector + and returns an output ,. Lastly the machine learning, the very core which estimates a new output ,′ from observing the correlation between input + and output ,. The first component of the system is the generator; it presents a random vector +, independently extracted from the input distribution. The learning machine that is being trained is not in control of the order of how the input vector is fed into the system. Further, the system is producing an output value , according to unknown probability .(, +) for each input vector +. During training, the learning machine gets the input vector + and generates functions that estimate the behavior of the system with output ,. In order to be able to generalize from the input data, the learning machine requires preset knowledge of possible function class for the given system. The functions that the learning machine is implementing for generalization can be explained with the following concept: # +,1 ,1 ∈ 3 – + being the input to the system, 1 as a parameter to the function and 3 is a collection of abstract parameters with the purpose of listing the set of functions. The set of approximating functions, used by the learning machine, for generalization of the system is ideally chosen according to prior knowledge of the given system. Often being difficult due to complex problems. A way to visualize the search for an approximation function is seen in Figure 2 where a given input 45 render a hypothesis function ℎ 4 that approximates #(4). Since the perfect function #(4) is unknown, the learning machine has an abundance of functions ℎ(4) to try out in order to approximate #(4). It is here the prior criteria comes in, without restricting the class of approximation functions, there are not only numerous functions to try, but it is also no way to decide upon which of the approximated functions to use. By declaring conditions linked to the search of an approximation function, the search space is reduced, and an appropriate ℎ(4) can be chosen. The algorithm needs preset criteria to restrict the search scope when trying functions (Kantardzic, 2011).

Figure 2. Visualization of the search for a hypothesis function h(x).

The parameters of the approximating function can either be linear or nonlinear. In conclusion, the process of a learning process is both estimating the unknown function #(+,1) and approximate its parameters. So far, a learning machines task is to choose a particular function among the set of supported functions, that best estimates the output of the system while examining a given number of inputs. Samples fed to the learning machine can be described as (+5, !5) with7 being the number of the input sample. The loss function 8 !, # +,1 evaluates the quality of the approximated function, with + being the input to the system, ! the output and the set of parameters of the approximating functions being1. The output produced by a learning machine for a specific approximating function is the term #(+,1).

9

The loss function,8, is used to estimate the difference between the outputs that the system produced !9 and the output produced by the learning machine #(+5, 1), for every input +5 . Loss function ranges from large positive numbers, indicating an approximation mostly dissimilar from the system, to small positive numbers close to zero, indicating an appropriate approximation function. Cost function :(1) i the anticipated value of the loss function and is described as:

: 1 = 8(!, # +,1 ) . +, ! ;+;!

Further, 8(!, # +,1 ) is being the loss function evaluating the approximation and.(+, !) is the term which describes the distribution probability. The task of inductive learning is by using a set of input values +5 , approximating a function #(+,1<=>) that gives the lowest cost function :(1) among the set of given functions with unknown probability distribution .(+, !). Since the input set of samples is finite, the approximating function is redefined as #(+,1<=>∗ ) where 1<=>∗ is the obtained parameters. Two standard learning processes are regression and classification with different ways of interpreting their cost functions. For a binary classification problem where the output of the system is categorized between two discrete values ! = 0,1 the loss function is determining the error of the classification:

8 !, # +,1 = 07#! = #(+,1)17#! ≠ #(+,1)

When cost function is used together with the loss of classification problems, it reflects the probability of the learning machine to predicting the incorrect class. The objective of the learning process is thereby taking training data as input and approximating a classifier function #(+,1) that has the lowest probability of misclassification. When it comes to regression problems, the cost function is instead assessing the accuracy of the approximated functions predictions from the input data. For example, squared error evaluation is a standard loss function for regression problems:

8 !, # +,1 = (! − # +,1 )A

A minimized cost function gives the highest accuracy during regression problems and occurs when the approximated function manages to predict outputs similar to the exact output produced by the system. The learning itself is possible by the principle of providing instructions on how to generate an approximated function:

# +,1<=>∗

That approximation function is part of the class of accepted approximating functions out of a given set of training data. The principle that is providing instructions on what to do with the data is called the induction principle while the learning method is responsible for assigning how to use the input data to estimate the approximated function. The challenge is to determine which class of approximation functions to use for the given training data. Except for designing the learning machine itself, additional effort is needed for deciding on what kind of preprocessing that is needed for the input data and its variables. Further, the user also needs to put effort into deciding the input data rate and determine how to distribute the training data. Prior knowledge of the system and its usual behavior is used for concluding on which set of approximation functions that are suitable to use for the given problem and hypothesis.

10

Statistic learning theory, SLT, is precisely presenting mathematical evidence for the explained inductive learning and describes the compromise between a more complex algorithm and available sample size. Although SLT was initially developed to describe classification problems and different pattern recognition tasks, understanding the theory is fundamental for implementing and designing systems for induced learning. The ultimate goal of induced learning is to approximate undiscovered dependencies in a particular class of approximating functions utilizing a finite amount of input data. The most favorable estimate results in a minimized cost function (Kantardzic, 2011).

Varieties of learning methods Supervised learning and unsupervised learning are two frequently used learning methods. Supervised learning refers to approximating unidentified dependencies where both input data and output data are known. Supervised learning can be regarded as a system with a present teacher which knows the exact output that corresponds to the given input. The learning system is being corrected with both the correct output from the teacher and an error signal, both affecting the adjustment of the learning system's parameters iteratively. The difference between the correct answer and the answer given by the learning system is the error signal which is further used as feedback for the parameter adjustment. A supervised learning system is seen in Figure 3 where the knowledge from the teacher is fed to the learning system as the right output, combined with the feedback loop of the error signal back to the learning machine until its learned from the teacher. The difference between the teacher's output and the learning system's output can be evaluated using numerous error functions such as mean-square error or the sum of squared errors to name two.

Figure 3 - Supervised learning where the learning machine is approximating a function according the error signal, which is the difference between the desired response generated from the teacher and the actual response from the learning system. The feedback loop enables the learning machine to learn from the supervising teacher by altering the parameters w according to the difference between the responses.

The chosen error function can be envisioned as a multi-dimensional area where the learning operation is located. Each learning cycle can be visualized as a small movement of the learning operation, moving from the original position to a new location on that multidimensional surface. Figure 4 shows such a step in a simplified solution space, plotted in 3D for visualization purpose, the real solution space is a high dimensional system.

11

Figure 4. A simplified 3D plot of the solution space for visualization purpose, showing how the output of a chosen error function affects the learning operator. For each iteration, the learning operator is moving from its original location (A) to a new location (B). The small movement depends on the output of the chosen error function and is calculated for each learning iteration.

A successful learning process, where the learning system's performance continuously improves, results in the movement towards a local or global minimum on the multidimensional surface. The movement of the operating point is seen in Figure 5 where it approaches the minimum stepwise for each learning cycle with the teacher. A prosperous supervised learning process where the operating point approached the minimum, results in a learning system capable of successfully approximating functions and thereby accurately classify input samples (Kantardzic, 2011).

Figure 5. Showing a successful learning process, where the learning operator is gradually reaching the global minima by stepwise improvement of the error function’s output.

There is no teacher present during the other learning method; unsupervised learning, and the learning system can therefore not be corrected with desired output samples during the training. During unsupervised learning, the learning machine is expected to both arrange and assess the approximating function without any external feedback loop. An unsupervised learning process

12

results in a model that can uncover and disclose unidentified patterns and constellations in given data sets. The learning system adapts to the input data throughout the learning process and establishes a capability to estimate relationships between the characteristics of the input data.

Choosing a model and generalization There is a fundamental question that has to be answered when it comes to designing a learning system; given a finite data set with an undiscovered probability distribution, does it contain the adequate amount of latent regularities for a learning system to learn and additionally represent? If not, the learning system might instead merely memorize the exact input data, with the consequence of not being able to perform predictions on new previously unseen data accurately. During high dimensional problems, the training data is likely not sufficient enough to yield a single solution due to a large number of potential models. Additional assumptions are needed to be able to successfully obtain a single unique model to the finite input data, which still is robust enough to make predictions on new unseen data accurately. A compromise between the complexity of the model and the volume of input data has to be made in order to obtain a model with high generalization abilities (Kantardzic, 2011).

Model validation and verification The last and final phase of data mining consists of both validating and verifying the obtained model. Model validation manages how the model is built according to the system and the objectives stated by the user. Model validation evaluates if the data is reconstructed with the model and that their representation of the system reaches an acceptable accuracy. In other words, model validation is referring to the building of the right model. Model verification is, on the contrary, referring to the construction of the model in the right way. Model verification is proving that the acquired model is reconstructed from the input samples and producing different samples with acceptable accuracy. Model testing is, therefore, a crucial part of both evaluating the model performance and exposing eventual errors and miscalculations in the model. Performed test are both for validation where the model is evaluated according to the behavioral precision and from a verification point of view which evaluates the precision of data reconstruction in the model. Predictive accuracy is used to estimate the quality of the model's predictions, whether it is correctly classifying samples or predicting new samples out of the input data. The actual error rate is used to determine the predictive accuracy of a model and is statistically defined as "the error rate of the model on an asymptotically large number of new cases that converge to the actual population distribution" (Kantardzic, 2011) and is approximated using the whole data set. The process of determining the true error rate consists of dividing the data set into two new sets; a training set that is used during the training phase and a separate testing set that is not seen during training. The training set is used in the initial training phase where the data is used to construct the model, subsequently the test set is used for validating the model according to its performance on the unseen test set. The ultimate goal of estimating a true error rate is to get an evaluation of the model performance on future unseen input data. In order for the estimated error rate to be reliable, first the training and testing data set has to be of adequate volume and secondly, they have to be independent (Kantardzic, 2011). Upon deciding on how to split the data set, there are significant trade-offs to consider. A small training sett will likely result in a model with reduced generalizing abilities. In contrast, a split that results in a small testing set will result in an estimated error rate with low certainty. An important point to remember is that differently distributed splits will result in different error approximations.

13

As far as it comes resampling methods, the very techniques of how to do the split between the training set and the testing set, there are some ways to choose between. The assumption behind the action of splitting the data into a training portion and a testing portion is that both come from the same unidentified distribution. An assumption that generally is true for extensive data sets but not exclusively correct for a small dataset, which has to be accounted for when deciding how to split a small dataset. The most straightforward resampling technique, Resubstituting method, is to use the whole data set for both the initial training phase and later the validation phase. No split is performed, simply put. The consequence of using the whole data set for both training and validation results in an estimated error that is smaller than it likely would be for future unseen data. This technique is rarely used due to the optimistically biased estimation of the error. Holdout method a resampling technique when half or two-thirds of the data set is used for training and the rest for validation. The error pessimistically estimated and is likely more extensive than it would be for new unseen samples. By randomly varying the samples among the split numerous times, the error estimation gets more accurate. Leave-one-out method consists of systematically using B − 1 samples for training and then validate the training on the remaining sample. Different size of B − 1 training samples is iterated B times for B models. Rotation Method or Cross-Validation is generally a popular resampling technique, mainly if the data set is small. The approach is to use C − 1 samples for training and the rest for validation, similar to the Leave-one-out method. The difference is that the samples are first divided into smaller sets according to 1 ≤ C ≤ B before used for training and validation. Bootstrap Method is beneficially used for a small dataset. Samples of the original data are replaced with fabricated data in order to create many new fabricated data sets, comparable in size with the original set. This fabricated data set results in a new bootstrap estimated error.

Model assessment Standard error estimation is a way of assessing the performance of a model that is derived through inductive-learning techniques and the searched error rate is estimated by validating the model on a resampled testing data set. When it comes to classification problems, the error rate is the number wrongly predicted classifications. If all errors are considered to be of same significance, the error rate : can be described as the total errors E divided by the total number of samples F:

: = G

H (1)

The accuracy IJ of the model classifying samples correctly can be expressed by subtracting one by the error rate:

IJ = 1 − : = 1 −G

H=

HKG

H (2)

Confusion matrices and lift charts are two different techniques for evaluating the performance of a classification model. A confusion matrix is a useful way to examine the prediction accuracy by drawing up a table according to Table 1. The number of possible errors of classification problems can be expressed according to the number of classes L:

&!.MNO#MPPOPN = LA − L (3)

14

Binary classification problems have two separate classes, where each sample can be classified as True or False. A binary classification problem can therefore only involve two different errors according to Equation 3:

2A − 2 = 2

Each prediction for binary classification problems can therefore either be True or False and further divided into four possible guesses; False negative predictions, false positive predictions, true negative predictions and lastly true positive predictions. A false negative error is a scenario when the model predicts a sample to be false, but the accurate answer is true. On the contrary, false positive errors occur when the model predicts a sample as true when actual output should be false. True negative and true positive predictions are when the model correctly classifies a sample as false respectively true. All the errors and accurate predictions are shown in Table 1 (Kantardzic, 2011). Table 1. Confusion matrix for a binary classification problem showing the four predicting scenarios of a model. The two possible errors are false positive and false negative. Where the false positive error is obtained when the model predicts a sample being True when the actual class is False. Scenarios the other way are false negatives, the model predicts a sample being False when the accurate class is True.

True False

True True positive TP False positive FP

False False negative FN True negative TN

Equation 3 can be used for classification problems other than binary ones, as seen in Table 2 which has three classes and therefore a total of 6 error scenarios according to Equation 4: 3A − 3 = 6 (4) Table 2. A classification problem with three classes has 6 different types of errors according to Equation 4.

True class

Classification model 0 1 2 Total 0 11 2* 1* 14 1 1* 12 1* 14 2 3* 1* 13 17

Total 15 15 15 45 * = Types of errors stated by Equation 3.

Increased number of classes means increased types of errors according to Equation 3, which is seen in Equation 4 where the classification problem has three classes and a total of 45 samples seen in Table 2. By using the values from Table 2, the error rate of the model can be calculated:

: =E

F=1 + 3 + 2 + 1 + 1 + 1

45=9

45= 0.2

Actual class Predicted class

15

When the error rate is stated it is possible to determine the accuracy of the model's performance:

IJJ = 1 − : = 1 − 0.2 = 0.8

The accuracy is generally expressed as a percentage, and the accuracy of the model is therefore 80%. In this example, the classes were of equal size, and they all had 15 samples each, constituting a balanced dataset. Balanced classes are not always the case in real-world problems, and when the classes are imbalanced, accuracy as performance validation might not be the preferred assess. Models of real-world applications are often used to classify infrequent events in excessive datasets such as Hippocampal segmenting among studying group differences to detect Alzheimer's disease (D’Addabbo & Maglietta, 2015). Problems with only using accuracy as an evaluation metric during imbalanced data classes occur due to the model not being able to catch substantial information of the smaller class. The process behind this can be illustrated as an example with two classes where one of the classes (A) account for 99% of the training data, while only 1% of the data belongs to the other class (B). If the model predicts that all the inputs belong to the majority class (A), the accuracy will end up at 99% even though the model failed to classify any of the data belonging to the small class (B). This high accuracy might at first glance be regarded as a well-trained model, but it is essential to keep in mind that it is highly likely that the model failed to discover valuable information from the minority class (B). This kind of imbalance promotes models that are not expected to perform well on real-world data (D’Addabbo & Maglietta, 2015). Other more useful metrics for evaluating the performance of the model are the precision C, recall : and the relative overlap between the samples, all shown in Figure 6. Precision is representing the proportion of the samples that were categorized as true, that is actually true. Figure 6A is illustrating precision, and it is computed by dividing predictive classes that are true positive with all the predicted samples. Recall describes the ability of a model to classify all relevant data into their respective classes correctly. Recall is the percentage of true positives that are correctly classified (Powers, 2008) and is determined by dividing the true positive predicted samples with all the samples from the actual class seen in Figure 6B. The last part, Figure 6C, shows the relative overlap which rewards heavily overlapping.

16

Figure 6. Accuracy is not a suitable metric for performance measuring when the data set is imbalanced when it comes to classification problems. In addition to accuracy, precision, recall, and relative overlap are more useful for evaluating the performance of an imbalanced classification problem.

17

An additional metric for performance evaluation is the Sørensen-Dice index, DICE: a statistical model used for evaluating the similarity of the overlap between the two samples (Dice, 1945). Figure 7 is showing the schematic principle, where the common overlap can be expressed asA ∩B. DICE results in a similarity index, or rather the spatial overlap, between two samples. The DICE coefficient results in a value ranging from 0 to1 and is calculated by taking the intersection region twice, divided by the sum of the two separate samples regions, illustrated in Figure 7.

Figure 7. The DICE coefficient is a similarity index and measures the intersection between two samples and is giving a value in the range 0 to 1.

Artificial neural networks Artificial neural network, ACC, is not a newly brought discipline. Neural networks have been a research topic since the 1950s when Rosenblatt presented the very first application of simple pattern classification (Kantardzic, 2011). The whole concept of ACCs is the thought of creating a learning machine similar to the complex human brain, by mimicking the biological neurons with artificial ones. Each neuron has a connection to adjacent neurons, similar to the biological structure of human brains and constitutes a piece of an extensive network of connected neurons. Each output that leaves the neuron depends on adjustable parameters that are connected to the neuron, making each node adaptive and is driving the learning and generalization process during training. An artificial neuron primarily consists of three components: a number of associated links from the various input which all are having an individual weight 1]5 , an adder that estimates the sum of all

18

the input signals 45 and an activation function # with the responsibility of restricting the amplitude of the neurons yield, !] . The neuron is indicated by the first index, ^ in 1]5 while the index 7 corresponds to the input to which the weight is assigned to. shows the three essential components of a neuron together with an added bias, _] , which value can be both positive or negative. The bias is in charge of amplifying or lessen the input to the activation function, depending on if its value is positive or negative.

Figure 8. Schematic overview of an artificial neuron consisting of input values xi each connected to the neuron with a weight wki, an adder that sums all the input weights and being amplified or reduced by the bias bk, and an activation function f(netk) responsible of the output yk.

The learning procedure through the artificial neuron starts with multiplying each input value 4] with its weight 1]5 before being presented to the neuron. Inside the neuron, the adder takes the individual products and sums them up to a term called BM&] . Before the weighted sum is sent further to the activation function, a bias with a positive or negative charge is added, which alters the amplitude of the final value that is sent to the activation function. Lastly, the activation function estimates and presents the output !] . The weighted sum of all the input values is expressed as:

BM&] = 4` ∙ 1]` + 4b ∙ 1]A+. . . +4c ∙ 1]c + _] (5)

The next step is to replace the bias with 1]`, and since 4` has the default value of 1, Equation 5 is altered into:

BM&] = 4` ∙ 1]` + 4b ∙ 1]A+. . . +4c ∙ 1]c = 451]5c5d` (6)

Vector notation is another way to express the sum of Equation 6: BM&] = + ∙ 3 estimated by the scalar product of + and 3, both being m-dimensional vectors:

+ = 4`, 4b, 4A, … , 4c 3 = 1]`, 1]b, 1]A, … , 1]c

19

Figure 8 is showing an example from which the weighted sum can be estimated: BM&] = 4` ∙ 1]` + 4b ∙ 1]A + 4f ∙ 1]f + _] = 0 ∙ 0.2 + 1 ∙ 0.6 + 1 ∙ 0.3 + 0.3=1.2

Figure 9. Illustrative process of estimating the weighted sum of each input multiplied with its individual weight.

The last step of the procedure is the activation function estimating an output !] as a function of the calculated weighted sum #(BM&]). There are numerous activation functions to choose from, depending on the problem that is being solved. Ten frequently used activation functions are illustrated in Table 3, all being suitable for different kind of hypotheses. By using activation functions, the last step #(1.2) seen in Figure 8 can be estimated as illustrated in Table 3. Additional activation functions and their corresponding plots are found in

Table 4. Table 3. Estimated output yk calculated by using different activation functions.

Activation function Estimated output !]

Linear ! = # BM&] = #(1.2) = 1.2

Binary step ! = # BM&] = #(1.2) = 1

Sigmoid ! = # BM&] = # 1.2 =1

1 + MKb.A=

ReLU ! = # BM&] = # 1.2 = 1.2

20

Table 4. Ten commonly used activation functions, each one being suitable for different problems.

Activation function Equation Plot

Linear ! = BM&

Binary step

! =1 7#BM& ≥ 0−1 7#BM& < 0

Sigmoid

! =1

1 + MKij>

TanH

! =Mij> − MKij>

Mij> + MKij>

Rectified linear unit ReLU

! =0 7#BM& < 0BM& 7#BM& ≥ 0

Parameteric rectified linear unit PReLU (Leaky ReLU) ! =

k ∙ BM& 7#BM& < 0BM& 7#BM& ≥ 0

21

Exponential linear unit ELU ! =

k(Mij> − 1) 7#BM& < 0BM& 7#BM& ≥ 0

This previous example is an illustration for a single node but the same principle is applied when the network grows, and additional nodes and corresponding connections are added. When estimating the output from a net larger than one single node, the weighted sum is calculated for each node before being added together for the final output. Neural network architecture Essential factors of a neural network are the number of inputs, outputs, nodes and lastly its linkage to other nodes. The two categories of network architecture are feedforward and recurrent design, both seen in Figure 10. Feedforward architectures are characterized by a network where the input estimation consistently occurs from input to output. Nodes of each layer are solely connected to nodes in surrounding layers, and no connections are existing within nodes of the same layer. Recurrent architecture is characterized by feedback loops to and from a different part of the network. Even though the two classes of architecture are presented as two completely separate structures, the most commonly built architecture is a combination of both classes; feedforward with backpropagation. Unlike previous illustrations of nodes, Figure 10 shows a neural network with multiple layers. These layers in-between the input and output layer are called hidden layers and enables the neural network to perform non-linear predictions, which often are needed for real-world problems. A single layer network is only able to solve linear problems since it constructs a straight line and separating classes successfully, as seen in Figure 11a. When it comes to non-linear problems, it is not possible to successfully separate classes with just a straight line, illustrated in Figure 11b. With that being said, single layers ANNs are suitable for simple problems solvable with linear models. Multilayer ANNs constructs highly nonlinear models, possible of solving complex real-world problems.

22

Figure 10. Two common types of artificial neural networks. (A) Multilayer feedforward. (B) Single layer with recurrent design having a feedback loop.

Figure 11. Illustration showing points belonging to two different classes, separated by a linear and a nonlinear model respectively. (A) Linear problems are solvable using a single linear separation of points generated from single-layer neural networks. (B) Nonlinear problems need multilayered networks that generate a nonlinear separation.

The advantages of ANNs are both the networks capability to acquire information and learn from real-world data, and to improve its accomplishment and thereby to improve its performance systematically. The learning itself is an iterative process which is a result of small alterations to its associated weights for each repetition. Two factors of the learning process are firstly a set of predefined learning rules and the architectural design of the nodes with corresponding connections.

23

Learning rules For the learning process to take place, a corrective adjustment needs to be applied as a feedback loop back to the neuron through its input weights. Consider the simple neuron, previously seen in Figure 9 with input vectors 4]b, … , 4]c and a corresponding output ; B where B is a time point of the repetitive learning process. The previous expression of the output produces a predicted output !](B) which is further measured against the requested target output ;](B). The difference between the desired output and the estimated output from the node, ^, is denoted as the error:

M] B = ;] B − !](B)

The error is used as feedback by the neuron, with the goal of repetitively adjusting the output signal !] B , so it becomes more similar to the desired output ;] B for each iteration. Minimizing the difference between desired output and predicted output is called error-correction learning and the expression that is being minimized is the cost function:

E B =1

2M]A(B)

The adjustment that is needed to come a step closer to minimizing the cost function is calculated by using delta rule; a neuron ^ is excited by an input 49(B) and a weight factor of 1]9(B) at a given time point B. According to the delta rule, the weight adjustment ∆w]9(B) can thereby be calculated:

∆w]9 B = n ∙ M] B 49(B)

In addition to the weight, input, error and time point, the rate of learning is imposed by the positive constant n. In conclusion, the weight adjustment is equal to the product of the error and the input vector of the certain linkage. When the adjustment has been estimated, it is added to calculate the new value of the weight:

w]9 B + 1 = w]9 B + ∆w]9 B

Figure 12 illustrates the process of weight adjustment during learning. An additional key factor, still only briefly mentioned, is the learning rate n. Which is the positive constant that regulates the balance of convergence during the repetitive rounds of training.

24

Figure 12. Error-correction learning is the process of iterative weight adjustment with the intention of minimizing the error between the predicted output yk and the desired output dk. The estimated correction is sent backward as feedback for weight adjustment toward achieving a predicted output similar to the desired ditto.

The whole feedback loop learning procedure of an example is calculated in Figure 13, complete with initialized weights, estimation of adjustments and finally the new weights. The example displays how small changes are processed through the node and how the adjustment alters the initial weights before the iteration continues. The weight improvement procedure with shifting weights after every iteration will continually proceed unless a halting parameter, with stopping criteria, is predefined. A certain number of runs is an example of a stopping criterion.

25

Figure 13. Example of error correction-learning with initialized weights and estimation of new weights using the Delta rule.

26

Multilevel neural networks Earlier examples all described a network consisting of single nodes with corresponding connections and only briefly mentioning the benefits of multilayered networks. Three essential attributes of multilevel networks are the numerous intermediate layers of nodes; hidden layers, usually having a nonlinear activation function and lastly having a high number of connections between the layers. Figure 14 shows different multilayer neural networks, all fully connected, meaning that every node is connected to all other neurons in the layer preceding layer. Input data is traveling from left to right, layer by layer until it reaches the last output layer, furthest to the right. The error-correction learning procedure for single layer networks has a corresponding operation called error backpropagation and is a popular learning algorithm. Similar to the single-layer counterpart, the error backpropagation consists of two separate stages; a forward pass and a backward pass.

Figure 14. Chart is showing three fully connected neural networks. (A) Deep forward network. (B) Autoencoder network. (C) Sparse network.

The first stage of error backpropagation is the forward pass; the input vector with training data is added to the first input layer, and the response from each node travels forward throughout all the layer and results in a predicted output. The output response is compared to the desired output, and the difference between them constitutes the error signal. The error signal is attempted to minimize by sending adjustments back through the network while adjusting the weights to produce a new response more comparable with the desired output. All the weights are first fixed during the forward pass before they are getting individually adjusted during the backpropagation. As previously stated during error-response learning on single-layer networks, the error is calculated by subtracting the given response with the target response:

M9 B = ;9 B − !9(B)

Where o is the node with training sample number B. To improve the predicted output, the error energy has to be calculated. For a single node o, the error energy is defined by E = b

AM9A(B) and

the error energy for the whole network is accordingly the sum of errors over the sum of respective node in the output layer:

E B =1

2M9A(B)

9∈p

27

J is the number of neurons present in the output layer while q is the total number of input data samples. The next step is to calculate the average squared error by aggregating the error energy for all iterations B and lastly standardizing the expression with the total number of samples q, giving the average error energy:

Erst =1

qE(B)

u

idb

The average error energy is the cost function of the problem, targeted to minimize by altering the free parameters of the network. As previously calculated during single-layer weight correction, the weight alteration is proportional to the product of the error, the input vector, and a positive learning rate. Similarly, backpropagation weight correction ∆195(B) is equal to learning rate, the error signal at its node and training data, this time both input and output samples:

∆195 B = n ∙ M9 B ∙ !9(B)(1 − !9 B ) ∙ 45(B)

Backpropagation In summary, the full operation of backpropagation learning consists of two distinct phases. In the first stage, called the forward pass, the input data is fed to the input layer, and the signals start to travel through the network of nodes. Signals proceed through each neuron, layer by layer until it reached the end of the network, consisting of an output layer. All the weights are fixed during this first stage of training, and the network is generating an output when the signal reached the last output layer of the network. After the predicted output, the second step of the backpropagation learning starts; the backward pass. First, an error signal is calculated as the difference between the predicted output and the desired output and all the weights are sequentially unlocked, ready to be adjusted to be more like the desired output. This error signal is passed through the net the other way around, from right to left, adjusting the weights of each connection according to the estimated error, as the signal travels from neuron to neuron. When the error signal has passed the whole network, and all weights are altered, a new input is passed to the input layer, and the entire process starts over. Iterations continue back and forth through the network until the error signal is minimized or until some specific preset criteria are fulfilled.

Learning rates The parameter n denotes the learning rate of the process and is a parameter that regulates the amplitude of change of the weights during the training. Small values result in smooth steps in the trajectory direction during the search of the function minimization. While a smooth trajectory is an upside with a small training rate value, the trade-off when using small values is highly reduced rates of learning and in worst case stagnation in learning. Larger values of n speed up the overall training rate with the risk of making the network unstable. If the learning rate n is too high, the solution will not be able to reach the minimizing point due to its large oscillating steps past the minimum point.

28

Overfitting The potency of artificial neural networks is their ability to not only learn from training, but they also have the capability of applying the knowledge to new unseen data. A prerequisite of being able to accurately predict new samples that are slightly different from training data is the ability to generalize well. A circumstance essential to monitor is the phenomenon of overfitting, seen in Figure 15. The event when a neural network memorizes the data instead of learning it, resulting in a model with high accuracy on training samples but which performs poorly on new unseen and slightly different data. Overfitting or overtraining often occur when an extensive network is exposed to a small dataset. The approach of starting with a small network and working the way up to a more complex structure counters the risk of overfitting. An important note is the number of parameters in the network, in order to mitigate overfitting, the network parameters should be limited and be substantially fewer than there are data points.

Figure 15. Illustration of the concept of underfitting and overfitting. Black point illustrates the prediction of new unseen data. (A) Showing a simple model that failed to learn from the data points. (B) A robust model which handles new unseen data points well. Generalizes well and performs well on data that is even slightly different compared to the training data. (C) An overfitted model that memorizes all the training data and fails to generalize on a new unseen dataset. Performs poorly on data that is slightly different. Overfitting occurs when the network has a higher number of parameters compared to the number of data points in the dataset and when the input data consists of a small number of training samples. Underfitting is often the result of a simple model with too few parameters compared to data points, which hinders it from learning underlying regularities.

Image study Different problems demand various image preprocessing techniques. For image segmentation and segmentation problems, contrast enhancement is an important preprocessing step. A traditional contrast enhancement technique is histogram equalization, HE. HE intensifies the contrast by spreading out the most recurring intensity values by applying a probability density function on the image and after that changing all the gray levels according to the probability function (Singh, et al., 2016). The result is seen in Figure 16, and HE is a popular method since it is simple to implement with the trade-off changing the average brightness of the image and artifacts might arise.

29

Figure 16. Histogram alterations for improved contrast with (A) histogram stretching and (B) histogram equalization.

Segmentation problem Images from the fluorophore channel provide valuable information that can be used for nucleus detection and a valuable first step for future predictions. Segmentation is a procedure where each region of the image is classified into a set of preset classes. During binary segmentation, each domain is classified as either belonging to class0 or class 1. Segmentation of categorical classification with B classes, each region of the image is instead categorized as belonging to one of the B classes. Segmentation of the fluorophore channel is a binary problem where each pixel of the image is either categorized as background or being a pixel belonging to a nucleus, further described in the Methodology section ahead. Creating binary masks A simple way of separating a foreground object from the background in an image is using thresholding; by changing all pixels below a chosen threshold to black, and those pixels with values above the chosen threshold, to white. Depending on the object composition, best results are achieved with objects with colors distinctly different from the background. The challenge is choosing an appropriate threshold level, which is a difficult task when different objects have various properties. Choosing a level of thresholding too high results in loss of information while too low levels can introduce artifacts. The histogram landscape of an image gives clues on which methods that might perform well. A histogram that has two distinct peaks present implies that there are two distinct brightness areas present; one peak is belonging to the object in the foreground and the other one its environment or background. With multiple methods available, the method of choice performs differently depending on the pixel intensities of a given image. Some commonly used thresholding methods are: The method of Isodata first studies samples of the background, taken close to the object in order to find a mean gray level for the background and later the same procedure but for the foreground. Further, the mean value of the two samples are set as the threshold and the process iterates repeatedly up till the used threshold is higher than the mean of the brightness of the two areas (Ridler & Calvard, 1978).

30

Otsu’s method assumes that the histogram contains two peaks and calculates the best threshold as the value that separates the two classes, found in Figure 29, with the threshold in red. Explained more in detain in the following section. Yen thresholding is the concept of systematic thresholding according to the highest correlation criterion. The process of triangle thresholding is first drawing a line between the peak of the histogram all the way to the most distant end of the histogram. After that appoint the threshold to the mark that has the most substantial distance between the line and the histogram. Li thresholding is based on choosing a threshold that reduces the cross-entropy among an original image and the image that has been thresholded.

Convolution operations Convolutional neural networks are an essential part of image classification and computer vision and are basically repetitive layers of operations that performs activation functions on incoming data (Britz, 2017). Compared to the earlier sections, where the computational learning was described as a signal traveling from node to node, the learning of convolutional neural networks occurs through convolutions over multiple layers. Essentially, the overall goal of a classifying convolutional network is to reduce the size of the input data step-wise until it has been merged spatially to a small size. Mainly, there are four principal operations to achieve this spatially merged input image; convolution, activation function, pooling, and lastly the classification itself. Starting with the convolution operation, its fundamental function is to derive features from the input image. The convolution operation retains the spatial relation among pixels by using a filter over patches of the image to learn its features. Conceptually, every image can be expressed as a matrix of different pixel values ranging between 0 and 255; Figure 17 shows a simplified example where the background color is defined as 0, and the pixels containing nuclei are specified as 1.

Figure 17. A simplified example of how every image can be expressed as a matrix with different pixel values. Each background pixel is defined as 0, and each pixel belonging to a nucleus is identified as 1.

Each convolution operation consists of a smaller matrix, filter or kernel, that slides over the input image while performing element-wise multiplication between each position of the filter and the input image. The multiplication outputs are finally added together into a final number, seen step by step in Figure 18. The final matrix, created by the element-wise multiplication between the filter and the input image, is called a feature map and contains features of the input image itself. Depending on how the numerical values of the filter matrix are constructed, the final feature maps

31

from the same input image will look different depending on the sliding filter, seen in (Britz, 2017). A variety of filters with different numerical values are used for each convolution operation to capture particular features of the input image, such as edges, contour, and rounded features.

Figure 18. The step by step process of creating a feature map by elementwise multiplication of the input image and the 2x2 filter matrix, shown as a pink 2x2 matrix. The multiplication outputs are summed up into a final integer for each step, each comprising a single element of the final output matrix.

32

A high number of filter matrices gives numerous feature maps, with the result of a network with a higher capability of understanding patterns and arrangements of future unseen images. The characteristics of the feature map controls by three separate specifications; depth, stride, and zero-padding (Britz, 2017). Starting with the depth, which is the number of filters used for each input. Every filter is designed to capture different details of the input and thereby to learn the model to look for particular patterns in the input. The depth value is ultimately the number of feature maps that are being produced, seen in Figure 20.

Figure 19. By alerting the numeric values of the filter matrix, different feature maps are produced from each filter variation. Filters can be constructed to distinguish features such as edges and curves of the input image.

The second parameter that decides the size of the feature map is called stride, which is how many pixels that filter is taking during its sliding over the input image. A stride of 1 is equal to the filter moving one pixel at the time for each operation. Figure 18 illustrates a stride of 2, where the filter moves two pixels for each step during the creation of the feature map. The stride directly affects the reduction of output size, which also is seen in Figure 18, where a648 matrix is reduced to a344 matrix by using a stride of 2. Greater stride results in fewer steps and therefore smaller feature maps.

The last parameter is the filter padding, which controls the spatial size of the output image. While the value of the stride reduces the produced feature map, adding padding increases the size of the feature map. Figure 21A shows a smaller output size with stride, and no padding and Figure 21B shows how the output map can keep the same size as the input image when zero-padding is added.

Figure 20. One parameter of the feature map size is the depth, which describes the number of filters, corresponding to the number of feature maps produced for each operation.

33

Figure 21. Adding zero-padding controls the size of the feature map during a convolution operation.

After the first step, convolution operation, the next procedure is to apply an element-wise activation function. A common activation function in convolutional neural networks is the nonlinear Rectified Linear Unit, ReLU. ReLU applies element-wise operations on each pixel of the feature map, by replacing all negative values by zero:

#(4) =0 7#4 < 04 7#4 ≥ 0

Since the previous convolution operations were of a linear nature, introducing a non-linear activation function makes the model able to performing on data with non-linear patterns, which real-world data mostly consists of. The last step of the process of the convolution layer is to reduce the dimensions of the given feature map spatially. This is accomplished by a pooling step that performs a downsampling operation (Britz, 2017). This operation consists of sliding over a spatial region with a given filter matrix and reduce the number of parameters by keeping certain features, seen in Figure 22.

Figure 22. Reduction of spatial dimensions by a max pooling step. The largest from the feature matrix within the sliding 2x2 matrix is saved into a new feature map with reduced dimensionality.

34

In summary, a convolution neural network consists of layers of convolution operation, activation function and a pooling layer with the goal of having the input image merged spatially into a smaller version. Schematic illustration of the three layers of downsampling is seen in Figure 23.

Figure 23. Schematic overview of the three stages of downsampling in convolution neural networks. Starting with convolution operation of the input image to feature maps, followed by a non-linear activation function and lastly downsampling by a pooling layer.

U-net The concept of U-nets is built on a neural network architecture called fully convolutional network for pixel-wise prediction (Long, et al., 2015) with modifications that allow up-sampling from low resolution to high resolution. The overall architecture consists of two separate paths; one narrowing path and a revered side that is responsible for the expansion (Ronneberger, et al., 2015). Every block of the contracting path consists of a convolution operation, max pooling, and an activation function. Each block of the contracting part is a downsampling step where the number of filters is doubled for every step.

35

Materials and methods

Cell cultivation U2OS cells were cultivated in McCoy’s 5A modified medium and supplemented with 10% fetal bovine serum, L-glutamine, and penicillin-streptomycin. Cultivation was performed in a humidified atmosphere containing 5% CO2 at 37°C.

Sample preparation The U2OS cells were grown overnight on coverslips before stained with MitoTracker Red CMXRos. Subsequently, the cells were incubated for 30 minutes at 37°C, before fixation with 4% paraformaldehyde at room temperature for 15 minutes while protected from light. The cells were mounted onto glass slides by using VectaShield and DAPI as mounting media before allowed to set overnight.

Image acquisition A Leica epifluorescence microscope, with a 63x1.25 NA objective, was used for imaging the cells. The slides were imaged with both fluorescent and transmitted illumination. The fluorescent channels were blue with 405 nm excitation laser and red with 561 nm excitation laser. Brightfield and phase contrast was used for transmitted illumination. A total of 415 layered image file format (LIFF) images with 2048x2048 resolution were gained during microscope image processing, divided into three separate channels, seen in Table 5. Table 5. Number of images for each channel, gained after imaging.

Fluorophore Bright-field Phase contrast

Number of images 415 415 415

Image preprocessing All the acquired images were initially reviewed for artifacts, such as floating cells out of focus, and abnormalities such as ongoing cell division. Cells that are no longer fixed to the glass slides appear as bright artifacts, as shown in Figure 24, at times difficult interpret when only analyzing the fluorescence images showing the nuclei. When compared to its bright-field counterpart, shown in Figure 25, both abnormalities and artifacts are more easily explained and investigated. In Figure 25a it is clear that the previous bright object appears to be floating cells while the bright-field image of Figure 25b reveals a stage of cell division. Studying Figure 25b further shows an additional important aspect of analyzing both images of each fluorescent/bright-field pair; a whole cell can be mistaken for a nucleus. As the one shown in the far left upper corner of Figure 25b, showing a cell in mitosis phase.

36

Figure 24. A sequence of fluorescent images showing distinctive bright objects, usually being cells no longer fixed to the slide, floating in the media.

Figure 25. By analyzing fluorophore images with their bright-field counterpart, artifacts and abnormalities get easier to locate. (A) Showing the previous bright artifact as a group of dying cells while (B) is clearly showing an ongoing cell division, where the cells are swelling compared to neighboring cells still in phase

Initial image analysis revealed that approximately 5% of the images contains cells amid some stage of cell division. Multiple examples with varying mitosis stages are shown in the image sequence of Figure 26. A cell that undergoes cell division is distinctly dissimilar to neighboring cells being in interphase and might give rise to unexpected effects during further image processing.

37

Figure 26. A sequence of cells amid various stages of mitosis, an estimation of 5% of all cells across the dataset undergo cell division.

Different methods of thresholding were applied to a small set of images to evaluate the performance of the technique on different images from the data set. After visually studying the dataset, the next step was to examine the color and brightness distribution with histogram exploration before further segmentation. Histogram analysis showed a clear, sharp peak for the pixels belonging to the nuclei, seen in Figure 27. Due to the images mostly consisting of dark background with objects color’s ranging in grayscale, different thresholding techniques were examined for further exploration.

Figure 27. Histogram representation of input data showing a sharp peak being the pixel intensities of the nuclei in the foreground.

38

A subset of images was chosen for evaluating the performance of different thresholding methods. The images of the subset were selected for representing both the sample forecasted to be difficult to segment but also multiple images with good contrast. A sample of this set of images is seen in Figure 28, where each frame illustrates potential segmentation difficulties, such as slightly out of focus Figure 28a, low pixel intensity Figure 28b, artifacts Figure 28c and rare events in Figure 28d.

Figure 28. A sample of images differing from the sample majority in different fashions, used for evaluating different thresholding methods. (A) Image slightly out of focus. (B) Image with low pixel intensities and overall low contrast. (C) Image with a bright artifact. (D) Image with abnormality; a cell during mitosis.

The samples were tested with eight different thresholding methods, all seen in Figure 31. Otsu’s method was the chosen thresholding technique for the creation of binary masks; the method plots a vertical line between two-pixel color regions that constitute the threshold value, the concept shown in Figure 29. A sample with the performance of the overlapping results is shown in Figure 30, where the generated binary mask is marked as red outlines to the far right. The last preprocessing step was morphological opening and closing operations. Starting with morphological opening with a disk kernel to smooth and rounding off sharp parts of the shapes. Opening operations are also removing any pieces smaller than the used disk. Preprocessing was finishing with morphological closing to fill gaps and holes.

Figure 29. Otsu's thresholding method with a vertical line illustrating the calculated threshold.

Figure 30. Illustration of automatic generation of a mask for future use during neural network training as ground truth. The third image illustrates the thresholding performance as an overlay on the original image.

39

Figure 31. Various thresholding methods tested on original images for evaluation and to determining which thresholding method to use further. (A) being the original input image. (B) Mean thresholding. (C) Thresholding with the Iso method. (D) Li’s minimum cross-entropy method. (E) Yen’s method of thresholding. (F) Threshold according to the triangle algorithm. (G) Otsu’s method. (H) Niblack local thresholding.

The first dataset cleanup was the removal of all images that were significantly out of focus, reducing the initial dataset from 415 to 397 images for each channel. Followed by the next cleanup step, the removal of all the frames containing cells in some stage of mitosis, together with images containing artifacts. In total, an additional 75 out of 397 images were removed from the original dataset due to mentioned artifacts and rare events like cells in mitosis phase. Images having the resolution of2048x2048 pixels were tiled to smaller patches of256x256 pixels with a step size of64x64, resulting in a total of 1024 tiles for each input image. Figure 32 showing the concept of tiling into smaller patches and the sliding window that moves 64 pixels for each tile. Tiling size of256x256 was chosen to fit approximately one nucleus or at least a part of a nucleus for each tile.

40

Figure 32. The tiling concept, consisting of (A) 256x256 tiles, cropped from a (B) sliding window moving 64 pixels at the time horizontally and vertically until the whole image is cropped into a total of 1024 tiles. 32 steps both horizontally and vertically by the sliding window (B) results in 1024 new tiles.

Network architecture A simplified overview of the network architecture is shown in Figure 33 and consists of a contracting path followed by an expansive part, often called encoder part and decoder part. With the full network architecture illustrated in detail is shown in Appendix 1. The down-sampling path is a repetition of 3x3unpadded filters for convolution, followed by a nonlinear rectified linear unit, ReLU, as activation function and lastly a 2x2 filter with stride 2 for max pooling (Ronneberger, et al., 2015). For each contracting step, the filters are doubled in number up to 64, starting from 16 and increases to 32, 64, 96 and 128 for the smaller network. While the expanded network instead starts at 32 and is increased to 64, 128, 192 and ends at 256 before the upsampling part starts. During the up-sampling part of the architecture, the feature maps are reduced by half for each step with a 2x2 filter. In addition to the up-sampling filter, a concatenation is performed with the feature map from the corresponding step on the downsampling path (Ronneberger, et al., 2015). This concatenation between corresponding steps of the contracting and expanding path, called skip connections, helps to prevent loss of spatial resolution. A 3x3 filter performing two convolutions follow this and lastly ReLU as the activation function. A dropout rate of 0.2 is added after each step in both the downsampling path and for all the upsampling steps. Finally, the last fully connected layer is mapped to each of the 256-component feature vectors to the classes.

41

Figure 33. Simplified illustration of the U-net design. Each downsampling and upsampling step also contains a dropout.

Training During the training of the network, input data were used together with its matching segmentation mask as ground truth. The network was trained with the optimizer Nadam (Dozat, 2016), an improved methodology of Adam (Kingma & Lei Ba, 2015) used together with Nesterov momentum, implemented in Keras, a high-level API for TensorFlow. Since the learning rate is a vital hyper-parameter, a learning rate finder (Smith, 2015) was implemented in order to find a suitable learning rate to both start with and to use as the interval for the training. The learning rate finder function ran for three epochs with the purpose of gradually increasing from a low base value to a max value of the learning rate. During the gradual increase, the learning rate was plotted against the loss, resulting in a plot from which the optimal learning rate interval could be determined. Except for a function for helping to find a suitable starting point for the learning rate, an additional function was implemented to systematically cyclic the learning rate throughout the training process. Cyclical learning reduces the number of experiments needed to find an appropriate learning rate by systematically varying cyclically between a base value and a top value (Smith, 2015) evaluated by the previous learning rate finder. Two network configurations were arranged during the training, one for a smaller network and one for a slightly larger one, details of both setups seen in Table 6. The larger network was designed to explore if a more significant number of filters would capture more data points in the input images than a slightly smaller network.

42

Table 6. Training configuration for both the smaller network and the larger one.

Smaller network Larger network

Batch size 32 16 Train split ratio 0.7 0.7 Validation split ratio 0.15 0.15 Test validation split ratio 0.15 0.15 Patience 20 20 Filter size 3 3 Activation ReLU ReLU Dropout 0.2 0.2 Number of filters per block 16, 32, 64, 96, 128 32, 64, 128, 192, 256 Final activation Sigmoid Sigmoid

43

Results As the aim was to train a convolutional neural network on predicting nuclei segmentation from bright-field and phase contrast images, with two different loss functions to evaluate the training. The designed U-net were used on two separate segmentation tasks; the first being segmentation of nuclei in bright-field microscopic images and the second being segmentation of nuclei in phase contrast images. Both experiments used an automatically created segmentation map as ground truth, with white pixels for nuclei and black pixels for the background. In total, eight experiments were carried out divided on two networks of different sizes. Two loss functions with corresponding metrics were used across the experiments; Sørensen-Dice index, DICE, and cross entropy. The former with DICE coefficient as metric and the latter with binary accuracy as the metric. An overview of all the experiments is shown in Table 7. The DICE coefficient essentially evaluates the overlapping similarities between the output and the ground truth while cross entropy measures the classification achievement with a probability value between 0 and 1. DICE is having the advantage of being robust even with a dataset with significant class imbalance (Fidon, et al., 2017) since it takes both false negatives and false positives into consideration, while cross entropy not explicitly taking imbalanced classes as a consideration. Table 7. Overview of all performed experiments across both small and large networks.

DICE Cross entropy

Small network Large network Small network Large network

Bright-field

Phase contrast

Bright-field

Phase contrast

Bright-field

Phase contrast

Bright-field

Phase contrast

Experiment # 1 2 3 4 5 6 7 8

The intervals from the implemented learning rate finder function are seen in Table 8 with a subset of graphical illustrations shown in Figure 34 and Figure 35, the rest is found in Appendix 2. The plots derived from the learning rate finder functions shows how the loss is reduced differently when training with DICE as a loss function compared to when trained with cross entropy as loss function. The difference is seen both among different loss functions but also between training on bright-field images, Figure 34 and phase contrast images Figure 35. Table 8. Schematic overview of different learning rates that were used in the experiments.

DICE Cross entropy

Small network Large network Small network Large network

Base learning rate 0 0 0 0

Max learning rate 0.0025 0.001 0.0025 0.01

44

Figure 34. Cyclic learning rate finding on the small network with bright-field images and DICE coefficient as loss function.

Figure 35. Cyclic learning rate finding on the small network with phase contrast images and DICE coefficient as loss function.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.000 0.002 0.004 0.006 0.008 0.009

Loss

Learning rate

Cyclic learning rate, DICE, Brightfield, small network

0

0.2

0.4

0.6

0.8

1

1.2

0.0000 0.0019 0.0038 0.0057 0.0076 0.0095

Loss

Learning rate

Cyclic learning rate, DICE , phase contrast, small network

45

The resulting models of the eight models were individually evaluated, using an unseen test set of images. Previously described metrics were used for evaluation; precision, recall, and relative overlapping, the summary found in Table 9 with extensive values in Appendix 3. In addition to these metrics, the models were also evaluated with their number of true positives TR, true negatives TN, false positives FP and false negatives FN on the same test set. Table 9. Accuracy comparison between the smaller (S) and the more extensive network (L). (P) Being precision, (R) recall and (RO) being the relative overlap. All numbers are foreground values, being the evaluation of the predicted pixels of class 1.

DICE Cross entropy

Bright-field Phase contrast Bright-field Phase contrast P R RO P R RO P R RO P R RO

S

82%

84%

11%

91%

94%

12%

76%

69%

9%

91%

91%

12%

L

78%

80%

10%

92%

92%

12%

72%

90%

12%

92%

93%

13%

The yielded numbers are indicating models with both high performance and high recall for predicted pixel values of 1. Evaluating the precision and recall for classes of predicted pixel values yield essential insights of the performance for different models. The values of models trained with DICE as loss function are mostly similar but in some cases slightly higher than models trained with cross-entropy as loss function. An intermediate image prediction was performed after each training epoch. This operation allowed visual evaluation between iterations during the training and gave important insights of the performance even before the model was fully trained. A subset of intermediate predictions is seen in Figure 36. The cyclical learning rate is particularly seen in the bright-field intermediate subset, the whole intermediate training prediction image set is found in Appendix 4.

46

Figure 36. Series of intermediate image prediction, performed after each training epoch for intermediate ocular visualization during training the neural networks.

During the model evaluation, the predictions were performed on a whole204042048 image with multiple predictions on the test set seen in Figure 37-Figure 44, all figures showing a pairwise result of each experiment with high precision (A) and prediction with high recall (B). The red contours on each output are showing the boundary of its corresponding ground truth mask, revealing how each prediction performed for each nucleus.

DICE coefficient Predictions on phase contrast images performed better across all experiments, foremost with the small network and with both loss functions. The best performance on predicting phase contrast images, with DICE as loss function, is achieved with the smaller network and is seen in Figure 38. The red outline tells how well the prediction performs and some nuclei are perfectly predicted. Predictions performed by the large network is seen in Figure 40, also showing highly accurate predictions. Predictions on bright-field images, on the other hand, is performing better using DICE as loss function, which is seen in Figure 37, showing the performance of the smaller network trained on bright-field images. The larger network performs with slightly lower precision, seen in Figure 39.

47

Cross entropy Models trained on phase contrast images performed negligible better with the larger network compared with the smaller network, seen in Figure 42 and Figure 44. However, models trained on bright-field images performed noticeably better on the larger network, Figure 43, compared with the smaller network, Figure 41.

Figure 37. Model performance, trained on bright-field images on the small network with DICE as loss function. (A) Showing prediction with high precision. (B) Showing prediction with high recall.

Figure 38. Model performance, trained on phase contrast images on the small network with DICE as loss function. (A) Showing prediction with high precision. (B) Showing prediction with high recall.

48

Figure 39. Model performance, trained on bright-field images on the large network with DICE as loss function. (A) Showing prediction with high precision. (B) Showing prediction with high recall.

Figure 40. Model performance, trained on phase contrast images on the large network with DICE as loss function. (A) Showing prediction with high precision. (B) Showing prediction with high recall.

49

Figure 41. Model performance, trained on bright-field images on the small network with cross entropy as loss function. (A) Showing prediction with high precision. (B) Showing prediction with high recall.

Figure 42. Model performance, trained on phase contrast images on the small network with cross entropy as loss function. (A) Showing prediction with high precision. (B) Showing prediction with high recall.

50

Figure 43. Model performance, trained on bright-field images on the large network with cross entropy as loss function. (A) Showing prediction with high precision. (B) Showing prediction with high recall.

Figure 44. Model performance, trained on phase contrast images on the large network with cross entropy as loss function. (A) Showing prediction with high precision. (B) Showing prediction with high recall.

51

Discussion The duplication of the filter size should result in a significant improvement due to theoretically being able to perceive more information in the images. The increased number of filters did however not substantially improve the performance. Overfitting is a common problem when training neural networks, meaning that the trained model memorizes unintended features instead of significant and general features. The consequences of overfitting are a model that performs well on the training data but fails to generalize when it is exposed to new, unseen data. The initial conclusion of the absent performance improvement was that the larger network with the increased number of filters was overfitting, due to being over-parameterized compared to the similar-looking input dataset. A probable reason for the model getting over-parameterized could be that the dataset is aggregated with very similar conditions; taken from the same microscope, at the same position and same angles. This generates a dataset that is incredibly consistent, and the network does not need to generalize over many factors to perform well. However, further investigation instead showed that the smaller model already performed at full capacity, probably already surpassing the number of data points in the training data, making it unnecessary to increase the network size. Figure 45 and Figure 46 shows the relation between the performance difference of training and validation, a growing gap between the datasets larger than 5-10% would indicate overfitting. The absence of a significant gap supports the conclusion that the smaller model probably already has the full capacity for the given task. Appendix shows the training history from which the conclusion is drawn.

Figure 45. Showing a small gap between model performance on training data and validation data in small, original network, indicating that the network reached full capacity for the task. A significant gap between performance on training data and validation data indicates overfitting.

Figure 46. Showing a small gap between model performance on training data and validation data in the large network with an increased number of filters, indicating that the small network probably already reached full capacity for the task. A significant gap between performance on training data and validation data indicates overfitting.

0.65

0.75

0.85

0.95

1.05

0 5 10 15 20 25 30 35 40 45DIC

E c

oeffi

cien

t

Epoch

Train small Val small

0.650.750.850.951.05

0 5 10 15 20 25 30 35 40 45

DIC

E c

oeffi

cien

t

Epoch

Train large Val large

52

As expected, due to low contrast and low signal to noise, segmentation of Bright-field images shows a lower performance across all the experiments, compared to predictions on phase contrast images. Histogram comparison between phase contrast images and bright-field images demonstrates a significant difference between pixel intensities, seen below in Figure 47.

Figure 47 - Comparison of pixel intensities between a bright-field image and a phase contrast image. Bright-field histogram displays a low signal to noise, with lowered model performance as the outcome, compared to models trained with phase contrast images as input.

Label-free segmentation of phase contrast images, having a moderate contrast and signal to noise, is challenging but doable. Out of the loss functions that were used, DICE is performing slightly better than cross entropy, across all experiments. When it comes to real-world applications, nucleus segmentation is usually a question of accurately segmenting the boundary of the organelle. With that being said, a model with higher precision would be demanded before a model with a higher recall, for accurate boundary segmentation. A model with high precision, figuratively seen in Figure 6, is predicting nucleus segmentation with a low number of false positively predicted pixels. The precision trade-off is that the model does not take account of false negatively predicted pixels, that is to say; missed white pixels where there should have been pixels. The model seen in Figure 48 is performing predictions with high precision since all its predicted pixels are within the boundaries of the ground truth. The red contours show where the nuclei were supposed predicted, but that the model wrongly predicted as background.

Figure 48 - Illustration showing phase contrast input image, the ground truth that is the target outcome and the predicted output image with the ground truth overlay in red. The red contours show where the nuclei were supposed to be predicted; this model performs with high precision since it correctly predicts a high rate of pixels belonging to cell nucleus with a low rate of false positive pixels.

53

To sum up, a model that accurately predicts pixels within the right areas but incorrectly predicts part of nuclei as background can still score high precision rate due to the precision function;

CPMw7N7OB = (&xPyM& ∩ .PM;7w&7OB)

.PM;7w&7OB

However, it is crucial not entirely to base the model evaluation on the precision score alone. As seen in Figure 48, the model missed almost a whole nucleus. When performing model evaluation and overall assessment, it is crucial to look at the bigger picture and study the whole set of predicted test images in more detail. To be able to assess if the model solves the problem and to be able to determine if the predictions are consistent across the whole test set. The other aspect of evaluating the model is the recall, the rate that describes how well the model found all of the pixels belonging to cell nuclei;

:Mwxzz = (&xPyM& ∩ .PM;7w&7OB)

&xPyM&

Figure 49 shows a model with high recall since it managed to predict a high rate of pixels belonging to the cell nucleus while having a low rate of false negatively predicted pixels. The trade-off in this situation is that the recall equation does not take false positively predicted pixels into account. Therefore, the contours show areas where the model predicted the boundaries outside of the actual nucleus boundaries.

Figure 49 - Illustration showing phase contrast input image, the ground truth that is the target outcome and the predicted output image with the ground truth overlay in red. The red contours show where the nuclei were supposed to be predicted; this model performs with high recall since it correctly predicts a high rate of pixels belonging to the cell nucleus and low rate of false negative pixels.

In the best of all possible worlds, the trained model would both have high precision and high recall simultaneously. However, since real-world data seldom result in a perfect solution, it is important to consider solutions to meet the consequences of low precision or low recall. There are mainly two scenarios of falsely predicted pixels that are less problematic; noise and artifacts within the boundary of the nucleus (false negatives, with low recall as the outcome) and noise and artifacts that are not adjacent to any nucleus. The first problem can be solved by filling the patches within the boundaries after the prediction and thereby achieving a better final result. While the latter issue can be solved by setting a threshold of a certain size that removes artifacts of a particular size, smaller than the average size of nuclei. Figure 50 shows the two scenarios with falsely predicted pixels that could be solved systematically by filling boundaries or filtering out predictions smaller than a particular size.

54

Figure 50. Two scenarios with falsely predicted pixels can be a result of false negative predictions as seen in (A) and false positive predictions shown in (B). False negatives within the nucleus boundary (A) can be solved with a subsequent filling of boundaries while false positive pixels that are not adjacent to any nucleus (B) can be removed by filtering predictions below a threshold size.

Two more critical scenarios occur when the prediction is significantly wrongly predicting the nucleus as background (lowered recall but could still have high precision) and when noise or artifacts are wrongly predicted adjacent to nuclei. These problems are more difficult to automatically solve after the prediction is carried out and automatic sequent solutions might even worsen the outcome. These more complex scenarios are shown in Figure 51.

Figure 51. Two more complex scenarios with falsely predicted pixels. The result of false negative predictions as seen in (A) and false positive predictions shown in (B). False negatives without a clear nucleus boundary (A) is more difficult to solve automatically. Same for false positive pixels that are adjacent to cell nuclei are difficult to remove due to being connected to ground truth systematically.

The robustness of the network can be questioned due to the fact that all the images are generated during a single occasion and with precisely the same circumstances and that that whole set is used for training, validation and lastly testing. A way of improving the robustness of the model would be additional images from multiple occasions and different microscope settings. Another aspect of the ability to generalize, approximately 18% of the images were removed from the original set due to containing cells in various stages of mitosis. This contributes to a model that is highly specified by segmenting interphase cells well and is likely not as successful when it comes to predicting segmentation of images with a various number of dividing cells. Another promising approach is doing full image to image generation, bypassing the need for segmentation, but predicting the full output cell image. One significant challenge of training neural networks with large images is the limitations of assessable memory in chosen graphics processing units, GPU. The barrier of the maximum image size is due to the process of first copying the input data and then keeping all of its intermediate activations in memory during the backpropagation. This requires a trade-off in input image size and the batch size that is fed to the network during the training. Two possible approaches to overcome memory depletion when training with large images is either substantially downsizing the

55

images or tiling the images and thereby training on separate patches of the whole image. Drawbacks of downsampling images lead to loss of information while training on patches instead of whole images has the consequence of losing comprehensive spatial information (Pinckaers & Litjens, 2018). All the achieved models were trained on smaller patches/image tiles as previously described and finally predicted on a full image. All trade-offs considered, the given dataset is relatively homogenous in terms of a minority of cell colonies and a majority of the dark background. The tiling procedure was therefore hypnotized to result in models still able to generalize well. Real-world use of the acquired models would need additional training for a more generalized design. It is important to note, however, that high accuracy, precision and recall not always translate into good real-world application performance. The outcome of the model could be adapted to work for a narrow and specialized application or serve as an initial support system, even though there might be artifacts. Creating solutions to make it usable in a real-world setting is more important than improving the precision-recall with a few percentages, even though they often correlate.

Summary The results are showing great promise; with nucleus prediction from phase contrast images being precise and quite robust and predictions on bright-field images performing better than expected due to extremely low signal to noise levels. However, given the high requirements to achieve a precise location and boundary prediction of the cell nucleus given an image (high precision), the model needs to be improved before implemented into a real-world setting or for lab usage.

56

Future perspectives

Improvements Even though bright-field images have overall low contrast and low signal to noise, the future work of improving the model performance is possible. Reasonable future steps would be contrast improving on input images and additional, more diverse, input data. Compared to the segmentation of phase contrast images, the dataset size plays a more significant difference, hence the conclusion of expanding the input data for achieving a more robust model, able to catch the underlying regularities. Another approach of improving the prediction of bright-field images is utilizing multiple focus levels and take advantage of their intensity shifts (Selinummi, et al., 2009). The progress of state-of-the-art deep learning techniques occurs very rapidly, although the techniques used in this study are recent, numerous techniques were not tested due to time constraints. A widely used methodology is the use of pretrained weights for the encoder part of the U-net. Using a pretrained encoder makes training faster and increases the performance of smaller datasets. Another method that likely would improve the performance is varying the learning rate to an even greater extent, by letting different parts of network train with particular learning rates and not only using a cyclical learning rate.

57

Acknowledgements This project would not have been possible without the generous support and assistance of many individuals at Chan Zuckerberg Biohub. A special thanks to Anitha Krishnan and Jenny Folkesson, my external supervisors at Chan Zuckerberg Biohub, I am endlessly thankful for their advice, guidance, and feedback as well as for their tireless support throughout the whole project. My sincere thanks and special gratitude to Manuel Leonetti, who invested his full effort in guiding me in achieving the goals of this report and providing me with invaluable data. Further, I would like to express my deepest appreciation towards Emma Lundberg, my main supervisor at KTH, for giving me support, encouragement and above all, gave me the golden opportunity to do this extraordinary project. My thanks and appreciation also go to the whole data science team, each of the members gladly helped me with their feedback during the project. Finally, I wish to thank Andreas Klintberg, who provided unending inspiration and patience.

58

References

Barbosa, A. D., Savage, D. B. & Siniossoglou, S., 2015. Lipid droplet–organelle interactions: emerging roles in lipid metabolism. Current Opinion in Cell Biology, Volume 35, pp. 91-97.

Britz, D., 2017. Artificial Intelligence, Deep Learning, and NLP. [Online] Available at: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ [Accessed 20 08 2018].

Christiansen, E. M. et al., 2018. In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images. Cell, 18 April, 173(3), pp. 792-803.

D’Addabbo, A. & Maglietta, R., 2015. Parallel selective sampling method for imbalanced and large data classification. Pattern Recognition Letters, 62(61-67).

Dice, L. R., 1945. Measures of the Amount of Ecologic Association Between Species. Ecology, 26(3), pp. 297-302.

Dozat, T., 2016. Incorporating Nesterov momentum into Adam, s.l.: s.n.

Fidon, L. et al., 2017. Generalised Wasserstein Dice Score for Imbalanced Multi-class Segmentation using Holistic Convolutional Networks, London: arXiv:1707.00478v4 [cs.CV] .

Kantardzic, M., 2011. Data Mining: Concepts, Models, Methods, and Algorithms. 2nd Edition ed. New Jersey: John Wiley & Sons.

Kingma, D. P. & Lei Ba, J., 2015. Adam: A method for stochastic optimization, San Diego: arXiv:1412.6980 [cs.LG].

Long, J., Shelhamer, E. & Darrell, T., 2015. Fully Convolutional Networks for Semantic Segmentation, Berkeley: arXiv:1411:4038v2 [cvCV].

Morra, J. H. et al., 2010. Comparison of AdaBoost and Support Vector Machines for Detecting Alzheimer's Disease Through Automated Hippocampal Segmentation. IEEE Transactions on Medical Imaging, 29(1), pp. 30-43.

Ounkomol, C. et al., 2018. Label-free prediction of three-dimensional fluorescence images from transmitted light microscopy. [Online] Available at: https://doi.org/10.1101/289504 [Accessed 1 June 2018].

Pinckaers, H. & Litjens, G., 2018. Training convolutional neural networks with megapixel images. Amsterdam, Medical Imaging with Deep Learning.

Powers, D., 2008. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation, Adelaide: School of Informatics and Engineering Flinders University.

Ridler, T. W. & Calvard, S., 1978. Picture Thresholding Using an Iterative Selection Method. IEEE Transactions on Systems, Man, and Cybernetics, 8(8), pp. 630-632.

59

Ronneberger, O., Fischer, P. & Brox, T., 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation , Freiburg: arXiv:1505.04597v1 [cs.CV] .

Selinummi, J. et al., 2009. Bright Field Microscopy as an Alternative to Whole Cell Fluorescence in Automated Analysis of Macrophage Images. PLOS ONE, 4(10), p. e7497.

Singh, K., Vishwakarma, D. K., Walia, G. S. & Kapoor, R., 2016. Contrast enhancement via texture region based histogram equalization. Journal of Modern Optics, 63(15), pp. 1444-1450.

Smith, L. N., 2015. Cyclical Learning Rates for Training Neural Networks, Washington: arXiv:1506.01186 [cs.CV].

Sullivan, D. P. & Lundberg, E., 2018. Seeing More: A Future of Augmented Microscopy. Cell, 19 April, 173(3), pp. 546-548.

Valm, A. M. et al., 2017. Applying systems-level spectral imaging and analysis to reveal the organelle interactome. Nature, 546(7656), pp. 162-167.

60

Appendix 1 Detailed U-Net architecture.

62

Appendix 2 Learning rate finder plots.

Figure 52 - Cyclic learning rate finding on large network with bright-field images and DICE coefficient as loss function.

Figure 53 - Cyclic learning rate finding on large network with phase contrast images and DICE coefficient as loss function.

0

0.2

0.4

0.6

0.8

1

1.2

0.0000 0.0007 0.0015 0.0022 0.0030 0.0037 0.0044 0.0052

Loss

Learning rate

Cyclic learning rate, DICE, Brightfield, large network

00.10.20.30.40.50.60.70.80.9

1

0.0000 0.0007 0.0015 0.0022 0.0030 0.0037 0.0044 0.0052

Loss

Learning rate

Cyclic learning rate, DICE, phase contrast, large network

63

Figure 54 - Cyclic learning rate finding on small network with bright-field images and cross entropy as loss function.

Figure 55 - Cyclic learning rate finding on large network with bright-field images and cross entropy as loss function.

0.000000

2.000000

4.000000

6.000000

8.000000

10.000000

12.000000

0.0000 0.0015 0.0030 0.0045 0.0059 0.0074 0.0089

Loss

Learning rate

Cyclic learning rate, cross entropy, brightfield, small network

02468

1012141618

0.0000 0.0007 0.0015 0.0022 0.0030 0.0037 0.0044 0.0052

Loss

Learning rate

Cyclic learning rate, cross entropy, bright-field, large network

64

Figure 56 - Cyclic learning rate finding on small network with phase contrast images and cross entropy as loss function.

Figure 57 - Cyclic learning rate finding on large network with phase contrast images and cross entropy as loss function.

0

0.5

1

1.5

2

2.5

3

3.5

0.0000 0.0015 0.0030 0.0045 0.0059 0.0074 0.0089

Title

Title

Cyclic learning rate, cross entropy, phase contrast, small network

02468

1012141618

0.0000 0.0007 0.0015 0.0022 0.0030 0.0037 0.0044 0.0052

Title

Title

Cyclic learning rate, cross enropy, phase contrast, large network

65

Appendix 3 Individual and average precision, recall and relative overlap for segmented 0 (zeros) and 1 (ones). Table 10 - Extensive values of precision, recall and relative overlap, both for individual classes (zeros and ones) and weighted average values of both classes.

Large network DICE Brightfield

DICE Phase contrast

Cross entropy Brightfield

Cross entropy Phase contrast

Precision 0 0.973552153 0.989222318 0.986755851 0.989881312

Precision 1 0.779944317 0.918017509 0.716093346 0.916742896

Precision weighted 0.950823012 0.98072352 0.955755501 0.981073826

Recall 0 0.969905367 0.988858425 0.953792488 0.988483808

Recall 1 0.801908729 0.920509132 0.901030558 0.926196234

Recall weighted 0.950182925 0.980700453 0.947749396 0.980983004

Relative overlap 0.1 0.12 0.12 0.13

Small network DICE Brightfield

DICE Phase contrast

Cross entropy Brightfield

Cross entropy Phase contrast

Precision 0 0.978761428 0.99168252 0.959722873 0.988495633

Precision 1 0.819422544

0.913436775

0.758668721

0.910681301

Precision weighted 0.960068892 0.982528266 0.936326535 0.979334613

Recall 0 0.975375805

0.988229416

0.971072187

0.98804027

Recall 1 0.840749549 0.937442621 0.690542799 0.913825189

Recall weighted 0.959582383 0.982287685 0.938427448 0.979302987

Relative overlap 0.11

0.12

0.09

0.12

66

Appendix 4 Intermediate image predictions.

Figure 58 - Intermediate image predictions during training on bright-field images with DICE as loss function on a small network. Each intermedia prediction was performed after each epoch during training for ocular visualization and evaluation during training sessions. First image shows the input, remaining shows a prediction for each epoch, trained with cyclical learning rate.

67

Figure 59 - Intermediate image predictions during training on phase contrast images with DICE as loss function on a small network. Each intermedia prediction was performed after each epoch during training for ocular visualization and evaluation during training sessions. First image shows the input, remaining shows a prediction for each epoch, trained with cyclical learning rate.

68

Figure 60 - Intermediate image predictions during training on bright-field images with DICE as loss function on a large network. Each intermedia prediction was performed after each epoch during training for ocular visualization and evaluation during training sessions. First image shows the input, remaining shows a prediction for each epoch, trained with cyclical learning rate.

69

Figure 61 - Intermediate image predictions during training on phase contrast images with DICE as loss function on a large network. Each intermedia prediction was performed after each epoch during training for ocular visualization and evaluation during training sessions. First image shows the input, remaining shows a prediction for each epoch, trained with cyclical learning rate.

70

Figure 62 - Intermediate image predictions during training on bright-field images with cross entropy as loss function on a small network. Each intermedia prediction was performed after each epoch during training for ocular visualization and evaluation during training sessions. First image shows the input, remaining shows a prediction for each epoch, trained with cyclical learning rate.

71

Figure 63 - Intermediate image predictions during training on phase contrast images with cross entropy as loss function on a small network. Each intermedia prediction was performed after each epoch during training for ocular visualization and evaluation during training sessions. First image shows the input, remaining shows a prediction for each epoch, trained with cyclical learning rate.

72

Figure 64 - Intermediate image predictions during training on bright-field images with cross entropy as loss function on a large network. Each intermedia prediction was performed after each epoch during training for ocular visualization and evaluation during training sessions. First image shows the input, remaining shows a prediction for each epoch, trained with cyclical learning rate.

Figure 65 - Intermediate image predictions during training on phase cntrast images with cross entropy as loss function on a large network. Each intermedia prediction was performed after each epoch during training for ocular visualization and evaluation during training sessions. First image shows the input, remaining shows a prediction for each epoch, trained with cyclical learning rate.

73

Appendix 5 True positives, true negatives, false positives and true negatives as evaluation of the trained neural networks.

True False

True 19,857,028

4,375,925

False 3,761,216

173,332,423

Figure 66 - Bright-field DICE smaller network

True False

True 18,953,359

5,347,554

False 4,681,948

172,343,731

Figure 67 - Bright-field DICE large network

True False

True 22,080,456

2,092,488

False 1,473,472

175,680,176

Figure 68 - Phase contrast DICE smaller network




74

True False

True 22,119,608

1,975,366

False 1,910,146

175,321,472

Figure 69 - Phase contrast DICE larger network

True False

True 16,178,065

5,146,216

False 7,249,976

172,752,335

Figure 70 - Bright-field cross entropy smaller network

True False

True 20,776,824

8,237,304

False 2,282,132

170,030,332

Figure 71 - Bright-field cross entropy larger network

True False

True 21,659,502

2,124,342

False 2,042,517

175,500,231

Figure 72 - Phase contrast cross entropy smaller network





75

True False

True 22,454,871

2,039,315

False 1,789,312

175,043,094

Figure 73 - Phase contrast cross entropy larger network


www.kth.se

artificial intelligence for segmentation of nuclei from ...1455037/fulltext01.pdfexamensarbete inom...

Documents