peptide retention time prediction using artificial neural...

IN DEGREE PROJECT MATHEMATICS,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2016

Peptide Retention Time Prediction using Artificial Neural Networks

SARA VÄLJAMETS

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

Peptide Retention Time Prediction using Artificial Neural Networks

S A R A V Ä L J A M E T S

Master’s Thesis in Mathematical Statistics (30 ECTS credits) Master Programme in Applied and Computational Mathematics (120 credits) Royal Institute of Technology year 2016 Supervisor at KTH: Lukas Käll, Timo Koski Examiner: Timo Koski

TRITA-MAT-E 2016:49 ISRN-KTH/MAT/E--16/49-SE Royal Institute of Technology SCI School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

This thesis describes the development and evaluation of an artificial neural net-work, trained to predict the chromatographic retention times of peptides, based ontheir amino acid sequence. The purpose of accurately predicting retention times isto increase the number of protein identifications in shotgun proteomics and to im-prove targeted mass spectrometry experiment. The model presented in this thesisis a branched convolutional neural network (CNN) consisting of two convolutionallayers, followed by three fully connected layers, all with leaky rectifier as the activ-ation function. Each amino acid sequence is represented by a matrix X ∈ R20×25,with each row corresponding to a certain amino acid and the columns represent-ing the position of the amino acid in the peptide. This model achieves a RMSEcorresponding to 3.8% of the total running time of the liquid chromatography anda 95 % confidence interval proportional to 14% of the running time, when trainedon 20 000 unique peptides from a yeast sample. The CNN predicts retention timesslightly more accurately than the software ELUDE when trained on a larger data-set, yet ELUDE performs better on smaller datasets. The CNN does however havea considerable shorter training time.

Peptid retentiontids prediktering med artificiella neuronnat

Sammanfattning

Det har examensarbetet beskriver utveckningen och utvarderingen av ett arti-ficiellt neuronnat som har tranats for att prediktera kromotografisk retentionstidfor peptider baserat pa dess aminosyrasekvens. Syftet med att prediktera reten-tionstider ar att kunna identifiera fler peptider i ”shotgun” proteomik experimentoch att forbattra riktade masspektrometri experiment. Den slutgiltiga modelleni detta arbete ar ett konvolutions neuronnat (CNN) bestaende av tva konvolu-tions lager foljt av tre lager med fullt kopplade neuroner, alla med ’leaky recti-fier’ som aktiveringsfunktion. Varje aminosyrasekvens representeras av en matrisX ∈ R20×25, dar varje rad representerar en specifik aminosyra och kolumnernabeskriver aminosyrans position i peptiden. Den har modellen uppnar ett kvadrat-iskt medelfel motsvarande 3.8% av kortiden for vatskekromatografin och ett 95 %konfidensinterval motsvarande 14% av kortiden, nar CNN modellen tranas pa 20000 unika peptides fran ett jastprov. CNN modellen presterar marginellt battrean mjukvaran ELUDE nar de ar tranade pa ett stort dataset, men for begransadedataset sa presenterar ELUDE battre. CNN modellen tranar dock avsevart mycketsnabbare.

Acknowledgements

I would like to thank my supervisors Lukas Kall and Timo Koski for their helpand feedback on my work. I would also like to thank Heydar Maboudi Afkham forhis supervision and input throughout the project. I would like to thank MatthewThe for his help on running ELUDE. Finally, I would like to thank Marcus forproofreading.

Contents

1 Introduction 11.1 Motivation & Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Artificial Neural Networks 42.1 FeedForward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Training Deep Neural Networks 113.1 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Gradient Descent and Stochastic Gradient Descent . . . . . . . . . . . . 113.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Data 164.1 Data Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Mirror Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Evaluation 21

6 Implementation 23

7 Model Development 247.1 Layer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.2 Selecting Hyperparameter Values . . . . . . . . . . . . . . . . . . . . . . 267.3 Structure of Final Model, Net 11.3 . . . . . . . . . . . . . . . . . . . . . 31

8 Results 32

9 Discussion 379.1 Comparison of the Different Implemented Models . . . . . . . . . . . . . 379.2 Comparison of Net 11.3 and ELUDE . . . . . . . . . . . . . . . . . . . . 379.3 Comment on the Structure of Net 11.3 . . . . . . . . . . . . . . . . . . . 389.4 Future Work & Possible Improvements . . . . . . . . . . . . . . . . . . . 39

10 Conclusion 40

A Complete Results from First and Second Round of HyperparameterExperiments 41

B Earlier Models 42B.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42B.2 AA Frequency & AA Pair Frequency ANN . . . . . . . . . . . . . . . . 43B.3 Single Matrix Representation CNN . . . . . . . . . . . . . . . . . . . . . 44

1 Introduction

The study of proteins, proteomics, is central to enhancing our understanding of biolo-gical systems. It is also a field that at the offset faces a series of challenges. Firstly,there is a vast number of different proteins. In the human body alone over 20 000protein coding genes have been identified, some with the ability to encode hundreds ofdifferent proteins. Another challenge is the ratio of concentrations of these different pro-teins. In human plasma it is estimated that these ratios amount to 1 in 1012 for certainproteins, yet the scarce proteins may be as important to study as the abundant ones [2].

Mass spectrometry (MS) based methods are today the most popular tools for ana-lysing protein content of biological mixtures [19]. In MS the specimen to be analysedis ionised and then accelerated through an electric or magnetic field, sorting the ionsbased on their mass to charge ratio. Inferences about the specimen can than be madefrom the resulting spectrum. Shotgun proteomics combines MS with an initial liquidchromatography (LC) step. First, the proteins are enzymatically broken down into pep-tides, short chains of amino acids. The peptides are then dissolved in water (solvent A)and poured into a tube (the LC column) filled with hydrophobic silica beads. Due totheir generally hydrophobic nature the peptides bind to the silica beads. A hydrophobicsolvent B is then gradually introduced into the column, usually with a linear gradient.Each peptide will release from the silica beads at a certain concentration of solvent B.This allows for the peptides to be gradually fed to the MS, as the peptides will elute atdifferent times depending on their properties. The time interval from the introductionof solvent B to the elution of a peptide is referred to as a peptide’s chromotographicretention time [21].

Accurate prediction of retention times can benefit the field of proteomics both by in-creasing the number of peptide identifications and by increasing the reliability of thoseidentifications. This thesis explores the possibilities of applying deep neural networksto this task.

1.1 Motivation & Contribution

There are two main application that make retention time prediction relevant to shot-gun proteomics. Firstly, predicting retention times can improve peptide identification.Identifying peptides from MS is nontrivial and the proportion of spectra from LC/MSanalyses that can be confidently matched to a peptide is generally around 50%. Thislow yield is due to several factors, such as limitations on protein database searches andincorrect mass and charge assignment [2]. Many attempts at dealing with these issuesare underway and utilizing peptide retention time prediction is one such approach. Theexperimental retention time of the peptide can be compared to the predicted retentiontime of the identified peptide as an additional verification of the identification [19].The second application relates to the large variation in abundance between peptides.High-abundance peptides are favoured in most identification methods. As a result

1

targeted MS is gaining ground, to improve identification of more scarce peptides [19].Knowledge of when certain peptides will elute and enter the mass spectrometer allowsthe calibration to be changed to fragment the mass to charge ratio corresponding tothe peptides of interest [19].

For a given setup the timing of different peptides has proven to be highly consist-ent between runs [18], [21]. This reproducibility suggests that the factors determiningretention times for a given setup almost solely relate to the peptides themselves [21]. Apeptide constitutes a sequence of amino acids connected by peptide bonds and in addi-tion has a three-dimensional conformation. The sequence should hold virtually all theinformation required, although in particular for longer peptides the three-dimensionalstructure may affect the elution.

The elution process is dependent on the choice of solvent gradient, among other fea-tures, and retention time is therefore specific to each experimental setup. However,there is little variation in the setup and conditions of LC, making interlaboratory com-parisons possible, if the settings can be calibrated for [2].

Several attempts at retention time prediction have previously been published. It wasfirst considered in 1980 by Meek [18]. The prediction then entailed summing the con-tribution of each amino acid’s residue to the retention time. However, this model hadlimited predictive power and it became apparent that the order of amino acids affectsthe retention time. This led to the development of a series of predictors based ondifferent methods, yet which all included more complex feature engineering, utilizingdomain knowledge, to better represent the data [13], [19], [22], [23]. Notable amongthen is the retention time predictor ELUDE, first described in [21], using kernel regres-sion to predict retention times, based on 60 features derived from the peptides’ aminoacid composition. Petritis et al. [18] applied artificial neural networks to this problemrepresenting the data as nearest neighbour pairs. However, neither of these predictwith accuracy close to the high level of consistency in retention times for repeatedexperiments. There is still considerable room for improvement, particularly for longerpeptides.

The methods applied to this problem have congregated more and more towards therealm of machine learning, where the performance is heavily dependent of the choice ofdata representation used [1]. Recent advances in deep learning have shown promisingresults from learning a feature representation from the data, rather than engineeringthem heuristically [1]. Deep neural networks can discover intrinsic structures in com-plex data by learning a representation through multiple layers, to increasing levels ofabstraction [17]. Perhaps the key challenge in the retention time prediction is engin-eering sufficiently descriptive features. A deep learning approach may therefore havethe potential to find a better representation and thereby enable the learning of a modelwith improved performance. Furthermore, through this representation learning arti-

2

ficial neural networks are adept at modelling complex non-linear function, as long asthey are reasonably continuous. A detailed discussion on the particular advantagesof applying convolution networks to this problem will follow in Section 2.3. In shorthowever convolution networks are designed to handle array structured data and forsuch data the train is easier due to the model using a smaller number of connectionsand parameters [12].

1.2 Outline

This report will detail the work done for this project on applying artificial neuralnetworks to the task of peptide retention time prediction. Firstly, an introduction toartificial neural networks will be given. The training of these networks will also bediscussed at length in Section 3. The data used for the training of these models willthen be described, along with the chosen data representations. The evaluation metricsapplied when presenting the result in Section 8 will also be presented in detail. Section7 then explains the methodology employed in the model development. The final modelstructure is illustrated in Section 7.3. Finally, the results of the implemented modelsare presented and compared to the benchmark model ELUDE, followed by a briefconclusion.

3

2 Artificial Neural Networks

The development of artificial neural networks was, as the name suggests, inspired bytheir biological counterpart. The neuroscience inspired name is furthermore suitableas artificial neural networks are today viewed as one of the key methods in the fieldof Artificial Intelligence (AI). Artificial neural networks have for example had greatsuccess in natural language processing and computer vision, as well as teach computersto play games such as Go [3], [12], [25]. However, the resemblance to biological neuralnetworks is in practice limited. The goal of most research on artificial neural networksis not to model the brain, it is rather guided by many mathematical and engineeringdisciplines [8].

Although artificial neural networks were introduced in the 1950s it took five decadesfor these methods to gain the high level of popularity they hold today. This was dueto several factors, one among them being the modelling limitations of the first networkpresented, the one layer perceptron. However, already in 1989 standard multilayerfeedforward network with as few as one hidden layer (one layer less than the networkdescribed in Section 2.1 and illustrated in Figure 1) was proved capable of approxim-ating any Borel measurable function from one finite dimensional space to another, toany desired degree of accuracy, given that the hidden layer contains sufficiently manyunits [10]. As such artificial neural networks can be viewed as a class of universalapproximators. There is of course a distinction between what a network theoreticallycan represent and what learning algorithms are able to learn. Furthermore, there is aquestion of size to consider the sufficient number of units stated in the theorem mightbe enormous [8].

Neither the main principles of these methods, nor the algorithms for training them,have changed greatly since the 1980s even though the views on these methods havechange significantly. This shift can mainly be attributed to the considerable improve-ments in computer power that have occurred in recent years [6]. Another contributingfactor, is the increasing size of available datasets which has allowed for successful train-ing networks that generalise well [12]. These two developments, with the aid of a smallnumber of algorithmic changes, have enabled successful training of large neural net-works, consisting of many layers, deep neural networks in other word.

Its is these deep neural networks which, to a great extent, account for the dramaticimprovement in such areas as computer vision and speech recognition [17]. This suc-cess is mainly due to abilities discussed in Section 1.1 of discovering intrinsic structuresin data by learning the data representation required for classification/evaluation fromthe raw data. The basic theory of artificial neural networks will be presented in thefollowing sections.

Throughout this report the term ’neural networks’ will be used to refer to artificial

4

neural networks, not biological. The terms neural network, artificial neural network,ANN and deep neural network will be used almost interchangeably, with the only dis-tinction that a deep neural network refers to an artificial neural network with a largenumber of layers.

2.1 FeedForward Neural Networks

The aim of artificial neural networks is to approximate some function y = f∗(x) withy = f(x) where the input x can take many different forms and the target variable ycan be either categorical or continuous. These models are called feedforward networksbecause information flows through the function in a chain [8]. Models that includeinformation being passed ”backwards” are called recurrent neural networks and willnot be considered in this report.

Neural network layers generally consist of several units, or neurons, performing vectorto scalar computations in parallel, thus forming a directed interconnected network, asseen in Figure 1.

Most functions used in the hidden layer neurons consist of two components:

z = wTx + b (1)

φ(z) = f(z). (2)

Firstly, a linear transform is applied to the input vector x as in Equation 1, where wis a vector of weights and b is a scalar bias term. A non-linear activation function,Equation 2, is then applied to the result of 1. The function φ(z), generally referred toas the activation function or the squashing function, takes on different shapes and willbe discussed in Section 2.2.

A neural network with two hidden layers, each consisting of k units, and a linearoutput layer will now be described. The structure of this network is shown in Figure 1.The networks implemented for this project contain more layers however, the principlesare the same and this smaller network is therefore presented for clarity.

Given a data set of n samples where each sample i takes the form (xi, yi), with xi ∈ Rm

yi ∈ R the network will have an input layer as follows:

Z10 = 1

Z11 = xi1

Z12 = xi2

...

Z1m = xim ,

(3)

5

...

......

xi1

xi2

xi3

xim

Z11

Z12

Z13

Z1m

Z21

Z22

Z2k

Z31

Z32

Z3k

Z4

Inputlayer

First hiddenlayer

Second hiddenlayer

Ouputlayer

Figure 1: Structure of network described in Section 2.1 with each cirle representing aneuron and the arrows representing directed connections between the neurons.

6

where Z10 is included to account for the bias term. The first hidden layer is then

f (1) =

Z20 = 1

Z21 = φ(w1

10Z10 + w1

11Z11 + w1

12Z12 + . . .+ w1

1mZ1m)

Z22 = φ(w1

20Z10 + w1

21Z11 + w1

22Z12 + . . .+ w1

2mZ1m)

......

Z2k = φ(w1

k0Z10 + w1

k1Z11 + w1

k2Z12 + . . .+ w1

kmZ1m) ,

(4)

where each equation corresponds to a neuron in the layer. This first hidden layer isfollowed by:

f (2) =

Z30 = 1

Z31 = φ(w2

10Z20 + w2

11Z21 + w2

12Z22 + . . .+ w2

1kZ2k)

Z32 = φ(w2

20Z20 + w2

21Z21 + w2

22Z22 + . . .+ w2

2kZ2k)

......

Z3k = φ(w2

k0Z20 + w2

k1Z21 + w2

k2Z22 + . . .+ w2

kkZ2k) ,

(5)

f (3) ={Z4 = w3

0Z30 + w3

1Z31 + w3

2Z32 + . . .+ w3

kZ3k . (6)

As each neuron in a layer takes the complete output of the previous layer as input, thesetypes of layers are referred to as fully connected layers. Considering these equationsin matrix notation, with a weight matrix WL for each layer L where W 1 ∈ Rk×m+1

contains the weights from Equations 4, W 2 ∈ Rk×k+1 contains the weights from Equa-tions 5 and W 3 ∈ Rk+1 contains the weights from Equations 6 with Z1 ∈ Rm+1. Theoutput of the network Z4 is thus:

Z4 = f (3)(f (2)(f (1)(x))) = W 3Z3 = W 3φ(W 2Z2) = W 3φ(W 2φ(W 1Z1)).

As the final output layer, f (3) (Z4) attempts to drive the output towards the targetvalue yi for each given xi during training. The previous layers f (1), f (2) however donot have directly specified targets and as a consequence these layers are referred to as’hidden’. These layers can be seen as preprocessing of the data in a high dimensionalspace to learn a representation to give as input to the linear output layer. Insteadof deciding on a kernel or manually engineering the features the hidden layers of thenetwork are learning them. The strength of neural networks comes from these hiddenlayers [8].

7

2.2 Activation Functions

This section introduces several commonly used activations functions, variations of φ(z)as was seen in Equations 2, 4, 5.

The first activation function introduced was a simple step function as shown in Figure2a. This is part of the perceptron, Equation 7, the simplest neural network which waspresented by Rosenblatt in 1958 [24].

φ(z) =

{1 if wTx + b > 0

0 else(7)

A multilayer perceptron was the network considered in [10], when neural networks wereproved to be universal approximators. However, other architectures have since beendeveloped and have for certain problems proven to train faster and have a greater like-lihood of reaching a better and more stable solution [16].

(a) Perceptron (b) Sigmoid (c) Hyperbolic tangent (d) Rectifier

Figure 2: Examples of commonly used activation functions in artificial neural networks.

Traditionally, the most commonly used activation function has been the sigmoidfunction, Equation 8, plotted in Figure 2b. However, in recent years with the successof deep neural networks sigmoid activation functions have slightly fallen out of favourfor use in hidden layers. The reason for this is that the sigmoid function is prone tosaturation during training. For a sigmoid to output 0 it has to be pushed to a regimewhere the gradient approaches zero, thereby preventing gradients to flow backward andprevent the lower layers from learning useful features [6]. (Training of neural networkswill be discussed in Section 3.) Sigmoid functions can however still be useful for theright problem and with a suitable initialisation of the weight:

φ(z) =1

1 + exp(−z). (8)

Hyperbolic tangent is another commonly used activation function:

φ(z) = tanh(z) .

8

As clearly seen in Figure 2c the hyperbolic tangent is in a linear regime around zero,which is an improvement on the sigmoid function. However, it still has the problem ofsaturation in other regions.

Studies by Glorot et al. [7] among others, have shown that training of neural net-works proceeds better when the neurons are either off or operating mostly in a linearregime. Following these observations a new activation function has gained popularity,a ”sharp” sigmoid function. This activation function is called Rectifier and takes theform:

φ(z) = max(0, z) . (9)

This non-linearity has the advantage of considerably shorter training time, due to itnot being prone to the same saturation problem as the sigmoid and hyperbolic tangentfunctions [12]. A theoretical objection to this function is that it is not differentiable at0, which was also one of the reasons why it took some time for it be be widely adopted.However, for implementation this does not present a challenge as the value zero cansimply be approximated with a value close but not equal to zero [8].

One problem that can arise with rectifier activation functions is that they can ”getstuck” on zero slopes as the gradient based methods used cannot learn on exampleswhen the neuron is not activated [8]. A development of the rectifier has been suggestedto deal with this disadvantage:

f(z) = max(αz, z). (10)

Equation 10 is called a leaky rectifier, where the parameter α determines the ”leak-iness”. The neuron is then never completely unactivated. There are plenty of otheroptions for activation functions exist, however so far no other function has proven toperform significantly better than than the rectifier on a wider range of problems [8].

2.3 Convolutional Layers

Artificial neural networks with at least one convolutional layer are referred to as con-volutional neural networks (CNNs), or simply convolutional networks. Convolutionalnetworks are designed to handle data in the form of multiple arrays, be they two di-mensional (2D) or three dimensional (3D) [17]. These types of layers have gained hugesuccesses within computer vision, with the network designed by Krizhevsky et al. forthe ImageNet competition being a notable milestone [12],as well as the early successof LeCun with reading hand-written digits, presented in [15]. Convolutional networkshave however also been successfully applied in other areas, for example for constructingdata representations of the game board when teaching computers to play the game Go[25] and for natural language processing [3].

The convolution operation is described in Figure 3. This example is for an inputmatrix of size X ∈ R3×4 and a filter size of 2×2. A convolutional layer will have

9

a

e

i

b

f

j

c

g

k

d

h

l

α

γ

β

δ

Input Matrix Filter

Elements of Output Matrix

φ(aα+ bβ + eγ + fδ) φ(bα+ cβ + fγ + gδ) φ(cα+ dβ + gγ + hδ)

φ(eα+ fβ + iγ + jδ) φ(fα+ gβ + jγ + kδ) φ(gα+ hβ + kγ + lδ)

Figure 3: This figure shows a convolution operation of a 2×2 filter, or kernel, on a 3×4input matrix. Each of the expressions directly above are elements of the output matrix,which in this case has the dimensions 2×3. The first output element is obtained whenthe filter operates on the top left corner of the input matrix. The other elements aresimilarly obtained when the filter is applied to different regions of the input.

several such filters in each layer. The network then learns different parameter valuesof α, β, γ, δ for the different filters in the same way as neurons in standard fully con-nected layer will have different parameter values. These filters can vary in size, [12]for example reported using filters ranging from 11×11 to 3×3. The difference betweenthese layers and the layers described in the Section 2.1 is that several neurons in aconvolution layer share parameters and that each neuron is only connected to a smallregion on the input (the difference between the size of the input and the filter size isgenerally larger). When comparing a CNN with a standard ANN of equal size, theCNN therefore has considerably fewer connections and parameters and is consequentlyeasier to train. At the same time the CNN’s theoretical best performance is usuallyonly slightly worse [12]. This allows for the construction of a larger network, with thepotential of improving the performance.

Convolutional layers are usually paired with a subsequent pooling layer, the most com-mon kind being max-pooling layers. This layer does not perform any learning. Foreach k× k region in it outputs the maximum of that region. Pooling filters with k = 2are commonly used. Pooling layers reduce the size of the data for each layer, which isparticularly useful in computer vision where the input data is generally of a large size.It can also introduces a degree of invariance to input translations [1].

10

3 Training Deep Neural Networks

The training of the networks is handled as a classical supervised machine learningproblem. A labeled data set is used, in this case meaning that the retention times areknown. A portion of the data is set aside to be used as test set, used to compare withthe predicted values to evaluate performance.

Training of neural networks is almost always done with gradient based algorithms,which requires a cost function to be specified. This section will describe the costfunction used, the training algorithm and in addition a number of techniques used toimprove training.

3.1 Cost Function

As for any optimization method a measure quantifying the difference between thepredicted value and the target has to be defined. Minimizing this cost function is theaim of the training process. The retention time problem is a regression problem andconsequently mean square error is a natural choice as a cost function:

c =1

n

n∑i=1

(yi − yi)2 ,

where y is the known target and y is the predicted value, with the index i specifyingthe data point in the data set of n points.

As a measure to avoid overfitting a regularisation term is added to the cost function,by the same principles as in ridge regression:

c =1

n

n∑i=1

(yi − yi)2 + λ

L−1∑l=1

k∑j=1

m∑p=1

wljp

2+

m∑p=1

wLp2

, (11)

where the sum of the square of all weights in the network are added as a penalty term.Note that these indices are adapted to the network example given in Section 2.1, it ishowever for example not necessary for all hidden layers to have the same number ofneurons k. The value of the regularisation parameter λ determines the extent to whichthe regularisation term influences the training.

3.2 Gradient Descent and Stochastic Gradient Descent

As an initial step in training a neural networks all weights in the network have to be setto a starting value. This is usually done by randomly initialising the weights to valuesclose to zero. With this random set of parameter values the network can perform pre-dictions for given input data. After this step however we need a technique for knowinghow to change the weights in order to improve these prediction. Gradient descent is

11

a simple and commonly used technique to iteratively completing this task. It is basedon calculating how small perturbations in the parameter values in each layer will affectthe output, and changes the weights in the direction in weight space that reduces thecost function the most.

For each weight this computation is performed:

wljp

t+1 ← wljp

t − η ∂c

∂wljp

t , (12)

where t is the index for the iteration in training, and η specifies the learning rate.This straight forward expression becomes more complicated when considering how cdepends on wl

jp. As the layers are connected in a chain 12 requires knowledge of howZ2 affects Z3 and how Z3 affects Z4. The next section will describe how the resultingchain derivatives are dealt with in practice.

Using the basic gradient descent approach entails passing through the entire trainingset in each iteration and then updating the weights according to the average gradients.This method is today referred to as batch gradient descent [16]. A more commonlyused approach is Stochastic Gradient Descent (SGD) which in each iteration randomlyselects one or a small number of data points (a mini batch) on which to update theweights. The main advantage of this method is that it speeds up the learning.

An additional technique often used to further increase the learning rate is momentum.As indicated by the name, momentum favours subsequent learning steps that go in thesame direction in weight space. Denoting the change in weight as ∆wt+1 (∆wt+1 =η ∂c

∂wljp

t ) gradient descent with momentum is given by:

wt+1 = η∂c

∂wt+m∆wt. (13)

where m is the momentum constant. This means that the weights are updated accord-ing to a moving average of gradients, not just the last result. This smooths the processof training, by decreasing oscillations and effectively increasing the learning rate indirections of low curvature. It thereby has the potential to increase the learning speed[16].

3.3 Backpropagation

The calculation of the derivative in Equation 12 often requires a summation over sev-eral chains of derivatives. The number of these weights for which the gradients haveto be computed is also generally very extensive. These factors create the need for aneffective algorithm to compute these derivatives. Backpropagation, an abbreviation of”backward propagation of errors” is by far the most commonly used algorithm for thispurpose. Here follows an explanation of backpropagation, using the network example

12

introduced in Section 2.1, with k = 2. The layer wise modular approach of explainingis inspired by de Freita lectures in [4]. This approach has been chosen as it is consistentwith how the training algorithm is generally implemented.

To start let us consider the cost function as a final layer to the network describedin Equations 3-6. This gives the additional layer Z5:

c(W ) = Z5 =

n∑i=1

[yi − Z4i ]2 = [yi −W 3φ(W 2φ(W 1Z1))]2 .

For k = 2 the dependence of the cost function c can then be described as follows:

c(W ) = Z5

(Z4

[W 3, Z3

1

(W 2,Z2

1{W 11 , Z

1}, Z22{W 1

2 , Z1}), (14)

Z32

(W 2, Z2

1{W 11 , Z

1}, Z22{W 1

2 , Z1})])

. (15)

When implementing the network it is naturally the output of Z4 that we are interestedin, however, for purposes of presenting back propagation considering the cost functionas Z5 simplifies the calculations. As an example, the derivative of the cost functionwith respect to the parameters W 1

1 (weights of the first neuron in the first layer) is:

∂c(W )

∂W 11

=∂Z5

∂Z4

∂Z4

∂Z31

∂Z31

∂Z21

∂Z21

W 11

+∂Z5

∂Z4

∂Z4

∂Z32

∂Z32

∂Z21

∂Z21

W 11

. (16)

Even for this simple network Equation 16 requires the calculation of several partial

derivatives. An important fact to note is also that∂Z3

2

∂Z21

is used twice. The partial

derivative of c with respect to W 12 will also to a large extent consist of the same partial

derivatives. The backpropagation algorithm allows for computations of the derivativesrecursively and without having to redo the computation of any single derivatives.

13

Layer L

Figure 4: Three messages pass through each layer in the network; a forward pass offunction values, a backward pass of derivatives and in case of layer parametes outputof gradients with respect to the different parameters.

The workings of the algorithm can easily be considered for full layers. Figure 4 showsthe flow of information through a layer. For each layer three separate quantities have tobeen determined. These quantities are the function the layer computes, the derivativeof the layer with respect to the inputs and finally the derivatives with respect to theparameters.

The function values are passed from layer L to layer L+ 1 through the network. Thisis referred to as the forward pass:

ZL+1 = f(ZL) .

This is the functionality required to compute a prediction from input. The partialderivative with respect to the input are passed from layer L + 1 to L, backwardsthrough the network. This is referred to as the backward pass:

δLi =∂c(W )

∂ZLi

=∑j

∂c(W )

∂ZL+1j

∂ZL+1j

∂ZLi

=∑j

δL+1j

∂ZL+1j

∂ZLi

,

where i is the index of the unit in layer L and the index j specifies the output of thelayer. The upper bound of the summations change depending on the size of the layersas follows:

L = 2 : i ∈ [1,m], j ∈ [1, k]

L = 3 : i ∈ [1, k], j ∈ [1, k]

L = 4 : i ∈ [1, k], j = 1.

Finally, the derivatives with respect to the parameters are computed, as shown in the

14

example Equation 16:

∂c

∂WLi

=∑j

∂c

∂ZL+1j

∂ZL+1j

∂WLi

=∑l

δL+1j

∂ZL+1j

∂WLi

.

With each layer in the network performing these three functions all the information re-quired to apply SGD are obtained in an effective way for each iteration in the training.There are also today a wide range of software that offer implementation of trainingthrough backpropogation, hence these derivatives need not be computed by hand [14].

3.4 Dropout

Ensemble learning, the technique of combining several models, is frequently used fordifferent machine learning techniques to improve performance and has been proven tosignificantly reduce variance. However, in the case of deep neural networks it oftenbecomes too computationally expensive to train several networks [12]. Using dropout,first described in [9] is a computationally inexpensive solution that produces similareffects to ensemble learning.

The dropout layer can be viewed as a filtering layer. The output ZLj for each j and

each L among the hidden layers is set to 0 with a probability p (p = 0.5 is usuallygiven as a default). Thus, the network samples a different architecture for each input.This produces a more robust solution less prone to overfitting by reducing advancedco-adaptions in the network. Neurons can not rely on the presence of others to thesame extent [12].

15

4 Data

The data used in this project has been obtained from LC/MS experiments on a yeastsample. The retention time has been recorded and the observed spectrum has beenmatched to a peptide. The peptide is recorded as a sequences of letters, each letter cor-responding to an amino acid. The retention time of each peptide is recorded in minutes.

The data set used for the training and the evaluation of the models is a collectionof results from five runs with the same setup, each containing approximately 14 000peptides. The running time of the experiment was approximately 263 minutes for eachrun. There is a considerable overlap of peptides between the runs which reduces thenumber of peptides that can be used as there can be no duplicates between the trainingand test sets. In total the data set contains 24953 unique peptides, 4000 of which wereset aside for the final evaluation of the models, as a test set.

There are several sources of error in the data sets that are important to considerwhen evaluating the performance of the models. As previously discussed there is anelement of uncertainty when matching observed spectrum to peptides. The data setsused do however have a very low false discovery rate. All peptide identification with aposterior error probability higher than 1% have been removed from the set. However,we still expect to have some misclassified peptides in the set. This is one factor thatcan be expected to decrease the accuracy of predictions.

Another issue to consider is in-source fragmentation. During the ionisation processprior to the MS step of the analysis peptides sometimes break into smaller parts. Thismeans that the peptides analysed in the MS are not the same ones that eluted fromthe chromatography column [19]. Such smaller peptides have generally been removedfrom the data set, however there may still be occurrences.

Analysing the retention times of the peptides identified in all five runs gives an in-dication of the extent to which these sources of error may affect the results. Figure 5shows the retention times of the 4866 peptides found in all five runs of the experiment.As expected the majority of the peptides are aligned between the runs, although witha slight shift. However, the peaks show that some peptides are reported with largervariations in retention time. If we classify outliers as peptides with a retention timethat deviates with 10 min or more from the average of the other four runs, the approx-imated percentage of incorrectly classified peptides is 0.25%, thus lower than the FDR.The root mean square error (Equation 17) of the duplicates shown in Figure 5 is 2.5min. This can be viewed as the minimum achievable error level.

16

Figure 5: The retention times of 4866 peptides that occur in all five runs of the ex-periment. The peptides have been sorted according to retention time in the first run,T1.

Perhaps the opposite benchmark to consider is the performance of the null model. Forthe training data set, if predicting all retention times as the sample mean the RMSE is61.2 min. We thus expect the model to have a higher error than 2.5 min and to, aftersuccessful learning, have a considerably lower test error than 61.2 min.

4.1 Data Representations

The performance of a statistical learning model is always heavily dependent on whatfeatures or what more general data representation is applied to it. Initial tests withamino acid frequency and pair frequency as features were performed. In the case ofamino acid frequency the input vector has the dimensions x ∈ R20, one element foreach amino acid. In the case of pair frequency x ∈ R210, one element for each pos-sible amino acid pair. Further details on these representations are found in Appendix B.

However, as the aim of the project was to utilize neural networks’ ability to learnrepresentations of the data, a more general form was sought. A matrix representationwas therefore applied. For each peptide used to train or evaluate the model a matrixX ∈ R20×k was constructed, where the 20 rows each represent an amino acid and thek columns represent the position in the sequence.

As a toy example, a peptide with the amino acid sequence EHSENEHKESGK wouldgenerate the following matrix:

17

a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 . . . ak

H 0 1 0 0 0 0 1 0 0 0 0 0 0 . . . 0R 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . 0K 0 0 0 0 0 0 0 1 0 0 0 1 0 . . . 0N 0 0 0 0 1 0 0 0 0 0 0 0 0 . . . 0Q 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . 0G 0 0 0 0 0 0 0 0 0 0 1 0 0 . . . 0S 0 0 1 0 0 0 0 0 0 1 0 0 0 . . . 0E 1 0 0 1 0 1 0 0 1 0 0 0 0 . . . 0

This representation still leaves two degrees of freedom, the order of the amino acidsand the choice of k. Ideally, we wish the rows to be ordered in such a way that aminoacids which together in sequence strongly affect hydrophobicity are represented byadjacent rows. The reason for this being that important patterns in the matrix mightbe found more easily with small convolutional filters. The linear combination of theamino acids’ independent contributions offers a crude model for predicting a peptide’sretention time. However, the internal ranking of the elements of the ordinary leastsquares estimate, each correspodning to a different amino acids, give an indication oftheir hydrophobic properties. A comparison of these weights were therefore used toorder the amino acids, resulting in the following order,

H R K N Q G S C T E A D P V Y M I L F W .

A more detailed explanation of this sorting is found in Appendix B.1. This is not aperfect solution, however it is likely better than simply using the alphabetical orderand furthermore we expect the neural networks to learn which patterns are importanteven if important patterns do not consist of direct diagonals.

The other degree of freedom, the choice of k determines the maximum peptide lengththat the model can handle and also the number of columns of each peptide matrix,regardless of sequence length. A more elegant solution would be to have matrices ofvariable number of columns. However, the number of methods that can handle inputof variable length is unfortunately today limited and for convolutional networks thesize of the input has to be fixed. Figure 6 shows the distribution of peptide lengths inthe main datasets used, the second dataset shows a similar distribution. As shown inthe figure the majority of the peptides are of lengths in the range k = [7, 20]. Choosingk = 50 will allow the model to handle all peptides in this dataset. However, the modelwill have a considerable larger number of parameters and can be expected to trainslower. To the opposite extreme, selecting k = 10 will give a smaller model and withfew excess columns in each matrix, however the model can only use approximately 30% of the data. With these factors in mind k = 25 was selected as a standard maximumlength, allowing for the inclusion of 97.2 % of the dataset. It should be noted that

18

the exclusion of the longest peptides is likely to slightly improve the observed accuracyof the model. Previous studies have shown that retention time predictions are lessaccurate for longer peptides, removing them from the validation set thereby introducesa slight bias [27].

5 10 15 20 25 30 35 40Peptide Length

0

1000

2000

3000

4000

5000

6000

Figure 6: Distribution of peptide lengths in the training set.

4.2 Mirror Images

A technique often used in computer vision to extend a dataset is to use multiple copiesof the same data, yet with added modifications, such as rotating the images or slightlyshifting pixel values [12]. One such technique that can be applied to this problem isusing mirror images of the sequences, effectively doubling the dataset. The matrix forthe mirror image of the peptide (EHSENEHKESGK) shown in the previous sectionwill take the form:

a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 . . . amax

H 0 0 0 0 0 1 0 0 0 0 1 0 0 . . . 0R 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . 0K 1 0 0 0 1 0 0 0 0 0 0 0 0 . . . 0N 0 0 0 0 0 0 0 1 0 0 0 0 0 . . . 0Q 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . 0G 0 1 0 0 0 0 0 0 0 0 0 0 0 . . . 0S 0 0 1 0 0 0 0 0 0 1 0 0 0 . . . 0E 0 0 0 1 0 0 1 0 1 0 0 1 0 . . . 0

As this is the same peptide, the target retention time during training is the same asfor the original matrix.

19

4.3 Scaling

Scaling is a common practice in most machine learning methods. The target values,that is the retention times of the peptides, are scaled prior to training. The trainingset is scaled to have approximately mean zero and unit variance. This is achieved bysubtracting the mean of the set and then dividing by the sample standard deviation.

20

5 Evaluation

This section presents the measures used to evaluate the performance of different modelsin this project.

A natural measurement to consider is the root mean square error:

RMSE =

√1

N

∑i

(yi − yi)2 . (17)

This metric is equivalent to the cost function, if the penalty term is ignored, and it isthus directly correlated with the quantity being minimised in the optimisation. How-ever, following the discussion in Section 4 we expect approximately 1 % of the peptidesto be incorrectly classified, meaning that the retention time prediction for them mayyield a considerably larger error. Such outliers can affect the RMSE. This is one reasonwhy considering a confidence interval, rather than the average result, can also be useful.

A confidence interval with a significance level of α = 0.05 is employed to evaluate themodels. This can be expected to exclude outliers, resulting from for example incorrectidentifications. The confidence interval can be estimated by a empirical nonparametricconfidence interval based on the whole sample. This measure is also useful in practiceif using the retention time predictor to confirm peptide identifications. For a certaintime t let:

St(y, y) = {i : |yi − yi| < t} . (18)

The confidence interval is then given by 2t when the following equation is satisfied:

|St| = α× n . (19)

With the conditions of Equations 18 and 19 being fulfilled and given α = 0.05, 95%of the predicted retention times will be within t time units from the actual values. Itshould be noted that this is the confidence interval for a whole data set. It is not uniqueto each data point. It is thus a measure of the uncertainty on the whole set evaluated,not for a specific given prediction. The confidence interval can be a useful measure ifusing the prediction results to verify a peptide match in a new experiment. Given thatthe peptide match is correct there is then a 95% probability of the predicted retentiontime falling inside this interval.

To make comparisons of results from different runs easier, it is advisable to scale themeasurements. This can be done by considering the above mentioned measures relativeto the difference between the highest and the lowest retention time in the set. Ideallythis scaling should be done with respect to a fixed set of peptides to mark a known in-terval. However, using the retention times of the first and last peptides to elute gives aestimate of the running time used. Hence a relative confidence interval can be specifiedas

∆t95%r =2t∗

max(y)−min(y),

21

where max(y) and min(y) are the lowest and highest retention times observed in thetraining set.

A final measure, that has been considered in previous publications on retention timeprediction, is the correlation coefficients between predicted and observed retention times[20]. The sample correlation is calculated as follows:

r =

∑ni=1(yi − yi)(yi − ¯yi)√∑n

i=1(yi − yi)√∑n

i=1(yi − ¯yi). (20)

where y is the sample mean:

y =1

n

n∑i=1

yi .

These four metrics will be used in the presentation of the results in Section 8 and inthe discussion that follows in Section 9.

22

6 Implementation

There are today many software packages for working with neural networks, such asTorch, Theano, and packages for MATLAB. I have chosen to use Lasagne, a light-weight library for building and training neural networks in Theano [14]. Lasagne isan open-source project started in 2014 that offers implementation of different networkstructures and also includes optimisation packages. Lasagne enables the use of GPU,through the CUDA toolkit, during training, considerable improving the speed of learn-ing.

All code for the project has been written in Python.

23

7 Model Development

This section describes the development of a convolutional network taking the matrixrepresentation of peptides, discussed in Section 4.1, as input. Models based on simplerfeature vectors, developed at early stages of this project, will be presented in Section8 for comparison, yet will not be discussed in this section. Details of these models areinstead found in Appendix B.

The selection of artificial neural networks as the method to be applied to the taskof retention time prediction still leaves a large amount of choices to be made regardingstructure of the network and values of hyperparameters to be used. In general, thereare unfortunately not yet many definitive guiding theoretical principles for the designof hidden units, although it is an extremely active area of research [8]. Recommenda-tions found in literature are often based on experimental findings, in essence trial anderror. The model development, central in this project, has been based on commonpractices and a series of experiment. These experiments, evaluated using a validationset consisting of 3906 peptides, will be presented in this section and will be discussedfurther in Section 9.

The aim of the model development is to determine a structure that can be success-fully trained to accurately predict retention times of peptides. This requires a modelthat is sufficiently complex to represent the problem, a network of sufficient size. Inaddition to this, the optimisation method has to manage to minimize the cost functionsufficiently. The success of these two factors will be manifested by a low training er-ror. However, as the goal is to predict retention times for unknown peptides, a thirdcriteria is introduced; the model has to generalise well to new data. This is achievedby employing methods that regularise the model, avoiding overfitting to the trainingset. We hence arrive at the classical trade off in statistical learning between bias andvariance [11]. A larger model poorly regularised will yield high variance and low bias,whilst a small model with strong regularisation will yield low variance, yet high bias.This high variance is manifested by being very sensitive to the data used to train themodel. The ideal solution is a trade off between these two. For neural networks themost successful approach, when viewing breakthroughs such as [12] and [25], appearsto be constructing a large network and then attempting to strongly regularise it withconvolutional layers, techniques such as dropout, and the use of a large training set.

Two additional factors that have to be considered are limitations on time and computerpower. Increasing the network size can often be compensated for by also increasing thetraining set, thereby improving the regularisation. However, this brute force approachwill inevitably increases the required computer power and the time for training thenetwork [8]. Even with few limitations on time and access to supercomputers, fasterlearning requiring less resources can only be seen as a positive factor and also perhapshas the added benefits of being more comprehensible and more elegant. For this pro-

24

ject however, training time has not been a central concern. All attempted networkstructures have been trained in less than 60 min, even when using the full training set.

7.1 Layer Architecture

As discussed in Section 4 the data can be represented in a 2D array form with limitedloss of information. This fact and the benefits of convolution networks presented inSection 2.3 favours the use of convolutional layers. The common practice for CNNsis to alternate convolutional layers with pooling layers for the first few layers of thenetwork and to have layers of fully connected ’standard’ neurons in the final, top layers.This general structure has been the starting point of the model development in thisproject.

As discussed in Section 4.1 mirror images of the peptide arrays can easily be created,effectively doubling the data set, with the potential of improving the generalisation ofthe network. An alternative option, that was been the main focus of this project, is togive both mirror images as input to the network simultaneously. This is possible as theyshare the same target value. For this purpose a branched network was implemented,as illustrated in Figure 7. Two branches consisting of convolution layers, followed byfully connected layers were setup to each take one of the matrices as input. The outputof these branched was then jointly fed through a series of fully connected layers beforereaching the single linear output neuron. The results of this branched network will becompared to those of the standard unbranched network in Section 8. Advantages thatcan be expected are improved generalisation and a possible decrease in training time,compared to training the network on twice the number of data points.

The activation function to be used in the fully connected layers, as well as in theconvolution layers, also had to be determined. As presented in Section 2.2 there aredefinite advantages of using the leaky rectifier activation function and consequently thetests in this project focused on implementations involving this nonlinearity. However,results from using other activation functions will be presented as reference in Section7.2, after the hyperparameter values have been set.

This concludes the discussion of the general structure being tested. What remainsis determining the values of all hyperparameters such as number of layers, convolutionfilter size and pooling layers. These will be discussed in the next section.

25

Xi1

Xi2

a1)

a2)

b1)

b2)

c) d)

Convolutionallayers

ReLUlayers

ReLUlayers

Ouputlayers

Figure 7: Structure of branched network, where Xi1 and Xi

2 are the two input matrices,representing the two mirror images of a peptide sequence. a) Convolutional and poolinglayers. b) Fully connected leaky rectifier layers in each branch. c) Fully connected leakyrectifier layers merging the output of branches 1 and 2. d) Output layer with one linearoutput neuron. The final linear output layer is present in all network structures.

7.2 Selecting Hyperparameter Values

A summary of the hyperparameters to be determined regarding the structure of the net-work have been summarised in Table 1, along with the range of values tested. Severalfactors make this task more challenging. Firstly, these parameters cannot be optimisedindividually, for example the best width of layers to use is likely to depend on thechosen number of layers and vice versa. This co-dependence naturally makes the pro-cess of selecting hyperparameter values more involved. Secondly, as there is a stochasticelement to the training results between trials vary. The average performance shouldtherefore be considered from a set of trials, as well as the variation in performance togive an indication of the stability of the model. The training of multiple instances ofeach network structure becomes time consuming and the number of structures testedmust therefore be limited.

The results for some of the initial test rounds are shown in Table 8 in Appendix A.These tests suggested that the use of convolution and pooling layers did not improvethe performance of the model. Increasing the number of convolution layers decreasedthe performance and one of the best performing networks was the network withoutconvolution layers altogether. As a consequence of these results the next test round fo-cused on one and two convolution layers and removing the pooling operations betweenlayers. The structures tested in this round all had the same setup of fully connectedlayers, 3×30 neuron in each branch, followed by 1×15 neurons in the merged layer (oneof the best performing structures from the initial tests). A selection of these results aresummarised in Table 9 in Appendix A. In the majority of trials removing the poolinglayers improved the performance of the network.

The RMSE of the majority of the networks was in the range of 10-11 min, suggesting

26

Hyperparameter Tested Values Notation in Figure 7

Number of conv layers 1 - 4 a)No. conv filters per layer 15, 30, 40 a)Filter size 2×2 - 5×5 a)Pooling Yes / No a)Number of ReLU layers in branches 1 - 4 b)Width of ReLU layers in branches 1 - 50 b)Number of final ReLU layers 1 - 4 c)Width of final ReLU layers 1 - 50 c)

Table 1: Hyperparameters determining the structure of the network, listed with therange of values tested for each parameter.

that the performance is not particularly sensitive to the choice of convolution filter size.However, to proceed one structure was chosen, that with the best performance Table9. This network is presented in Table 2.

Net 11.1 was chosen as it had the lowest RMSE of the tested networks. Using Net11.1 as a foundation, the affect of changing the number of fully connected layers andthe width of these layers were then investigated more thoroughly. Figure 8 summarisesthe findings by plotting RMSE as a function of the number of layers and by plottingaverage RMSE as a function of the width of the layers. Each structure was trained 10times and the sample standard deviation among these 10 results has also been plotted.

Graph 8a shows that up until 35 neurons the performance of the model is improvedwhen adding neurons to each layer. The training error continues to decrease slightlyeven after 35. The variation between model performances after re-training also de-creases with the increase in neurons.

Graph 8b shows that the performance of the network is best for 2 - 4 layers. Whenmore layers are added both the test and training error are increased and greater vari-ations in the results are observed. As depth in networks are generally more challengingfor optimisation methods than widths increasing the number of epochs for the trainingis one example of a measure that might improve the performance of deeper networks.However, preliminary test on this suggest that the networks in many cases start tooverfit when adding more epochs.

27

Hyperparameter Net 11.1 Notation in Figure 7

Conv layer 1 a)No. conv filters 30 a)Filter size 4x2 a)Pooling No a)

Conv layer 2 a)No. conv filters 30 a)Filter size 4x4 a)Pooling No a)

Number of ReLU layers in branches 3 b)Width of ReLU layers in branches 30 b)Number of final ReLU layers 1 c)Width of final ReLU layers 15 c)

Final training error 8.2 (min)Test error 9.7 (min)

Table 2: The structure of the best performing network from the first two rounds oftrials. The reported errors are the average error of three separate trainings of thenetwork, trained and evaluated on the same data.

10 20 30 40 50Neurons per layer

0

10

20

30

40

50

RM

SE

TrainingTestTrain STDTest STD

(a)

1 2 3 4 5 6 7 8Number of layers

0

5

10

15

20

RM

SE

TrainingTestTraining STDTest STD

(b)

Figure 8: Both plots show the results of tests performed on a network with convolutionlayer structure according to Table 2. In (a) RMSE is plotted as a function of the numberof neurons per fully connected layer, in a network containing four fully connected layers.(b) shows RMSE as a function of the number of fully connected layers of width 30. Thelayers have alternately been added in the branches and the final layers, (correspondingto b) and c) in Figure 7). Both graphs give the mean RMSE (min) obtained over 10trials. The marked grey areas show the sample standard deviation of these 10 trials.

28

Hyperparameter Net 11.1 Net 11.2 Net 11.3

Conv layer 1No. conv filters 30 30 40Filter size 4x2 4x2 4x2

Conv layer 2No. conv filters 30 30 40Filter size 4x4 4x4 4x4

Number of ReLU layers in branches 3 2 2Width of ReLU layers in branches 30 40 50Number of final ReLU layers 1 2 1Width of final ReLU layers 15 40 50

Final training error 8.1 6.2 6.0Test error 10.1 10.1 10.0Test error STD 0.9 0.5 0.3

Table 3: The results of the final trials from which Net 11.3 was selected as the mainmodel.

Given the results in Figure 7 a final round of tests were performed, altering Net 11according to the findings. The results are presented in Table 3. Note that these resultsare more reliable than those presented in Tables 2 as the average RMSE is calculatedfrom the results of 10 trials, instead of three. The difference in RMSE on the testset between the three structures is negligible. However, Net 11.3 has a lower variationbetween re-trainings and also has the lowest training error and was consequently chosenas the main model.

There are also a series of hyperparameters pertaining to the training of the net-works that can be adjusted. These are summarised in Table 4. These values have beentaken from literature and different values have not been tested to any great extent,with the exception of the dropout level.

All reported results are for networks training during 50 epochs. This limit wasset heuristically as many structures where found to overfit when running the trainingfor 100 epochs or more. This limit on the number of epochs also has the advantage ofdecreases the learning time, enabling more tests to be performed.

Figure 9 clearly shows the advantage for a network of this depth of using the leaky rec-tifier (Equation 10) as activation function, rather than the traditional sigmoid function(Equation 8). The training error obtained with the sigmoid network is approximatelythat of the null model and the test error is slightly higher. This clearly shows that thetraining of the sigmoid network is unsuccessful. It should however be noted that this

29

Hyperparameter Value

Learning rate, η 0.01Momentum, m 0.9λ 0.001No. of Epochs 50Dropout level 0.1 - 0.5

Table 4: Hyperparameters pertaining to the training of the network. η and m areexplained in Equations 12 and 13. λ refers to the parameter determining the weight ofthe L2 regularisation used in the cost function, see Equation 11.

is not the best performing sigmoid network for this task. A smaller network performsbetter than the sigmoid version of Net 11.3, although the training of a ReLU networkis generally always easier.

0 10 20 30 40 50Epochs

0

10

20

30

40

50

60

70

80

RMSE

(min

) Training, ReLU Test, ReLUTraining, sigmoidTest, sigmoid

Figure 9: Comparison of learning curves of Net 11.3 with sigmoid and rectifier activa-tion functions.

Finally, tests were performed for different levels of dropout applied to Net 11.3, theresults of which are presented in Table 5. As seen in the table, none of the tested levelsof dropout improved the performance of the final model. When the number of epochswas increased the results improved slightly, for 200 epochs an test error of 11.6 min wasobtained for 50 % dropout level. Yet the performance was still better without dropout.Consequently, no dropout was applied in the training of the final model.

30

Dropout level Validation Error STD Validation Error

0.1 10.3 0.40.4 11.5 0.50.5 12.6 0.60.6 14.4 1.1

Table 5: Results achieved when applying different levels of dropout between the fullyconnected layers of Net 11.3.

7.3 Structure of Final Model, Net 11.3

Xi1

Xi2

40

40

4x2

4x2

40

40

4x4

4x4

50

50

50

50

50 1

Convlayer

Convlayer

ReLUlayer

ReLUlayer

ReLUlayer

Linearlayer

Figure 10: Structure of selected network, Net 11.3. The performance of this networkwill be evaluated in Section 8.

31

8 Results

In this section the performance of the model Net 11.3, the development of which wasdescribed in Section 7, is presented. Its performance will also be compared to thatof earlier stage models implemented for this project and to the performance of theELUDE software [18].

Model Type Training Error (min) Test Error (min)

Null Model 61.6 62.5Linear Regression 22.0 22.4AA Frequency ANN 15.5 15.9AA Pair Frequency ANN 7.7 14.4Unbranched CNN 10.4 11.7Branched CNN (Net 11.3) 6.0 10.0

Table 6: The performance of the different model types implemented. The errors aregiven in terms of RMSE and are the averages of 10 trials. Details on the differentmodels used can be viewed in Appendix B. The network run for the Single CNN hasthe same structure as the main model Net 11.3, with the exception of only having onebranch.

0 50 100 150 200 250Observed Retention time (min)

0

50

100

150

200

250

Pred

icte

d Re

tent

ion

time

(min

)

Figure 11: Scatter plot of predicted retention times using Net 11.3 as a function of theobserved retention times for the 3906 peptides in the test set.

32

Model ∆t95%r Correlation Training Data Training Time

ELUDE 17.4-26.6 % 0.92-0.97 See [19],[20] -ELUDE 21.1% Na Yeast 40cm, 1850 samples -ELUDE 17% 0.98 Main set 20 000 samples 28 hNet 11.3 23% 0.97 Main set, 2000 40 sNet 11.3 14% 0.98 Main set, 20 000 samples 5 min

Table 7: Comparison of relative confidence intervals and sample correlation betweenNet 11.3 and ELUDE.

Figure 12: Histogram showing the deviance of the predicted retention time from theobserved elution time for the test set, using Net 11.3 and ELUDE.

33

10 15 20 25Peptide Length (number of amino acids)

0

2

4

6

8

10

12

14

RMSE

(min

)

Net 11.3 ELUDE

Figure 13: Average RMSE of predictions with Net 11.3 and ELUDE, plotted as afunction of peptide length.

34

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000Training set size

0

5

10

15

20

RM

SE

TrainingTestTraining STDTest STD

Figure 14: RMSE plotted as a function of training size used to train Net 11.3. Allnetwork instances were trained over 50 epochs. Adjusting the number of epochs to givea constant level of data exposure during traning (e.g. setting the number of epochsto 500 for 2000 training peptides for an even comparison with 50 epochs for 20 000trainign peptides) did not significantly change this result. In several cases the resultswere indeed poorer, due to overfitting of the networks. Training on 2000 peptides over500 epochs still gave an average RMSE of 15 min.

35

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000Training set size

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

Rela

tive C

onfidence

Inte

rval, ∆t9

5%r

Figure 15: Relative confidence interval plotted as a function of training size, for aconstant number of 50 epochs.

36

9 Discussion

9.1 Comparison of the Different Implemented Models

Table 6 summaries the performances of the different model types implemented, sortedin order of test error value. The structures of the earlier stage models of the projectare all presented in Appendix B. As expected all models perform better than thenull model. The performance of the linear regression model and the amino acid (AA)frequency ANN show that retention times can be predicted with reasonable accuracywithout having information of the complete structure.

Training the same network on a vector representation of the AA pair frequency notice-ably decreases the training error, showing that the model is more adept at representingthe problem. The test error is however only slightly lower, suggesting that the modelfails to generalise as well as the CNNs. It may however well be possible to furtherreduce the test error slightly by fine tuning parameter values. The unbranched CNNfurther improves the performance on the test set and finally, the best scoring modelin Table 6 is the branched CNN Net 11.3, the development of which was described inSection 7. Figure 11 shows the predicted retention time using Net 11.3, plotted as afunction of observed retention time.

The branched network was implemented in the hope of improving generalisation byeffectively building two data representations through the branched layers of each pep-tide, thereby improving the chance of the network picking up the important structuralelements of the data and reducing the risk of overfitting. It is however slightly sur-prising that the branched structure produced such a low training error compared tothe unbranched structure. It was also hoped that the training time would be reduced,compared to feeding the network with twice the amount of data (when adding mirrorimages separately). However, the training time was only reduced by 17%. A moreeffective implementation of the network could probably reduce the training time of thebranched structure.

It should be noted that more time was spent optimising the performance of the branchedCNN structure than on any of the other models. Early evaluation of the model typeshowever suggested that this structure had the greatest potential and to limit the scopeof the project focus was therefore put on its development. It is unlikely that the rank-ing of the models would change even if more time was spent attempting to improve theperformance of the other models.

9.2 Comparison of Net 11.3 and ELUDE

Table 7 shows how the performance of Net 11.3 compares to the benchmark modelELUDE. With a small training set ELUDE performs better than Net 11.3. However,as seen in Figure 15 the performance of Net 11.3 improves significantly when the train-

37

ing set is extended. ELUDE is not designed to handle large training sets and doesnot show the same improvement in accuracy when more data is added. Increasing thetraining set from approximately 2000 peptides to 20 000 reduces the confidence inter-val from 21% to 17% for ELUDE. The corresponding change for Net 11.3 is from 23%to 14%. A comparison of the deviance from the observed retention times is found inFigure 12.

Figure 13 shows that ELUDE has a more even accuracy for different peptide lengths,compared to Net 11.3. Net 11.3 performs significantly better for shorter peptidesthan for longer ones. This explains the decrease in accuracy for peptides with higherobserved retention times seen in Figure 11 as longer peptides generally have longerretention times [27]. The limited exposure of the network to longer peptides duringtraining is presumably one reason for the trend seen in Figure 13. As seen in Figure 6,the number of long peptides is limited in this dataset. A potential solution would besimply up-sampling the longer peptides during training, although this method has notbeen explored.

9.3 Comment on the Structure of Net 11.3

During the background research for this projects no similar applications of convolu-tional neural networks were found. There is therefore no clear benchmark with whichto compare the structure of this network. The ImageNet CNN presented in [12] con-tains five convolution layers, alternated by pooling layers and followed by three fullyconnected layers. All layers had a larger width than the layers in Net 11.3. The num-ber of fully connected layers found to be optimal is the same. However, the design ofconvolutional layers differs.

The absence of pooling layers is perhaps one of the more noticeable features of thisnetwork, as they are a very common practice. The difference in the input dimensionsis one explanation for this difference. The ImageNet input for example is a 256 × 256pixel image where each pixel has an RGB value. Our input is a sparse 20×25 matrix.The pooling operation, which each time throws away 75 % of the information is likelybetter suited for larger input dimension and less sparse data, where the loss of someinformation throughout the network may even be desirable. It is also better suited forinput that is less sensitive to translations, again such as images.

The lack of improvement gained from dropout is another noteworthy finding. Severalsources have reported that introducing dropout of 50 % between the fully connectedlayers improve the performance of neural networks. In the case of Net 11.3 however,the performance decreased with the increase in dropout level, as reported in Table 5.The most probably explanation for this is that the network is not sufficiently largeto benefit from the regularising effect of dropout. It is possible that dropout is moresuccessful for still larger networks, trained over more epochs.

38

9.4 Future Work & Possible Improvements

The reproducibility of experimental retention times, estimated for this experimentalsetup to a RMSE of 2.5 min, is four times lower than the RMSE obtained with Net11.3. This suggests that it ought to be possible to construct a model that performseven better.

Adding more data to the training set improves the performance of Net 11.3, as seen inFigures 14 and 15. However, the performance appears to start converging at 18 000,although unfortunately there is not more data to fully confirm this. A convergencecould be due to the fact that the number of layers and the width of the layers havebeen optimised for this amount of data, as seen in Figures 8a and 8b. For a largerdataset an increased network size may well yield the minimum RMSE, compared tothe minimum seen in Figures 8a and 8b. A larger network might then improve uponthis prediction accuracy.

In addition to increasing the training set, an endless amount of time can be spenton optimising the structure hyperparameters as well as testing different values of learn-ing parameters. Net 11.3 is likely to be suboptimal, even for this given training size.However, most trials during the model development only gave slight variations in per-formance and the optimal network of this general structure is therefore unlikely toperform considerably better than Net 11.3.

There are also a number of improvements regarding other model capabilities, whichcan be considered. One desired improvement would be the construction of a modelwhich allows for variable input sizes, without requiring a maximum length being set.However, convolutional networks are as yet unfortunately limited in this respect. Thiswould therefore require the use of some other method or some clever preprocessing ofthe data.

Another improvement would be having the predictor estimate the uncertainty of eachpredicted retention time separately, rather than the confidence interval being based onthe entire sample. As seen in Figure 13 there is for instance a strong positive correla-tion between peptide length and RMSE. There are also other factors that are likely toaffect the confidence of the estimate, such as similarity of the peptide to peptides in thetraining set. The difficulty of the prediction is dependent of the peptide in question,and this fact could potentially be incorporated in the model.

An interesting additional test on this model would be exploring its capability of re-calibrating to new experimental conditions using only a small calibration dataset. Thiscould likely be achieved by initializing the network with the current parameter values ofNet 11.3, before starting training with a small calibration dataset. This would howeverrequires access to an additional dataset, obtained from a separate experimental setup.

39

10 Conclusion

This thesis has applied convolutional neural networks to the task of predicting liquidchromatographic retention times of peptides. The aim of which is to increase thenumber of protein identifications in shotgun proteomics and to improve targeted massspectrometry experiment. The final network was a branched convolutional neural net-work consisting of two convolutional layers of 40 filters each, followed by three fullyconnected of width 50 and a final single linear output neuron. Convolutional networksare designed to handle data in array format and each amino acid sequence was there-fore represented by a matrix X ∈ R20×25, with each row corresponding to a type ofamino acid and the columns representing the position of the amino acids in the peptide.The two mirror images of this representation were fed simultaneously to the network,through one branch each, merging before the final two layers.

This model achieved a RMSE of 10.0 min, 3.8% of the total running time of theliquid chromatography and a 95 % confidence interval proportional to 14% of the run-ning time, when trained on 20 000 unique peptides from a yeast sample. The softwareELUDE, used as a benchmark, produced a confidence interval of 17 % when trainedand evaluated on the same data. ELUDE is still preferable when the amount of datais limited, however for larger training sets the presented CNN performs better, par-ticularly when taking into consideration the considerably shorter training time. Aninteresting extension to this work would be testing the CNN’s capability to re-calibrateto a new experimental setup by initialising training on the calibration dataset with thecurrent parameter values of the network.

Although the CNN presented in this thesis improves upon the performance of ELUDEwhen trained on a large dataset, there is still theoretically potential to develop a pre-dictor with higher accuracy, given the reproducibility of results in liquid chromato-graphy.

40

Hyperparameter Net 1 Net 2 Net 3 Net 4 Net 5 Net 6 Net 7 Net 8 Net 9 Net 10 Net 11

No. of conv layers 1 4 2 2 3 1 1 1 1 1 0

No. conv filters per layer 15 15 30 30 30 30 30 30 30 30 0

No. of ReLU layers 1 1 1 3 3 3 4 4 2 2 3

Width of ReLU layers 15 15 15 15 15 30 30 15 30 30 30

No. of final ReLU layers 1 1 1 3 3 1 1 1 2 1 2

Width of final ReLU layers 15 15 15 15 15 15 15 15 15 15 15

Final training error 9.1 12.6 9.5 9.4 11.0 7.9 7.9 8.7 7.5 7.6 7.0

Test error 11.4 14.3 11.1 12.8 15.0 10.9 11.6 13.1 11.0 11.0 10.9

Table 8: A selection of the results from the initial hyperparameter tests. The general network structure can be viewed in Figure 7. Theerrors reported are the mean RMSE (min) from three trials. Convolution filters of size 3×3 were used for all networks, with 2×2 poolinglayers after each convolutional layer.

Hyperparameter Net 1 Net 2 Net 3 Net 4 Net 5 Net 6 Net 7 Net 8 Net 9 Net 11 Net 12 Net 13 Net 14 Net 15

Filter size conv layer 1 3x3 2x2 3x3 2x2 2x2 2x2 4x2 4x2 4x2 4x2 5x2 5x5 5x5 3x3

Pooling No No No No No No No No No No No No No No

Filter size conv layer 2 - - 3x3 2x2 3x3 3x3 2x2 4x4 2x2 4x4 4x4 5x5 5x5 5x5

Pooling - - No No No Yes Yes Yes Yes No Yes Yes No No

Final training error 7.4 7.0 8.2 7.8 7.8 8.4 8.4 8.5 8.0 8.2 8.6 11.6 8.8 8.1

Test error 10.3 10.4 11.6 10.7 10.9 11.0 11.1 9.8 11.6 9.7 10.2 12.8 11.5 11.0

Table 9: Tests on convolutional layers. All networks in the table have the same structure of ReLU layers: 3 layers of 30 neurons each inthe branches, followed by one joint layer consisting of 15 neurons.

A Complete Results from First and Second Round of Hyperparameter Experiments

41

B Earlier Models

This section describes the models implemented in this project prior to the final branchedstructure to which Section 7 is dedicated. The test and training errors of the modelspresented in this section are summarised in Table 6 in Section 8.

B.1 Linear Regression

The first two model types implemented have the same data representation. A vectorxi ∈ R20 is constructed for each peptide i where each element of xi corresponds to thecount of one of the amino acid’s:

A R N D C E Q G H I L K M F P S T W Y V,

in this given order. As a toy example a peptide of the sequence EHSENEHKESGKwill give the vector:

x = ( 0 0 1N 0 0 4E 0 1G 2H 0 0 2K 0 0 0 2S 0 0 0 0) .

With a standard linear regression the retention time of a peptide can be estimatedas:

yi = βxi + bias .

Where β is the ordinary least square estimate,

β = (

ntr∑i=1

xixTi )−1(

ntr∑i=1

xiyi).

estimated from the all n data points in the training set.The weight vector β was also used to determine the ordering of rows in the peptide

matrix representation. The relative values of the elements of β indicate the extent towhich each amino acid contributes to the hydrophobicity of the peptide. The aminoacid which has the highest corresponding value in β is represented by the first row inthe matrix and the amino acid with the lowest value is subsequently represented bythe last row in the matrix.

42

AA Frequency Pair Frequency ANN

Number of ReLU layers 2 2Width of ReLU layers 50 50

Final training error 15.5 7.7Test error 15.9 14.4Test error STD 0.1 0.2

∆t95%r 24% 22%

Table 10: The structure of the unbranched ANN, presented for comparison in Table 6,taking either AA frequency or AA pair frequency as input.

B.2 AA Frequency & AA Pair Frequency ANN

An artificial neural network was also designed to take the vector xi ∈ R20 as input.From the initial trials the network structure presented in Table 10 produced the mostaccurate predictions. This same network was also employed to predict retention timeswhen given amino acid pair frequency as input, rather than the singular AA count.This gives an input vector xi ∈ R210 as there are 210 possible unique combination of20 amino acids.

43

Single Matrix Representation CNN

Conv layer 1No. conv filters 40Filter size 4x2

Conv layer 2No. conv filters 40Filter size 4x4

Number of ReLU layers 3Width of ReLU layers 50

Final training error 10.4Test error 11.7Test error STD 0.2

∆t95%r 17%

Table 11: The structure of the unbranched CNN, presented for comparison in Table 6.This network has an approximate training time of 6 min, compared to the branchedstructure which trains in 5 min.

B.3 Single Matrix Representation CNN

A series of convolutional networks of unbranched structure have been implemented andevaluated in this project. However, for the purposes of comparing to the branched net-work a similar structure was chosen for the evaluation in Table 6. Furthermore, noneof the other tested unbranched CNNs showed significantly better performance.

The network is presented in Table 11.

44

References

[1] Bengio Y, Courville A, Vincent P, 2013, ’Representation Learning: A Review andNew Perspectives’, IEEE Transactions on Pattern Analysis and Machine Intelli-gence 35, pp. 1798-1828.

[2] Chen JY, Lonardi S, 2009, ’Chapter 14: Computational Approaches to PeptideRetention Time Prediction’, Biological Data Mining pp. 337-347.

[3] Collobert R, Weston J, 2008, ’A Unified Architecture for Natural Language Pro-cessing: Deep Neural Networks with Multitask Learning’, in International Con-ference on Machine Learning 2008, pp. 160-167.

[4] de Freitas N, Lecture notes 2015: ’Deep Learning’, Department of Computer Sci-ence, Oxford University.

[5] Di Palma S, Hennrich ML, Heck AJR, Mohammed S, 2012, ’Recent advancesin peptide separation by multidimensional liquid chromatography for proteomeanalysis’, Journal of Proteomics 75(13), pp. 3791-3813.

[6] Glorot X, Bengio Y, 2010, ’Understanding the Difficulty of Training Deep Feed-forward Neural Networks’, International conference on artificial intelligence andstatistics, pp. 249-256.

[7] Glorot X, Bordes A, Bengio Y, 2011, ’Deep sparse rectifier neural networks’, inProc. 14th International Conferenec on Artificial Intelligence and Statistics, pp.315-323.

[8] Goodfellow I, Bengio Y, Courville A, 2016, Deep Learning, Book in preparationfor MIT Press, http://www.deeplearningbook.org.

[9] Hinton, GE, Srivastava, N, Krizhevsky, A, Sutskever, I, Salakhutdinov, RR, 2012,’Improving neural networks by preventing co-adaptation of feature detectors’CoRR arxiv.org/pdf/1207.0580.

[10] Hornik K, Stinchcombe M, White H, 1989, ’Multilayer Feedforward Networks areUniversal Approximations’, Neural Networks 2, pp. 359-366.

[11] James G, Witten D, Hastie T, Tibshirani R, 2013, An introduction to statisticallearning. New York: Springer.

[12] Krizhevsky A, Sutskever I, Hinton GE, 2012, ’ImageNet Classification with DeepConvolutional Neural Networks’ NIPS, pp. 1106-1114.

[13] Krokhin OV, Spicer V, 2009, ’Peptide retention standards and hydrophobicityindexing in reversed-phase high-performance liquid chromatography of peptides’,Analytical Chemistry 81, pp. 9522-9530.

45

[14] Lasagne, http://lasagne.readthedocs.org/en/latest/index.html

[15] LeCun Y. et al., 1990, ’Handwritten digit recognition with a back-propagationnetwork’, In Proc. Advances in Neural Information Processing Systems pp. 396-404.

[16] LeCun Y, Bottou L, Orr GB, Muller KR, 2012 ’Efficient BackProp’, in Goos G,Hartmanis J, van Leeuwen J (eds), Neural Networks Tricks of the Trade 2nd Ed.,pp. 9-48.

[17] LeCun Y, Bengio Y, Hinton G, 2015, ’Deep learning’, Nature 521(7553), pp. 436-444.

[18] Meek JL, 1980, ’Prediction of peptide retention times in high-pressure liquid chro-matography on the basis of amino acid composition’, Proc Natl Acad Sci USA 77,pp. 1632-1636.

[19] Moruz L, Tomazela D, Kall L, 2010, ’Training, Selection, and Robust Calibrationof Retention Time Models for Targeted Proteomics’, Journal of Proteome Research9(10), pp. 5209-5216.

[20] Moruz L, Staes A, Foster JM, Hatzou M, Timmerman E, Martens L, Kall L,2012, ’Chromatographic retention time prediction for posttranslationally modifiedpeptides’, Proteomics 12(8), pp. 1151-1159.

[21] Moruz L, Kall L, 2016, ’Peptide Retention Time Prediction’, Mass SpectrometryReviews, pp. 1-9.

[22] Petritis K, Kangas LJ, Yan B, Monroe ME, Strittmatter EF, Qian WJ, etal. 2006, ’Improved peptide elution time prediction for reversed-phase liquidchromatography-MS by incorporating peptide sequence information’, AnalyticalChemistry 78(14), pp. 5026-5039.

[23] Pfeifer N, Leinenbach A, Huber C, Kohlbacher O, 2007, ’Statistical learning ofpeptide retention time behavior in chromatographic separations: A new kernelbased approach for computational proteomics’, BMC Bioinformatics 8(468)

[24] Rosenblatt F, 1958, ’The perceptron: A probabilistic model for information storageand organization in the brain’, Psychological Review, 65(6), pp. 386-408.

[25] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Sch-rittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D,Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Grae-pel T, Hassabis D, 2016, ’Mastering the game of Go with deep neural networksand tree search’, Nature 529(7587), pp. 484-489.

[26] Theano Development Team, 2016, ’Theano: A Python frameworkfor fast computation of mathematical expressions’, arXiv e-prints,http://arxiv.org/abs/1605.02688.

46

[27] Zuvela P, et al., 2016, ’Exploiting non-linear relationships between reten-tion time and molecular structure of peptides originating from proteomesand comparing three multivariate approaches’, J Pharm Biomed Analysis,http://dx.doi.org/10.1016/j.jpba.2016.01.055

47

TRITA -MAT-E 2016:49

ISRN -KTH/MAT/E--16/49-SE

www.kth.se

peptide retention time prediction using artificial neural...

Documents