networks in multivariate calibration

8/9/2019 Networks in Multivariate Calibration

1/22


2/22

2 Principle of neural networks

NNs stem from the field of artificial intelligence. An earlymotivation for developing NNs was to mimic some uniquecharacteristics of the human brain, such as the ability to learngeneral mechanisms from presentation of a reduced set ofexamples, or to retrieve correct information from missing ordistorted input data. NNs currently used in applied scienceshave little in common with their human counterparts and thescope of their possible applications is more restricted. Researchis still being carried out to establish links between neurobiologyand artificial intelligence, but a description of NNs by analogywith biological concepts, although fascinating, can lead to anerroneous perception of NNs as mysterious intelligent ma-chines. In the framework of multivariate calibration, we willconsider NNs in a more pragmatic way and in a firstapproximation define them as non-parametric non-linear re-gression estimators.7 Non-parametric methods are those meth-ods that are not based on the a priori assumption of a specificmodel form.

NNs allow one to estimate relationships between one orseveral input variables called independent variables or descrip-tors and one or several output variables called dependentvariables or responses. Information in an NN is distributedamong multiple cells (nodes) and connections between the cells(weights). An example of MLP is displayed in Fig. 1, for amodel with four descriptorsx1,x2,x3,x4 and a single responsey.

The descriptors are presented to the NN at the input layer andthen weighted by the connections wAij between the input andhidden layer. Hidden layer nodes receive simultaneouslyweighted signals from input nodes and perform two tasks: a

summation of the weighted inputs followed by a projection ofthis sum on a transfer functionfh, to produce an activation. Inturn, hidden nodes activations are weighted by the connectionswBj between the hidden and output layer and forwarded towardsthe nodes of the output layer. Similarly to hidden nodes, outputnodes perform a summation of incoming weighted signals andproject the sum on their specific transfer functionfo. In Fig. 1 asingle responsey is modelled and the output layer contains onlyone node. The output of this node is the estimated response y

that can be expressed as

y f w f w xo j

j

nh

h ij i

i

nd

= +

= =

q q+1 1

(1)

where ndand nh are the number of input variables and hiddennodes, respectively.

Although NNs can be considered as non-parametric tools, themodels that they yield are defined by sets of adjustableparameters determined by an algorithm, not a priori by the user.Adjustable parameters are the weights wAij, wBj and biases qA, qBthat act as offset terms by shifting the transfer functionshorizontally. They are determined with an iterative procedure

called training or learning. The adjustable parameters are firstascribed initial random values, then training starts and proceedsin two steps. First, a forward pass [Fig. 1(a)] is performedthrough the NN with a set of training samples with knownexperimental responsey. At the end of the pass, the magnitudeof the error between experimental and predicted responses iscalculated and used to adjust all weights of the NN, in a back-propagation step [Fig. 1(b)]. These two steps constitute aniteration or epoch. A new forward pass is then performed withthe training samples and the optimised parameters. The wholeprocedure is repeated until convergence is reached. This meansthat a pre-specified or acceptably low error level is reached.

Training an NN is an optimisation problem, where one seeksthe minimum of an error surface in a multi-dimensional spacedefined by the adjustable parameters. Such surfaces are

characterised by the presence of several local minima, saddlepoints or canyons. It must be accepted that the NN will probablynot find the absolute minimum of the error surface, but a localminimum relatively close to the absolute minimum andacceptable for the problem considered. The most popularalgorithm to adjust weights during training is the gradientdescent algorithm based on the estimation of the first derivativeof the error with respect to each weight.8

The most important feature of NNs applied to regression isthat they are universal approximators: they can fit anycontinuous function defined on a compact domain (a domaindefined by bounded inputs) to a pre-defined arbitrary degree ofaccuracy.9 We will now see why this characteristic can beparticularly attractive in analytical chemistry.

3 Neural networks in multivariate calibration

3.1 When to use neural networks

For analytical chemists, a calibration model relates a series ofinstrumental measurements to the concentration or somephysico-chemical properties of one or several target analytes.10

NNs can be used to build empirical multivariate calibrationmodels of the form: Y= F(X) + e. We will only consider inversecalibration models, for whichXdesignates a matrix of analyticalmeasurements performed on a series of n samples. For a givensample, measurements are described by a set of descriptors,xi,

for instance, absorbance values at a given set of wavelengths. Yis a vector or a matrix containing sample responses, for instance,the concentrations of a target analyte in a set of mixtures.Calibration sample responses are often determined experimen-tally with reference methods such as the wet chemistry Kjeldahl

Fig. 1 Feed-forward NN training: a, forward pass; b, error back-propagation.

158R Analyst, 1998, 123, 157R178R


3/22


4/22

Although they are linear methods, MLR, PCR or PLS can beused for the modelling of some specific types of non-linear data.If the form of the non-linear relationship between the responseand the descriptors is known, a model can be linearised bytaking the appropriate transform of the original variables,26 orby adding higher order and cross-terms to the regressionequation. In practice, the number of situations where theseapproaches are successful is limited, mainly because the exactform of the non-linear relationships is not known a priori and

the number of calibration samples available is not sufficient tofit a complex model with a large number of cross-terms. It isalso known to PCR and PLS practitioners that in some casesthese methods can accommodate non-linear relationships byusing higher order components to correct for partial non-linearities.27 However, there is a risk of introducing a significantamount of irrelevant information in the model.

3.2.2 Non-linear methods. Non-linear variants of PCR orPLS also exist (polynomial PCR,28 quadratic PLS29). Theirmain limitation is that they are based on the assumption that asimple (e.g., quadratic) relationship exists between the responsemodelled and the components. This assumption is sometimes

violated since components are already linear combinations oforiginal variables.30 Locally weighted regression (LWR) isbased on the decomposition of a global non-linear model in aseries of local linear PLS or PCR models. It was found toperform well in multivariate calibration, especially on clustereddata sets.31,32 However in LWR a data set cannot be described

with a unique set of components and loadings, since eachsample is fitted with a local model built with its nearestneighbours only. One must also take the risk that local modelparameters are less stable than global model parameters sincethey are estimated with a reduced set of objects.

Other techniques exist for non-linear regression but they arenot yet as popular as NNs and the above-mentioned techniques.A review of non-parametric non-linear regression methods[alternating conditional expectations (ACE), smooth multiple

additive regression technique (SMART), non-linear partial leastsquares (NLPLS), classification and regression trees (CART),multivariate adaptive regression splines (MARS) and splinepartial least squares (SPL-PLS)] can be found in the report ofSekulic et al.33 and in Franks tutorial.34 These methods canperform well on non-linear data but are computationally morecomplex than linear methods and share with NNs the limitationof being prone to overfitting. Their performance also dependsheavily on the amount and quality of data available.34

3.3 Advantages and limitations of neural networks

3.3.1 Flexibility of neural networks. We have seen that NNs

are not the only tools to handle non-linear multivariate data.However, their flexibility is often a decisive asset comparedwith parametric techniques that require the assumption of aspecific hard model form. Hard models cannot be developedwith NIR data owing to the significant overlap of combinationand overtone bands in the spectra. Other types of analytical data

Table 1 Examples of application of NNs to multivariate calibration

Property modelled Descriptors Ref.

Alditols in binary mixtures (%) 1H NMR spectra 18Apparent metabolic energy of barley Measured physical and chemical characteristics of barley 60Components in simulated binary and ternary mixtures UV/VIS spectra with simulated instrumental perturbations 12Active ingredients in drugs UV/VIS spectra

Components in simulated binary and ternary mixtures Absorption spectra with simulated non-linear effects 30, 67Components in rhodamine mixtures UV/VIS spectraComponents in simulated binary mixtures UV/VIS spectra 19Active ingredients in drugs UV/VIS spectraProtein in wheat Near-infrared spectra[H2O] in meat Near-infrared spectra 25Flex modulus of polymers Near-infrared spectraEthanol in mixtures containing latex Near-infrared spectra 24Fat in port meat Near-infrared spectraMineral charge in polymer Near-infrared spectra 70Gasoline octane number Near-infrared spectra[KOH] in polyether polyols Near-infrared spectraConstituents in paper coatings Near-infrared spectra 64, 65[OH], [NH], grind size in cereals Simulated near-infrared spectra 35Methanol in water mixtures Near-infrared spectra 41Composition of organic extract Near-infrared spectraHydroxyl in cellulose esters (%) Near-infrared spectraProperty of polymer pellets Near-infrared spectraSolvents in aqueous process stream Near-infrared spectraAromaticity of brown coals Fourier transform infrared spectra 72Color change in emulsion paints Measured concentration of oxide ingredients 61RNA, DNA or lysozyme in binary mixtures containing

glycogenPyrolysis mass spectra 16

Bacteria in tertiary mixtures Pyrolysis mass spectraAdulteration of cows milk with goats or ewes milk Pyrolysis mass spectra 17Penicillin and buffer ion concentrations in solutions with

different buffer ion concentrationsMeasurement signals of enzyme field effect transistor flow

injection analysis96

Urea and glucose in solutions at different pH Measurement signals of enzyme field effect transistor flowinjection analysis

23

[SO2] and relative humidity of water vapour in sample gas Frequency response of a piezoelectric crystal gas sensor 97Cu/Zn in simulated two-component system with formation of

intermetallic compoundsAnodic stripping voltammograms 98

Cu/Pb/Cd/Zn in experimental four-component system Anodic stripping voltammograms

Ionic concentrations in mixtures Measurements from ion-selective electrode arrays 99Characteristics of the physical structure of polymer yarns Parameters describing mechanical properties of the yarns 38Metals in Fe/Ni/Cr systems X-ray fluorescence spectra 86Fe/Ni in thin films X-ray fluorescence spectraGasoline octane number Gas chromatograms 14

160R Analyst, 1998, 123, 157R178R


5/22

(e.g., UV/VIS spectra) are more easily interpretable from thespectroscopic point of view, but the a priori specification of ahard model rarely incorporates the non-linear effects that mayoccur in practice. Non-linearity in a data set can be detectedwith graphical methods but identification of its source is morechallenging and sometimes impossible. Thanks to their abilityto learn and deriveXYrelationships from the presentation of aset of training samples, NNs avoid the time-consuming andpossibly expensive task of hard model identification. In

addition, the fundamental principle of distributing informationamong several weights and nodes renders the NN model robustwith respect to random noise in the input data (as alreadyexplained) and allows one to have several NNs with differenttopologies converging to qualitatively equivalent results.

If one is not careful, however, a drawback of the flexibility ofNNs is their tendency to overfit calibration data and theresulting lack of generalisation ability, that is, the capability ofa model to produce a valid estimate of the correct output whena new input is presented to the NN. Also, the flexibility of NNscan lead to unreliable results in situations of extrapolation.Although NNs proved to perform better than PLS on extrapo-lated non-linear data in some applications,24 they were found tobe equivalent to or less reliable than methods such as MLR,

PCR, PLS or LWR in comparative studies of calibrationmethods where extrapolations occurred.35,36 The dangers ofstrong extrapolation with NNs are illustrated in Fig. 2(a)(c),which show results obtained for the modelling of a cosinefunction with different numbers of hidden nodes and test points(+) outside the calibration domain. The calibration domaincontains theX-values in the range [22, +2].

The NN builds an empirical model to fit objects in thecalibration space only and test points are badly predicted. Withanalytical data, such strong extrapolations rarely occur and one

generally has the situation represented in Fig. 2(d), where theprediction error is less dramatic. It is possible to use NNs toperform small or mild extrapolations on such non-linear databut NNs should not be considered as generally suitable forextrapolation, as with any other chemometrics technique.

3.3.2 Neural networks and linear models. One may wonderwhat happens if an NN is used to model a linear data set. Forinstance, a model may be wrongly considered as non-linear

owing to an incorrect estimation of linear PLS or PCR modelcomplexity. It is also tempting to take advantage of theflexibility of NNs and let them do the work with any kind ofdata, even when they should be linear.

From the point of view of prediction, if the data are linear anNN with non-linear transfer functions should neverthelessconverge to a solution that approximates a linear modelsolution, since the linear portion of the transfer functions can beactivated in that case (see Fig. 3).

This was confirmed by the results of a recent comparativestudy carried out to evaluate the performance of several linearand non-linear modelling methods on real industrial data.32

Each of the four industrial data sets consisted of a series of NIRspectra (X-data) and a specific property to be predicted (Y-data).

Some results of this comparative study are listed in Table 2,which contains the root mean square error of prediction(RMSEP) values obtained with stepwise MLR, PCR, PLS andNN.

NNs outperform linear methods for the strongly non-lineardata set, which is not surprising, but their performance onslightly non-linear and linear data is comparable to theperformance of linear methods such as PLS or PCR. This is inagreement with the observations of Gemperline et al.,12 whostated that Artificial neural networks having the appropriate

Fig. 2 NN predictions within and outside calibration space: ac, cosine function; d, quadratic function. Model with, a, three; b, six; c, nine; and d, two hiddencones. o, Actual training, +, actual test, *, predicted.

Analyst, 1998, 123, 157R178R 161R


6/22

architecture can be used to develop linear calibration modelsthat perform as well as linear calibration models developed byPCR or PLS.

It was said that when NNs are used to model linearrelationships, they require a long training time since a non-linear technique is applied to linear data.33 In theory this is truein the sense that the apparently linear portion of the non-lineartransfer functions is not perfectly linear, and therefore thelearning algorithm must perform continuous adjustments to

correct for this slight deviation. For a perfectly linear and noise-free data set, the NN performance tends asymptotically towardsthe linear model performance and it generally converges to theintrinsic precision of the computer. However, in this case thecurve of NN error as a function of the number of iterations isalmost perfectly flat and an acceptable solution can be reachedrelatively early during the training. Moreover, perfectly linearand noise-free data sets are seldom available so that in practiceNNs can reach a performance qualitatively similar to that oflinear methods in a reasonably short training time.

In spite of these reassuring results, it does not make senseintuitively to apply a complex and possibly time-consumingmethod when simpler tools are likely to perform as well. MLRwith stepwise variable selection can give excellent prediction

results on linear data sets (see Table 2) and its interpretationproperties for the analyst are optimal compared with all othermethods. In practice, using a highly flexible tool to model linearphenomena can lead to rapid overfitting of the measurementnoise. Artefacts can also occur if the topology of the NN is notcarefully designed. As an illustration, Fig. 4 shows distortionsappearing when a perfectly linear and noise-free model is fittedwith an NN containing too many hidden nodes and a non-linearnode instead of a linear node in the output layer.

3.3.3 Robustness of the models. NNs are sometimesrecommended for their robustness,37 but this term is rarelydefined with precision. Unlike analytical procedures for which

official definitions of the term exist, there is not a uniquedefinition of the robustness of a multivariate calibration model,as illustrated by some controversial statements.38,39 It seemsreasonable to follow Frank and Todeschinis40 definition ofrobustness in the framework of regression analysis: robustmethods are those methods that are insensitive to smalldeviations from the distributional assumptions. This definition

applies in particular to methods designed to cope with outlierspresent in the calibration set. Methods to detect or handleoutliers are presented in Section 4.1.2. Robustness of an NN isalso challenged when predictions are performed on newsamples outside the calibration domain in theX-space or in theY-space. We underlined in Section 3.3.1 that NNs often performrelatively poorly in situations of extrapolation.

In all these situations, deviations from a priori assumptions(data set free of outliers and of systematic noise) affect the

training samples. Some authors consider robustness from adifferent perspective, in situations where a model has beendeveloped with training data that fulfil initial assumptions butperturbations affect new objects to be predicted.38,41 Differenttypes of perturbations must be considered. The appearance ofhigher levels of random noise in the test samples is usually notcatastrophic.42 Derks et al.38 related quantitatively the varianceof predicted responses to the variance of random noise added tothe input variables. The influence of instrumental perturbationsthat have a more systematic effect than random noise (e.g.,baseline or wavelength shift) can be more catastrophic and isdifficult to anticipate. Indeed, it depends on a number ofparameters: the curvature of the relationship between eachdescriptor and the response and the position of the perturbed

samples on the descriptors axes. When a strongly non-linearrelationship is being modelled, the NN can have either anattenuating effect with respect to perturbations (compared withlinear models) because of the squashing effect of the non-linearity, or a catastrophic effect on high leverage points, asillustrated in Fig. 5.

Since the exact shape of the model and the position of futuresamples in input space cannot always be known, a solutionconsists of identifying possible sources of degradation andincluding them either in the training set42 or in the monitoringset.41 This allows one to avoid large prediction errors after theappearance of small perturbations that can be expected inpractice.

3.3.4 Black-box aspect of neural networks. NNs canperform at least as well as any other technique in terms ofprediction, but a major criticism remains their black-box aspect.To be fair, it should be pointed out that this limitation is notpeculiar to NNs only. For instance, it is often impossible tovisualise clusters and outliers by projecting scores on compo-nent axes in LWR since the samples belong to local models

Fig. 3 Usual non-linear transfer functions: hyperbolic tangent; sigmoid.

Table 2 RMSEP of different multivariate calibration methods applied to industrial data

Propertyy Nature of data MLR PCR PLS NN

Moisture in wheat Linear 0.1860 0.2147 0.2150 0.1981Hydroxyl number of polyether polyol Linear 0.90 1.15 1.31 0.88Octane number of gasoline Slightly non-linear 0.1355 0.1426 0.1461 0.1459Mineral charge in a polymer Strongly non-linear 0.0797 0.0477 0.0445 0.0096

162R Analyst, 1998, 123, 157R178R


7/22

defined with different objects. However, model interpretationwith an NN is still considered much more complex than with,e.g., PLS or PCR. This is due to the operations (summation andprojection on transfer function) performed successively in thehidden and output layer, that prevent one from deriving simpleanalytical expressions between input and output variables [seeeqn. (1)]. In addition, unlike QSAR applications, where inputvariables are heterogeneous original variables, the inputvariables used in multivariate calibration are often scores

compressing spectral information, which complicates evenfurther model interpretation. Methods to ease model inter-pretation will be presented in Section 4.3.3, but it is clear thatmodel interpretability remains an active research area for theNN community and the danger of incorrect inference (commonto all non-parametric techniques) must not be overlooked.

4 Development of calibration models

We will now examine in more detail the way in which an NNmodel should be developed, according to our experience. Thedifferent steps in method development are summarised in the

flow chart in Fig. 6.It will come as no surprise that data pre-processing (Fig. 6,

left) governs closely the quality of results that can be expected.We propose some tools to help in optimising parameters such asthe number of input variables or the number of hidden nodes.NN construction (Fig. 6, right) is based on alternating removalof input and hidden nodes, starting from a large NN. Theprocedure described in this flow chart is very general and ofcourse other strategies are applicable. Short cuts can be madethrough the flow chart by including a priori knowledge, or asthe user acquires more experience with topology optimisation.

4.1 Data pre-processing

4.1.1 Detection of non-linearity. As a general rule, oneshould not try to build an NN model unless the situation is oneof those mentioned in Section 3.1. Therefore, some diagnostictools are necessary to detect the presence of non-linearity in adata set. The simplest approachwhich in many cases issufficient to detect the presence of non-linearityis to plot theproperty of interest versus the different measurement variables,

or combinations of these variables such as PC scores. If theseplots are inconclusive then one should build a linear model withMLR, PCR or PLS. Visual inspection of the residuals (y2 y) ofthe linear model versus each descriptorxi retained in the model,versus the experimental response y and versus the estimatedresponse y should then be performed to detect non-linearities.

Recently, Centner et al.43 reviewed a number of moresophisticated graphical and numerical methods to detect non-linearities. They cited the Mallows augmented partial residualsplot (MAPRP) combined with a runs test as the most promisingapproach for non-linearity detection. The MAPRP is the plot ofthe term (e + bixi + biixi2), called augmented residuals, versus xi.The e are residuals of the linear regressiony = f(x1,. . .,xi,. . .,xn,xi2). The regression should be performed on all variables xi in

the model (original variables or principal component scores).Curvature in the MAPRP plot indicates that higher variables xj(j > i) correct for the non-linear (quadratic) nature of therelationship between y and the variable xi. In that case thevariablexj is undesirable because it makes the model less robust.The runs test is used to detect series of residuals with the samesign, called runs. Long runs indicate the presence of a trend inresiduals that may be a systematic bias or non-linearity. Fromthe total number of positive and negative residuals, onecalculates az-value that is compared with a tabulated value. Asignificant value of |z| indicates a trend in the residuals. As anillustration, we performed the detection of non-linearity be-tween the spectra of a series of 104 diesel oil NIR spectra andtheir viscosity. We built a 10-component PCR model, and for

each principal component (PC) we looked at the MPARP plotcombined with the runs test. For PC2, PC3 and PC4, the |z|values indicate a non-linearity between the augmented residualsand the variable (Fig. 7).

A limitation of the MPARP plot is that it allows only thedetection of non-linearities that can be described or approxi-mated by a quadratic term.

Centner et al.43 emphasised the need for careful outlierdetection before drawing conclusions about the presence ofnon-linearity in a data set. Outliers with high leverage can pullthe regression line and lead to an incorrect estimation of thenumber of runs. Conversely, some outlier detection methodscan wrongly flag as outliers samples that are high leveragepoints responsible for non-linearity in the data.33 This will be

illustrated in the next section.

4.1.2 Detection of outliers. Actually, the term outlierdetection encompasses two steps: first, atypical object detec-tion, followed by an outlier identification. Although numericalmethods allow flagging of samples that are outliers on statisticalgrounds, the positive identification of an atypical object as a trueoutlier requires knowledge of the process or data acquisitionprocedure, or interaction with the person in charge of thisacquisition. It is recommended to keep all flagged samplesunless they are positively identified as outliers on experimentalgrounds.

It is beyond the scope of this paper to review all methods for

outlier detection proposed in the literature, but we will suggesta few guidelines. One must make a distinction between differenttypes of outliers. Outliers inXcan be due to accidental processupsets, experimental errors during acquisition of spectra ortranscription errors during the labelling of samples or file

Fig. 4 Predictions for linear model with incorrect NN topology.

Fig. 5 Attenuation or amplification of Y-prediction error in a non-linearmodel compared with a linear model, depending on the sign of the error inX.

Analyst, 1998, 123, 157R178R 163R


8/22

manipulation. Outliers in Yare due to incorrect measurementsof reference values or transcription errors also. Atypical objects,i.e., possible outliers in X or in Y, can be flagged beforeperforming any modelling. By contrast, outliers in the XYrelationship can only be detected after building a completemodel.

The simplest tool to flag atypical objects before modelling isthe visual observation of theXand Ydata available. One shouldlook at the original set of sample spectra, the vector of responses

and score plots on the first PCs. To detect outliers in theXspace,it is recommended to examine the leverage of each sample todetect possible outliers. The leverage of a sample is a measureof its spatial distance to the main body of the samples inX.44 Fora given data matrixX, the leverage of sample i is given by thediagonal term pii of the prediction matrix P, also called Hatmatrix:

P = X(XTX)21XT (2)

Fig. 6 Strategy for construction of NN model: left, data handling; right, network construction.

164R Analyst, 1998, 123, 157R178R


9/22

When there are more variables than objects inX, the predictionmatrix must be calculated with the matrix TA of sample scoreson theA first significant PCs:

P = TA (TATTA)21 TAT (3)

High leverage points have large values ofpii (diagonal elementsof the P matrix) and special attention should be paid to thesepoints. They have a strong influence on parameter estimationand can alter the model dramatically if they happen to be trueoutliers. The limitation of this approach is that it is not

Fig. 6Continued

Analyst, 1998, 123, 157R178R 165R


10/22

straightforward to determine A. Several methods (see Section4.2.2) can be applied to perform this determination.4549

An alternative approach for the a priori detection of atypicalobjects inXis to apply Grubbs test on Raos statistic.50 RaosstatisticD2(k) (yi) is a value calculated for each sample i and foreach PC k. It accumulates all variations described by PCs k+ 1to p. For each k, the Raos statistic is used as input to flagpossible outliers inXwith the univariate Grubbs test.

After flagging possible outliers inXor in Y, one must check

if these samples are outliers in the XY relationship. Centneret al.50 proposed a procedure based on the development of PLSleave-one-out cross-validation models after flagging possibleoutliers with a Grubbs test performed on the Raos statistic.The goal of the cross-validation is to discriminate situationswhere a true outlier alters the models resulting in a largecumulative cross-validation error, from situations where thelarge value of the cross-validation error is simply due to theincorrect prediction of a high leverage point that is not anoutlier. A limitation of this approach is that the identification isbased on linear cross-validation models (it will be explained inSection 4.1.4 why cross-validation should not be performedwith NNs). A sample that is an outlier to a linear model mightnot be an outlier to a non-linear model.33 The final decision

should be made on the basis of a comparison of predictionresults for NN models with and without the flagged samples inthe training set.

To illustrate the difficulty of outlier detection in non-linearmodels, we report in Fig. 8 a PC scores plot for the NIR data setused to model viscosity of diesel oil. Applying Grubbs test onRaos statistic, the sample marked with an asterisk wasidentified as an atypical object. Using leave-one-out cross-validation on PLS models, the flagged sample (which has thehighest Y-value in the data set) is positively identified as anoutlier to the PLS model. If we compare the PLS and NN testresults depending on the inclusion or not of this flagged samplein the training set, we obtain the RMSEP values reported inTable 3.

When the flagged sample is included in the training set, theNN performance in prediction improves whereas the PLSperformance degrades. This illustrates how non-linear informa-tion can be extracted by the NN from a high leverage samplethat is not an outlier.

Since outlier detection is not always successful, it is possibleto design NNs that can handle outliers present in the training set.For instance, Walczak51 proposed to use error thresholdingfunctions adjusted iteratively during training with respect to the

median of residuals. Wang et al.52 also applied a thresholdingfunction adjusted with respect to the assumed proportion ofoutliers among the ranked residuals. In both approaches, theidea is to prevent outlier residuals from influencing weightestimations during training.

4.1.3 Number of samples. The number of samples availableis often a limiting factor when using NNs. Like other regressionmethods, there are constraints concerning the number ofsamples required to develop an NN model. The number ofadjustable parameters is usually such that the training set israpidly overfitted if too few samples are available. We considerthat when this number is less than 30, an alternative modelling

technique should be considered. Unfortunately, this is notalways obvious for inexperienced users, who can be deceivedby the extreme flexibility of NNs since they can fit the trainingdata with arbitrary precision. It is possible to obtain excellenttraining results for the modelling of data sets with less than 15samples. However, if these models are validated on newindependent samples, a significant degradation of the results isobserved due to a lack of generalisation ability.

To estimate the minimum number of training samplesallowing theoretical generalisation, one can use a parametercalled the VapnikCervonenkis dimension (VCDim). For anMLP with one hidden layer, the lower bound of the VCDim isapproximated as twice the total number of weights in the NN.53

It is possible to reach good generalisation if the number of

Fig. 7 Mallows augmented partial residual plots for PCR models of diesel oil viscosity: PC1; PC2; PC3; PC4.

166R Analyst, 1998, 123, 157R178R


11/22

training samples is at least equal to this lower bound. When thenumber of samples available does not fulfil this requirement,NN can still be used to find an acceptable local minimum closeenough to the absolute minimum of the error function.However, the ratio of the number of samples to the number ofadjustable parameters should be kept as high as possible, inorder to avoid under-determination of the problem. The numberof samples is generally imposed or limited by practicalconstraints, but one can partly solve the under-determination

problem by reducing the number of weights in the NN as muchas possible, as will be explained in Section 4.1.5.

4.1.4 Data splitting and validation. An important step in thedevelopment of any calibration model is the splitting of theavailable data into two subsets: a training set (used to estimatemodel parameters) and a validation set or test set (used to checkthe generalisation ability of the model of new samples). ForNNs the problem is more complex because they fit to arbitraryprecision the training data, provided that the number of hiddennodes is sufficient and the training time long enough. Therefore,an additional monitoring set is necessary to stop the trainingbefore the NN learns idiosyncrasies present in the training

data.4,54,55

The monitoring set must be representative of thepopulation under study in order to avoid NN overtraining thatleads to overfitting (see Fig. 9).

Ideally, for a number ntof training samples, the monitoringset and the test set (if it is available) should contain between nt/2and ntsamples each. The repartition of samples between thesesets and the terminology used in several papers are the source ofmany confusions. When prediction errors are reported in theliterature concerning NNs, are the authors referring to trainingerror, monitoring error or validation error? The performance ofan NN should not be judged by its performance on training datathat can always be fitted perfectly. Often, the problem is toknow whether the reported results have been obtained on amonitoring set or a validation set. Data sets are seldom largeenough to be split into three subsets, so that authors often reportresults on a monitoring set that they call the validation set ortest set. There is no reason why results obtained on amonitoring set could not be reported, as long as it is made clearthat these results were obtained on the data set used to evaluate

the training end-point. One must be aware of the limitations ofthis approach: a true validation error is a better estimator of theNN generalisation ability than a monitoring error.4 If onedecides to favour the modelling power of the NN by using onlytwo subsets (training and monitoring) instead of three subsets ofsmaller size (training, monitoring and validation), very goodresults may be obtained on the monitoring set but the model hasnot been truly validated in the sense that the monitoring datawere used to optimise one of the model parameters (number of

iterations for training). However, the monitoring results can beconsidered as indicative of the modelling power to expect fromthe NN model, and they can be compared with, e.g., PLS resultswith cross-validation. We summarise the comparison betweendifferent situations for PLS and NNs in Fig. 10.

Some authors mention leave-k-out (often k = 1) cross-validation as a way of estimating the generalisation ability of theNN, for instance, when only few calibration samples areavailable.3,6,56 We believe that this approach is not adapted toNNs37,41,54 and we do not recommend it. The procedure can besuitable for parametric linear models characterised by aquadratic bowl-shaped smooth error surface. With such models,the perturbation caused by the removal of one or a few samplesfrom the training set has little influence on the model

parameters, and therefore the cumulative cross-validation errorobtained is a reliable validation error estimate for the modelconstructed with all samples. The situation changes for NNsapplied to non-linear problems characterised by complex errorsurfaces.53 Unlike PLS or PCR, which are constrained toproduce orthogonal components, no constraint is imposed onNN adjustable parameters and it tends to perform a point-by-point fit of all training samples. Solutions obtained when twodifferent samples are removed from the training set can differsignificantly from each other.4 In this case one cannot considerthat the global model is validated, and it is even possible thatnone of the models developed during cross-validation describethe same region of the error surface as the global model.Therefore, if too few calibration samples are available to create

a monitoring set, it is better to consider an alternative method toNNs.

Ideally, the monitoring and validation set should be inde-pendent of each other and of the training set. This can only beachieved if the samples in each of these subsets are selectedrandomly. However, it is important to include as many sourcesof variance as possible in the training set. If not, extrapolationmay occur in the prediction phase and this should be avoidedwith any modelling method. Specific algorithms can be used toselect training samples that are representative of the totalpopulation and contain high leverage points that carry informa-tion about the main sources of variance. A limitation of thisapproach is that the subsets selected are no longer independentsince mathematical criteria are applied to discriminate training

samples from the other samples. It is important to keep

Fig. 8 Score plot of diesel oil samples.

Table 3 Influence of the presence of a single training sample on RMSEPobtained with PLS and NN models

MethodRMSEP (flagged samplenot in training set)

RMSEP (flagged samplein training set)

PLS 0.31 0.39NN 0.28 0.23 Fig. 9 Typical evolution of training and monitoring errors as a function of

number of iterations.

Analyst, 1998, 123, 157R178R 167R


12/22

this restriction in mind when results are reported. We will nowpresent some algorithms to perform automatic subset selec-tion.

The D-optimality criterion selects the n calibration samplesthat provide regression coefficients with the lowest variance ofall the subsets Xn of n samples. Selection is performed bymaximising the determinant of the information matrix (XnTXn).When the number of samples available is large, Ferre and Rius57

proposed the use of Fedorovs exchange algorithm to select the

D-optimal subset. Samples selected with this criterion arelocated at the border of the calibration domain. If a smallnumber of samples are retained, the interior of the calibrationdomain is not appropriately sampled and the set obtained is notrepresentative of the whole population.

The KennardStone algorithm58 is an alternative method thatallows the selection of a subset of representative samples.Samples are selected iteratively by maximising the Euclideandistance between the last selected point and its nearestpreviously selected neighbour. The first samples selected withthis method are generally the same as with the D-optimalitycriterion and they describe the border of the calibration domain.As the number of selected samples increases, their repartitionbecomes more homogeneous and the subset selected is more

representative of the global population.These two algorithms ensure that monitoring and/or valida-

tion samples are within the domain covered by the trainingsamples, so that the model does not extrapolate. This type ofsample selection does not match the not-so-ideal situationsometimes encountered in practice, where it is not guaranteedthat all new samples fall within the calibration domain. Theduplex algorithm59 allows a more realistic repartition ofsamples than the two previous methods. Samples are selected inthe same way as with the KennardStone method, but they arealternatively assigned to the training set and the validation (or

monitoring) set. Thus, not all samples located at the border ofthe calibration domain are placed in the training set; some arefound in the validation set. However, if some samples at theborder of the domain are very close to each other, duplexsplitting can be misleading because each training sample willhave its nearest neighbour in the validation set. This can lead tooverfitting and over-optimistic estimation of the validationerror. For the same reason, with any splitting method allreplicates of a sample should be assigned to the same subset.

Sample selection is often performed in the PC space on thescores matrix Tinstead of on the original matrixX, which allowsone to reduce the computational burden. To illustrate theprinciple of the three selection methods (KennardStone, D-optimal and duplex), we represented the sets of 30 trainingsamples from a non-linear data set (prediction of viscosity ofdiesel oil samples from their NIR spectra) selected with eachmethod. We first performed a PCA decomposition of theoriginal X matrix (104 3 795), then the 30 training sampleswere selected in the subspace spanned by the first ten PCs. Fig.11 represents the position of the training samples (asterisks)selected in the PC1PC2 plane.

If one wants to compare the efficiency of several modellingmethods, samples can be selected withD-optimal or Kennard

Stone designs. If a model has to be developed for an applicationfor which there is no guarantee that only interpolation will beperformed, then duplex design will lead to more pessimistic butreliable results.

It is also possible to perform the splitting after projecting thesamples on a two-dimensional map with a Kohonen NN.60.61

The advantage of such a projection is that an estimation of therelevant number of dimensions is not required and the essentialtopological features of the data set are preserved in twodimensions, which allows rapid visualisation of the datastructure.

Fig. 10 Repartition of samples for internal and external validation with PLS and NN.

168R Analyst, 1998, 123, 157R178R


13/22

With strongly clustered data, subset selection should beperformed on each cluster separately in order to ensure goodrepresentativity between the training and test data. After datasplitting, one can apply the methods presented by Jouan-Rimbaud et al.62,63 for estimating numerically the representativ-ity of two data sets. These methods provided indices varyingbetween 0 and 1 to compare direction, covariance and centroidsof two data sets.

4.1.5 Data compression. As pointed out earlier, the ratio ofthe number of samples to the number of adjustable parametersin the NN should be kept as large as possible. One way of over-determining the problem is to compress input data, especiallywhen they consist of absorbances recorded at several hundredwavelengths. In addition to reducing the size of input data,compression allows one to eliminate irrelevant information

such as noise or redundancies present in a data matrix.Successful data compression can result in increased trainingspeed, a reduction of memory storage, better generalisationability of the model, enhanced robustness with respect to noisein the measurements and simpler model representation.

The latent variables calculated with the PLS algorithm aredesigned to project data points on a lower dimensional subspacedescribing all relevant sources of variance. While PCs aredesigned to maximise the explained variance in the X-space

only, PLS latent variables are built so as to maximise thecovariance betweenXand Y. Some authors have used PLS tocalculate input socres for NN training.64 However, the latentvariables are designed to conserve information linearly corre-lated with the response and some relevant non-linear informa-tion might be rejected in higher order latent variables that arenot retained in the model.24,65 For this reason, we do notrecommend pre-processing data with PLS before NN model-ling.

The most popular method for data compression in chemo-metrics is principal component analysis (PCA). In addition tosummarising almost all variance in theX-matrix on a few axesonly (the PCs), it has the property that these axes are mutuallyorthogonal, which allows inversion of the variancecovariance

matrix in linear regression models (PCR). Orthogonality ofinput variables is not so critical for NNs that can handlecollinear input data. However, most NN applications inquantitative analysis with spectral data use PC scores as inputvariables.24,30,41,6670 For the determination of the optimumnumber A of input PCs to retain, one can use the same PCselection procedures as for PCR, although the choice is not socritical since NN models are built iteratively by successiveoptimisations of the NN topology. One possible approachconsists in performing initial calculations with a deliberatelylarge number of PCs and progressively reducing this number.This point will be detailed in Section 4.2.

When compressing data with PCA, one must be aware ofsome theoretical limitations. PCA is a linear projection method

that fails to preserve the structure of a non-linear data set. Ifthere is some non-linearity inX(or betweenXand Y), this non-linearity can appear as a small perturbation on a linear solutionand will not be described by the first PCs as in a linear case. Anon-linear transformation of theX-matrix or PC scores matrixcan be performed to restore the least-squares approximationproperty, but the resulting non-linear PCs are strongly depend-ent upon the pre-selected non-linear form and may not ensurethe best representation of distances between points in theoriginal space.71 In practice, PC scores are often successfullyused as inputs without transformation because all relevantinformation aboutXis usually contained in the first 15 PCs.

Alternatively, it is possible to use Fourier analysis,35,41

Hadamard transform72 or wavelet analysis73 to pre-process

spectral data before NN modelling. An attractive feature ofwavelets is their ability to describe optimally local informationfrom the spectrum, whereas Fourier decomposition is global. Ifthis localised information is related to the non-linearity presentin the data, an improvement can be expected if the input matrixis described with wavelet coefficients instead of PC scores orFourier coefficients. A difficulty lies in the selection of one ofthe numerous wavelet bases for spectral decomposition. Ascheme based on the optimisation of the minimum descriptionlength (MDL) criterion in multivariate calibration was ex-plained by Walczak and Massart.74

Whatever the compression method retained, the new sub-space (PCs, Fourier coefficients, wavelet coefficients) forsample description must be determined on the training set only.

Then the monitoring and test samples can be projected in thissubspace to calculate their scores or coefficients.

4.1.6 Data scaling. Once the input variables have beenselected or calculated, one must ensure that they can be used for

Fig. 11 Data splitting: selection of calibration samples (*) in PC space: a,D-optimum design; b, KennardStone design; c, duplex design.

Analyst, 1998, 123, 157R178R 169R


14/22

efficiently estimating NN parameters. It is not necessary tomean-center input variables before training since the biases actas offsets in the model. NN training is not based on variancecovariance maximisation, and therefore it is not necessary toscale the different variables to unit variance, even when they areheterogeneous. This is an advantage over methods such as PCRor PLS that require auto-scaling when variables are of differentnature. For instance, in process control applications where somevariables are continuous and others are binary, the binary

variables can be artificially given more weight than thecontinuous variables because of auto-scaling, and the modelinterpretation is incorrect.

The only constraint for NNs is to scale each input variable sothat training starts within the active range of the non-lineartransfer functions. Usually, samples are range-scaled with alinear mapping called minmax scaling. Scaling parametersmust be determined on the training samples. All samples mustbe scaled with respect to these parameters. Let Xmintrain andXmaxtrain

be the extreme values of variable Xin the training set, and letrmin and rmax define the limits of the range where we want toscale variableX. Any sampleXi (from the training, monitoringor test set) must be scaled to a new value Ai as follows:

A X X

X Xr r ri

i= -( )--( )+min

train

max

train

min

train max min min (4)

For NNs with sigmoid or hyperbolic tangent transfer functions,rmin and rmax are set to 21 and 1, respectively. One must alsoensure that the initial weights wi0 are reasonably small to avoidsaturating the transfer functions in the first iterations. Wesuggest setting them so that 0 < |wi0|


15/22

of NNs with and without direct connections for the developmentof multivariate calibration models with non-linear simulateddata. They found that directly connected NNs learned morequickly in the initial and intermediate training phases, but NNswithout direct connections converged to lower calibration andprediction errors. Dolmotova et al.65 recently compared NNswith and without direct connections for the simultaneousdetermination of the concentration of three main components inpaper coating. The results obtained with both methods were

approximately similar. In theory, an NN without directconnections can achieve the same prediction performance as anNN with direct connections, and we therefore prefer NNswithout direct connections to reduce the number of adjustableparameters.

4.2.2 Number of input and output nodes. Although NNshave the property to model multiple responses simultaneously,it is recommended that one models only one response at a timeand therefore have a single output node. The only exception tothis rule is for situations where one wants to predict severalcorrelated responses, such as the concentrations of differentconstituents of a mixture in a closed system. In that case, allresponses can be modelled simultaneously with an NN having

one output node per response.To set the initial number of input nodes, two approaches are

possible: the stepwise addition approach consists of startingwith a deliberately small number of input variables and addingnew variables one at a time until the monitoring and/orprediction performance of the NN does not improve any more;the stepwise elimination approach consists of starting with adeliberately large number of input scores and graduallyremoving (pruning) some of them until the monitoring and/orprediction performance of the NN stops improving. Bothapproaches are used in practice and no definite recommendationcan be given as to which one is better, since they both haveadvantages and limitations. If PCs are selected according toeigenvalues and the scores used as inputs, the stepwise addition

method often leads to quick and satisfactory results, because allnecessary information is usually contained in the first few PCs.However, it can happen that most information is contained in,e.g., PC1 to PC5, but some important additional information isalso contained in PC10. During stepwise addition, the NNperformance will stagnate or degrade between PC6 and PC9 andthere are few chances that PC10 is included in the finalmodel.

When stepwise elimination is performed, one must include adeliberately large number of input variables in the initial set.Irrelevant variables can be eliminated later, but relevantvariables that have not been included in the initial model willnot be tested subsequently. Here again, working with PC scoresas inputs is advantageous. Using classical techniques (e.g.,

Malinowskis factor indication function and reduced eigenvaluetest45 or cross-validation75), one can estimate the pseudo-rankof the input data matrix. Then, one selects a few additional PCs

(five or six) that may account for possible non-linearity, and theNN training can be started with this initial training set. Forcalibration problems, the size of the initial set should typicallyvary between 10 and 15 PCs. The drawback of the stepwiseelimination approach is that it can be extremely time consum-ing, if input variables are tentatively removed by trial and error,because of the large number of possible combinations.60

In neural computation, the relevance of a variable to a modelis called its sensitivity. The optimisation of the set of input

variables can be accelerated if a method to estimate thesensitivity of each variable is implemented. Several methodshave been proposed. The most common is often referred to asHinton diagrams. It consists of ascribing to each input variablea sensitivity proportional to the average magnitude of itsassociated connections in the NN, represented on a two-dimensional map by square boxes of varying size. Candidatevariables to be deleted are those with the lowest sensitivity. Inspite of its popularity, this method exhibits severe theoreticaland practical limitations.70,76 It is based on an analogy with theclassical MLR approach, where the magnitude of a regressioncoefficient reflects the importance of the relationship betweenthe associated descriptor and the response. In an NN model,input variables that have a linear contribution to the response

will be modelled in the linear portion of the sigmoidal transferfunction associated with small or medium magnitude weights,whereas the non-linear variables will be modelled in theconcave portion of the transfer function associated with largemagnitude weights. Therefore, the Hinton diagram rankingmethod is not based on the intrinsic relevance of a variable to amodel, but simply on the nature of its contribution to theresponse. Linear input variables are systematically flagged asunimportant even when they explicitly contribute to the model.This approach can only give reliable results when the data set isentirely linear, in which case there is no point in using an NN.For the same reason, we are not in favour of training methodsbased on the principle of weight decay4 that consists of addingto the cost function a term penalising large weights.

The approach based on estimation of saliencies is theoret-ically more stringent.76 The saliency of a weight is the measureof the increase in the NN cost function caused by the deletion ofthis weight. It is estimated at the end of the training. Deletion ofan individual weight wi in an NN can generally be considered asa small perturbation. First, the change in cost function caused bythis small perturbation to the weight matrix is approximated bya second-order Taylor series expansion. Ideally, the training isstopped when the NN has converged to a minimum, andtherefore the change in cost function can be described usingonly Hessian terms (second partial derivatives of the errorfunction with respect to weights) in the approximation of thechange in error. Hassibi and Stork77 proposed calculating thesaliency of a weight kas

s w

kk

kk

=[ ]-

1

2

2

1H

(6)

where H21 is the inverse of the Hessian matrix. Once thesaliency of each weight in the NN is obtained, we use the sumof the saliencies of weights connected to input variable i todetermine the sensitivity Si of this variable:76

S si k

k

= (7)The saliency estimation method has already been used tooptimise NN topology in multivariate calibration.68 It can leadto unstable results in situations where the assumptions made forsaliency estimation (small magnitude of weights, training

stopped when training error is at a minimum) are notfulfilled.70

Two variance-based approaches for input variable sensitivitydetermination were proposed recently.70 They are designed forsituations where input variables are orthogonal, which is theFig. 13 Example of three-layer 431 NN with direct connections.

Analyst, 1998, 123, 157R178R 171R


16/22

case with PC scores. The methods are based on the estimation ofthe individual contribution of each input variable to the varianceof the predicted response. In the first approach, this contributionis determined by partial modelling. First, the NN is trained toestimate the parameters of the model:

y = f(x1,x2,. . .,xn) (8)

After training, the sensitivity of each input variable xi iscalculated as the variance of the response y(xi) predicted withthe trained NN when all input variables except x

iare set to

zero:

y(xi) = f(xi) (9)

Si = s2y(xi) (10)

In the second approach, the separate contribution of each inputvariable to the variance of the estimated response is derivedfrom a variance propagation equation for non-linear combina-tions of variables. In the case of a two-variable model (x1,x2),this equation is

s s sy x x x x

y

x

y

x

y

x

y

xCOV2

1

2

2

2

2

2

1 21 2 1 2

2=

+

+ ( )

(11)

Since PC scores are orthogonal, the covariance term can beneglected and the sensitivity of input variable xi is calculated

as

S

y

xi

ixi

=

2

2s (12)

Applying the chain rule several times, one obtains ananalytical expression that allows one to determine Si at the endof training. The most interesting characteristic of these twovariance-based methods (partial modelling and variance propa-gation) is that they give extremely stable results. When NNswith the same topology are trained with different sets of initialrandom weights, they can converge to different local minima onthe error surface that are qualitatively equally good and close toeach other. In that case the two variance-based methods givesimilar results, which is not always the case with Hinton

diagrams or with the saliency estimation method.Once the sensitivity of each input variable has beenestimated, we recommend that one should first try to remove thevariable with the lowest sensitivity, and retrain the NN. If themonitoring error decreases after removing the flagged variable,it can be considered as irrelevant for the model and permanentlyremoved, otherwise it must be replaced and another flaggedvariable must be tentatively removed. Since parsimoniousmodels should be preferred in multivariate calibration, wepropose the following methodology for the stepwise eliminationof input variables. Let ME(k) be the monitoring error at the kthtrial and ME(k+ 1) the monitoring error at the next trial afterremoval of a flagged input variable. Then,

If ME(k + 1) @ t 3 ME(k), then remove the flaggedvariable

Else, replace the flagged variable and try to remove thenext variable with lowest sensitivity

Here t is a tolerance factor that can be adjusted to differentvalues; we suggest t = 1.1. Increasing this factor will result inremoving more input variables from the model, at the risk oflosing some relevant sources of variance; tshould not be lowerthan 1, otherwise the NN could have a poor generalisationability.

For a given set of input variables, the NN performance willalso vary with the number of hidden nodes. Therefore,optimisation of the number of input variables and of the numberof hidden nodes should be performed in conjunction: at eachstep, one should optimise the number of input variables, then thenumber of hidden nodes, then optimise again the number of

input variables and proceed so until the monitoring error stopsdecreasing.

4.2.3 Number of hidden nodes. A study performed by Tetkoet al.55 suggested a fairly wide tolerance of NNs to the number

of hidden nodes, provided that overtraining be avoided with anexternal validation set. However, an upper bound on the numberof hidden nodes is of the order of the number of training samplesused.53 It was further proved that an NN with n sigmoidalhidden nodes could approximate the response of 2n 2 1samples.78 These results support the idea that it is not necessaryto use large numbers of hidden nodes to fit complex multivariaterelationships. On the contrary, large numbers of hidden nodesoften accentuate the risk of overfitting.79

To circumvent the problems of overfitting and local minimatrapping characteristic of complex networks, Jiang et al.66

proposed a recursive algorithm to add a reasonable number ofhidden nodes to an already trained NN. The idea is that anaugmented NN is capable of the same approximation as asmaller one, and convergence can be improved with additionalhidden nodes. The augmented NN is trained with a modifiedgenetic algorithm (MGA) instead of the usual back-propagationalgorithm to avoid local minima. However, the initial topologyto be augmented remains to be determined.

Conversely, Kanjilal and Banerjee80 presented a strategy forreducing the number of hidden nodes in an NN. The method isbased on orthogonalisation of the hidden layer output matrixwith singular value decomposition (SVD), after a crude

convergence has been reached. Zhang et al.69 presented analgorithm based on a similar concept, that allows one to use allcalibration samples for NN training without need for amonitoring set. The initial postulate is that NNs with largenumbers of hidden nodes are relatively insensitive to initialconditions, but their generalisation ability is worse than NNswith a hidden layer of reduced size. The proposed schemeconsists of starting NN training with a deliberately large hiddenlayer until an arbitrarily low error is reached, then perform SVDon the hidden layer output matrix H:

Hk3h = Uk3kSk3hVTh3h (13)

where h is the number of hidden nodes and k the number of

training samples. The number rof dominant singular values inthe diagonal S matrix (determined by a variance ratio criterion)is considered as the number of hidden nodes necessary for theNN. A new NN is built, with only r < khidden nodes, and thenew initial weight matrices are determined by least squares fitso that the hidden layer output matrix is

HA = [U1U2. . .Ur] (14)

Training is then resumed on this pruned NN with improvedgeneralisation ability.

We have studied the influence of the number of hidden nodeson the NN error on four non-linear NIR data sets, for which theoptimum set of input variables (PC scores) had previously beenidentified. The first two data sets consist of diesel oil spectra

with their corresponding values of viscosity and pour point(eight and four input variables, respectively). The third data setcontains spectra of a polymer and the concentration of a mineralcharge in this polymer as dependent variable (three inputvariables). The fourth data set contains spectra of gasolinesamples and their corresponding octane numbers (thirteen inputvariables). The first three sets can be considered as stronglynon-linear, whereas the last one is only slightly non-linear.32

For each set, models with different numbers of hidden nodeshave been designed. Each model was repeated five times toavoid chance correlations due to the random initialisation of theweights. Fig. 14 shows the evolution of average calibrationerror (CE), monitoring error (ME) and test error (TE) as afunction of the number of hidden nodes in the NN, for each of

the four data sets.For the three highly non-linear data sets [Fig. 14(a)(c)],

there is first a sharp decrease in error as the second and/or thirdhidden node are added to the model, whereas for the modellingof octane number [Fig. 14(d), slightly non-linear], the error

172R Analyst, 1998, 123, 157R178R


17/22

curves remain relatively flat between 1 and 20 hidden nodes.The high initial error values observed in Fig. 14(a)(c) for onehidden node indicate a situation where the NN is not flexibleenough to model highly non-linear relationships. The situationis equivalent to fitting a second- or third-order polynomial witha first-order model. One could think of simply selecting anarbitrary large number of hidden nodes and keep it constant,since the error curves in Fig. 14 remain stable for high numbersof hidden nodes. However, the test samples in these examples

are all within the calibration domain. The situation changessignificantly when NN are used in extrapolation. For instance,in Fig. 15(a) the CE, ME and TE values are reported for themodelling of diesel oil viscosity, when the test set containssamples with extremeXvalues.

The monitoring and test errors increase as more hidden nodesare added, in contrast to what was observed in Fig. 14(a). Themain reason is that several samples that describe the non-linearity are now in the test set, and the calibration samples

Fig. 14 Evolution of NN calibration, monitoring and test error as a function of the number of hidden nodes: a, viscosity data; b, pour point data; c, polymerdata; d, gasoline data.

Fig. 15 Evolution of calibration, monitoring and test errors as a function of the number of hidden nodes for viscosity data, when some test samples areoutside calibration space: a, error; b, standard deviation of error.

Analyst, 1998, 123, 157R178R 173R


18/22

mainly describe the linear portion of the viscosity range. Onehidden node is sufficient to fit the mild non-linearity present inthe calibration set. The fit is slightly better if a second hiddennode is added (lower CE), but we already start to overfit thetraining data, which leads to higher ME and TE values. Thesituation is now equivalent to fitting a first-order polynomialwith a second- or third-order model. If we consider only the TEvalues, models with one or six hidden nodes give equivalentresults, but the one hidden node model has the advantage of

producing very stable results: Fig. 15(b) represents the standarddeviation of errors on five trials with different initial sets ofrandom weights. A model obtained with one hidden node isquasi-independent from the set of initial weights (standarddeviation almost zero). As more hidden nodes are added,different sets of initial random weights can lead to differentcombinations of transfer functions to build empirical models.81

These models are generally equivalent within the calibrationdomain, but can lead to different results in extrapolation, as wasseen in Fig. 4: when the number of hidden nodes was increasedto six or nine, the calibration fit improved slightly but theperformance in prediction degraded.

We therefore recommend systematically reducing the num-ber of hidden nodes as much as possible, in order to achieve

simpler and more robust models. It is always a good idea tocompare the performance of a one hidden node model with theperformance of a more complex model since many data sets inmultivariate calibration are only slightly non-linear. Theadvantage of models with one hidden node is that the resultsthey produce are stable and independent of the set of initialrandom weights.81 Moreover, a model with one hidden nodereduces to a sigmoidal regression that can be easily interpreted.In an extrapolation calibration study,36 the prediction error ofthe NN on one data set was reduced by 50% by using one hiddennode only.

4.2.4 Transfer function. Kolmogorovs theorem states that

an NN with linear combinations of n 3 (2n + 1) monotonicallyincreasing non-linear functions of only one variable is able to fitany continuous function of n variables.82 The most currentlyused nonlinear transfer functions in the hidden layer are thesigmoid or hyperbolic tangent functions that are bounded, easilydifferentiable and exhibit a linear-like portion in their centre, sothat data sets that are only slightly non-linear can also bemodelled (see Fig. 2). These two functions are popular becausethey allow one to fit a large number of non-linearities, but otherfunctions can be tried. For instance, Gemperline et al.12

performed multivariate calibration with NNs on UV/VIS datausing in their hidden layer combinations of linear, sigmoid,hyperbolic tangent and square functions, to accommodate

different types of non-linear response in different spectralregions.

The transfer function(s) in the output layer can be linear ornon-linear. In many situations, if the number of hidden nodes issufficient, all modelling is done in the hidden layer. It wasobserved that in some situations where data were mainly linear,non-linear output transfer functions could introduce distortionin the predicted responses,16 as illustrated in Fig. 3(a). If a linearoutput transfer function is used, any linear node in the hidden

layer can be replaced with a direct connection between inputand hidden layer (because two successive linear transformationscan be reduced to a single one), which reduces the number ofadjustable parameters in the NN.

The safest procedure is try both types of output transferfunctions (linear and non-linear) during topology optimisationand to base the decision on the shape of residuals for modelsconstructed with the same input variables.

4.3 Training of the network

4.3.1 Learning algorithms. Two general modes of learning

can be distinguished: incremental learning and batch learning.Incremental learning consists of successively updating theweights in the NN after estimating the error associated with theresponse predicted for each sample presented in a random order.In the batch learning mode the errors of all training samples overeach iteration are first summed and the parameters are adjustedwith respect to this sum. The former approach has the advantagethat it superimposes a stochastic component on the weightupdate. This can help the NN escape from local minima on theerror surface in the hyperspace of the weights. A drawback isthat the method is prone to the phenomenon of thrashing: theNN can take successive steps in opposite directions that mayslow learning. Batch learning provides a more accurate estimateof the gradient vector4 and faster convergence, but it also

requires more memory storage capacity. The relative efficiencyof both approaches is usually data set dependent. Theincremental approach seems particularly suited for very homo-geneous training sets21 or for on-line process control applica-tions4 where the composition of the training set is constantlymodified.

Training an NN is an optimisation problem, and severalmethods are available for this task. It is not possible to reviewin detail all algorithms available, but the main types ofalgorithms will be summarised and their particularities out-lined.

The gradient descent algorithm performs a steepest-descentminimisation on the error surface in the adjustable parameters

Fig. 16 Detection of representativity problems between training and monitoring set on r.m.s. error curves: a, lack of representativity; b, chance correlationwith initial set of weights.

174R Analyst, 1998, 123, 157R178R


19/22

hyperspace. This algorithm was described and popularised byRumelhart and McClelland83 in 1986. The excessively slowconvergence of the basic algorithm and its tendency to becometrapped in the numerous local minima of the error surfacetriggered the need for improvements such as the addition of amomentum term in the weight update, that allows one to smooththe error surface and to attenuate oscillations in the bottom ofsteep valleys. The speed of the algorithm can be significantlyenhanced by using adaptive parameters (learning rate and

momentum rate) for each weight in the NN. This is the basis ofthe delta-bar-delta84 and extended delta-bar-delta85 algorithms,that have been successfully applied in multivariate calibra-tion.30

Faster convergence can be reached with second-orderoptimisation methods, based on the determination or approx-imation of the Hessian matrix of partial second derivatives ofthe cost function: these methods typically have a convergencetime one order of magnitude smaller than the gradient method orits derivatives. In the NewtonRaphson method, the Hessianmatrix is used to adjust the descent direction at each step, andconvergence is reached in a single step if the error surface isquadratic, with ellipsoidal contours. Currently, one of the most

popular and efficient second-order methods for NN training isthe LevenbergMarquardt algorithm,8 which is a compromisebetween gradient descent and NewtonRaphson optimisation.At each step, an adaptive parameter allows the algorithm totransit smoothly between the gradient direction and theNewtonRaphson direction. The inverse Hessian matrix is onlyestimated and iteratively updated to avoid tedious calculations.Applications of this algorithm for NN training in multivariatecalibration have recently been reported.32,68,70,79 Conjugategradient optimisation is an alternative second-order techniquethat also uses the Hessian matrix, but the algorithm isformulated in such a way that the estimation and storage of theHessian matrix are completely avoided.8 With conjugategradient optimisation, each new search direction is chosen so as

to spoil as little as possible the minimisation achieved by theprevious one, in contrast to the winding trajectory observed withthe gradient method. This method is guaranteed to locate theminimum of any quadratic function of n variables in at most nsteps.

Genetic algorithms (GA) have been used for NN training.66,86

This global search method allows one to overcome the problemof becoming trapped in local minima, but at the expense of along computing time because each individual in the populationrepresents a different NN model. In addition, a number ofparameters must be set to define the population size andevolution mode, and therefore this approach cannot be easilyimplemented.

Random optimisation consists of taking successive random

steps in the weight space and discarding all steps that do notreduce the cost function. In contrast to the classical back-propagation algorithm, random search is guaranteed to find aglobal minimum,87 but the computation time is so high that themethod is never used in practice. Instead, GA or randomoptimisation can be used as preliminary techniques to optimisethe initial set of weights in the NN, then the training is continuedwith a back-propagation-based method.

4.3.2 When to stop training. As mentioned previously, amonitoring set has to be used in order to reduce the tendency ofNN to overtrain and therefore overfit the training data. Theevolution of the monitoring error must be followed duringtraining. The frequency of monitoring error estimation has to be

determined by the user; ideally it should be performed after eachiteration. Consecutive monitoring error values are stored in avector, and several criteria can be applied to retain the optimumset of weights: train the NN for a pre-defined large number ofiterations and retain the set of weights corresponding to theminimum of the monitoring error curve; stop training and retainthe last set of weights as soon as the monitoring error is belowa pre-specified threshold; or stop training and retain the last setof weights as soon as the decrement between two successivemonitoring errors is below a pre-specified threshold.

One must also check that the training error is reasonably lowat the number of iterations retained, and that the representativitybetween the training and the monitoring set is ensured. A usefulway to detect lack of representativity between training and

monitoring set is when the r.m.s. error curves for both sets areseparated by a large gap in the region where they flatten, asshown in Fig. 16(a).3,88

Alternatively, it is possible that the optimum monitoring erroris reached while the training error is still relatively high [Fig.

Fig. 17 Visualisation of sample repartitions on hidden nodes (hn) outputmaps for ICP data: a, hn1hn2; b, hn1hn3; c, hn2hn3.

Analyst, 1998, 123, 157R178R 175R


20/22

16(b)]. This can be due to chance correlation, for instance whenthe initial set of random weights brings the model near a localminimum on the monitoring error surface. Chauvin89 demon-strated that in NNs with complex architectures, late validationminima could sometimes be deeper than the first localminimum. In both cases (large gap between monitoring andtraining error curves, or early minimum for monitoring), adifferent splitting of data between the two subsets should beconsidered.

The sensitivity of the NN solution to initial conditions is awell known issue that was discussed by Kolen and Pollack.81 Toovercome effects due to chance correlation, several trials mustbe performed with different sets of initial random weights.55 Atleast five trials are recommended. The topology correspondingto the lowest average monitoring error should be retained,provided that the variability of predictions is not significantlyhigher than with other topologies. Once the topology has beenestablished, any set of weights leading to an acceptablemonitoring error can be retained for the final model. It isrecommended, however, to test it against a validation set, ifavailable, before performing predictions on unknown sam-ples.

Some approaches have been presented that avoid the need for

a monitoring set, such as the method based on hidden nodepruning presented in Section 4.2.3.69.80 Since no overfitting isobserved in the later stage of training with this approach, it isclaimed that no monitoring set is necessary. This seemsparticularly attractive for situations where the number ofcalibration samples is low. In practice, we found that themethod was giving very good results when no particularoverfitting problem was observed with a classical NN, but insituations where we had difficulty with a classical NN a

monitoring set was also necessary with the hidden node pruningapproach.

4.3.3 Model interpretation. NNs have more to offer than asimple empirical model. The sensitivity plots that we havepresented earlier describe the relative influence of the differentinput variables in the final model. In addition, examination ofthe projection of the samples on the hidden nodes of the NN isoften informative.37 We performed a calibration model for the

quantitative analysis of traces of lead in water, using inductivelycoupled plasma atomic emission spectrometry (ICP-AES) dataas input (14 descriptors). At the end of training, if we display theactivation of hidden nodes versus each other, we obtain plotscomparable to score plots (Fig. 17). The five measurementreplicates marked with asterisks are easily identified as probableoutliers. Such plots are instructive and also allow visualisationof clusters present in the data, but they are rarely used. Whendata must first be compressed, visualisation is performed on thescores before modelling instead.

We displayed in Fig. 18(a)(c) the activation of the threehidden nodes at the end of training for the ICP-AES data NNmodel. Fig. 18(d) and (e) show the activation of the two hiddennodes in the non-linear model for polymer charge concentra-

tion. To estimate the relative importance of each hidden node inthe final model, we have reported the value of the magnitude ofthe weight between this hidden node and the output node inparentheses. This is possible because all hidden nodes areconnected to one output node only. Therefore, the magnitude ofthe connecting weights can directly be compared, which is notthe case for weights connected to input nodes.

The activation of hidden nodes for ICP-AES data indicatesthat this data set is mainly linear, whereas the transfer functions

Fig. 18 Visualisation of hidden nodes activation: a, ICP data, hn1, w = 20.36; b, ICP data, hn2, w = 20.54; c, ICP data, hn3, w = 0.60; d, polymer data,hn1, w = 20.12; e, polymer data, hn2, w = 0.33.

176R Analyst, 1998, 123, 157R178R


21/22

for the modelling polymer data are activated in their stronglynon-linear portion. Thus we obtain information on the degree ofnon-linearity of a given data set, even when the exact form ofthe model is unknown.

Recently, several groups have investigated the assessment ofstatistical confidence intervals for predictions with NNs. Datheand Otto72 derived confidence intervals using the bootstrapmethod. After finding the optimum topology of the NN, theyerase a portion of the calibration matrix and randomly fill it withreplicate samples from the remaining portion. An arbitrarynumber nsets of calibration matrices is created, and nsets modelsare built with the pre-defined topology. An external test set isused to predict the responses with each of the bootstrapped NNmodels, and standard deviations of predicted responses can becalculated. Derks and Buydens90 also worked on the calculationof confidence intervals and compared three forms of boot-strapping. The advantage of the bootstrap approach is that thederived confidence intervals contain all sources of variability(experimental noise, model errors, effect of different sets ofrandom weights), thus yielding a worst case estimation. Thedrawback is that the confidence intervals derived correspond toan NN topology, not to a single model with a fixed set of

weights.

5 Conclusions

As is often the case in chemometrics, data pre-treatment andpresentation (number of samples, detection of outliers, datacompression and splitting) are critical issues that should not beoverlooked. Experience has proved that several failures of NNsfor modelling were indeed due to inappropriate problemformulation. Such issues can be circumvented by focusing onprior model identification, in particular the detection of non-linearity. Proper a priori non-linearity detection is one of themajor difficulties and methods existing so far often fail in thepresence of outliers.

NNs should become part of the standard toolkit of analyticalchemists concerned with multivariate calibration, but it isimportant to have a clear understanding of their capabilities andlimitations. One should not consider NNs as black boxes, but asregression models whose flexibility will depend on the topologydefined by the user. In recent years, numerous research effortshave been focused on improving the speed of algorithms usedfor NN training. With the availability of faster personalcomputers, the emphasis is no longer on the speed of algorithmsbut rather on the development of tools to ease topologyoptimisation, visualisation and model interpretation.

The design of an optimum topology is certainly critical and

time consuming, but this is true also for the optimisation ofparameters for other methods (form of the model in polynomialPCR or PLS, complexity of soft models, number of nearestneighbours in LWR, variables to retain/eliminate in methodsbased on feature selection/elimination), although it is lessemphasised. Moreover, the comment that NNs do not allowinference is somewhat unfair. Some simple plots can provideinformation on the nature and form of the problem tackled andon the presence of possible clusters or outliers.

Several recent research efforts aimed at combining theflexibility and auto-adaptive ability of NNs with the superiorinterpretability and inference capability of PLS models.9194 Sofar, it seems that these methods also combine the pitfalls of bothapproaches and their application generally requires an optimisa-

tion of a large number of parameters. Radial basis function(RBF) networks offer interesting alternatives to MLP in thesense that they allow local training and the final models can beinterpreted in terms of logical rules.38,53,95 Another approach togain insight into a complex problem is to combine the use of

classical MLP (for prediction) with counter-propagation NNs toobtain contour plots of the input and output variables.60,61

6 Acknowledgements

The authors are grateful to Vita Centner and Frederic Estiennefor fruitful discussions. This work received financial support

from the European Commission (SMT Programme contractSMT4-CT95-2031) and the Fonds voor WetenschappelijkOnderzoek (FWO, Fund for Scientific Research).

7 References

1 J. Zupan and J. Gasteiger,Anal. Chim. Acta, 1991, 248, 1.2 S. D. Brown, S. T. Sum, F. Despagne and B. K. Lavine,Anal. Chem.,

1996, 68, 21R.3 J. R. M. Smits, W. J. Melssen, L. M. C. Buydens and G. Kateman,

Chemom. Intell. Lab. Syst., 1992, 22, 165.4 D. Svozil, V. Kvasnicka and J. Pospchal, Chemom. Intell. Lab. Syst.,

1997, 39, 43.5 D. A. Cirovic, Trends Anal. Chem., 1997, 16, 148.

6 M. Bos, A. Bos and W. E. van der Linden, Analyst, 1993, 118,323.

7 S. Geman, E. Bienenstock and R. Doursat,Neural Comput., 1992, 4,1.

8 R. Fletcher, Practical Methods of Optimisation, Vol. 1: Uncon-strained Optimisation, Wiley, New York, 1980.

9 K. Hornik, M. Stinchcombe and H. White,Neural Networks, 1989, 2,359.

10 E. Thomas,Anal. Chem., 1994, 66, 795A.11 C. E. Miller,NIR News, 1993, 4, 3.12 P. J. Gemperline, J. R. Long and V. G. Gregoriou,Anal. Chem., 1991,

63, 2313.13 M. S. Danhoa, S. J. Lister, R. Sanderson and R. J. Barnes, Near

Infrared Spectrosc., 1994, 2, 43.14 J. A. van Leeuwen, R. J. Jonker and R. Gill, Chemom. Intell. Lab.

Syst., 1994, 25, 325.15 F. Wulfert, W. T. Kok and A. K. Smilde, Anal. Chem., 1998, 70,

1761.16 R. Goodacre, M. J. Neal and D. B. Kell, Anal. Chem., 1994, 66,

1070.17 R. Goodacre,Appl. Spectrosc., 1997, 51, 1144.18 S. R. Amendolia, A. Doppiu, M. L. Ganadu and G. Lubinu, Anal.

Chem., 1998, 70, 1249.19 J. R. Long, V. G. Gregoriou and P. J. Gemperline,Anal. Chem., 1990,

62, 1791.20 T. J. Sejnowski and C. R. Rosenberg, Complex Syst., 1987, 1, 145.21 J. Hertz, A. Krogh and R. Palmer, Introduction to the Theory of

Neural Computation, Addison Wesley, Redwood City, CA, 1991.22 S. Biswas and S. Venkatesh, in Advances in Neural Information

Processing Systems, ed. R. P. Lippmann, J. E. Moody and D. S.Touretzky, Morgan Kaufmann, San Mateo, CA, 1991, Vol. III.

23 B. Hitzmann, A. Ritzka, R. Ulber, T. Scheper and K. Schugerl,Anal.Chim. Acta, 1997, 348, 135.

24 C. Borggaard and H. H. Thodberg,Anal. Chem., 1992, 64, 545.25 T. Naes, K. Kvaal, T. Isaksson and C. Miller, J. Near Infrared

Spectrosc., 1993, 1, 1.26 J. Verdu-Andres, D. L. Massart,

networks in multivariate calibration

Documents