wavelength selection for multivariate calibration using tikhonov regularization

11
Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization FORREST STOUT, JOHN H. KALIVAS,* and KA ´ ROLY HE ´ BERGER Department of Chemistry, Idaho State University, Pocatello, Idaho 83209 (F.S. and J.H.K.); and Institute of Chemistry, Chemical Research Center, Hungarian Academy of Sciences, PO Box 17, H-1525, Budapest, Hungary (K.H.) Prediction of sample properties using spectroscopic data with multivariate calibration is often enhanced by wavelength selection. This paper reports on a built-in wavelength selection method in which the estimated regression vector contains zero to near-zero coefficients for undesirable wavelengths. The method is based on Tikhonov regularization with the model 1-norm (TR1) and is applied to simulated and near-infrared (NIR) spectral data. Models are also formed from wavelength subsets determined by the standard method of stepwise regression (SWR). Harmonious (bias/variance tradeoff) and parsimonious considerations are compared with and without wavelength selection for principal component regression (PCR), ridge regression (RR), partial least squares (PLS), and multiple linear regression (MLR). Results show that TR1 models generally contain large baseline regions of near-zero coefficients, thereby essentially achieving built-in wavelength selection. For example, wave- lengths with spectral interferences and/or poor signal-to-noise ratios obtain near zero regression coefficients. Results often improve with TR1 models, compared to full wavelength PCR, RR, and PLS models. The SWR subset results are similar to those for the TR1 models using the NIR data and worse with the simulated spectral situations. In general, wavelength selection improves prediction accuracy at a sacrifice to a potential increase in variance and the parsimony remains nearly equivalent compared to full wavelength models. New insights gained from the reported studies provide useful guidelines on when to use full wavelengths or use wavelength selection methods. Specifically, when a small number of large wavelength effects (good sensitivity and selectivity) exist, subset selection by SWR (with caution) and TR1 do well. With a small to moderate number of large to moderate sized wavelength effects, TR1 is better. Lastly, when a large number of small effects are present, full wavelengths with the methods of PCR, RR, or PLS are best. Index Headings: Wavelength selection; Multivariate calibration; Tikhonov regularization. INTRODUCTION Spectroscopic data can be quantitatively analyzed using the multivariate calibration model: y ¼ Xb þ e ð1Þ where X contains calibration spectra for m samples measured at w wavelengths, b is the w 3 1 regression vector, the m 3 1 vector y holds the quantitative information for the analyte, such as concentration, and e is an m 3 1 vector representing random error. As shown in Eq. 1, b contains coefficients determining how each respective wavelength is used to predict y (y ˆ ¼ Xb ˆ ). For example, a large positive coefficient implies that the corresponding wavelength has a large positive effect in predicting y. However, this does not imply that the wavelength should be kept, as the noise associated with this wavelength could be large, i.e., a poor signal-to-noise ratio. The model regression vector is commonly estimated by b ˆ ¼ X þ y, with X þ being a generalized inverse of X. The generalized inverse can be estimated by a number of modeling methods, including principle component regression (PCR), ridge regression (RR), partial least squares (PLS), multiple linear regression (MLR), etc. 1,2 In an effort to improve model quality, wavelengths (columns) are often selected from X prior to estimating the regression vector. Common methods for accomplishing wavelength selection are forward selection, stepwise regression (SWR), genetic algorithms, simulated annealing, etc. 3–6 Such methods often require arbitrary user-set optimization parame- ters. Altering these complicated parameters for a specific algorithm leads to different wavelength subsets. This dilemma is the result of selecting wavelengths with chance correla- tions. 7–9 The greater the ratio of the number of wavelengths to the number of samples, the greater the prospect of the dilemma occurring. 10 Additionally, many wavelength selection algo- rithms 3–6 only use prediction bias information in selecting wavelengths, i.e., sequential variation of wavelength subsets followed by model determination using PCR, RR, PLS, MLR, etc., and then computation of an accuracy (bias) diagnostic for model comparisons and final selection of the wavelength subset. Optimization of such a criterion based on only using calibration data, e.g., root mean square of calibration (RMSEC), results in over-fitting and poor predictions for a new X. Model optimization using only validation data, e.g., root mean square error of validation (RMSEV), makes the validation set part of the modeling process and over-fitting again results. These problems can be resolved by including a prediction variance penalty. In linear situations, prediction bias and variance are complementary measures in the sense that a decrease in bias results in an increase in variance. 11 Thus, both issues need to be examined when determining the best b ˆ with wavelength selection, i.e., the model with the proper bias/variance tradeoff (harmonious model) should be sought. By doing so, there is less chance of obtaining an over- or under-fitted model. The method of Tikhonov regularization (TR) described in the next section is based on simultaneous optimization of prediction bias and variance measures. An alternative to wavelength selection is a built-in wavelength selection approach in which full wavelength regression vectors are obtained with zero or near-zero coefficients for wavelengths with poor predictive relationship to the analyte due to spectral interference, low signal-to-noise ratio, etc. If regression vector coefficients have values of near zero, the corresponding wavelengths have virtually no impact on prediction. Built-in wavelength selection with harmonious considerations has previously been achieved using TR in conjunction with knowledge of the spectral noise structure. 12 However, the noise structure is often not known. This study aims to use a different TR approach to perform Received 15 September 2006; accepted 3 November 2006. * Author to whom correspondence should be sent. E-mail: kalijohn@isu. edu. Volume 61, Number 1, 2007 APPLIED SPECTROSCOPY 85 0003-7028/07/6101-0085$2.00/0 Ó 2007 Society for Applied Spectroscopy

Upload: karoly

Post on 06-Oct-2016

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

Wavelength Selection for Multivariate Calibration UsingTikhonov Regularization

FORREST STOUT, JOHN H. KALIVAS,* and KAROLY HEBERGERDepartment of Chemistry, Idaho State University, Pocatello, Idaho 83209 (F.S. and J.H.K.); and Institute of Chemistry, Chemical Research

Center, Hungarian Academy of Sciences, PO Box 17, H-1525, Budapest, Hungary (K.H.)

Prediction of sample properties using spectroscopic data with multivariate

calibration is often enhanced by wavelength selection. This paper reports

on a built-in wavelength selection method in which the estimated

regression vector contains zero to near-zero coefficients for undesirable

wavelengths. The method is based on Tikhonov regularization with the

model 1-norm (TR1) and is applied to simulated and near-infrared (NIR)

spectral data. Models are also formed from wavelength subsets

determined by the standard method of stepwise regression (SWR).

Harmonious (bias/variance tradeoff) and parsimonious considerations are

compared with and without wavelength selection for principal component

regression (PCR), ridge regression (RR), partial least squares (PLS), and

multiple linear regression (MLR). Results show that TR1 models

generally contain large baseline regions of near-zero coefficients, thereby

essentially achieving built-in wavelength selection. For example, wave-

lengths with spectral interferences and/or poor signal-to-noise ratios

obtain near zero regression coefficients. Results often improve with TR1

models, compared to full wavelength PCR, RR, and PLS models. The

SWR subset results are similar to those for the TR1 models using the NIR

data and worse with the simulated spectral situations. In general,

wavelength selection improves prediction accuracy at a sacrifice to a

potential increase in variance and the parsimony remains nearly

equivalent compared to full wavelength models. New insights gained

from the reported studies provide useful guidelines on when to use full

wavelengths or use wavelength selection methods. Specifically, when a

small number of large wavelength effects (good sensitivity and selectivity)

exist, subset selection by SWR (with caution) and TR1 do well. With a

small to moderate number of large to moderate sized wavelength effects,

TR1 is better. Lastly, when a large number of small effects are present, full

wavelengths with the methods of PCR, RR, or PLS are best.

Index Headings: Wavelength selection; Multivariate calibration; Tikhonov

regularization.

INTRODUCTION

Spectroscopic data can be quantitatively analyzed using themultivariate calibration model:

y ¼ Xbþ e ð1Þ

where X contains calibration spectra for m samples measured atw wavelengths, b is the w31 regression vector, the m31 vectory holds the quantitative information for the analyte, such asconcentration, and e is an m 3 1 vector representing randomerror. As shown in Eq. 1, b contains coefficients determininghow each respective wavelength is used to predict y (y¼Xb).For example, a large positive coefficient implies that thecorresponding wavelength has a large positive effect inpredicting y. However, this does not imply that the wavelengthshould be kept, as the noise associated with this wavelengthcould be large, i.e., a poor signal-to-noise ratio. The model

regression vector is commonly estimated by b¼Xþy, with Xþ

being a generalized inverse of X. The generalized inverse can beestimated by a number of modeling methods, including principlecomponent regression (PCR), ridge regression (RR), partialleast squares (PLS), multiple linear regression (MLR), etc.1,2

In an effort to improve model quality, wavelengths(columns) are often selected from X prior to estimating theregression vector. Common methods for accomplishingwavelength selection are forward selection, stepwise regression(SWR), genetic algorithms, simulated annealing, etc.3–6 Suchmethods often require arbitrary user-set optimization parame-ters. Altering these complicated parameters for a specificalgorithm leads to different wavelength subsets. This dilemmais the result of selecting wavelengths with chance correla-tions.7–9 The greater the ratio of the number of wavelengths tothe number of samples, the greater the prospect of the dilemmaoccurring.10 Additionally, many wavelength selection algo-rithms3–6 only use prediction bias information in selectingwavelengths, i.e., sequential variation of wavelength subsetsfollowed by model determination using PCR, RR, PLS, MLR,etc., and then computation of an accuracy (bias) diagnostic formodel comparisons and final selection of the wavelengthsubset. Optimization of such a criterion based on only usingcalibration data, e.g., root mean square of calibration(RMSEC), results in over-fitting and poor predictions for anew X. Model optimization using only validation data, e.g.,root mean square error of validation (RMSEV), makes thevalidation set part of the modeling process and over-fittingagain results. These problems can be resolved by including aprediction variance penalty.

In linear situations, prediction bias and variance arecomplementary measures in the sense that a decrease in biasresults in an increase in variance.11 Thus, both issues need to beexamined when determining the best b with wavelengthselection, i.e., the model with the proper bias/variance tradeoff(harmonious model) should be sought. By doing so, there isless chance of obtaining an over- or under-fitted model. Themethod of Tikhonov regularization (TR) described in the nextsection is based on simultaneous optimization of predictionbias and variance measures.

An alternative to wavelength selection is a built-inwavelength selection approach in which full wavelengthregression vectors are obtained with zero or near-zerocoefficients for wavelengths with poor predictive relationshipto the analyte due to spectral interference, low signal-to-noiseratio, etc. If regression vector coefficients have values of nearzero, the corresponding wavelengths have virtually no impacton prediction. Built-in wavelength selection with harmoniousconsiderations has previously been achieved using TR inconjunction with knowledge of the spectral noise structure.12

However, the noise structure is often not known.This study aims to use a different TR approach to perform

Received 15 September 2006; accepted 3 November 2006.* Author to whom correspondence should be sent. E-mail: [email protected].

Volume 61, Number 1, 2007 APPLIED SPECTROSCOPY 850003-7028/07/6101-0085$2.00/0

� 2007 Society for Applied Spectroscopy

Page 2: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

built-in wavelength selection on simulated and real spectraldata in which the noise structure is not required, although itcould also be used if it is known. The approach isaccomplished by optimizing the RMSEC simultaneously withthe regression vector 1-norm in TR (TR1), rather than the oftenoptimized 2-norm in TR, thereby circumventing the pitfall ofusing only bias information to obtain the model as well asavoiding empirical determination of optimization parametersrequired in other wavelength selection algorithms. Thus,characteristic to TR1 is wavelength selection with simultaneousestimation of model parameters compared to the usualapproach of sequential variation of wavelengths followed bymodel determination with PCR, RR, PLS, MLR, etc. However,all forms of TR necessitate using an optimal meta-parametersimilar to the number of basis vectors meta-parameter (numberof latent vectors, factors, etc.) for PCR and PLS or the ridgevalue meta-parameter for RR. To avoid determining the TR1meta-parameter directly, this study uses Pareto L-curve plots ofmodel size against bias indicators.13–15 A proper model is thenselected from the Pareto corner region (close to the origin) ofthe L-shaped curve representing an acceptable trade-offbetween plotted criteria. A model from this region has keywavelengths that are selected and built into the regressionvector and unnecessary wavelengths have regression coeffi-cients of zero to near-zero values.

Another aspect of wavelength selection specifically investi-gated in this paper is that of obtaining the parsimonious model:the most fully reduced model with the most information.Parsimony is a measure of the degrees of freedom used to fitthe model. It is commonly thought that a model with fewerwavelengths is more parsimonious. However, such inter-modelassessments become difficult when MLR models based onselected wavelengths are compared to models from projectionmethods such as PCR, RR, or PLS using full wavelengths orselected wavelengths. By using the effective rank (ER) basedon the fitted degrees of freedom, inter-model parsimonyevaluations can be accomplished.15–18

In addition to results from full wavelength PCR, RR, andPLS models, results from built-in wavelength selection withTR1 are compared to those from the well-established standardSWR wavelength selection method. In this case, the TR1models are compared to MLR and PLS models built from theSWR selected wavelengths.

It should be noted that once a TR1 regression vector isformed, a TR1 wavelength subset could be obtained from thatregression vector. A near-zero empirical cutoff is needed tojudge each coefficient as either zero (coefficient magnitudebelow the cutoff) or important (coefficient magnitude above thecutoff). Any wavelengths associated with zero coefficientswould be removed from X and the remaining wavelengthsconstitute the wavelength subset, which can then be modeled byPCR, RR, PLS, etc. However, this method would require a seriesof studies empirically setting the near-zero cutoff value and,hence, results from this approach are not reported in this paper.

All model comparisons involve calculated bias indicators(RMSEC, RMSEV, and corresponding R2 values), predictionvariance indicators (the model 2-norm ||b||2 and meanprediction standard deviations ry from Monte Carlo simula-tions), and model parsimony (ER).

METHODS

Tikhonov Regularization. Regression vectors for thestandard multivariate calibration model in Eq. 1 can be

estimated through generalization of the Tikhonov regulariza-tion (TR), expressed as:

min jjXb� yjjaa þ kjjLbjjbb� �

ð2Þ

where ||�||p signifies the regression vector p-norm, e.g., p¼ 2 isthe 2-norm or Euclidean norm, a and b represent the same ordifferent norms in the range 1 � a, b , ‘, L denotes aregulation operator that enforces the estimate of b to belong tocorresponding subspaces of well-behaved functions, and ksymbolizes the regularization meta-parameter controlling theweight given to the second term relative to the first term, whichis the usual single least squares criterion.13,19 The left term hasbeen labeled a bias (accuracy) indicator and the right termreflects the model size and, hence, when b¼ 2, the 2-norm actsas a variance (precision) measure.15 Choices for L are varied,e.g., the identity matrix I, derivative operators for smoothing,spectral error covariance matrix for wavelength selection,instrumental differences for calibration transfer, or spectra ofknown and/or new interferenents.12,13,20,21

When L¼ I and a¼ b¼ 2, TR is said to be in standard formand is also known as RR. For a fixed set of wavelengths, it hasbeen shown that the regression vector 2-norm ||b||2 in Eq. 2 isproportional to the prediction variance.15,22–25 Therefore, in thecase of TR in standard form, optimization of Eq. 2 is concernedwith simultaneous minimization of bias and variance indica-tors. Such an optimization seeks the most harmonious model,one that is the most Pareto (closest to the origin) in an L-curveplot of the regression vector 2-norm against RMSEC. That is,to select the right k value in Eq. 2, a host of models can beformed by varying k and then ||b||2, or some other varianceindicator, is plotted against RMSEC, jy� y||2, or another biasmeasure. In such a plot, an L-shaped curve results and theoptimal model (k) is in the corner of the L-curve near theorigin.13,26 This model represents the best compromise for thebias/variance tradeoff, i.e., the most harmonious model. The L-curve (harmonious, Pareto) plot is also applicable to selectingother model meta-parameters such as the number of factors forPCR or PLS. In general, the standard form of TR (RR) L-curves and regression vectors are similar to those found byPCR and PLS.12,14

As noted previously, optimization of a bias criterion basedon only using calibration data (RMSEC) or validation data(RMSEV) results in over-fitting and, respectively, poorpredictions for a new X are obtained or the validation setbecomes part of the modeling process. These problems areuniquely resolved by TR, which simultaneously optimizesprediction variance (a penalty for over- or under-fitting) with abias measure.

In a variation of the TR optimization approach, this studyseeks optimization of RMSEC along with the regression vector1-norm to accomplish bias reduction with built-in wavelengthselection, i.e., the 1-norm acts as a penalty like the 2-norm toguard against over- or under-fitting. This TR approach with themodel 1-norm (TR1) is achieved by setting a¼ 2, b¼ 1, and L¼ I in Eq. 2, which is also known as the least absoluteshrinkage and selection operator (LASSO).27 The 1-norm forthe regression vector in TR was first proposed in 1973 in Ref.28 and has been further studied and applied in severalpublications since.2,29–40

Insights about the mechanism of the 1-norm can be gainedby considering the following example. Like the regression

86 Volume 61, Number 1, 2007

Page 3: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

vector 2-norm, the vector 1-norm is a scalar measure ofregression vector size (coefficient magnitudes). The normsdiffer computationally by

jjbjj2 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiXw

i¼1

b2i

sð3Þ

jjbjj1 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiXw

i¼1

jbijs

ð4Þ

The two hypothetical regression vectors bt1¼ (5, 1, 0.2) and bt

2

¼ (5.1, 1.1, 0) both have the same 1-norm values of 6.2, but the2-norms differ with 5.1 for bt

1 and 5.2 for bt2. Thus, minimizing

the 2-norm in Eq. 2 favors bt1 with no zero regression

coefficients. For minimizing the 1-norm in Eq. 2, the 1-normsare the same and the better regression vector would be the oneproviding the smaller value for the first term in Eq. 2.Alternatively, consider bt

3 ¼ (4, 2.8, 1.5) and bt1, which both

have the same 2-norm values of 5.1, but the 1-norms differwith 6.2 for bt

1 and 8.3 for bt3. In this case, minimizing the 1-

norm in Eq. 2 favors bt1, where the third coefficient is closer to

zero. As exemplified in these situations, 2-norm optimizationimpedes regression vectors with coefficients at zero, while theopposite is true for 1-norm optimization. Said another way, the2-norm enforces smoothness in the regression vector and doesnot construct big variations well while the 1-norm imposesflatness and is able to construct big variations.

Three previous studies involving TR1 for wavelengthselection are incomplete in that only bias information wasused to compare models to determine the k meta-parame-ter.38–40 This approach is inaccurate, for reasons notedpreviously, and the 1-norm also needs to be simultaneouslyevaluated for determining k. Said another way, because Eq. 2 isbased on two measures to judge the model, these two measuresand preferably additional measures should be simultaneouslyevaluated to assess the proper tradeoff. Using only a singlecriterion does not accomplish this assessment. Another aspectmissing from these wavelength selection studies is that artificialdata with known spectral situations have not been examinedand it is uncertain whether the used algorithms actuallyconverge to optimal wavelengths. Using simulated data alsoallows discerning the functionality of TR1 for wavelengthselection, thereby providing guidelines on wavelength selec-tion, and insights enhancing interpretation of real data resultsare gained. Therefore, a focus of this project is to inherently useEq. 2 with a ¼ 2, b ¼ 1, and L ¼ I in conjunction with an L-curve to develop a new approach to model building and applyit to the analytical chemistry problem of wavelength selectiondeveloping guidelines for when wavelength selection is useful.

To simplify the optimization process and avoid determiningthe k meta-parameter for the optimal model in Eq. 2, multiplemodels are instead estimated by using a target vectoroptimization approach as has been accomplished in previousTR work.12,14,15 The approach is accomplished by varyingtarget values using Eq. 5 in the optimization algorithm:

min

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiða� tÞtða� tÞ

q� �ð5Þ

where a is a 2 3 1 vector containing the RMSEC andregression vector 1-norms for a particular model and t is a

vector of corresponding user-provided target values. Theoptimization algorithm converges to a model minimizing Eq.5 rather than Eq. 2 by minimizing the Euclidian distance to thetarget RMSEC and vector 1-norm values. The convergedRMSEC and ||b||1 values are used to form the L-curve for finalmodel selection. With the L-shaped curve, over- and under-fitted models are clearly delineated and the tradeoff betweenthe model size and RMSEC is well characterized. An L-shapedcurve for a host of TR1 models could also be formed by directoptimization of Eq. 2 varying k. However, as noted, thisprocess is algorithmically complicated and requires empiricaldetermination of an acceptable range of k values to generatethe L-curve, i.e., models well into the over- and under-fittedregions are not needed.

Optimization of TR1 can be biased toward either the left orright terms in Eq. 2 (or corresponding terms in Eq. 5) due todifferences in the magnitudes of the two terms. To avoid thissituation, the RMSEC and ||b||1 are scaled to vary from a smallnon-zero number to one in Eq. 5 by dividing the values byrespective maxima. In this way, similar magnitudes forRMSEC and ||b||1 are obtained during an optimization process.An estimated maximum ||b||1 value is obtained prior to TR1optimization from a full-rank PCR or PLS model or an RRmodel with the ridge value at zero. These values correspond tothe least squares model with maximum ||b||2. The maximumRMSEC value is estimated a priori from the one factor PCR orPLS model or an RR model with a large ridge value.Considering that PCR, RR, or PLS models form complete L-curves (over-fit, optimal (corner), and under-fit model regions),both the RMSEC and ||b||1 values in TR1 can be equivalentlydivided by respective maxima obtained from PCR, RR, or PLSmodels to accomplish the range scaling.

In this study, the TR1 program utilized the fminunc.mprogram from the MatLab Optimization Toolbox with termi-nation tolerances for the function and variables set to 10�25 and0, respectively. This program uses a quasi-Newton method witha mixed quadratic and cubic line search. Two sets of 100 scaledtarget values were used to form L-curves. The first set consistedof a constant target RMSEC value of 0 and target 1-norm valuesranging from 0 to 1 at intervals of 0.01. The second set hastarget RMSEC values ranging from 0 to 1 at intervals of 0.01and a constant target 1-norm of 0. The first set allows the TR1optimization to formulate the upper over-fitted region of the L-curve, while the second set focuses on the lower under-fittedregion of the L-curve. In order to shorten program running timesfor each set of target values, the PLS model with the closestscaled RMSEC or 1-norm value to the non-zero target valuewas used as the initial optimization starting point (the closestPCR or RR model could also be used).

Stepwise Regression. The SWR method is a standardapproach for variable selection and often used in comparison tonew wavelength selection methods. It is fully described in anylinear regression textbook.41 In this study, forward SWR wasperformed using the stepwise.m program from the MatLabStatistics Toolbox and run with the default parameters (nowavelengths in the initial model and 95% confidence intervalfor significance testing of regression coefficients). Partial leastsquares was then performed on the SWR subset in addition tothe MLR default model from SWR.

PERFORMANCE MEASURES

While other indicators of variable selection performancehave been presented,42–44 the focus of this study is the harmony

APPLIED SPECTROSCOPY 87

Page 4: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

tradeoff in conjunction with a parsimony consideration.Optimal model selection for PCR, RR, and PLS was basedon harmonious L-curve plots of ||b||2 against RMSEC and R2.The resultant L-curves from plotting ||b||1 against RMSEC andR2 were used to select the final TR1 model.

Harmony. To assess the bias/variance tradeoff, bias andvariance indicator values are tabulated for all final models. Ithas been noted in previous work that using ||b||2 as a predictionvariance indicator between models with differing number ofwavelengths or the same number of wavelengths, but thewavelengths themselves are dissimilar, is not always exact, i.e.,the regression vector 2-norm only acts as a prediction varianceindicator for models with the same wavelengths and does notalways correlate when comparing models with differentwavelength subsets.12 This paper further explores this aspectof using ||b||2 as a direct measure of potential variances withdissimilar wavelength subsets by using mean model predictionstandard deviations estimated by 100 iterations of perturbingthe calibration set X and y with homoscedastic noise relative to1% of respective sample concentrations and spectra maxima.This approach could not be done for TR1 models due toconvergence limitations presented by local minima.

All tabulated model comparisons involve prediction biasindicators (RMSEC, RMSEV, and corresponding R2 values)and prediction variance indicators (the model 2-norm ||b||2 andmean prediction standard deviations ry from the Monte Carlosimulations). With these values, the harmony (tradeoff)between the different modeling methods for different wave-length subsets can be assessed.

Parsimony. An important model quality measure isparsimony, the sparseness of the model. Said another way,parsimony is an assessment of the amount of calibration datainformation used in building the model and the moreparsimonious a model is, the less likely it is that the model isover-fitted. Effective rank (ER) is a measure of suchparsimony,17,18 and a direct computation13 and two statisticalmethods16,45 have been reported as well as applications andcomparison studies of these three methods.15,17,18,46 Thegeneralized degrees of freedom (GDF) is the ER measure usedin this study.16 The approach measures ER as the sum of thesensitivities of predicted calibration y values (y) to noiseperturbation on the original calibration y using a Monte Carloprocess. The process involves using N iterations with normallydistributed noise (d) applied to y. For each sample i, thesensitivity (hi) of yi to noise added to yi is estimated as the slopeof the line given by

yi ¼ aþ hidni; n ¼ 1; :::;N ð6Þ

with a representing the line intercept. The model ER iscalculated by summing respective values as shown in Eq. 7:

ER ¼Xm

i¼1

hi ð7Þ

For this study, N is set to 1000 and noise added to y is normallydistributed noise with mean zero and standard deviation 0.01for all data sets. It should be noted that because the slope isused for the GDF calculation of ER, the GDF process is quiteinsensitive to the noise level used as the slope is generallyconsistent over a range of noise levels.17 Again, due to the localminima nature of the 1-norm, ER values are not reported forTR1.

EXPERIMENTAL

Apparatus/Software. MatLab 7.0 (The MathWorks, Na-tick, MA) programs for TR1, PCR, RR, PLS, MLR, SWR, andER were written by the authors. Programs for TR1 and SWRused routines from the MatLab Optimization and StatisticToolboxes, respectively (see previous corresponding sectionsfor details). All programs ran on an Intel Pentium 4 personalcomputer.

Data Procedure. All data sets were split into calibration andvalidation data sets as follows. Samples were sorted by themagnitudes of values in y and every other sample was placed inthe validation set with the remaining samples forming thecalibration set. All data was mean centered relative to thecalibration set prior to modeling.

Data Sets. Controlled spectral simulations were performedin order to assess the influences of spectral overlap (selectiv-ity), signal-to-noise ratios, and sensitivity on selectingwavelengths by TR1 and SWR. Two simple situations areused and both are based on a chemical system composed of ananalyte and an interferent. In one case, pure-component spectraare broad banded, and the other consists of narrow bandedspectra. By using the artificial data, essential information isgained to form guidelines and provide better insight into theanalysis of the real data sets.

Simulated Spectroscopic Set I. The simulated wavelengthselection data set used in Ref. 12 is examined. This data setsimulates spectra over 50 wavelength units using Gaussiancurves for the analyte and an interferent that partly overlapswith the analyte (see Fig. 1). Random concentrations for bothcomponents range from 0 to 1 over 100 samples and respectivespectra are formed by the product of these concentrations withthe pure-component spectra at unit concentration in Fig. 1.Random homoscedastic noise with a normal distribution meanof zero and standard deviation of one was added to X at 1% ofmaximum peak response of respective spectra followed byadditional noise at 3% of peak maximum to responses atwavelengths 20 through 30.

Simulated Spectroscopic Set II. This simulated data set of61 wavelengths was created using Gaussian peaks fivewavelength units wide with a one wavelength baselineseparating each peak (see Fig. 2). Analyte peaks are centeredat every sixth wavelength from wavelengths 4 to 52. Interferent

FIG. 1. Pure-component spectra at unit concentration prior to noise additionfor the simulated spectroscopic data set I; (dotted line) analyte and (solid line)interferent.

88 Volume 61, Number 1, 2007

Page 5: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

peaks are centered at wavelengths 22, 28, 34, 40, and 52,producing peaks with perfect selective for the analyte centeredat wavelengths 4, 10, 16, and 46. Sixty-six (66) sampleconcentrations for both the analyte and interferent weregenerated from random numbers with a normal distributionof a mean of zero and a standard deviation of one. Absolutevalues were used, resulting in analyte concentrations rangingfrom 4.3 3 10�5 to 2.3 with a mean of 0.76 and a standarddeviation of 0.56 and interferent concentrations ranging from0.011 to 2.2 with a mean of 0.72 and a standard deviation of0.49. Sample spectra are from the product of these concentra-tions with the pure-component spectra at unit concentrationpictured in Fig. 2. Random heteroscedastic noise with a normaldistribution of a mean of zero and a standard deviation of onewas added to X at 1% of respective spectral values for eachwavelength.

Gasoline. The gasoline data set consists of 40 gasolinesamples with reference octane values.47 Samples were mea-sured by diffuse reflectance near-infrared (NIR) as log(1/R)from 900 to 1700 nm at 2 nm intervals. Every fifth recordedwavelength is used, yielding 81 wavelengths from 900 to 1700nm.

Corn. The corn data set consists of 80 corn samples withreference moisture values.48 Spectra were measured from 1100to 2498 nm at 2 nm intervals on the near-infrared (NIR)spectrometer designated m5. Every tenth recorded wavelengthwas used, resulting in 70 wavelengths in the spectra.

RESULTS AND DISCUSSION

Simulated Spectroscopic Data Set I. The L-curves plottedin Fig. 3a for the TR1, PCR, RR (TR in standard form), PLS,and SWR PLS models show that TR1 is indeed Pareto optimalwith respect to the regression vector 1-norm. Plots in Fig. 3bvisually demonstrate that optimization of TR with theregression vector 2-norm cannot achieve the models foundby TR1. Specifically, the TR1 models form inferior points onthe regression vector 2-norm plot relative to PCR, RR, andPLS. Similar patterns to those in Fig. 3 were obtained whenRMSEV was used instead of RMSEC. These results agree withprevious work for this data set, in which models withwavelengths selected using the regression vector 2-norm andL set to the noise structure in Eq. 2 are not Pareto to fullwavelength models with the 2-norm plots.12 The reasoning for

this inferiority is that models based on wavelength subsets havelarger regression vector 2-norms compared to models with thesame wavelengths and additional wavelengths.49 Note that inFig. 3b, the PCR, RR, and PLS harmonious L-curves closelyresemble one another, albeit with relatively subtle differences.This result is consistent with previous studies and the other datasets analyzed in this study.14,15,20 Lastly, Fig. 3 demonstratesthat a smooth L-curve from 1-norm optimization may not yielda smooth L-curve when plotted with the 2-norm. For example,TR1 models in the corner region with an RMSEC range ofapproximately 0.005 to 0.015 yield a smooth 1-norm L-shapedcurve in Fig. 3a. However, TR1 models in the same RMSECrange for the 2-norm plot in Fig. 3b result in a jagged curvewith peaks and troughs and the L-shape is significantlydistorted. In this portion of the 2-norm curve, models wouldnot likely be selected as optimal, despite being in the cornerregion of the 1-norm L-curve. For this study, all TR1 modelsare selected from 1-norm L-curves, as the intent of TR1 isbuilt-in wavelength selection, which is determined by 1-norm.

As shown in Fig. 4a, the TR1 regression vector has manymore near-zero coefficients than the selected PLS 2 factormodel (PCR and RR have nearly the same regression vectors asPLS in the same region of their respective L-curves). The TR1regression vector primarily retains wavelengths 18 and 19, twoof the three most sensitive analyte selective (free of spectral

FIG. 2. Pure-component spectra at unit concentration prior to noise additionfor the simulated spectroscopic data set II; (dotted line) analyte and (solid line)interferent. An offset of 0.3 is added to the analyte for visual clarity.

FIG. 3. Simulated spectroscopic data set I RMSEC L-curves for (a) the 1-norm and (b) the 2-norm. (Solid line) TR1, (open circle) PCR, (asterisks) PLS,(dotted line) RR, and (squares) SWR PLS. Numbers on plot indicate therespective number of basis vectors and the star denotes the TR1 modelevaluated.

APPLIED SPECTROSCOPY 89

Page 6: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

overlap) wavelengths. Other analyte selective wavelengths,e.g., 9 through 17, have lower signal-to-noise ratios thanwavelengths 18 and 19 and are effectively ‘‘zeroed out’’. Themaximum pure-component analyte peak height is at wave-length 20 and hence, the greatest sensitivity. However,wavelength 20 is also the wavelength at which the additional3% noise is added (wavelengths 20 through 30) anddegradation of the signal-to-noise ratio begins. When thisadditional noise is not added to the same simulated spectra, theoptimal TR1 regression vector coefficient with the greatestvalue is at wavelength 20. From the pure-component spectraplotted in Fig. 1, wavelengths 19 and 21 are noted as beingequally sensitive towards the analyte, but wavelength 21,which is zeroed out in the TR1 optimal model, includesinterferent overlap and the additional noise. Therefore, it maybe concluded that for this data set, TR1 zeros out wavelengthswith relatively low signal-to-noise and less selectivity. In orderfor PCR, RR, or PLS to obtain a regression vector focusing onwavelengths 18 and 19 as does TR1, an over-fitted model isobtained. Such a regression vector is depicted in Fig. 4a for thePLS 7 factor model and shows the over-fitting of thenonessential and noisy wavelengths.

The selected TR1 model has many more near-zero

coefficients than the model obtained in Ref. 12 from optimizingTR with L set to the noise structure vector. Furthermore, somesmall over-fitting is seen in the model from Ref. 12, e.g., asmall positive coefficient at wavelength 32, which contains noanalyte information, and this over-fitting is absent from theTR1 model.

In recent statistical work, it has been reported if there is agroup of variables with large pair-wise correlations, then TR1will tend to select only one variable from the group and thealgorithm does not care which one is selected.50,51 However,when wavelength selection is the focus, it appears that in theoptimal model region, TR1 prefers the more sensitivecorrelated wavelengths. For example, plotted in Fig. 5 are theregression coefficients for analyte responding wavelengths 14,16, 18, 19, and 21 covering the under-fitted, corner, and over-fitted regions of the L-curve in Fig. 3a. Figure 5 shows that inthe optimal corner model region of Fig. 3a, the regressioncoefficients for correlated wavelengths 18 and 19 are thegreatest and there are some tradeoffs in magnitudes. As themodel becomes over-fitted in Fig. 3a, the correspondingmagnitudes for wavelengths 18 and 19 in Fig. 5 begin todecrease and other less sensitive correlated wavelengths 14 and16 begin to concurrently increase in magnitude. In the under-fitted region of Fig. 3a, respective regression coefficients forwavelengths 18 and 19 in Fig. 5 tend to randomly vary, whilewavelengths 14 and 16 are essentially zero. Wavelength 21 isplotted in Fig. 5 as this wavelength has a large regressioncoefficient value for the PLS 2 factor model displayed in Fig.4a. The TR1 regression coefficient value for wavelength 21 inFig. 5 also randomly varies in the associated under-fitted regionof Fig. 3a, proceeds to be zeroed out through the corner regionof Fig. 3a, and effectively remains zero for the over-fittedregion of Fig. 3a. Thus, it appears that in the optimal modelregion, TR1 is indeed particular about which wavelengths areselected.

Values tabulated in Table I show that in comparison with fullwavelength PCR, RR, and PLS models, the TR1 model yieldslower RMSEC and RMSEV values (and correspondingincreased R2 values) at a cost of a greater ||b||2 value. Theseimproved TR1 bias values probably stem from less fitting ofthe interference and additional noise regions past wavelength

FIG. 4. Simulated spectroscopic data set I regression vectors. (a) (solid) TR1,(asterisks) PLS 2 factors, and (dotted line/closed circle) PLS 7 factor. Offsets of0.1 and 0.25 are added to, respectively, PLS 2 and 7 factor regression vectorsfor visual elucidation. (b) (Squares) SWR PLS 2 factor and (X’s) SWR MLR.To plot SWR regression vectors, zeroes were inserted for wavelengths notselected by SWR. An offset of 0.1 is added to the SWR PLS 2 factor regressionvector for visual clarity.

FIG. 5. Simulated spectroscopic data set I regression vector coefficients forwavelengths (dot dash line) 14, (X’s) 16, (solid line) 18, (dashed line/closedcircles) 19, and (dotted line) 21. TR1 model regression vector 1-norm is 2.29(see Table I) and is noted on the plot.

90 Volume 61, Number 1, 2007

Page 7: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

19. While the regression vector 2-norm value is higher with theTR1 model, actual estimated prediction standard deviationscould not be obtained.

The SWR method selected wavelengths 2, 13, 14, 16, 18, 19,20, 41, and 45. Note that wavelengths 2, 41, and 45 contain noanalyte information and are probably selected based on chancecorrelations. It is also interesting that SWR did not selectanalyte wavelengths 15 and 17, although it did select all otheranalyte wavelengths 13 through 20. Values in Table I showthat despite incorporating more analyte selective wavelengthsthan the TR1 model, the SWR 2 factor and MLR models do notyield lower RMSEV values than the TR1 model does.However, these two SWR models do yield lower RMSECvalues than the TR1 model and is probably the result of over-fitting, albeit the SWR PLS model substantially reduces theover-fitting and provides better validation results than the SWRMLR model. The SWR PLS 2 factor regression vector plottedin Fig. 4b shows some small over-fitting at wavelengths 2, 41,and 45, which do not contain any analyte or interferentinformation. Figure 4b reveals that such over-fitting is evenmore pronounced in the SWR MLR model. Additionally, theSWR MLR model yields higher estimated prediction variance(ry) than the full wavelength PCR, RR, and PLS models. TheSWR PLS 2 factor model yields comparable estimatedprediction variances to the full wavelength models. WithSWR, the model used is commonly the MLR model, as it is themodeling method inherent to the algorithm. Results in Table Ishow that by using PLS with the SWR selected wavelengthsinstead of MLR, results can be substantially improved as PLSreduces the over-fitted regression coefficient magnitudescompared to the MLR model. The L-curves in Fig. 3ademonstrate that the TR1 models are Pareto and, hence, theSWR MLR and PLS models are bypassed in the TR1optimization algorithm.

The parsimony for the subset of wavelengths appears to notbe noticeably different than the full wavelength models.Specifically, the MLR SWR model has the largest ER eventhough only nine wavelengths are used compared to 50wavelengths for the full wavelength models. By using PLSinstead of MLR with the SWR selected wavelengths, the ER isreduced to a comparable value to those from the fullwavelength models. Thus, even though the full wavelengthmodels use a much larger number of wavelengths, the finalparsimony of the models is nearly the same as models based onfewer wavelengths. The full wavelength models also maintainbetter parsimony than the MLR model with fewer wavelengths.

Overall, wavelength selection appears beneficial for this dataset. By using the 1-norm penalty in the regression vector size,TR1 is able to simultaneously estimate regression coefficientsfor selected wavelengths. This approach is in contrast to the

usual method of sequential variation of wavelength selectionsfollowed by model determinations using PCR, RR, PLS, MLR,etc. With built-in wavelength selection by TR1, similar resultsto SWR selection (minus the over-fitting) were obtained. Theimprovement with TR1 or subset selection by SWR probablyresults from the fact that this data set has what can be classifiedas a small to moderate number of large to moderate sizedwavelength effects (sensitivity and selectivity). Thus, TR1 andSWR are able to isolate these key wavelengths. While SWRtended to also include random wavelengths, SWR followed byPCR, RR, or PLS can reduce this effect. As will be observed inother data sets, the full wavelength methods perform betterwhen there are a greater number of small wavelength effects.

It is noteworthy to compare full wavelength PCR, RR, andPLS values listed in Table I where it is observed that RR hasthe greatest ER but the smallest ||b||2, PCR has the smallest ERand a ||b||2 equal to PLS, and the RMSEC and RMSEV valuesare the largest for PCR and smallest for RR. Thus, in terms ofharmony, RR is the most harmonious and PCR is the least forthe measures presented in Table I. For the ER parsimonyconsideration, PCR is the most parsimonious and RR is theleast parsimonious. The method of PLS is intermediate in thisapparent harmony/parsimony tradeoff. This trend can beobserved for the other data sets and stems from how theeigenvector basis set for X is used to form respective regressionvectors b. Briefly, PCR weights only those eigenvectors up tothe number used and thus, ER is the number of basiseigenvectors used. Partial least squares can weight all rank keigenvectors where rank(X) ¼ k � min(m, w), but theweighting focus is on the same eigenvectors used in the finalPCR model with small non-zero weights on a few more andzero weights on the rest. Like PLS, RR can weight all keigenvectors and the focus is on the same eigenvectors as PLSwith small non-zero weights on the rest. Thus, PLS appearsintermediate in how the eigenvector basis set is used and hence,intermediate in harmony and parsimony. References 14, 52,and references therein should be consulted for further details oneigenvector weighting.

Simulated Spectroscopic Data Set II. Respective L-curvesfor this data set are similar to those for the simulatedspectroscopic data set I shown in Fig. 3. The TR1 regressionvector also shows similar trends of modeling wavelengths withthe greatest sensitivity and selectivity while giving otherwavelengths near zero coefficients, as shown in Fig. 6.Conversely, the plotted PLS model uses all four selectivewavelengths. Because more of the selective wavelengths areused, the regression coefficients are more spread out inmagnitude with respect to the actual sensitivities seen inpure-component spectra plotted in Fig. 2. Both the TR1 andPLS regression vectors eliminate the interferent peaks.

TABLE I. Model results for simulated spectroscopic data set I.

Subset Modela RMSEC RMSEV R2 (cal.) R2 (val.) ||b||1 ||b||2 rycal (310�3) ryval (310�3) ER

Full PCR (2) 0.0213 0.0208 0.9945 0.9940 2.96 0.865 5.33 1.84 2.03PLS (2) 0.0211 0.0207 0.9946 0.9940 2.99 0.865 5.29 1.83 2.16RR (0.100) 0.0200 0.0186 0.9959 0.9945 3.09 0.851 4.77 2.22 4.70TR1 0.0120 0.0114 0.9984 0.9988 2.29 1.60 ��� ��� ���

SWRb MLR 0.0051 0.0140 0.9997 0.9972 7.02 2.52 6.38 5.13 8.93PLS (2) 0.0101 0.0116 0.9988 0.9984 3.07 1.59 4.69 1.89 2.44

a Values in parentheses are the number of respective PCR and PLS basis vectors and RR ridge value.b SWR subset contains wavelengths 2, 13, 14, 16, 18, 19, 20, 41, and 45.

APPLIED SPECTROSCOPY 91

Page 8: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

However, as shown in Table II, the tabulated TR1 model yieldsinferior results to the other presented full wavelength models.

When plots such as those in Fig. 5 are made for this data set,similar trends are observed. As a TR1 model approaches thecorner region of the L-curve from the under-fitted area (small1-norm values), regression coefficients increase in magnitudefor the two selective (and correlated) wavelengths with thegreatest sensitivity. As a TR1 model moves out of the cornerregion of the L-curve and into the over-fitted region with(greater 1-norm values), regression coefficients for otherselective wavelengths, and hence correlated, with lesssensitivity begin to increase in magnitude while regressioncoefficient values for the more sensitive decrease. Spuriousnon-correlated (baseline noise) wavelengths also begin toreceive non-zero coefficient values in the over-fitted region.Thus, TR1 appears to focus on correlated selective wave-lengths with the greatest sensitivity, thereby eliminating theless sensitive but selective and useful wavelengths for this dataset.

The SWR subset consisting of wavelengths 4, 14, 16, 30, 37,45, 56, and 59 intuitively appears flawed as it avoidswavelength 10 while including many wavelengths with pooranalyte selectivity. Furthermore, this subset includes wave-lengths 56 and 59, which do not contain any analyte orinterferent information. Consequently, the SWR model pre-sented in Table II yields inferior results in all aspects. UsingPLS on the SWR subset did not notably change the predictionresults, but the ER did decrease relative to the MLR ER.

In general, wavelength selection by TR1 or SWR does notimprove the harmony or parsimony for this data set. As withthe previous data set, SWR also selected chance correlated

wavelengths. With this data set, the full wavelength approachestended to perform better as there are numerous wavelengthswith small effects (small sensitivities). If the data set only had afew large wavelength effects, e.g., only wavelengths 4 and 10responding at the same simulated selectivity and sensitivitylevels as before, then TR1 and the subset selection methodsshould provide better results compared to the full wavelengthmethods.

Gasoline Data. The calibration and validation L-curvesfollow patterns similar to those for the simulated data sets.Examination of the chosen regression vectors in Fig. 7 revealsa TR1 model with far more near-zero coefficients than the PLSand RR models (PCR is essentially the same), which isconsistent with results from the simulated data sets. The TR1peaks appear where large PLS and RR peaks also occur. At thesame time, not every major peak for PLS and RR isaccompanied by a TR1 peak, as seen from wavelengths 1620to 1660 nm. This result is probably due to TR1 wavelengthselection subject to factors such as analyte sensitivity, elevatednoise levels, and interferent overlap, as seen in the simulateddata. Analogous to the simulated data, the RR and PLSregression vectors in Fig. 7 tend to be more spread out overwavelength regions (PCR is similar, but not shown). Thewavelength regions highlighted by TR1 correspond to thosewavelength regions selected by simulated annealing in anotherwavelength selection study.53 Perhaps even more interesting isthe closer resemblance of TR1 to those wavelength regionsidentified in Ref. 53 based on the histogram plot for the 50 bestfour-wavelength subsets with the lowest RMSEC using MLRfound from three million random combinations.

Values in Table III show that the TR1 model achieves a

TABLE II. Model results for simulated spectroscopic data set II.

Subset Modela RMSEC RMSEV R2 (cal.) R2 (val.) ||b||1 ||b||2 rycal (310�2) ryval (310�2) ER

Full PCR (2) 0.0101 0.0125 0.9997 0.9996 1.81 0.713 1.05 0.494 2.12PLS (2) 0.0105 0.0125 0.9997 0.9996 1.81 0.713 1.04 0.492 2.19RR (0.0303) 0.00925 0.0149 0.9999 0.9995 1.99 0.707 0.788 0.576 9.89TR1 0.0126 0.0133 0.9995 0.9995 1.11 0.810 ��� ��� ���

SWRb MLR 0.0118 0.0187 0.9995 0.9991 2.10 1.15 0.954 1.01 8.04PLS (2) 0.0121 0.0173 0.9995 0.9992 1.88 1.11 0.932 0.788 2.25

a Values in parentheses are the number of respective PCR and PLS basis vectors and RR ridge value.b SWR subset contains wavelengths 4, 14, 16, 30, 37, 45, 56, and 59.

FIG. 7. Gasoline data set regression vectors for (asterisks) TR1, (dashed line)PLS 4 factors, and (dotted line) RR k¼ 5.00 3 10�4. Offsets of 5 are added toboth the PLS and RR regression vectors for visual elucidation.

FIG. 6. Simulated spectroscopic data set II regression vectors for (asterisks)TR1 and (dashed line) PLS 2 factors. An offset of 0.05 is added to PLSregression vector for visual elucidation.

92 Volume 61, Number 1, 2007

Page 9: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

lower RMSEV than PCR and PLS, but the selected RR modelformed the lowest RMSEV. As observed with the simulateddata and in other studies, RR maintains the better harmony,PCR the best parsimony, and PLS is intermediate.

The SWR subset includes wavelengths 910, 1220, and 1400nm. In this case, the PLS optimal model is the full rank (threefactor) model, and hence, it is the same as the MLR model. Asdoes the TR1 model, the SWR MLR model yields bettercalibration diagnostic values than the full wavelength modelsand produces equivalent to improved validation values. The ERfor the SWR subset substantially improves and is the lowestamong all the methods. It appears that for this real data set,SWR was not impacted as much from chance-correlatedwavelengths as transpired with the simulated data sets, i.e., theRMSEC for the SWR MLR wavelengths is lower than the TR1model, but the RMSEV is greater.

Overall, the full wavelength models performed nearly aswell as TR1 and SWR. Thus, this data set could be classified asbeing intermediate between having a large number of smallwavelength effects and a moderate number of moderatewavelength effects.

Corn Data. The corn data exhibits the same L-curve trendsseen in the simulated and gasoline data. Furthermore,regression vectors shown in Fig. 8 present similar trends tothose of the simulated and gasoline data sets. However, withthis data set, the TR1 optimal model yields superior bias resultscompared to those from PCR, RR, and PLS full wavelengthmodels, as shown in Table IV. By comparing the TR1regression vectors for the corn data (Fig. 8) and the gasolinedata (Fig. 7), the model for this data set includes more smallcoefficients (relative to the size of the largest coefficient) such

as those seen at 1420, 1600, 2240, and 2380 nm. However, theimproved bias diagnostics come at the expected tradeoff withthe larger regression vector 2-norm for TR1. From informationlearned with the simulated data sets, it appears that a small tomoderate number of large to moderate sized wavelength effectsare present in the corn data. Thus, TR1 captures the moderatenumber of key wavelengths for the corn data and zeros outwavelengths with perhaps important but small effects (smallersensitivity and signal-to-noise ratios).

The SWR subset consists of wavelengths at 1540, 1620,1900, 1920, 1940, 2100, 2120, 2260, and 2480 nm. As with thegasoline data, the optimal PLS model occurs at the MLRsolution and SWR appears to not select wavelengths withchance correlations. The parsimony improves with the SWRselected wavelengths compared to the full wavelength models.

Summarizing, the corn data apparently has a small tomoderate number of large to moderate sized wavelengtheffects. Thus, TR1 and wavelength subset selection by SWRprovide improved results compared to the full wavelengthmethods PCR, RR, and PLS. This result is in contrast to thegasoline data, for which the full wavelength PCR, RR, and PLSresults are best due to the apparent presence of a large numberof small effects. Between PCR, RR, and PLS, the betterharmonious model again lies with RR while PCR is the mostparsimonious and PLS is intermediate.

CONCLUSION

Built-in wavelength selection is possible by using theregression vector 1-norm with TR. Such regression vectorscontain many near-zero coefficients, especially for wavelengthscontaining interference and low signal-to-noise ratios. In orderto select a proper fitted TR1 regression vector, it is imperativeto simultaneously use the two criteria in Eq. 2, not just a biasindicator as in previous studies. For the simulated data, themethod of SWR tends to select random wavelengths inbaseline noise regions where TR1 does not. However, usingPCR, RR, or PLS with the SWR subset rather than the defaultMLR can sometimes reduce the problem and improve the ER.

Results from this study provide insights allowing thedevelopment of useful guidelines on when to use fullwavelength methods or wavelength selection methods. Basi-cally, when there are a small number of large wavelengtheffects (good sensitivity and selectivity), subset selection byTR1 or MLR does well. With a small to moderate number oflarge to moderate sized wavelength effects, TR1 is better.Lastly, when a large number of small effects are present, thefull wavelength methods of PCR, RR, or PLS are best.Intermediate cases will arise from variations of differentcombinations of these situations. Unfortunately, in order touse these guidelines, one must know the underlying spectral

TABLE III. Model results for gasoline data set.

Subset Modela RMSEC RMSEV R2 (cal.) R2 (val.) ||b||1 ||b||2 rycal ryval ER

Full PCR (5) 0.295 0.296 0.9515 0.9562 311 57.7 0.629 0.368 5.08PLS (4) 0.258 0.283 0.9629 0.9600 319 58.2 0.828 0.494 6.62RR (5.00310�4) 0.194 0.223 0.9817 0.9801 308 58.2 0.799 0.499 7.34TR1 0.191 0.236 0.9809 0.9766 184 68.9 ��� ��� ���

SWRb MLRc 0.163 0.261 0.9853 0.9661 255 159 0.498 0.564 3.02

a Values in parentheses are number of respective PCR and PLS basis vectors and RR ridge value.b SWR subset contains wavelengths 910, 1220, and 1400 nm.c PLS model for SWR subset is the MLR model.

FIG. 8. Corn data set regression vectors for (asterisks) TR1 and (dashed line)PLS 8 factors. An offset of 10 is added to the PLS regression vector for visualclarity.

APPLIED SPECTROSCOPY 93

Page 10: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

situation, which often is unknown. Thus, it is best to buildmodels with and without wavelengths selected and use theguidelines to explain why wavelength selection improved ordegraded results.

Wavelength selection based on minimizing prediction errorsfor the calibration or validation data sets will form modelsover-fitted to the respective data sets. By including a 1-normpenalty in TR, proper models are obtained with essentialwavelengths selected. However, with TR1 it is known that ifthe number of samples (m) is less than the number ofwavelengths, as is typical with spectroscopic data, then at most,m wavelengths can be selected.50 This fact could prove to betoo restrictive for some data sets, although it was not a problemwith the data sets studied in this paper. A possible solution is toincorporate additional criteria into Eq. 2. For example,including both ||b||1 and ||b||2, expressed by

minðjjXb� yjj22 þ k1jjLbjj11 þ k2jjLbjj22Þ ð8Þ

could assist in forming better models. This assistance will beespecially true for situations intermediate in the number andsize of wavelength effects, such as with the gasoline data.Another possibility is to use Eq. 2 with other norms for b in theinterval 0 , b , ‘.19,30,33,34 If the noise structure is known,then including it as L for wavelength selection by using Eqs. 2and 8 with b ¼ 1 should improve results.

A bias/variance/parsimony tradeoff was found to exist forfixed wavelength situations as well as for varying thewavelengths. However, prediction standard deviations andregression vector 2-norms in Tables I through IV show thatwhen the number of wavelengths changes or for a fixed numberof wavelengths, but with changing locations, then ||b||2 is notalways comparable as a measure of actual prediction variancesexpected. When the number and location of wavelengthsremain the same, then ||b||2 works well. For analyte predictionvariance expressions,22–25 the homoscedastic measurementerror is assumed, which is probably not the case. Using amore complete variance expression should provide a moreconsistent relationship, but often times many of the terms areknown exactly. Additionally, selecting a subset of wavelengthsmathematically increases ||b||2,49 and hence, an increase is notalways a direct measure of a variance increase. Lastly, whenthe measurement noise increases for a fixed number ofwavelengths, ||b||2 decreases.8 Thus, by including wavelengthswith larger measurement error, as can happen with fullwavelength situations, a false decrease of ||b||2 can result whenit should have increased to reflect the expected increase inprediction variance.

The method of TR is beginning to prove itself as a versatiletool in multivariate calibration with application to a wide rangeof problems. As well as using TR for basic calibration with the

2-norm on the regression vector,14,15 TR approaches have beenused to remove unwanted spectral artifacts (with additional on-going investigations in our laboratory),12,20,21,54 calibrationtransfer (with additional on-going investigations in ourlaboratory),55,56 smoothing X,57 smoothing the regressionvector,20 and converting raw measured spectra to derivativespectra.58 The results of this study add to this list of TR usesand other applications with the 1-norm on the regressionvector.

ACKNOWLEDGMENT

This material is based upon work supported by the National ScienceFoundation under Grant No. CHE 0400034 and is gratefully acknowledged bythe authors.

1. T. Næs, T. Isaksson, T. Fern, and T. Davies, A User Friendly Guide toMultivariate Calibration and Classification (NIR Publications, Chichester,2002).

2. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference, and Prediction (Springer, New York,2001).

3. P. J. de Groot, H. Swierenga, G. J. Postma, W. J. Melssen, and L. M. C.Buydens, Appl. Spectrosc. 57, 642 (2003), and references therein.

4. J. Jiang, R. J. Berry, H. W. Siesler, and Y. Ozaki, Anal. Chem. 74, 3555(2003), and references therein.

5. J. H. Kalivas and P. J. Gemperline, in Practical Guide to Chemometrics, P.J. Gemperline, Ed. (CRC Press, Boca Raton, FL, 2006), 2nd ed., andreferences therein.

6. M. L. Griffiths, D. Svozil, P. Worsfold, S. Denham, and E. H. J. Evans,Anal. At. Spectrom. 17, 800 (2002), and references therein.

7. H. Mark, Appl. Spectrosc. 42, 1427 (1988).8. H. Mark, Principles and Practice of Spectroscopic Calibration (John

Wiley and Sons, New York, 1991).9. J. G. Topliss and R. P. Edwards, J. Medicinal Chem. 22, 1238 (1979).

10. R. Leardi and A. L. Gonzalez, Chemom. Intell. Lab. Syst. 41, 195 (1998).11. N. M. Faber, J. Chemom. 13, 185 (1999).12. J. H. Kalivas, Anal. Chim. Acta 505, 9 (2004).13. P. C. Hansen, Rank Deficient and Discrete Ill-Posed Problems: Numerical

Aspects of Linear Inversion (SIAM, Philadelphia, PA, 1998).14. J. H. Kalivas and R. L. Green, Appl. Spectrosc. 55, 1645 (2001).15. J. B. Forrester and J. H. Kalivas, J. Chemom. 18, 372 (2004).16. J. Ye, J. Am. Stat. Assoc. 93, 120 (1998).17. H. A. Seipel and J. H. Kalivas, J. Chemom. 18, 306 (2004).18. H. A. Seipel and J. H. Kalivas, J. Chemom. 19, 64 (2005).19. A. Dax, SIAM J. Optimization 2, 602 (1992).20. F. Stout and J. H. Kalivas, J. Chemom., paper in press (2006).21. R. DiFoggio, J. Chemom. 19, 203 (2005).22. K. Faber and B. R. Kowalski, J. Chemom. 11, 181 (1997).23. K. Faber and B. R. Kowalski, Chemom. Intell. Lab. Syst. 34, 283 (1996).24. N. M. Faber, X. H. Song, and P. K. Hopke, Trends Anal. Chem. 22, 330

(2003).25. J. A. Fernandez Pierna, L. Jin, F. Wahl, N. M. Faber, and D. L. Massert,

Chemom. Intell. Lab. Syst. 65, 281 (2003).26. C. L. Lawson and R. J. Hanson, Solving Least Square Problems (SIAM,

Philadelphia, PA, 1995).27. R. Tibshirani, J. R. Statist. Soc. B 58, 267 (1996).28. J. F. Claerbout and F. Muir, Geophysics 38, 826 (1973).29. H. L. Taylor, S. C. Banks, and J. F. McCoy, Geophysics 44, 39 (1979).

TABLE IV. Model results for corn data set.

Subset Modela RMSEC RMSEV R2 (cal.) R2 (val.) ||b||1 ||b||2 rycal ryval ER

Full PCR (11) 0.0172 0.0257 0.9979 0.9961 451 74.3 0.137 0.0409 11.1PLS (8) 0.0152 0.0210 0.9984 0.9974 433 74.6 0.104 0.0946 14.7RR (6.06310�6) 0.0121 0.0310 0.9990 0.9948 405 73.9 0.102 0.0995 15.2TR1 0.00417 0.0115 0.9999 0.9994 261 115 ��� ��� ���

SWRb MLRc 0.00534 0.00874 0.9998 0.9996 264 120 0.148 0.0622 9.22

a Values in parentheses are the number of respective PCR and PLS basis vectors and RR ridge value.b SWR subset contains wavelengths 1540, 1620, 1900, 1920, 1940, 2100, 2120, 2260, and 2480 nm.c PLS model for SWR subset is the MLR model.

94 Volume 61, Number 1, 2007

Page 11: Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization

30. I. E. Frank and J. H. Friedman, Technometrics 35, 109 (1993).31. F. Santosa and W. Symes, SIAM J. Sci. Stat. Comput. 7, 1307 (1986).32. P. C. Hansen, Num. Lin. Alg. Appl. 3, 513 (1996).33. W. J. Fu, J. Comput. Graph. Statist. 7, 397 (1998).34. K. Knight and W. Fu, Annals Statist. 28, 1356 (2000).35. S. van de Geer, Math. Meth. Statist. 10, 355 (2001).36. J. Fan and R. Li, J. Am. Stat. Assoc. 96, 1348 (2001).37. K. Vach, Statistica Neerlandica 55, 53 (2001).38. H. Ojelund, H. Madsen, and P. Thyregod, J. Chemom. 15, 497 (2001).39. H. Ojelund, H. Madsen, P. J. Brown, and P. Thyregod, Technometrics 44,

369 (2002).40. I. G. Chong and C. H. Jun, Chemom. Intell. Lab. Syst. 78, 103 (2005).41. S. Weisberg, Applied Linear Regression (John Wiley and Sons, New York,

2005), 3rd ed.42. H. van der Voet, J. Chemom. 13, 195 (1999).43. K. Baumann and N. Stiefl, J. Computer-Aided Molecular Design 18, 549

(2004).44. K. Baumann, H. Albert, and M. von Korff, J. Chemom. 16, 339 (2002).45. K. Baumann, M. von Korff, and H. Albert, J. Chemom. 16, 351 (2002).

46. K. Baumann, Trends Anal. Chem. 22, 395 (2003).47. J. H. Kalivas, Chemom. Intell. Lab. Syst. 37, 255 (1997).48. B. M. Wise, N. B. Gallagher, R. Bro, and J. M. Shaver, PLS_Toolbox 3.0

for use with MATLAB (Eigenvector Research, Manson, WA, 2003).49. A. Lorber and B. R. Kowalski, J. Chemom. 2, 67 (1988).50. H. Zou and T. Hastie, J. R. Statist. Soc. B 67, 301 (2005).51. B. A. Turlach, W. N. Venables, and S. Wright, Technometrics 47, 349

(2005).52. J. H. Kalivas, J. Chemom. 13, 111 (1999).53. J. M. Brenchley, U. Horchner, and J. H. Kalivas, Appl. Spectrosc. 51, 689

(1997).54. F. Stout and J. H. Kalivas, Anal. Lett., paper in press (2006).55. M. Westerhaus, in Proc. Third Intl. Conf. Near Infrared Spectrosc., R.

Biston and N. Bartiaux-Thill, Eds. (Agricultural Research CentrePublishing, Gemblous, Belgium, 1990), pp. 671–674.

56. P. Tillman, T. Reinhardt, and C. Paul, J. Near Infrared Spectrosc. 8, 103(2000).

57. P. H. C. Eilers, Anal. Chem. 75, 3631 (2003).58. Y. L. Yeow and Y. K. Leon, Appl. Spectrosc. 59, 584 (2005).

APPLIED SPECTROSCOPY 95