a primer on multivariate calibration

For centuries the practice of calibration has been widespread throughout science and engineer

ing. The modern application of calibration procedures is very diverse. For example, calibration methods have been used in conjunction with ultrasonic measurements to predict the gestational age of human fetuses (1), and calibrated bicycle wheels have been used to measure marathon courses (2).

Within analytical chemistry and related areas, the field of calibration has evolved into a discipline of its own. In analytical chemistry, calibration is the procedure that relates instrumental measurements to an analyte of interest. In this context, calibration is one of the key steps associated with the analyses of many industrial, environmental, and biological materials. Increased capabilities resulting from advances in instrumentation and computing have stimulated the development of numerous calibration methods. These new methods have helped to broaden the use of analytical techniques

Edward V. Thomas Sandia National Laboratories

A Primer on Multivariate Calibration

Calibration methods allow one to relate

instrumental measurements to

analytes of interest in industrial,

environmental, and biological materials

(especially those that are spectroscopic in nature) for increasingly difficult problems.

In the simplest situations, models such as? = a + χ · b have been used to express the relationship between a single measurement (y) from an instrument (e.g., absor-bance of a dilute solution at a single wavelength) and the level (*) of the analyte of interest. Typically, instrumental measurements are obtained from specimens in which the amount (or level) of the analyte has been determined by some indepen

dent and inherently accurate assay (e.g., wet chemistry). Together, the instrumental measurements and results from the independent assays are used to construct a model (e.g., estimate a and b) that relates the analyte level to the instrumental measurements. This model is then used to predict the analyte levels associated with future samples based solely on the instrumental measurements.

In the past, data acquisition and analysis were often time-consuming, tedious activities in analytical laboratories. The ad-

0003-2700/94/0366-795A/$04.50/0 © 1994 American Chemical Society

Analytical Chemistry, Vol. 66, No. 15, August 1, 1994 795 A

Report

Report

vent of high-speed digital computers has greatly increased data acquisition and analysis capabilities and has provided the analytical chemist with opportunities to use many measurements—perhaps hundreds—for calibrating an instrument (e.g., absorbances at multiple wavelengths) . To take advantage of this technology, however, new methods (i.e., multivariate calibration methods) were needed for analyzing and modeling the experimental data. The purpose of this Report is to introduce several evolving multivariate calibration methods and to present some important issues regarding their use.

Univariate calibration To understand the evolution of multivariate calibration methods, it is useful to review univariate calibration methods and their limitations. In general, these methods involve the use of a single measurement from an instrument such as a spectrometer for the determination of an analyte. This indirect measurement can have significant advantages over gravimetric or other direct measurements. Foremost among these advantages is the reduction in sample preparation (e.g., chemical separation) that is often required with the use of direct methods. Thus, indirect methods, which can be rapid and inexpensive, have replaced a number of direct methods.

The role of calibration in these analyses is embodied in a two-step procedure: calibration and prediction. In the calibration step, indirect instrumental measurements are obtained from specimens in which the amount of the analyte of interest has been determined by an inherently accurate independent assay. The set of instrumental measurements and results from the independent assays, collectively referred to as the calibration set or training set, is used to construct a model that relates the amount of analyte to the instrumental measurements.

For example, in determining Sb concentration by atomic absorption spectroscopy (AAS), the absorbances of a number of solutions (with known concentrations of Sb) are measured at a strong absorbing line of elemental Sb (e.g., 217.6 nm). A model relating absorbance and Sb concentration is generated. In this case, model development is straightforward, because

Beer's law can be applied. In other situations, the model may be more complex and lack a straightforward theoretical basis. In general, this step is the most time-consuming and expensive part of the overall calibration procedure because it involves the preparation of reference samples and modeling.

Next, the indirect instrumental measurement of a new specimen (in combination with the model developed in the calibration step) is used to predict its associated analyte level. This prediction step is illustrated in Figure 1, which shows Sb determination by AAS. Usually, this step is repeated many times with new specimens using the model developed in the calibration step.

Even in the simplest case of univariate calibration—when there is a linear relationship between the analyte level (AT) and instrumental measurement (y)—modeling can be done in different ways. In one approach, often referred to as the classical method, the implied statistical model is

y\ = V * i + e-, (1)

where x-, and y-, are the analyte level and instrument measurement associated with the fth of η specimens in the calibration set. The measurement error associated with^j is represented by ex. To simplify this discussion, an intercept is not included in Equation 1. In the calibration step, the model parameter, bv is usually estimated by least-squares regression of the instrument measurements on the reference values associated with the specimens com-

0.06 0.05

-gO.03' 5 0.02

0.01

0 1 2 3 4 5 Concentration of Sb (ppm)

Figure 1 . Prediction of the Sb concentration of a new specimen. The calibration model (solid line derived from the calibration set [dots]) relating the absorbance at 217.6 nm to the Sb concentration and the absorbance of the new specimen are used for prediction.

posing the calibration set. The estimate of ft, can be expressed as b\ = (x'x)"1 xTy, wherex= fru*» · · -.«π)1 andy = (yi,y2, .. .,yn)T. In this article, the "hat" symbol over a quantity is used to denote an estimate (or prediction) of that quantity. The predicted analyte level associated with a new specimen is x' = y*/b\, where/ is the observed measurement associated with the new specimen.

In another approach, often referred to as the inverse method, the implied statistical model is

*i = ft2 · :Vi + e- (2) where ex is assumed to be the measurement error associated with the reference value xy In the calibration step, the model parameter, ft2, is estimated by least-squares regression of the reference values on the instrument measurements (i.e., \ = (yTy)_1yTx)·m the prediction step, x" = $2 · / . In general, predictions obtained by the classical and inverse methods will be different. However, in many cases, these differences will not be important. In the literature, there has been an ongoing debate about which method is preferred (3,4). When calibrating with a single measurement, the inverse method may be preferred if the instrumental measurements are precise (e.g., as in near-IR spectroscopy).

The point of this discussion isn't to recommend one method over another; it is to show that, even in this relatively simple situation, different approaches exist. However, the breadth of the applicability of these univariate methods is limited.

For example, let us reconsider the determination of Sb concentration by AAS. Suppose the specimens to be analyzed contain Pb. It is well known that Pb has a strongly absorbing spectral line at 217.0 nm, which is quite close to the primary Sb line at 217.6 nm (5). There are important ramifications of this fact. If an analyst fails to recognize the presence of .Pb, the application of univariate calibration using the 217.6-nm line can result in inaccurate predictions for Sb because of the additional absorbance attributable to Pb (see Figure 2). If Pb is recognized as a possible interference, the usual approach is to move to a less intense spectral line for Sb (e.g., 231.2 nm); however, one can expect a poorer detection limit

796 A Analytical Chemistry, Vol. 66, No. 15, August 1, 1994

0.06' g 0.05 J 0.04 "§0.03 (Λ

<0.02 0.01

ο

0.06' g 0.05 J 0.04 "§0.03 (Λ

<0.02 0.01

ο

τ : / ' |Pb absorbance X

/ι ;

\ s ' i,Bias|

0 1 2 3 4 5 Concentration of Sb (ppm)

Figure 2. Effect of the presence of Pb on predicting Sb concentration.

and degraded analytical precision. The preceding example exposes a fun

damental weakness of univariate methods: In the absence of an effective method for separating the analyte from the interferences, there is a need for measurements that are highly selective for the analyte of interest. Without a selective measurement, univariate methods may produce unreliable predictions. Furthermore, on the basis of a single measurement it is impossible to detect the presence of unknown interferences or to know when predictions are unreliable.

To apply univariate methods successfully, an analyst will often need a great deal of specific knowledge about the chemical system to be analyzed. Furthermore, unless the system is relatively simple (or unless the analyst has substantial knowledge of the subject matter), a selective measurement with an appropriate level of sensitivity will usually be hard to find.

On the positive side, univariate calibration can often be used with reasonable success in applications where selective measurements can be found (as in AAS) or when the analyte can be effectively separated from interferences. In such cases, the simplicity of a univariate method offers a significant advantage. Even in these ideal settings, however, care is needed to maintain the reliability of predictions based on univariate calibration.

Given the data-rich environment in modern laboratories, numerous selective measurements might be available for analysis. To use univariate methods, it is necessary to specify a single measurement or condense the multiple measurements to a single entity, such as peak area. For example, suppose IR absorption spectros

copy is chosen to determine trace levels of water vapor by using the spectral region displayed in Figure 3. Furthermore, suppose that the water vapor is the only absorbing species in the optical path in this spectral region. A common univariate approach would be to select the single wavelength that exhibits the strongest signal for the analyte, in this case at about 2595 nm. This procedure might form the basis for a usable prediction model. However, as we shall see in the next section, the use of measurements from many wavelengths (in conjunction with multivariate calibration methods) can provide more precise predictions.

In the event that selective measurements are not available (which is frequently the case), univariate methods will not be reliable. For example, in the analysis of multicomponent biological materials by near-IR spectroscopy, the spectral responses of the components frequently overlap and selective measurements for the analyte of interest are unavailable. Modern methods need to be reliable, rapid, and precise—even for difficult applications such as quality control, process monitoring, environmental monitoring, and medical diagnosis, which are important in the chemical, pharmaceutical, oil, microelectronics, and medical industries.

The nature of these applications, which frequently requires in situ, noninvasive, or nondestructive analyses, precludes the use of tedious sample preparation to obtain highly selective measurements or separation. The result is that the materials to be analyzed by analytical instruments are often quite complex and involve a very large number of chemical components, some of which may be unknown. The combination of complex materials and the need for rapid, reliable, accurate, and precise determinations has motivated researchers to develop and use multivariate calibration methods.

Multivariate calibration In the agricultural and food industries multivariate calibration methods are used with spectral data to determine protein in wheat (6), water in meat (7), and fat in fish (7). In the manufacturing industries, multivariate calibration methods are often used in process monitoring applications, including the fabrication of semicon

ductor devices (8). In medical applications, significant developments are being made in producing reagentless and noninvasive instruments for analyzing blood components (9-11). Additional examples have been reviewed by Brown et al. (12).

As is true with univariate calibration, multivariate calibration consists of the calibration step and the prediction step. In the calibration step, multiple instrumental measurements are obtained from numerous specimens. These measurements could be the absorbances of each specimen at each of a number of wavelengths. As with univariate calibration, the level of the analyte in each specimen is determined by independent assay.

By using multiple measurements it is sometimes possible to determine multiple components of interest simultaneously. Often, however, there is only a single analyte of interest. The multivariate instrumental measurements and results from the independent assays form the calibration set and are used to model the level of the analyte.

In the prediction step, the model and the multivariate instrumental measurements of new specimens are used to predict the analyte levels associated with the new specimens. Often, predicted analyte values for each new specimen (x) are obtained by evaluating a particular linear combination of the available instrumental measurements (yvy2, • • -,;yq), that is

χ = a0 + a1'yl + a2'y2 +

(3)

Individual calibration methods differ in the values of the coefficients (a0, av ..., aq) used to form x. More fundamentally, the method used (in the calibration step) to obtain the coefficients in Equation 3 distinguishes the different calibration methods. In some methods (e.g., multiple linear regression, MLR), the number of instrumental measurements (q) that can be used on the right-hand side of Equation 3 is constrained to be no greater than the number of specimens in the calibration set («). In other methods (e.g., principal components regression, PCR, and partial least squares, PLS), the number of instrumental measurements is unrestricted.

The evolution of multivariate calibration methods has been motivated by the


Report

continuing desire to solve increasingly difficult problems. Many advances have occurred in conjunction with the use of spectral methods. Therefore, although the methods can be applied to areas outside spectroscopy, it is convenient to describe them in this context.

Methods such as PCR and PLS, in which an unlimited number of measurements may be used, are often referred to as full-spectrum methods. In the rest of this section, some of the more common calibration methods (all producing predictions from linear combinations of instrumental measurements, as in Equation 3) will be described and illustrated by spectral data analysis. Strengths and weaknesses of each method will be discussed, and the breadth of application for each will be emphasized.

Classical least-squares method. CLS is based on an explicit causal (or hard) model that relates the instrumental measurements to the level of the analyte of interest (and often the levels of interfering components) via a well-understood mechanism (e.g, Beer's law). In the simplest case, CLS uses a linear model (an extension of Equation 1) that relates a single an-

2550 2600 2650 2700 2750 2800 Wavelength (nm)

(b)

-0.2 * 2550 2600 2650 2700 2750 2800

Wavelength (nm)

Figure 3. Absorbance spectra of (a) water vapor at 1 ppm concentration (estimated) and (b) a new specimen.

alyte to q instrumental measurements that are selective for that analyte. For example, for the i'th specimen,

3»ij = ftj ·*; + e8, for j = 1, 2 q (4)

where etj is the measurement error associated with the/th measurement on the ith specimen,^. As in Equation 1, note that an intercept term is not included in this model for the sake of simplicity. The appropriate method for estimating the model parameters, b = (blt b2,..., 6q)T, in the calibration step depends somewhat on the nature of the measurement errors. In the prediction step, a linear combination of the measurements from a new specimen in conjunction with estimated model

Λ Λ Λ Λ

parameters, b=(b1, b2,..., ftq) , is used to predict the analyte level. If the errors across measurements are independent, it is appropriate to express the predicted analyte level as

+ aa'ya (5) χ = a1,yl + a2

my2 + A A A

where ÛJ = fy / b b. In essence, the predicted value provided by Equation 5 is the least-squares estimate of the slope of the relationship, through the origin, among the various (fy, y) pairs. That is, the predicted value is obtained by regressing the Vj's on the ft/s.

To illustrate this graphically, let's revisit the example associated with Figure 3a. The spectrum displayed in Figure 3a represents the estimated absorbance spectrum of water in the gas phase at 1 ppm

Λ Α Λ

concentration (bv b2,..., 6q) that was obtained from the calibration step. Suppose we wish to predict the water concentration of a new specimen that exhibits the absorbance spectrum (y1,y2, • • -,:yq) illustrated in Figure 3b. The prediction of the water concentration of this new specimen can be visualized by plotting the (ftj,^) pairs (Figure 4). The predicted value of the water concentration of this new specimen is χ = 0.457 ppm, or the estimated slope of the (bs,yj) relationship.

The reliability and precision of this prediction can be assessed by examining the nature and magnitude of the scatter in the relationship among the (fy,^) pairs. In this case, the strong linear relationship among the (fy, y}) pairs and the constant level of scatter throughout the range of the 6j's indicate a reliable prediction. If many of the (bj,yj) pairs had deviated signifi

cantly from the typical relationship among the pairs, one would suspect that an unaccounted interfering species or some other unexpected phenomenon influenced the spectrum of the new specimen. Hence, the selectivity of the measurements and the reliability of the prediction would be questioned.

In the case of univariate calibration, A

there would be only one (bj,ys) pair and hence no ability to discover unusual behavior. Thus, one important advantage of using multivariate versus univariate methods is the ability to assess the reliability of predictions and identify outliers.

For our purposes, an outlier is a specimen (a member of the calibration set or a new specimen) that exhibits some form of discordance with the bulk of the specimens in the calibration set. In the calibration set, an outlier specimen could result from an unusually large error in a reference determination. During prediction, a new specimen would be considered an outlier if it contained a chemical component (which affects the instrumental measurements) that was not present in the specimens composing the calibration set.

The consequences of failing to detect an outlier differ, depending on whether the outlier is in the calibration or the prediction set. When outliers are present in the calibration set, the result will likely be a poor model. The performance of such models during prediction will often be adversely affected. When a prediction set specimen is an outlier, the predicted analyte value for that specimen may differ significantly from the true unknown analyte value. Thus, it is very important to identify outliers.

For many multivariate methods, a number of diagnostics are available (13). Furthermore, the use of multivariate, rather than univariate, methods offers the potential for significantly improving the precision of the predicted analyte values. The example depicted in Figure 3a demonstrates this potential. If, for example, the whole spectral region in Figure 3a is used to model the water concentration, the precision of the predictions generated by using Equations 4 and 5 can be expected to be about four times better than if measurements at a single wavelength (2595 nm) are used with the classical univariate method. That is, the standard de-


(a) 3.0

ο g 2.5 Χ 2.0 Φ g 1.5 CO € 1.0 ο 5 0.5 <c

0

ο 1 ' 0

§0.8 Χ 0.6 03

g 0.4 10.2 § ο

Figure 4. Absorbance of the new specimen (y axis) versus the estimated absorbance of water vapor at 1 ppm concentration by wavelength (x axis). Each point represents a (bt, y) pair.

viation of repeated determinations will be about four times smaller for the multivariate method. However, this gain in efficiency will be realized in practice only if the precision of the reference method is sufficiently good.

The CLS method can be further generalized to include multiple components, among them other analytes or interferences (14). In this case the underlying model given in Equation 4 is expanded to account for the effects of other analytes or interferences. The primary advantage of this more general approach is that it does not require selective measurements. However, the successful use of this or any other method based on an explicit model depends on complete knowledge of the chemical system to be analyzed. For example, all interferences must be known; furthermore, with regard to each specimen in the calibration set, the levels of all analytes and interferences must be known.

These requirements greatly restrict the applicability of any method (such as CLS) that is based on an explicit causal model, especially for the analysis of complex materials. However, if an explicit causal model can be developed, a method such as CLS can be quite valuable; it may provide a reasonable basis for extrapolation and understanding of the uncertainty in the predicted values of the analyte (15).

Multiple linear regression. In many applications, analysts lack the knowledge required to use an explicit causal model and instead use methods involving empirically determined (or soft) models that

are driven by correlation rather than causation. Empirical models often relate the characteristic of interest to simple functions (e.g., low-order polynomials) of the instrumental measurements. In calibration problems in analytical chemistry, particularly spectroscopy, the simple functions often consist of linear combinations of instrumental measurements. For example, calibration methods relating the concentration of an analyte to a linear combination of measurements (e.g., absorbance) from several wavelengths were introduced to develop models for a broader range of conditions. Specifically, such methods were designed to be used when the spectral features of the analyte overlap with features of other, perhaps unknown components in the material to be analyzed. These methods, which are based on a generalization of the univariate inverse model (see Equation 2), are of the form

*i = K + V^ii + ν>Ί2 + · · • +

V^iq + Ί (6)

whereby is the/th measurement associated with the i'th specimen.

Various methods exist for estimating the model parameters (b0, bv ..., bq) using data from the calibration set. One method uses MLR (16). The resulting

Λ Λ Λ

parameter estimates (b0, bu ..., bq) are used to predict the analyte levels of new specimens using Equation 3, with as = by Unlike the generalized CLS method, MLR and other soft modeling methods do not explicitly require knowledge of the levels of interferences and other analytes for specimens in the calibration set. A judicious choice of instrumental measurements can compensate for interferences.

For example, transmission (or reflectance) spectroscopy involving two wavelengths in the vis/near-IR region is used to noninvasively monitor the level of oxygen saturation of arterial blood in a surgical or critical care environment (17). One wavelength (in the red portion of the spectrum) is sensitive to the pulsatile blood volume and oxygen saturation of the blood. A second wavelength (in the near-IR region), which is insensitive to the level of oxygen saturation, provides a measure of the pulsatile blood volume. From the total signal associated with the

first wavelength, the contribution of the interfering phenomenon ( blood volume) can be effectively removed by using information obtained from the second wavelength. The resulting signal can provide a useful measure of oxygen saturation.

The number of measurements (a) that can be used with MLR is often severely restricted; it is usually in the range of 2-10, depending on the complexity of the materials being analyzed. The strong correlation among the instrumental measurements can introduce instability among the resulting model parameter estimates. Because MLR cannot use a large number of correlated measurements, selecting the appropriate set of instrumental measurements is important. If available, as in the case of the noninvasive oximeter, specific knowledge of the way in which the analyte and interfering components in the sample material affect the instrumental measurements can be used to select instrumental measurements. In the absence of specific information, it is necessary to use empirical selection methods.

PLS, PCR, and related methods. Recently, soft-model-based methods (including PLS and PCR), in which a very large number of instrumental measurements are used simultaneously, have been successfully applied to analytical chemistry, particularly spectroscopy. Stone and Brooks (18) showed that PLS and PCR belong to a general class of statistical methods referred to collectively as continuum regression. In general, the assumed model for all such methods is of the form

X; = fto + V ' l l + b2'h2 + . . . + 6h-iih + ei (7)

where iik is the kth score associated with the jth sample. Each score consists of a linear combination of the original measurements; that is

Ik - Ιχΐ 'Λ + Yk2*:Vi2 +

• • • • Τ ί κ , · ^ (8) Data from the calibration set are used to obtain the set of coefficients {γ^}; the estimated model parameters ) 6k); and the model size that is given by the meta-parameter, h, which is usually considerably smaller than the number of measurements, q. In general, the coefficients l7kj} are obtained in a manner such that the vec-


Ο 0.5 1.0 1.5 2.0 2.5 Absorbance X 1000 per ppm

§1.0 - 0 . 8 eu 0.6 ο 5 0.4 1 0.2 Ι ο

-0.2

Report

tors (tlk, t2k>..., tnk) and (tlm, t2m

tnm),fork * m, are orthogonal. Thus, the resulting parameter estimates {b0, b\,..., bh] are stable. During the prediction step, the predicted analyte value for a new specimen is given by A Α Λ Λ

χ = b0 + ô j ' i j + b2't2 +

wherefk = ykl-yi + yk2·y2 + . . . + Ykq-yq. Thus, î is simply a linear combination of the instrumental measurements associated with the new specimen (i.e., ic = a0 + ai'yi+a2'y2 + ... + aq'yq).

What differentiates these methods from one another is the set of coefficients (ûjl that is used and the way in which the {Ojl are obtained. One important difference between PLS and PCR is the manner in which the {Ykjl coefficients are obtained. In PCR, the instrumental measurements of the calibration set are used exclusively to obtain the {ykj} coefficients; in PLS, the analyte values of the calibration set are used as well.

Although it is beyond the scope of this article to explain precisely how the {ykj| and |«j} coefficients are obtained for these methods, detailed information is available (13,14,19, 20). In addition, References 21 and 22 provide comparisons of competing calibration methods. In these and other comparative studies, the prediction performances of PCR and PLS were generally found to be quite similar over a broad range of conditions. Rather than delve into how these methods differ, we will focus on common issues that are critical to the successful use of soft-model-based approaches such as PLS and PCR.

Data pretreatment Before methods such as PLS and PCR are used, a certain amount of data pretreatment (or transformation) is often performed. The most common and simplest type is centering the data. This operation is usually performed on both the analyte values and the instrumental measurements, taken one at a time. That is, the centered analyte values are given by x*= χ-, - x, where

1 * f • - Σ *i !-l

and the centered values for the/th measurement are given by

1 "

This operation can help make subsequent computations less sensitive to round-off and overflow problems. In addition, this operation generally reduces the size of the resulting model

x\ = f + h ' f,*i + b2 ' k +

. . . + fth-4 + «i do) where

4 = Yki · J £ + Yk2 · y'u +

• · · + Ykq * y'n ÛD

by one factor. During the prediction step, the measurements associated with a new

Before methods such as PLS and PCR

are used, a certain amount of data pretreatment is often performed.

specimen, [yji, are similarly translated by the amounts

Less frequently, the instrumental measurements are differentially weighted. Each centered instrumental measurement is multiplied by a measurement-dependent weight, w; (i.e., y I = y"ï} · tt>j). The centered and weighted instrumental measurements (yjj) are then used as the basis for constructing the calibration model.

The purpose of using nonuniform weighting is to modify the relative influence of each measurement on the resulting model. The influence of the/th measurement is raised by increasing the magnitude of its weight, wy Sometimes

weighting is performed such that the standard deviation of the centered and weighted measurements are identical (au-toscaling). That is, the standard deviation of (y%yli ^ ) = ifory=i,2,...,9.

Although centering promotes numerical stability during the model-building stage, differential weighting can drastically alter the form and performance of the resulting model. For example, if the weights of the least informative measurements were unilaterally increased, one would expect the model performance to suffer. On the other hand, if the weights of the most informative measurements were unilaterally increased, one would expect performance to improve.

The key to using weighting successfully is the ability to identify informative measurements. Without knowledge of the relative information content of the various measurements, weighting is akin to "shooting in the dark."

To a certain extent, differential weighting reduces to variable selection. That is, for measurements that are not selected, the associated weights are set to zero. Because full-spectrum methods (such as PCR and PLS) can use many wavelengths, the prevailing belief among spectrosco-pists seems to be that measurement (wavelength) selection is unnecessary; thus, they often use all available wavelengths within some broad range. However, in many applications, measurements from many spectral wavelengths are non-informative or are difficult to incorporate in a model because of nonlinearities. Whereas to some degree full-spectrum methods are able to accommodate non-linearities, the inclusion of noninformative (or difficult) spectral measurements in a model can seriously degrade performance.

For many difficult problems, wavelength selection can greatly improve the performance of full-spectrum methods. Furthermore, in applications outside the laboratory (e.g. determination of components in an in situ setting), physical and economic considerations associated with the measurement apparatus may restrict the number of wavelengths (or measurements) that can be used. Thus, wavelength selection is very important, even when applying methods capable of using a very large number of measurements.


Currently few empirical procedures for wavelength selection are appropriate for use with full-spectrum methods such as PLS. Most procedures (e.g., stepwise regression) are associated with calibration methods (e.g., MLR) that are capable of using relatively few wavelengths. However, Frank and Friedman (22) showed that stepwise MLR does not seem to perform as well as PLS or PCR with all measurements.

In general, wavelength selection procedures that can be used with full-spectrum methods (e.g., the correlation plot) search for individual wavelengths that empirically exhibit good selectivity, sensitivity, and linearity for the analyte of interest over the training set (23, 24). In order for these methods to be useful, wavelengths specific to the analyte of interest with good S/N are needed. However, the required wavelength specificity is not usually available in difficult applications (e.g., analysis of complex biological materials). Procedures such as the correlation plot, which consider only the relationships between individual wavelengths and the analyte of interest, are ill-equipped for such applications. This has provided the motivation to develop methods that select instrumental measurements on the basis of the collective relationship between candidate measurements and the analyte of int e r e s t ^ ) .

A number of other procedures exist for data pretreatment, primarily to linearize the relationships between the analyte level and the various instrumental measurements. This is important because of the inherent linear nature of the commonly used multivariate calibration methods. For example, in spectroscopy, optical transmission data are usually converted to ab-sorbance before analysis. In this setting, this is a natural transformation given the underlying linear relationship (through Beer's law) between analyte concentration and absorbance.

Other pretreatment methods rely on the ordered nature of the instrumental measurements (e.g., a spectrum). In near-IR spectroscopy, instrumental measurements—which are first converted to reflectance—are often further transformed by procedures such as smoothing and the use of differencing (derivatives). Smoothing reduces the effects of

high-frequency noise throughout an ordered set of instrumental measurements such as a spectrum. It can be effective if the signal present in the instrumental measurements has a smooth (or low-frequency) nature.

Differencing the ordered measurements mitigates problems associated with baseline shifts and overlapping features. Another technique often used in near-IR reflectance spectroscopy is multiplicative signal correction (26), which handles problems introduced by strong scattering effects. The performance of the multivariate calibration methods described earlier can be strongly influenced by data pretreatment.

Cross-validation, model size, and model validation Cross-validation is a general statistical method that can be used to obtain an ob-

Cross validation can be used to

obtain an objective assessment of the

magnitude of prediction errors.

jective assessment of the magnitude of prediction errors resulting from the use of an empirically based model or rule in complex situations (27). The objectivity of the assessment is obtained by comparing predictions with known analyte values for specimens that are not used in developing the prediction model. In complex situations, it is impossible or inappropriate to use traditional methods of model assessment. In the context of multivariate calibration, cross-validation is used to help identify the optimal size (hopt) for soft-model-based methods such as PLS and PCR. In addition, cross-validation can provide a preliminary assessment of the prediction errors that are to be expected when using the developed model (of optimal size) with instrumental measurements obtained from new specimens.

The cross-validated assessment of the

performance of a specific calibration model (fixed method/model size) is based on a very simple concept. First, data from the calibration set are partitioned into a number of mutually exclusive subsets (S1(

S2,..., Sv), with the z'th subset (SJ containing the reference values and instrumental measurements associated with »; specimens. Next, F different models are constructed, each using the prescribed method/model size with all except one of the F available data subsets. The ith model, MA, is constructed by using all data subsets except Sr In turn, each model is used to predict the analyte of interest for specimens whose data were not used in its construction (i.e., M_{ is used to predict the specimens in S,). In a sense, this procedure, which can be computing-intensive, simulates the prediction of new specimens. A comparison of predictions obtained in this way with the known reference analyte values provides an objective assessment of the errors associated with predicting the analyte values of new specimens.

Partitioning the calibration set into the various data subsets should be done carefully. Typically, the calibration set is partitioned into subsets of size one (i.e., leave one out at a time cross-validation). However, difficulty arises when replicate sets of instrumental measurements are obtained from individual specimens. Many practitioners use "leave one out at a time cross-validation" in this situation. Unfortunately, what is left out one at a time is usually a single set of instrumental measurements. In this case, the cross-validated predictions associated with specimens with replicate instrumental measurements will be influenced by the replicate measurements (from the same specimen) used to construct M_{.

Such use of cross-validation does not simulate the prediction of new samples. The likely result is an optimistic assessment of prediction errors. A more realistic assessment of prediction errors would be obtained if the calibration set were partitioned into subsets in which all replicate measurements from a single specimen are included in the same subset.

To select the optimal model size (hopt), the cross-validation procedure is performed using various values of the meta-parameter, h. For each value of A, an ap-


Report

propriate measure of model performance is obtained. A commonly used measure of performance is the root mean squared prediction error based on cross-validation

RMSCV(A) =

(12)

where χ{[ΜΑ0ι)] represents the predicted value of the ith specimen using a model of size h, which was developed without using Sj. Sometimes, to establish a baseline performance metric, RMSCV(O) is computed. For this purpose, îJAi.ji.O)] is defined as the average analyte level in the set of all specimens with the ith specimen removed. Thus, RMSCV(O) provides a measure of how well we would predict on the basis of the average analyte level in the calibration set rather than instrumental measurements.

Often, practitioners choose hopt as the value of h that yields the minimum value of the RMSCV. The shape associated with RMSCV(/j) in Figure 5 is quite common. When h < hopV the prediction errors are largely a consequence of systematic effects (e.g., interferences) that are unaccounted for. When h > hopt, the prediction errors are primarily attributable to modeling of noise artifacts (overfitting). Usually, if the model provides a high degree of predictability (as in the case illustrated by Figure 5), the errors caused by overfitting are relatively small compared with those associated with systematic effects not accounted for.

- .300 ' 5 - .300 ' 5 Λ £200 \ > ϋ to 100 tr

0 ν γ.

> ϋ to 100 tr

0 * I • •

0 5 10 15 No. of factors (ft)

Figure 5. Determination of optimal PLS model size, ftopt. The model relates near-IR spectroscopic measurements to urea concentration (mg/dL) in multicomponent aqueous solutions.

At this point, an optimal (or near-optimal) model size has been selected. RMSCV(/iopt) can be used as a rough estimate of the root mean squared prediction error associated with using the selected model with new specimens. This estimate may be somewhat optimistic, given the nature of the model selection process where many possible models were under consideration. A more realistic assessment of the magnitude of prediction errors can be obtained by using an external data set for model validation. An ideal strategy for model selection and validation would be to separate the original calibration set into two subsets: one for model selection (i.e., determination of hopl) and one strictly for model validation. Use of this strategy would guarantee that model validation is independent of model selection.

Pitfalls The primary difficulty associated with using empirically determined models is that they are based on correlation rather than causation. Construction of these models involves finding measurements or combinations of measurements that simply correlate well with the analyte level throughout the calibration set. However, correlation does not imply causation (i.e., a cause-and-effect relationship between the analyte level and the instrumental measurements).

Suppose we find, by empirical means, that a certain instrumental measurement correlates well with the analyte level throughout the calibration set. Does this mean that the analyte level affects that particular instrumental measurement? Not necessarily. Consider Figure 6, which displays the hypothetical relationship between the reference analyte level and the order of measurement (run order) for specimens in the calibration set. Because of the strong relationship between analyte level and run order, it is difficult to separate their effects on the instrumental measurements. Thus, the effects of analyte level and run order are said to be confounded. In this case, simple instrument instability could generate a strong misleading correlation between analyte level and an instrumental measurement. Fortunately, a useful countermeasure for this type of confounding exists: randomization

of the run order with respect to analyte level.

Often, however, more subtle confounding patterns exist. For instance, in a multi-component system, the analyte level may be correlated with the levels of other components or a physical phenomenon such as temperature. In such situations it may be difficult to establish whether, in fact, the model is specific to the analyte of interest. In a tightly controlled laboratory study, where the sample specimens can be formulated by the experimenter, it is possible to design the calibration set (with respect to component concentrations) so that the levels of different components are uncorrelated. However, this countermeasure does not work when an empirical model is being used to predict analyte levels associated with, for example, complex industrial or environmental specimens. In such situations, one rarely has complete knowledge of the components involved, not to mention the physical and chemical interactions among components.

The validity of empirically based models depends heavily on how well the calibration set represents the new specimens in the prediction set. All phenomena (with a chemical, physical, or other basis) that vary in the prediction set and influence the instrumental measurements must also vary in the calibration set over ranges that span the levels of the phenomena occurring in the prediction set. Sometimes the complete prediction set is at hand before the calibration takes place. In such cases, the calibration set can be obtained directly by sampling the prediction set (28). Usually, however, the complete prediction set is not available at the time of calibration, and an unusual (or unaccounted for) phenomenon may be associated with some of the prediction specimens. Outlier detection methods represent only a limited countermeasure against such difficulties; valid predictions for these problematic specimens cannot be obtained.

Sources of prediction errors Ultimately, the efficacy of an empirical calibration model depends on how well it predicts the analyte level of new specimens that are completely external to the development of the model. If the reference


Run order

Figure 6. Relationship between the reference analyte level and run order of the calibration experiment.

values associated with m new specimens (or specimens from an external data set) are available, a useful measure of model performance is given by the standard error of prediction

SEP = (13)

If these m new specimens comprise a random sample from the prediction set spanning the range of potential analyte values (and interferences), the SEP can provide a good measure of how well, on average, the calibration model performs. Often, however, the performance of the calibration model varies, depending on the analyte level.

For example, consider Figure 7a where the standard deviation of the prediction errors (eK = *; - x) increases as the analyte value deviates from the average analyte value found in the training set, f. In this case, although the precision of predictions depends on the analyte value, the accuracy is maintained over the range of analyte values. That is, for a particular analyte level, the average prediction error is about zero. The behavior with respect to precision is neither unexpected nor abnormal; the model is often better described in the vicinity off rather than in the extremes.

On the other hand sometimes there is a systematic bias associated with prediction errors that is dependent on the analyte level (Figure 7b). When the analyte values are less than χ, prediction errors are generally positive. Conversely, when analyte values are greater than x, the prediction errors are generally negative.

This pattern is indicative of a defective model in which the apparent good predictions in the vicinity of χ are attributable primarily to the centering operation that is usually performed during preprocessing in PLS and PCR. That is, predictions based on Equation 10 effectively reduce to i, = χ + noise, if the estimated model coeffi-cients, {b), are spurious. Spurious model coefficients are obtained if noninformative instrumental measurements are used to construct a model. Thus, one should be wary of models that produce the systematic pattern of prediction errors shown in Figure 7b, regardless of whether the predictions are based on cross-validation or a true external validation set.

Several other factors affect the accuracy and precision of predictions, notably the inherent accuracy and precision of the reference method used. If the reference method produces erroneous analyte values that are consistently low or high, the resulting predictions will reflect that bias. Imprecise ( but accurate) reference values will also inflate the magnitude of prediction errors, but in a nonsystematic way. Furthermore, errors in determining the reference values will affect the ability to assess the magnitude of prediction errors. The assessed magnitude of prediction errors can never be less than the magnitude of the reference errors. Thus, it is very important to minimize the errors in the reference analyte values that are used to construct an empirical model.

Other sources of prediction error are related to the repeatability, stability, and reproducibility of the instrumental measurements. Repeatability relates to the ability of the instrument to generate consistent measurements of a specimen using some fixed conditions (without removing the specimen from the instrument), over a relatively short period of time (perhaps seconds or minutes). Stability is similar to repeatability, but it involves a somewhat longer time period (perhaps hours or days). Reproducibility refers to the consistency of instrumental measurements during a small change in conditions, as might occur from multiple insertions of a specimen into an instrument.

Further classification of instrumental measurement errors is possible for cases in which the multiple instrumental errors are ordered (e.g., by wavelength). It is also

possible to decompose instrumental variation into features that are slowly varying (low frequency) and quickly varying (high frequency). Often the focus is on only the high-frequency error component, and usually only in the context of repeatability. This is unfortunate because multivariate methods that are capable of using many measurements are somewhat adept at reducing the effects of high-frequency errors. Practitioners should work to identify and eliminate sources of slowly varying error features.

Other sources of prediction error may be unrelated to the reference method or the analytical instrument. Modeling nonlinear behavior with inherently linear methods can result in model inadequacy. Some researchers thus have adapted multivariate calibrations methods in order to accommodate nonlinearities (29,30).

The ability to adequately sample and measure specimens in difficult environ-

6 8 10 12 14 16 Reference analyte level

(b)

Φ 1 6 "

| 14-yf*

I 1 12-c Ê3

/i

Pre

dict

ed

σ»

oo

ο

• y

6 8 10 12 14 16 Reference analyte level

Figure 7. Relationship between the predicted analyte and reference analyte levels. (a) In the nomal relationship, the average reference analyte level, x, is 11 (arbitrary units). The precision of the predicted values depends on the reference analyte level and is best in the vicinity of x. (b) In the abnormal relationship, the precision and accuracy of the predicted values depend on the reference analyte level.


100

80 m S 60

IT 40 CO

* 20

0 Ο 5 10 15 20

(a) _ 16

I 14 ϊ 12 C

Ι 1 0

ο 'S 8 * 6

A COMPLETE FAMILY

OF SELECT TOOLS

Report

FOR THE DEMANDING

ELECTROCHEMICAL

RESEARCHER For more than 25 years, EG&G Princeton

Applied Research has been supplying science

and industry with the finest selection of

electrochemical instruments and software

available. Add the outstanding support

provided by our full staff of applications

specialists, and it's easy to see why the

EG&G PARC family of electrochemical

solutions is legendary.

And the family is growing, with the recent

addition of the Model 263, a 20 volt compli

ance, 200 mA potentiostat/galvanostat with

impressive speed, precision, and sensitivity.

You can run static or scanning experiments

from the front panel or from your PC.

Also, the new Model 6310 Electrochemical

Impedance Analyzer combines high perform

ance potentiostat circuitry with state-

of-the-art phase sensitive detection in one

affordable unit.

To start your own family of electrochemical

products or just to add on, give us a call at

609-530-1000. One of our expert applications

people can help you make the right choice.

ments can significantly affect the performance of calibration methods. In a laboratory it might be possible to control some of the factors that adversely affect model performance. However, many emerging analytical methods (e.g., noninvasive medical analyses and in situ analyses of industrial and environmental materials) are intended for use outside the traditional laboratory environment, where it might not be possible to control such factors. The ability to overcome these obstacles will largely influence the success or failure of a calibration method in a given application. Thus, practitioners must strive to identify and eliminate the dominant sources of prediction error.

Summary This article has provided a basic introduction to multivariate calibration methods with an emphasis on identifying issues that are critical to their effective use. In the future, as increasingly difficult problems arise, these methods will continue to evolve. Regardless of the direction of the evolutionary process, the resulting methods will need to be used carefully, with recognition of the issues presented in this article.

I thank Steven Brown, Bob Easterling, Ries Robinson, and Brian Stallard for their advice on this manuscript. Brian Stallard provided the water vapor spectra.

References (1) Oman, S. D.; Wax, Y. Biometrics 1984,

40,947-60. (2) Smith, R. L.; Corbett, M. Applied Statistics

1987,36, 283-95. (3) Krutchkoff, R. G. Technometrics 1967, 9,

425-39. (4) Williams, E. J. Technometrics 1969, 11,

189-92. (5) Occupational Safety and Health Adminis

tration Salt Lake Technical Center: Metal and Metalloid Particulate in Workplace Atmospheres (Atomic Absorption) (US-DOL/OSHA Method No. ID-121). In OSHA Analytical Methods Manual Part 2, 2nd ed.

(6) Fearn, T. Applied Statistics 1983,32, 73-79.

(7) Oman, S. D.; Naes, T.; Zube, A. J. Che-mom. 1993, 7,195-212.

(8) Haaland, D. M.Anal. Chem. 1988, 60, 1208-17.

(9) Small, G. Α.; Arnold, Μ. Α.; Marquardt, L. A. Anal. Chem. 1993, 65, 3279-89.

(10) Bhandare, P.; Mendelson, Y.; Peura, R. Α.; Janatsch, G.; Kruse-Jarres, J. D.; Mar-bach, R.; Heise, H. M. Appl. Spectrosc. 1993, 47,1214-21.

(11) Robinson, M. R.; Eaton, R. P.; Haaland, D. M.; Koepp, G. W.; Thomas, E. V.; Stallard, B. R.; Robinson, P. L. Clin. Chem. 1992,38,1618-22.

(12) Brown, S. D.; Bear, R. S.; Blank, T. B. Anal. Chem. 1992,64, 22R-49R.

(13) Martens, H.; Naes, T. Multivariate Calibration; Wiley: Chichester, England, 1989.

(14) Haaland, D. M.; Thomas, E. V. Anal. Chem. 1988, 60,1193-1202.

(15) Thomas, E. V. Technometrics 1991,33, 405-14.

(16) Brown, C. W.; Lynch, P. F.; Obremski, R. J.; Lavery, D. S. Anal. Chem. 1982,54, 1472-79.

(17) Mendelson, Y. Ph. D. Dissertation, Case Western Reserve University, Cleveland, OH, 1983.

(18) Stone, M.; Brooks, R. }.]. Royal Statistical Soc. Series B, 1990,52, 237-69.

(19) Hoskuldsson.A./. Chemom. 1988,2, 211-28.

(20) Helland, I. S. Scandinavian J. Statistics 1990,17,97-114.

(21) Thomas, E. V.; Haaland, D. M. Anal. Chem. 1990, 62,1091-99.

(22) Frank, I. E.; Friedman, J. H. Technometrics 1993, 35,109-48.

(23) Hruschka, W. R. In Near-Infrared Technology in the Agricultural and Food Industries; Williams, P.; Norris K., Eds.; American Association of Cereal Chemists, Inc.: St. Paul, MN, 1987; pp. 35-55.

(24) Brown, P.J./. Chemom. 1992, 6,151-61. (25) Li, T.; Lucasius, C. B.; Kateman, G. Anal.

Chim. Acta 1992,268,123-34. (26) Isaksson, T.; Naes, T. Appl. Spectrosc.

1988,42,1273-84. (27) Stone, M.J. Royal Statistical Soc. Series B,

1974,36,111-33. (28) Naes, T.; Isaksson, T. Appl. Spectrosc.

1989, 43,328-35. (29) Hoskuldsson, A./. Chemom. 1992, 6,

307-34. (30) Sekulic, S.; Seasholtz, M. B.; Kowalski,

B. R.; Lee, S. E.; Holt, B. R. Anal. Chem. 1993,65,835A-845A

Edward V. Thomas is a statistician at the Statistics and Human Factors Department ofSandia National Laboratories, Albuquerque, NM 87185-0829. His B.A. degree in chemistry and M.A. degree and Ph.D in statistics were all awarded by the University of New Mexico. His research interests include calibration methods with a focus on their application to problems in analytical chemistry.


INSTRUMENTS Princeton Applied Research EG&G

P.O. Box 2565 · Princeton, NJ 08543 · (609) 530-1000 FAX: (609) 883-7259-TELEX: 843409

United Kingdom (44) 734-773003 · Canada 905-827-2400 Netherlands (31) 034-0248777 · Italy (39) 02-27003636

Germany (49) 89-926920 France (33)01-69898920 · Japan (03) 638-1506

Circle 10 for Literature. Circle 11 for Sales Rep.

a primer on multivariate calibration

Documents