chemometrics for raman spectroscopy

32
Chemometricsfor Raman Spectroscopy Jeremy M. Shaver Eigenvector Research, Inc., Manson, Washington I. INTRODUCTION Although Raman spectroscopy has been utilized for chemical analyses for decades, it is the recent advent of commercially available, high-sensitivity integrated Raman systems which has allowed increasing numbers of scientists to adopt it as a commonly invoked tool. As a probe into the vibrational, rotational, and magnetic structure of molecules and condensed phases, Raman spectroscopy has a wealth of advantages over other techniques. These advantages range from the optics required to the selection rules governing the observed modes. When compared to the sibling technique of infrared (IR) absorption spectroscopy (as practiced in the far-, mid-, or near-IR regions), the fundamental infor- mation probed using the Raman effect is similar. Both techniques make use of band positions, intensities, and shapes to obtain information on the concentration and chemical or physical nature of the materials being observed. To this extent, the techniques required to extract information from an observed Raman or IR spectrum are also similar. A wealth of chemometric techniques have been developed or proven useful for analysis of IR ab- sorption data and these techniques have relevance to Raman spectroscopy. Additionally, techniques adopted from a wide range of analytical chemistry and other sciences have relevance to the analysis of Raman spectra. When one considers the fundamental natures of the IR absorption and Raman-scat- tering processes, it becomes clear that significant differences should be and are observed in the spectral interferences, instrumental effects, and limitations of the two techniques. Low signal is not uncommon in Raman spectra; the average probability of observing a Raman-shifted photon is on the order of 1 in lo6. Likewise, background luminescence or other broad-band signals can further obscure Raman signal and add noise to a spectrum. Conversely, selection rules for Raman scattering provide less cluttered spectra, and high- concentration solutions are usually never a problem. Optically, advantages abound. The optical interferences observed with Raman spectroscopy differ from those observed with IR spectroscopy because most Raman analyzers work in a wavelength range different than Copyright © 2001 by Taylor & Francis Group, LLC

Upload: richard-alexander

Post on 14-Apr-2015

122 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Chemometrics for Raman Spectroscopy

Chemometrics for Raman Spectroscopy

Jeremy M. Shaver Eigenvector Research, Inc., Manson, Washington

I. INTRODUCTION

Although Raman spectroscopy has been utilized for chemical analyses for decades, it is the recent advent of commercially available, high-sensitivity integrated Raman systems which has allowed increasing numbers of scientists to adopt it as a commonly invoked tool. As a probe into the vibrational, rotational, and magnetic structure of molecules and condensed phases, Raman spectroscopy has a wealth of advantages over other techniques. These advantages range from the optics required to the selection rules governing the observed modes. When compared to the sibling technique of infrared (IR) absorption spectroscopy (as practiced in the far-, mid-, or near-IR regions), the fundamental infor- mation probed using the Raman effect is similar. Both techniques make use of band positions, intensities, and shapes to obtain information on the concentration and chemical or physical nature of the materials being observed. To this extent, the techniques required to extract information from an observed Raman or IR spectrum are also similar. A wealth of chemometric techniques have been developed or proven useful for analysis of IR ab- sorption data and these techniques have relevance to Raman spectroscopy. Additionally, techniques adopted from a wide range of analytical chemistry and other sciences have relevance to the analysis of Raman spectra.

When one considers the fundamental natures of the IR absorption and Raman-scat- tering processes, it becomes clear that significant differences should be and are observed in the spectral interferences, instrumental effects, and limitations of the two techniques. Low signal is not uncommon in Raman spectra; the average probability of observing a Raman-shifted photon is on the order of 1 in lo6. Likewise, background luminescence or other broad-band signals can further obscure Raman signal and add noise to a spectrum. Conversely, selection rules for Raman scattering provide less cluttered spectra, and high- concentration solutions are usually never a problem. Optically, advantages abound. The optical interferences observed with Raman spectroscopy differ from those observed with IR spectroscopy because most Raman analyzers work in a wavelength range different than

Copyright © 2001 by Taylor & Francis Group, LLC

Page 2: Chemometrics for Raman Spectroscopy

276 SHAVER

their IR counterparts [ultraviolet (UV) and visible to near-IR as compared to near-IR to far-IR]. Backscatter collection from most samples can be trivial, and quantitative and remote monitoring of samples through fiber optics is common. All of these differences lead to important considerations in data analysis.

Because a vast number of the common chemometric techniques are applicable to Raman spectroscopy, this chapter will attempt to highlight the most important considera- tions in the successful practice of chemometrics in Raman spectroscopy. Some of the particular analysis advantages afforded by the use of Raman spectroscopy will also be discussed. References which are particular to Raman analysis will be given as examples in much of the discussion, although in many cases, the original technique was not devel- oped using Raman spectroscopy.

It is worth mentioning the definition of the term chemometrics used in this setting. The term is often used specifically to indicate the use of multivariate mathematical tech- niques and/or statistics to derive chemical information from data. However, a more general definition is also used: the process of deriving chemical information from data. This second definition includes all steps of the analysis: preprocessing (i.e., signal processing) tech- niques used on data to make an analysis more stable and/or accurate; interrogation of simple band metrics (width, position, area, etc.); least-squares analyses for calibration curves and kinetic studies; classically multivariate approaches for classification, investi- gation, and quantification. Because all of these aspects of data analysis are crucial to accurate Raman analyses, the second definition will be the working definition for this discussion.

A. Metrics of Qualitative and Quantitative Analysis To better understand the perks and pitfalls of Raman data analysis, an understanding of the characteristics of Raman spectra used for analysis is required. If data analyses are classified into two categories, qualitative and quantitative, an easy delineation of goals is provided.

Qualitative analyses are more concerned with identification of the type, location (or timing), and, sometimes, the number of species present in a sample. What is the identity of the inclusion on this silicon wafer? Where is the delamination in this sample? At what point during the reaction is the side product or the intermediate observed? What types of crystal structures are being formed? Often, the easier questions can be answered using simple band characteristics, or metrics. The question to be answered in these types of analyses is: Are particular bands present at particular Raman shifts? Raman spectroscopy often excels in this task because of the narrow, well-resolved bands typical of many Raman spectra. Given some simple previous knowledge (or standard spectra) of the species of interest, many classifications could, in fact, be based only on the presence of one or two bands. In practice, and as the classification task becomes more complicated, multiple bands must be used to verify the identity of a species. As the species of interest become more and more structurally similar or as the required confidence level increases, knowledge of band shapes and relative band intensities will be required. Because of instrumental effects on these secondary band characteristics (see Chapter 3), an analysis requiring their use will be inherently more complicated and susceptible to error. Section III.A.l will discuss these considerations.

Quantitative analyses usually combine a qualitative question ("find the given species X") with the generic question of "how much is there." The focus of the quantitative

Copyright © 2001 by Taylor & Francis Group, LLC

Page 3: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 277

question is one which deserves some careful consideration. One unambiguous goal would be the determination of a species' concentration. Such an analysis often relies on the predictability of observed intensity as a function of concentration [I]. Often, however, a more complicated integral property is the actual final goal: the density or percent cure of a polymer [2-41, fuel octane number [5], or diamond film stress [6], for example. In all of these cases, there are multiple band metrics which guide us to the final quantitative answer.

B. Univariate and Multivariate Analyses Univariate analyses are named such because of the fact that a single value is used for the calculation of the property of interest. A simple example is the use of a single intensity to calculate concentration. The observed intensity for a species is usually a linear function of the concentration of that species:

where IR is the observed Raman intensity, a is the apparent Raman-scattering efficiency (dependent primarily on species, environment, and excitation wavelength), L is interro- gated volume, C is the species' concentration, I, is incident intensity, and k is instrumental throughput. Note that the term "interrogated volume" is used instead of path length be- cause of the potentially complex shape of the interrogated region (see Chapter 3). Assum- ing that a, L, and I, are constant, this equation can be simplified as

where m is a parameter empirically fit through calibration using known concentration mixtures and observed intensities. After determining m, Eq. (2) can be used in conjunction with an observed band intensity for prediction of an unknown concentration. Assuming constant concentration (i.e., pure component or homogeneous mixture), a similar equation could be written using I, and a, which can be related to some property of interest (e.g., crystallinity). This approach can work exceptionally well for simple closed systems [1,7]. It may be necessary to utilize higher-order equations, including C2 or higher terms to account for nonlinear changes with concentration.

1. Multicomponent Analyses Typically, multiple points in a spectrum are integrated to increase the signal to noise of the observed intensity. However, if two (or more) overlapping bands from different species are integrated, the resulting intensity will be a function of all the species' concentrations. Using only one measured intensity provides no means by which to discriminate the in- dividual contributions. One solution is to measure several bands or regions and use these multiple intensities together to solve for the various concentrations:

j

where is the intensity observed for region or band i, mjSi is the empirically determined weighting factor for species j and band i, and Cj is the concentration of species j. Equation (3) would be repeated for multiple bands (i) using at least as many bands as there are components ( j ) in the system.

Alternatively, if, instead of integrating, the two bands are fit with curves, then the analysis is utilizing the individuality of the multiple measurements. The band fit takes into account the relationship between the individual points as the rise and fall of the band

Copyright © 2001 by Taylor & Francis Group, LLC

Page 4: Chemometrics for Raman Spectroscopy

278 SHAVER

shape to distinguish between the species. In addition to the deconvolved band intensities, information regarding the band shape and position are also obtained. This information may be even more crucial to the analysis of the system.

Although band fitting is not typically described as a multivariate analysis, the distinct difference between straight integration and band fitting demonstrates the theoretical dif- ference between a univariate and multivariate analysis. Multivariate analyses utilize the relationship between "variables," both within a spectrum (wavenumber to wavenumber) and between separate spectra (same wavenumber, different sample), to more accurately and precisely make a determination. These variables are often individual points in a spec- trum. By analyzing a group of related spectra as a whole, patterns which appear in multiple spectra can be isolated. This can be done because bands from a unique species will change, together with each other and independently of the bands due to other species. They are said to be correlated. Once the correlated spectral patterns are determined, they can be used in a qualitative or quantitative determination, much in the way concentration can be determined from the intensity of a individual band observed for a single species. The use of multiple pieces of information greatly improves the accuracy and precision of a deter- mination. The effect is similar to using someone's hair color alone for identification (uni- variate) versus using all of their facial features (multivariate). The means by which cor- related features of interest are extracted from the spectral data varies depending on the technique and goal. These aspects will be discussed later in this chapter.

II. DATA PRETREATMENT AND PREPROCESSING

Much of the success or failure of any method of data analysis relies on having appropriate data at the start. Certainly, the foremost concern is experimental design. As with any technique, the data must adequately represent the system being investigated. The com- plexity of the experimental design will be dictated by the purpose of the investigation or analysis and the particular quantitative or qualitative technique being used. Some of the individual expectations of experimental design will be discussed later, with each calibration and investigative technique. However, common to all techniques is the careful consider- ation of preprocessing the data prior to further analysis. Most of the inaccuracy and in- stability of an analysis can be removed through some straightforward preprocessing tech- niques which deal with the common systematic and random errors present in Raman spectroscopic data.

The preprocessing techniques described in this section are focused on operations on the basic Raman spectrum. Chemometrics sometimes differentiates between such raw spec- tral preprocessing and preprocessing of metrics derived from a spectrum (e.g., band width or area). With the exception of some issues with normalization, use of these latter oper- ations are not particularly unique to Raman spectroscopy and their use is left to the dis- cretion of the reader.

A. Smoothing and Denoising One of the most difficult aspects of Raman spectral analysis is overcoming the often low signal levels present. Fortunately, many sources of noise are easily identifiable and straight- forward to correct. Here, we will consider the noise which appears as random variations from point to point in a spectrum and varies from spectrum to spectrum in a consistent manner. Correction of changes in the total observed signal will be addressed in Sec- tion 1I.C.

Copyright © 2001 by Taylor & Francis Group, LLC

Page 5: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY

1. Random Noise Correction If two spectra are collected for a sample, assuming that there are no changes in the sample or the analyzer, the difference between the two will be the random error associated with the measurement. As discussed in Chapters 3 and 4, the use of cooled, high-quality de- tectors greatly reduces noise from electrical and thermal sources with most spectrometers. The next most significant source of random noise is shot or statistical noise. This noise arises from the random probability associated with actually observing a photon at a given wavelength. It can be shown that the random variation for a given measurement is the square root of the number of counts measured. Thus, the approximate shot noise associated with a measurement of n counts will be n'". This is often described in terms of a signal- to-noise ratio (S/N), which is the total number of counts divided by the noise counts or as relative error, which is simply the reciprocal of S/N. The number of counts of noise can be approximated by n'" or through direct measurement given constant conditions. Note that whereas the actual variation in counts (noise) is increasing as the total number of counts increases, the relative error actually decreases because of the square root re- lationship.

There are a multitude of random-noise removal techniques which can be applied to Raman spectra [8]. They range from single-spectrum smoothing algorithms such as boxcar averaging to multivariate analyses, which identify the similarities (correlation) between an entire set of spectra to isolate signal from noise in each of the spectra.

All of the single-spectrum techniques make use of the relationships between adjacent spectral points to isolate signal from noise. Noise usually appears as high-frequency ran- dom fluctuations from one pixel to the next, whereas signal is assumed to be lower fre- quency, changing more gradually from pixel to pixel. Figure 1 demonstrates the frequency differentiation among noise, Raman signal, and the even lower background signal. A sim-

Original spectrum

Highest 33%

Mid 3%-67%

Lowest 3%

Figure 1 Discrete frequency analysis of a Raman spectrum (original shown at top). The data were separated through Fourier analysis into three groups containing primarily high-frequency noise (A), mid-frequency Raman signal (B), and low-frequency background signal (C).

Copyright © 2001 by Taylor & Francis Group, LLC

Page 6: Chemometrics for Raman Spectroscopy

280 SHAVER

ple boxcar averaging averages each point in the spectrum with some number of its neigh- bors. Effectively, the number of counts for each point are increased and, therefore, the relative error is decreased because of the square root relationship described earlier. The assumption is that adjacent points are essentially equal in the signal they report. In practice, bands are finitely changing over often narrow pixel ranges and the band shape will be distorted by the smoothing process. Weighting functions which decrease the influence of further-away points can be used in smoothing to reduce the band-shape changes. Savitsky- Golay polynomial smoothing [9,10] is a related approach which permits some lower- frequency changes to occur over each small window of points. In this case, polynomial functions are fit to a group of points around each point in the spectrum. The resulting function can then be used to isolate the lower-frequency signal-related information.

Fourier filtering also makes direct use of the discrete frequency of the points in a spectrum [ll]. Through a mathematical transformation, the spectrum is converted into a summation of cosine and sine functions of increasing frequency. The higher-frequency functions, assumed to represent only noise, can then be dropped and the lower ones can be used to reconstruct the less noisy spectrum. Unfortunately, any finite set of sines and cosines are only rough approximations for the shape of a discrete Raman band. As a result, to accurately represent more narrow Raman bands, a long series of sines and cosines would have to be used. Removal of the higher-frequency noise functions also removes some of these crucial band-defining functions, and distortions are the common result. Figure 1 shows, in the highest 33% frequency spectrum, the sharp Raman features mistakenly re- moved along with the noise.

To decrease band-shape distortions, wavelet filtering, a technique similar to Fourier filtering, makes use of a specially selected set of functions called a "basis set" [12,13]. The basis set is similar in nature to the increasing frequency sine and cosine functions used in Fourier filtering except that the basis set functions are specifically chosen to be better estimators of band shape. Although their use in Raman spectroscopy is relatively recent and limited, there is indication that wavelets permit noise to be better filtered from the signal with less band deformation.

When analyzing large sets of data, multivariate analysis techniques can provide greatly enhanced signal to noise. By analyzing for recurring patterns in the spectra, the noise, which should be random and uncorrelated between spectra, can be isolated from the Raman signal which will usually appear in more than one of the spectra. The result is a very low- noise image of the Raman spectral patterns in the data. Note that some noise is "embed- ded" and not separable from the signal. For example, when only one spectrum contains a given Raman feature, there is no ability to improve on the signal to noise of that feature because there are no other spectra to which that spectral feature can be compared. When the spectral patterns are used to analyze a new separate spectrum, such as is done in the prediction phase of a partial least squares (PLS) model (see Sec. IV.B.2), the only realized advantages are those from the use of multiple variables (i.e., spectral points) within a spectrum and how well they match the low-noise spectral patterns; that is, whereas the large number of spectra in the "calibration" dataset allow us to determine very well what spectral features we are and are not looking for, the noise in the new, separate spectrum may fundamentally limit our ability to recognize those features.

2. Cosmic Spike Removal Another nonanalytical source of variation from spectrum to spectrum are cosmic rays which impact the detector during the exposure. Each ray affects only a few pixels on the

Copyright © 2001 by Taylor & Francis Group, LLC

Page 7: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 281

camera, causing an exceptionally high charge to be created. This appears in the spectrum as a very narrow, high-intensity spike. The high-intensity appearance of these bands can cause catastrophic effects on analyses. Band integration and fitting will be compelled to overweight the cosmic spike over the real Raman signal. Likewise, multivariate analyses which utilize spectral variance to isolate the signals of interest are often badly affected by these high-variance signals. A single spike in a spectrum may cause the numerical analysis to identify the spectrum as an outlier and discard the otherwise useful data.

The narrow nature of cosmic spikes provides the key to the most straightforward way of removing their effect. Although smoothing techniques may be effective for qualitative removal of these spikes, a more quantitatively effective tool is replicate analysis. Two identical spectra are collected and subtracted. Because it is very unlikely that any particular cosmic-ray-induced spikes observed in either spectrum will appear in the exact same lo- cation in the other spectrum, large and narrow positive or negative deviations will be observed from the otherwise random noise (assuming no significant changes in the sample between the two measurements). For any point that appears statistically the same in both spectra, an average of the value observed in the two spectra will be used. Points at which cosmic rays appear in one of the spectra will be filled in using the other spectrum only, thereby removing the cosmic effect. This technique is fully described in Chapter 3. It is important to note that the two spectra must match in terms of the sample and instrumental conditions. If any changes occur in the sample between the collection of the two spectra, these changes may be filtered out by mistake or may cause abnormal band shapes.

After identification of a cosmic spike, another approach to remove the spike is to use a median filter on several related spectra. The spectra can be related in time, for time- series data as with data taken from a process on-line analyzer, or related in space, in the case of a Raman image dataset in which spectra are taken from spatially segregated locations.

B. Baseline and Background Removal Backgrounds in Raman spectroscopy are common and arise from either luminescence processes (e.g., fluorescence, phosphorescence), non-laser-induced emissive processes (e.g., room light, sunlight, blackbody radiation, chemiluminescence), or Raman scattering from other-than-analyte sources (e.g., substrate, solvent, optics). Sometimes, experimental modifications such as choice of excitation wavelength, confocal optical design, or sample preparation can preemptively remove the source of backgrounds. Ultimately, many Raman spectra contain some amount of background, although it may not be sufficient to interfere with the measurement.

Possibly the most significant effect of background is the introduction of noise into the spectrum. Previously, we discussed shot noise as the primary source of noise in many Raman-scattering experiments. Returning to the explanation of shot noise and its square root relationship to the number of counts measured, we stated that the higher the number of counts, the better the signal to noise (lower relative error). For example, if 100 counts are measured, the standard deviation will be (100)112 or 10 counts, providing a relative error of 10%. If 400 counts are measured, the standard deviation will be 20 counts, giving a relative error of only 5%. The problem in the case of a spectrum with background is that signal-to-noise improvement with higher counts assumes that the only source of var- iation is the counts of primary (Raman) signal. If the aforementioned 400-count Raman band is superimposed on a moderate background of only 4500 counts, then the total

Copyright © 2001 by Taylor & Francis Group, LLC

Page 8: Chemometrics for Raman Spectroscopy

282 SHAVER

standard deviation in counts will be (4900)"2 or 70 counts, which gives a relative error for the Raman-scattering intensity of 17.5% (=70/400)! Because of this, a background correction technique may remove the broad structure, but the resulting spectrum will still carry the shot noise imparted by the background. The next preprocessing step will likely have to be some sort of noise removal scheme to help recover the relative error. With exceptionally large backgrounds, the Raman signal is often rendered unusable even if the structure of the background can be completely removed from the Raman spectrum. As- suming that this is not the case, the following are some guidelines to background removal for Raman spectral analyses.

For many qualitative applications, the only need for background removal is to make it easier to distinguish the bands' presence and positions. For example, McCreery and co- workers observed the failure of several library search techniques when the background was not removed from the Raman spectra of several pharmaceuticals [14]. For these ap- plications, if the background is known or easily determined (e.g., Raman spectrum of the solvent or sunlight), then weighted subtraction may be the best route for removal. The amount of background to be removed from any individual spectrum can be determined either by user intervention or by an automatic determination based on spectral smoothness or correlation techniques [15-191. If the total of the background is not removed via the subtraction, at least the magnitude will be greatly decreased.

If the background is changing or a pure spectrum of the background cannot be ob- tained, the use of broad approximations of the background across the entire spectrum can be used. Several points in the spectrum which are presumably only background are chosen and an appropriate function (linear, simple polynomials, splines, etc.) can be fit to these points. The function is then used to create an approximation of the background and it is subtracted from the given spectrum.

Some automated routines also exist which make use of the typically low-frequency (i.e., broad) nature of the background and remove it using derivatives, Fourier filtering as shown in Fig. 1, or fixed shape techniques [20,21]. Derivatives are the most commonly used of these techniques because of their simplicity. Point difference and Savitsky-Golay (SG) derivative techniques are two of the most commonly used approaches to calculating derivatives. Point difference is done by simply subtracting each spectral point from one of its adjacent points (before or after). SG derivatives make use of a polynomial function to fit some local number of points around each spectral point and then determine the derivative of each point from the function used for that point. Each time a derivative is performed on a spectrum, the offset of the spectrum is removed and subsequently higher order shapes get simplified: linear trends become offsets, quadratic become linear, and so forth. To completely remove a background with any significant structure, a fairly high- order derivative would be needed. Unfortunately, with each subsequent derivative, noise is also amplified unless the procedure used incorporates some filtering mechanism (the use of polynomials in the SG process serve this purpose).

It should be noted that shape-fitting background removal techniques (e.g., polynomial or spline fitting) which operate individually on the spectra in a group will often cause problems if followed by a multivariate analysis. These background removal steps may introduce large variations that can overwhelm real small-variance changes. This artificial variance can cause severe problems with multivariate analyses. A set of nearly identical spectra baselined using a polynomial fit on each individual spectrum may have many more differences between the spectra after the baseline than were present before the baselining. These larger differences will, in effect, distract the multivariate analysis from the real

Copyright © 2001 by Taylor & Francis Group, LLC

Page 9: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 283

differences. For this reason, baselining should be used cautiously with multivariate analyses.

Alternatively, some multivariate techniques can use the independence of the Raman signal and the background to separate the two [22-241. For this approach, some or all of the background can be left in the data and isolated mathematically after the fact. This is often an effective approach as long as the background changes are not correlated to the Raman signal, such as would be the case if a species giving a particular Raman spectrum was also weakly luminescent or was always in a fixed stoichiometric presence with a luminescent species.

The extent to which a background will interfere with a quantitative analysis depends on what band metrics are being used as well as the analysis technique itself. Clearly, any background not removed from a spectrum when using simple integrated band intensity will artificially increase the determined intensity. Similarly, if curve fitting is used, any unremoved background can substantially bias the determined bandwidth, as shown in Fig. 2. Although the peak is the same in both cases, the width determined through band fitting is different when the peak is superimposed on a broad Gaussian background. A second- order polynomial baseline was used in the curve fit and the residual was well within the expected range ( X 2 of 1.44). Although the answer is inaccurate, it may still be precise. If the background is consistent in its shape, then the bias will be constant. Likewise, mul- tivariate approaches keyed in to bandwidth may be somewhat less susceptible to this kind of bias, assuming expected backgrounds were present in the calibration spectra.

Because of the extreme sensitivity of intensity and quantitative band shape to the presence of background, accurate quantitative analysis hinges on complete removal or consistent correction for background at the band(s) of interest in a spectrum. As a result, full-spectrum approximations of background are often not sufficient. However, if the same baselining techniques are used on smaller portions of the spectrum, far better results may be obtained.

1100 1300 1500 1700 1900 Raman Shift (cm-')

Figure 2 Error observed when determining bandwidth without adequate background removal. The bottom spectrum shows the original band and determined width without background. The upper spectrum is the same band in the presence of background. Bandwidths observed using linear (L) and quadratic (Q) baselines are shown.

Copyright © 2001 by Taylor & Francis Group, LLC

Page 10: Chemometrics for Raman Spectroscopy

SHAVER

C. Normalization for Correction of Uniform Intensity Changes One of the next most significant sources of systematic error in Raman spectral measure- ments arises from total intensity variations. As described earlier (see Sec. I.B), many properties of interest can be simply correlated to observed Raman intensity. We showed that, using Eq. (I), we could relate observed Raman intensity to either concentration or a species' scattering efficiency. However, this relationship assumes that any changes in ob- served intensity are due only to changes in concentration or cross section. In practice, incident intensity (I,), interrogated volume (L), and instrumental throughput (k) are subject to change. Changes in laser-intensity output and laser throughput to sample will both change the incident intensity. Changes in a sample's refractive index, opacity (due to suspended particulate, bubbles, or phase separation), position, absorptivity, and density are just a few of the many things that can change the interrogated volume and, thus, the observed intensity. Such changes are nearly always observed with in-line and at-line pro- cess analysis applications, although they are exceptions. Finally, despite the stability of many instruments, some changes can occur in the instrumental throughput, especially over long time periods. Clearly, the extent to which changes of the last type will occur depends on the instrumental design. Particular designs can provide information to help correct for the incident intensity and throughput effects. Typically, interrogated volume changes are more difficult to eliminate and must be dealt with in the data analysis stage.

In most cases, classification and identification techniques are not based on absolute intensity and will inherently use a normalization step to fix intensity to a known value. As a result, such qualitative Raman models will not be affected by changes in total inten- sity. However, if the total intensity is substantially diminished, detection limits may suffer. Quantitative models, on the other hand, often base the determination on intensity and are much more likely to be influenced. One means to avoid these effects is to use band metrics, which are unaffected by total intensity (e.g., bandwidth or position). Creating such quan- titative models may be impossible in some cases or may be based on inferential effects (e.g., band-shape changes of the OH stretch as an indicator of acid concentration), which may be subject to significant interference and error. In most cases, normalization of the entire spectrum to some known or constant value is necessary (known as normalization to unit-vector length). In this discussion, the focus will be on quantitative models for concentration although the same arguments and methods apply to calibrations concerned with cross-sectional changes related to some other property of interest.

The most important aspect to choosing a normalization approach is selecting a metric, such as the area of a band or bands, which accurately reflects the total intensity change. One standard approach offered by many spectral analysis packages is normalization to total spectral intensity. If, for a given set of samples, the total spectral intensity remains constant or nearly constant for all samples, then any spectrum can be normalized by integrating the total intensity and dividing by that scalar value. The normalized intensity of any individual band can then be assumed to be a linear function of concentration or scattering efficiency and calculations done as earlier. This is often not the case.

Difficulties with this approach include bias from unremoved background and noncon- stant total Raman signal. As previously discussed, complete and accurate removal of back- ground from an entire spectrum can be difficult. If the background is from the sample itself, then removal may not be necessary if the source of the background (e.g., fluorescent species in the sample) is unchanging in concentration and spectral shape. In this case, the changes in background intensity will probably be a reasonable approximation of the non-

Copyright © 2001 by Taylor & Francis Group, LLC

Page 11: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 285

chemical changes in Raman signal and complete background subtraction will not be nec- essary. If the background varies in intensity or shape or is from an external source (e.g., optics), then it will adversely affect the normalization.

A similar detrimental effect is observed with nonconstant total Raman signal. This situation arises if, for example, a Raman-active species dominates some spectra but not others. The extent to which this will be observed depends on the difference in net scattering efficiency (total for all observed bands). The larger the difference in scattering efficiencies and concentration range, the more significant the change in total integrated intensity. If all species in a sample gave the same Raman intensity for a given concentration, then this would not be observed [4]. However, equal scattering is usually not the case. The result is normalization with a nonconstant metric, which means that the normalized intensity cannot be fit by a linear function as expected from Eq. (1). A simple plot of normalized intensity versus concentration will diagnose this situation. Some unpublished work by the author indicates that nonlinear equations similar to those published by Karstang and Manne [25] effectively model these nonlinearities, permitting the use of normalization to a chang- ing band.

Another common normalization approach is to use only a single reference band. The band can be either an internal reference or an external reference. Solvent bands such as bending modes for water (1650 cm-I) or CH, (1450 cm-') or the OH or CH stretching modes (above 2900 cm-I) [26] are typical choices for reference bands. As long as these modes are not significantly affected by other changes in the samples, they will likely be more stable than the use of total intensity. As with total intensity changes, any changes in the reference band concurrent with concentration changes will appear as nonlinearities.

An external reference such as vapor bands observed in an intermediate focus in the optics can be used to correct for laser intensity and some of the instrumental throughput factors. In this approach, some reference external to the sample is used to correct for all intensity changes except those associated with the final optics and sample properties. It is crucial to note that they do not correct for changes in sample density, absorbance, or any of the other optical effects discussed earlier.

Two alternate approaches to normalization which have been used often with infrared spectroscopy are multiplicative scatter correction (MSC) [19], which uses a target reference spectrum (often the mean of the calibration dataset) for normalization, and standard normal variant (SNV) [27-291, which centers and normalizes to an expected variance. These have also been used in Raman spectroscopy with impressive results [29-311.

After all of the normalization options have been examined and the best of them chosen, there may still be a nonlinear relationship between intensity and the property of interest resulting from the normalization step. In these cases, the use of nonlinear equations or multiple factors to fit the response will be necessary. It is important to note that accuracy may suffer in these instances if the model is not a good approximation of the nonlinear response. The use of multiple factors in multivariate models will be discussed in Sections 3 and 4.

D. Difference Raman and Mean Centering An exceptionally common technique to help isolate and identify only the changing features in a spectrum is the difference method. In its simplest form, a difference spectrum is the one-to-one subtraction of two spectra. Usually, one spectrum is a reference spectrum, which is a known or fixed condition. Any changes observed in the difference spectrum

Copyright © 2001 by Taylor & Francis Group, LLC

Page 12: Chemometrics for Raman Spectroscopy

286 SHAVER

are then used to determine how the second sample differs from the reference sample. Raman spectral features are uniquely suited to qualitative analyses using difference spec- troscopy techniques because of the lower number of usually narrow bands observed for many samples. The less the band overlap, the easier the interpretation of a difference spectrum. Raman Optical Activity [32] is one example of this technique.

Often, the use of a fixed reference sample is also used to help improve the precision and accuracy of multivariate models in which many spectra are used together to create a calibration or training set. In these cases, the reference spectrum is the mean of the entire set of Raman spectra used for calibration and it is subtracted from each of the individual calibration spectra as well as any subsequent spectra from which a prediction is to be made. This approach is called "mean centering."

The most crucial aspect of using difference Raman or mean centering is related to the issues discussed in Section II.C, normalization. If any uncorrected change has occurred in the sampling volume or collection efficiency of the system, the absolute difference in Raman intensities will appear in the mean-centered spectrum even though this change may not be a reflection of a true concentration or scattering efficiency change in the sample. The effect will be the same in either a qualitative or quantitative model. Severe failure or incorrect prediction is likely to result. Whenever total intensity variations are likely, the use of spectral normalization is strongly recommended before any mean-centering or dif- ference Raman calculation.

Wavelength axis calibration is also crucial to accurate analysis of difference Raman spectra. Although difference Raman is exceptionally sensitive to chemically caused band shifts, it is equally sensitive to instrument artifact band shifts and results can be easily misinterpreted if the stability of the wavelength axis is overestimated for the time frame over which the experiment occurs. Issues of wavelength axis calibration are discussed further below.

E. Integration, Band Fitting, and Variable Selection In many cases, the final goal of an analysis is to create a predictive model which can then be applied to future data to determine some condition, composition, or property of interest. To create the model, a dataset in which the property of interest has been varied is analyzed and one of various procedures are used to create the predictive model. Although the model can be used to analyze spectra from unknown samples, it is important to realize that any model has some limited set of information and training from which it must make all future predictions. As long as the chemical and instrumental conditions are similar to those under which the training set were collected, the model should continue to predict accurately. There are, however, many different effects that may alter the observed bandwidths, inten- sities, andlor positions and may jeopardize the model accuracy. The first step toward the development of a robust model is to include in the analysis only the bands which provide useful information and are not likely to be adversely affected by conditions not included in the training set.

In many cases, band selection can be based on simple observation of the Raman spectrum combined with chemical knowledge of the system. The bands known to be providing the information can be used and the others discarded from the calibration. Most analysis packages allow some sort of manual spectral region selection. Some calibration techniques such as partial least squares (PLS, discussed later) will automatically base their prediction on only the spectral features which provide useful predictive information. Bands

Copyright © 2001 by Taylor & Francis Group, LLC

Page 13: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 287

which are clearly unrelated to the property of interest are excluded. Shaw and co-workers [33] show an interesting example in which they examine a large variety of peak charac- teristics for 17 peaks in the spectrum of glucose as it was transformed to ethanol by yeast. Through these characteristics, they selected the bands to pass to a PLS calibration model with very effective results.

These simple band selection approaches are commonly used in spectrographic analysis and work well, especially if the data which are used to train the model were collected under chemical and physical conditions similar to those under which the model will be required to predict. In practice, all the possible experimental conditions may not be avail- able or even known, and the model may be presented in the future with spectra with some unexpected change. Bands which have been shifted or broadened from the calibration set are the most common anomalies observed with Raman spectra. Unfortunately, many mod- els use the individual intensities observed at each Raman shift in a spectrum and their relationship to one another. When using such raw spectral information, models can be very sensitive to changes in band shape, position, and intensity. Whether these band changes are due to some physical or chemical interaction [34] or are due to an instrumental change, a robust model must be capable of either disregarding the change and predicting accurately or recognizing and reporting the discrepancy.

For analyses which can extract the needed information on band intensity alone and do not require the knowledge of exact band position or width, one simple solution is to take advantage of the commonly narrow bands observed with Raman spectroscopy and utilize deresolution [8]. Deresolution adds groups or "bins" of adjacent Raman shifts together to form a single value, which represents the intensity of all the included Raman shifts. This is demonstrated in Fig. 3. As a result of deresolution, intensity is largely decoupled from small changes in the position or width of the band. The resulting binned values are used in an analysis routine in the same manner as would be the raw spectral values; therefore, isolation of single-component or even single bands needs not be achieved. The number of Raman shifts and positions of the bins can be determined by grouping every n observations together across the entire spectrum or though guided se- lection, in which the region around each band or bands of interest is binned together as a single integrated value [26,35]. This simple band integration approach offers improved resistance to spectral changes, but it does not have an inherent means to indicate if an anomalous band is integrated along with the expected band. Any background must also be completely removed prior to deresolution, otherwise this intensity will be included with the intensity of the Raman band without much ability to distinguish between the two.

A more adaptable approach which is also more useful in cases of high band overlap is least-squares fitting of the individual bands to some band-shape function. In these cases, an expected function for the shape of the bands which are overlapping is adjusted until all of the intensity of overlapping bands is taken into account. The advantage of this approach is that intensity, width, and position can be measured for a band and used in- dependently of one another. The disadvantage is that band fitting may become unstable when anomalous peaks, excessive noise, or unremoved background are present. In these cases, the quality of fit should provide indication that the results cannot be trusted.

Another approach to desensitizing a model to instrumental effects on the spectrum was presented by Swierenga and co-workers [3 1,361. True instrumental changes were included as part of the calibration set and sophisticated selection algorithms were used to select individual Raman shifts which would provide the best predictive ability in the presence of the instrumental effects. This approach differs from the previous two in that

Copyright © 2001 by Taylor & Francis Group, LLC

Page 14: Chemometrics for Raman Spectroscopy

1100 1200 1300 1400 1500 1600 1700 Raman Shift (llcm)

Figure 3 Demonstration of deresolution of an epoxy spectrum using window sizes of (A) 6 cm-', (B) 12 cm-', and (C) 24 cm-'. Note the decreasing ability to distinguish the shoulder on the 1600- cm-' band and the increasing insensitivity to absolute band position.

it is empirically determining which wavelengths to use rather than performing a prepro- cessing step which desensitizes the model by removing changes. Although these empirical selection techniques have yet to be put into wide trials, the guided selection may prove to be of interest in the future.

F. Abscissa and Ordinate Calibrations Issues relating to instrument calibration are discussed at length in Chapters 3 and 6. Here, we will discuss how these calibration issues effect different analyses. There are three primary calibrations which are typically done on a Raman spectrum: wavenumber axis calibration (including laser wavelength), intensity axis calibration, and laser-line-shape corrections. The first is mandatory for any Raman system; the other two are used selec- tively, depending on the application and instrumentation.

I . Wavenumber Axis Calibration Because the identity of any band is strongly keyed to the energy at which it is observed, the accurate calibration of the wavelength or wavenumber axis is crucial. Perhaps even more crucial is an understanding of the stability of the wavenumber axis over time. Al- though the short-term wavenumber stability may be good enough to collect the spectra required for a calibration dataset, the stability over the time that calibration needs to function may be far different. It is advisable to determine the real stability of any system prior to forcing it to conform to unrealistic expectations. If a dataset need only be internally consistent, then the long-term stability may not matter. Recall, also, that the stability of the Raman spectrum's wavenumber axis will depend on the spectrograph stability as well

Copyright © 2001 by Taylor & Francis Group, LLC

Page 15: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 289

as the laser wavelength stability and the ability of the analyzer to correct for any changes in either.

The most stringent need for wavenumber axis calibration is in determinations based on band position. For this reason, qualitative analyses are likely to be affected by drifts or inaccuracy in the wavenumber axis [14]. Likewise, quantitative determinations based on band position, such as strain in diamond films [6], will be affected similarly. Other quantitative analyses may also be affected by band-position error. It is common to use the raw spectral intensities (intensity at every wavenumber) in a multivariate analysis. Al- though this approach can be very powerful, any unexpected shift in wavenumber calibra- tion can cause severe error in the model. In essence, the spectral pattern to which the model has been trained has been shifted. The mathematics of the model are expecting a particular relationship of intensity between adjacent variables (wavenumbers) and cannot usually account for shifts [31]. To some extent, multivariate models can be desensitized to inaccuracy and imprecision by assuring that the calibration samples also exhibit some of the same shifting features, but model sensitivity may suffer as a result. Although not in common use, other deconvolution methods have been introduced which may be appli- cable to removing shift effects of inaccurate wavenumber calibrations [37].

Another means of desensitizing a model to a shifting x axis is to either integrate band area or to use band fitting prior to model analysis, as both of these techniques remove much of the dependency of intensity on absolute band position. Figure 4 shows the effect of shift on the prediction of octene concentration through integrated intensity and raw PLS spectral analysis. The error associated with the integrated intensity determination is due simply to the "tails" of the bands shifting outside the integration window.

When using multivariate analyses, there are a variety of model metrics which evaluate the goodness of fit of a model to an individual spectrum (not to be confused with spectral band metrics previously discussed, which are simply some observed property of a band in the spectrum) [a]. These metrics should give a good indication of when the shift of a spectrum has passed outside of an acceptable range for an individual model.

2. Intensity Axis Calibration

No two instruments have identical sensitivity to intensity at all wavelengths. As a result, the spectra measured for an identical sample on two different instruments will vary in the

-l.ooJ Spectral Shift (cm-')

Figure 4 Octene prediction error observed with a shifting spectrum. The solid squares represent the prediction from a raw-spectrum PLS model; the open triangles represent prediction from a simple band integration and ordinary least-squares model.

Copyright © 2001 by Taylor & Francis Group, LLC

Page 16: Chemometrics for Raman Spectroscopy

290 SHAVER

absolute intensities observed across the spectrum. An individual instrument may also change in its throughput over time. Recent solutions to the problem of calibrating a visible- wavelength Raman system are discussed in Chapters 3 and 6. Whether or not these cor- rective approaches are used, intensity axis iuaccuracies affect models much in the way that nonconstant incident intensity or optical changes cause problems: Models utilizing intensity may be rendered invalid. To that extent, use of internal spectral normalization (see Sec. 1I.C) may reduce the difference between instruments. Unfortunately, the specific optical alignments in an instrument will affect the throughput of different regions of the spectrum to different extents. Although bands close together in a spectrum will likely have a similar factor of difference and could be used in reference to one another to correct for the instrumental differences, the more distant the bands, the more different their throughput is likely to be on the two instruments and the larger the potential error. Models that compare the relative intensities of multiple bands throughout the spectrum such as raw spectral methods will be the most affected. Depending on the extent of the difference between instruments, the various goodness-of-fit metrics should indicate a problem.

3. Laser Wavelength and Line-Shape ESfects The final step in calculating a Raman spectrum is calculation of Raman shift in wave- numbers (cm-') from wavelength. This simple calculation is done by subtracting the wave- length observed for a Raman band and that either known or measured for the laser exci- tation. To this extent, the vibrational energy is a relative measurement. If the wavelength of the laser changes (ignoring any secondary excitation effects), the resulting spectrum will simply be shifted in wavelength by an equal amount. Correction of these simple shifts is often trivial and many commercial systems do so automatically by either monitoring the laser line itself and using this wavelength as 0 cm-' or by setting the shift of a known band to the expected Raman shift and shifting all other bands linearly with the correction.

Several other excitation wavelength phenomena can occur which cannot be corrected by a simple shift. In most cases, the line shape observed for a band is a function of the natural width and shape due to physical and chemical effects on the vibration. These natural linewidths eventually approach some limit at which the spectrograph is incapable of resolving differences. This limit will depend on the resolution of the individual spec- trograph and its settings [as well as data collection settings for Fourier transform (FT)- Raman systems]. For a spectral linewidth to become spectrograph limited, one other cri- terion must be met: The excitation profile of the laser must be narrower than the resolution of the spectrograph. This condition is easy to meet with most gas lasers as well as the common doubled Nd:YAG (532 nm) and similar lasers. It is not always met with diode lasers used for near-IR (NIR) Raman excitation. Some diode lasers will exhibit either an excitation line with non-negligible width or multiple discrete excitation wavelengths si- multaneously during the period of a spectral collection. In essence, a separate Raman spectrum arises from each of the excitation wavelengths. In the case of a single broad laser line, the observed Raman line shape can be very notably affected. If multiple exci- tation wavelengths are observed, the resulting spectrum will contain multiple "ghost" images of the Raman spectrum.

Correction of simple laser-induced broadening effects may not be necessary if the resolution of the spectrograph is three or four times lower than the laser bandwidth. If this is not the case and no other steps to correct for the laser profile are attempted, the long-term stability of any model will depend greatly on the consistency of the laser profile. Because some lasers exhibit unusual profiles in an inconsistent manner, repeatability is

Copyright © 2001 by Taylor & Francis Group, LLC

Page 17: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 291

nearly impossible. The usual method to correct for the laser profile is to mathematically deconvolve it out of the spectrum using either a measured spectrum of a known reference sample or measurement of the laser profile itself. In either case, the mathematics are similar and require some iterative calculation to back-calculate what the "unperturbed" actual spectrum looks like. Some loss in signal to noise from the original spectrum can be expected by nature of the mathematics. Some residual ghost features will also be observed if the deconvolution is incomplete. The effects will usually be some loss in precision due to noise and potential inaccuracy if any ghost features are observed.

Ill. QUALITATIVE MULTIVARIATE ANALYSIS

The optical sampling abilities of Raman spectroscopy, which allow interrogation of the contents of unopened containers or solid materials and probe volumes as small as a cubic micron, make it optimal for use in a wide variety of qualitative identification tasks. The goals of these analyses are usually to either identify the components of a sample or to utilize differences between samples to group them without prior knowledge. Identification of components is referred to as classification and is performed by first training the model to recognize differences between preclassified samples followed by the prediction of un- known samples using the trained model. Because the classification requires the use of prior knowledge, these are referred to as supervised techniques [8]. Approaches which identify natural groupings or clusters within the data are referred to as unsupervised techniques because no prior information need be presented to the model to perform an analysis. Obviously, unsupervised cluster analysis techniques are useful when calibration samples are not available. They are also of use in simple investigative studies when deducing patterns in multiple spectra.

For more information on the analysis techniques described in the following sections, the reader is directed to several good texts which cover these subjects more thoroughly with a complete development of the mathematics behind the techniques [8,38-401.

Before providing some examples of qualitative analysis techniques, the basis for dis- crimination should be considered. Given a limited set of samples, all of which display significantly different spectra, it is possible that individual bands could be used to identify each species. In practice, such simple approaches are not practical nor accurate. For ex- ample, the mere observation of a band near 1700 cm-' does not unambiguously indicate the presence of acetone, as many species exhibit a band in that region. Certainly, a more certain identification can be made by searching for the presence of multiple bands. Add to the determination a requirement that the observed bands have specific intensities relative to one another, and the determination improves greatly. The obvious extension is to utilize all bands present to discriminate the species.

As introduced earlier, such comparisons are often done by considering each point in a spectrum as an individual variable. If the bands were infinitely narrow and appeared as intensity at a single point (variable), then a model could be created which might state, for example, "Species A exhibits intensity at variables 5, 200, and 750." If no intensity is observed in variable 200, the presence of species A could be disproved. Of course, ob- served Raman lines are often not so discrete and isolated. A range of variables are used as a key for each vibrational mode.

A. Unsupervised Analysis Techniques The initial investigation of most systems is determining if there are sufficient differences between samples for the analysis technique to be useful. At other times, the task is to find

Copyright © 2001 by Taylor & Francis Group, LLC

Page 18: Chemometrics for Raman Spectroscopy

292 SHAVER

the spectral features which make several samples similar and discount those which make the samples appear different.

1. ClassiJications: Principal Components The most common approach to identifying clusters in Raman data has been the use of eigenvector-based methods such as principal component analysis (PCA). PCA starts with an eigenvector decomposition of the data matrix into eigenvectors and eigenvalues. The eigenvectors, or "factors," are mathematical representations of the spectral data. They are determined as orthogonal vectors along n-dimensional axes which are calculated to de- scribe the maximum amount of variance in the data. A useful way to picture the PCA operation for spectroscopy applications is to consider the entire set of spectra being si- multaneously analyzed for similar features in the individual spectra. Mathematically, this is achieved by identifying multivariate axes in the data. A multivariate axis (i.e., factor) is similar to a line plotted on a typical x-y plot. It has an intercept and slope relative to the two axes. What makes factors unusual is that they are comprised of as many dimen- sions as there are variables (spectral points) in the spectrum. If we only had two spectral points (a short spectrum indeed), two factors could be plotted on a simple x-y plot as two lines crossing but with some rotation and offset to the normal X and Y axes. If these factors are orthogonal, then they will cross forming a 90" angle. The first factor would fall along the primary trend of the data; the second would be perpendicular to the first. The orthog- onality of the factors also gives them a particular mathematical nature, but this is only incidental to the discussion here and will not be discussed further [40].

As we will demonstrate in a moment, this simple visualization is not possible with real (multipoint) spectra. Instead, each factor is plotted as a "spectrum" to visualize the axes. To interpret the factor spectra, consider that each factor describes some group of spectral features which are present in the spectral dataset. Spectral features which are completely correlated (only occur with one another) must appear together in an individual factor. As each factor is removed from the dataset, the remaining signal is examined for further similarities until none remain. Mathematically, the approach is accomplished in a variety of ways, usually through eigenvector decomposition. Many good explanations of PCA have been previously written [8,38-401, so here we present a brief description of PCA with some specifics to spectral analyses. This description of PCA will also be relevant to the explanation of several other multivariate analysis techniques.

Although a visible inspection of the factors obtained from an eigenanalysis is not strictly necessary, it can help identify which spectral features are changing. It also serves to help understand clustering using PCA. If the spectra have been mean centered (by subtracting the average spectrum, see Sec. II.D), then each factor will represent a change to the mean spectrum. If the spectra have not been mean centered, then one factor will represent some average spectrum and the other principal components will represent changes to this spectrum (it is worth noting that, depending on the structure of the data, this factor may not be the actual geometric mean spectrum but something approximating the mean). One band disappearing relative to another in the spectrum will produce a factor containing one negative and one positive weighted peak, as shown in Fig. 5. Band shifts will be represented in multiple factors as derivativelike shapes, which can be considered as the spectral differences between the observed band positions. If all of the spectra in a dataset are identical except for random noise, then all of the factors will contain only the individual noise patterns of each spectrum, as this is all that is needed to reproduce any spectrum from the mean.

Copyright © 2001 by Taylor & Francis Group, LLC

Page 19: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY

Figure 5 The two factors recovered from a PCA of two simulated bands which change relative to one another. The data were not mean centered prior to PCA. Note that by adding these two factors together using various weightings on each allows us to reproduce any spectrum with these two bands individually or in combination.

Typically, the factors are sorted by their contribution to the observed changes in the dataset. A subset of factors which contribute the most to the data are selected; this set is sometimes called the principal factors. Because the changes in Raman signal are mostly orthogonal to (i.e., uncorrelated with) the noise in the spectra, significant spectral changes and random noise will be mostly segregated into different factors. As such, many factors can be discarded, dumping noise without losing useful signal. In this form, only a small number of factors are of real interest. It is important to note that smaller Raman signal changes will be mixed with noise in the latter useful factors and may not be easily detected.

Using these simple decompostion approaches, the first three factors aided in the iden- tification of an unexpected interfacial region for some crystalline linear polyethylene sam- ples analyzed by Shen et a1 [41]. The first two factors observed could be identified as representing combinations of the crystalline and amorphous polymers. A third factor, how- ever, indicated an unexpected band which was attributed to partial crystallization.

Because, in practice, each factor is a mathematical mix of various spectral features and changes, it is not always easily understood by inspection. However, used in combi- nation (with the mean spectrum, if mean centering was used), the factors can be used to reproduce any of the spectra in the original dataset; that is, we can describe any spectrum, x, by using some subset of the factors, described as vectors p, through p,, each multiplied by a scalar weighting factor known as a score (ti) or principal component (PC):

The vector e represents the residual signal that was in x but not modeled by the n factors used.

The PCs are also the key to finding the underlying clusters of similar spectra. The basis for clustering arises from the fact that spectra which contain similar features will require similar amounts of the same factors. The simplest way of investigating for clus- tering is a PC plot in which the scores for all the spectra in the dataset for a given factor are plotted versus the scores for another factor. Similarities between different samples are indicated by groups of samples which have similar scores for the two factors. It is worth noting that the scores actually contain quantitative information from which we can make

Copyright © 2001 by Taylor & Francis Group, LLC

Page 20: Chemometrics for Raman Spectroscopy

294 SHAVER

qualitative assignments. The quantitative uses of PCA will be discussed in more detail in Section IV.

Principal component analysis has been widely used in Raman spectroscopy. Kokot and co-workers showed the use of PCA to discriminate between several different dye treatments of cotton fabrics [42]. A subrange of the Raman spectra were mean centered and scaled (range of each spectral point scaled to 5 1 ) . The resulting data were then analyzed using PCA and PC plots were made identifying subgroups of the spectra for each fabric treatment. Figure 6 shows an example of one of these scores plots. In a similar manner, Allen and co-workers presented PCA results along with other supervised classi- fication technique results (see Sec. 1II.B) in the analysis of various recyclable plastics [43]. Clusters of the various plastics could be observed through PC plots of the first three factors. Similar grouping ability has been reported using PCA on the Raman spectra of variously cured epoxide polymers [44].

Often, PCA is used as an initial investigative technique to determine if any differences can be identified within a dataset prior to a separate classification or quantitative analysis. PC plots are similarly useful in these cases. The previously mentioned work on recyclable plastics by Allen et al. is one example of this approach. Another example is the PCA analysis of a set of ethylene-vinyl acetate copolymer samples by Shimoyama and co- workers who first identified measurable differences between Raman spectra before creating a quantitative model [45]. PCA was also used by Keen et al. to distinguish fibers for forensic applications [46].

Principal component analysis has played a key role in the analysis of multidimensional Raman data such as that collected for Raman images. The utility of PCA in analysis of multidimensional Raman data arises from the exceptionally large number of spectra avail- able in combination with redundancy of these data. For example, in two-dimensional Raman imaging, a spectrum is collected at every point within a two-dimensional area for a sample. Because many of these spectra will have many similar band features, the entire set of spectra can often be described by a small number of factors. Because so many spectra (e.g., 2500 from a 50 X 50 point image) are used during the calculations, the noise can be very effectively removed from the Raman signal and the resulting factors will contain high signal-to-noise spectra from the dataset. If the factor scores are calculated for each point in the image, maps can be created which indicate regions of chemical

PC2, Scores (20.5%)

Figure 6 Scores scatterplot showing the identification of cotton fabric samples containing fixed (F and FN) and unfixed (U and UN) dye states. Samples FN and UN were ammonia treated (see Ref. 42 for details). (Reprinted with permission from Ref. 42.)

Copyright © 2001 by Taylor & Francis Group, LLC

Page 21: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 295

similarity in the imaged sample. These maps are called score images. If the clustering techniques described earlier are used, the location of the spectra which make up any individual cluster could be identified as a chemically distinguishable region [47].

For either multidimensional or simple datasets, a related technique called factor anal- ysis (FA) provides a more interpretable form of the same factors. Earlier, we discussed the somewhat abstract nature of the factors obtained from a PCA and how their interpre- tation as spectra can be complicated. Because of this, the interpretation of scores can also be very complicated. One method of simplifying the analysis of the scores is through "rotation" or transformation of the factors [22,24,38,40,48]. Transformation involves find- ing linear combinations (weighted sums) of the factors which produce meaningful (phys- ically significant) spectra, which are called factors. It is worth warning the reader that in some statistics discussions, a second form of "factor analysis" exists which is different from that used in many chemometric discussions. Additionally, the term "principal factor analysis (PFA)" is sometimes used in place of factor analysis. In this discussion, we will use the term "factor analysis" as defined in works by Malinowski and Beebe et al. [8,38] Additionally, in some articles, the use of PCs and factors may be mistakenly interchanged.

Mathematically, transformation is generally described as

in which p: is the rotated factor, p, through pn are the original factors, and w, through w n are the weighting terms which must be determined during transformation. This process is shown graphically in Fig. 5. There are a variety of methods to perform transformation, but in most cases, the goal is to obtain factors which describe individual chemical species. In some cases, PCA is not even the starting point of the analysis, but a alternate data decomposition techniques are used to make transformation easier [24]. These decompo- sition and transformation techniques are collectively called multivariate curve resolution (MCR). MCR techniques can be used as either qualitative or quantitative techniques. Quantitative analyses are discussed in the next section. An example of simple quantitative use is when MCR is applied to image analysis. The factors can be used to calculate score 6'. images," which provide an indication of where those individual chemical species can be found within the imaged region [23,48,49]. Similar results can be obtained by analyzing large datasets obtained while monitoring separations [50,51]. The number of algorithms designed to perform MCIUFA are far too numerous to list. Malinowski provides a good overview of the general approaches [38]. Many good methods have been designed for a wide range of data formats (images, separations, kinetic studies), although not nearly as many have been applied to Raman analysis.

Pitfalls of PCA analysis mostly arise from systematic errors, which makes any spec- trum appear different from the other spectra even if no chemical difference is present. These artificial differences will appear in the PCA results usually as at least one additional factor. As each additional factor decreases the ability to distinguish between useful factors with Raman information and those with noise, clustering could be lost or artificial clus- tering could be created. Some typical problems are as follows: drifts in either axis (wave- number or intensity) during the collection of the data; inconsistent background removal; unfiltered cosmic spikes. Because of the exceptionally high intensity of a cosmic spike and the fact that they are completely random in position and occurrence, these anomalies create an additional factor for each occurrence. Transformation of factors in FA is another troublesome point. Mathematical methods which take into account bandwidths and con- straints that intensity be "real" (non-negative) provide some aid [23,24].

Copyright © 2001 by Taylor & Francis Group, LLC

Page 22: Chemometrics for Raman Spectroscopy

SHAVER

2. Cluster Analysis and Neural Networks The PCA approach described classifies samples by starting with the entire set as a single group and progressively looking for differences between the individual spectra by which they can be grouped and subgrouped. Theoretically, this can continue until each spectrum can be individually distinguished from all others. Another approach is to begin with each spectrum as an individual item and look for similarities by which these individuals can be grouped together. Three unrelated methods which can be used to achieve this are k- nearest neighbors (KNN) [52], hierarchical cluster analysis (HCA) [8], and artificial neural networks (ANN) [53].

The KNN method compares all spectra in the dataset through the use of the Euclidean distance (a metric describing the similarity between spectra). Based on this metric, groups of spectra can be defined. Daniel et al. used this approach to help classify different groups of nitro-containing explosive materials based on their Raman spectra.

The HCA method, which uses any of a variety of multivariate distance calculations to identify similar spectra, has found little use in Raman spectroscopy, although it could be of use in the growing analysis of complicated systems in which a large heterogeneous sample set is being analyzed. A study of spruce needles by Krizova et al. [54] and an investigation of cancerous skin lesions by Fendel and Schrader [55] are two examples showing the modest power of HCA.

Similarly, artificial neural networks, which can be either unsupervised or supervised, can be used to identify clusters within datasets of very complicated samples and spectra. ANNs attempt to emulate the processes of the human mind using mathematical "neurons" to classify (or quantify) data [53,56,57]. A single neuron is a simply mathematical function which provides a single output value based on the value of multiple input values. The function can take a variety of forms but often take the form of a weighted sum of the input values modified by a transfer function. The details of these functions are beyond the scope of this chapter. The reader is referred to the works by Zupan and Gasteiger for further details on these issues [53,56]. Although a large number of designs exist, simple spectroscopic ANNs are often comprised of a "hidden layer" of neurons which use spectral data as input. The output values of these neurons are, in turn, connected to another layer of neurons (the "output layer") out of which the classification(s) is made. It is useful to anthropomorphize the hidden layer as a group of specialists, each of whom are trained on specific features of the data. The hidden layer, as a whole, then advises each of the output layer specialists of their observations, which make their own individual assessments of whether or not a given sample belongs to their specific class. The number of layers (i.e., hidden layers), the number of neurons in the layers, the functions used in the neurons, and the form of the data, among other parameters, can be varied to create the best-per- forming ANN. Liu et al. presented some useful guidelines for design of ANN for spectro- scopic data derived from work with Raman spectra [57]. In most cases, data reduction is necessary (i.e., some variable selection needs to occur so that the entire raw Raman spec- trum is not passed to the ANN).

One example is the use of an unsupervised form of neural networks to classify wood types through their Raman spectra [%I. In this work, the neural-network approach com- pares favorably to results obtained previously by other researchers using only individual bands for discrimination. Lewis et al. [59] reported the use of a supervised neural network to classify wood types. Other examples of the use of ANN on Raman spectra can be found for polymers [60] as well as more complicated biological samples [61-631. It is in these

Copyright © 2001 by Taylor & Francis Group, LLC

Page 23: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 297

cases, when a large number of different features must all be considered together to provide the best analysis, that ANN are of notable use.

It is important to note that ANNs are also used for classification and quantification analyses. For the ANN to perform in these modes, it must be "trained" to predict a given set of known samples. In the training process, the weightings and other function parameters are adjusted to provide the best predictions.

B. Supervised Classification Techniques Supervised classification techniques are far more commonly used for qualitative analysis, simply because of the predictive ability of the approaches. The general approach is to collect spectra for a group of known samples and then attempt to predict the identity or composition of an unknown sample based on its similarity to the known. Although com- mercial libraries of spectra for large numbers of compounds are becoming available, most classification work has been accomplished by development of much smaller specialized models. Because these application-specific models are trained for a particular set of sub- stances, they tend to be more accurate, as they cannot give a "long-shot" assignment to a poor spectrum. Whereas a larger general model may misidentify a sample, a small specific model may simply fail to provide any assignment, thereby warning the user of the difficulty.

The simplest way of assigning an identity to an unknown is by identifying the posi- tions, intensities, and, sometimes, bandwidths of the peaks present in the known and unknown samples. The sample from the calibration set having the most similar bands is considered the match. This approach may be misled by unexpected peaks or a poorly developed library. The next most advanced approach is to use the entire spectrum and do a cosine calculation between the unknown spectrum and the set of known spectra in the library. In this case, the unknown spectrum is compared to each library spectrum one at a time by considering the two spectra as multivariate vectors (as described in the PCA description earlier) and calculating the angle between them. This angle is dependent on both the position and intensity of the bands present in the spectra. The premise works well for Raman spectra with sharp, distinct bands without overlap. As with all multivariate vector approaches which use the raw spectral intensities as individual variables, the sen- sitivity to band position makes the library search very sensitive to error in the instrument calibration. Additionally, significant overlap can lead to ambiguous assignments. Such considerations have been discussed in the literature [14,64] with considerations of cali- bration changes on an individual instrument as well as comparisons between different instruments. The latter case of different instruments is certainly the most appealing goal, as one would like to be free to use the library on any instrument. McCreery et al. found that the simple comparison of band positions and intensities was sensitive to wavenumber axis inaccuracies as small as 2-3 cm-', but that the cosine library search method could tolerate 5-cm-' inaccuracies without failure. It is important to note that the sensitivity of any technique to wavenumber axis shifts will be highly dependent on the band shapes of samples as well as the resolution of the instrument.

One drawback to the traditional use of a library of spectra is that it assumes that a representative spectrum can be obtained for a given sample type. If, however, a sample exhibits a range of spectra depending on pressure, temperature, or other physical property, a single spectrum of that sample at a given set of experimental conditions may not be sufficient to identify that material across the entire range of conditions. Although multiple

Copyright © 2001 by Taylor & Francis Group, LLC

Page 24: Chemometrics for Raman Spectroscopy

298 SHAVER

spectra for the same sample could be included in a library, a more valid analysis can be achieved by utilizing some simple statistical approaches. A large range of these techniques has been reported, but most utilize vector comparisons on either raw spectra or on factors from a PCA analysis of multiple spectra from all the samples of interest. The techniques of k-nearest neighbors (KNN) and soft independent modeling by class analogy (SIMCA) are two such techniques [S]. The supervised version of KNN compares a spectrum to a training set which contains multiple spectra from each type of sample (class). As in un- supervised KNN, the Euclidean distance is calculated although for classification, it is done between the unknown spectrum and all the training set spectra. The class of an unknown is assigned to that of the majority of the k closest training set spectra, where k is a previously determined number of spectra.

In a general SIMCA analysis, a PCA is performed on the training set and the resulting factor scores are then analyzed using statistics to assign "regions" in the multidimensional factor space in which spectra of a given sample are located. Using the same factors, the scores for an unknown spectrum are used to locate the most likely class. Because multiple training spectra are used to define a "statistically likely" region for a given class, the class assignment for an unknown can be given with uncertainty limits.

Despite the success of these techniques with other spectroscopic data, very little has been published on their use with Raman data. The aforementioned work on postconsumer plastic identification by Allen et al. [43] utilized KNN for their analysis, although they present little of the actual classification results. Similarly, Krizova et al. [54] simply state that the SIMCA analysis of Norway spruce needles resulted in similar results to PCA and cluster analysis studies. More detail was given by Daniel et al. [52] when comparing KNN and ANN for analysis of exposive materials.

IV. QUANTITATIVE MULTIVARIATE ANALYSIS

Nearly all chemical analyses involve some level of quantitative determination. Appropri- ately, there are a significant number of articles describing the use of quantitative chemo- metric analyses on Raman spectra. In most cases, quantitative analysis for Raman data takes the same form as analyses of other data. This is because there is a fundamental relationship between intensity, band shape, or band position and either concentration or some other property which changes the spectrum of a substance (acidity, crystallinity, substitution, etc.). It is not necessary in all cases that the exact quantitative relationship between the spectral change and the property of interest be discerned. Often, an analysis can be done without performing an absolute calibration. We will discuss several chemo- metric techniques which take advantage of this fact. Of course, in most cases, a chemically significant value is desired. Many chemometric techniques produce calibrations which relate the spectral changes to the chemical values of interest.

A. Calibrationless Analysis Techniques When it is only necessary to identify the extent to which some spectral change has occurred and not develop an exact relationship between that change and a given chemical change or property, several simple, yet powerful, techniques can be used. As an example, the determination of the width of a band may be enough to indicate that the density of a polymer sample varies through the sample [3] or the indication of different intensities from different regions of a microscopically examined sample could indicate heterogeneity in the sample [23].

Copyright © 2001 by Taylor & Francis Group, LLC

Page 25: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 299

In the preceding sections, we have discussed the use of principal component analysis (PCA) and factor analysis (FA) as two methods used for qualitative analyses. It is a straightforward extension from these, and related curve resolution and decomposition tech- niques, to quantitative data analysis. When the spectral profiles or factors are used to calculate scores, we are, in fact, calculating a number which provides a quantitative de- scription of how much each factor has to be scaled to explain the represented features in the spectrum. Svensson et al. found that the primary factor and its PC (the principal component explaining the most significant spectral changes) from a PCA analysis of data collected for an ethyl acetate synthesis reaction provided excellent correlation with the reaction progression [29]. Thus, without any subsequent data manipulation, the PCA pro- vided quantitative information on the system. This simplicity is an exception. Often, the many quantitative changes in the dataset will be mixed together and some transformation may be necessary to isolate the meaningful results. The factor analysis of the dilution of a highly concentrated glycene solution provides two primary factors as shown in Fig. 7. These factors require transformation before quantitative information can be interpreted in the scores [22]. Similarly, the factor analysis of two-dimensional Raman data by Andrew and Hancewicz provided both their Raman spectra and score images for a large number of chemical species [24]. These images contain the uncalibrated quantitative information regarding the concentration of the species in the sample, although actual concentration information can only be obtained through the use of reference spectra of known con- centrations.

Quantitative use of multicomponent resolution techniques in Raman spectroscopy are primarily limited by the isolation of the common luminescence backgrounds observed in Raman spectra. The high intensity of some backgrounds and their presence throughout the spectrum sometimes makes isolation difficult. This is especially true for techniques which require so-called "pure" variables or for spectral regions in which only one species is observed, such as SIMPLISMA [65]. With a broad luminescence background, no single variable will be due to any one single species. For this reason, background subtraction is particularly important when using SIMPLISMA.

800 1000 1200 1400 Raman Shift (cm-l)

Figure 7 The first two factors recovered from a PCA of a serial dilution of supersaturated glycene. The derivativelike shapes in F2 represent shifts in several of the bands observed in F1. The data were not mean centered prior to PCA.

Copyright © 2001 by Taylor & Francis Group, LLC

Page 26: Chemometrics for Raman Spectroscopy

B. Calibration-Based Analysis Techniques As we discussed early in this chapter, there is significant value in the ability to predict a meaningful result from a spectrum. Often, we have a set of samples for which values of interest are already known, and the trick becomes using that knowledge together with the spectra of those samples to be able to predict future samples. When introducing Eqs. (2) and (3), it was indicated that a relationship between concentration (or other property) and intensity could be exploited to calibrate and then predict a value for an unknown sample. In practice, the calibration and prediction are performed using either classical or inverse techniques. Complete descriptions of these differences and details of many of the tech- niques we will be discussing here can also be found in the work by Malinowski and Beebe et al. [8,38].

It is worth mentioning that for any calibration to be successful (either quantitative or qualitative), careful experimental design is critical. The proper data must be provided to the calibration algorithm. These data should account for as many interferences as is pos- sible both to understand how these interferences may affect the calibration and to possibly enhance the calibration to remove the effects. Poor experimental design cannot be under- estimated as a great source of error in many calibrations; as such, its scope is too broad for us to discuss here, but the reader is encouraged to investigate the general literature on the topic.

I . Classical Quantitative Calibrations Classical techniques follow the basic form of Eq. (3). Two limitations on the use of classical techniques are that, for the most part, they are only useful for the prediction of concentration and they require that we know the concentrations of all interfering species present in a sample. Classical techniques are often referred to as ordinary or classical least squares (OLS, CLS). These techniques rely on the relationship specified in Eq. (3). In each case, a calibration is done to determine the m scaling terms for the component(s) present and the equation is inverted to predict a concentration from an observed intensity.

The simplest OLS calibrations usually rely on the metric from a single band, often either a peak's height or area. This is by far the most common analysis because of the simple-to-understand-and-implement mathematics. Because of the use of only a single value, OLS exhibits the poorest signal-to-noise prediction. It is, however, the least likely to be affected by unrelated spectral changes (bandwidths, positions, spikes), so that what is given up in sensitivity is often gained back in robustness. Another drawback is that multiple component determinations can quickly become too complicated for the simple equations. For every component in the system, another band intensity must be supplied. Some simple multicomponent systems can be analyzed if the sufficient information is known and sufficient "isolated" bands can be measured. Although these criteria would be nearly impossible to meet for near-IR or fingerprint-region mid-IR spectra, it is not un- common for Raman spectra. A typical analysis would start with a baseline or other means to remove background, followed by integration of multiple bands. Often, one of these bands is chosen as a reference band to which the others are normalized. Each normalized band is then used for calibration of the species responsible for the band [26]. Although they do not explicitly state it, many quantitative analyses which have been published to date use some form of OLS.

The term "classical least squares" (CLS) is often used to describe an extension of OLS analysis (as described earlier). CLS uses the entire spectrum, with each spectral point (i.e., discrete interval on the wavenumber axis) being considered a separate piece of in-

Copyright © 2001 by Taylor & Francis Group, LLC

Page 27: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 301

formation to which a calibrationlprediction equation is derived. Because it uses multiple pieces of information, it offers better signal to noise over OLS and simpler multicomponent analysis. It does, however, require "pure" component spectra. These pure spectra are either actual spectra of the pure components or calculated spectra based on simple regression for each spectral point [8]. These pure spectra indicate how much a given component contributes to each spectral point. In terms of Eq. (3), the pure spectrum j is a vector of the mj,i values for all wavelengths (i).

Although band separation is not as critical for CLS as it is with OLS, unremoved and/ or varying backgrounds can devastate a CLS prediction. Possibly the most difficult to meet requirement of CLS or OLS is that the calibration set must represent all the expected components in one or more samples. Any unexpected component will introduce a bias into the prediction of the components which were included in the calibration. Few articles have explicitly used CLS for analysis of Raman spectra. One example can be found in a semiqualitative study of neurotransmitters in which CLS is compared to neural networks 1631.

2. Inverse Quantitative Calibrations The previous section alludes to the most common problems in quantitative Raman spec- troscopic calibrations: Most models require that all components in a system to be known and modeled in the calibration data to accurately predict any one component. Inverse calibration techniques such as inverse multiple linear regression (inverse MLR), principal component regression (PCR) and partial least squares (PLS; also known as principal latent structures) avoid this problem by forcing the calibration steps to utilize only the spectral features which are either changing (PCR) or directly correlated to the property of interest (PLS). More so, not all components in a sample need to be known to perform an inverse calibration. The basic form of an inverse calibration centers around an equation of the form

in which P is the property to be predicted (e.g., concentration), Ii is a given intensity, and mi is the weighting value for that intensity which will be calculated during the calibration. Note the distinct difference between this equation and Eq. (3). Whereas Eq. (3) indicates that a given band intensity is due to contributions from the various species, Eq. (6) indi- cates that a concentration or property can be predicted from a weighted sum of various intensities. Equation (6) simply provides an empirical means by which the quantitative information can be calculated from the observed intensities. The source of the individual intensities as well as how many will be used depends on the technique.

The usual implementation of inverse MLR simply utilizes individual points of intensity from the spectrum. The difficulty with this approach is a mathematical one in the calibra- tion step: Theoretically, a large number of calibration samples are needed if a large number of intensity points are used [8]. As the name implies, PCR solves this problem by using factors derived from a PCA (as described earlier in this chapter). The individual factors reduce the number of unique intensities one would use in Eq. (6) by identifying how individual intensities are changing with one another. PLS solves the problem in the same way by using what are called "latent variables," which are analogous to the factors used in PCR except they are derived through a distinctly different process which takes into account the features which are correlated with concentration (or other chemical information being calibrated for).

Copyright © 2001 by Taylor & Francis Group, LLC

Page 28: Chemometrics for Raman Spectroscopy

302 SHAVER

a. PCR. For a PCR calibration, a PCA is performed on the calibration dataset and a subset of factors are selected which sufficiently describe the data. These factors are regressed onto the original data to calculate scores for each calibration sample. These scores are subsequently used as intensities in Eq. (6) to calculate the weighting factors (mi). Therefore, there will be i factors used with an equal number of weighting terms. Remember that each factor may describe an entire band or multiple bands, as shown in Fig. 5. To predict for an unknown sample, the previously calculated factors are used with a spectrum of the unknown to calculate new scores and, subsequently, a prediction via Eq. (6). Note that factors are only calculated once on the original calibration dataset, and that these factors are then used on all subsequent spectra.

Principal component regression provides excellent signal to noise because of the de- noising inherent in PCA. Another inherent advantage of PCR is that spectral changes other than simple integrated band intensity, such as band shifts and width changes, also appear in the factors. The incorporation of all the changes allows the straightforward use of PCR to calibrate for properties other than concentration. To achieve the same result with a classical calibration approach, one would have to know a priori what band features were going to be related to the property or concentration of interest.

Principal component regression suffers from two primary drawbacks: First, PCA pro- duces factors which explain the changes in the data but does not guarantee that those changes provide any useful information regarding the species of interest. As a result, spectral changes not required to calculate the concentration can be incorporated into the model. These features may bring additional noise and, if, for some reason, these features change (e.g., a change in solvent, a temperature change, an additional background), then the PCR model will no longer describe all the spectral features present and will fail to predict correctly. A second difficulty with PCR arises if spectral features which were not present in the calibration dataset (background, new bands, etc.) appear in the spectrum to be predicted; these will not have been incorporated into the factors and the PCR model will have no way to account for these changes. As a result, severe prediction errors will occur. When using PCR with Raman spectra, unexpected luminescent backgrounds are among the most common spoiler.

As mentioned earlier, care should be taken with background subtraction techniques, as some have been observed to leave residual background in the spectra [22]. Fortunately, there are straightforward means to calculate the applicability of the PCR model to the current spectrum based on spectral residuals [8].

Principal component regression is commonly used. Some good examples of PCR with Raman data are shown in the work by Cooper [4], in which he studied the use of several different chemometric analysis techniques, including PCR, to quantify the progress of two reactions. This work demonstrated some good comparisons between the different tech- niques as well as showing how well PCR can work when there is little background and nearly all the observed Raman signal is from the components of interest. In contrast, an early work by Haaland et al. indicated that PCR could be used even in the presence of backgrounds without problems [66]. In these cases, the assumption is that the backgrounds which appear in prediction spectra are similar to those observed in the calibration spectra and actually contain useful information regarding the samples. As a result, including factors with background in the calibrations does not harm the predictive ability of the model. Two other examples of PCR with Raman data include a quantification of the oxidation state of a thin film on an electrode using Raman line transects [67] and a semiqualitative analysis of olive oil [68].

Copyright © 2001 by Taylor & Francis Group, LLC

Page 29: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 303

b. PLS. What gives PLS its particular power and utility is the means by which spectral information is "selected" for use in the calibration [8]. In PCR, we decomposed the calibration spectra based solely on spectral change and, after the fact, we calibrated those spectral changes to the desired known values. We were never assured that the changes we observed would be correlated with the known values. In PLS, the known values for the calibration are included in the decomposition step, thereby heavily weighting spectral changes which are correlated with the value of interest. As with PCA and PCR, the noise observed in the spectra is isolated into separate latent variables (LVs) which are left out of the calibration, improving prediction precision, and nonlinear relationships be- tween the properties of interest and intensity can be accommodated in a PLS model by including multiple LVs. Additionally, PLS benefits from the inverse nature of the equations. Simply put, one need not know all the concentrations or exact conditions in a mixture to calibrate. In general, it is recommended that the calibration set contain as many conditions as are expected to be found in the samples which will be predicted later.

Although inverse approaches are useful in automatically identifying features which can used to predict, it is always important to perform a "rationality check" on the features which are being used. This is even more important when using PLS for Raman spectral analysis as there are many strong spectral features which are not directly associated with the Raman effect and are, therefore, unreliable as true predictors. One example would be the inadvertent use of background or Raleigh (elastic) scatter to predict the extent of an emulsion reaction. In such a reaction, the opacity of the solution increases over time; as a result, the background luminescence and elastic backscatter increase in proportion to the extent of reaction. Additionally, reflected-room-light sources will also increase in intensity. If any of these spectral features are supplied to the PLS algorithm, they will be incorpo- rated into the model. If, however, the collection geometry or reaction conditions change, the opacity of the solution may be altered, thereby throwing off the calibration. A sounder model would be developed if only the polymer product bands (e.g., CH stretching vibra- tions), solvent bands (e.g., HOH water stretch), and reactant bands were provided. All of the PLS software packages currently on the market allow inspection of the latent variables for spectral features. These should be evaluated based on known spectral features and changes therein.

Inverse calibrations in the form of PLS analysis have been applied to a wide range of Raman prediction tasks. The sheer number of applications precludes listing here. In- stead, we present a few demonstrative examples. Many fuel applications have been at- tempted using Raman spectroscopy for problems in which infrared (IR) absorption spec- troscopy was originally applied. In 1989, one of the first uses of PLS on Raman data was published by Seasholtz et al. [69]. The authors used MLR, PCR, and PLS with Rarnan spectroscopy to predict mass percentage of liquid fuel mixtures. They acknowledge that although the signal to noise of the Raman spectra may be lower than the equivalent near- IR spectra, this may be offset by sharp spectral features observed in the Raman spectra. The best results were obtained using PCR and PLS, mainly due to the signal averaging afforded by the data decomposition and the use of multiple factorsllatent variables for the prediction. Other work using Raman spectra of liquid fuels developed PLS calibrations to predict weight percent oxygen [70], octane numbers and Reid vapor pressures [4,5], and benzene, toluene, and ethylbenzene concentrations [71]. The 1996 Cooper et al. article observed somewhat less precision at lower percent oxygen levels relative to IR spectros- copy, but attributes this to the lack of normalization and a laser which had a known change in output. It also makes reference to many more PLS-Raman spectroscopy articles by the

Copyright © 2001 by Taylor & Francis Group, LLC

Page 30: Chemometrics for Raman Spectroscopy

304 SHAVER

authors. Walder and Smith also observed less precision for a PLS prediction of xylene isomer concentration from Raman spectra relative to IR spectra, although no mention was made of normalization in their work either [72].

A second industrial field which has often used Raman spectroscopy and PLS analysis for quantitative modeling is in the production of polymers. PLS has been used on Raman spectra to predict the density of poly(ethy1ene terephthalate) [3] and polyethylene [73] and Chalmers and Everall mention crystallinity measurements for polyketones [74]. Similarly, Sano and co-workers presented a density study of linear low-density polyethylene using PLS and Raman spectroscopy [75]. A subset of these authors have previously shown [45] the prediction of vinyl acetate content in ethylene-vinyl acetate copolymers. PLS was used along with PCA to study the rate constants for a synthesis and hydrolysis of ethyl acetate [29].

With the successful application of PLS to such a wide range of tasks and the attractive ability of Raman spectroscopy to measure from remote sampling locations, it is no surprise that many researchers are developing quantitative Raman models for on-line process mon- itoring and control. These implementations of PLS-Raman models expose the primary limitation of PLS modeling. Although PLS is capable of high-precision predictions by isolating only the spectral features which provide information for the prediction, it is the sensitivity to small changes in the observed spectrum which compromise accuracy when instrumental or physical properties drift into an unexpected range. Specifically, changes in the calibration of the wavelength axis, the laser wavelength, observed bandwidths (whether due to chemical effects or instrumental effects), and wavelength-dependent absorption effects are all typical caveats which can drive a model to inaccurate prediction. All of these are commonly observed in on-line analyses, particularly when a model is transferred to another instrument, instrumental changes occur such as a laser replacement, or an un- stable instrument is allowed to drift in calibration. Another exceptionally common source of error in PLS predictions is a direct result of the use of absolute intensities. A PLS model which makes use of the normalization approaches mentioned earlier in this chapter will greatly stabilize a PLS model to unavoidable changes in total intensity.

As the number of Raman spectroscopy-based applications has increased, research into making PLS and other models less sensitive to these effects has also increased. Work discussing variable selection and better experimental design have been published by Swier- enga and co-workers [31,36]. In these two articles, it was shown that including variations of instrumental conditions into a PLS calibration model could improve the stability (i.e., long-term accuracy) of a model. Additionally, the use of variable selection through sim- ulated annealing was shown to greatly reduce the effect of band shifting. New methods to correct for spectral shifts are also being developed [37].

V. CONCLUSION

The combination of new preprocessing and analysis techniques along with careful use of the standard approaches can provide a wealth of information. The range of techniques available to the analyst are continually advancing: intelligent modeling systems which automatically create quantitative models from a dynamic library; neural-network expert systems which can identify functional groups and whole molecules; analyses of data from multiple sources and multiple dimensions. Although all of these tools are helping to im- prove the predictive ability of chemometric analyses, the key to the most successful anal- yses will always be careful consideration of what spectral changes are going to provide

Copyright © 2001 by Taylor & Francis Group, LLC

Page 31: Chemometrics for Raman Spectroscopy

CHEMOMETRICS FOR RAMAN SPECTROSCOPY 305

the information and how can those changes be best isolated from the expected interfer- ences. With consideration of the particular problems encountered with Raman spectra, reliable analyses can be performed; when reliable analyses are performed, all of the in- herent advantages of the Raman technique can be fully exploited.

REFERENCES JH Giles, DA Gilmore, MB Denton. J Raman Spectrosc 30:767, 1999. N Everall, H Owen, J Slater. Appl Spectrosc 49:610, 1995. N Everall, K Davis, H Owen, MJ Pelletier, J Slater. Appl Spectrosc 50:388, 1996. JB Cooper. Chemometr Intel1 Lab 46:231, 1999. PE Flecher, WT Welch, S Albin, JB Cooper. Spectrochim Acta A 53:199, 1997. DS Knight, WB White. J Mater Res 4:385, 1989. TM Niemczyk, MM Delgado-Lopez, FS Allen. Anal Chem 70:2762, 1998. KR Beebe, RJ Pell, MB Seasholtz. Chemometrics: A Practical Guide. New York: Wiley, 1998. A Savitsky, MJE Golay. Anal Chem 36:1627, 1964. PD Wilson, SR Polo. J Opt Soc Am 71(5):599, 1981. G Horlick. Anal Chem 44:943, 1972. L Pasti, B Walczak, DL Massart, P Reschiglian. Chemometr Intel1 Lab Syst 48:21, 1999. B Walczak, B van den Bogaert, DL Massart. Anal Chem 68:1742, 1996. RL McCreery, AJ Horn, J Spencer, E Jefferson. J Pharm Sci 87(1):1, 1998. K Li, S Banerjee. Appl Spectrosc 45:1047, 1991. MA Friese, S Banerjee. Appl Spectrosc 46:246, 1992. K Maquelin, L Choo-Smith, T van Vreeswijk, HP Endtz, B Smith, R Bennett, HA Bruining, GJ Puppels. Anal Chem 72:12, 2000. AJ Berger, TW Koo, I Itzkan, MS Feld. Anal Chem 70:623, 1998. IS Helland, T Naes, T Isaksson. Chemometr Intel1 Lab Syst 29:233, 1995. SM Haight, DT Schwartz. Appl Spectrosc 51:930, 1997. PA Mosier-Boss, SH Leibermann, R Newbeny. Appl Spectrosc 49:630, 1995. JM Shaver, KA Christensen, JA Pezzuti, MD Moms. Appl Spectrosc 52:259, 1998. N Jestel, JM Shaver, MD Morris. Appl Spectrosc 52:64, 1998. JJ Andrew, TM Hancewicz. Appl Spectrosc 52:797, 1998. TV Karstang, R Manne. Chemometr Intel1 Lab Syst 14:165, 1992. MJ Pelletier, KL Davis, RA Carpio. Electrochem Soc Proc 95(2):282, 1995. RJ Barnes, MS Dhanoa, SJ Lister. Appl Spectrosc 43:772, 1989. MS Dhanoa, SJ Lister, RJ Barnes. J Near-Infrared Spectrosc 2:43, 1994. 0 Svensson, M Josefson, F W Langkilde. Chemometr Intel1 Lab Syst 49:49, 1999. TM Hancewicz, C Petty. Spectrochim Acta A 51:2193, 1995. H Swierenga, AP de Weijer, RJ van Wijk, LMC Buydens. Chemometr Intel1 Lab Syst 49:1, 1999. LA Nafie, D Che. In: M Evans, S Kielich, eds. Theory and Measurement of Raman Optical Activity in Modern Nonlinear Optics, New York: Wiley, 1994, Part 3, pp 105-149. AD Shaw, N Kaderbhai, A Jones, AM Woodward, R Goodacre, JJ Rowland, DB Kell. Appl Spec- trosc 53:1419, 1999. MJ Pelletier. Appl Spectrosc 53:1087, 1999. ED Lipp, RL Gross. Appl Spectrosc 52:42, 1998. H Swierenga, AP de Weijer, LMC Buydens. J Chemometr 13:237, 1999. F Westad, H Martens. Chemometr Intel1 Lab Syst 45:361, 1999. ER Malinowski. Factor Analysis in Chemistry. New York: Wiley, 1991. RA Johnson, DW Wichern. Applied Multivariate Statistical Analysis. 4th ed. Englewood Cliffs, NJ: Prentice-Hall, 1998. R Reyment, KG Joreskog. Applied Factor Analysis in the Natural Sciences. New York: Cambridge University Press, 1996. C Shen, AJ Peacock, RG Alamo, TJ Vickers, L Mandelkern, CK Mann. Appl Spectrosc 46:1226, 1992. S Kokot, NA Tuan, L Rintoul. Appl Spectrosc 51:387, 1997. V Allen, JH Kalivas, RG Rodriguez. Appl Spectrosc 53:672, 1999. JF Aust, KS Booksh, CM Stellman, RS Parnas, ML Myrick. Appl Spectrosc 51:247, 1997.

Copyright © 2001 by Taylor & Francis Group, LLC

Page 32: Chemometrics for Raman Spectroscopy

306 SHAVER

M Shimoyama, H Maeda, K Matsukawa, H Inoue, T Ninorniya, Y Ozaki. Vibrat Spectrosc 14:253, 1997. IP Keen, GW White, PM Fredericks. J Forensic Sci 43(1):82, 1998. CM Stellman, KS Booksh, ML Myrick. Appl Spectrosc 50:552, 1996. CD Hayden, MD Moms. Appl Spectrosc 50:708, 1996. CA Drumm, MD Moms. Appl Spectrosc 49:1331, 1995. PA Walker III, JM Shaver, MD Morris. Appl Spectrosc 51:1394, 1997. KJ Schostack, ER Malinowski. Chemometr Intell Lab Syst 10(3):303, 1991. NW Daniel, IR Lewis, PR Griffiths. Appl Spectrosc 51:1868, 1997. J Zupan, J Gasteiger. Anal Chim Acta 248: 1 1991. J Krizova, P Matejka, G Budinova, K Volka. J Mol Struct 480-481:547, 1999. S Fendel, B Schrader. J Anal Chem 360:609, 1998. J Zupan, J Gasteiger. Neural Networks for Chemists. New York: VCH, 1993. Y Liu, BR Upadhyaya, M Naghedolfeizi. Appl Spectrosc 47:12, 1993. H Yang, IR Lewis, PR Griffiths. Spectrochim Acta A, 552783, 1999. IR Lewis, NW Daniel, NC Chaffin, PR Griffiths. Spectrochim Acta A, 50A:1943, 1994. C Batur, MH Vhora, M Cakmak, T Serhatkulu. ISA Trans 38:139, 1999. R Goodacre, EM Timmins, R Burton, N Kaderbhai, AM Woodward, DB Kell, PJ Rooney. Microbi (UK) 1441157, 1998. M Gniadecka, HC Wulf, NN Mortensen, OF Nielsen, DH Christensen. J Raman Spectrosc 28(2- 3):125, 1997. HG Schulze, LS Greek, BB Gorzalka, AV Bree, MW Blades, RFB Turner. J Neurosci Methods 56(2):155, 1995. CK Mann, TJ Vickers. Appl Spectrosc 532356, 1999. J Guilment, S Markel, W Windig. Appl Spectrosc 48:320, 1994. DM Haaland, KL Higgins, DR Tallant. Vibrat Spectrosc 1:35, 1990. SM Haight, DT Schwartz, MA Lilga. J Electrochem Soc 146:1866, 1999. V Baeten, M Meurens, MT Morales, R Aparicio. J Agric Food Chem 44:2225, 1996. MB Seasholtz, DD Archibald, A Lorber, BR Kowalski. Appl Spectrosc 43:1067, 1989. JB Cooper, KL Wise, WT Welch, RR Bledsoe, MB Sumner. Appl Spectrosc 50:917, 1996. PE Flecher, JB Cooper, TM Vess, WT Welch. Spectrochim Acta A 52:1235, 1996. FT Walder, MJ Smith. Spectrochim Acta A 47:9, 1991. KPJ Williams, NJ Everall. J Rarnan Spectrosc 26:427, 1995. JM Chalmers, NJ Everall. Trend Anal Chem 15:18, 1996. K Sano, M Shimoyama, M Ohgane, H Higashiyama, M Watari, M Tomo, T Ninomiya, Y Ozaki. Appl Specrosc 53:55 1, 1999.

Copyright © 2001 by Taylor & Francis Group, LLC