orthogonalizationapproaches for data preprocessing with ... · example calibration data...

Orthogonalization Approaches for Data Preprocessing with

Pharmaceutical, Petrochemical and Remote Sensing Applications

Barry M. Wise, Jeremy M. Shaver and Neal B. Gallagher

Eigenvector Research, Inc.

AbstractOver the past dozen years, a number of powerful spectral analysis methods have been published which make use of orthogonalization (i.e. projection followed by weighted subtraction) of interferences or "clutter." These filtering methods provide a means to mitigate the effect of interferences arising from background chemical or physical species, instrumental artifacts, systematic sampling errors and instrument or system drift. They have been used very effectively with complex biological systems, remote sensing applications, chemical process monitoring and calibration transfer problems.

This class of methods includes Orthogonal Partial Least Squares (O-PLS), External Parameter Orthogonalization(EPO), Dynamic Orthogonal Projection (DOP), Orthogonal Signal Correction (OSC), Constrained Principal Spectral Analysis (CPSA), Generalized Least Squares Weighting (GLSW), and Science Based Calibration (SBC) among others. All are based on the orthogonalization premise and each touts a unique ability to improve model performance, robustness, and/or interpretability.

Some relationships between these methods are noted, along with ties to older work. Examples are given of the use of the methods in calibration and classification problems in pharmaceutical, petrochemical and remote sensing applications.

What is an Orthogonalization Filter?

• Removes spectral patterns from data which are "interfering" with signal of interest

• The interfering species are historically called “clutter” (backgrounds, noise, interferents)

• Filters return spectra with features “removed”• Weighted subtraction of one or more vectors• "Soft” orthogonalization is deweighting but

not outright complete subtraction

Some Examples Using Orthogonalization Filters(by Eigenvector)

• In vivo Tissue identification with NIR probe• Cancer detection using in vivo fluorescence• Identification of arthlesclerosis in artery walls

using NIR• Determination of hydroxide concentration in

high-concentration aqueous ion solutions using Raman spectroscopy

• Identification of chemical species in remote sensing

Method 1: Orthogonalization of Model

Method 2: Pre-selection of "clutter"

SOME Orthogonalization Filters

• OSC – Orthogonal Signal Correction (Wold et. al. 1998)

• OPLS – Orthogonal PLS (Trygg, Wold 2002 , patented)

• MOSC – Modified OSC (POSC - Feudale, Tan, S. Brown 2003)

• CPSA - Constrained Principal Spectral Analysis(J. Brown 1990 , patented)

• EPO – External Parameter Orthogonalization(Roger, Chauchard, Bellon-Maurel 2003)

• GLS – Generalized Least Squares(Aitken 1935, Martens et. al. 2003)

• SBC – Science Based Calibration(Marbach 2005, patented (?))

Two General Approaches

Filter

PCADecomposition

ClutterSpectra

Clutter Loadings

ChooseSubset

Method 2: Pre-selection of "clutter"Method 1: Orthogonalization of Model

PCA or PLSDecomposition

CalibrationSpectra

Scores & Loadings

OrthogonalizeTo Y-block

Y-block(Classes)

Clutter Loadings

FilteredSpectra

Repeat for Multiple

Components

Orthogonal Signal Correction (OSC)

• Introduced by Wold in 1998– “OSC paper not the clearest thing” – Johan Trygg,

June 9, 2011

• OSC objective function

OSC Issues

• To the extent the objective function is optimized OSC doesn’t work• Only works if you don’t try too hard!

• Many algorithms (at least 5) with various problems• Factors not orthogonal to y• Factors don’t capture maximum variance in X• Filtered X not in same subspace as original X

• Often implemented prior to cross validation—totally misleading!

O-PLS• Originally formulated as sequential algorithm

(NIPALS based)• Since shown to be obtainable from post-

processing conventional PLS model• Does not improve prediction• Claim is that model is more interpretable

E.K. Kemsley and H.S. Tapp, “OPLS filtered data can be obtained directly from non-orthogonalized PLS1,” J. Chemo, 23, 263-264, 2009R. Ergon, “PLS post-processing by similarity transformation (PLS+ST): a simple alternative to OPLS,” J. Chemo, 19, 1-4, 2005J. Trygg and S. Wold, “Orthogonal Projections to Latent Structures (O-PLS),” J. Chemo, 16, 119-128, 2002.

NIR of Pseudo-gasoline Samples

800 900 1000 1100 1200 1300 1400 1500 1600

0

0.005

0.01

0.015

0.02

0.025

Wavelength (nm)

Abso

rban

ceEstimated Pure Component Spectra

PLS Model on Component 1

10 12 14 16 18 20 22 24 26 28 3010

12

14

16

18

20

22

24

26

28

30

Y Measured 1

Y C

V Pr

edic

ted

1

Samples/Scores Plot of spec1

R2 = 0.9885 Latent VariablesRMSEC = 0.46945RMSECV = 0.68306Calibration Bias = −1.0658e−14CV Bias = 0.0375

Percent Variance Captured by Regression Model -----X-Block----- -----Y-Block-----

Comp This Total This Total ---- ------- ------- ------- -------1 91.17 91.17 8.36 8.36 2 7.40 98.57 7.19 15.55 3 0.93 99.50 32.81 48.36 4 0.46 99.96 26.18 74.54 5 0.02 99.98 24.90 99.44

Regular and O-PLS Filtered Regression Vectors

800 900 1000 1100 1200 1300 1400 1500 1600−2

−1.5

−1

−0.5

0

0.5

1

1.5

Variable

Reg

Vec

tor f

or Y

1

Variables/Loadings Plot for spec1

800 900 1000 1100 1200 1300 1400 1500 1600−25

−20

−15

−10

−5

0

5

10

15

20

25

Variable

Reg

Vec

tor f

or Y

1

Variables/Loadings Plot for spec1

Regular O-PLS Filtered

Interpretation

• Better in previous example, but partly because we know what spectra should look like

• What if problem has discrete variables with signal that could be positive, negative or zero?

• Much harder! (see “On the Interpretability of O-PLS Models”)

• Working on developing better understanding of when it will work and when it won’t

Orthogonalize Model

Pre-selection Methods…

• CPSA - Constrained Principal Spectral Analysis(J. Brown 1990, patented)

• EPO – External Parameter Orthogonalization(Roger, Chauchard, Bellon-Maurel 2003)

• GLS – Generalized Least Squares (Aitken 1935)

• SBC – Science Based Calibration(Marbach 2005, patented (?))

FilterPCADecomposition

ClutterSpectra

Clutter Loadings

ChooseSubset

• Identical• Choose # of PCs

• Quite similar• Down-weight by

scale of eigenvalues

All the same…

19

Clutter Covariance

€

Xc = (X1,c − x 1,c ) + (X2,c − x 2,c ) + ...

€

C =XcTXc

N −1

Clutter source 1 Clutter source 2

20

Covariance to GLS Weighting Matrix

€

C = VS2VT

€

G = VD−1VT

€

di,i−1 =

1si,i2

g2+1

with Large g è 1, dimension unaffectedSmall g è 0, dimension eliminated

Choosing Components

1 2 3 4 5 6 7 8 9 10 11-5.5

-5

-4.5

-4

-3.5

-3

-2.5

-2

-1.5

Principal Component Number

log(

eige

nval

ues)

EPO / CPSAxf = x - xPkPk

T

GLS / SBCxf = x - xPDPT

Eigenvalues of Clutter

One adjustable parameter in each method

k=4 k=5k=3

decreasing D

Other Similar Pre-selection Filters…• Extended Mixture Model (Extended Least

Squares) orthogonal filtering for Classical Least Squares (CLS) models!

Target (Calibration) Spectra

Starget

Clutter Spectra

Sclutter

c = xS(STS)-1

Pseudo-inverse is an orthogonalization!

Equivalent to full-rank EPO / CPSA model

Pre-selecting Clutter

How to get clutter?Look at differences in samples

which should otherwise be the same.

In classification – all samples within a class should nominally be the same!

Use Calibration itself! Filter

PCADecomposition

ClutterSpectra

Clutter Loadings

ChooseSubset

CalibrationSpectra

More on How to Get Clutter

• Pure component spectra of known interferences

• Subspace spanned by – samples where analyte of interest is not present– variation in data that is all of the same class – differences between samples where analyte of

interest is (nearly) the same, e.g. y-gradient– repeat measurement of blanks

• Make it up! e.g. polynomial baseline shapes

Y-gradient Method

• Sort samples by y (reference) values• Take differences between adjacent samples• Weight X-differences by inverse of difference

in y values • Deweight by covariance of differences (GLS) or

orthogonalize against some number of PCs (EPO, ELS, EMM, PA-CLS)

Orthogonalization FiltersFilter Soft/

HardAdj. Params

Clutter source Improves Prediction?

OSC Hard # LVs Part of X orthogonal to y No, but reduces models complexity

O-PLS Hard # LVs Part of X-model space orthogonal to X’y

No, but improves interpretation

MOSC Hard # PCs Part of X orthogonal to y Maybe

CPSA Hard # PCs A priori, includes pathlength adj. Yes

EPO Hard # PCs Classes, y-gradient or a priori Yes

DOP Hard # PCs Synthetic reference samples Yes

GLS Soft Shrinkage a Classes, y-gradient or a priori Yes

SBC Soft # PCs (20?) Repeat samples or blanks Yes

EMM Hard None A priori from known interferents, clutter subspace

Yes, CLS model

ELS Hard # PCs Clutter subspace Yes

PA-CLS Hard None/# PCs Baseline shapes, residuals Yes, CLS model

WLS Soft Regularization Noise measurements Yes

We think it is useful to use Clutter!

Example Classification Data

4000 3500 3000 2500 2000 1500 1000 500

0

0.5

1

1.5

2

2.5

Using these regions only

• Mid-IR spectra of food grade oils• Classify oils, detect adulterated olive oil

Calibration with MSC

�0.2 �0.15 �0.1 �0.05 0 0.05 0.1 0.15 0.2 0.25 0.3�0.1

�0.05

0

0.05

0.1

0.15

0.2

0.25

Scores on PC 1 (84.99%)

Scor

es o

n PC

2 (1

2.71

%)

Samples/Scores Plot of Olive Oil Calibration

Scores on PC 2 (12.71%)CMargCornOliveSaffl

Cal and Test with MSC

�0.2 �0.15 �0.1 �0.05 0 0.05 0.1 0.15 0.2 0.25 0.3�0.1

�0.05

0

0.05

0.1

0.15

0.2

0.25


Scor

es o

n PC

2 (1

2.71

%)

Samples/Scores Plot of Olive Oil Calibration & Oiltest,

With MSC and GLS

�0.06 �0.04 �0.02 0 0.02 0.04 0.06 0.08 0.1�0.04

�0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14


Scor

es o

n PC

2 (3

7.55

%)


Zoom on Olive Oil

�0.06 �0.055 �0.05 �0.045 �0.04 �0.035 �0.03 �0.025 �0.02 �0.015

�0.035

�0.03

�0.025

�0.02

�0.015

�0.01


Scor

es o

n PC

2 (3

7.55

%)


Calibration and test Olive Oil

Adulterated Olive Oil

Zoom on Corn and Safflower Oil

0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08

�15

�10

�5

0

5

x 10�3


Scor

es o

n PC

2 (3

7.55

%)


Calibration and test Corn Oil

Calibration and test Safflower Oil

Zoom on Corn Margarine

�0.04 �0.0395 �0.039 �0.0385 �0.038

0.1215

0.122

0.1225

0.123

0.1235


Scor

es o

n PC

2 (3

7.55

%)


With MSC and EPO

�0.2 �0.15 �0.1 �0.05 0 0.05 0.1 0.15 0.2 0.25�0.05

0

0.05

0.1

0.15

0.2


Scor

es o

n PC

2 (1

1.52

%)


Indian Pines Data

• Classic image data set used in many publications

• Crop area near West Lafayette, Indiana• Ground truth identified 16 know crop areas• Data from AVIRIS: Airborne Visible/Infrared

Imaging Spectrometer• 220 channels, 400-2500nm

Indian Pines Image

Soybean Fields

Soybeans no tillSoybeans minSoybeans clean

PLS-DA, Mean-Center Only

Class Probability Image

PLS-DA, EPO 1-PC

Class Probability Image

Example Calibration Data• IDRC-2002 Shootout data• NIR Transflectance of pharmaceutical tablets• Goal is to predict assay value

600 800 1000 1200 1400 1600 18002

2.5

3

3.5

4

4.5

5

5.5

6

6.5

Wavelength (nm)

Sign

al Int

ensit

y

Calibration and Test with MSC & MC

150 160 170 180 190 200 210 220 230 240150

160

170

180

190

200

210

220

230

240

Y Measured 3 assay

Y Pr

edic

ted

3 as

say

Samples/Scores Plot of calibrate_1,c & test_1,

R^2 = 0.9642 Latent VariablesRMSEC = 3.3253RMSEP = 3.3487Calibration Bias = 0Prediction Bias = �0.4224

With MSC, GLS & MC

150 160 170 180 190 200 210 220 230 240150

160

170

180

190

200

210

220

230

240

Y Measured 3 assay

Y Pr

edic

ted

3 as

say


R^2 = 0.9842 Latent VariablesRMSEC = 2.5171RMSEP = 2.159Calibration Bias = �1.1369e�13Prediction Bias = 0.067298

With MSC, EPO & MC

150 160 170 180 190 200 210 220 230 240150

160

170

180

190

200

210

220

230

240

Y Measured 3 assay

Y Pr

edic

ted

3 as

say



With MSC, ELS and MC

150 160 170 180 190 200 210 220 230 240150

160

170

180

190

200

210

220

230

240

250

Y Measured 3 assay

Y Pr

edic

ted

3 as

say



Conclusions• Main differences between methods are– How the clutter is defined– Whether the de-weighting is hard or soft

• Filtering methods are more similar than published statements might have you believe

• Methods achieve similar results, model performance generally improved (except O-PLS, OSC)

• Interpretation of filtered results can be challenging – except OPLS (mostly)

orthogonalizationapproaches for data preprocessing with ... · example calibration data...

Documents