nir shootout 2002 - eigenvector peak...3/26/2015 3 150 160 170 180 190 200 210 220 230 240 150 160...

3/26/2015

1

Automated Peak and Peak-Ratio

Selection for Regression and

Classification Models of

Raman and LIBS Data

Jeremy M. Shaver Eigenvector Research, Inc.

Brian Marquardt, Tom Dearing, Sergey Mozharov, MarqMetrix

NIR Shootout 2002

• 2002 International Diffuse Reflectance Conference (IDRC) "Shootout" data

– NIR spectra

– 654 pharmaceutical tablets

– Calibration Set, Validation Set, Test Set

– Two spectrometers

– Goal: best model with calibration transfer

• Won by Karl Norris using "Norris Regression" –selected peaks and peak ratios including gap-segment derivative

3/26/2015

2

Norris' "Winning" Model

Term 1 Term 2

Wavelength Smooth Gap Wavelength Smooth Gap

Numerator 1142 nm 10 nm 26 nm 1338 nm 0 nm 22 nm

Denominator 920 nm 0 nm 30 nm

600 800 1000 1200 1400 1600 18002

2.5

3

3.5

4

4.5

5

5.5

6

6.5

Wavelength

Interactive manual selection of regions and smooth/gap parameters

150 160 170 180 190 200 210 220 230 240140

150

160

170

180

190

200

210

220

230

240

Y Measured 3 assay

Y P

redic

ted 3

assay

RMSEC: 2.9

RMSECV: 3.0

RMSEP: 2.8 2.8

Results using (Approx.) Norris Regression

Preprocessing: 2nd Derivative (gap: 18 nm, segment: 6 nm) +

Integrate + Autoscale

validation sets

(both instruments)

calibration data

Note: Test set covers same range as calibration data

3/26/2015

3

150 160 170 180 190 200 210 220 230 240150

160

170

180

190

200

210

220

230

240

250

Y Measured 3 assay

Y P

redic

ted 3

assay

Samples/Scores Plot of c & Test

R^2 = 0.9683 Latent VariablesRMSEC = 2.585RMSECV = 2.6755RMSEP = 4.0273Calibration Bias = 0CV Bias = -0.018797Prediction Bias = 1.345

EMSC + 1st Derivative + Mean Centering

RMSEC: 2.6

RMSECV: 2.7

RMSEP: 3.3 4.8

EMSC = Extended Multiplicative Scatter Correction

Tabulated Results

RMSEC RMSECV Val 1 Val 2 Test 1 Test 2

Norris Regression 2.7 2.7 2.8 2.8 3.0 3.3

Expert-Selected

Preprocessing2.6 2.7 3.3 4.8 2.8 4.2

Good Model… but Bad Transfer

3/26/2015

4

Norris Regression – GenericallyNon-linear Regression

(Gap-Segment 1st Derivative)

(Peak Normalization)

(Peak Normalization with

variable- gap 1st derivative)

Binary Encoding of Norris Equations

• Example for 5 variables: [ x1 x2 x3 x4 x5 ]

xi

0 0 0 0 0 0 1 0 0 0 1 0 0 0 10 0 1 0 0 0 0 0 0 0 0 0 0 1 0

xi

x1

xi

x2

xi

x3

xi

x4

xi

x5

x2

x2

x3x1

x3

x5

x3

x4

x5

This much could be done by pre-computing…

but at a big memory cost

(525MB for shootout data)

3/26/2015

5

+ Allow Subtraction…

• Example for 5 variables: [ x1 x2 x3 x4 x5 ]

• One additional group to identify "baseline"

xi

0 0 0 0 0 0 1 0 0 0 1 0 0 0 10 0 1 0 0 0 0 0 0 0 0 0 0 1 0

xi

x1

xi

x2

xi

x3

xi

x4

xi

x5

0 1 0 0 0

-xi

x3 –x2x1 –x2

x3 –x2

x5 –x2

x3 –x2

x4 –x2

x5 –x2

Pre-computation would now require

2 x10201 variables (for the shootout data)

variables = 2n (n2+n)

1050 1100 1150 1200 1250 13002

3

4

5

Variables

Me

an

1050 1100 1150 1200 1250 13002

3

4

5

Variables

Mean

+ Binning to Reduce Dimensionality

5-fold variable bin Similar effect to use of

smoothing in derivatives

650 Variables = 422,500 possible ratios (many quite boring)

130 variables = 16,900 possible ratios

3/26/2015

6

+ Genetic Algorithm to Select Terms

• Try lots of combinations (Calculate variable

ratios and offsets on-the-fly)

• Choose best cross-validated results

• Breed (intermix terms) and repeat

• Will refer to this as "GA-Norris"

• Question: Can this approach approximate

what the interactive Norris approach does?

Tabulated Results

RMSEC RMSECV Val 1 Val 2 Test 1 Test 2

Norris Regression 2.7 2.7 2.8 2.8 3.0 3.3

Expert-Selected

Preprocessing2.6 2.7 3.3 4.8 2.8 4.2

GA Norris (Cal 1 only) 2.4 2.5 3.9 5.0 2.8 3.7

GA Norris (Cal 1 & 2) 2.8 2.9 3.0 3.0 3.0 3.3

Simple GA (Cal 1 & 2) 2.6 2.7 3.7 3.8 3.3 3.5

Selecting Variables based on both

instruments (building model from ONE)

yields GA Norris preprocessing which closely

approximates what Karl Norris did.

3/26/2015

7

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

1.2x 10

-3

Hotelling T^2 (97.97%)

Samples/Scores Plot of c & Test

0 2 4 6 8 100

2

4

6

8

10

12

Hotelling T^2 (84.10%)

Q R

esid

uals

(15.9

0%

)

GA-Norris Model Expert Preprocessing Model

Outlier Detection Achieved…

Where Else Would Ratios Help?

• Raman – correcting for throughput differences

and offsets

• LIBS – correcting for throughput differences

and for emphasizing the importance of

"relative abundance"

3/26/2015

8

Raman of Octene in Toluene

15

1000 1100 1200 1300 1400 1500 1600 17000

2

4

6

8

10

12

Raman Shift (cm-1)

• Raman spectra measured on 36 solutions of Octene in

Toluene (3 replicates of 12 concentrations)

• Calibration set for on-line monitoring of polymerization

process feed line (octene is comonomer).

• Little interference or other artifacts in calibration data

• EXPECT: throughput errors and spectral shifts

• Calibrate with 24 samples, Validate with 9

Y Predicted Octene (w/w fraction)

3/26/2015

9

RM

SE

P

0

0.05

0.1

0.15

0.2

Au

tosc

ale

No

rma

lize

No

rma

lize

(10

00

cm

-1)

1st

De

riva

tive

Wh

itta

ker

Ba

seli

ne

Ba

seli

ne

+

No

rma

lize

GA

No

rris

(0.89) (0.50)

Prediction Error Vs. Interferences

Autoscale Scale variables to unit standard deviation

Normalize Divide by total intensity

Normalize (1000 cm-1) Divide by intensity at 1000 cm-1 peak

1st Derivative Savitzky-Golay 1st Derivative (15 point)

Whittaker Baseline Automatic baseline subtraction

GA Norris Binning + GA Norris Variable Selection + Ratios

(All methods also include mean centering)

RM

SE

P

0

0.05

0.1

0.15

0.2

No

rma

lize

(10

00

cm

-1)

(0.89) (0.50)

As Measured

w/ 0.1 cm-1 Axis Shift

* Throughput Errors

+ Baseline Errors


3/26/2015

10

RM

SE

P

0

0.05

0.1

0.15

0.2

Au

tosc

ale

No

rma

lize

No

rma

lize

(10

00

cm

-1)

1st

De

riva

tive

Wh

itta

ker

Ba

seli

ne

Ba

seli

ne

+

No

rma

lize

GA

No

rris

(0.89) (0.50)

As Measured

w/ 0.1 cm-1 Axis Shift

* Throughput Errors

+ Baseline Errors


Q Residuals (0.00%)

Outlier Detection: Baseline+Norm

???

Good predictions but Bad Outlier Status

3/26/2015

11

Q Residuals (0.73%)

Outlier Detection: GA Norris

LIBS / Raman Classification

• Mystery classes (natural product, difficult to

separate classes)

• Raman data – not much information

• LIBS data – too much information

• Anticipate Peak Ratios should help greatly in

LIBS!

• Try GA Norris on LIBS

3/26/2015

12

Error Rate

Fewer Selected Peaks

1000 Selection Results

Better

than

using all

variables

Worse

than

using all

variables

Example GA Norris Results

Error Rate

Error Rate

Error Rate

Error Rate

All = 17%

Best = 8%

All = 9%

Best = 2%

All = 19%

Best = 0%

All = 8%

Best = 3%

All Classes

3/26/2015

13

With Randomized Classes (Red)

Non-linear model

+ variable selection

+ large domain

= large chance of over-fit

= use caution & permutation tests

3/26/2015

14

Conclusions

• GA Norris can reproduce Norris Regression

results

• Can be used to achieve similar results to

standard preprocessing (but with less sound

decisions!)

• Large chance of over-fit = use caution &

permutation tests, or standard methods!!

nir shootout 2002 - eigenvector peak...3/26/2015 3 150 160 170 180 190 200 210 220 230 240 150 160...

Documents