nir shootout 2002 - eigenvector peak...3/26/2015 3 150 160 170 180 190 200 210 220 230 240 150 160...
TRANSCRIPT
-
3/26/2015
1
Automated Peak and Peak-Ratio
Selection for Regression and
Classification Models of
Raman and LIBS Data
Jeremy M. Shaver Eigenvector Research, Inc.
Brian Marquardt, Tom Dearing, Sergey Mozharov, MarqMetrix
NIR Shootout 2002
• 2002 International Diffuse Reflectance Conference (IDRC) "Shootout" data
– NIR spectra
– 654 pharmaceutical tablets
– Calibration Set, Validation Set, Test Set
– Two spectrometers
– Goal: best model with calibration transfer
• Won by Karl Norris using "Norris Regression" –selected peaks and peak ratios including gap-segment derivative
-
3/26/2015
2
Norris' "Winning" Model
Term 1 Term 2
Wavelength Smooth Gap Wavelength Smooth Gap
Numerator 1142 nm 10 nm 26 nm 1338 nm 0 nm 22 nm
Denominator 920 nm 0 nm 30 nm
600 800 1000 1200 1400 1600 18002
2.5
3
3.5
4
4.5
5
5.5
6
6.5
Wavelength
Interactive manual selection of regions and smooth/gap parameters
150 160 170 180 190 200 210 220 230 240140
150
160
170
180
190
200
210
220
230
240
Y Measured 3 assay
Y P
redic
ted 3
assay
RMSEC: 2.9
RMSECV: 3.0
RMSEP: 2.8 2.8
Results using (Approx.) Norris Regression
Preprocessing: 2nd Derivative (gap: 18 nm, segment: 6 nm) +
Integrate + Autoscale
validation sets
(both instruments)
calibration data
Note: Test set covers same range as calibration data
-
3/26/2015
3
150 160 170 180 190 200 210 220 230 240150
160
170
180
190
200
210
220
230
240
250
Y Measured 3 assay
Y P
redic
ted 3
assay
Samples/Scores Plot of c & Test
R^2 = 0.9683 Latent VariablesRMSEC = 2.585RMSECV = 2.6755RMSEP = 4.0273Calibration Bias = 0CV Bias = -0.018797Prediction Bias = 1.345
EMSC + 1st Derivative + Mean Centering
RMSEC: 2.6
RMSECV: 2.7
RMSEP: 3.3 4.8
EMSC = Extended Multiplicative Scatter Correction
Tabulated Results
RMSEC RMSECV Val 1 Val 2 Test 1 Test 2
Norris Regression 2.7 2.7 2.8 2.8 3.0 3.3
Expert-Selected
Preprocessing2.6 2.7 3.3 4.8 2.8 4.2
Good Model… but Bad Transfer
-
3/26/2015
4
Norris Regression – GenericallyNon-linear Regression
(Gap-Segment 1st Derivative)
(Peak Normalization)
(Peak Normalization with
variable- gap 1st derivative)
Binary Encoding of Norris Equations
• Example for 5 variables: [ x1 x2 x3 x4 x5 ]
xi
0 0 0 0 0 0 1 0 0 0 1 0 0 0 10 0 1 0 0 0 0 0 0 0 0 0 0 1 0
xi
x1
xi
x2
xi
x3
xi
x4
xi
x5
x2
x2
x3x1
x3
x5
x3
x4
x5
This much could be done by pre-computing…
but at a big memory cost
(525MB for shootout data)
-
3/26/2015
5
+ Allow Subtraction…
• Example for 5 variables: [ x1 x2 x3 x4 x5 ]
• One additional group to identify "baseline"
xi
0 0 0 0 0 0 1 0 0 0 1 0 0 0 10 0 1 0 0 0 0 0 0 0 0 0 0 1 0
xi
x1
xi
x2
xi
x3
xi
x4
xi
x5
0 1 0 0 0
-xi
x3 –x2x1 –x2
x3 –x2
x5 –x2
x3 –x2
x4 –x2
x5 –x2
Pre-computation would now require
2 x10201 variables (for the shootout data)
variables = 2n (n2+n)
1050 1100 1150 1200 1250 13002
3
4
5
Variables
Me
an
1050 1100 1150 1200 1250 13002
3
4
5
Variables
Mean
+ Binning to Reduce Dimensionality
5-fold variable bin Similar effect to use of
smoothing in derivatives
650 Variables = 422,500 possible ratios (many quite boring)
130 variables = 16,900 possible ratios
-
3/26/2015
6
+ Genetic Algorithm to Select Terms
• Try lots of combinations (Calculate variable
ratios and offsets on-the-fly)
• Choose best cross-validated results
• Breed (intermix terms) and repeat
• Will refer to this as "GA-Norris"
• Question: Can this approach approximate
what the interactive Norris approach does?
Tabulated Results
RMSEC RMSECV Val 1 Val 2 Test 1 Test 2
Norris Regression 2.7 2.7 2.8 2.8 3.0 3.3
Expert-Selected
Preprocessing2.6 2.7 3.3 4.8 2.8 4.2
GA Norris (Cal 1 only) 2.4 2.5 3.9 5.0 2.8 3.7
GA Norris (Cal 1 & 2) 2.8 2.9 3.0 3.0 3.0 3.3
Simple GA (Cal 1 & 2) 2.6 2.7 3.7 3.8 3.3 3.5
Selecting Variables based on both
instruments (building model from ONE)
yields GA Norris preprocessing which closely
approximates what Karl Norris did.
-
3/26/2015
7
0 5 10 15 20 250
0.2
0.4
0.6
0.8
1
1.2x 10
-3
Hotelling T^2 (97.97%)
Samples/Scores Plot of c & Test
0 2 4 6 8 100
2
4
6
8
10
12
Hotelling T^2 (84.10%)
Q R
esid
uals
(15.9
0%
)
GA-Norris Model Expert Preprocessing Model
Outlier Detection Achieved…
Where Else Would Ratios Help?
• Raman – correcting for throughput differences
and offsets
• LIBS – correcting for throughput differences
and for emphasizing the importance of
"relative abundance"
-
3/26/2015
8
Raman of Octene in Toluene
15
1000 1100 1200 1300 1400 1500 1600 17000
2
4
6
8
10
12
Raman Shift (cm-1)
• Raman spectra measured on 36 solutions of Octene in
Toluene (3 replicates of 12 concentrations)
• Calibration set for on-line monitoring of polymerization
process feed line (octene is comonomer).
• Little interference or other artifacts in calibration data
• EXPECT: throughput errors and spectral shifts
• Calibrate with 24 samples, Validate with 9
Y Predicted Octene (w/w fraction)
-
3/26/2015
9
RM
SE
P
0
0.05
0.1
0.15
0.2
Au
tosc
ale
No
rma
lize
No
rma
lize
(10
00
cm
-1)
1st
De
riva
tive
Wh
itta
ker
Ba
seli
ne
Ba
seli
ne
+
No
rma
lize
GA
No
rris
(0.89) (0.50)
Prediction Error Vs. Interferences
Autoscale Scale variables to unit standard deviation
Normalize Divide by total intensity
Normalize (1000 cm-1) Divide by intensity at 1000 cm-1 peak
1st Derivative Savitzky-Golay 1st Derivative (15 point)
Whittaker Baseline Automatic baseline subtraction
GA Norris Binning + GA Norris Variable Selection + Ratios
(All methods also include mean centering)
RM
SE
P
0
0.05
0.1
0.15
0.2
No
rma
lize
(10
00
cm
-1)
(0.89) (0.50)
As Measured
w/ 0.1 cm-1 Axis Shift
* Throughput Errors
+ Baseline Errors
Prediction Error Vs. Interferences
-
3/26/2015
10
RM
SE
P
0
0.05
0.1
0.15
0.2
Au
tosc
ale
No
rma
lize
No
rma
lize
(10
00
cm
-1)
1st
De
riva
tive
Wh
itta
ker
Ba
seli
ne
Ba
seli
ne
+
No
rma
lize
GA
No
rris
(0.89) (0.50)
As Measured
w/ 0.1 cm-1 Axis Shift
* Throughput Errors
+ Baseline Errors
Prediction Error Vs. Interferences
Q Residuals (0.00%)
Outlier Detection: Baseline+Norm
???
Good predictions but Bad Outlier Status
-
3/26/2015
11
Q Residuals (0.73%)
Outlier Detection: GA Norris
LIBS / Raman Classification
• Mystery classes (natural product, difficult to
separate classes)
• Raman data – not much information
• LIBS data – too much information
• Anticipate Peak Ratios should help greatly in
LIBS!
• Try GA Norris on LIBS
-
3/26/2015
12
Error Rate
Fewer Selected Peaks
1000 Selection Results
Better
than
using all
variables
Worse
than
using all
variables
Example GA Norris Results
Error Rate
Error Rate
Error Rate
Error Rate
All = 17%
Best = 8%
All = 9%
Best = 2%
All = 19%
Best = 0%
All = 8%
Best = 3%
All Classes
-
3/26/2015
13
With Randomized Classes (Red)
Non-linear model
+ variable selection
+ large domain
= large chance of over-fit
= use caution & permutation tests
-
3/26/2015
14
Conclusions
• GA Norris can reproduce Norris Regression
results
• Can be used to achieve similar results to
standard preprocessing (but with less sound
decisions!)
• Large chance of over-fit = use caution &
permutation tests, or standard methods!!