new data analysis in gcxgc - tecnometrix · 2014. 7. 2. · welcome to the magic world of...
TRANSCRIPT
Data analysis in GCxGCData analysis in GCxGC
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Gabriel Vivó TruyolsAnalytical-chemistry group
Van ‘t Hoff Institute for Molecular SciencesUniversity of Amsterdam
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
IntroductionFrom 1D chromatography to 2D chromatography: what does change?
GCxGC, Riva - 2014
2 4 6 8 10 12 14 16 18Time, min
0x100
4x106
8x106
1x107
Inte
nsity
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
The complexity of the data changesInstruments can be classified according to the order of the tensor of data used to represent a single experiment:
Zero-order instruments Produce Zero-order tensor
(e.g. a number) Example pH - meter
First-order instruments Produce First-order tensor
(e.g. a vector) Example UV-VIS spectrometer
Second-order instruments Produce Second-order tensor
(e.g. a matrix) Example LC-MS, GCxGC
Third-order instrument Produce Third-order tensor
(e.g. a “cube” of data) Example GCxGC-MS
nth-order instrument exist, but they are rare
The changes from 1D to 2D
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
The concept of “peak vicinity” changes
Two-dimensional
1tR2 t R
One-dimensional
tR
Peak of interest
Peak of interest
Neighbors Neighbors
S. Peters, G. Vivó-Truyols, P. Marriott, P.J. Schoenmakers, J. Chromatogr. A. 1146 (2007), 232-241.
The changes from 1D to 2D
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
The concept of “peak resolution” is different
A
B
One-dimensional
tR
Two-dimensional
1tR2 t R
f
g
Valley-to-peak ratio(between A and B) g
fP gfP
AB?
Valley-to-peak ratio(between A and B)?
S. Peters, G. Vivó-Truyols, P. Marriott, P.J. Schoenmakers, J. Chromatogr. A. 1146 (2007), 232-241.GCxGC, Riva - 2014
The changes from 1D to 2D
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
The concept of “peak” changes
One-dimensional
tR
Two-dimensional
The changes from 1D to 2D
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Do the main steps in data processing change?
Ste
p 2
Pre-process
Ste
p 3
MeasureS
tep
1View
• Base-line correction
• Noise filtering• Spike filtering• Alignment• … etc.
• Peak detection / integration
• Calibration• Deconvolution• Pattern
recognition
The changes from 1D to 2D
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Do the main steps in data processing change?
Ste
p 2
Pre-process
Ste
p 3
MeasureS
tep
1View
• Base-line correction
• Noise filtering• Spike filtering• Alignment• … etc.
• Peak detection / integration
• Calibration• Deconvolution• Pattern
recognition• Class separation
The changes from 1D to 2D
GCxGC, Riva - 2014
Basically, the steps are the same, but the algorithms used for each step may be different.
Univariate
Multivariate
• Folding• Phasing
First step: visualizationFirst step: visualization
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
GCxGC, Riva - 2014
Ste
p 2
Pre-process
Ste
p 3
Measure
Ste
p 1
View
• Base-line correction
• Noise filtering• Spike filtering• Alignment• … etc.
• Peak detection / integration
• Calibration• Deconvolution• Pattern
recognition• Class separation
• Folding• Phasing
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
VisualisationRaw data in 2D chromatography
0 10 20 30Time, min
-101234
Abso
rban
ce, A
U
Chromatogram of Glycine preparation (254nm)(data courtesy of University of Valencia)
Two-dimensional chromatographyGCxGC chromatogram of diesel (FID detector)First dimension: non-polar; Second dimension: polar
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
“Folding” the chromatogram Visualisation
GCxGC, Riva - 2014
Mod time (8 seconds)
27002710
27202730
2740
02
46
80
2000
4000
6000
8000
10000
1tR, s2tR, s
FID
sig
nal
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
“Folding” the chromatogram Visualisation
27002720
2740
02
46
80
5000
10000
1tR, s2tR, s
FID
sig
nal
2700 2710 2720 2730 27400
1
2
3
4
5
6
7
8
1tR, s
2 t R, s
0
2000
4000
6000
8000
10000
Mod time (8 seconds)
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
“Folding” the chromatogram: phasing Visualisation
Mod time (8 seconds)
27002720
2740
46
8100
5000
10000
1tR, s2tR, s
FID
sig
nal
2700 2710 2720 2730 2740
3
4
5
6
7
8
9
10
1tR, s
2 t R, s
0
2000
4000
6000
8000
10000
Phase = 0.3
2700 2710 2720 2730 27404
5
6
7
8
9
10
11
12
1tR, s
2 t R, s
0
2000
4000
6000
8000
10000
27002720
2740
46
810
120
5000
10000
1tR, s2tR, s
FID
sig
nal
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
“Folding” the chromatogram: phasing Visualisation
Mod time (8 seconds)
Phase = 0.5
2700 2710 2720 2730 27406
7
8
9
10
11
12
13
14
1tR, s
2 t R, s
0
2000
4000
6000
8000
10000
27002720
2740
68
1012
140
5000
10000
1tR, s2tR, s
FID
sig
nal
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
“Folding” the chromatogram: phasing Visualisation
Mod time (8 seconds)
Phase = 0.75
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Cylindrical coordinates. An alternative way to represent the data.
Visualisation
J.J.A.M. Weusten, E.P.P.A. Derks , J.H.M. Mommers, S. van der Wal, Anal. Chim. Acta 726 (2012), 9
1tR
=2tR
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Interpolation Visualisation
27002720
2740
46
8100
5000
10000
1tR, s2tR, s
FID
sig
nal
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
+ =
Welcome to the magic world of chemometrics!
Interpolation Visualisation
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
“Folding” the chromatogram: final result Visualisation
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Conclusions Visualisation
• Visualising is simple, and gives a lot of information.
• Folding (one-dimensional) data into (2D) image introduces discontinuities in the edges. Other visualization methods (cylindrical coordinates) possibloe.
• Phasing can be of great help.
• Careful with “cosmetic” effects!
GCxGC, Riva - 2014
Second step: Pre-processingSecond step: Pre-processing
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
GCxGC, Riva - 2014
Ste
p 2
Pre-process
Ste
p 3
Measure
Ste
p 1
View
• Base-line correction
• Noise filtering• Spike filtering• Alignment• … etc.
• Peak detection / integration
• Calibration• Deconvolution• Pattern
recognition• Class separation
• Folding• Phasing
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processingTypical problems: base-line drifts and noise
0 10 20 30Time, min
-101234
Abso
rban
ce, A
U
0 10 20 30Time, min
-0.15-0.1
-0.050
0.050.1
Abso
rban
ce, A
U
Base-line drifts Noise
20 20.2 20.4 20.6 20.8 21Time, min
-0.04-0.02
00.020.040.06
Abso
rban
ce, A
U
GCxGC, Riva - 2014
0 10 20 30Time, min
-0.2
-0.1
0
0.1
0.2
Abso
rban
ce, A
U
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processingBase-line drifts.
0 10 20 30Time, min
-0.15-0.1
-0.050
0.050.1
Abso
rban
ce, A
U
Original chromatogram
Corrected chromatogr.
Fitted base-line correction
GCxGC, Riva - 2014
Weighted least squares fitting
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processing
GCxGC, Riva - 2014
Originalchromatogram
Base-line drifts.
Weighted least squares fitting
Fitted base-line correction
Correctedchromatogram
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processing
GCxGC, Riva - 2014
Originalchromatogram
Fitted base-line correction
Correctedchromatogram
Other (more sophisticated) options:
- Use splines- Base-line correction coupled to peak detection- Fourier-transform based approaches- Wavelet-based approaches
Base-line drifts.
Weighted least squares fitting
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processing
GCxGC, Riva - 2014 S.E. Reichenbach, M. Ni, D. Zhang, E.B. Ledford Jr., J. Chromatogr. A, 985 (2003) 47 - 56
Base-line is reached at the (half) upper part
Base-line is reached at the (half) bottom part
Base-line drifts.
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processing
GCxGC, Riva - 2014 S.E. Reichenbach, M. Ni, D. Zhang, E.B. Ledford Jr., J. Chromatogr. A, 985 (2003) 47 - 56
Consider the positions with the smallest values in each half
Estimate local background parameters using neighboring
values
Interpolate the main background trend and
substract it
Base-line drifts.
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processingNoise removal. Smoothing and derivatives.
Savitzky-Golay filter is the most common method
Two parameters should be optmisized
• Window size• Polynomial degree
These parameters govern how much correlated noise is
removed
• Large window sizes and low polynomial degree
Too much noise is removed (chromatograms appear deformed)
• Small window sizes and large polynomial degrees Too much noise remains
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processingNoise removal. Smoothing and derivatives.
Window size = 11Polynomial = 2
Window size = 41Polynomial = 2
Window size = 251Polynomial = 2
Originalchromatogram
Correctedchromatogram
G. Vivó-Truyols, P.J. Schoenmakers, "Automatic selection of optimal Savitzky-Golay smoothing”, Anal. Chem. 78 (2006) 4598-4608.GCxGC, Riva - 2014
FID
sig
nal
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processingNoise removal. Spikes.
A good way of removing spikes consists of passing a median filter(before the Savitsky-Golay filter)
Savitzky-Golay filter
Median filterBase-line correction
Original data
Parameter to tune: window
size
Parameters to tune: window size and
polynomial degreeOptimising three parameters
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processingAlignment.
Two types of alignment
Between-chromatogram alignment
Between-modulation alignment
Alignment is not always necessary, depending on the final objective of the analysis
Rarely doneUsing 2D
techniques(folded data)
Using 1D techniques
(unfolded data)
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
15 different (but related) chromatograms
13.5 13.6 13.7 13.8Time, min
Alignment
Pre-processingAlignment. COW (1D)
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pre-processingAlignment. Using (truly) 2D algorithms
S. Castillo, I. Mattila, J. Miettinen, M. Orešič, T.Hyötyläinen, Anal. Chem. 83 ( 2011) 3058–3067
Score alignment in GCxGC-MS
D. Zhang, X. Huang, F.E. Regnier,M. Zhang, Anal. Chem., 80 (2008) 2664–2671
COW-adapted GCxGC-MS (using
single channel)
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Conclusions Pre-processing
• Pre-processing methods are almost the same: one-dimensional = two-dimensional. Normally done in the (pre-folded) raw data.
• Every case needs a particular solution (it always exists, but some care should be taken!)
GCxGC, Riva - 2014
Third step: measureThird step: measure
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
GCxGC, Riva - 2014
Ste
p 2
Pre-process
Ste
p 3
Measure
Ste
p 1
View
• Base-line correction
• Noise filtering• Spike filtering• Alignment• … etc.
• Peak detection / integration
• Calibration• Deconvolution• Pattern
recognition• Class separation
• Folding• Phasing
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Peak detection in one step: the watershed algorithm Peak detection
Most common software programs use the watershed algorithm to detect peaks in 2D chromatography:
J. De Bock et al., doi 10.1007/11558484
1 2 3 4
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
??? ?Single catchmentbasin?
G. Vivó-Truyols, H.G. Janssen, J. Chromatogr. A, doi:10.1016/j.chroma.2009.12.063
Peak detection in one step: the watershed algorithm
GCxGC, Riva - 2014
Peak detection
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
The first problem using the watershed algorithm.
Peak detection in one step: the watershed algorithm
GCxGC, Riva - 2014
Peak detection
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Critical variability in second-dimension retention time:
2
2 1
pq
tP critfail
22 1
pq
tP critfail
mp
2
1
m
p
2
1
1mq 1mq
q=0.5 (eight cuts per peak)
q=1 (four cuts per peak)
q=2 (two cuts per peak)
q=4 (1 cut per peak)0 0.04 0.08 0.12 0.16 0.2
tR,crit, min
0
0.2
0.4
0.6
0.8
1
Pfa
il
p=5p=20
p=2.5
p=10
Current instruments
Current chromatographic practice
G. Vivó-Truyols, H.G. Janssen, J. Chromatogr. A, doi:10.1016/j.chroma.2009.12.063
Peak detectionin one step: the watershed algorithm
GCxGC, Riva - 2014
Peak detection
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
S. Peters, G. Vivó-Truyols, P.J. Marriott and P.J. Schoenmakers, J. Chromatogr. A 1156 (2007) 14.E.J.C. van der Klift, G. Vivó-Truyols, F.W. Claassen, F.L. van Holthoon, T.A. van Beek, J. Chromatogr. A, 1178 (2008) 43.
Time, arbitrary units (AU)
Ste
p
1 Detect peaks as in one-dimensional chromatography
Use information from derivatives (pre-processing
step)
Peak detectioni in two steps. Peak detection
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Ste
p
2 Merge peaks that belong to the same compound according to 2nd-dimension retention time differences
50100
150
0
5
100
500
1000
1500
4 4.5 5 5.5 60
500
1000
1500
2tR, AU
T: Tolerance criterion
70 90 110 1304.5
5
5.5
1tR, AU
2 t R, A
U
GCxGC, Riva - 2014
Peak detectioni in two steps. Peak detection
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Ste
p
2 Merge peaks that belong to the same compound according to 2nd-dimension retention time differences
50100
150
0
5
100
500
1000
1500
70 90 110 1304.5
5
5.5
1tR, AU
2 t R, A
U
4 4.5 5 5.5 60
500
1000
1500
2tR, AU
D
D>T ?
GCxGC, Riva - 2014
Peak detectioni in two steps. Peak detection
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Ste
p
3 Check unimodality
50100
150
0
5
100
500
1000
1500
4 4.5 5 5.5 60
500
1000
1500
2tR, AU
70 90 110 1304.5
5
5.5
1tR, AU
2 t R, A
U
GCxGC, Riva - 2014
Peak detectioni in two steps. Peak detection
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Conclusions Peak detection
• Two methods available: (inverse) watershed, and two-step peak detection process.
• Peak detection seems to be still a subject of discussion.
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Deconvolution mehtods. Deconvolution
GCxGC, Riva - 2014
Deconvolution methods (always with MS)
Using 1D (unfolded) data
Use truly (folded) 2D data
• ALS or rank annihilation methods• Does not need between-
modulation alignment• Similar to AMDIS
• PARAFAC (needs between-modulation alignment)
• PARAFAC2 (more robust against between-modulation alignment)
Main problem: determine the number of components behind the peak cluster
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Deconvolution mehtods. Deconvolution
GCxGC, Riva - 2014A.E. Sinha, J.L. Hope, B.J. Prazen, C.G. Fraga, E.J. Nilsson, R.E.
Synovec, J. Chromatogr. A, 1056 (2004) 145 - 154
An example of PARAFAC
Third step: measureThird step: measure
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
GCxGC, Riva - 2014
Ste
p 2
Pre-process
Ste
p 3
Measure
Ste
p 1
View
• Base-line correction
• Noise filtering• Spike filtering• Alignment• … etc.
• Peak detection / integration
• Calibration• Deconvolution• Pattern
recognition• Class separation
• Folding• Phasing
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pattern recognition as a variable reductionS
tep
1 Obtain chromatogram(s)
Ste
p
2 Peak detection
Ste
p
3 Pattern recognition
Ste
p
4 Unknown sample characterized
HPLC-MS/MS GCxGC-MS
GC-MS
Base-line correction Alignment
Peak detection
Digital variables
Chemicalvariables
Process/biologicalvariables
From digital variables (chromatogram) to process variables
Raw data
Features
Healthy/sick
Pattern recognition
GCxGC, Riva - 2014
Information
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Is peak detection necessary?
Chromatographic data
Using raw data directly
Ana
lysi
s of
flam
e ac
cele
rant
s us
ing
GC
xGC
Source ASource B
… for example:
Pattern recognition
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pattern recognition as a variable reductionS
tep
1 Obtain chromatogram(s)
Ste
p
2 Peak detection
Ste
p
3 Pattern recognition
Ste
p
4 Unknown sample characterized
HPLC-MS/MS GCxGC-MS
GC-MS
Base-line correction Alignment
Peak detection
Digital variables
Chemicalvariables
Process/biologicalvariables
From digital variables (chromatogram) to process variables
Raw data
Features
Healthy/sick
Pattern recognition
GCxGC, Riva - 2014
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pattern recognition in GCxGC. The options Pattern recognition
GCxGC, Riva - 2014
Pattern recognition in GCxGC
Using raw data Using peak tables
• Alignment is critical• Less chance to miss important
compounds• Normally done with the unfolded
(raw) data, but not always (e.g. N-PLS)
• Alignment not important, but peak tracking is essential (normally MS should be present)
• Chance to miss important compounds (close to the S/N)
• (Truly) 1D method
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pattern recognition in GCxGC. Supervised methods. Pattern recognition
GCxGC, Riva - 2014
Any method will be prone to overfitting
In supervised pattern recognition of GCxGC, a tremendous reduction of variables is performed (form millions to a few tens/hundreds)
Any variable pre-reduction (e.g. using Fisher ratios) should be done within a cross-validation loop
Otherwise the results will be optimistic (a method that seems to work, when in fact it only works for that data)
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pattern recognition in GCxGC. Example of a wrong strategy Pattern recognition
GCxGC, Riva - 2014
Ste
p 2Variable
selection: Fisher ratio on the raw
data Ste
p 3Supervised
pattern recognition: PLS-DA to
separate sick from healthyS
tep
1Obtain GCxGCchromatograms for sick (50) and
healthy (50) Ste
p 4Consider the
coefficients from PLS-DA as indicators of
potential metabolites
Objective: discovering metabolites responsible for cancer tumor
??? ?Aren’t you Overfitting? No, I’ve been cross-
validating the PLS-DA
… but the variable pre-selection has been done with the full data set!!
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Pattern recognition in GCxGC. Example of a wrong strategy Pattern recognition
GCxGC, Riva - 2014
Ste
p 2Variable
selection: Fisher ratio on the raw
data Ste
p 3Supervised
pattern recognition: PLS-DA to
separate sick from healthyS
tep
1Obtain GCxGCchromatograms for sick (50) and
healthy (50) Ste
p 4Consider the
coefficients from PLS-DA as indicators of
potential metabolites
Objective: discovering metabolites responsible for cancer tumor
??? ?Aren’t you Overfitting? No, I’ve been cross-
validating the variable seletion and the PLS-DA
correct
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Orig
inal
dat
a se
t
Ste
p 2Withdraw
subset 1, fit the model with the rest, and test
the model with subset 1 S
tep
3
Do the same for the
other k sectionsS
tep
1Divide the original data
set in k subsets
(randomly) Ste
p 4
Sum up the error of the model in all
validation sets
1
2
…
k
Cal
ibra
tion
Val
Cal
ibra
tion
Val
Cal
Test the model here
Test the model here
Cal
Val
Cal
Pattern recognition in GCxGC. Example of a correct strategy Pattern recognition
“model” = “variable pre-selection (Fisher ratio) + PLS-DA”
Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam
Conclusions Multivariate methods
• Deconvolution: normally done with the unfolded data (less problems with between-modulation alignment)
• Deconvolution: problem to establish the number of compounds (normally done in a manual way)
• Two ways for pattern recognition: with raw data (normally preferred) or with peak table.
• Careful with validation of supervised pattern recognition. Variable pre-selection should be included in the validation loop.
GCxGC, Riva - 2014