new data analysis in gcxgc - tecnometrix · 2014. 7. 2. · welcome to the magic world of...

Data analysis in GCxGCData analysis in GCxGC

Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam

Gabriel Vivó TruyolsAnalytical-chemistry group

Van ‘t Hoff Institute for Molecular SciencesUniversity of Amsterdam

[email protected]


IntroductionFrom 1D chromatography to 2D chromatography: what does change?

GCxGC, Riva - 2014

2 4 6 8 10 12 14 16 18Time, min

0x100

4x106

8x106

1x107

Inte

nsity


The complexity of the data changesInstruments can be classified according to the order of the tensor of data used to represent a single experiment:

Zero-order instruments Produce Zero-order tensor

(e.g. a number) Example pH - meter

First-order instruments Produce First-order tensor

(e.g. a vector) Example UV-VIS spectrometer

Second-order instruments Produce Second-order tensor

(e.g. a matrix) Example LC-MS, GCxGC

Third-order instrument Produce Third-order tensor

(e.g. a “cube” of data) Example GCxGC-MS

nth-order instrument exist, but they are rare

The changes from 1D to 2D

GCxGC, Riva - 2014


The concept of “peak vicinity” changes

Two-dimensional

1tR2 t R

One-dimensional

tR

Peak of interest

Peak of interest

Neighbors Neighbors

S. Peters, G. Vivó-Truyols, P. Marriott, P.J. Schoenmakers, J. Chromatogr. A. 1146 (2007), 232-241.


GCxGC, Riva - 2014


The concept of “peak resolution” is different

A

B

One-dimensional

tR

Two-dimensional

1tR2 t R

f

g

Valley-to-peak ratio(between A and B) g

fP gfP

AB?

Valley-to-peak ratio(between A and B)?

S. Peters, G. Vivó-Truyols, P. Marriott, P.J. Schoenmakers, J. Chromatogr. A. 1146 (2007), 232-241.GCxGC, Riva - 2014



The concept of “peak” changes

One-dimensional

tR

Two-dimensional


GCxGC, Riva - 2014


Do the main steps in data processing change?

Ste

p 2

Pre-process

Ste

p 3

MeasureS

tep

1View

• Base-line correction

• Noise filtering• Spike filtering• Alignment• … etc.

• Peak detection / integration

• Calibration• Deconvolution• Pattern

recognition


GCxGC, Riva - 2014


Do the main steps in data processing change?

Ste

p 2

Pre-process

Ste

p 3

MeasureS

tep

1View





recognition• Class separation


GCxGC, Riva - 2014

Basically, the steps are the same, but the algorithms used for each step may be different.

Univariate

Multivariate

• Folding• Phasing

First step: visualizationFirst step: visualization


GCxGC, Riva - 2014

Ste

p 2

Pre-process

Ste

p 3

Measure

Ste

p 1

View








VisualisationRaw data in 2D chromatography

0 10 20 30Time, min

-101234

Abso

rban

ce, A

U

Chromatogram of Glycine preparation (254nm)(data courtesy of University of Valencia)

Two-dimensional chromatographyGCxGC chromatogram of diesel (FID detector)First dimension: non-polar; Second dimension: polar

GCxGC, Riva - 2014


“Folding” the chromatogram Visualisation

GCxGC, Riva - 2014

Mod time (8 seconds)

27002710

27202730

2740

02

46

80

2000

4000

6000

8000

10000

1tR, s2tR, s

FID

sig

nal


“Folding” the chromatogram Visualisation

27002720

2740

02

46

80

5000

10000

1tR, s2tR, s

FID

sig

nal

2700 2710 2720 2730 27400

1

2

3

4

5

6

7

8

1tR, s

2 t R, s

0

2000

4000

6000

8000

10000



“Folding” the chromatogram: phasing Visualisation


27002720

2740

46

8100

5000

10000

1tR, s2tR, s

FID

sig

nal

2700 2710 2720 2730 2740

3

4

5

6

7

8

9

10

1tR, s

2 t R, s

0

2000

4000

6000

8000

10000

Phase = 0.3

2700 2710 2720 2730 27404

5

6

7

8

9

10

11

12

1tR, s

2 t R, s

0

2000

4000

6000

8000

10000

27002720

2740

46

810

120

5000

10000

1tR, s2tR, s

FID

sig

nal




Phase = 0.5

2700 2710 2720 2730 27406

7

8

9

10

11

12

13

14

1tR, s

2 t R, s

0

2000

4000

6000

8000

10000

27002720

2740

68

1012

140

5000

10000

1tR, s2tR, s

FID

sig

nal




Phase = 0.75


Cylindrical coordinates. An alternative way to represent the data.

Visualisation

J.J.A.M. Weusten, E.P.P.A. Derks , J.H.M. Mommers, S. van der Wal, Anal. Chim. Acta 726 (2012), 9

1tR

=2tR

GCxGC, Riva - 2014


Interpolation Visualisation

27002720

2740

46

8100

5000

10000

1tR, s2tR, s

FID

sig

nal

GCxGC, Riva - 2014


+ =

Welcome to the magic world of chemometrics!

Interpolation Visualisation

GCxGC, Riva - 2014


“Folding” the chromatogram: final result Visualisation


Conclusions Visualisation

• Visualising is simple, and gives a lot of information.

• Folding (one-dimensional) data into (2D) image introduces discontinuities in the edges. Other visualization methods (cylindrical coordinates) possibloe.

• Phasing can be of great help.

• Careful with “cosmetic” effects!

GCxGC, Riva - 2014

Second step: Pre-processingSecond step: Pre-processing


GCxGC, Riva - 2014

Ste

p 2

Pre-process

Ste

p 3

Measure

Ste

p 1

View








Pre-processingTypical problems: base-line drifts and noise

0 10 20 30Time, min

-101234

Abso

rban

ce, A

U

0 10 20 30Time, min

-0.15-0.1

-0.050

0.050.1

Abso

rban

ce, A

U

Base-line drifts Noise

20 20.2 20.4 20.6 20.8 21Time, min

-0.04-0.02

00.020.040.06

Abso

rban

ce, A

U

GCxGC, Riva - 2014

0 10 20 30Time, min

-0.2

-0.1

0

0.1

0.2

Abso

rban

ce, A

U


Pre-processingBase-line drifts.

0 10 20 30Time, min

-0.15-0.1

-0.050

0.050.1

Abso

rban

ce, A

U

Original chromatogram

Corrected chromatogr.

Fitted base-line correction

GCxGC, Riva - 2014

Weighted least squares fitting


Pre-processing

GCxGC, Riva - 2014

Originalchromatogram

Base-line drifts.



Correctedchromatogram


Pre-processing

GCxGC, Riva - 2014




Other (more sophisticated) options:

- Use splines- Base-line correction coupled to peak detection- Fourier-transform based approaches- Wavelet-based approaches

Base-line drifts.



Pre-processing

GCxGC, Riva - 2014 S.E. Reichenbach, M. Ni, D. Zhang, E.B. Ledford Jr., J. Chromatogr. A, 985 (2003) 47 - 56

Base-line is reached at the (half) upper part

Base-line is reached at the (half) bottom part

Base-line drifts.


Pre-processing

GCxGC, Riva - 2014 S.E. Reichenbach, M. Ni, D. Zhang, E.B. Ledford Jr., J. Chromatogr. A, 985 (2003) 47 - 56

Consider the positions with the smallest values in each half

Estimate local background parameters using neighboring

values

Interpolate the main background trend and

substract it

Base-line drifts.


Pre-processingNoise removal. Smoothing and derivatives.

Savitzky-Golay filter is the most common method

Two parameters should be optmisized

• Window size• Polynomial degree

These parameters govern how much correlated noise is

removed

• Large window sizes and low polynomial degree

Too much noise is removed (chromatograms appear deformed)

• Small window sizes and large polynomial degrees Too much noise remains

GCxGC, Riva - 2014


Pre-processingNoise removal. Smoothing and derivatives.

Window size = 11Polynomial = 2





G. Vivó-Truyols, P.J. Schoenmakers, "Automatic selection of optimal Savitzky-Golay smoothing”, Anal. Chem. 78 (2006) 4598-4608.GCxGC, Riva - 2014

FID

sig

nal


Pre-processingNoise removal. Spikes.

A good way of removing spikes consists of passing a median filter(before the Savitsky-Golay filter)

Savitzky-Golay filter

Median filterBase-line correction

Original data

Parameter to tune: window

size

Parameters to tune: window size and

polynomial degreeOptimising three parameters

GCxGC, Riva - 2014


Pre-processingAlignment.

Two types of alignment

Between-chromatogram alignment

Between-modulation alignment

Alignment is not always necessary, depending on the final objective of the analysis

Rarely doneUsing 2D

techniques(folded data)

Using 1D techniques

(unfolded data)


15 different (but related) chromatograms

13.5 13.6 13.7 13.8Time, min

Alignment

Pre-processingAlignment. COW (1D)


Pre-processingAlignment. Using (truly) 2D algorithms

S. Castillo, I. Mattila, J. Miettinen, M. Orešič, T.Hyötyläinen, Anal. Chem. 83 ( 2011) 3058–3067

Score alignment in GCxGC-MS

D. Zhang, X. Huang, F.E. Regnier,M. Zhang, Anal. Chem., 80 (2008) 2664–2671

COW-adapted GCxGC-MS (using

single channel)


Conclusions Pre-processing

• Pre-processing methods are almost the same: one-dimensional = two-dimensional. Normally done in the (pre-folded) raw data.

• Every case needs a particular solution (it always exists, but some care should be taken!)

GCxGC, Riva - 2014

Third step: measureThird step: measure


GCxGC, Riva - 2014

Ste

p 2

Pre-process

Ste

p 3

Measure

Ste

p 1

View








Peak detection in one step: the watershed algorithm Peak detection

Most common software programs use the watershed algorithm to detect peaks in 2D chromatography:

J. De Bock et al., doi 10.1007/11558484

1 2 3 4

GCxGC, Riva - 2014


??? ?Single catchmentbasin?

G. Vivó-Truyols, H.G. Janssen, J. Chromatogr. A, doi:10.1016/j.chroma.2009.12.063

Peak detection in one step: the watershed algorithm

GCxGC, Riva - 2014

Peak detection


The first problem using the watershed algorithm.

Peak detection in one step: the watershed algorithm

GCxGC, Riva - 2014

Peak detection


Critical variability in second-dimension retention time:

2

2 1

pq

tP critfail

22 1

pq

tP critfail

mp

2

1

m

p

2

1

1mq 1mq

q=0.5 (eight cuts per peak)

q=1 (four cuts per peak)

q=2 (two cuts per peak)

q=4 (1 cut per peak)0 0.04 0.08 0.12 0.16 0.2

tR,crit, min

0

0.2

0.4

0.6

0.8

1

Pfa

il

p=5p=20

p=2.5

p=10

Current instruments

Current chromatographic practice

G. Vivó-Truyols, H.G. Janssen, J. Chromatogr. A, doi:10.1016/j.chroma.2009.12.063

Peak detectionin one step: the watershed algorithm

GCxGC, Riva - 2014

Peak detection

GCxGC, Riva - 2014


S. Peters, G. Vivó-Truyols, P.J. Marriott and P.J. Schoenmakers, J. Chromatogr. A 1156 (2007) 14.E.J.C. van der Klift, G. Vivó-Truyols, F.W. Claassen, F.L. van Holthoon, T.A. van Beek, J. Chromatogr. A, 1178 (2008) 43.

Time, arbitrary units (AU)

Ste

p

1 Detect peaks as in one-dimensional chromatography

Use information from derivatives (pre-processing

step)

Peak detectioni in two steps. Peak detection


Ste

p

2 Merge peaks that belong to the same compound according to 2nd-dimension retention time differences

50100

150

0

5

100

500

1000

1500

4 4.5 5 5.5 60

500

1000

1500

2tR, AU

T: Tolerance criterion

70 90 110 1304.5

5

5.5

1tR, AU

2 t R, A

U

GCxGC, Riva - 2014



Ste

p

2 Merge peaks that belong to the same compound according to 2nd-dimension retention time differences

50100

150

0

5

100

500

1000

1500

70 90 110 1304.5

5

5.5

1tR, AU

2 t R, A

U

4 4.5 5 5.5 60

500

1000

1500

2tR, AU

D

D>T ?

GCxGC, Riva - 2014



Ste

p

3 Check unimodality

50100

150

0

5

100

500

1000

1500

4 4.5 5 5.5 60

500

1000

1500

2tR, AU

70 90 110 1304.5

5

5.5

1tR, AU

2 t R, A

U

GCxGC, Riva - 2014



Conclusions Peak detection

• Two methods available: (inverse) watershed, and two-step peak detection process.

• Peak detection seems to be still a subject of discussion.

GCxGC, Riva - 2014


Deconvolution mehtods. Deconvolution

GCxGC, Riva - 2014

Deconvolution methods (always with MS)

Using 1D (unfolded) data

Use truly (folded) 2D data

• ALS or rank annihilation methods• Does not need between-

modulation alignment• Similar to AMDIS

• PARAFAC (needs between-modulation alignment)

• PARAFAC2 (more robust against between-modulation alignment)

Main problem: determine the number of components behind the peak cluster


Deconvolution mehtods. Deconvolution

GCxGC, Riva - 2014A.E. Sinha, J.L. Hope, B.J. Prazen, C.G. Fraga, E.J. Nilsson, R.E.

Synovec, J. Chromatogr. A, 1056 (2004) 145 - 154

An example of PARAFAC

Third step: measureThird step: measure


GCxGC, Riva - 2014

Ste

p 2

Pre-process

Ste

p 3

Measure

Ste

p 1

View








Pattern recognition as a variable reductionS

tep

1 Obtain chromatogram(s)

Ste

p

2 Peak detection

Ste

p

3 Pattern recognition

Ste

p

4 Unknown sample characterized

HPLC-MS/MS GCxGC-MS

GC-MS

Base-line correction Alignment

Peak detection

Digital variables

Chemicalvariables

Process/biologicalvariables

From digital variables (chromatogram) to process variables

Raw data

Features

Healthy/sick

Pattern recognition

GCxGC, Riva - 2014

Information


Is peak detection necessary?

Chromatographic data

Using raw data directly

Ana

lysi

s of

flam

e ac

cele

rant

s us

ing

GC

xGC

Source ASource B

… for example:

Pattern recognition

GCxGC, Riva - 2014


Pattern recognition as a variable reductionS

tep

1 Obtain chromatogram(s)

Ste

p

2 Peak detection

Ste

p

3 Pattern recognition

Ste

p

4 Unknown sample characterized

HPLC-MS/MS GCxGC-MS

GC-MS

Base-line correction Alignment

Peak detection

Digital variables

Chemicalvariables

Process/biologicalvariables

From digital variables (chromatogram) to process variables

Raw data

Features

Healthy/sick

Pattern recognition

GCxGC, Riva - 2014


Pattern recognition in GCxGC. The options Pattern recognition

GCxGC, Riva - 2014

Pattern recognition in GCxGC

Using raw data Using peak tables

• Alignment is critical• Less chance to miss important

compounds• Normally done with the unfolded

(raw) data, but not always (e.g. N-PLS)

• Alignment not important, but peak tracking is essential (normally MS should be present)

• Chance to miss important compounds (close to the S/N)

• (Truly) 1D method


Pattern recognition in GCxGC. Supervised methods. Pattern recognition

GCxGC, Riva - 2014

Any method will be prone to overfitting

In supervised pattern recognition of GCxGC, a tremendous reduction of variables is performed (form millions to a few tens/hundreds)

Any variable pre-reduction (e.g. using Fisher ratios) should be done within a cross-validation loop

Otherwise the results will be optimistic (a method that seems to work, when in fact it only works for that data)


Pattern recognition in GCxGC. Example of a wrong strategy Pattern recognition

GCxGC, Riva - 2014

Ste

p 2Variable

selection: Fisher ratio on the raw

data Ste

p 3Supervised

pattern recognition: PLS-DA to

separate sick from healthyS

tep

1Obtain GCxGCchromatograms for sick (50) and

healthy (50) Ste

p 4Consider the

coefficients from PLS-DA as indicators of

potential metabolites

Objective: discovering metabolites responsible for cancer tumor

??? ?Aren’t you Overfitting? No, I’ve been cross-

validating the PLS-DA

… but the variable pre-selection has been done with the full data set!!


Pattern recognition in GCxGC. Example of a wrong strategy Pattern recognition

GCxGC, Riva - 2014

Ste

p 2Variable

selection: Fisher ratio on the raw

data Ste

p 3Supervised

pattern recognition: PLS-DA to

separate sick from healthyS

tep

1Obtain GCxGCchromatograms for sick (50) and

healthy (50) Ste

p 4Consider the

coefficients from PLS-DA as indicators of

potential metabolites

Objective: discovering metabolites responsible for cancer tumor

??? ?Aren’t you Overfitting? No, I’ve been cross-

validating the variable seletion and the PLS-DA

correct


Orig

inal

dat

a se

t

Ste

p 2Withdraw

subset 1, fit the model with the rest, and test

the model with subset 1 S

tep

3

Do the same for the

other k sectionsS

tep

1Divide the original data

set in k subsets

(randomly) Ste

p 4

Sum up the error of the model in all

validation sets

1

2

…

k

Cal

ibra

tion

Val

Cal

ibra

tion

Val

Cal

Test the model here

Test the model here

Cal

Val

Cal

Pattern recognition in GCxGC. Example of a correct strategy Pattern recognition

“model” = “variable pre-selection (Fisher ratio) + PLS-DA”


Conclusions Multivariate methods

• Deconvolution: normally done with the unfolded data (less problems with between-modulation alignment)

• Deconvolution: problem to establish the number of compounds (normally done in a manual way)

• Two ways for pattern recognition: with raw data (normally preferred) or with peak table.

• Careful with validation of supervised pattern recognition. Variable pre-selection should be included in the validation loop.

GCxGC, Riva - 2014

new data analysis in gcxgc - tecnometrix · 2014. 7. 2. · welcome to the magic world of...

Documents