determination of number of significant components and key variables using genetic algorithms in...

9
Determination of number of significant components and key variables using genetic algorithms in liquid chromatography-nuclear magnetic resonance spectroscopy and liquid chromatography-diode array detection Mohammad Wasim, Richard G. Brereton School of Chemistry, University of Bristol, Cantocks Close, Bristol BS8 1TS, United Kingdom Received 22 February 2005; received in revised form 16 December 2005; accepted 16 December 2005 Available online 8 February 2006 Abstract A new method is proposed for determination of the number of significant components in hyphenated chromatographic data. The method is based on the application of genetic algorithms (GA). The method is applied to five datasets from on-flow liquid chromatography with nuclear magnetic resonance spectroscopy (LC-NMR) and on three datasets from liquid chromatography with diode array detector (LC-DAD). The effect of different factors such as the application of technique on time or spectral direction, noise level, baseline, chromatographic resolution and relative peak height has been studied. The method is compared with other chemometric approaches for rank analysis and variable selection. The results produced by GA are promising on both types of data. © 2006 Published by Elsevier B.V. Keywords: Coupled chromatography; LC-NMR; Genetic algorithms; Rank analysis 1. Introduction The introduction of hyphenated techniques such as liquid chromatography with diode array detector (LC-DAD), liquid chromatography with mass spectroscopy (LC-MS) and so on provides large quantities of data. It is very difficult for an analytical chemist to analyse this amount of data without mathematical tools. In the analysis of a coupled chromatograph- ic data, the first step usually involves identifying the number of compounds in the data, often called rank analysis. There are several techniques in the literature, which are applied for determining data rank or peak purity assessment, most based on multivariate approaches. Sanchez et al. [1] have categorised these methods into two types: (a) methods based on comparison of chromatograms and (b) methods based on comparison of spectra. The first category includes ratiograms [2] and evolving principal component innovation analysis (EPCIA) [3]. In the second category are approaches such as methods based on GramSchmidt orthogonalisation [4], evolving factor analysis (EFA) [5], fixed size window evolving factor analysis (FSW- EFA) [6] and correlation and derivative plots [7]. There are also several methods, which do not fit in either or belong to both categories [8]. Some methods provide key variables [9] along with the data rank. On the basis of the output generated, these methods are categorised in three groups: (a) only rank providing methods, (b) only key variable providing methods and (c) rank and key variable providing methods. In the first category are methods based on eigenvalue analysis such as error indicator functions developed by Malinowski [10], residual percent variance [11], Exner function [12], F-test [13], cross-validation [14], EFA, FSW-EFA and EPCIA. The second category includes methods, which do not calculate rank but provide key variables; techniques such as the simplified Borgen method (SBM) [15], needle search [16] and PC-plots [17] belong to this category. The third category comprises the orthogonal projection approach [1,18] and SIMPLISMA [19]. Genetic algorithms (GA) were introduced by John Holland [20] as an evolutionary problem solving method that follows the lines of the natural evolution process. GA have been used in various fields such as search and optimization. GA have several distinct advantages over other classical gradient-based optimi- zation approaches among which the most prominent are their ability to find a global solution and independence from the initial Chemometrics and Intelligent Laboratory Systems 81 (2006) 209 217 www.elsevier.com/locate/chemolab Corresponding author. Tel.: +44 117 9287658; fax: +44 117 9251295. E-mail address: [email protected] (R.G. Brereton). 0169-7439/$ - see front matter © 2006 Published by Elsevier B.V. doi:10.1016/j.chemolab.2005.12.006

Upload: mohammad-wasim

Post on 26-Jun-2016

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Determination of number of significant components and key variables using genetic algorithms in liquid chromatography-nuclear magnetic resonance spectroscopy and liquid chromatography-diode

ry Systems 81 (2006) 209–217www.elsevier.com/locate/chemolab

Chemometrics and Intelligent Laborato

Determination of number of significant components and key variables usinggenetic algorithms in liquid chromatography-nuclear magnetic resonance

spectroscopy and liquid chromatography-diode array detection

Mohammad Wasim, Richard G. Brereton ⁎

School of Chemistry, University of Bristol, Cantock’s Close, Bristol BS8 1TS, United Kingdom

Received 22 February 2005; received in revised form 16 December 2005; accepted 16 December 2005Available online 8 February 2006

Abstract

A new method is proposed for determination of the number of significant components in hyphenated chromatographic data. The method is basedon the application of genetic algorithms (GA). The method is applied to five datasets from on-flow liquid chromatography with nuclear magneticresonance spectroscopy (LC-NMR) and on three datasets from liquid chromatography with diode array detector (LC-DAD). The effect of differentfactors such as the application of technique on time or spectral direction, noise level, baseline, chromatographic resolution and relative peak heighthas been studied. The method is compared with other chemometric approaches for rank analysis and variable selection. The results produced by GAare promising on both types of data.© 2006 Published by Elsevier B.V.

Keywords: Coupled chromatography; LC-NMR; Genetic algorithms; Rank analysis

1. Introduction

The introduction of hyphenated techniques such as liquidchromatography with diode array detector (LC-DAD), liquidchromatography with mass spectroscopy (LC-MS) and so onprovides large quantities of data. It is very difficult for ananalytical chemist to analyse this amount of data withoutmathematical tools. In the analysis of a coupled chromatograph-ic data, the first step usually involves identifying the number ofcompounds in the data, often called rank analysis. There areseveral techniques in the literature, which are applied fordetermining data rank or peak purity assessment, most based onmultivariate approaches. Sanchez et al. [1] have categorisedthese methods into two types: (a) methods based on comparisonof chromatograms and (b) methods based on comparison ofspectra. The first category includes ratiograms [2] and evolvingprincipal component innovation analysis (EPCIA) [3]. In thesecond category are approaches such as methods based onGram–Schmidt orthogonalisation [4], evolving factor analysis(EFA) [5], fixed size window evolving factor analysis (FSW-

⁎ Corresponding author. Tel.: +44 117 9287658; fax: +44 117 9251295.E-mail address: [email protected] (R.G. Brereton).

0169-7439/$ - see front matter © 2006 Published by Elsevier B.V.doi:10.1016/j.chemolab.2005.12.006

EFA) [6] and correlation and derivative plots [7]. There are alsoseveral methods, which do not fit in either or belong to bothcategories [8]. Some methods provide key variables [9] alongwith the data rank. On the basis of the output generated, thesemethods are categorised in three groups: (a) only rank providingmethods, (b) only key variable providing methods and (c) rankand key variable providing methods. In the first category aremethods based on eigenvalue analysis such as error indicatorfunctions developed by Malinowski [10], residual percentvariance [11], Exner function [12], F-test [13], cross-validation[14], EFA, FSW-EFA and EPCIA. The second category includesmethods, which do not calculate rank but provide key variables;techniques such as the simplified Borgen method (SBM) [15],needle search [16] and PC-plots [17] belong to this category. Thethird category comprises the orthogonal projection approach[1,18] and SIMPLISMA [19].

Genetic algorithms (GA) were introduced by John Holland[20] as an evolutionary problem solving method that follows thelines of the natural evolution process. GA have been used invarious fields such as search and optimization. GA have severaldistinct advantages over other classical gradient-based optimi-zation approaches among which the most prominent are theirability to find a global solution and independence from the initial

Page 2: Determination of number of significant components and key variables using genetic algorithms in liquid chromatography-nuclear magnetic resonance spectroscopy and liquid chromatography-diode

210 M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 81 (2006) 209–217

guess. They have been used in chemistry in many applicationssuch as parameter optimisation [21–23], calibration [24–29],quantitative structure activity relationships (QSAR) [30] andclassification [31].

The present paper describes a new approach based on geneticalgorithms for the selection of variables in coupled chromatog-raphy based on genetic algorithms. By selecting variables, it ispossible to determine which spectroscopic or chromatographicvariables correspond to each pure component in a mixture, andby determining the number of key variables, it is possible toestimate the data rank. The methods in this paper are applied totwo common forms of chromatographic data, LC-DAD and LC-NMR (Liquid Chromatography Nuclear Magnetic Resonance)[32–34], and will benefit curve resolution approaches that aredependent on finding pure or key variables. They are comparedto alternative multivariate methods for finding key variables.

2. Experimental

Two different types of datasets are analysed: (i) experimentaland (ii) simulated. A list of the compounds used to create realand simulated datasets is presented in Table 1. The simulateddatasets are created by using previously recorded real spectraand modelling the chromatographic peak shapes.

2.1. Experimental data

The sum of chromatographic profiles for all eight experi-mental datasets (LC-NMR and LC-DAD) analysed in this paper

Table 1The name of compounds, label used in the text and supplier

Compound label Compound name Purity and supplier

I Phenyl acetate 99%, Aldrich Chemical,Milwakee, WI, USA

II Methylp-toluenesulphonate

98%, Lancaster,Morecambe, UK

III Methyl benzoate 99%, LancasterIV 2,6-dihydroxynaphthalene 98%, Avocado,

Research Chemicals Ltd.Heysham, UK

V 1,3-dihydroxynaphthalene 99%, LancasterVI 2,3-dihydroxynaphthalene 99%, Aldrich,

Steinheim, GermanyVII 1,6-dihydroxynaphthalene 98%, Acros Organics,

Geel, Belgium, UKVIII 1,2-diethyoxybenzene 98%, LancasterIX 1,4-diethyoxybenzene 98%, LancasterX 1,3-diethyoxybenzene 95%, LancasterXI Diethyl maleate 98%, AvocadoXII Diethyl fumarate 97%, AvocadoXIII Nicotinic acid 98–100%, Sigma,

Steinheim, GermanyXIV p-hydroxy benzoic acid 99%, SigmaXV Benzoic acid 99.5%, SigmaXVI Pyridine-

3,5-dicarboxylic acid98%, Lancaster

XVII Anthracene 99%, SigmaXVIII Pyrene 98%, AldrichXIX 1,2:

5,6-Dibenzanthracene97%, Aldrich

is presented in Fig. 1. It can be observed that the number ofcompounds is not clear in all plots except for in the case of data6and data7.

2.1.1. LC-NMRFive mixtures were created by using different compounds.

Data1: Three-component mixture — consists of I, II, and III,125 mM each.

Data2: Three-component mixture— consists of IV (50 mM), V(100 mM), and VI (50 mM).

Data3: Four-component mixture — consists of IV, V, VI andVII, 100 mM each.

Data4: Seven-component mixture — consists of II, III, IV, VII,VIII, IX and X, 50 mM each.

Data5: Eight-component mixture — consists of IV, VI, VII,VIII, IX, X, XI and XII, 50 mM each.

Further experimental details are provided elsewhere [8,9,32–34].

2.1.2. LC-DADThree datasets were created.

Data6: Three-component mixture— consists of IV (0.5 μM), V(1 μM) and VI (0.5 μM).

Data7: Four-component mixture— consists of XIII (36.6 μM),XIV (18.6 μM), XV (50 μM) and XIV (25.6 μM).

Data8: Eight-component mixture — consists of IV, VI, VII,VIII, IX, X, XI and XII, 3 μM each.

2.2. Simulated data

Various datasets were produced representing LC-NMR andLC-DAD each of three compounds.

2.2.1. LC-NMR datasetsSeveral three-component mixtures consisting of IV, V, and VI

were created. Details about modelling chromatographic peaksand bilinear data are presented in Section 2.2.3. Pure NMRspectra were acquired using the same experimental conditions asstated for data2 in Section 2.1.1.

2.2.2. LC-DAD datasetsSeveral three-component mixtures consisting of XVII

(11.1μM),XVIII (9.7μM) andXIX (7.0μM)were created. Detailsabout modelling chromatographic peaks and bilinear data are pre-sented in Section 2.2.3. Pure UV–vis spectra were acquired usingthe UV–Vis spectrometer (Pharmacia LKB Biochrom UltrospecIII, Pharmacia, Uppsala, Sweden) at a resolution of 1 nm.

2.2.3. SimulationThe chromatographic direction was simulated by using a

polynomially modified Gaussian model (PMG) [35]. PMG fortwo peak shape parameters is defined as

aðtÞ ¼ Ae−0:5ðt−trÞ2=ðr0þr1ðt−trÞÞ2 ð1Þ

Page 3: Determination of number of significant components and key variables using genetic algorithms in liquid chromatography-nuclear magnetic resonance spectroscopy and liquid chromatography-diode

0.0 1.0 2.0 3.0 4.0

time (min)

Data2

1.1 3.1 5.1 7.1

time (min)

Data1

0.5 1.5 2.5 3.5time (min)

Data3

0.0 1.0 2.0 3.0 4.0

time (min)

Data4

2.0 2.5 3.0 3.5 4.0

time (min)

Data6

3.0 3.5 4.0 4.5

time (min)

Data7

1.5 2.5 3.5 4.5

time (min)

Data8

2.0 4.0 6.0 8.0 10.0time (min)

Data5

Fig. 1. Sum of all chromatographic profiles in all datasets.

211M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 81 (2006) 209–217

where a(t) is the peak intensity at time t, tr is the retention time, Ais peak height maximum and σ0 and σ1 are peak shape parame-ters. In the modelling, the following parameters were employed.

LC-NMR Data: Each chromatographic profile was modelledover 200 points in time with a resolution of 1 s and 989frequencies in the range 6.4–8.1 ppm with a resolution of0.002 ppm. The peak shape parameters were σ0=10 andσ1=0.1.LC-DAD Data: Each chromatographic profile was modelledover 98 points in time with a resolution of 0.16 s and 151wavelengths in the range 200–350 nm with a resolution of 1nm. The peak shape parameters were σ0=0.07 and σ1=0.1.Peak intensities at the maximum (A) and the retention times(tr) were varied from dataset to dataset and are described inSection 4. Bilinear data matrices were created by

X ¼ CS ð2Þwhere X is data matrix, C is matrix of concentration profilesand S is matrix of spectral profiles. Further changes in thedata were made in retention time, noise level, peak heightsand baseline, which will be explained in the Results section.

2.3. Software

The data analysis was performed by computer programswritten in MATLAB by the authors of this paper except the pre-

processing of LC-NMR data, which was achieved by in-housewritten software called ‘LCNMR’ [34].

3. Data analysis

The two-way data matrix is denoted byX (M×N) withM rowscorresponding to spectra at a given elution time and N columnsrepresenting chromatographic profiles at a given wavelength orspectral frequency,m is used as row index and n as column index.K is used for total number of compounds and k as its index.

3.1. Data pre-processing

As experimental LC-NMR data were obtained in the timedomain as free induction decays (FIDs), these were apodised,Fourier transformed and phase corrected. In order to removeerrors due to quadrature detection, which results in regularoscillation of intensity, a moving average of each frequency overevery four points in time was performed in the chromatographicdirection [36]. No pre-processing was performed on theexperimental LC-DAD data and simulated datasets.

3.2. Genetic algorithms

GA methods are based on the concept of “chromosomes”.These consist of a set of variables, which, may, for example be aseries of chromatographic readings or of spectral wavelengths,their length dependent on how many variables it is desired to

Page 4: Determination of number of significant components and key variables using genetic algorithms in liquid chromatography-nuclear magnetic resonance spectroscopy and liquid chromatography-diode

212 M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 81 (2006) 209–217

search. Each variable is coded in a variety of ways. In this paperwe use binary encoding which consists of concatenated stringsof variables containing 0s and 1s, while real-value encodingcontains real numbers.

They start with a set of possible solutions, called a“population” in contrast to classical gradient based methods,which start with one guess. The size of the population chosen isa trade off between computation time and the desired accuracyof the results. Once a random population is created, eachsolution is evaluated for an objective function. The fitness of achromosome is assigned a value. A termination condition isassessed, if the termination criterion is not satisfied, thepopulation of solutions is modified and a new populationcreated and is changed analogous to natural evolution with thehelp of three operations in iterative way: selection, crossoverand mutation. The generation counter is incremented to indicatethat one generation (or one iteration) of GA has been completed.The methods stop when they reach a preset number of iterationscalled the number of generations or by other conditions. Moredetails can be found in textbooks [37–39], tutorials and reviewarticles [40–43].

3.2.1. Objective functionThe objective function in our study is similar to that used in

OPA [1] with the difference that vectors selected in this methodare based on GA search. The method starts with the identificationof two variables (columns), which correspond to two profiles x1and x2. If the number of variables in the original space is not apower of 2, a new search space for the purpose of GAs is obtainedof length 2K−1; the variables in the new space are mapped ontothe nearest variable in the old space, as discussed in more detail inSection 3.2.2. A matrix Y is constructed as

Y ¼ ½x1x2� ð3Þ

Using genetic algorithms, as described below, a search isperformed to maximise dissimilarity, denoted by d2, for twoprofiles using the function

d2 ¼ detðY VYÞ ð4Þ

An explanation of the dissimilarity measure is as follows. Thedeterminant (d2) of a (2×2) matrix measures the area of theparallelogram formed by the vectors consisting of the columnsof the matrix [44]. The more dissimilar two spectra are, thegreater the value of the determinant of the 2×2 matrix Y′Ycontaining unscaled vectors x1 and x2. Thus maximizing valueof d2 will locate the two most dissimilar spectra. On adding newspectra to Y the value of dissimilarity will increase until thenumber of spectra in Y becomes equal to the rank of the matrix.After that, the increase in dissimilarity will not be as much asproduced in the previous steps and it can be detected by takingthe second difference of d. A similar trend should be observed ifthe vectors x1 and x2 are normalized to unit length, in that casedissimilarity will decrease with the addition of more vectors. Iftwo identical vectors are selected in the search, the value of thedeterminant will become zero.

The GA search is then repeated for 3 to K+3 components,where K is a value that should ideally exceed the total number ofcomponents expected, the value is not especially important solong as it is quite large.

After searching for K+3 components, a second difference onvector d is calculated by

SDk ¼ logðdk−1Þ−2logðdkÞ þ logðdkþ1Þ where

k ¼ 3 to K þ 3 ð5Þ

The data rank is estimated as value for k where SDk has aminimum value. The vectors selected [x1, x2, …, xk] provide thekey variables. The method can be applied to the spectral as wellas time dimension.

3.2.2. CodingThe objective of this study is to find most dissimilar

spectra or chromatograms present in the data. The variablesare the index of rows or columns. The search space containsintegers corresponding to the column numbers. The upperand lower bounds on variables depend on whether the aim isto find the spectra or the chromatographic elution times thathave most dissimilar characteristics. If the search isperformed in the chromatographic direction then the upperand lower bounds are 1 and M otherwise 1 and N (spectraldirection).

The coding applied for feature selection in multivariatecalibration [26] is different as compared to the coding systemused in our study. In our study the aim is to find perhaps 4 or 5key variables (each corresponding to a spectral frequency orelution time that best characterises a pure compound). Inmultivariate calibration the aim may be, for example, to find 20or 30 frequencies that are best for calibration, so a differentapproach is necessary.

In our study, binary coding is used where each chromosomeis represented by a binary string as b=(b1, b2, …, bK), where Krepresents the number of key variables. For the first iteration (2key variables or components), the string b consists of two parts,b1 and b2. The first bits (b1) correspond to the first key variable,or the position (datapoint) in time or frequency in the coupledchromatogram. The variables are coded according to the numberof datapoints in the dimension to be searched, for example, ifthere are 1024 possible spectral frequencies, b1 will have length10, and will refer to x1 (see Section 3.2.1). If the number ofdatapoints is not a power of 2 (although in NMR the number ofpoints in the frequency domain is usually a power of 2), thenumber of bits is chosen so that it exceeds the number ofdatapoints, for example 10 bit coding will be employed if thereare 1000 datapoints characteristic of one dimension: thetranslation between coded and true values involves finding thenearest true variable (in the original space which is not of length2k) that corresponds to its binary coded equivalent (in the newspace between 0 and 2k−1).

Hence the length of the chromosome depends on the numberof key variables and the number of datapoints in the spectrum orchromatogram.

Page 5: Determination of number of significant components and key variables using genetic algorithms in liquid chromatography-nuclear magnetic resonance spectroscopy and liquid chromatography-diode

Table 2Parameters used for GA

Parameters Value

No. of generations 5000Population size 20Selection probability (%) 90Crossover probability (%) 90Mutation probability (%) 1Selection operator Roulette WheelCrossover operator Single point crossoverMutation operator Single point bit-wise mutation

213M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 81 (2006) 209–217

3.2.3. SelectionThe purpose of the selection operator is to make multiple

copies of good chromosomes and eliminate bad chromosomes,while keeping the population size constant. In this study RouletteWheel [37] methods are applied, where selection is based onrunning the roulette wheel a number (population size) of times.Each time a uniformly distributed random number is generatedfrom the range [0,1] and compared with the cumulative fitnessprobability of each chromosome. A chromosome is selected if itscumulative probability is higher than the random number asdefined as follows. The probability of selection for eachchromosome is calculated as

Pi ¼ fitnessi=totalXfitness ð6Þso the cumulative probability qi for each chromosome is cal-culated as

qi ¼Xi

j¼1

Pi ð7Þ

If fitness values are the same as objective function values, thesolutions with high fitness values will dominate the mating pool(which is defined as the pool with chromosomes selected inthe selection process). Normally, solutions are scaled before theselection, this process is called Ranking. In our study, theobjective function values are sorted from the worst (i=0) to thebest (i= I−1) and fitness values for each solution are modifiedas

fitnessi ¼ 2i=ðI−1Þ ð8Þ

where i is the population index and I is the total number ofchromosomes in a population.

The population size is a trade off between calculation time andexploration, normally a very large population size involves lengthycomputing times. In this paper the population size used was 20.

3.2.4. CrossoverThe crossover operator is mainly responsible for the creation

of new chromosomes from old ones thus increasing theprobability for finding a suitable solution. At each step in thealgorithm new chromosomes are created in order to search forbetter solutions. In binary coding, two suitable chromosomes arepicked randomly and one or more crossover sites are locatedrandomly. One or more parts on the right side of the crossoversite are exchanged to create two new chromosomes. The numberof chromosomes participating in the process depends on a user-defined crossover probability. The number of crossover sitesselected in a chromosome can vary from one to the size of thechromosome.

3.2.5. MutationMutation works in a manner similar to crossover by gen-

erating new chromosomes from the old ones but it also createsdiversity in the population. It helps to pull the solution out oflocal minima in the search space. Mutation involves theselection of a mutation site randomly and changing 1 to 0 or

vice versa at this site. Mutation also operates with a user-definedmutation probability.

3.3. Key variable selection using multivariate methods

In order to compare the effectiveness of the GAmethods in thispaper, four other approaches commonly employed in chemo-metrics, namely the Orthogonal Projection Approach (OPA) [1],SIMPLe-to-use Interactive Self-modelling Mixture Analysis(SIMPLISMA) [19], the Simplified Borgen Method (SBM)[15] and the Needle Search [16] are compared to GAs. All themethods used in the comparison are described elsewhere anddescriptions are omitted for brevity.

4. Results and discussions

The parameters used for the genetic algorithms are presentedin Table 2. A selection probability of 90% is applied and 2 best-fit chromosomes out of 20 were retained in each subsequentgeneration.

4.1. Rank analysis

4.1.1. Experimental dataSince variables can be selected in the time as well as the

spectral domain, GAs were applied in both dimensions forcomparison. The results reveal that correct rank is determinedonly when variable selection is performed in the time direction.The results in the form of plots are presented in Fig. 2, theminimum provides an estimate of the data rank. For each datasetGAwas applied three times, with different starting populations,and it was observed that there was either little or no variation inthe variables selected in the time direction as compared to thefrequency or wavelength direction. The variables selected eachtime for data5 are presented in Table 3. It seems that there arefewer local minima in the space spanned by time variablescompared to the space spanned by the frequency variables. SinceGA produced close results in all three runs on time variables inall datasets, repeat determinations were not reported further.

Rank analysis by GA is compared with the results obtainedby OPA and SIMPLISMA. The offset parameter used bySIMPLISMA for LC-NMR datasets was 15% and for LC-DAD5% [45]. It is evident from Table 4 that GA produces betterresults compared to the two other methods. For SIMPLISMA asuitable value of the offset is required and sometimes the data

Page 6: Determination of number of significant components and key variables using genetic algorithms in liquid chromatography-nuclear magnetic resonance spectroscopy and liquid chromatography-diode

-0.5

-0.3

-0.1

0.1

3 4 5 7

No. of componentsData1 No. of componentsData5

No. of componentsData6

No. of componentsData7

No. of componentsData8

No. of componentsData2

No. of componentsData4

No. of componentsData3

SD

(k)

SD

(k)

SD

(k)

SD

(k)

SD

(k)

SD

(k)

SD

(k)

SD

(k)

-1.1

-0.7

-0.3

0.1

-1.1

-0.7

-0.3

0.1

-1.1

-0.7

-0.3

0.1

-0.8

-0.6

-0.4

-0.2

0.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

-2.0

-1.5

-1.0

-0.5

0.0

-1.3

-0.9

-0.5

-0.13 4 6 8 10 11 12

6 8 9

3 4 5 76 8 9 3 4 5 76 8 9

3 4 5 76 8 93 4 5 76 8 9

5 7 9

3 4 6 8 10 11 125 7 9

3 4 6 8 10 11 125 7 9

Fig. 2. Plots of second difference (SD(k)) against the number of components, the minimum point in the plot (square with cross inside) shows the data rank.

214 M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 81 (2006) 209–217

needs to be truncated at the start and at the end of the time axisto exclude variables with low signal to noise ratios. This kind ofparameter adjustment or data truncation is not required for GA.

4.1.1.1. Effect of pre-processing. Rank analysis was per-formed on the raw, mean-centred and standardised datasets.Mean centring was performed as

mcxmn ¼ xmnx̄n

ð9Þ

where x̄n is the average of nth column. The standardisation wasperformed as

stdxmn ¼ xmn � x̄nffiffiffiffiffiffiffiffiffiPMm¼1

sðxmn � x̄nÞ2=M

ð10Þ

Since LC-NMR contains many variables where only noiseexists, good results from standardised data were not expected,

Table 3Variable selection by GA method for LC-NMR data5 containing 8 compounds

Variable Time direction (min) Frequency direction (ppm)

Variable 1 3.03 3.14 3.03 1.253 1.248 1.248Variable 2 3.61 3.66 3.61 1.277 1.281 1.281Variable 3 4.19 4.19 4.19 1.320 1.320 1.363Variable 4 4.67 4.72 4.67 1.340 1.340 2.452Variable 5 5.41 5.41 5.41 3.725 3.725 4.013Variable 6 6.36 6.36 6.36 4.055 6.941 6.844Variable 7 7.42 7.42 7.42 7.100 7.100 7.182Variable 8 7.89 7.89 7.89 7.182 7.182 7.570

however, LC-DAD data benefits by mean centring andstandardisation as all variables in the datasets analysed containsignificant intensity. The results indicated that for LC-NMRcorrect rank was determined only by using raw data or mean-centred data, for LC-DAD correct rank was obtained only withstandardised data. The results presented in Fig. 2 are calculatedby using the raw data for LC-NMR and standardised data forLC-DAD on time dimension.

4.1.2. Simulated dataThe application of GA for rank determination on simulated

data was performed in a different way. The effects of fivefactors, each at three levels, were estimated for both datasets.The LC-NMR results were compiled for raw data and LC-DADfor standardised data and variable selection was performed onthe time variables only.

4.1.2.1. Effect of different factors. Five factors namely, whitenoise, baseline, heteroscedastic noise, relative peak heights ofconcentration profiles and chromatographic resolution werestudied.

Normally distributed white noise was added to the raw data,the amount of noise was measured in percent of maximumsignal height at three levels. The noise levels for LC-NMR(1%, 2.5%, 5%) were kept higher than LC-DAD (0.1%, 0.5%,1%) to reflect the lower sensitivity of LC-NMR. Chromato-graphic peak resolution between two adjacent peaks wasapproximately 0.5 unless otherwise stated and relative peakheights at their maxima were 1 :0.5 :1. The results for allsimulated LC-DAD datasets were correct while GA determined

Page 7: Determination of number of significant components and key variables using genetic algorithms in liquid chromatography-nuclear magnetic resonance spectroscopy and liquid chromatography-diode

1234567

II OH

OH

1234567

III

H

O

H

1234567

IVCH3

CH3

O

O

S

1234567

V

O

O

VI O

O

CH

H

H

3

CH3

CH3

CH3

CH3

CH3

1234567

I

OH

OH

O

O

Table 4A comparison of rank estimation by three methods: GA, OPA and SIMPLISAfor LC-NMR data5 containing 8 compounds

Dataset Truerank

Estimated rank (time dimension)

GA OPA SIMPLISMA

Data1 3 3 4 3Data2 3 3 4 3Data3 4 4 4 4Data4 7 7 9 9Data5 8 8 10 N8Data6 3 3 3 3Data7 4 4 N4 4Data8 8 8 10 10

215M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 81 (2006) 209–217

the incorrect rank for LC-NMR simulated data at 2.5% and 5%noise levels.

To study the effect of baseline, a baseline was modelled asfollows [1] using

blxmn ¼ xmn þ b1 þb2

M � 1tðmÞ ð11Þ

where blxmn is baseline added signal value, β1 and β2 representconstants. In our study both are equal to a multiple of themaximum signal height. Chromatographic resolution betweentwo adjacent peaks was fixed at 0.5, relative peak heights at1 :0.5 :1 and white noise at 1%. The three levels of baseline(0.1%, 0.5%, 1% of maximum signal height) are the same inLC-NMR and LC-DAD and the results are correct for bothdatasets, which indicates that baseline level does not affect theresults.

Heteroscedastic noise [1] was generated with a standarddeviation equal to a multiple of the square root of the signal, asgiven by

hetxmn ¼ xmn þ b3 �ffiffiffiffiffiffiffixmn

p � q ð12Þ

where hetxmn is heteroscedastic noise added signal, ρ is anormally distributed random number and β3 is a number relatingto the level of the noise. Three different levels (0.01%, 0.1%,1% of maximum signal height) were implemented for LC-NMRand (0.01%, 0.05%, 0.1%) for LC-DAD. In LC-NMR the levelswere higher than in LC-DAD, to simulate the lower sensitivityof LC-NMR. The fixed parameters were resolution between two

Table 5A comparison of variable selection by four method: GA, SBM, OPA andSIMPLISMA for LC-NMR data5 containing 8 compounds

Variable GA (ppm) SBM (ppm) OPA (ppm) SIMPLISMA (ppm)

Variable 1 1.253 1.255 1.253 1.241Variable 2 1.277 1.279 1.279 1.279Variable 3 1.320 1.322 1.323 1.322Variable 4 1.340 1.340 1.340 1.363Variable 5 3.725 1.359 1.357 2.453Variable 6 4.055 2.452 2.452 6.840Variable 7 7.100 7.099 7.099 7.099Variable 8 7.182 7.184 7.182 7.182

chromatographic peaks 0.5, relative peak heights 1 :0.5 :1 andwhite noise 1%. The rank was predicted correct in all cases.

To determine the effect of relative peak heights, three levels(1 :1 :1, 1 :0.5 :1, 1 :0.25 :1 relative peak heights) were createdby varying the height of central peak. All levels were used atfixed parameters of resolution between two adjacent chromato-graphic peaks 0.5 and white noise 1%. The rank was predictedcorrectly for all datasets.

The last factor was chromatographic peak resolution. Threelevels (0.35, 0.5, 0.7 resolution between two adjacent peaks) forboth instrumental methods were applied at relative peak height1 :0.5 :1 and white noise 1%.

In this simulation GA determined the correct rank for allinstrumental datasets except the case of LC-NMR when thewhite noise level was high (more than 1%).

1234567

1234567

VII

O

1234567

chemical shift (ppm)

VIII

CH3

CH3

O

O

CH3

CH3

O

Fig. 3. Pure LC-NMR spectra of all compounds in data5.

Page 8: Determination of number of significant components and key variables using genetic algorithms in liquid chromatography-nuclear magnetic resonance spectroscopy and liquid chromatography-diode

Table 6A comparison of time variable selection by five method: GA, SBM, OPA,SIMPLISMA and Needle Search for data5, row index is used instead of time forthe purpose of comparison

Variable GA SBM OPA SIMPLISMA Needle Search

Variable 1 58 57 58 57 58Variable 2 69 66 67 70 69Variable 3 80 83 84 83 80Variable 4 89 91 91 91 89Variable 5 103 101 102 102 104Variable 6 121 122 121 121 121Variable 7 141 142 144 145 141Variable 8 150 156 151 164 150

216 M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 81 (2006) 209–217

4.2. Key variable selection

Rank analysis and variable selection are performed in asingle step by the method of Section 3.2. The results producedfor the selection of time variables (spectra) are very consistent.The results for frequency selection (chromatograms), however,revealed some variations. The variation in the results infrequency or wavelength direction provides an extra advantage.It searches for more key variables for some compounds, whichobviously cannot be obtained by OPA, SIMPLISMA or SBM.Once GA is applied on the frequency or wavelength direction, anumber of useful variables (frequencies or wavelengths) areobtained. An analysis could be made on all selected variablesand better key variables are retained.

The results of key frequency selection are compared to threeother methods: OPA, SIMPLISMA and SBM. The results ofdata5 are presented in Table 5, which shows few variations,most of the variables obtained by GA, in one run, are similar toother methods except the variables at 3.725 and 4.055 ppm.Pure LC-NMR spectra for data5 are presented in Fig. 3, for thepurpose of comparison. Elution profiles at the selected fre-quencies by different methods, SBM, OPA, SIMPLISMA andthree runs by GA, are plotted in Fig. 4. The pure profiles,representing the elution of a single compound, are indicated byan arrow for each method. It can be observed that GA producedmore pure variables compared to the other methods. When GAis performed a few more times, there are more chances that newpure variables are identified as revealed by Fig. 4. Table 5shows three frequency variables obtained by GA (1.253, 1.277and 1.340 ppm), which have similar elution profiles as com-

SBM

OPA

2 6 8 10time (min)

SIMPLISMA

4

2 6 8 10time (min)

4

2 6 8 10time (min)

4

Fig. 4. Elution profiles at key frequencies selected by different methods for d

pared to the other methods, the minor differences in chemicalshift are not significant. In contrast to using elution time var-iables GA produces slightly different variables in the frequencyor wavelength direction in separate runs but as stated earliervariation in the selected frequency variables generates moreinformation.

The results of variable selection in the time direction arecompared to OPA, SIMPLISMA, SBM and Needle Search. Theresults are presented in Table 6, which reveals the resultsproduced by GA are closer to the results of Needle Search thanother methods.

5. Conclusions

The use of genetic algorithms provides a valuable new toolfor rank analysis and key variable selection in chemometricsand this paper illustrates their application in coupled

GA

GA

GA

2 6 8 10time (min)

4

2 6 8 10time (min)

4

2 6 8 10time (min)

4

ata5, arrows indicate profiles representing the elution of a single profile.

Page 9: Determination of number of significant components and key variables using genetic algorithms in liquid chromatography-nuclear magnetic resonance spectroscopy and liquid chromatography-diode

217M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 81 (2006) 209–217

chromatography. Their application to two types of diversedatasets has been demonstrated successfully and is expectedthat it will work with data from other instruments. It not onlyperformed better for rank determination but also for theselection of pure (or key) variables. GA takes a long time inthe search but provides very stable results especially whenapplied for finding pure variables in the time (or chromato-graphic) direction. Genetic algorithms are a promising newapproach and an alternative to conventional methods such asOPA or SIMPLISMA.

Notations useda Peak heightA Maximum peak heightt Timetr Retention timeσ0 Peak shape parameter related to peak widthσ1 Peak shape parameter related to tailingX Data matrixmcX Mean Centred data matrixstdX Standardized data matrixC Matrix of concentration profileS Matrix of spectral profileM Number of rows in XN Number of columns in XK Number of total compounds in XI Size of population in genetic algorithmi Population index of chromosomesY A matrix used in OPA and SIMPLISMASD Second differenced DissimilarityPi Probability of selection for a chromosomeqi Cumulative probability of selection for a chromosomeβ1, β2 Factors used in baseline modellingβ3 A factor for heteroscedastic noise modellingρ Normally distributed random number

Acknowledgements

M.W. gratefully acknowledges the Ministry of Science andTechnology, Government of Pakistan for providing funds forPhD and the support of PAEC, Pakistan. M.W. is also thankfulto H. Duan for providing data7 and Y. Xu and L. Zhu forvaluable discussions about GA.

References

[1] F.C. Sanchez, J. Toft, B. van den Bogaert, D.L. Massart, Anal. Chem. 68(1996) 79–85.

[2] A. Bylina, D. Sybilska, Z.R. Grabowski, J. Koszewski, J. Chromatogr. 83(1973) 357–362.

[3] S.J. Vanslyke, P.D. Wentzell, Anal. Chem. 63 (1991) 2512–2519.[4] F.C. Sanchez, M.S. Khots, D.L. Massart, Anal. Chim. Acta 285 (1994)

181–192.[5] M. Maeder, Anal. Chem. 59 (1987) 527–530.[6] H.R. Keller, D.L. Massart, Anal. Chim. Acta 246 (1991) 379–390.[7] M. Wasim, M.S. Hassan, R.G. Brereton, Analyst 128 (2003) 1082–1090.

[8] M. Wasim, R.G. Brereton, Chemom. Intell. Lab. Syst., Lab. Inf. Manag. 72(2004) 133–151.

[9] E.R. Malinowski, Anal. Chim. Acta 134 (1982) 129–137.[10] E.R. Malinowski, Factor Analysis in Chemistry, third ed., Wiley, New

York, 1991.[11] R.B. Cattell, Multivariate Behav. Res. 1 (1966) 245–276.[12] H.H. Kindsvater, P.H. Weiner, T.J. Klingen, Anal. Chem. 46 (1974)

982–988.[13] E.R. Malinowski, J. Chemom. 3 (1988) 49–60.[14] S. Wold, Technometrics 20 (1978) 397–405.[15] B.-V. Grande, R. Manne, Chemom. Intell. Lab. Syst., Lab. Inf. Manag. 50

(2000) 19–33.[16] A. de Juan, B. Van den Bogaert, F.C. Sanchez, D.L. Massart, Chemom.

Intell. Lab. Syst., Lab. Inf. Manag. 33 (1996) 133–145.[17] O.M. Kvalheim, Y.-Z. Liang, Anal. Chem. 64 (1992) 936–946.[18] F.C. Sanchez, M.S. Khots, D.L. Massart, J.O. De Beer, Anal. Chim. Acta

285 (1994) 181–192.[19] W. Windig, J. Guilment, Anal. Chem. 63 (1991) 1425–1432.[20] J.H. Holland, Adaptation in Natural and Artificial Systems, University

of Michigan Press, Ann Arbor, MI, 1975 Revised Print: MIT Press,Cambridge, MA, 1992.

[21] D.B. Hibbert, Chemom. Intell. Lab. Syst., Lab. Inf. Manag. 19 (1993)319–329.

[22] G. Vivo-Truyols, J.R. Torres-Lapasio,M.C. Garcıa-Alvarez-Coque, Chemom.Intell. Lab. Syst., Lab. Inf. Manag. 59 (2001) 89–106.

[23] G. Vivo-Truyols, J.R. Torres-Lapasio, A. Garrido-Frenich, M.C. Garcia-Alvarez-Coque, Chemom. Intell. Lab. Syst., Lab. Inf. Manag. 59 (2001)107–120.

[24] D.J. Rimbaud, D.L. Massart, Anal. Chem. 67 (1995) 4295–4301.[25] D. Broadhurst, R. Goodacre, A. Jones, J.J. Rowland, D.B. Kell, Anal.

Chim. Acta 348 (1997) 71–86.[26] R. Leardi, A.L. Gonzalez, Chemom. Intell. Lab. Syst., Lab. Inf. Manag. 41

(1998) 195–207.[27] Q. Ding, G.W. Small, A.M. Arnold, Anal. Chem. 70 (1998) 4472–4479.[28] H.C. Goicoechea, A.C. Olivieri, J. Chem. Inf. Comput. Sci. 42 (2002)

1146–1153.[29] H.C. Goicoechea, A.C. Olivieri, J. Chemom. 17 (2003) 338–345.[30] K. Hasegawa, T. Kimura, K. Funatsu, Quant. Struct.-Act. Relat.

Pharmacol. Chem. Biol. 18 (1999) 262–272.[31] Q. Guo, W. Wu, D.L. Massart, C. Boucon, S. de Jong, Chemom. Intell.

Lab. Syst., Lab. Inf. Manag. 61 (2002) 123–132.[32] H. Shen, C.Y. Airiau, R.G. Brereton, Chemom. Intell. Lab. Syst., Lab. Inf.

Manag. 62 (2002) 61–78.[33] H. Shen, C.Y. Airiau, R.G. Brereton, J. Chemom. 16 (2002) 165–175.[34] H. Shen, C.Y. Airiau, R.G. Brereton, J. Chemom. 16 (2002) 469–481.[35] J.R. Torres-Lapasio, J.J. Baeza-Baeza, M.C. Garcia-Alvarez-Coque, Anal.

Chem. 69 (1997) 3822–3831.[36] J.C. Hoch, A.S. Stern, NMR Data Processing, Wiley-Liss, New York,

1996.[37] Z. Michalewicz, Genetic Algorithms+Data Structures=Evolution Pro-

grams, third ed., Springer-Verlag, Berlin Heidelberg, 1996.[38] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine

Learning, Addison-Wesley, 1989.[39] R. Leardi (Ed.), Nature — Inspired Methods in Chemometrics: Genetic

Algorithms and Artificial Neural Networks, Elsevier, Amsterdam, 2003.[40] C.B. Lucasius, G. Kateman, Chemom. Intell. Lab. Syst., Lab. Inf. Manag.

19 (1993) 1–33.[41] D.B. Hibbert, Chemom. Intell. Lab. Syst., Lab. Inf. Manag. 19 (1993)

277–293.[42] C.B. Lucasius, G. Kateman, Chemom. Intell. Lab. Syst., Lab. Inf. Manag.

25 (1994) 99–145.[43] R. Leardi, J. Chemom. 15 (2001) 559–569.[44] S.I. Grossman, Elementry Linear Algebra, fifth ed., Saunders college

Publishing, Fort Worth, 1994.[45] M. Wasim, R.G. Brereton, J. Chromatogr. A, 1096 (2005) 2–15.