quantitative pattern recognition using nonlinear … · relating the measurable outputs of a...

QUANTITATIVE PATTERN RECOGNITION USING

NONLINEAR MODEL –BASED ANALYSIS

A DissertationPresented for the

Doctor of Philosophy

Degree

The University of Tennessee, Knoxville

Martin A. Hunt

May 1998

ii

Copyright Martin Anthony Hunt, 1998All rights reserved

iii

DEDICATION

This work is dedicated to my wife, Elizabeth, and children, Caroline and Robert. You

all are the joy and love of my life. I am forever grateful for your support and appreciate the

sacrifices you made for me to pursue this endeavor.

iv

ACKNOWLEDGMENTS

I thank Dr. Mongi Abidi, my major professor and principal investigator on the univer-

sity robotics technology development program (URTDP), for his initial offer to be a mem-

ber of the chemical analysis automation (CAA) team, for his technical advice given over

the course of my association with him as both a professor and advisor, for the opportunity

to gain experience in both the academic and management aspects of a major research pro-

gram. I also appreciate his sensitivity and consideration of my responsibilities to my fam-

ily and Oak Ridge National Lab (ORNL). For their time and comments in the review of

this dissertation, I thank the additional members of my Doctoral committee, Dr. Don

Bouldin, Dr. Paul Crilly, Dr. Michael Roberts, and Dr. Belle Upadhyaya.

I am grateful to Dr. Leon Klatt, site manager of the ORNL CAA program and data

interpretation module (DIM) functional leader, for his valuable guidance in the world of

analytical chemistry. He provided insight on the analysis requirements and interpretation

of the gas chromatography data. I also give credit to Jim Younkin for development of the

communications toolkit used in the DIM and his and Dave Thompson’s assistance in the

integration and testing of the DIM.

Many people at ORNL have been instrumental in affording me the opportunity to

work on this dissertation in conjunction with my normal work responsibilities. I thank Dr.

Kenneth Tobin for supporting my requests for reduced hours under the university study

and the education sabbatical programs. I thank the I&C division management and the

Executive committee for approving my participation in these programs.

v

Several people in the imaging, robotics, and intelligent systems lab have assisted in

various stages of this research and I thank them for their efforts. Michael Williams devel-

oped the artificial neural network approach to analyte concentration and Thomas Lewis

developed the corresponding graphical user interface. Laurana Wong and Khuloud Al-

hend assisted in the early development of the nonlinear least squares algorithms. Katie

Jager Petito and Melissa Cox have provided excellent administrative support. Finally, I

would like to say thanks to the lab gang for keeping me laughing throughout this entire

process.

I would like to acknowledge the encouragement, motivation, and understanding pro-

vided by my extended family and friends during this endeavor. My parents have given

both emotional and physical support ranging from child care to rejuvenating sailing vaca-

tions.

This work was sponsored by the DOE’s University Research Program in Robotics

(Universities of Florida, Michigan, New Mexico, Tennessee, and Texas) under grant

DOE–DE–FG02–86NE37968. Portions of the work were performed at Oak Ridge

National Laboratory, managed by Lockheed Martin Energy Research Corp. for the U.S.

Department of Energy under contract DE–AC05–96OR22464.

vi

ABSTRACT

A nonlinear model–based approach is taken to quantitatively analyze time series data gener-

ated by analytical instruments. An automated system is presented which takes as input an Analyti-

cal Instrument Association (AIA) network common data format (NetCDF) data file and generates

an estimate of the concentrations of specific analytes of interest. The system consists of three pri-

mary modules which, when combined, provide accurate and precise knowledge about unknown

sample matrices, especially difficult–to–analyze mixture samples. A preprocessing module

extracts peak parameter estimates for the exponentially–modified Gaussian (EMG) model from the

raw signal and utilizes nonlinear optimization techniques to fit the model to the observed data. A

novel sliding window approach ensures that the influence of neighboring peaks is included in the

model fitting without the requirement for arbitrary established peak endpoints. Modeled peak

parameters are available for both instrument performance assessment and use in the analysis mod-

ule. Several traditional analysis algorithms are implemented in parallel on the raw and extracted

data. A complete analyte–based model–analysis algorithm is also developed for the analysis of

complex mixture samples. This algorithm utilizes concentration dependent, complete analyte

models derived from calibration standards to model the observed signal in a unified manner. Each

analysis algorithm generates analyte concentration and confidence estimates and an additional per-

formance measure. The third module utilizes a fuzzy logic inference system to fuse the results of

the multiple analysis algorithms into a single comprehensive sample characterization. Software

modules implement the described algorithms and interface to the supervisory controller of an auto-

mated chemical analysis system. Experimental results from gas chromatography data generated

from simulated, standard and actual environmental samples are presented and conclusions drawn

regarding the increase in accuracy and performance of the system over traditional methods.

vii

Contents

CHAPTER 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 General problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Specific problem scope and definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Application area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Narrative Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

CHAPTER 2 Gas Chromatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.1 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Instruments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4 Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

CHAPTER 3 Peak Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1 Derivation of Exponentially Modified Gaussian . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Parameter estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Support of peak model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Derivatives of peak model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5 Sliding window approach to large peak sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

CHAPTER 4 Concentration Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Total concentration model fitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

viii

CHAPTER 5 Nonlinear Least Squares Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1 Least squares formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Minimization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Line search and estimation of LM scaling factor. . . . . . . . . . . . . . . . . . . . . . . . 74

5.4 Initial estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.5 Refinement of model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.6 Constrained nonlinear minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

CHAPTER 6 Results Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1 Fusion methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2 Fuzzy logic based fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

CHAPTER 7 System integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.1 Functional description of the integrated system . . . . . . . . . . . . . . . . . . . . . . . . 94

7.2 Software structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.4 Off-line analysis tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

CHAPTER 8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.1 Preprocessing and filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.2 Baseline estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.3 Peak modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.4 Analyte modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8.5 Chromatogram modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

8.6 Results fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

ix

CHAPTER 9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

9.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

9.3 Developed software and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

9.4 Future research. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

BIBLIOGRAPHY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

APPENDIX A.Derivation of EMG function partial derivatives . . . . . . . . . . . . . . . . . . 161

APPENDIX B.Table of individual results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

x

List of Tables

Table 8.1: Average absolute error, model algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 122

Table 8.2: Average absolute error, Commercial system . . . . . . . . . . . . . . . . . . . . . . 122

Table 8.3: Average percent error at SNR = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Table 8.4: Comparison of analytical methods for a single sample . . . . . . . . . . . . . . 137

Table 8.5: Summary results of individual methods . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Table 8.6: Reported concentrations and fused results for a single sample . . . . . . . . 143

Table 8.7: RMS error over entire data set for different methods . . . . . . . . . . . . . . . . 145

Table B.1: Concentration values for sample set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

xi

List of Figures

Figure 1.1 Schematic of various physical phenomenon measurement systems.. . . . . . . 2

Figure 1.2 Model of the overall measurement process. . . . . . . . . . . . . . . . . . . . . . . . . . 7

Figure 1.3 Schematic flow of data and processing steps. . . . . . . . . . . . . . . . . . . . . . . . . 8

Figure 1.4 Schematic of the target processes the CAA program will automate. . . . . . 14

Figure 1.5 Typical peak profile for a gas chromatogram. . . . . . . . . . . . . . . . . . . . . . . . 19

Figure 2.1 Schematic of stationary and mobile phases in GC. . . . . . . . . . . . . . . . . . . . 32

Figure 2.2 Schematic of a GC instrument. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Figure 2.3 Typical chromatogram of a PCB sample. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Figure 2.4 Enlarged section of the chromatogram in Fig. 2.3. . . . . . . . . . . . . . . . . . . . 38

Figure 3.1 Typical chromatogram peak and Gaussian peak shape. . . . . . . . . . . . . . . . 40

Figure 3.2 Convolution of Gaussian peak function with exponential decay function. . 43

Figure 3.3 Typical chromatogram peak and EMG peak. . . . . . . . . . . . . . . . . . . . . . . . 44

Figure 3.4 Typical EMG peak shown with the measured parameters. . . . . . . . . . . . . . 47

Figure 3.5 Flow chart of peak fitting operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Figure 4.1 Flow diagram of the off–line chromatography calibration processing.. . . . 61

Figure 4.2 Flow diagram of the on–line chromatography analysis processing. . . . . . . 65

Figure 5.1 Processing flow for peak detection algorithm. . . . . . . . . . . . . . . . . . . . . . . 76

Figure 5.2 Example of baseline estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Figure 5.3 Graphic description of vectors associated with constrained optimization. . 81

Figure 6.1 Block diagram of the results fusion module. . . . . . . . . . . . . . . . . . . . . . . . . 86

Figure 6.2 Example of two Gaussian based membership functions.. . . . . . . . . . . . . . . 88

xii

Figure 6.3 Complete results fusion architecture.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Figure 6.4 Example of the fuzzy logic based combination of two measured inputs. . . 92

Figure 7.1 Top level schematic of the DIM CC functional blocks. . . . . . . . . . . . . . . . 95

Figure 7.2 Main screen of the GUI chromfit tool.. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Figure 8.1 Comparison of the power spectral density estimates. . . . . . . . . . . . . . . . . 106

Figure 8.2 Power spectrum estimate used for filter selection. . . . . . . . . . . . . . . . . . . 108

Figure 8.3 Frequency response of FIR filter used to suppress high frequency noise.. 109

Figure 8.4 Example of the approximated baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Figure 8.5 Example of signal derivative estimates.. . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Figure 8.6 Chromatogram segment with peaks indicated by vertical stem plots.. . . . 114

Figure 8.7 Sum of squares error surface for a single peak. . . . . . . . . . . . . . . . . . . . . . 115

Figure 8.8 Contour plot of the error surface for the single peak modeling operation. 117

Figure 8.9 Plots showing the accuracy of the peak modeling process. . . . . . . . . . . . . 118

Figure 8.10 Stem plot of the peak times and corresponding areas.. . . . . . . . . . . . . . . . 119

Figure 8.11 Example of a typical simulated chromatogram. . . . . . . . . . . . . . . . . . . . . 121

Figure 8.12 Example of the modeling results for noisy data. . . . . . . . . . . . . . . . . . . . . 123

Figure 8.13 Example of the raw data and the modeled peak at a SNR of 40.. . . . . . . . 124

Figure 8.14 Comparison of the average RT error for a range of SNRs. . . . . . . . . . . . . 125

Figure 8.15 Comparison of the average area error for a range of SNRs. . . . . . . . . . . . 126

Figure 8.16 Plot of the peak areas vs. analyte concentration for three different peaks. 129

Figure 8.17 Comparison between quadratic and linear models for a set of peak areas. 130

Figure 8.18 Image indicating the peaks included in the three analyte models.. . . . . . . 131

Figure 8.19 Model and measured signal for Aroclor 1242 standard. . . . . . . . . . . . . . . 132

xiii



Figure 8.22 Simulated chromatogram for mixture sample.. . . . . . . . . . . . . . . . . . . . . . 136

Figure 8.23 Accuracy comparison of an complete chromatogram model. . . . . . . . . . . 139

Figure 8.24 Zoomed in area of a difficult to fit mixture sample. . . . . . . . . . . . . . . . . . 140

Figure 8.25 Input membership functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Figure 8.26 Comparison of accuracy and precision for results fusion.. . . . . . . . . . . . . 144

1

CHAPTER 1

INTRODUCTION

Many types of analytical systems generate significant amounts of raw data during their

course of operation. These raw data generated by the sensor must be interpreted to obtain

meaningful high level knowledge about the target phenomenon of interest. Due to the

complex interactions between sensed phenomena and the sheer volume of individual

analysis that must be made, automation of the interpretation process is becoming essential.

This work focuses on utilizing advanced signal processing and pattern recognition algo-

rithms to generate quantitative knowledge from one–dimensional time series generated by

analytical instruments.

1.1 General problem definition

Relating the measurable outputs of a sensor(s) to physical attributes of interest is a

fundamental task of any system that contains detectors or sensors measuring a physical

phenomenon. In the ideal situation the sensor will have a scalar response which is selec-

tive to a single physical phenomenon of interest. This selectivity results in a one-to-one

correspondence between sensor output and measured attribute that has no correlation to

other possible physical conditions. An example of such a system would be a humidity sen-

sor that generates an output voltage that is linearly proportional to the humidity regardless

of the temperature, pressure, etc. The schematic of a general measurement system with

such a response is shown in Fig. 1.1a.

2

ENVIRONMENT

35.65

PROBE SENSOR

SCALAR RESPONSE

OUTPUT

PROBE SENSOR OUTPUT

a.)

b.)

0 2 4 6 8 10 12 14 16 18 200

10

20

30

40

50

60

70

80

90

100

sensor output (v)

Rel

ativ

e hu

mid

ity

0 2 4 6 8 10 12 14 16 18 200

10

20

30

40

50

60

70

80

90

100

sensor output (v)

Rel

ativ

e hu

mid

ity

SELECTIVE CORRELATED

2 4 6 8 10 12 140

1

2

3

4

5

6

Time (min.)

Det

ecto

r un

its

SELECTIVE0 5 10 15 20 25 30

0

0.5

1

1.5

2

2.5

3

3.5

4

peak area (counts−sec)

Con

cent

ratio

nVECTOR RESPONSE

Figure 1.1 Schematic of various physical phenomenon measurement systems. a.) Sche-matic of a sensor system with a scalar output. Two possible output relation-ships between sensor output and physical phenomenon - selective and linear;correlated and nonlinear; b.) Schematic of a sensor system with a vector out-put. Selective response generates a single peak with an area that is linearlyrelated to a physical phenomenon.

ENVIRONMENT

AREA

X = a

X = c

CO

NC

EN

TR

ATIO

N

3

componet 1

componet 2

sensor output

0 5 10 150

1

2

3

4

5

6

7

8

9

Time (min.)

Det

ecto

r un

its

Figure 1.1 c.) Schematic of a sensor system with a vector output. Non-selectiveresponse generates many peaks and sensor output consists of multiple com-ponents, each related to a physical phenomenon.

PROBE SENSOR OUTPUT

c.) NON-SELECTIVE, MULTIVARIATE

VECTOR RESPONSEENVIRONMENT

4

Another more realistic, mapping between the physical attribute and the sensor

response is one in which the output is influenced by a physical phenomenon other than the

targeted phenomenon. For the previous example, the humidity sensor response would be a

function of both humidity and temperature. In this case, an additional sensor which mea-

sures temperature is required to get an accurate measurement of the humidity. To complete

this system, a model of the relationship between sensor response, temperature, and humid-

ity would be developed to estimate the humidity given the sensor response and tempera-

ture. Another possible deviation from the ideal circumstances is a sensor response that

does not have a simple first–order linear relationship to the physical phenomenon of inter-

est. The model developed for this system is required to have the flexibility for higher order

or nonlinear components to account for the actual sensor response. This scenario is con-

sidered a correlated or non-orthogonal, nonlinear sensor response and forms one compo-

nent of the basic problem this research will address. The selective, linear response and the

correlated, nonlinear response are shown in Fig. 1.1a.

Some sensors generate a vector response for a given measurement as depicted in Fig.

1.1b. This type of output, which may be a function of wavelength or time (but not neces-

sarily measuring a time varying physical phenomenon), requires several additional pro-

cessing steps to obtain the relationship between the physical phenomenon and the

observed sensor response. Signal processing and pattern recognition techniques are used

in this case to extract the desired relationships. If the sensor generates a particular pattern

over time, for a given condition, it might be possible to determine the unique contribution

of a single measurable physical phenomenon related to this pattern. One method to accom-

plish this is to recognize the component of the measured sensor response that matched the

5

particular signature of interest, such as a peak at a specific time. This method would have a

one-to-one mapping between the time series vector and the physical quantity of interest if

there was no additional sensor response due to other physical phenomena. This scenario is

depicted in Fig. 1.1b. In practice it is difficult to build a sensor which is selective (gener-

ates a unique response) to the physical phenomenon of interest. Typically there exist

response components due to other phenomena, interferences, drifts, and noise. The

observed signal from the sensor is the combination (either linearly or nonlinearly) of all

these possible components into one observable signal. An example of this multivariate

nonselective response is shown in Fig. 1.1c.

Consider these simple examples as a means to express the fundamental aspects of

measurement systems in which the observed instrument response is a complicated mixture

of several physical phenomena, has global and local noise or interference patterns, and

potentially nonlinear behavior. The fundamental objective of this research will be to rec-

ognize and accurately quantitate the relationship between an observed time series pattern

and a physical phenomenon of interest in using pattern recognition and signal processing

theory.

1.2 Specific problem scope and definition

This research work will focus specifically on the analysis of time–series signals for the

quantitative determination of the amounts of multiple components giving rise to the mea-

sured time signal. Such time–series signals are generated by gas chromatography (GC)

systems performing chemical separations.

6

The input time signal consists of a train of peaks whose area and center time are char-

acteristics of the components present at the sensor generating the signal and whose shape

is similar to a Gaussian probability density function. This time series is not a periodic

function and thus many of the traditional signal processing techniques do not directly

apply in the analysis process. The estimation or optimization approach based on the sto-

chastic nature of the signal can be used to extract the relevant information from the mea-

sured signals. An analysis of the analytical instrument which generates the time signals

reveals that random variations are introduced at several locations in the system. The

underlying chemical process produces random variations in both the retention time at

which the fundamental peaks are generated and the areas of these peaks. In addition, the

detector is essentially an electron counter which experiences random variations in the

electron count. Finally, the analog electronics and analog-to-digital conversion process

introduce a degree of randomness. However, these random effects are typically much

smaller in magnitude than the deterministic response of the instrument due to the presence

of a chemical compound.

Therefore, a fundamental approach of model–based estimation of the underlying com-

ponent signals will yield an optimal solution in the presence of these random variations.

The schematic shown in Fig. 1.2 is a depiction of the system which generates a measured

time signal and the process of estimating the model parameters. In this schematic there are

several models that are required for the estimation process including the model of the pro-

cess which analyzes the sample, the model of the sensor response, and the data acquisition

system.

7

This research effort addresses the issues touched on in the proceeding paragraphs in an

attempt to further the state of the art in GC data analysis. Specifically the following prob-

lems are addressed in an original, unified, and systems based approach:

1. Accurate chromatogram baseline estimation;

2. Accurate and robust peak–area determination;

3. Generation of analytical instrument performance assessment measures;

4. Mixture concentration determination using pattern recognition and quantitative model-

ing; and

5. Results fusion for increased accuracy and precision.

The complete process of sample assessment using GC is shown schematically in Fig.

1.3. This research will address all the elements required for end–to–end processing of this

Figure 1.2 Model of the overall measurement process. Model based representation ofthe signal generation system, measurement instrument, and filtering or esti-mation operation.

W Uncertainty

Process

a(•)

Sensor

V Uncertainty

+Measured

Estimation

Analytical InstrumentData Analysis System

b(•)a(•)

N(0,Rw) N(0,Rv)

b(•)ResultsSample

Observed

a(•) – Model of processb(•) – Model of sensor N(0,Rw) – Estimate of Process uncertainty

N(0,Rv) – Estimate of data acquisition uncertainty

P

P – Process parametersW – Process uncertaintyV – Data acquisition uncertainty

Signal

8

Raw Time Series

Assess inputsignal quality

Principle ComponentRegression

Artificial NeuralNetwork

Modeled ComponentPattern Recognition

Analytical resultsFusion

Quantitative assessment

Preprocessing

Analysis

Postprocessing

Figure 1.3 Schematic flow of data and processing steps. Shaded blocks represent areasin which this work makes contributions to the field. Additionally, the auto-mated end-to-end processing has significant impact in the practical applica-tion of the developed technology.

Parameter estimation

BaselineEstimation

Peak parameterEstimation

DATA INTERPRETATION PROCESS

9

type of data. Significant and novel contributions to the field are made in items 1 - 4 of the

above list in addition to the system level contribution of an end-to-end data interpretation

system.

1.3 Application area

Chemical analysis using separation science is a well established field of research both

in the public and private sectors. In particular the field of chromatography has grown tre-

mendously since the classical theories were presented by Martin and Synge [65] in 1941.

A general definition of chromatography is the process of separating constituents of a mix-

ture by permitting a solution of the mixture to flow through a column of adsorbent on

which the different substances are selectively separated into distinct bands. Chromatogra-

phy has demonstrated its power and flexibility in many areas including the analysis of per-

manent and light hydrocarbon gases, gasolines, organic acids in human tissues, steroids,

organic pollutants, and aromas[41].

In gas chromatography (GC), a column is swept by a flow of gas (the mobile phase)

that carries the sample from a dedicated sampling port to an on-line detector. The separa-

tion process in this case takes place in the interaction of the mobile and solid phases and is

made evident in the output of the on-line detector rather than as bands in the column itself.

James and Martin [47] published the first description of this technology and the results of

their experiments which revolutionized the entire field of chromatography. The resulting

signal from the on-line detector has a wealth of information on the compounds present in

the sample, their concentrations, and information on the thermodynamics and kinetics of

the molecular interactions between the column and the sample [41]. In the forty years

10

since its introduction, GC has become a very popular and common analytical technique

used by industry, government, and universities around the world. Despite this history,

there are areas which require additional applied research such as the area of complex mix-

ture analysis.

The use of gas chromatography (GC) for analyzing a variety of volatile compounds is

standard practice in many analytical laboratories [22]. In the environmental restoration

field, the analysis of samples for contamination by regulated substances makes extensive

use of GC. The United States (U.S.) Environmental Protection Agency (EPA) method

8080A [78] describes the use of GC for the analysis of organochlorine compounds and

polychlorinated biphenyls (PCBs). Environmental laboratories will use GC as the primary

technique for sample analysis due to the reduced complexity, lower capital equipment

cost, and automation of sample introduction, all of which contribute to a lower cost per

sample analysis. Although sample introduction into the GC equipment is highly auto-

mated, interpretation of the raw chromatogram is still a labor–intensive task and accounts

for up to 50% of the total sample analysis cost.

The United States Department of Energy (DOE) has recognized that the sample analy-

sis needs of its remediation activities are very large and will increase dramatically in the

coming years. In 1995 DOE made 2 - 3 million chemical, biological, and radioactive

determinations per year, with a projected 10 million by 1997 [75]. In addition to the added

costs related to these sample analysis, the human resources required, including experi-

enced chemical analysts, outnumber the projected resources in the workforce. The Chemi-

cal Analysis Automation (CAA) program was initiated as a result of these needs to

automate the chemical analysis process from the sample preparation to the data interpreta-

11

tion and reporting phases in a modular, integrated system. The primary goals of the data

interpretation module (DIM) within the CAA program are to reduce the time required to

analyze the results of the measurement device and to increase the accuracy and precision

of the analysis. Remediation costs are directly affected by the results of such sample

analysis; thus the higher accuracy can save additional money by reducing unnecessary

cleanup operations (typically the amount of contamination is conservatively estimated for

complicated mixture samples).

The primary method for extracting chemical knowledge in the form of an identifica-

tion and quantitation of component(s) is a linear regression of the chromatogram peak area

onto calibrated standards of the compound of interest. The steps involved in this regres-

sion are to first identify which peak(s) in the chromatogram will be used, extract an esti-

mate of the peak area, then generate a set of calibration data [20]. This method works well

under conditions where the response of the GC detector is linear, the baseline variations

are minimal, and a single compound with unique peak patterns exists. Unfortunately, some

sample analysis will not meet these conditions completely and alternate methods must be

employed to obtain accurate quantification.

A majority of samples to be analyzed with GC contain multiple components and possi-

bly background interferences. This results in a complex chromatogram which might not

have unique or orthogonal peaks and it is therefore difficult to identify completely or to

quantitate the contributing components. The peak areas extracted may consist of contribu-

tions from several sources in these cases. A method which can explicitly use and account

for these conditions is necessary for accurate analysis of this type of sample.

12

1.4 Related Work

In the following sections, several areas of research related to the interpretation of time

series signals generated by analytical instruments are reviewed. The initial sections are

brief introductions to the specific application area of this research. The sections on peak

detection, deconvolution, baseline estimation, and peak measurement comprise the pre-

processing aspect of the interpretation process. These sections provide a review of current

approaches and form a background for the discussion of alternative approaches to charac-

terizing individual peaks.

The final sections review literature on the topics of qualitative analysis, analytical

methods, and modeling. These topics address the analysis and postprocessing stages of the

data interpretation process. From this review the need for additional methods which han-

dle a specific class of interpretation is identified.

1.4.1 Chromatography

The theory of chromatography is well documented in the literature and continues to be

a fertile area of research. Several books have been written which chronicle developments

in the chemical aspects of this field includingGas Chromatography by Golay [38],Chro-

matographyby Giddings [36], andQuantitative Gas Chromatography for Laboratory

Analysis and On-Line Process Control by Guiochon and Guillemin [40]. Guiochon and

Guillemin have also written an excellent review article [41] which summarizes the state of

gas chromatography and presents examples of the breadth of possible applications. A sig-

nificant advancement in the separation capability came with the introduction of the capil-

lary column which used an open, tubular column with a stationary film on the inner wall.

13

This technology was made practical by Dandeneau with the introduction of flexible fused

silica as the column tubing material [19].

1.4.2 Chemical Analysis Automation

The CAA program has defined a number of standards for the automated processing of

soil samples using the protocols of U.S. EPA solid waste analysis (SW-846 [79]) using a

set of modular Standard Laboratory Modules (SLM). Erkkila and Hollen [23] give an

overview of the state of the program and the overall goals. The concept and operation of

the basic units of the system, SLMs, and their interaction with a supervisory control archi-

tecture is presented by Salitet. al. [71]. Several components of the DIM module have been

completed and will be discussed in the Analytical Methods section below. Combining the

results from the best analytical methods into a more accurate and precise concentration

analysis, the ultimate goal for the DIM, has not been achieved to date. Several papers have

reported the progress to date on the DIM, including a paper by Elling and Klatt [21]. The

processes that the CAA program will automate are shown in Fig. 1.4.

1.4.3 Peak detection

Most typical chromatography applications use a series of thresholds on the raw sig-

nal’s amplitude and possibly the first derivative to automatically delineate the presence of

a peak. These methods are highly dependent on the estimation of the underlying signal

baseline and thus tend to perform poorly when peaks are overlapped, the baseline is hard

to estimate, or the signal noise level is large. In practical applications the parameters asso-

ciated with the GC system are manipulated to obtain a chromatogram with an isolated,

well defined peak representing the analyte of interest. If this approach fails to completely

14

resolve the peaks or if the analyte is a mixture with many peaks, signal processing tech-

niques can be used to extract the peak(s) of interest from the chromatogram.

The use of a Kalman filter to resolve peaks with partial overlap in a high-performance

liquid chromatography application is given by Hayashi,et. al. [44]. Characterized as a lin-

ear recursive least-squares algorithm, the authors explain the advantages of the least

squares distance measure in the Gauss space rather than the Euclidian space. The results

of this approach were to filter the noise-contaminated raw data so that slightly overlapped

peaks could be decomposed into their constituent peaks. Adaptive Kalman filtering has

also been proposed by Brown and Bear to correct empirical models of overlapped peaks

and voltammetry simulations [9][3]. Another filter–based approach for increased accuracy

in peak detection under noisy chromatograms is the matched filter [7]. The matched filter

uses models of both the signal and the noise to generate a filter which optimizes the sig-

nal–to–noise ratio. A Gaussian peak model and band-limited white noise were used to

derive the filter and the results showed an improved signal to noise ratio. Drawbacks to

Figure 1.4 Schematic of the target processes the CAA program will automate. Robot-ics and software algorithms are combined to achieve a high degree of auto-mation. (CAA project document, used by permission from Leon Klatt)

Kno

wle

dge

Out

Sam

ple

In

AnalysisDataSample

Preparation Interpretation++

15

this method were the forced assumption of a symmetric peak model and the assumption of

a single peak.

Finally, the application of zero-area digital filters has been shown to be effective for

peak detection under conditions of statistical noise, significant background, and interfer-

ences [49]. Three zero-area filters, square, triangular, and Gaussian, were evaluated by

convolving the filter functions with the original raw signal. This essentially had the effect

of high–pass filtering the input signal. Conclusions drawn from this research indicated that

this type of filtering could enhance apparent resolution, thus aiding in peak detection.

Overall, these methods are useful in either detecting or enhancing individual peaks but

still require some of the fundamental operations of differentiation or thresholding in a

post–processing stage to automatically extract the peak parameters. The presented

research will address this issue by using a nonlinear fitting of the raw data to a flexible

model which only needs rough estimates on the peak parameters as initial input.

1.4.4 Deconvolution

A successful approach to the resolution of overlapped (or coeluting) peaks has been to

deconvolve the observed chromatogram signal with a model of the system impulse

response. Several methods have been presented in the literature to accomplish the decon-

volution with the primary focus on Fourier transform and iterative relaxation methods. In

both of these methods somea priori knowledge of the nominal peak shape must be mea-

sured or assumed based on the first principals of the chromatographic system.

In the Fourier deconvolution approach the properties of the Fourier transform which

relate convolution in the time domain to multiplication in the frequency domain are used

16

to recover a representation of the true response of the system. The theoretical derivation of

frequency–domain peak sharpening was developed by Kirmse and Westerberg for a sym-

metrical peak shape (Gaussian) [54]. This theory was later refined for an asymmetrical

peak shape (exponentially modified Gaussian - EMG) by Felinger [26]. The disadvantages

in the Fourier–based deconvolution are the sensitivity to noise, determination of the cutoff

frequency and smoothing window.

The iterative relaxation methods have generally proven to be more stable and power-

ful. A method proposed by Jansson [50] has been successfully used to resolve peaks in gas

chromatography data that has moderate noise levels and coeluting peaks. Extensions and

refinements on the base method have been given by Crilly [16],[17],[18] for cases of low

signal–to–noise ratios and peak overlap. These methods applied various filtering tech-

niques including an-point polynomial smoothing filter of Savitzky and Golay [62] and a

cross-correlation or “reblurring” filter.

These deconvolution methods primarily result in an enhanced chromatogram which

still requires some form of integration to determine the peak area and other peak parame-

ters. The problem still remains, although significantly easier in many cases, to automati-

cally locate and extract parameters from the relevant peaks.

1.4.5 Baseline estimation

A fundamental component of any peak detection and measurement operation is the

accurate estimate of the underlying baseline of the chromatogram signal. The most typical

method for baseline estimation is to detect local minimum in the signal and then fit these

points to either a piecewise linear or polynomial function [13]. This method is susceptible

17

to signal level deviations which cause the fitted low order polynomial to vary significantly

from the true baseline between the ordinate data points. A Fourier series approach has

been presented which is less sensitive to extreme values of individual points [85]. In this

approach a piecewise linear approximation is obtained using local minimum, then a Fou-

rier series approximation of the piecewise linear approximation is computed. A smoothed

version of the piecewise linear approximation can be obtained by taking only a few of the

low order terms of the series and reconstructing a new time series.

Schechter takes a combined approach to the determination of the baseline and the con-

centration of a monovariable chemical system [73]. In this work a model is developed

based on two reference signals with known concentrations of the single analyte of interest.

The model is based on the observed signal being the sum of three components: the analyte

of interest, all other components, and a polynomial baseline. Using the known parameters

from the two reference signals, an unknown concentration of the analyte can be estimated

using a least-squares minimization. This approach to baseline estimation does not in fact

extract a baseline in the usual sense for subsequent peak processing, but rather it effec-

tively isolates the contribution to the overall observed signal using a model. This funda-

mental approach has some similarities to the research proposed in this work.

1.4.6 Peak measurements

Based on the chemical and physical properties of the GC separation process, the reten-

tion time and area under a given peak in the GC chromatogram can be related to a specific

compound with a given concentration that generated the peak under controlled GC condi-

tions [39]. Early attempts to use this relationship relied on manual methods to determine

18

the area of the peaks, as plotted using a strip chart recorder. One method used a planimeter

to trace the plot of the peak and determine the area. Another method involved cutting the

peak portion of the paper plot out and weighing to determine peak area [40]. With the

advent of the analog to digital converter (ADC), the chromatogram could be represented

as an one dimensional time series of the magnitude of the GC detector at a fixed sampling

interval. This advancement led to the manipulation of the chromatographic data by mathe-

matical algorithms implemented on a computer. Initially the digital peak processing

required operator interaction to delineate the part of the signal to integrate using a cursor

overlay. Later advancements usedad hoc methods to delineate the peak and then calculate

the area with integration [24]. Some of these methods include techniques such as perpen-

dicular drop to baseline from peak valleys, triangle approximation based on maximum

height and half-height width, and tangent skim for baseline determination of peaks that are

superimposed on the tail of a larger peak [13].

A review of the perpendicular drop and tangent skim methods was presented by Papas

and Tougas which included an accuracy comparison of the two methods [68]. The exam-

ple section from a simulated chromatogram shown in Fig. 1.5 illustrates a scenario that

would lead to erroneous estimates of peak area using the standard peak area estimation

methods. In this time series the peaks overlap, so the points in time to start and stop inte-

gration are not distinct. In addition, the baseline appears to be varying either due to the

overlapping peaks or true change in output baseline. Foley presents some of the inherent

inaccuracies of these approximation methods [29].

Automation of the qualification and quantitation of complex analytes, including mix-

tures, has been difficult due to the complications associated with peak detection/delinea-

19

tion, baseline estimation and overlapped peaks. Significant activity in the literature has

focused on these problem areas and will be discussed below.

1.4.7 Qualification by retention time

A qualitative approach to the analysis and data interpretation of an unknown sample

can be taken if the presence or absence of an analyte is the primary concern. This approach

simplifies the needs listed in the previous sections for peak parameter estimates. The pri-

mary feature used in this approach is the retention time at which a peak or group of peaks

occurs. On potential use of this approach is for input into a results–fusion module which

would base the method of fusion on the results of a qualitative estimation of the unknown

sample contents.

0 2 4 6 8 100.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

Time (min.)

Det

ecto

r un

its

Figure 1.5 Typical peak profile for a gas chromatogram. This signal shows three indi-vidual peaks with characteristics that make estimating the peak parametersdifficult.

20

Fundamental studies have been performed in which the prediction of the retention

times of individual peaks associated with each of the 209 possible isomers (congeners) of

a PCB [70]. Another study used the retention times to classify the toxicity of the PCB

sample base on the congeners present [56]. Felinger has presented several approaches to

the qualitative analysis of multicomponent samples including Fourier [25] and statistical

[27] based methods. In the first work the method was successful in determining the num-

ber of single components contributing to the multicomponent mixture. The second

method, which extends on the original work, gives examples of the ability to determine the

presence and number of complex compounds in real chromatography data. The next step

taken to complete the analytical and interpretation process is the quantification of the

unknown sample.

1.4.8 Analytical methods

Analytical methods for the determination of analyte concentration range from simple

linear regression, using either single peak height or area, to multiple input nonlinear meth-

ods such as artificial neural networks. In this section the details of several methods will be

covered along with their advantages and potential limitations in the quantitative determi-

nation of samples with a potential mixture of complex analytes.

Straight linear regression on individual peaks is the most used method in commercially

available systems [13], [77]. In this approach a linear model is obtained which relates the

observed peak parameter (area or height) to the analyte’s concentration using a least

21

squares solution and a series of calibrated peak parameters [20]. The model is generally

given as

, 1.1

whereY is then x 1vector of observations,X is then x 2 matrix of independent variables

augmented with a column of ones in the first column (if the intercept term is desired),β is

the2 x 1 vector of parameters to be estimated, andε is an x 1vector of zero mean, inde-

pendent, random variables with normal distribution and variance . For simple linear

regression theβ vector represents the intercept and slope of a straight line (first-order lin-

ear regression). A least squares procedure is used to solve for estimates,b, of the unknown

β given at least two pairs of observations. For the first-order linear regression case, the

individual elements ofb can be solved for directly using the relationships below.

1.2

1.3

is the mean of the vectorY, is the mean of the vectorX, andn is the total number of

observations (length of vectorX). The estimation of an unknown concentration of an ana-

lyte, , based on a measured peak area generated by a sample,X, would be given by

. 1.4

This model can be extended to include higher order terms in the matrixX such as a

column ofX2 terms (this would be a second order model). In this caseb would be a3 x 1

Y Xβ ε+=

σ2

b0 Y β1X–=

b1

XiYii 1=

n

∑ nXY–

Xi2

i 1=

n

∑ nX–

--------------------------------------=

Y X

Y

Y bX=

22

vector and could be solved for using the following relation obtained by solving the normal

equations

1.5

where is the transpose of matrixX and the superscript-1 indicates matrix inversion.

If additional independent (or predictor) variables are added to the above model the

result is termed multiple linear regression. For this model the columns ofX are now each

one of the independent variables to be included in the model. The solution for the parame-

ters of the model,b, is the same as given in Eqn. 1.5.

A recent focus area has been in the area of biased methods such as principal compo-

nent regression (PCR) and partial least squares (PLS) [10],[42],[61]. These methods are

considered biased because some of the original data are discarded prior to the least-

squares estimate of the model parameters in the attempt to obtain a simpler model that is

more sensitive to the component of the signal that contains relevant information. In PCR

the data (i.e. independent variables) matrix,X (n x p), is decomposed into mean vector

plus the product of score matrix and a loading matrix plus a residual matrix given by

1.6

where1 is an x 1 vector of ones,xmean is a1 x p vector of the means of column ofX. The

matrixT (n x f, f < p) is a lower dimensional orthogonal projection based on the eigenvec-

tors of the covariance matrix derived fromX, andP (f x p) is the linear transform back into

thep-dimensional space ofX. The goals of this process are to reduce the dimensionality

and complexity of the multivariate data and to separate the signal from the noise [59].

b X'X( ) 1– X'Y=

X'

X 1xmean TP E+ +=

23

The PLS formulation is very similar to PCR in that the eigen analysis is performed but

additional information from the observations are used to correlate the direction of the

eigenvectors (principal components) with the observations (concentrations) [58]. Advan-

tages of this method are that if a weak signal component, is present in the original data, the

projections will be in a direction related to this signal component and the potential for the

number of factors (f) to be less than PCR.

All of the methods listed above are strictly linear in nature and only make transforms

of the original independent–variable data space based on linear orthogonal projections

which maximize the total variance in the data in a reduced dimensional data space. These

methods work well under conditions where the a true linear relationship exists between

independent variables and observations (e.g. peak areas and analyte concentration). In

many cases the response of sensors used in the measurement process are approximated

with linear functions over a limited dynamic range, but are inherently nonlinear. Another

weakness of these methods is that they have limited capability to decompose the contribu-

tions in the independent variables from complex mixtures with overlapping signal

response. Some efforts have been presented to overcome the linearity constraints includ-

ing nonlinear PCR and PLS and artificial neural networks.

A review of the techniques to modify the basic PCR and PLS to obtain the nonlinear

version is given by Sekulic et al. [74]. In general theT matrix of Eqn. 1.6 is augmented

with quadratic terms during the estimation phase. Another review by Gemperline, Long,

and Gregoriou also includes a nonlinear version of PCR called quadratic PCR (QPCR)

which is compared to other linear and nonlinear multivariate methods [34]. Both of these

reviews came to the conclusion that the nonlinear methods performed better in many

24

cases. The final nonlinear technique which has seen growing popularity is the artificial

neural network (ANN). Both of the reviews listed above included an ANN based model

which compared favorably to the nonlinear PCR and PLS and was much better than the

linear methods. Two different ANN strategies, backpropagation and counterpropagation,

were investigated by Majcen et al. in the multicomponent analysis of color differences in

paint [63]. Williams has investigated a multilayer perceptron architecture ANN usingtanh

and radial basis function transfer functions [81]. Results showed that the ANN performed

much better than conventional first-order linear regression models and slightly better than

PCR on multicomponent PCB GC data. These techniques can effectively model the non-

linearities present in some data but they still lack the ability to directly model the separate

contribution of individual components in a composite mixture.

1.4.9 Modeling

A study of the theoretical response from a GC system based on the fundamental phys-

ical and chemical principles is given by Jaulmeset al. [51]. In this work the authors derive

a model for the elution peaks obtained in gas chromatography which accounts for many of

the typical variations seen in actual chromatograms. This work focused on the dynamics

of the GC system by looking at one isolated peak shape in the resulting chromatogram. No

attempt was made to model the response of multiple components.

A data handling system is presented by Gerth et al. which includes some chromato-

gram modeling along with raw data acquisition, file I/O, and data display [35]. This sys-

tem does not attempt to model multiple complex compounds or calibrate to known

25

standards. The individual peak modeling has some similarities to the research being pro-

posed here.

Ceipidor used a sum of individual peaks to model the spectra from X-ray photoelec-

tron spectroscopy [12]. This work used the Gauss-Newton method to minimize the least

squares function. The peaks were modeled with a weighted sum of Gaussian and Lorentz-

ian functions. The results for individual analytes were good and exhibited several advan-

tages over traditional approaches.

1.4.10 Related Work Synopsis

Based on the prior research presented in the previous sections, the following conclu-

sions and future research directions can be observed.

• The research performed to date in the area of quantitative analysis of gas chromatogra-

phy data has focused on single compounds with primarily isolated, orthogonal peaks.

Much work has been done to estimate the attributes, e.g. area, height, of these single

peaks. The commercial offerings focus on estimating peak height directly from the

raw data points and area from summation of raw data values with a rough baseline

removed. These approaches continued to be plagued by two primary problems: locat-

ing the peak of interest start and stop (in time) and finding the baseline level of the sig-

nal at the location of the peak. Signal processing techniques such as filters (Kalman,

matched, zero-area) and deconvolution have been applied to make the task of isolating

individual peaks easier. Baseline estimation is currently based on either piece-wise lin-

ear segments located between valley points or low order basis function (polynomial,

sinusoidal) fit to the same valley points.

26

• Some research has been done in the area of modeling the peak function with the focus

primarily on single peaks. Several modeling functions have been investigated, but the

EMG function seems to have the best fundamental support based on the underlying

chemistry. However, additional research work needs to be performed to effectively uti-

lize this function due to the sensitivity of several of the model parameters.

• The analytical methods discussed in the research centered on straight linear calibration

on single variables (e.g. single peak area) or multiple variables (e.g. multiple peak

areas). These approaches work well for single analytes but can’t represent multiple–

analyte samples. Alternative methods for multiple–input/multiple–analyte cases were

based on principal component analysis as a preprocessing step to obtain variables

which could be used in linear regression methods. Since these methods are also linear

transformations of variables, nonlinear relationships can’t be represented in this

framework. This work will focus on extending the research in this area by using a

summation of fundamental peaks to model an analyte and then combine multiple ana-

lyte models to represent the observed raw signal generated by the gas chromatography

system.

• Very little research has been performed on the specific topic of analytical method

results fusion. Fusion of other types of data is reported heavily in the literature espe-

cially in the area of robotic and computer vision. A specific method of fusion, fuzzy

logic, will be investigated in this work to determine its efficacy for the results–fusion

component.

27

1.5 Narrative Organization

The narrative of this work in comprehensive gas chromatography analysis will follow

five primary thrusts and will combine related chromatography research being performed

concurrently with this research. The primary goal of the research will be to develop a

method for the accurate and precise estimation of analyte concentration in complex envi-

ronmental analysis samples using a combination of mathematical and signal processing

techniques. Documentation will be presented in the following chapters: gas chromatogra-

phy theory, peak modeling, mixture concentration modeling, nonlinear least squares opti-

mization, results fusion, and system integration. Each of these chapters will be briefly

discussed in the following sections in the larger context of the overall research goals. In

addition, a chapter presenting experimental results of the system applied to simulated and

real data follows the theoretical chapters. The final chapter presents conclusions drawn

from the theoretical and experimental sections.

1.5.1 Gas Chromatography Theory

In the analysis of potential environmental contamination, gas chromatography is an

indispensable tool. Any comprehensive quantitative interpretation of the data generated by

this type of system should include some fundamental understanding of the chemical and

physical process. Much research has been done in this area and it will be utilized as refer-

ence for any subsequent data and signal processing.

1.5.2 Peak Modeling

A single peak representing the presence of some chemical compound is the expected

result from the first principles of the GC separation process. Using this knowledge the

28

observed chromatogram time signal can be decomposed into a summation of many indi-

vidual peaks that carry important qualitative and quantitative information about the con-

stituents of the sample. Many methods have been developed to estimate these peak

parameters, but most suffer from the reliance on empirical baseline estimation and end-

point delineation techniques. The chapter will focus on an integrated approach to simulta-

neously estimating both the key peak parameters and the underlying baseline. This

approach will be applied across the chromatogram in the calibration phase to accurately

characterize the response of the GC system for inputs with known analyte concentrations.

1.5.3 Concentration mixture model

A concentration model will be developed using the results of the peak modeling proce-

dure applied to a set of calibration standards. This model will relate the concentration of

the analyte to the peak parameters over the entire chromatogram. In complex mixtures the

analytes of interest generate response peaks across the entire time interval of the chro-

matographic process. The general magnitude of the response of different analytes follows

a characteristic distribution, but in an unknown the observed response could be any combi-

nation of analytes of interest, interferences, and the nominal baseline. The observed

response will be reconstructed from a combination of the response functions of each ana-

lyte of interest by using a linear (but not necessarily first order) regression of the concen-

tration model to the unknown sample’s observed response chromatogram. The resulting

regression will yield a unique solution because each analyte’s model adjusts the peak

parameters in unison based on concentration thereby retaining each analyte’s characteris-

29

tic distribution. The output of this process will be the concentrations and confidence inter-

vals of the analytes of interest.

1.5.4 Nonlinear least squares optimization

The peak modeling described in the previous section will be cast into a nonlinear least

squares optimization framework by expressing the optimization function as the square of

the difference between the model function and the observed chromatogram data. This

problem is a nonlinear one due to the form of the peak model which includes several expo-

nential terms with peak parameters. The optimization process will begin with determining

initial estimates of all the peak model parameters based using a combination of traditional

signal processing techniques. This will be followed by an efficient and robust method for

solving for the unknown peak and baseline parameters. Constraints on the range of peak

parameter values will be used to prevent the minimization procedure from generating

infeasible physical values. The estimated peak parameters will be evaluated and, if

needed, the initial estimates will be refined and the optimization repeated. Stopping crite-

ria will be developed for the termination of this refinement.

1.5.5 Results fusion

The final phase of the research investigates methods to fuse the results of several ana-

lytical methods with the objective of obtaining a higher degree of accuracy and precision

than would be available from any individual method. Several methods have been devel-

oped under the DIM functional area and each one may perform better or worse under cer-

tain scenarios. The fusion task determines the optimal combination of results from each

individual method based on the confidence measure reported by each method.

30

1.5.6 System integration

The motivation of this work is to develop a complete system which automates the

entire chemical analysis process – including the data analysis component. The CAA pro-

gram has implemented a hardware and software system that enables functional modules to

be linked together to perform desired sample analysis tasks. A DIM has been designed and

implemented on a UNIX workstation as part of this system. This chapter describes the

software that links the various processing modules and the master task sequence controller

(TSC).

1.5.7 Results and conclusions

These final two chapters present the experimental results obtained by applying the

algorithms and concepts detailed in the proceeding chapters. Results are given for simu-

lated and actual data over a range of initial conditions. The results are very favorable and

indicate an increase in performance over approaches considered. Finally, overall conclu-

sions are drawn and significant contributions to the field are defined.

31

CHAPTER 2

GAS CHROMATOGRAPHY

Gas chromatography (GC) is a fundamental separation technology that was originally

developed by Martin and Synge [65] and later, the specific gas-liquid method by James

and Martin [47]. The primary operation in chromatography is the separation of substances

on the basis of their differential migration velocities in a biphasic system [41]. The follow-

ing sections will explain the principal components of a GC system.

2.1 Separation

Separation in a GC system results from the interaction between a mobile phase and a

stationary phase based on the rate constant of the kinetics of mass transfer between the

two phases. The specific type of separation used in GC is elution. In elution the mobile gas

phase, which carries the sample, is swept past a nonvolatile liquid stationary phase coated

on the inner wall of a long narrow tube or column. The schematic in Fig. 2.1 shows a cross

section of this configuration of mobile and stationary phases. The rate of migration of the

compound through the column is controlled by the differing equilibrium constants of each

compound with respect to the stationary phase. The result is that a sample composed of

different compounds is injected into the column and the compounds exit the column at dif-

ferent times. The separation properties of the components in a mixture are constant under

constant conditions, and therefore once determined they can be used to identify and quan-

tify each of the components.

32

The retention time of a specific compound,tR, can be determined in a first order sense

using the following relationship

2.1

with

, 2.2

where is the retention factor (or column capacity factor),L is the column length, the

mobile phase velocity,R the universal gas constant,T the absolute column temperature,ρ

the density of the stationary phase andM its molecular weight,γ the activity coefficient of

the compound in solution in the stationary phase at infinite dilution,P0 its vapor pressure

at the column temperature,VL the volume of the liquid phase in the column, andVG the

Figure 2.1 Schematic of stationary and mobile phases in GC. Compound A will travelslower through the column than B because of the differing equilibrium con-stants.

Mobile phase

Stationary phase

Mobile phase

Stationary phase

A B

tR 1 k'+( ) Lu---

=

k'RTρ

γ P0M

---------------VL

VG-------

=

k' u

33

volume available to the gas phase [41]. The equation for the actual detector response can

be approximated by solving the differential equations governing the mass balance equa-

tions. Jaulmes et al. has a solution given as

. 2.3

The variables in this equation are rather complex terms related to the kinetics of adsorp-

tion-desorption and definitions that can be found in [51]. The important point to notice is

the form of the equation and the various nonlinear terms such as the exponential, hyper-

bolic cotangent, and error function. The relationship between the peak model developed in

a subsequent section and Eqn. 2.3 is supported by the theoretical relationship derived from

the first principles.

2.2 Columns

Gas chromatography columns are of two designs: packed or capillary. Packed columns

are typically a glass or stainless steel coil (ranging from 1-5 m total length and 5 mm inner

diameter) that is filled with the stationary phase. Capillary columns are a thin fused-silica

(purified silicate glass) capillary (typically 10 -100 m in length and 25 - 5.0µm inner

diameter) that has the stationary phase coated on the inner surface. Capillary columns pro-

vide much higher separation efficiency than packed columns but are more easily over-

loaded by too much sample.

C t( ) 2λU-------- D'

πt-----

exp

tR t–( )2t

2σt2

----------------------–

µ2---

coth erftR t–( ) t

2σt

-----------------------

+

--------------------------------------------------------------=

34

2.3 Instruments

Mobile phases are generally inert gases such as helium, argon, or nitrogen. The injec-

tion port consists of a rubber septum through which a syringe needle is inserted to inject

the sample. The injection port is maintained at a higher temperature than the boiling point

of the least volatile component in the sample mixture. Since the partitioning behavior is

dependent on temperature, the separation column is usually contained in a thermostat-con-

trolled oven. Separating components with a wide range of boiling points is accomplished

by starting at a low oven temperature and increasing the temperature over time to elute the

high-boiling point components. Most columns contain a liquid stationary phase on a solid

support. Separation of low-molecular weight gases is accomplished with a solid adsorbent.

A schematic of the entire GC instrument is shown in Fig. 2.2.

Figure 2.2 Schematic of a GC instrument. Components include a. carrier gas, b. gasvalve, c. flowrate controller, d. manometer, e. injection port at temperatureθ1, f. column at temperatureθ2(t), g. electron capture detector at temperatureθ3, and h. the electronic signal from the detector.

b

a

c

d

e f g

θ1

θ2

θ3

Computer DataAcquisition

System

h

35

2.4 Detectors

The sensor of the GC instrument is the detector which transforms changes in the eluant

composition into a voltage or current that can be measured. Detectors have different prop-

erties based on the sensing method employed and can be optimized for the analyte under

investigation. A primary property is the selectivity of the detector - nonselective detectors

respond equally to all compounds and selective detectors respond only to some com-

pounds or class of compounds. Detectors also introduce several undesirable effects into

the GC process including both low and high frequency noise, peak–altering lowpass filter-

ing, and nonlinear response outside a specified dynamic range.

There are two classes of detectors in GC systems based on their response functions.

The first class has a response which is proportional to the concentration of the eluate in the

carrier gas and include the thermal conductivity, gas density, and electron capture detec-

tors (ECD). A second class has a response function which is proportional to the mass flow

rate of the eluate to the detector. The flame ionization, flame photometric, and thermoinic

detectors and the mass spectrometer are all in this second class. The ECD detector is used

in many environmental analysis and will be discussed below.

The ECD uses a radioactive Beta emitter (electrons) to ionize some of the carrier gas

and produce a current between a biased pair of electrodes. When organic molecules that

contain electronegative functional groups, such as halogens, phosphorous, and nitro

groups pass by the detector, they capture some of the electrons and reduce the current

measured between the electrodes. The current between the electrodes is kept constant by

pulsing the detector with short pulses at a variable frequency. The pulse period is a mea-

36

sure of the concentration of the detected species in the column effluent. A typical plot of a

gas chromatogram for a PCB analysis is shown in Fig. 2.3. The plot in Fig. 2.4 shows an

enlarged view of the time interval between 15 and 23 minutes of the same chromatogram

shown in Fig. 2.3.

37

5 10 15 20 25 30 35

0.5

1

1.5

2

2.5

3

3.5

x 104

Time (min.)

Det

ecto

r co

unts

Figure 2.3 Typical chromatogram of a PCB sample.

38

13 14 15 16 17 18 19 20 21 22 23

0.6

0.8

1

1.2

1.4

1.6

1.8

x 104

Time (min.)

Det

ecto

r co

unts

Figure 2.4 Enlarged section of the chromatogram in Fig. 2.3. Higher magnification showgreater detail including the individual peaks representing specific compounds

39

CHAPTER 3

PEAK MODEL

The fundamental element of any chromatogram is ideally a peak in the observed detec-

tor time series signal. This peak is indicative of the presence of a compound in the carrier

gas that is being swept past the detector as described in Sec. 2.1. In the ideal case the

shape of this peak would be an impulse whose amplitude would be linearly related to the

compound’s concentration in the injected volume and time would be directly correlated to

the specific compound. In reality, the general shape of this peak, based on a first order

approximation to the theoretical chemical and thermodynamical properties of the GC sys-

tem, is Gaussian. Several factors contribute to the distortion of the ideal peak shape

including diffusion rates, dead volumes, flow anomalies, and finite response times of

detectors, amplifiers, and conversion electronics. The following sections will present the

details of an empirical peak shape which has been justified theoretically [37],[55] and has

been shown to model the measured peaks very well [31].

3.1 Derivation of Exponentially Modified Gaussian

The primary attribute that distinguishes the typical observed chromatogram peak from

the nominal Gaussian shape is the asymmetrical skewing, or “tailing” of the peak. This

tailing can be likened to the effects of a lowpass filtering operation caused by an electronic

RC circuit [76]. A typical chromatographic peak is shown in Fig. 3.1 along with a least–

squares–fitted Gaussian peak for comparison.

40

11.4 11.45 11.5 11.55 11.6 11.65 11.7 11.750

1000

2000

3000

4000

5000

6000

7000

coun

ts

Time (min.)

Observed dataGaussian peak

Figure 3.1 Typical chromatogram peak and Gaussian peak shape. Gaussian peak repre-sents a best fit to observed signal.

41

A better analytical function for representing the chromatographic peaks is the Expo-

nentially modified Gaussian (EMG) function. This function is the convolution of a regular

Gaussian function with an exponential decay function of unit area. The exponential decay

can be thought of as the impulse response (or system transfer function) of the chromato-

graphic system including the injection, column, detector, and conversion electronics. A

mathematical representation of the EMG functionhEMG(t), can be obtained using the tra-

ditional convolution equation

, 3.1

where is a dummy variable of integration,G(t) is the Gaussian function, andH(t) is the

exponential decay term. The Gaussian function is given by

, 3.2

whereA is the area of the Gaussian peak,σ is the standard deviation of the Gaussian, and

tg is the center of the peak. The exponential decay term is given by

, 3.3

whereτ is the time constant of the exponential decay. The solution of Eqn. 3.1 is given by

, 3.4

hEMG t( ) G t'( )H t t'–( ) t'd

0

t

∫=

t'

G t( ) A

σ 2π--------------exp

t tg–

2σ------------

2–=

H t( ) 1τ---exp t

τ--–= t 0≥

H t( ) 0= t 0<

hEMG t( ) Aτ---exp

12--- σ

τ---

2 t tg–

τ------------

–exp

y2

–2

--------

2π-----------------

yd

∞–

z

∫=

42

wherez, the upper limit on the integral term, is . Numerical approxima-

tions, such as theerf function or series expansions [5], of the indefinite integral term in

Eqn. 3.4 can be used to actually compute the value of the EMG function at given time,t.

The relationship in Eqn. 3.4 can be expressed in terms of anerf function as

, 3.5

with theerf defined as

3.6

The plot in Fig. 3.2 shows a standard Gaussian function, a exponential decay, and the

result of the convolution of these two functions. Using the EMG function and the actual

chromatogram data shown in Fig. 3.1 a least squares fit results in the plot shown in Fig.

3.3.

The EMG function given in Eqn. 3.5 has several attributes which make the practical

use in an optimization procedure problematic. Peak parameters which appear in the domi-

nator terms of the equation can’t smoothly approach a value of zero without causing the

numerical evaluation to become indeterminate. This can be easily seen in the first term

which has the peak skew parameter,τ, in the denominator. One alternative calculation

method is to generate an infinite impulse response filter (IIR) based on the impulse

zt tg–( )

σ---------------- σ

τ---–=

hEMG t( ) A2τ-----exp

12--- σ

τ---

2 t tg–

τ------------

– 1 erfZ

2-------

+=

erf x( )2

π------- exp y–

2( ) yd

0

x

∫=

43

EMG function

Gaussian function

exponential decay

1.8 1.9 2 2.1 2.2 2.3 2.4 2.50

2

4

6

8

10

12

14

16

18

Time (min.)

coun

ts

Figure 3.2 Convolution of Gaussian peak function with exponential decay function.Resulting signal is an exponentially modified Gaussian (EMG) peak func-tion.

44

11.4 11.45 11.5 11.55 11.6 11.65 11.7 11.750

1000

2000

3000

4000

5000

6000

7000

coun

ts

Time (min.)

Observed dataEMG peak

Figure 3.3 Typical chromatogram peak and EMG peak. EMG peak represents abest fit to the observed signal.

45

response of the exponential decay and then apply this filter to a Gaussian peak. The IIR fil-

ter is implemented using the difference equation

, 3.7

wherey(n) is the resulting EMG approximation,x(n) is the input Gaussian peak function,

andC is given by

, 3.8

with ∆t defined as the sampling period for the data acquisition system. Terms are collected

and Eqn. 3.7can be rewritten as

. 3.9

3.2 Parameter estimates

Several techniques have been presented in the literature to obtain estimates of the four

parameters of the EMG function from an observed chromatogram signal [52],[83]. These

estimates are based on measuring (either manually on a printed or displayed chromato-

gram plot or automatically with a search algorithm on the electronic data) the observable

parameters such as peak height, retention time at peak, width of peak at a given fraction of

peak height (each side of the peak individually), and retention times at the width measure-

ment points. These measurements are then related to the EMG function parameters in a

sequential process of first determining estimates ofA andσ thenτ. A potential drawback

y n( ) y n 1–( ) x n( ) y n 1–( )–C

-------------------------------------+=

C1

1 e

∆– tτ

---------

–

----------------------=

y n( ) 1 1C----–

y n 1–( ) 1C----x n( )+=

46

of these methods is that they rely on a simple constant background level and well isolated

peaks to determine the peak height and width.

The approach taken in this work will be to first estimate the baseline with a general

polynomial function, then identify candidate peak centers using an estimate of the second

derivative, and finally the calculation of individual peak parameter estimates. In this sec-

tion the relationships between the easily measurable properties of the peak and the terms

of Eqn. 3.3 will be given. A more complete description of the processing steps to generate

the initial peak location will be given in Sec. 5.4.

The results of the derivative analysis generate the probable peak locations (retention

time of peak maximum negative curvature) for the entire chromatogram. A single peak

with the peak maximum,hp, and the corresponding retention time, , labeled is shown in

Fig. 3.4. There are four parameters that must be specified for each peak,tg, A, σ, τ. Two

methods have been identified to obtain the necessary parameter estimates for each peak.

One method to determine the initial estimates is to select several points on the actual

signal based on their amplitude as compared to a fraction of the peak maximum, e.g.

50%of the peak maximum. The retention time and magnitude of these points are used in

an empirical relationship to determineσ andM2 (second moment) first; then an analytical

relationship is used to determineτ andtg [52],[29]. Finally, the area can be obtained using

an empirical relationship to the peak height, asymmetry factor, and width [30].

The first step in this process is to determine the peak height,hp, that corresponds to the

peak retention time, , found with the derivative analysis. The peak asymmetry, , is

then obtained as the ratio of the time between the peak maximum and the time the signal

value reaches a fraction of the peak height prior to or after the peak. These values are

thp

thp

ba---

47

1.8 2 2.2 2.4 2.6 2.8 30

2

4

6

8

10

12

14

16

18

20

Time (min.)

coun

ts

a1

a2 b2

b1

Figure 3.4 Typical EMG peak shown with the measured parameters. The parametersare needed for estimation of the peak model parameters. Peak height,hp,peak width at 75% and 25% of the peak height,W1 = a1 + b1 andW2 = a2 +b2 respectively, and peak asymmetry,a1 / b1 anda2 / b2.

hp

thp

48

annotated on a typical EMG peak shown in Fig. 3.4 for 75% and 25% of the peak height.

Using these measured values the standard deviation of the base Gaussian is given by

, 3.10

whereWr is the total measured width at the peak height fraction,r, i.e. , and

is a empirically derived quadratic function. The exponential decay,τ, is calculated

using

, 3.11

where

. 3.12

Finally, the area of the base Gaussian function is given by

3.13

Another approach for determining the EMG model parameters is based on measuring

the peak retention time, peak height, area, and first moment [83][84]. In this approach an

iterative scheme is used in combination with a set of derived relationships between the

measured peak properties and the model parameters. The main drawback of this method is

that it will only work on completely resolved peaks.

σG

Wr

fba---

r

-------------=

Wr ar br+=

fba---

r

τ M2 σG2

–=

M2 W2r f

ba---

r

=

A 1.64hpW0.75ba---

0.717=

49

3.3 Support of peak model

The support of both the Gaussian and EMG peaks is infinite based on the definition of

each of these functions in the previous sections. In practice the amplitude of these func-

tions is negligible a finite distance from the center value. In this section the practical extent

of the peak functions will determined for use in the actual computation of the model val-

ues. The time extent over which the peak function is calculated is derived based on either

the percentage of the theoretical total area or a percentage of the peak maximum magni-

tude.

The base Gaussian peak function has a relationship between the total area under the

curve and the number of standard deviations from the mean. An analytical expressions

relating these parameters is given by

, 3.14

where∆ is the number of standard deviations away from the mean, , andφ

is the fraction of the total peak area. For the EMG peak defined in Eqn. 3.5, a fraction of

the maximum peak magnitude is used due to theerf function in the definition. The EMG

definition contains three parts: a constant term, exponential term, and a term based on the

erf function, with only the exponential anderf terms dependent on the time variable. If

each of these terms is considered separately, one sees that theerf term controls the magni-

tude of the leading tail of the peak and the exponential term controls the trailing tail. By

setting each of these terms equal to a fraction of the maximum peak magnitude the follow-

ing relationships are obtained,

∆ 2

2-------erf

1– φ( )=

∆ t tg–( ) σ⁄=

50

and 3.15

, 3.16

wheretl is the lower time limit,tu is the upper time limit andΘ is the fraction of the maxi-

mum peak magnitude. Using these relationships the numerical computation overhead

associated with evaluating the functions can be significantly reduced. This individual peak

reduction significantly decreases the total time required for the iterative least squares pro-

cess.

3.4 Derivatives of peak model

The primary technique underlying many of the iterative least squares algorithms is the

use the partial derivatives to determine the change in the free variables to decrease the

error between the model and the true values. This technique is further described in Chap.

5. This section details the derivation of the required partial derivatives for both the base

line Gaussian and EMG functions.

We begin with the base Gaussian as defined in Eqn. 3.2 and take the partial derivatives

with respect to each of the three parameters,A, σ, andtg. The partial w.r.t area is given by

, 3.17

the partial w.r.t.σ is given by

, 3.18

tl tgσ2

τ------ 2σerf

1–1 2Θ–( )–+=

tu tgσ2

τ------ τ 2Θ( )log–+=

A∂∂

G t( ) 1

σ 2π--------------exp

t tg–

2σ------------

2–=

σ∂∂

G t( ) A–

σ22π

-----------------expt tg–

2σ------------

2–

A

σ42π

----------------- t tg–( )2exp

t tg–

2σ------------

2–+=

51

and the partial w.r.t. the center time,tg, given by

. 3.19

In a similar manner the partial derivatives of the EMG function defined in Eqn. 3.5 can be

derived. The erf function adds a level of complexity to the derivations and thus the inter-

mediate steps are shown in Appendix A. The resulting partials are given by

, 3.20

,3.21

,3.22

. 3.23

tg∂∂

G t( ) A

σ32π

----------------- t tg–( )expt tg–

2σ------------

2–=

A∂∂

hEMG t( ) 12τ----- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–( )τ

----------------–

exp=

σ∂∂

hEMG t( ) A–

τ π---------- σ

2τ----------

t tg–( )

2σ----------------–

–2 1

2τ----------

t tg–( )

2σ2----------------– σ2

2τ2--------

t tg–

τ------------–

expexp

Aσ2τ3 π---------------- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–

τ------------–

exp

+=

tg∂∂

hEMG t( ) A–

τσ 2π----------------- σ

2τ----------

t tg–( )

2σ----------------–

–2 σ2

2τ2--------

t tg–

τ------------–

expexp

A

2τ2-------- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–

τ------------–

exp

+=

τ∂∂

hEMG t( ) A

2τ2-------- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–

τ------------–

exp–

Aστ3

2π---------------- σ

2τ----------

t tg–( )

2σ----------------–

–2 σ2

2τ2--------

t tg–

τ------------–

expexp

A2τ----- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2–

τ3---------

t tg–

τ2------------+

σ2

2τ2--------

t tg–

τ------------–

exp

+

+

=

52

All of these partial derivatives contain theerf function and theτ parameter in the

denominator terms. As a result, the same caveats of the base EMG function apply to their

numerical evaluation using binary arithmetic on a computer.

3.5 Sliding window approach to large peak sets

A typical chromatogram contains on the order of 80 peaks distributed across the mea-

sured time series. The model fitting process is practically limited to the number of free

parameters that can be simultaneously determined (on the order of 20). Therefore, an

approach to parse the fitting operation is required to fit the entire sequence of estimated

peak parameters. The advantages of fitting the peaks with a model, including not having to

determine hard points where the peak starts and end and the ability to accurately deter-

mine the parameters of overlapped peaks, should be retained in the approach used to fit the

entire sequence of peaks.

The approach presented here is based on the same concept as a convolution operation

that slides a convolution kernel with finite support across the original signal, e.g.

. 3.24

In this case there is no actual output signal but rather a set of fitted parameters for the

peaks contained within the extent of a rectangular “kernel.” The algorithm consists of sev-

eral stages that can be broken into selection, fitting, and evaluation components. The flow

chart in Fig. describes the top level components of the algorithm.

Unlike the definition for convolution, the support or rectangular window is not a constant

width, but rather a variable width which is based on the extent and separation of the peaks

y t( ) x τ( )h t τ–( ) τd

τl–

τh

∫=

53

Start

Figure 3.5 Flow chart of peak fitting operation. This process is applied to an entirechromatogram.

Establish masterparameter list

(based on estimates)

Identify RT markers

Compute each peak’sfit measure

Select firstN peaks

Calculate the extentof neighbor peaks

Overlapwith neighbor

peaks?

Stop

Add peaks and enlargedata window

Fit peak model to data

Compute fit measure forpeaks in orig. window

Checkpeak fit measure

Fit peak model to data Fit peak model to data

Checkpeak fit measure

Select nextN peaksinclude overlap factor

No Yes

Yes No

Better Worse

54

included in the window. The algorithm is driven by how many peak parameters can be

simultaneously determined by the nonlinear least squares algorithm and typically is set at

four or five peaks. A nominal number of peaks is selected and then the raw estimated peak

parameters are used to determine the extent of the peaks in the original time series. In

addition, neighbor peaks on either side of the determined data window are examined to

determine the overlap in the defined data segment. If an appreciable fraction of the neigh-

boring peak’s area is contained in the data segment, that peak will be included in the fitting

process.

The defined data segment and model parameters extracted from the master parameter list

are passed to the nonlinear least squares algorithm. Once the fitting is complete, the new

fit parameters are used to calculate an individual peak–fit measure, as opposed to the glo-

bal residual obtained during the fitting algorithm. The peak fit measure is the sum squared

residual between the peak and the raw data. The master parameter list is updated using the

following relation

3.25

wherei is the peak index and only ranges over the indices considered within the original

data segment window.

The final step in the process is to select the next set ofN peaks on which to repeat the fit-

ting process. A user–defined overlap fraction determines the degree of overlap between

consecutive fitting operations. A typical overlap fraction of 0.5 will guarantee that each

peak will be fit a minimum of two times. This typically provides for an excellent fit and

master_parameteri( ) fit_parameter i⟨ ⟩master_parameteri( )

= i fit_measurei( ) master_measurei( )<∈∀i fit_measurei( ) master_measurei( )<∉∀

55

reduces the complexity of the fit because one group of parameters has already gone though

the fitting process once. When all the peaks have been processed, the master parameter list

is written to the binary Analytical Instrument Association (AIA) Network Common Data

Format (NetCDF) data file for the chromatogram under investigation.

56

CHAPTER 4

CONCENTRATION MIXTURE MODEL

The analysis of an unknown time signature typically consists of an offline model cali-

bration process and an online model fitting process. Offline calibration determines the

relationship between the measured signal of a calibrated standard and the signal model

parameters. The online process entails fitting the calibrated model to measured signals

from unknown samples. During the first step, individual time series signals will be

obtained for standard samples with single analytes over a range of known concentrations.

The peak modeling described in the previous chapters is applied to these time series and a

composite model of each analyte of interest is generated. These composite models are

then used as the fitting function for a time series generated by an unknown sample. The

following two sections will describe the analyte model generation during the calibration

phase and the final model fitting for an unknown sample.

4.1 Calibration

Calibration can be defined as the process of fixing the relationship between a known

quantity and the variables of a model relating the known quantity and the measured

attributes. The calibration process first requires that a model of the input/output relation-

ship be generated based either on the first principles of the process or experimental obser-

vations. In the case of peak area measurements from gas chromatograms, the relationship

between peak area and analyte concentration is thought to be first order linear over a lim-

57

ited dynamic range. However, a statistical analysis of the linear and the quadratic models

reveals that in many cases the first order linear model suffers from a lack of fit. The fol-

lowing sections will present the derivation of the analyte model and the procedure used in

determining the model parameters.

4.1.1 Analyte model

A unique approach to determine analyte quantification is the use of a complete model

consisting of all the peaks found in a time series that meet a certain level of statistical rela-

tionship to the given analyte of interest. Previous approaches have only considered a few

peak areas in an independent linear regression analysis and manually selected which

peak(s) to use for actual quantification. Another approach considers numerous peak areas

(or the raw chromatogram) in a principal component analysis. By establishing the relation-

ship between analyte concentration and the entire peak distribution, the unique contribu-

tion of each analyte in an unknown sample can be determined using a global model fitting

operation.

Using the EMG peak function derived in Eqn. 3.4, the analyte model can be expressed

as the summation ofN EMG peaks shifted to the appropriate retention times. This can be

expressed by using

, 4.1

to represent thejth peak of theith analyte (z as defined in Eqn. 3.4) and

, 4.2

EMG ij t c,( )A c( )ij

τ j--------------exp

12---

σ j

τ j-----

2 t tgj–

τ j-------------

–exp

y2

–2

--------

2π-----------------

yd

∞–

z

∫=

A c( )ij λ0i jλ1i j

ci⋅ λ2i jci

2⋅+ +=

58

whereci is the concentration of theith analyte and theλ’s are the calibration parameters

that are discussed in the next section. With this definition of the individual peaks based on

the concentration, the complete analyte is given by

, 4.3

wheres is the set of peaks that pass the statistical tests for the relationship between peak

area and analyte concentration. This general model of an analyte’s chromatography

response can be used for any number of analytes. The various parameters for the analyte

models can be stored in a three dimensional matrix with each row containing an specific

analyte, each column a peak, and in the third dimension the peak parameters (tg - center

time,λ’s - area coefficients,σ - Gaussian standard deviation, andτ - Exponential time

constant). This matrix is populated during the calibration procedure described in Sec.

4.1.4. The strength of this model becomes evident when one considers that in mixtures of

multiple analytes, the unique contribution of each analyte to the observed signal can be

determined.

4.1.2 Background model

The complete chromatographic response generated by the detector includes both the

response due to the analytes and the baseline response due to noise and carrier gas. For a

complete chromatogram model this background signal must also be included. In analyzing

blank (no sample or standard injected) chromatographic signals, the general signal charac-

teristics are slowly varying and relatively constant. Both the temperature profile of the col-

umn and the carrier gas flow rate effect the baseline signal. Based on the low frequency

Analytei t c,( ) EMG ij t c,( ) for all j A c( )ij s∈[ ]∀j 1=

N

∑=

59

nature of the baseline, a low order polynomial can be used to model the baseline compo-

nent of the chromatogram.

The general form of the background signal is a polynomial of fairly low order (5 - 7)

given by

, 4.4

whereN is the order of the polynomial andαl is lth coefficient term in the polynomial. The

coefficient terms are determined by extracting candidate baseline points and fitting the

selected polynomial to the observed data. This procedure is described in Sec. 5.4.

4.1.3 Interference model

The third component of an unknown chromatogram signal is interferences. Interfer-

ence peaks are typically of the same form as the individual peaks described in Eqn. 4.1,

but their area is uncorrelated to any of the analytes of interest. The model for a single

interference peaks is given by

, 4.5

where in this case the area,Aj, is not concentration dependent. The complete interference

model is the summation of all the peaks which are not included in any of the analyte mod-

els given by

, 4.6

Base t α,( ) αl tl

l 0=

N

∑=

interferencej t( )Aj

τ j------exp

12---

σ j

τ j-----

2 t tgj–

τ j-------------

–exp

y2

–2

--------

2π-----------------

yd

∞–

z

∫=

Inter t( ) EMG j t( ) for all j A j r∉[ ]∀j 1=

N

∑=

60

where andsi the set of peaks included in analytei. By combining

all three major model components, analytes, baseline, and interferences, a complete model

of the observed chromatogram signal can be generated.

4.1.4 Calibration procedure

The calibration procedure consists of several steps which result in a set of complete

analyte models. A flowchart of these steps is shown in Fig. 4.1 and each major processing

block will be discussed below.

The first two steps in the procedure are used to reduce high frequency noise contained

in the input chromatogram signal. This noise is usually generated by the detector and asso-

ciated electronic components and will cause problems in subsequent derivative based pro-

cessing steps. An estimate of the complete signal power spectral density (PSD) using

Welch’s averaged periodogram method is computed for the signal [80]. A lowpass filter

cutoff frequency is then obtained by one of two methods and a finite impulse response

(FIR) filter is realized. One method of automatically determining the cutoff frequency is to

specify the total signal power to retain after filtering. In this approach the total area under

the PSD is computed. Then the area of each frequency bin is summed, starting at the low-

est frequency, until a specified fraction of the total is reached. A second approach esti-

mates a average baseline of the PSD and then adds a noise margin to this baseline. The

cutoff frequency is selected based on the location that the PSD value exceeds the set noise

margin threshold.

The next step in the calibration process is to estimate the signal baseline, locate the

retention time end markers, and estimate the peak parameters. This process is described in

r s1 s2 … sn∪ ∪ ∪≡

61

Generate rawchromatograms

Estimate signalnoise level

Lowpass filter

Estimate Baseline

Start

Locate peaks andestimate parameters

Processed all ofthe calibration

set?

Determine statisticalsignificance of peak

areas

Generate analytespecific models

End

No

Yes

Figure 4.1 Flow diagram of the off–line chromatography calibration processing. Theresult of this process is a set of analyte–specific chromatogram signal mod-els.

Locate retention timeend markers

62

greater detail in Sec. 5.4. The proceeding steps are repeated for the entire calibration set

which consists of at least two samples per analyte.

The final two steps in the calibration process use the three dimensional matrix of peak

information (rows correspond to samples, columns to peaks, and peak parameters stored

in the z axis direction) to generate the analyte models. The first step analyzes the statistical

correlation to the desired analytes’ concentration and selects the peaks to include in the

analyte model. Then a linear regression is performed for each selected peak to determine

theλ’s of Eqn. 4.2.

Statistical hypothesis tests will be used to determine if a given peak should be included

in a specific analyte model. The approach taken will be a sequential pairwise test of mod-

els with increasing polynomial order. TheF test can be used to determine the critical value

for rejecting the hypothesis that the model of high order gives a better fit to the data [33].

Application of theF test requires the assumption that the errors in the data are zero mean,

constant variance,σ2, and follow a normal distribution. A model of order zero will be con-

sidered as the baseline model (this model is the average peak area across the calibration set

and indicates no correlation to analyte concentration). The specific test is defined as the

ratio of sum of squares between model 1 and model 2 and is given by

, 4.7

where ,r is the residual vector,nx is increase in free parameters in model 2,α is

the significance level,m is the number of calibration points,n is the number of free param-

eters in model 1. If the ratio, , is less than the critical value on the right hand side of

Eqn. 4.7, model 2 is rejected as giving a better fit than model 1. This process is repeated up

S1

S2-----

nxF α nx m n–, ,( )m n–

------------------------------------------- 1+=

S rTr=

S1

S2-----

63

to a quadratic model to determine the best fit. A peak is not included in the analyte model

if the zero order model is the best fit.

A second test, described as theγ criterion, is used to determine if the regression model is

“useful” as distinct from “significant.” This test is also based on theF statistic and

involves computing a multiplier,γ, for the standardF-ratio significance level [8]. For this

test theF value is calculated with

4.8

whereMSReg is the mean square due to the regression ands2 is the mean square due to

residual variation. The critical value,F0, is given by

, 4.9

wherevr is the number of residual degrees of freedom and where

. 4.10

Based on a selectedγ factor, typically in the range from 2 - 4, the model is considered use-

ful if the ratio betweenF0 andF calculated in Eqn. 4.8 is greater than ten, i.e

, 4.11

wherevm regression degrees of freedom. If the concentration vs. peak area regression

model is deemed both significant and useful by these two tests, it will be included in the

complete analyte model given in Eqn. 4.3.

FMSReg

s2

----------------=

F0 1 γ02

+( )F v0 vr, 1 α–,( )≈

v0

vm 1 γ02

+( )2

1 2γ02

+( )----------------------------=

F0

F vm vr 1 α–, ,( )-------------------------------------- 10≥

64

4.1.5 Complete mixture model

The calibration procedure described in the previous sections results in the individual

analyte models. In the analysis of an unknown sample, these models, in combination with

a baseline and possible interference peaks, will be combined in an additive complete–mix-

ture model. This model will be fit to the chromatogram of the unknown sample using

methods described in Sec. 4.2. The form of this model is given by

, 4.12

where the last term on the right hand side represents any interference peaks not accounted

for by the analyte models.

4.2 Total concentration model fitting

The power of the proposed method can now be applied to an unknown sample’s chro-

matogram using the chromatogram model generated during the calibration process. The

steps involved in fitting the unknown chromatogram are depicted in the flow chart shown

in Fig. 4.2. Many of the initial steps are similar to those discussed in Sec. 4.1.4 with the

primary difference being the steps of estimating the analyte concentration and the iterative

complete chromatogram model fitting. The following sections will discuss these steps in

greater detail.

4.2.1 Multiple linear regression derived estimates

As depicted in Fig. 4.2, an initial estimate of the analyte’s concentration must be made

prior to the final analyte fitting operation. One method to generate this initial estimate is

multiple linear regression. Multiple linear regression (MLR) attempts to find the optimal,

Analyte t cn,( ) Inter t( )+ +

chrom t c1 … cn α,,, ,( ) Base tα,( ) Analyte t c1,( ) …+ +=

65

Estimate signalnoise level

Lowpass filter

Estimate baseline

Locate retention timeend markers

Locate peaks andestimate parameters

StartUse multivariate

analysis for initialconc. estimates

Generate unknownsample model

Report analyteconcentration &peak parameters

End

Figure 4.2 Flow diagram of the on–line chromatography analysis processing. Theresult of this process is a quantitative estimate of the desired analyte concen-trations.

Test residualerror

Pass

Fail

Fit unknown modelto raw signal

Analyze error

Update model

66

in a least–squared–error sense, linear combination of the analyte’s concentrations which

match the observed peak response generated by the calibration samples [58]. This rela-

tionship is given in matrix form by

, 4.13

whereR is anm x p matrix of the GC responses (peak areas),C is anm x n matrix of ana-

lyte concentrations,S is ann x p matrix of sensitivities, andE anm x p matrix of errors.

For analyte concentration estimation the pseudo inverse ofS is taken (sincep may be

greater thanm) to give

, 4.14

where the estimated vector of analyte concentrations, andS is derived from the normal

equations and is given by

. 4.15

The initial analyte concentrations can be obtained from the measured peak areas from an

unknown samples using these relationships.

4.2.2 Model fitting process

The final model fitting to the measured chromatogram can begin with the form of the

analyte model determined from the calibration process and the initial analyte concentra-

tions. This fitting operation is an iterative process that performs a nonlinear least squares

fit of the complete chromatogram model given in Eqn. 4.12. to the raw measured chro-

matogram. The baseline and interference terms of Eqn. 4.9 are not dependent on analyte

concentration, therefore the baseline and interference peaks (including the retention time

R CS E+=

c rS S'S( ) 1–=

c

S R'C C'C( ) 1–=

67

markers) are subtracted from the observed chromatogram prior to the fitting process. The

function to be minimized is given by

, 4.16

subject to , wherecl is the lower bound on concentration andcu is the upper

bound on concentration. The nonlinear least squares fitting method is an iterative method

based on successive approximation of the solution. This process will be discussed in detail

in Chap. 5.

4.2.3 Confidence interval estimate

Ancillary information regarding the accuracy and sensitivity of the analyte concentra-

tion estimates is useful in the complete characterization of a unknown sample. A typical

measure of these quantities is the confidence interval associated with the concentration

estimates. The general methodology is based on estimating the covariance matrix of the

concentration estimates.

The covariance matrix is derived by considering the objective function of the least–

squares minimization process. This function is dependent on both the concentration esti-

mates and the data values and is represented by . At the estimated solution,C* , to

the minimization process the following relationship holds

, 4.17

minimize gc t( ) chrom t c1 … cn α,,, ,( )–( )2

c ℜn∈

cl c cu≤ ≤

Φ C Y,( )

Φ C* Y,( )∂C∂

-------------------------- 0=

68

with C* representing the actual solution. If one perturbs the data,Y, by a small amount, the

solutionC will also be perturbed by some small amount giving

, 4.18

which is the new solution to the perturbed data. Expand Eqn. 4.18 with a Taylor series and

truncate terms greater than first order then subtract Eqn. 4.17 to get

. 4.19

Solve for to obtain

4.20

The covariance matrix, can then be defined by

. 4.21

Using the Gauss method to approximate the Hessian term with where

J is the Jacobian of evaluated at the solution,C* [2]. With this substitution and

evaluating the expected value operator, the implementation of the covariance matrix is

given by

, 4.22

where is estimated from the residual data by (m = number of data points,n =

number of parameters).

Φ C* Cδ+ Y Wδ+,( )∂C∂

------------------------------------------------------ 0=

Φ2∂C2∂

---------

C*δ Φ2∂C∂ W∂

--------------- Wδ+ 0≅

C*δ

C*δ invΦ2∂

C2∂---------

C C*=

Φ2∂C∂ W∂

--------------- Wδ–=

VC

VC E C*δ C*δT

( )≡

Φ2∂C2∂

--------- 2J Φ( )TJ Φ( )

Φ C Y,( )

VC σ2J

TJ( )=

σ2 rTr

m n–-------------

69

Now, using the diagonal terms of the covariance matrix, the confidence interval can be

formed using the studentt distribution to equate a confidence level to a multiple of stan-

dard deviations. The final interval calculation is given by

, 4.23

where is the student t distribution withv degrees of freedom.

4.2.4 Residual evaluation

At the completion of the first iteration of chromatogram model fitting, a functional

module tests the residual error to determine if any systematic error is present in the resid-

ual error between the model and the raw chromatogram. These errors will typically indi-

cate that the model being fitted is deficient in some way and modifications of the model

might generate a better fit. Initially, a simple test determines if the absolute value of the

residual error exceeds a preset threshold based on a fraction of the maximum signal value

or an estimate of the input signal’s noise level. The module applies a second level of analy-

sis if the residual exceeds this threshold. This second phase analyzes the correlation

between the location of the noise excursion and existing peaks contained in the model. At

locations where correlation exists, the corresponding peak parameters will be allowed to

change during the following iteration of fitting. If no correlation to an existing peak is

found. an additional interference peak is added to the complete chromatogram model. The

top level model fitting operation is repeated using the final results of the previous fit and

any model additions or modifications.

C∆ diag VC( ) tv0.975

=

tv0.975

70

CHAPTER 5

NONLINEAR LEAST SQUARES DATA MODELING

The method of least squares is a process of obtaining the best fit between an analytical

model and observed data. In most applications the measure of the fit between the modeled

data and the observed data is the sum of squares of deviations between the two signals,

which is desired to be minimized. Thus, the general method of least squares is a procedure

which either explicitly determines or iteratively changes the values of the model parame-

ters in order to minimize the sum of square residuals. In this application, the form of the

peak model is nonlinear, so an iterative method is required to determine the model param-

eters. The following sections will give additional detail on several of the key aspects of

applying nonlinear least squares to the problem of fitting EMG peak functions to the GC

signal.

5.1 Least squares formulation

The formal description of the least squares procedure begins with the concept of gen-

erating a quantitative measure of the total difference between a two sampled signals. One

method of generating such a single numeric description is to sum the squared difference

between the two signals at each discrete sample point. This error measure, S, can be

expressed by

, 5.1S εi2

i 1=

n

∑Yi Yi–( )

wi---------------------

2

i 1=

n

∑= =

71

wherewi is a weight factor based on the variance of the signal,ε is the difference or error

between an observed signal,Y, and an analytical model of the observation, . We drop the

weight factor, w, in Eqn. 5.1 due to any knowledge that the variance of the signal varies

over time or amplitude. In situations where the analytical model of the observed data is a

simple linear function, the value, S, can be minimized by taking the partial derivatives of

Eqn. 5.1 with respect to each model parameter and equating them to zero. This results in a

set ofnormal equations which can then be solved with linear algebra methods. In the cur-

rent application, the base analytical model,hEMG(t), (Eqn. 3.4) of the observed data is a

nonlinear function in several of the model parameters. The minimization of the complete

chromatogram model given in Eqn. 4.16 will also require a nonlinear–based minimization

algorithm. As a result, the minimization ofS must proceed as an iterative successive

approximation of the set of model parameters. In the case of the linear relationship, the

global minimum can be reached from any initial model parameters; but in the nonlinear

case, the quality of the initial model parameters plays a significant factor in the conver-

gence of the method to the global minimum.

5.2 Minimization algorithm

The development of an algorithm for the minimization of a nonlinear function, such as

the least squares error measure, is typically built around expanding the function as a trun-

cated Taylor series about some point in the function’s parameter space. This expansion

takes the following form

, 5.2

Y

yicalc

yiobs f i∂

pjinit∂

-------------

pj∆j 1 n,=∑+≈

72

wherefi is the modeling function evaluated at pointi, p is a parameter of the functionf.

This expansion makes two approximations about the underlying function, namely that the

higher order terms in the Taylor series can be ignored and the infinitesimaldpi can be

replaced with . In practice these approximations are very good in the limit when

becomes small or equivalently when is close to . Using this representation of

the calculated value of the modeling function, the least squares minimization can be cast

in the following form. The expression in Eqn. 5.1 can be rewritten as

, 5.3

where the residuals are , or in vector form , andW is a

weight matrix that will be assumed to initially be the identity matrix. Using the Taylor

series expansion the residual becomes

, 5.4

whereJ is the Jacobian of the modeling function,f. If the partial derivative with respect to

the model parameters of the objective function,S, is equated to zero, the nonlinear version

of the normal equations results. These quantities are given by

5.5

. 5.6

From Eqn. 5.6 the method for updatingp can be obtained as

. 5.7

pj∆ pj∆

yicalc

yiobs

S rTWr=

r i yiobs

yicalc

–= r yobs ycalc–=

r i yiobs

yicalc f i∂

pjinit∂

------------- pj∆j 1 n,=∑+–=

r y∆ J p∆–=

S∂p∂

------ 2JTWr–=

JTW y∆ J p∆–( ) 0=

JTWJ p∆ JTW y∆=

p∆ JTWJ( )1–JTW y∆=

73

The above method is commonly referred to as the Gauss method, which is an approxi-

mation to the Newton method, of nonlinear, unconstrained optimization. In practice this

method can be problematic due to several factors including divergence, oscillation and ill

conditioned inverses. These conditions arise due to poor initial guesses which are far from

the true solution and limitations of using the truncated Taylor series expansion. Several

approaches have been developed to avoid the problems that might be encountered with the

straight Gauss method including the Levenberg-Marquardt (LM) method [60][64]. In this

method the robustness and speed of convergence is increased by altering the direction and

length of the shift vector, .

This method introduces an additional term to the normal equations of Eqn. 5.6. The

modified version of the normal equations is given by

, 5.8

whereλ is a scaling factor andD is a diagonal matrix that may be the identity matrix or the

diagonal elements of . This term enables the method to shift between the standard

Gauss method and the steepest descent, or gradient method. It has been shown that in

cases where the Gauss method diverges (objective function increases),λ can be increased

to ensure convergence. The algorithm proposed by Levenberg-Marquardt details a proce-

dure for determining the optimal value forλ for each iteration step. This method is gener-

ally much faster than the steepest descent method and more robust than the Gauss method.

The primary method for controlling the value ofλ is to estimate the nonlinearity off, the

function to minimize, using a linear prediction of the value off at the next iteration and a

cubicly interpolated estimate of the minimum off.

p∆

JTWJ λD+( ) p∆ JTW y∆=

JTWJ

74

5.3 Line search and estimation of LM scaling factor

The scaling factorλ must be determined for the implementation of the LM algorithm

as noted in the previous section. In addition, a line search algorithm can be incorporated in

most nonlinear optimization routines to increase convergence speed by adjusting the step

length based on the values of the function along the line (direction) indicated by the base

algorithm.

The most basic line search method is the bisection method used in root finding. When

the magnitude and the gradient of the function are taken into account, better line searches

result in faster convergence rates. The approach taken in this work is to use a cubic poly-

nomial in the step length parameter,α, and solve for the minimum of this polynomial to

estimate the optimalα(k).

In general, the following steps are taken to implement the line search: (a) determine a

search direction,s(k), using a method like LM, (b) find a distance to move,α(κ), in the

directions(k), that will minimize with respect toα, wheref is the least

squares residual function, (c) update the parameter estimates using

, return to step (a) and repeat. By taking the derivative of the

polynomial with respect toα and equating it to zero, the optimal step length can be deter-

mined. In practice the default step length is one and this is used as a starting point at each

iteration of the line search procedure.

The results of previous line search procedures can be used to determine the appropri-

ate value of the LM scaling factorλ. An estimate of the nonlinearity of the problem is

f x k( ) αs k( )+( )

x k 1+( ) x k( ) α k( )sk( )

+=

75

made by comparing the linearly predicted sum of squared errors and a cubicly interpolated

estimate as proposed by More' [66]. The linear predicted sum of squares,fp is given by

, 5.9

whereF(x) is the vector of residuals at each data point. The cubicly interpolated estimate,

fk is determined as describe in the previous sections. The following relations are used to

updateλ

, 5.10

where * indicates the optimum found using the cubic interpolation.

5.4 Initial estimates

The success of any modeling approach using nonlinear optimization is strongly depen-

dent on the initial parameter values of the model. If these initial values are not “close” to

the true values, many nonlinear optimization algorithms will diverge from the true values.

There are two strategies to prevent this from occurring during the course of model fitting:

good initial parameter values and evaluation of the residuals between the model and the

observed data. This section will describe the procedure for generating the initial model

estimates and the next section will describe the process of residual evaluation.

The objective of the procedure that generates the initial model parameters is to take, as

input, the raw time series signal and first locate all the potential peaks, and then estimate

the model parameters for each peak and a nominal baseline model. Specific parameters for

f p xk( ) J xk 1–( )( )Tsk 1– F x( )+=

λk λk 1–

f k x*( ) f p xk( )–

α*---------------------------------------+= if f p xk( ) f k xk( )<( )

λk

λk 1–

1 α*+

---------------= if f p xk( ) f k xk( )≥( )

76

the peaks are location (center), amplitude, width, and skew. The baseline is modeled as

either a third order polynomial or a cubic spline.

Several methods exists to detect peaks in time series, and this work uses a combination

of several standard techniques. The typical approaches range from maximum within a

time window[72], variations on the detection of zero crossings of the first derivative

[11][14][49], minimum of the second derivative [15], and pattern recognition approaches

[43]. A variant of the minimum of the second derivative method was utilized in order to

detect poorly resolved peaks.

This method locates local minimum in the smoothed second derivative (difference) of

the input signal. Figure 5.1 shows the basic processing steps used in the detection of can-

didate peaks in the input time series. The logic involved in the final step controls the

region of the signal to perform the minimum search and enables multiple peaks to be

located on a leading or trailing edge of a larger peak.

Once the candidate peaks have been detected, the various parameters of each peak

must be estimated. This is accomplished by first estimating the signal baseline and then

Lowpass 1st Difference smooth

1st Difference

input

smoothThreshold

Minimum detectLogic

Figure 5.1 Processing flow for peak detection algorithm.

77

calculating the peak amplitude and width from the raw signal. Several techniques exist for

baseline estimation, including truncated Fourier series, polynomial fit, and spline fit. The

polynomial fit will be described below.

The first step in the baseline estimation process is to locate the candidate baseline

points. This can be accomplished by either searching for the signal minimum within non-

overlapping segments or to locate minima between peaks. Next these points will be used

to compute a linear least squares fit to a polynomial function (nominal order 5 - 7). An

example of this process is shown for a segment of a chromatogram in Fig. 5.2. In this

example the points used for the polynomial fit were the local minimum at the beginning

and end of the time series and the minimum amplitude within non-overlapping segments.

The time value for each point is represented as the midpoint of the segment.

5.5 Refinement of model

Once a good set of initial parameter estimates is determined, the fitting of the model

can begin with the nonlinear least squares minimization procedure. Several steps can be

included in this process including checking the values of the parameters during the fitting

process (constraining the parameter values to known acceptable ranges), residual analysis,

model refinement, and iterative application of the fitting procedure as described in Sec.

4.2.2.

Based ona priori knowledge of the fundamental peak shapes some constraints can be

applied to the parameter estimates. For example the amplitude, width (σ), and skew (τ) of

the peak are positive real numbers. Another constraint is that the center of a given peak

78

0 5 10 15 200

2

4

6

8

10

12

Time (sec.)

Am

plitu

de (

coun

ts)

Figure 5.2 Example of baseline estimation. A fifth order polynomial is fit to the mini-mum signal level within a non-overlapping segment.

79

should not take a value less than or greater than the adjacent peaks. These constraints are

applied to the parameters at each step of the least squares minimization procedure.

5.6 Constrained nonlinear minimization

In several cases of performing the peak modeling operation it became necessary to add

constraints to the model parameters. In particular there were cases where the peak center

was changing order, that is a peak at a lower retention time would switch with a peak at a

higher retention time. Due to the algorithm which determines if the new model parameters

should be accepted, this behavior could not be tolerated. In addition, some peaks with

small area were increasing in width to a point of trying to compensate for a baseline offset.

To correct this situation, constraints were established that prevented the peak center

parameter from taking on values less than the proceeding peak or greater than the subse-

quent peak. Upper and lower bounds were placed on other peak parameters based on the

general characteristics of a given GC configuration.

The general problem we wish to solve is given by

, 5.11

wheref(x) is the objective function (least squares residual) andci(x) are the constraint

functions. Using a Taylor series approximation with respect to the constraint functions we

obtain

, 5.12

minimize f x( ) x ℜn∈subject to ci x( ) 0= i E∈

ci x( ) 0≥ i I∈

ci x∗ δ+( ) ci∗ δTai

∗ o δ( )+ +=

80

with . Next, assume thatx* is a local minimum of the objective function and that

there are no feasible descent directions atx* with respect to the constraints. Using Eqn.

5.12 we can say that which implies that any feasible incremental

step lies along a feasible directions such that

. 5.13

If f(x) were to have negative slope along the directions, then

. 5.14

Based on our initial assumption thatx* is a local minimizer, both Eqn. 5.13 and Eqn. 5.14

can‘t be true and nos can satisfy both equations. We have arrived at the necessary condi-

tion for a constrained local minimizer of Eqn. 5.11 given by

, 5.15

whereλ is often referred to as a Lagrange multiplier.These relationships are shown graph-

ically in Fig. 5.3.

A typical approach to solving Eqn. 5.15 is to use the Lagrangian function,L(x, λ), to

combine the least squares function and the constraint functions. The function is given by

, 5.16

whereλi is the Lagrange multiplier andgi(x) is ith constraint. Using this function the con-

dition in Eqn. 5.15 can be expressed as

, 5.17

ai ci∇=

ci x∗ δ+( ) ci∗ 0= =

sTai∗ 0=

sT g∗ 0<

g∗ ai∗λi

∗i E∈∑ A∗λ∗= =

L x λ,( ) f x( ) λi gi x( )⋅i 1=

m

∑+=

∇L x∗ λ∗,( ) 0=

81

with

. 5.18

Following the usual Taylor series approach to generating a method to iteratively approach

the optimal solution we have

. 5.19

If the higher order terms are neglected and Eqn. 5.19 is equated to zero we obtain

. 5.20

Expanding Eqn. 5.20 using the definition given in Eqn. 5.16 we obtain

, 5.21

Contours off(x)

c(x) = 0

x*

g*

a*x'

g'

a'

δ

Figure 5.3 Graphic description of vectors associated with constrained optimization.The two points,x' andx* , show the initial and final alignment ofg* anda* atthe solution point.

∇∇x

∇λ

=

∇L x xδ+ λ λδ+,( ) ∇L x λ,( ) ∇2L x λ,( )

xδλδ

…+ +=

∇2L x λ,( )

xδλδ

∇L x λ,( )–=

W k( ) A k( )–

AT k( )– 0

δxδλ

g–k( ) A k( )λ k( )

+

c k( )=

82

where the Hessian of is defined as

, 5.22

andA(k) is the Jacobian of the constraints evaluated atx(k). If we let

and , Eqn. 5.22 can be written as

. 5.23

We therefore solve Eqn. 5.23 for and then use the following relation to update

x

. 5.24

The approach given can be formulated in a manner that leads to a slightly different

form of the result which can be easier to implement in practice. This formulation breaks

the original problem into a series of quadratic programming problems which have robust

algorithms for estimating/updating the Hessian matrix [28].

L x λ,( )

W k( ) ∇2f x k( )( ) λi

k( ) ∇2ci x k( )( )∑–=

λ k 1+( ) λ k( ) λδ+=

δ k( ) xδ=

W k( ) A k( )–

AT k( )– 0

δ k( )

λ k 1+( )

g–k( )

c k( )=

δ k( ) λ k 1+( )

x k 1+( ) x k( ) δ k( )+=

83

CHAPTER 6

RESULTS FUSION

The aggregation of the measurements into a more accurate and precise analysis is

desired in many applications involving redundant measurements of a single physical phe-

nomenon. This type of operation is generally referred to as information or sensor fusion

and is especially useful where potential conflicts exist between the measurements, or a

priori information is known about the reliability of information under certain conditions

[1][69]. Several methods analyze the measured chromatogram time series in this applica-

tion and each generates an estimate of the analyte concentrations in the unknown sample.

Each method has its strengths and weaknesses, and the goal of combining the multiple

results is to utilize confidence measures reported by each method to control the aggrega-

tion. In general the two primary advantages of fusion that are relevant to this application

are redundancy of information and complementary information. The following two sec-

tions will present several basic fusion methods and a detailed description of the fuzzy

logic approach to fusion.

6.1 Fusion methods

The most basic approach to analytical results integration is the use of a statistical

moment such as the mean to combine the reported analytes’ concentration and variances.

This approach is not optimal in a statistical sense but has computational simplicity and few

constraints or little required prior knowledge of the information being combined [82]. The

84

mean operator can be replaced with other nonlinear operators such asmin, max,and

median. A Kalman filter extends this general concept to incorporate the estimated statisti-

cal characteristics of the measurements and then to generate an optimal filter for the fusion

of the low–level sensor readings.

Another class of fusion operators based on probabilistic models includes Bayesian rea-

soning and evidence theory. With the Bayesian approaches, the prior and estimated condi-

tional probability distributions of the measurements are used to reduce uncertainty using

the formal Bayes statistical combination theorems. This approach requires either a large

amount of data to generate the probability distributions or a means of reliably estimating

the distributions [45]. In this application of results fusion, there are not any discrete classes

or events to assign probabilities, but rather a linguistic description of conditions and com-

bination rules. Dempster-Shafer (DS) evidence theory has also been applied to fusing

uncertain information using mass functions to represent sensor information and Demp-

ster’s rule of combination to combine the information sources [6]. A potential disadvan-

tage of DS is that the theory assumes the information sources to be independent which is

unlikely for the fusion of different analytical methods applied to the same raw data.

Fuzzy logic, or set theory, is the final approach considered for the fusion of results.

Zadeh first proposed fuzzy set theory as a means to represent inexact, incomplete, and

uncertain information in a mathematical framework [86]. Fuzzy set theory generalizes the

binary valuation of set membership to the real interval [0, 1] and, in doing so, enables set

membership to be represented in degrees. This set theory also includes the traditional

union and intersection operators, defined for fuzzy sets, that are used in combining infor-

mation in this framework. This approach has been successful generally in applications that

85

contain imprecise information and to encapsulate expert knowledge in a rule based sys-

tem. The fuzzy logic approach will be pursued as one possible solution to the task of com-

bining the results from several analytical methods.

6.2 Fuzzy logic based fusion

The final step in the data interpretation process is the combination of the individual

analyte concentrations determined by each of the analytical methods as shown in Fig. 1.3.

Each method generates a concentration and confidence interval for each analyte of interest

and a scalar “importance” measure of how well a synthetic chromatogram (or individual

peak areas) matches the measured chromatogram (peak areas). The synthetic information

is generated using the reported concentrations and a set of concentration normalized

parameters derived from the calibration set. The results fusion module uses these inputs to

generate a final, reported concentration and confidence interval for each analyte as shown

in the block diagram of Fig. 6.1. In the following sections the basic definitions of fuzzy

logic are given and the specific implementation of results fusion using the underlying

framework of fuzzy logic is presented.

6.2.1 Definitions

Membership functions (MF). A typical way to denote a fuzzy setA in the universe of

discourse X is with a set of ordered pairs

, 6.1

where is the grade of membership or membership function ofx in A which mapsX

to the membership space M. If M contains only two elements, 0 and 1,A is nonfuzzy and

A x µA x( )( , ) x X∈{ }=

µA x( )

86

Method1

conc

.co

nf.

impo

rt.

Method2

conc

.co

nf.

impo

rt.

MethodN

conc

.co

nf.

impo

rt.

conc

.co

nf.

Results Fusion

Figure 6.1 Block diagram of the results fusion module. Inputs to the module are analyteconcentrations and confidence intervals, and an importance measure. Outputsare a single combined set of analyte concentrations and confidence intervals.

87

is identical to the characteristic function of a nonfuzzy set. As stated in the intro-

duction, inexactitude can be represented by a fuzzy set. Three types of inexactitude are

generally significant: (1) generality, i.e. a concept applies to a variety of situations; (2)

ambiguity, i.e. a concept describes more than one distinguishable subconcept; (3) and

vagueness, i.e. a concept does not have precise boundaries. These types of inexactitude are

represented by fuzzy subsets in the following way: (1) generality, the universe of discourse

X is not just one point;(2) ambiguity, the membership function has more than one

local maximum for ;(3) and vagueness, takes values other than 0 and 1 [53].

One way to define is with a continuous standard function such as a Gaussian,

sigmoid, or polynomial. An example of such a membership function is given by

, 6.2

wherec is the center of the membership function and where the degree of membership is

one andσ controls the rate of decreasing membership as increases. The graph in

Fig. 6.2 shows several membership functions based on single and multiple functions of the

form given in Eqn. 6.2.

In this application the outputs from the analytical methods form the bases of two uni-

verses of discourse which will be labeledC andI for concentration and importance respec-

tively. The fuzzy sets within the universeC includepresentandabsent and the sets in the

universeI includelow, medium,andhigh. Membership in each of these sets is determined

in a fuzzification procedure which maps a crisp input value,x, to a fuzzy membership

value, in each set contained in the given universe.

µA x( )

µA x( )

x X∈ µA x( )

µA x( )

µA x σ c,;( ) e

x c–( )2–

2σ2---------------------

=

x c–

µA x( )

88

Logical operations.A fundamental set of operations in any set theory is the union, inter-

section, and negation of sets and their parallel logic operationsor, and,andnot. In fuzzy

set theory these operations are typically defined based on nonlinear operations on the

membership values of each set. The theoretical derivations of the operators for intersec-

tion and union have been justified by Bellman and Giertz [4] and by Fung and Fu

[32].

Three basic operations of set theory: union, intersection, and complement can be

defined for fuzzy sets. LetA andB be fuzzy sets of X. The union of fuzzy setsA andB is

denoted by and is defined by

, 6.3

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

Degre

e of m

embe

rship

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

Degre

e of m

embe

rship

Figure 6.2 Example of two Gaussian based membership functions. Plots depict MFswith varying parameters. a.) Single Gaussian withc = 5.0 andσ = 2.0; b.)Two Gaussians withc = 3.0 andσ = 1.0 for the left Gaussian,c = 4.0 andσ =3.0 for the right, and full membership between the two centers.

a. b.

A B∪

A B∪µA x( ) µB x( )∨

x-----------------------------------

x∫=

89

where∨ is the symbol for maximum. The intersection of A andB is denoted by and

is defined by

, 6.4

where∧ is the symbol for minimum. The complement ofA is denoted by and is

defined by

. 6.5

These equations reduce to a single point minimum, maximum, and complement for

discrete members of the fuzzy sets.

6.2.2 Combination Rules

Considering these basic definitions, the next step in the process of combining results is

defining a set of combination rules. These rules state the conditions under which the oper-

ators described in the previous section are applied to input values. A typical rule is of the

logical form “antecedent then consequent” where the antecedent usually contains the

operators listed above and the consequent is membership in another fuzzy set. An example

of such a rule is “ifconcentration is A andimportance is B then outputweight is C, where

concentration, importance,andweight are universes (input/output variables) andA, B,and

C are fuzzy sets. The complete fuzzy combination system consists of the defined set of

membership functions and operations, a set of combination rules and an aggregation

method for the results of all the rules. The next section will outline how these components

combine to accomplish the multiple analytical results integration.

A B∩

A B∩µA x( ) µB x( )∧

x-----------------------------------

x∫=

A

A1 µA x( )–

x-----------------------

x∫=

90

6.2.3 Fusion of results

The complete process proposed for the combination of the individual results from each

analytic method is based on the fuzzy principles described above and will be explained in

this section. The schematic shown in Fig. 6.3 depicts the architecture of the proposed sys-

tem.

The first step in this process is to run the analytical methods on the raw data and gener-

ate a set of output data consisting of the analytes’ estimated concentrations, confidence

intervals, and overall importance of the results. These results are then input to the method–

specific fuzzy inference systems and the crisp values are mapped to their appropriate

fuzzy set using the defined membership functions. All of the rules evaluate the input fuzzy

sets and produce a fuzzy output set. The results from each rule are combined into a single

fuzzy set using aggregation methods such as maximum or sum. A final step in the fuzzy

inference process is to defuzzify the output membership function using an operation such

as the centroid or area bisector. This process is depicted graphically in Fig. 6.4.

The output of each fuzzy inference system is a weight factor, wi, between 0.0 and 1.0

which is used in a weighted average of the concentration and confidence interval for each

analyte. Specifically the concentration is given by

, 6.6conc anal( ) final

wiconci anal( )i

∑wi

i∑

-------------------------------------------=

91

Rule N-2Method A ∑

∑

Rule 1

Rule N-2Method B ∑Rule 1

Rule N-2Method K ∑Rule 1

Final Results

InitialResults

Fuzzy setmembership

CombinationRules

RuleAggregation

Figure 6.3 Complete results fusion architecture. The architecture includes the genera-tion of fuzzy membership values for each method, rule evaluation, ruleaggregation, and combination weight generation.

Fuzzy inferencesystem

92

1

input1 input2 output1

2

3

4

0 1 0 1

0.294 0.1940 1

0.316

Centroid

Max

imum

“AND” “Then”“If”

Measured Inputs

Combined Output

Figure 6.4 Example of the fuzzy logic based combination of two measured inputs. Thisfuzzy inference system uses four rules and two fuzzy sets per input. Theinput measurements are mapped to membership values and each rule is eval-uated. The resulting output is obtained by combining the results from allrules and then taking the centroid

93

wherei is the set of all analytical methods andwi is the weight result derived from the

fuzzy system. The confidence interval is combined in a similar manner using

. 6.7

Several items must be configured prior to the use of the results fusion including the defini-

tion of membership functions and generation of the combination rule set. These are typi-

cally defined based on the expert knowledge of the system designer. Some work has been

done by Jang on automating the generation of the rule base in a fuzzy system [48].

conf anal( ) final

wi confi anal( )( )2

i∑

wii

∑--------------------------------------------------=

94

CHAPTER 7

SYSTEM INTEGRATION

The data interpretation module (DIM) is one of the standard laboratory modules

(SLM) used in the automation of PCB sample analysis[23][75]. This module consists of

software, which takes as input the raw chromatogram produced by the analytical instru-

ment (gas chromatography system) and produces an estimate of the concentration of spe-

cific analytes under investigation. All of the steps in this process are initiated and

controlled by the task sequence controller (TSC) via an electronic communications link.

The DIM consists of several distinct pieces including the UNIX–executable control and

communication (CC) program, the UNIX–executable MATLAB interface program, the

MATLAB programs which perform the analytical computations, and the off-line model–

building tools.

This chapter focuses on the components of the DIM related primarily to the interface

with the TSC and the control and sequencing of the MATLAB analysis functions. The

chapter will be organized into a functional description section, a description of the soft-

ware structure, and a description of the off-line analysis tool.

7.1 Functional description of the integrated system

The primary function of the CC component of the DIM is to translate commands

issued from the TSC into the appropriate MATLAB function calls and combine the results

from various analytical methods. A modular approach can be taken to meet the required

95

functionality with the primary modules being the TSC-DIM communications interface,

DIM-MATLAB interface, data management, and results fusion. Each of these functional

modules will be described in the following sections. The schematic in Fig. 7.1 shows a

diagram of the functional modules of the DIM CC.

7.1.1 TSC-DIM interface

The TSC is the master controller of the standard analysis method (SAM) and each

SLM must respond to commands issued as part of the SAM processing script. During the

analysis of a sample the TSC will instruct the appropriate SLMs to perform their respec-

tive functions. The last step in this process is the analysis of the raw chromatogram gener-

ated by the analytical instrument module (AIM) to extract chemical knowledge about the

Figure 7.1 Top level schematic of the DIM CC functional blocks.

TSC

ANALYTICAL

NETWORKED

INSTRUMENT

MODULE

DATA STORAGE

CONTROL &COMMUNICATION

MATLAB

INTERFACE

MATLABANALYTICAL METHODS

QA/QCRESULTS FUSION

DIM

96

sample. The functional requirements of the interface between the TSC and the DIM are

listed below.

1. Establish communication with the TSC. When the DIM CC program starts, the first

task is to establish a communications channel and identify itself by exchanging the rel-

evant information about the DIM to the TSC. The SLM interface tool kit is used as the

underlying code to configure and maintain the socket-based connection between the

TSC and DIM.

2. Respond to all TSC requests. All commands issued by the TSC must be acknowledged

by the DIM, then progress and completion messages must be sent back to the TSC as

the requested task is executed. Two levels of commands will be issued by the TSC:

laboratory unit operations (LUO) and intra-LUO (ILUO) operations. The LUO com-

mands are queued in a first in, first out (FIFO) buffer and do not interrupt an ongoing

command execution while the ILUO commands require an immediate response.

3. Define a common access file system. The DIM needs to be able to copy the raw chro-

matogram file generated by the AIM to a batch-processing directory. The TSC will

supply a filename to the DIM and the DIM will copy the file to the currently defined

batch processing directory based on the Analytical Instruments Association (AIA) net-

work common data format (NetCDF) sample type.

7.1.2 DIM-MATLAB interface

The majority of the analytical computations used to convert the raw chromatogram

signal to useful chemical knowledge of the sample contents are performed in the MAT-

LAB software environment. This environment is typically accessed via a command line or

97

graphical interface in which the user types commands or makes appropriate user interface

selections to execute the desired functions. In the automated processing scenario these

commands must be generated by a controlling interface to the MATLAB processing

engine. The primary functional requirements of this interface are to establish a message–

passing queue between the DIM CC and the MATLAB interface program, open a MAT-

LAB processing engine, generate the appropriate MATLAB function calls and arguments,

and parse the return arguments from the MATLAB functions. Each of these tasks will be

described in greater detail below.

1. Communications between DIM CC and MATLAB interface. MATLAB provides a

mechanism to start a processing engine from a user–written executable program. The

MATLAB engine function enables text strings to be constructed and passed to the

engine in a manner similar to the way the user would type commands at the MATLAB

prompt. Because of the ILUO response requirements, the DIM CC must not be

blocked by the call to the MATLAB engine. Therefore a stand-alone executable pro-

gram with a non-blocking communication link is required to interface between the

DIM CC and the MATLAB processing engine.

2. The communications link between the DIM CC and the MATLAB interface can be a

simple message queue. A message queue is a first in, first out (FIFO) queue that has a

user defined message structure. A library of support calls exists which enable the

opening, formatting, sending, and receiving of messages between two independent

executable code modules.

3. Open a MATLAB processing engine. In order to execute the processing algorithms

98

implemented in MATLAB code a MATLAB engine must be opened. A separate pro-

gram is used to interface with the MATLAB engine using a set of library functions

provided by MATLAB. These functions start the MATLAB process and provide a han-

dle for other functions to use in subsequent processing. This program is in a continu-

ous loop, which waits until a message is available on the queue, passes the command

string to the MATLAB engine, waits for completion, composes a result message, and

sends the message back to the CC.

4. Generation of MATLAB function calls. The interface between the DIM CC and the

MATLAB interface must compose a text string containing the necessary function

name and required arguments. Based on the command received from the TSC, an

appropriate MATLAB function will be selected and the arguments extracted from data

provided by the TSC. This information will be formatted and put into a message struc-

ture to be sent to the MATLAB engine interface program.

5. Parsing return arguments. At the completion of MATLAB functions the return argu-

ments are placed in the MATLAB environment memory and must be retrieved into the

interface’s memory space. The return variables are then parsed and formatted into a

message to be sent to the DIM CC program. After this message has been sent, the

input message queue is checked and the next MATLAB function request is processed.

7.1.3 Data management

The DIM CC is required to manage both the batch processing specific information and

the per–sample information during on-line processing. Upon start-up the CC will initialize

an internal database of parameters with default values (this state can also be reached by an

99

initialization command). The TSC is responsible for communicating information regard-

ing the current batch processing information prior to any processing of samples. This

information will be retained across all samples until a new batch is defined. The TSC will

also send sample–specific information prior to issuing the processing commands. Finally,

sample specific information is generated during the quality and analysis processing. The

details of the data management function are listed below.

1. Establish and maintain DIM CC database. A defined data structure with the required

fields is generated in memory during the start-up of the DIM CC. The TSC will issue

commands, which contain information that is stored in the data structure and subse-

quently used during sample processing. Results from the sample processing will also

be stored in the data structure. Checks will be made on the required fields prior to exe-

cuting MATLAB functions and fields, which change on a sample basis, will be cleared

upon completion of each sample’s processing.

2. Generation of ASCII results file. At the completion of each sample’s processing the

sample data structure is appended to an ASCII file located at the top level of the batch-

processing directory. This file contains a single line for each sample processed, and

database fields are separated with blank spaces.

7.2 Software structure

As described in Sec. 7.1 the primary software modules in the DIM are the TSC-DIM

communications interface, DIM-MATLAB interface, and analytical computation/data

management. Each of these modules consists of either C/C++ code or MATLAB code.

The structure of the DIM software will be described in the following sections for each

100

major functional module. Each section will describe the components, which make up the

overall functional module and the layout of the actual code.

7.2.1 DIM_SLM program

The executable programdim_slm is the primary program of the DIM and it contains

several source files. The main program file is “dim_slm.c” and the additional support files

are “dim_util.c,” dim_cmds.c,” and a header file “dim_slm.h.” This program uses the

generic CAA communications toolkit library and thus consists of a main program, which

calls the library function “Tkexecutive.” The remaining code consists of functions that

execute under specified TSC commands such asinitialize, start, read, etc. One important

attribute located in the main section of the code is the forking (or creation) of a child pro-

cess which starts the program “matlab_eng.” Upon execution of the main program, the

program splits and the child process starts the program which interfaces with the MAT-

LAB processing engine and the parent process sets up communication with the

“matlab_eng” program and then enters the toolkit executive.

The individual functions that are registered with the toolkit arediminit, dimstart, dim-

set, dimread, idle, bypass, intraLUO,anderror. Most of the action is initiated via one of

the first four functions and most of the time is spent in theidle function waiting for new

commands or for commands to complete. A global variable,slmstate, is used to enable

specific action to be taken within any of the above–listed toolkit support functions. The

variable can be set to the current action of the DIM when a command is started and cleared

when the command completes. A typical execution of a TSC command would proceed in

the following sequence:

101

1. the appropriate action function would be called (e.g.dimstart),

2. within the function a message is generated and sent to the “matlab_eng” program (if

MATLAB processing is required to complete the action),

3. the function sets theslmstate variable to indicate the processing state,

4. the idle function is entered and the message queue is checked to determine if the MAT-

LAB processing has completed,

5. if a message is in the queue, theslmstate is set to “IDLE” and the complete message is

returned to the TSC.

7.2.2 MATLAB_ENG program

The executable programmatlab_eng is a C program that interfaces between the

dim_slm program and the MATLAB computational engine. This program links these two

components in a non-blocking manner so that asynchronous communication can occur

between thedim_slm program and the TSC. An additional layer of software was required

because the standard library function, which executes MATLAB code, blocks the calling

program until the MATLAB code completes.

Thematlab_eng program utilizes message passing to receive commands and transmit

results from/to thedim_slm program. A MATLAB command string is generated by the

dim_slm program and packaged into the body of a message and placed into the queue. The

matlab_eng program continually polls the receive–message queue and parses commands

when a message is received. A call to the MATLAB engine starts the desired processing.

Once thedim_slm program has sent the message it is free to listen for commands from the

102

TSC and return messages from thematlab_eng program. When the MATLAB engine

returns, the result is formatted into a message and sent to thedim_slm program.

The matlab_eng program thus runs in a continuous loop of waiting for a message to

arrive, passing the message on to MATLAB, waiting for completion, sending the results

out in a message. The messaging facility provides the capability for multiple messages to

be stored in the queue and processed in a first in first out (FIFO) order.

7.2.3 MATLAB functions

The analytical computation and some data management are performed by functions

written in MATLAB. This computing environment enables robust and efficient algorithm

development for analytical chromatogram processing. As described in the proceeding sec-

tions, the MATLAB function calls are formulated by the control and communication pro-

gram and passed to the MATLAB environment via the MATLAB interface program. The

primary functions include chromatogram preprocessing, QA assessment, analytical meth-

ods, and results fusion. MATLAB functions are similar to other procedural languages with

a calling format of “[output1, output2, …output n] =function name (input1, input2, …

input n). Each of the functions includes a description of the processing performed in the

header area of the actual code.

7.3 Implementation

All of the software described in this chapter has been implemented on UNIX worksta-

tions using the native C/C++ compiler and the MATLAB development environment. The

system has been run on several machines at ORNL and UT with good performance. In one

case part of the system ran on a machine at ORNL (TSC) and the other on a machine at

103

UT (DIM) over the internet connection. Several demonstrations and a significant beta test

of the entire CAA SAM system have been completed using the DIM software. The beta

test involved processing approximately sixteen actual soil samples from a environmental

site in South Knoxville every day for a month. A processing script was developed and

used to automatically process the samples from start to finish, sample extraction to DIM

results. In addition the DIM software has been used to process a batch of stored chromato-

grams in an off line mode. This process involves running only the TSC, human computer

interface (HCI), and the DIM (no sample extraction, analytical instrument, or robot mod-

ules).

7.4 Off-line analysis tool

A graphical user interface (GUI) has been developed for access to the library of analy-

sis functions presented in the body of this document. This GUI enables the user to interac-

tively explore and configure the signal processing and modeling operations used for the

analysis of chromatograms. The base layout of the tool is shown in Fig. 7.2. In the main

window is a large plot which can display a number of raw and processed signals. The

magnification of the main view is controlled by a viewport window that has interactive

mouse control of the zoom limits. A scrolling table is displayed on the far right of the

screen. This table is used to report results of the individual processing steps and is dynam-

ically linked to the graphical displays. For instance, if a single peak (or group) is selected

from the list, the base peaks will be displayed in the pop-up peak view window. The tool

enables the user to experiment with algorithm parameter values, view the results, and save

these for use in on-line analysis.

104

Figure 7.2 Main screen of the GUIchromfit tool.

105

CHAPTER 8

RESULTS

The previous chapters have presented an approach to the analysis of gas chromatogra-

phy that is based on quantitative pattern recognition using nonlinear based methods. This

section will explore the preliminary results of applying this approach to real and simulated

time series. Results in this section will be broken down into components which parallel the

description given in the preceding sections.

8.1 Preprocessing and filtering

Preprocessing the measured chromatographic time–series signal enables a more robust

operation of the subsequent processing operations. The least–squares modeling operation

itself is robust to noise but the derivative–based peak parameter estimation process can be

biased by high frequency noise. The preprocessing consists primarily of a noise level esti-

mation and a lowpass frequency filtering operation. An estimate of the power spectral den-

sity (PSD) of a baseline chromatogram (blank injection or inert carrier gas) and a typical

chromatogram are shown in Fig. 8.1. The PSD will be used to determine the cutoff fre-

quency for the lowpass filtering operation.

8.1.1 Noise estimation

The first step in the preprocessing phase is to determine an estimate of the high fre-

quency noise component of the input time series. This is accomplished by computing the

PSD on the baseline chromatogram and then taking the average value as the nominal noise

106

0 0.2 0.4 0.6 0.8 1

0

20

40

60

80

Frequency

Pow

er S

pect

rum

Mag

nitu

de (

dB)

0 0.2 0.4 0.6 0.8 1

0

20

40

60

80

Frequency

Pow

er S

pect

rum

Mag

nitu

de (

dB)

10 15 20 25 304000

6000

8000

10000

12000

14000

Time (min)

coun

ts

10 15 20 25 304000

6000

8000

10000

12000

14000

Time (min)

coun

ts

Figure 8.1 Comparison of the power spectral density estimates. Plots represent variousconditions: a.) Baseline chromatogram and b.) typical sample chromatogram.Power spectral density estimates for the time series in a. and b. are shown inc. and d. respectively.

a. b.

c. d.

107

magnitude. The dashed line in the PSD plot show in Fig. 8.2 represents the signal power

for a blank chromatogram with added white noise (zero mean, 0.3 std. dev., 100 counts

amp.). As shown in the plot, the noise level is approximately 30 db. A noise threshold

level is determined by adding a 6 db margin to the computed noise level and is shown as a

horizontal line in Fig. 8.2. The lowpass cutoff frequency is determined by the cross-over

point at which the low frequency PSD crosses the noise threshold. For the data shown in

Fig. 8.2, the normalized (1 corresponds to the Nyquist sampling frequency) cutoff fre-

quency is 0.1.

8.1.2 Filtering

A finite impulse response (FIR) digital filter is used to perform the lowpass operation.

Filter coefficients are obtained using the classical method of windowed linear-phase FIR

design [46]. The plot in Fig. 8.3 shows the magnitude and phase of the frequency response

of the filter design using an order of 16 and normalized cutoff frequency of 0.1. The base

FIR filter is applied both in the forward and reverse directions to generate an effective zero

phase, order 32 filter [67]. Results of applying this filter are shown in the right plot of Fig.

8.2.

8.2 Baseline estimation

The baseline estimation procedure described in Sec. 5.4 has been applied to several

sample chromatograms and performs well. A fifth order polynomial fits the typical chro-

matogram baseline trend without over-fitting. Knot points for the fit are obtained by seg-

menting the chromatogram into ten regions and determining the minimum signal value

within the segment. The abscissa values are the midpoints of the regions. Additional knot

108

0 0.5 120

30

40

50

60

70

80

90

Frequency

Po

we

r S

pe

ctru

m M

ag

nitu

de

(d

B)

10.3 10.4 10.54400

4450

4500

4550

4600

4650

4700

4750

4800

Time (min)

cou

nts

Figure 8.2 Power spectrum estimate used for filter selection. The PSD estimate for abaseline (dashed line) and sample (solid line) chromatogram with addednoise are shown in the left plot. The horizontal line indicates the thresholdcalculated by adding 6 db to the mean signal level of the baseline spectrum.A segment of the noisy sample and lowpass filtered chromatogram is shownin the right plot.

109

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−800

−600

−400

−200

0

Normalized frequency (Nyquist == 1)

Ph

ase

(d

eg

ree

s)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−150

−100

−50

0

50

Normalized frequency (Nyquist == 1)

Ma

gn

itu

de

Re

sp

on

se

(d

B)

Figure 8.3 Frequency response of FIR filter used to suppress high frequency noise. Thefilter has zero-phase due to forward and reverse directional application of thebase linear phase FIR filter.

110

points are added at the beginning and end of the first and last segment respectively. A typ-

ical baseline approximation is shown in Fig. 8.4.

8.3 Peak modeling

The fundamental building block of the a typical chromatogram is a roughly Gaussian

shaped peak function. Peaks in the chromatogram are indicative of a specific chemical

component and their elution time and area are quantitative measures of the component.

The steps leading to the determination of the concentration of a complex mixture (one that

contains many such fundamental peaks) require first estimating the model parameters for

each observed peak in the chromatogram. This section will present the results of the meth-

ods outlined in Chap. 3 applied to simulated and real data from measured chromatograms.

8.3.1 Initial estimates

The initial estimates of the peak model parameters are obtained by analyzing the

actual chromatogram time series, estimated baseline, and estimates of the derivatives of

the chromatogram. Local negative minima indicate the presence of a peak and quantify the

retention time. Several checks are used to confirm the presence of the peaks including a

negative threshold set to establish the bounds to search for local minima. The plots in Fig.

8.5 show a segment of a real chromatogram and the associated derivatives.

The remaining peak model parameters are estimated based on the actual chromato-

gram signal, once the locations of the peaks are established. First the peak amplitude

above baseline is obtained, then the width at various fractions of the maximum amplitude.

These values are then substituted into the equations of Sec. 3.2 to compute the peak model

111

10 15 20 25 30

6000

6500

7000

7500

8000

8500

9000

Time (min.)

coun

ts

Figure 8.4 Example of the approximated baseline. A fifth order polynomial function isused as the model. Baseline variations are due to the changing equilibriumcondition of the GC during an elution (primarily temperature).

112

10 10.5 11 11.5 12 12.5 13

5000

6000

7000

8000

9000

Co

un

ts

10 10.5 11 11.5 12 12.5 13−5

0

5x 10

4

Co

un

ts/m

in.

10 10.5 11 11.5 12 12.5 13

−1

0

1

x 106

Time (min.)

Co

un

ts^2

/min

.

Figure 8.5 Example of signal derivative estimates. The plots show segments of the orig-inal lowpass filtered chromatogram (top plot), numerical estimate of the firstderivative (middle plot), and second derivative (bottom plot). The derivativesare used to locate candidate peaks.

113

parameters. The plot in Fig. 8.6 shows the peaks found in the segment shown in Fig. 8.5

and estimates for the location, amplitude, and width of each peak.

8.3.2 Nonlinear least squares modeling

A fundamental component of the modeling process described in this research is least

squares minimization. The model fitting process proceeds from the initial estimates

obtained as shown in the previous section by using the Levenberg-Marquardt algorithm

described in Sec. 5.2. In this section a simple example of the least squares procedure

applied to a single peak with two free parameters will be given. The results from the pro-

cedure applied to a segment of a real chromatogram will also be shown. Finally, the results

of applying this method in a sequential manner across an entire chromatogram are pre-

sented.

The fitting of a a Gaussian shaped peak model to a set of observed data is a nonlinear,

iterative process due to the model parameters that are contained in the exponential compo-

nent of the model. A simple simulation has been performed in which a time series is gen-

erated using a single peak function and then the least–squares procedure is used to

determine the true model parameters from an initial guess. For illustrative purposes, only

two of the model parameters are free to change thus allowing the error surface to be dis-

played using a mesh plot. A time series of 256 points was generated using a peak center

time of 6.4 units, an area of 18 units, and a sigma of 0.5. The surface in Fig. 8.7 shows the

square root of the sum of squared errors between the true peak function and an array of

candidate peaks with a range of parameters (Time - [2,10],Area - [0 30]). The goal of the

114

10 10.5 11 11.5 12 12.5 13

4500

5000

5500

6000

6500

7000

7500

8000

8500

9000

Time (min.)

Co

un

ts

Figure 8.6 Chromatogram segment with peaks indicated by vertical stem plots. Loca-tion (elution time) and amplitude of the peak are indicated by a vertical line.Peak width at half maximum is indicated by a horizontal line segment.

115

05

1015

2025

30

2

4

6

8

100

20

40

60

80

100

120

AreaTime

Err

or

Figure 8.7 Sum of squares error surface for a single peak. Plot was generated by com-puting the sum of squared error between a true peak and peaks with parame-ters specified by the two axis. The minimum in this surface can be seen atTime = 6.4 andArea = 18.

116

least–squares minimization is to adjust the initial guess parameters in an iterative manner

to reach the minimum of the surface.

In this simple example the error surface is smooth with a single local minimum, but it

does include two low–slope areas. Several initial guesses were randomly selected and the

progress of the model parameters values were tracked and plotted as shown in Fig. 8.8. In

both of these cases convergence was reached on the order of tens of function and gradient

evaluations.

In the analysis of a sample chromatogram, the nonlinear least–squares model fitting

approach is applied to overlapping segments with between four to seven peaks per seg-

ment. This operation is much more complicated due to the simultaneous modification of

~30 model parameters. An example of this process is shown in Fig. 8.9 for a segment of a

chromatogram. The plot in Fig. 8.10 shows the results for an entire chromatogram. A

comparison to two popular commercial products shows that the model fitting method is

comparable. An absolute comparison is not possible due to the unknown peak areas of this

actual environmental sample. A simulated chromatogram will be used in the next section

to compare the accuracy of the model fitting method to a commercial offering (HP Chem-

station [13]).

8.3.3 Noise analysis and accuracy comparison

The accuracy of any method designed to extract peak information is influenced by the

level of signal noise in the measure time series. This section describes an analysis of the

accuracy of the modeling approach and a commercial product under varying levels of sig-

nal–to–noise ratios (SNRs). As mentioned in the previous section the ground truth value

117

3 4 5 6 7 8 90

5

10

15

20

25

Time

Are

aSolution

Start Point

Figure 8.8 Contour plot of the error surface for the single peak modeling operation.The plot on the left shows an initial start point of (5.9, 5) and the right plothas an initial start point of (4.5, 24). Solid line with circle symbols representsthe sequence of intermediate parameter estimates obtained during the leastsquares minimization.

3 4 5 6 7 8 90

5

10

15

20

25

Time

Are

a

Solution

Start Point

118

11 11.2 11.4 11.6 11.8 12 12.2 12.4 12.6−1.5

−1

−0.5

0

0.5

1

1.5

Time (min.)

Per

cent

err

or

11 11.2 11.4 11.6 11.8 12 12.2 12.4 12.6

5000

6000

7000

8000

9000

Cou

nts

Figure 8.9 Plots showing the accuracy of the peak modeling process. An example ofactual data points from a sample chromatogram (+) and the fitted model con-sisting of the sum of seven individual peaks (solid line in top figure). Thepercent error between the actual and modeled signals is shown in the lowerplot.

119

Model fit

Target

TurboChrome

10 15 20 25 30

100

200

300

400

500

600

700

800

900

1000

1100

Time (min.)

Are

a (c

ount

−m

in)

Figure 8.10 Stem plot of the peak times and corresponding areas. The solid stems andcircles indicate model fitting approach, “X” the commercial Target results,and “*” commercial TurboChrome.

120

for the peak parameters of actual peaks generated with a GC is not available. Therefore,

the peak model described in Chap. 3 is used to generate a simulated chromatogram that is

used as the base time series for analysis. The simulated chromatogram includes 100 EMG

peaks that are centered at every minute (e.g. 0.5, 1.5, ..,99.5) over a total run time of 100

minutes. The Gaussian sigma and exponential decay, tau, are varied in ten steps over a

range of 0.01 to 0.055 (0.01, 0.015, ..., 0.055) and 0.0 - 0.09 (0, 0.01, ... 0.09) respectively.

In addition, five separate chromatograms with area values (20, 40, 60, 80, 100 count-min)

were generated. An example of a chromatogram with a constant peak area of 80 count-min

is shown in Fig. 8.11.

Gaussian white noise was added to these base chromatograms at levels to obtain SNRs

of 0.0, 20.0, 40.0, and 60.0 db. The total number of individual chromatograms numbered

twenty (five areas at four SNRs) and the total number of individual peaks was 2000. This

large sample size was beneficial in assessing the range of performance for both the model-

ing algorithm and the commercial system. All of the resulting chromatograms were saved

in the AIA NetCDF format to facilitate subsequent processing.

The set of chromatograms was processed using the model–fitting algorithms discussed

and the results were tabulated. Since the true model parameters were known, the actual

and absolute errors were included in the tabulation. The set was also processed using the

automatic integrator included in the HP Chemstation software. Results from this system

were included in the tabulation. In this case only the peak retention time (RT) and area

were reported due to the limitations of the software (width of the peak is generated but it

does not correspond to the sigma model parameter). At the highest noise level (SNR = 0),

the model–based approach performed very well in comparison to the commercial offering.

121

0 10 20 30 40 50 60 70 80 90 1000

500

1000

1500

2000

2500

3000

3500

Time (min.)

coun

ts

Figure 8.11 Example of a typical simulated chromatogram. These chromatograms areused in the noise and accuracy analysis. The peak area is 80 count-min. andthe sigma and tau values range from 0.01 to 0.055 and 0 to 0.09 respectively.

122

A single peak from the chromatogram with area equal to 80 and SNR equal to 0.0 is

shown in Fig. 8.12 Even at this high noise level the model converged to a very close

approximation to the true underlying signal. Another example of a typical peak from a

chromatogram with a SNR of 40 is shown in Fig. 8.13. This is approximately the level of

noise encountered in most of the chromatograms obtained from ORNL.

Summary results from this analysis have been tabulated and show that the modeling

approach is robust to noise and considerably more accurate than the commercial results. In

most cases the accuracy of the model estimated RT was typically an order of magnitude

less than the sampling interval (3.33e-3 min.). In contrast, the commercial system gener-

ated a RT estimate an order of magnitude greater than the sampling interval (7-8 times). In

all cases the RT estimate generated with the model approach was more accurate than the

commercial system. These results are shown in Table 8.1 and Table 8.2. and graphically in

Fig. 8.14 and Fig. 8.15.

Table 8.1: Average absolute error, model algorithm

SNR (db) RT (min.) Area (count-min.) Sigma (min.) Tau (min.)

0 0.01570 10.39444 0.01044 0.02685

20 0.00239 0.59100 0.00062 0.00213

40 0.00024 0.04013 0.00004 0.00038

60 0.00004 0.00519 0.00001 0.00010

Table 8.2: Average absolute error, Commercial system

SNR (db) RT (min.) Area (count-min.)

0 0.024546 256.5699896

20 0.023334 28.02913724

40 0.02348 1.806819218

60 0.021918 0.155671765

123

44.35 44.4 44.45 44.5 44.55 44.6 44.65 44.7 44.75

200

300

400

500

600

700

800

900

1000

1100

1200

Time (min.)

coun

ts

raw modeledideal

Figure 8.12 Example of the modeling results for noisy data. The plot shows the raw data,ideal (true) signal, and the modeled peak at a SNR of 0.

124

44.4 44.45 44.5 44.55 44.6 44.65 44.7

300

400

500

600

700

800

900

1000

1100

Time (min.)

coun

ts

raw modeled

Figure 8.13 Example of the raw data and the modeled peak at a SNR of 40. This isapproximately the level of noise encountered in chromatograms generated atORNL on standard samples.

125

0 20 40 600

0.005

0.01

0.015

0.02

0.025

SNR (db)

Ave

rage

abs

olut

e er

ror

(min

.)

CommercialModeled

Figure 8.14 Comparison of the average RT error for a range of SNRs. The error leveldecreases as the noise level decreases for the model approach but stays con-stant for the commercial system.

126

0 20 40 600

5

10

15

256 28

SNR (db)

Ave

rage

abs

olut

e er

ror

(cou

nt−

min

)

CommercialModeled

Figure 8.15 Comparison of the average area error for a range of SNRs. The error level issignificantly less with the model based approach, especially at the high noiselevels.

127

A more significant difference was seen in the estimates of the peak area over both the

noise levels and the two methods. At the highest noise level the average absolute error was

twenty-five times greater for the commercial system than the model based results. That

trend continued across all the noise levels. A comparison of the percent area error across

all the area levels at a SNR of zero is shown in Table 8.3.

Two other trends which were observed in analyzing these results were a large number

of spurious peaks detected with the commercial system and a the error in the area parame-

ter increased as the true peak tailing increased (larger tau). For the highest noise chromato-

grams, the commercial system generated three times the actual number of peaks and

between 110 and 150% more peaks than were in the base chromatogram. The average area

error for the commercial system was linearly related to the value of tau, and the change

represented approximately 50% of the base error at tau equal to zero (pure Gaussian). In

contrast the peak parameter errors obtained from the model based approach were consis-

tent across the range of RT, area, sigma, and tau. As expected the errors decreased as the

SNR increased.

Table 8.3: Average percent error at SNR = 0

Area (count-min.)

Modelalgorithm

Commercialsystem

20 14.32935635 258.25753

40 17.6865644 378.76231

60 21.66566152 329.43760

80 15.47984855 328.80415

100 16.6484185 618.98764

128

8.4 Analyte modeling

A key component of the integrated chromatogram modeling analysis approach is the

generation of individual analyte models. These models will be used to generate the com-

plete observed chromatogram as described in Sec. 4.1. In this section the results of per-

forming the calibration process on a set of calibration standards will be given. In addition,

the results of the statistical tests used to determine which peaks should be included in the

model are given. Based on the results presented in this section, the relationship of peak

area and concentration can be determined and modeled on a per peak basis to generate a

complete analyte model.

8.4.1 Calibration process

The off-line calibration procedure proceeds from the data obtained from extracting the

peak area estimates from a set of known concentration calibration standards. These data

can be represented in a two–dimensional matrix with the number of rows equal to the total

number of standard in the group and the number of columns equal to the union of all peaks

found across the calibration set. A similar matrix of the known analyte concentrations in

each standard can be created where the columns are the concentrations of each analyte of

interest. One requirement of this procedure is that any given standard contain only a single

analyte.

In the next step the peak areas for a given analyte are regressed onto the known con-

centrations using models of zero, first, and second order. The statistical test described in

Sec. 4.1.4 which compares the ratio of two model’s residual sum of squares to an F statis-

tic to determine if the more complex model is justified. The plot in Fig. 8.16 shows how

129

three different peaks exhibit different relationships to concentration. One peak has a small

constant peak area for all concentration amounts, while second and third peak have linear

and quadratic relationships respectively. An example of a set of peak areas with a qua-

dratic relationship is shown in Fig. 8.17. For this peak, the correlation coefficient for the

linear fit is 0.9971 and for the quadratic fit is 0.9999. While both of these are relatively

high values, the residual sum of squared errors is 654 and 8 respectively for the linear and

quadratic fit and the maximum error is on the order of 5% for the linear fit.

8.4.2 Significant peak determination

A final step in the analyte modeling is the determination of which peaks should be

included in the model. Aγ criterion is used for this test and the results are shown in the

Constant peak

Linear peak

Quadratic peak

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

100

200

300

400

500

600

Concentration (ng/ml)

Are

a (c

ount

s−m

in)

Figure 8.16 Plot of the peak areas vs. analyte concentration for three different peaks.The constant peak shown no relationship to concentration and the otherpeaks exhibit linear and quadratic relationships.

130

image shown in Fig. 8.18 for three analytes. This image represents the peaks that are

included in the analyte model as a gray rectangular patch at their corresponding retention

time. Note the general trend of the bulk of the included peaks to increase in retention time

with each analyte. Another point to notice is that many of the peaks are included in multi-

ple analyte models. The results of this complete analysis is a set of model coefficients that

will enable the generation of a complete analyte chromatogram based on a single concen-

tration specification.

8.4.3 Example analyte models

A calibration process was performed on a set of fifteen standards (five standards for

each analyte - Aroclors 1242, 1254, and 1260) generated by the ORNL analytical services

organization. Peak parameters were estimated and used to determine the relationship

Data points

Linear model

Quadratic model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

100

200

300

400

500

600

Concentration (ng/ml)

Are

a (c

ount

s−m

in)

Figure 8.17 Comparison between quadratic and linear models for a set of peak areas.The statistical test indicated that a quadratic model was significantly betterthan the linear model.

131

between analyte concentration and peak area. Models for each of the three analytes were

built and comparisons to the original standards performed. The results demonstrated that

the analyte model–building process worked extremely well. This is shown in Fig. 8.19,

Fig. 8.20, and Fig. 8.21. In all cases the maximum peak difference between the measured

signal and the model is under 8% and the average error is approximately 0.2%.

8.5 Chromatogram modeling

The final objective in the analysis of gas chromatogram time series is to quantitatively

determine the concentration of various analytes of interest. In this section the techniques

established for the modeling of the fundamental components of a chromatogram will be

combined to provide a means of determining concentration for analytes contained in

Time (min.)8 10 12 14 16 18 20 22 24 26 28

Analyte 1

Analyte 2

Analyte 3

Figure 8.18 Image indicating the peaks included in the three analyte models. The grayareas represent a peak at the corresponding retention time that has passed theγ test.

132

10 15 20 25 30−4

−2

0

2

4

Time (min.)

Per

cent

diff

eren

ce

10 15 20 25 304000

5000

6000

7000

8000

9000

10000

Cou

nts

MeasuredModel

Figure 8.19 Model and measured signal for Aroclor 1242 standard. Standard represents aconcentration of 200 ppb. Lower plot shows the percent difference betweenthe measured and modeled signals.

133

10 15 20 25 30−4

−2

0

2

4

6

8

Per

cent

diff

eren

ce

Time (min.)

10 15 20 25 304000

5000

6000

7000

8000

9000

10000

Cou

nts

MeasuredModel


134

10 15 20 25 30−4

−2

0

2

4

6

8

Time (min.)

Per

cent

diff

eren

ce

10 15 20 25 304000

5000

6000

7000

8000

9000

10000

Cou

nts

MeasuredModel


135

unknown mixtures. The results presented in this section show that the hypothesis of mod-

eling a sample chromatogram to determine analyte concentrations is valid. Examples of a

synthesized chromatogram will show how the complete chromatogram model can be used

to generate artificial chromatogram based on a desired analyte concentration, interference

peaks, and baseline. An example result of fitting a complete chromatogram model to an

actual chromatogram will also be shown.

8.5.1 Complete model

A complete chromatogram model has been developed using the model parameter data

obtained from the processing shown in the previous sections. This model includes three

analytes and a baseline function. A simulated chromatogram with a concentration of 200

ppb of Aroclor 1242 and 400 ppb of Aroclor 1260 is shown in Fig. 8.22. This artificial

chromatogram was generated by specifying only the analyte’s concentration, peak times,

and baseline polynomial coefficients. The ability to simulate chromatograms at any arbi-

trary concentration and with any baseline/interferences will enable a quantitative compari-

son of the various analytical analysis techniques. In addition, these simulated

chromatograms can be used to generate an inexpensive training set for an artificial neural

network concentration determination method or other methods that might need a larger

training set than is typically available.

8.5.2 Least square fit

The complete model can also be used in a least–squares error minimization procedure

to adjust the analyte concentrations to obtain the best fit to an unknown chromatogram. As

a first order example of this process, a mixture with specified concentration of 50 ppb Aro-

136

10 15 20 25 30

0.6

0.8

1

1.2

1.4

1.6

x 104

Time (min.)

Cou

nts

Figure 8.22 Simulated chromatogram for mixture sample. Mixture consists of 200 ppbAroclor 1242 and 400 ppb Aroclor 1260. Chromatogram model developedwith individual standard chromatograms.

137

clor 1242 and 100 ppb Aroclor 1260 was selected from a set of data generated at Oak

Ridge National Laboratory. The model developed in the previous section was used as the

basis for the fit. The results of this operation yielded concentrations of 57 ppb and 108 ppb

for Aroclors 1242 and 1260 respectively. A comparison to several of the other methods for

the specified sample is shown in Table 8.4. As can be seen from the data in the table, the

modeling approach performed quite well on this single sample. It should be noted that the

actual concentration of the Aroclors is probably only accurate to 5% due to the human

preparation of the standard samples. Also note the “importance” measure on the last line

of the table. This is a normalized measure of the euclidean distance between the original

time series and the series generated with the model. The value is calculated using

, 8.1

which is a scalar measure of how well the modeled data fit the actual data.

Table 8.4: Comparison of analytical methods for a single sample

AnalyteActual(ppb)

ModelPCRraw

PCRpeak

MLRpeak

ANNpeak

SLEraw

Aroclor1242

50 57 87 69 61 60 66

Aroclor1254

0 0 -14 29 81 1 0

Aroclor1260

100 108 136 95 127 107 120

Impor-tance

0.98 0.88 0.74 0.20 -0.97

0.83 0.90

Imp 1act model–( )2∑

act2∑

------------------------------------------------–=

138

A segment of the actual chromatogram and the fitted chromatogram is shown in Fig. 8.23.

The fit is very good, with a maximum peak percent difference of less than five percent.

The procedure used to process the sample above was applied to a much larger set of

samples generated by ORNL. The set includes thirty samples composed primarily of mix-

ture samples of the various Aroclors. One of the more difficult samples, a mixture of 0.2,

0.05, 0.8 ug/ml of Aroclors 1242, 1254, 1260 respectively, produced a result of 0.209,

0.037, 0.840. The area of the worst fit in this chromatogram is shown in Fig. 8.24. The

source of the error is primarily due to lack of fit in the peak tail. This is a difficult sample

to analyze for traditional methods due to the overlap of the 1242 and 1260 peaks into the

1254 region. The other algorithms had artificially high values reported for Aroclor 1254.

Results from the entire set were tabulated and are listed in Appendix B. Several summary

statistics from the entire set are given in Table 8.5. As can be seen from these results the

model–based approach performed on average better than the other methods. The RMS

measure was significantly better for the 1254 Aroclor. It should be noted that the model–

based method is intended to complement the existing methods rather than replace them. If

Table 8.5: Summary results of individual methods

MethodAverage

Error (ug/ml)RMSError (ug/ml)

1242 1254 1260 1242 1254 1260

PCR-R 0.010772 0.006432 0.018157 0.024692 0.040045 0.027843

PCR-P 0.005157 0.03647 -0.00402 0.026472 0.044724 0.026932

MLR-P -0.00244 0.17531 0.161284 0.019989 0.229429 0.231752

SLE-R 0.004425 -0.24653 -0.15033 0.01917 0.406205 0.38624

Model 0.00503 0.000453 0.012567 0.019089 0.026582 0.024284

139

15.2 15.4 15.6 15.8 16 16.2 16.4 16.6 16.8 17 17.2−1

−0.5

0

0.5

1

1.5

Time (min.)

Per

cent

diff

eren

ce

15.2 15.4 15.6 15.8 16 16.2 16.4 16.6 16.8 17 17.2

4500

5000

5500

6000

6500

Time (min.)

Cou

nts

Measured MLR estimModel

Figure 8.23 Accuracy comparison of an complete chromatogram model. The solid linerepresents the actual data, the dotted line is the initial MLR derived estimateand the dashed line the model. The bottom plot shows the percent differencebetween the measured and modeled signals.

140

20.3 20.35 20.4 20.45 20.5 20.55 20.6 20.65 20.7 20.75 20.8−10

−5

0

5

10

15

20

25

Per

cent

diff

eren

ce

Time (min.)

20.3 20.35 20.4 20.45 20.5 20.55 20.6 20.65 20.7 20.75 20.80.5

1

1.5

2

x 104

Cou

nts

MeasuredModel

Figure 8.24 Zoomed in area of a difficult to fit mixture sample. The top plot shows acomparison of an actual sample chromatogram and the least squares fittedmodel. The solid line represents the actual data and the dashed line themodel. The bottom plot shows the percent difference between the measuredand modeled signals. The poor fit in the tail could be a result of the tauparameter being constrained to an artificially low value.

141

one or several methods perform well, their results will be included in the final result based

on the results fusion module.

8.6 Results fusion

The final process in the complete data interpretation process is to intelligently combine

the results obtained from the various concentration determination methods. As described

in Chap. 6, a fuzzy logic based results fusion technique combines the individual concen-

tration and confidence intervals generated from each method. A complete set of 45 chro-

matograms has been analyzed with five analytical methods and the results input into the

fusion algorithm. This section will discuss the generation of the weighting factors for each

method and provide performance data.

8.6.1 Weight determination

The membership functions and rules required for the fuzzy inference were defined

using an analytical chemist at ORNL as the expert. For the membership functions the

chemist provided reasonable values and curve shapes for the various fuzzy sets such as

absent andpresent for the analyte concentration andlo, medium,and high for the impor-

tance. The defined membership functions are shown in Fig. 8.25. An example of a typical

rule in the inference system is “IFAnal_1 is present AND Anal_2 is presentAND

Anal_3 is absentAND Importance is medium THEN Weight is zero.” This particular

rule encapsulates some expert knowledge that the analytical method does not perform well

when two adjacent analytes are present.

The fuzzy combination weights for a single sample are given in Table 8.6. This table

shows the result of executing a method specific fuzzy inference system on the results

142

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0

0.2

0.4

0.6

0.8

1

ac42

Deg

ree

of m

embe

rshi

p

presentabsent

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

import

Deg

ree

of m

embe

rshi

p

low medium high

Figure 8.25 Input membership functions. These MF map analyte concentration andimportance to a degree of membership in their respective fuzzy sets.

Analyte concentration

Importance factor

143

from the respective method.A crisp weighting factor is listed in the second column and

will be used in the weighted average of the concentrations and confidence intervals. The

last two rows list the combined results and the “ground truth” values.

8.6.2 Sample analysis

A general trend of improved accuracy and performance was observed for the analyzed

data set, especially for the mixture samples. An example of such a mixture is shown in

Fig. 8.26. In this example the first abscissa values represent a single method, MLR, the

second represents a straight average, and the third is the fuzzy weighted average. The

reported fuzzy fused concentrations are closer to the true values, and the confidence inter-

val is smaller than the other methods.

Table 8.6: Reported concentrations and fused results for a single sample

Method Fuzzy Analyte 1 Analyte 2 Analyte3

Weight Conc.(ppb) Conf. Conc.(ppb) Conf. Conc.(ppb) Conf.

PCR-R 0.74 789 31 27 35 823 28

PCR-P 0.50 764 46 123 47 752 52

MLR-P 0.85 780 42

MLR-P 0.46 429 23

MLR-P 0.89 840 21

ANN-P 0.81 790 9 49 27 814 14

SLE-R 0.94 784 20 816

Fuzzy 783 34 98 33 814 29

Actual 800 50 800

144

0.5 1 1.5 2 2.5 3 3.50

200

400

0.5 1 1.5 2 2.5 3 3.5600

700

800

900

0.5 1 1.5 2 2.5 3 3.5600

700

800

900

Figure 8.26 Comparison of accuracy and precision for results fusion. A single analyticalmethod and two methods of results fusion are shown for a mixture sample.The Fuzzy logic based fusion has greater accuracy and precision.

MLR Average Fuzzy

Aro

clor

124

2A

rocl

or 1

254

Aro

clor

126

0

145

A root mean square error for each Aroclor was calculated to compare the overall per-

formance of the fuzzy based fusion. The results are listed in Table 8.7. and indicate the

increased performance of the fuzzy method.

Table 8.7: RMS error over entire data set for different methods

Method Analyte 1 Analyte 2 Analyte 2

ppb ppb ppb

MLR 20 229 232

Avg. 17 65 79

Fuzzy 16 41 47

146

CHAPTER 9

CONCLUSIONS

In this chapter an overview of the research is given and placed in the context of the

contributions to the field of quantitative analysis and pattern recognition. Results from this

work have demonstrated improved processing capabilities for complex, mixture based

chromatography analysis. Several significant software systems have been developed and

tested both in simulated and real world environments. Finally, the research has spurred

other potential areas of investigation which could build on the foundation laid by this

work.

9.1 Summary

A unified system for the complete, automated, and quantitative analysis of time series

data generated by gas chromatography has been developed. This system takes as input the

raw time series, run–specific information, and specific analytes of interest; and it generates

a quantitative measure of the amount and confidence interval about the result for each of

the analytes of interest. All of the intermediate steps required for this processing have been

addressed in this research.

The system level control and communication software necessary for the automatic

processing of data files has been developed and implemented in a standard UNIX environ-

ment. This software enables a remote supervisory and task sequencing program to make

high level requests and instructions to the data processing module. The system level soft-

147

ware provides the framework for orchestration of the necessary computational modules,

the data management, and the exchange of information required for automatic processing.

Industry standard communication protocols are used for the communication and data

transfer between the master controller and other standard laboratory modules (e.g. analyti-

cal instruments).

A suite of signal processing algorithms has been implemented to facilitate the extrac-

tion of information from the raw signals in a novel way. The first step in the processing is

to estimate the noise level in the signal and to filter based on the level. Using knowledge

that the relevant signal information is contained in the lower frequencies, the cutoff fre-

quency is determined using an estimate of the power spectral density. A zero–phase FIR

filter is then applied to the input signal. The retention time end markers are then located

and modeled, based on a predefined acceptance region. Once the region of interest has

been established, the baseline and peak parameter estimation processing is initiated. A

unique–region–based method is used to select the knot points for a low order polynomial

estimation of the baseline. Approximations of the first and second derivatives of the fil-

tered input are calculated and used to estimate peak locations. Empirical relations between

measurable signal attributes and peak parameters are used to generate the initial estimates

of the nonlinear peak–modeling algorithm. A sliding window approach is used to simulta-

neously fit multiple peaks across the time series using a robust nonlinear least squares

minimization algorithm.

The resulting estimates of the key peak parameters are then used for both the off-line

analyte model building process and the initial analyte concentration estimates for the com-

plete chromatogram modeling operation. Statistical measures are used to determine the

148

relationship between a given peak and analyte. Then a per–peak quadratic model is gener-

ated to relate the concentration of an analyte to peak’s area. A complete analyte model is

then generated using the summation of the included peaks. Multiple linear regression

using the estimated peak areas is performed to generate the initial analyte concentrations

followed by a least–squares fit of the multiple analyte models to the raw measured signal.

The final phase in the processing is the fusion of the results from multiple analytical

methods. Several analytical processing algorithms have been implemented by CAA team

members and included in this processing system. The desire is to increase the accuracy

and precision of the final analyte concentration estimates by intelligently combining the

results of multiple methods, each with advantages under certain sample scenarios. A fuzzy

logic based fusion has been implemented to calculate the relative weight of each methods

result in the final combined result. This fuzzy logic system uses the actual concentration

reported by the method and an “importance” measure to determine the output weighting.

The final result generated from these processing steps is an estimate of the concentration

and confidence interval for each of the analytes of interest.

9.2 Contributions

Several novel contributions to the field of quantitative chromatography analysis have

been made by this research. The first is a system and method for the automatic analysis of

gas chromatography data. This is the first system which implements two-way communica-

tion with a supervisory controller that can alter the processing flow of a sample based on

the results of an analysis. In addition the system is not tied to any specific vendor’s data

format or hardware and can run remotely across a standard IEEE 802 Ethernet network.

149

A second major contribution is the novel approach to use nonlinear least squares mini-

mization within a sliding convolution window for the simultaneous modeling of a large

number of peak parameters. In previous modeling attempt the user was limited in the num-

ber of parameters that could be estimated based on the convergence of the algorithm and

the processing time required. With the approach presented in this research a manageable

subset of the entire data set is modeled (including neighboring peaks whose tails overlap

the subset) and then the subset is redefined as the next group of peaks in the time series,

with possible overlap. A per–peak fit measure is used to determine when a master parame-

ter list should be updated. This approach enables a consistent treatment of overlapped

peaks and speeds the processing time.

The use of analyte models in a complete chromatogram fitting process is also a new

approach to determining analyte concentration. Many forms of regression on a set of lin-

ear transformed variables have been presented in the literature, but all of these methods

still can only change the weight of the influence of directly overlapped signatures. They

cannot completely discern the contribution of a possible large concentration of one analyte

from an average concentration of another analyte. The approach presented in this research

forces the best fit of each entire analyte model simultaneously across the entire chromato-

gram.

The final contribution is the intelligent fusion of results obtained from multiple analyt-

ical methods. A mathematical framework of fuzzy logic was used to combine the results

of each method based on a set of rules derived by an expert with knowledge of the

strengths and weaknesses of each method under various sample mixture scenarios. This

150

fusion generated a more accurate and precise estimate than was obtained by any one indi-

vidual method.

9.3 Developed software and tools

This research work has generated several software modules which can operate stand-

alone or as a system. The major software components are the control and communication

module, MATLAB data processing functions, and a graphical user interface (GUI) for

interactive off-line parameter evaluation. The control and communication software was

written in C and C++ and the compiled executable runs on a Sun Microsystems UNIX

workstation. This module must run with the CAA task sequence controller and the analyt-

ical instrument module (AIM). All of the MATLAB functions can be called directly with

the proper arguments from the MATLAB environment. The functions are modular in

design to enable stand-alone use of individual components e.g. baseline estimation, peak

modeling. Finally the MATLAB based GUI provides a convenient, interactive interface

for the manipulation of the various algorithm parameters.

9.4 Future research

Several areas of potential new research and implementation issues have become appar-

ent in the course of performing this research. One of the most interesting is the use of a

quantitative pattern recognition system to provide input into the results fusion process. A

factor analysis of the raw data could be used to generate a lower–dimensional feature set

that then could be used in a traditional multiclass pattern classification algorithm. The

151

results of absence/presence could be used to better determine which analytical method

should be used in the fusion process.

The automatic generation of the rules used in the fuzzy logic inference system is a log-

ical extension of the manually–generated rules. Some research has been done in this area

with a neural network used to generate the rules. The network was trained using example

scenarios and then the range of possible fuzzy set combinations was input to the network

for rule generation. The rule generation process becomes important due the to large num-

ber of rules required when the number of inputs and distinct fuzzy sets increases.

Another possible area of further research is the use of a index system for the labeling

of peaks rather than their absolute retention time. One existing index system proposed by

Kovats could be investigated[57]. Using an index would enable more robust peak corre-

spondence in the calibration and analysis phase of the processing. Currently normalized

retention time is used during the calibration phase.

A study into the detector response to determine if nonlinearities are present would help

validate the linear summation of analytes model. If a significant nonlinear response is

seen, it should be factored into the complete chromatogram modeling process.

A final area that could be addressed is optimizing the nonlinear least squares minimi-

zation code. The current implementation in MATLAB performs satisfactorily on a UNIX

workstation and may run faster on a PC based system. An implementation in C would per-

form better and would enable multiple GC systems to be serviced from one data interpre-

tation computer.

152

BIBLIOGRAPHY

153

Bibliography

[1] M. A. Abidi and R. C. Gonzalez, eds.Data Fusion In Robotics and Machine Intelli-gence, Academic Press, San Diego, CA, 1992.

[2] Y. Bard.Nonlinear Parameter Estimation, Academic Press, New York, NY, 1974.

[3] R. S. Bear and S. D. Brown. “Kalman Filter-Optimized Simulation for Step Voltam-metry,” Analytical Chemistry, 65(8): 1061 -1068, 1993.

[4] R. Bellman and M. Gertz. “On the analytical formalism of the theory of fuzzy sets,”Information Sciences, 5:149 - 156, 1973.

[5] A. Berthod. “Mathematical Series for Signal Modeling Using Exponentially Modi-fied Functions,”Analytical Chemistry, 63(17):1879 - 1884, September 1, 1991.

[6] I. Bloch. “Information combination operators for data fusion: A comparative reviewwith classification,” IEEE Trans. on Sys., Man and Cybernetics - Part A: Sys. andHumans, 26(1): 52 - 67, 1996.

[7] B. van den Bogaert, H. F. M. Boelens, and H. C. Smit. “Quantification of chromato-graphic data using a matched filter: robustness towards noise model errors,”Analyt-ica Chimica Acta, 274: 87 - 97, 1993.

[8] G. E. P. Box and J. Wetz. “Criteria for judging adequacy of estimation by an approx-imating response function,” University of Wisconsin Statistics Department TechnicalReport No. 9, 1973.

[9] S. D. Brown and S. C. Rutan. “Adaptive Kalman Filtering,”Journal of Research ofthe National Bureau of Standards, 90(6): 403-407, 1985.

[10] W. P. Carey, L.E. Wangen, and J. T. Dyke. “Spectrophotometric Method for theAnalysis of Plutonium and Nitric Acid Using Partial Least-Squares Regression,”Analytical Chemistry, 61(15):1667 - 1669, 1989.

[11] S. Carrato and A. Contin. “Application of a Peak Detection Algorithm for the ShapeAnalysis of Partial Discharges Amplitude Distributions,”IEEE Intl. Symp. on Elec-trical Insulation, Pittsburgh, PA, USA, 288 - 291, 1994.

[12] U. B. Ceipidor, T. R. I. Cataldi, E. Desimoni, and A. M. Salvi. “Non-linear LeastSquares Refinement with Constraints. Evaluation Through Curve Fitting on Emu-lated XPS-like Spectra and Application to the Analysis of Carbon Fibres,” Journalof Chemometrics, 8(3): 221 - 239, 1994.

154

[13] Chemstation Users Manual (Product # G2090AA). Hewlett Packard, 1994.

[14] K. Chopra and D. D. Woods. “A Maximum Likelihood Peak Detecting Channel,”IEEE Transactions on Magnetics, 27(6): 4819 - 4821.

[15] C. Creemers, D. Royer and P. Schryvers. “Automatic Analysis of ISS-Spectra,”Sur-face and Interface Analysis, 20: 233 - 242, 1993.

[16] P. B. Crilly. “Numerical Deconvolution of Gas Chromatography Peaks Using Jans-son’s Method,”Journal of Chemometrics, 1: 79 - 90, 1987.

[17] P. B. Crilly. “Enhancing the Deconvolution of Noisy Chromatographic Data by Jans-son’s Method,”Journal of Chemometrics, 4: 51 - 59, 1990.

[18] P. B. Crilly. “The Use of a Cross-Correlation Technique to Enhance Jansson’sDeconvolution Procedure,”Journal of Chemometrics, 4: 291 - 298, 1990.

[19] R. D. Dandeneau and E. H. Zerenner. “An Investigation of Glasses for CapillaryChromatography ,”High Resolution Chromatography and Capilary Chromatogra-phy, 2:351 - 356, 1979.

[20] N. R. Draper and H. Smith.Applied Regression Analysis, 2nd ed. Wilely, New York,NY, 1981

[21] J. W. Elling, L. N. Klatt, and W. P. Unruh. “Automated Data Interpretation in anAutomated Environmental Laboratory,”Laboratory Robotics and Automation, 6(2):73 - 78, April 1994.

[22] M. D. Erickson.Analytical Chemistry of PCB, Butterworth Publishers, Boston, MA,1986, reprinted by Lewis Publishers, Chelsea, MI, 1992.

[23] T. H. Erkkila, R. M. Hollen, and T. J. Beugelsdijk. “The Standard Laboratory Mod-ule: An Integrated Approach to Standardization in the Analytical Laboratory,”Labo-ratory Robotics and Automation, 6(2): 57 - 64, April 1994.

[24] J. L. Excoffier and G. Guiochon. “Automatic Peak Detection in Chromatography,”Chromatographia, 15(9): 543 - 545, 1982.

[25] A. Felinger. “Fourier Analysis of Multicomponent Chromatograms. Recognition ofRetention Patterns,”Analytical Chemistry, 64(18): 2164 - 2174, 1992.

[26] A. Felinger. “Deconvolution of Overlapping Skewed Peaks,”Analytical Chemistry,66(19): 3066 - 3072, 1994.

155

[27] A. Felinger. “Superposition of Chromatographic Retention Patterns,”AnalyticalChemistry, 67(13): 2078 - 2087, 1995.

[28] R. Fletcher.Practical methods of optimization, Wiley, Chichester, England, 1987.

[29] J. P. Foley. “Systematic Errors in the Measurement of Peak Area and Peak Height forOverlapping Peaks,”Journal of Chromatography, 384: 301 - 313, 1987.

[30] J. P. Foley. “Equations for chromatographic modeling and calculation of peak area ,”Analytical Chemistry, 59(15):1984 - 1987, 1987.

[31] J. P. Foley and J. G. Dorsey. “A Review of the Exponentially Modified Gaussian(EMG) Function: Evaluation and Subsequent Calculation of Universal Data,” Jour-nal of Chromatographic Science, 22: 40 - 46, January, 1984.

[32] Fung, L. W. and K. S. Fu. “The K’’th optimal policy algorithm for decision-makingin fuzzy environments.” Identification and System Parameter Estimation (P. Eykhoff,Ed.). North Holland, pp. 1025-1059, 1974.

[33] P. Gans. “Data Fitting in the Chemical Sciences,” Wiley, Chichester, England, 1992.

[34] P. J. Gemperline, J. R. Long, and V. G. Gregoriou. “Nonlinear Multivariate Calibra-tion Using Principal Components Regression and Artificial Neural Networks,”Ana-lytical Chemistry, 63(20): 2313 - 2323, October 15, 1991.

[35] D. J. Gerth, T. Howell, and S. I. Shupack. “An Object-oriented Data Handling Sys-tem for Spectral or Chromatographic Data Acquisition and Analysis,”Computers inChemistry, 16(1): 35 - 39, 1992.

[36] J. C. Giddings.Chromatography, 3rd ed., edited by E. Heftmann, Van NostrandReinhold, New York, NY, 1975.

[37] J. C. Giddings.Dynamics of Chromatography: vol. 1, edited by J. C. Giddings and R.A. Keller, Marcel Dekker, New York, 1965.

[38] M. J. E. Golay.Gas Chromatography, edited by D. H. Desty, Butterworths, London,UK, 1956.

[39] B. J. Gudzinowicz, M. J. Gudzinowicz, and H. F. Martin.Fundamentals of Inte-grated GC-MS: Vol. 1, GC, Marcel Dekker, New York, NY, 1976.

[40] G. Guiochon and C.L. Guillemin.Quantitative Gas Chromatography for Labora-tory Analysis and On-Line Process Control,Elsevier, Amsterdam, The Netherlands,1988.

156

[41] G. Guiochon and C.L. Guillemin. “Gas Chromatography,”Review of ScientificInstrumentation, 61(11), 1990

[42] D. Haaland and E. Thomas. “Partial Least-Squares Methods for Spectral Analyses.2. Application to Simulated and Glass Spectral Data,”Analytical Chemistry, 60 (11):1202 - 1208 , 1988.

[43] P. S. Hamilton and W. J. Tompkins. “Quantitative Investigation of QRS DetectionRules Using the MIT/BIH Arrhythmia Database,”IEEE Transactions on BiomedicalEngineering, 33(12): 1157 - 1164, 1986.

[44] Y. Hayashi, T. Shibazaki, and M. Uchiyama.. “Resolution of ovlapped chromato-grams by means of the Kalman filter,”Analytica Chimica Acta, 202: 187 - 197,1987.

[45] S. J. Henkind and M. C. Harrison. “An analysis of four uncertainty calculi,”IEEETrans. on Sys., Man and Cybernetics, SMC-18(5):700 - 714, 1988.

[46] IEEE.Programs for Digital Signal Processing, Algorithm 5.2. IEEE Press, JohnWiley & Sons, New York, 1979.

[47] A. T. James and A. J. P. Martin. “Gas-liquid Partition Chromatography: the Separa-tion and Micro-estimation of Volatile Fatty Acids from Formic Acid to DodecanoicAcid,” Biochemical Journal, 50: 679 - 690, 1952.

[48] J.-S. R. Jang. “ANFIS Adaptive-Network-based Fuzzy Inference System,”IEEETrans. on Systems, Man, and Cybernetics, 23(3):665 - 685, 1993.

[49] F. Janssens and J. Francois. “Evaluation of Three Zero-Area Digital Filters for PeakRecognition and Interference Detection in Automated Spectral Data Analysis,”Ana-lytical Chemistry, 63(4): 320 - 331, 1991.

[50] P. Jansson.Deconvolution With Applications in Spectroscopy, Academic Press, NewYork, 1984.

[51] A. Jaulmes, C. Vidal-Madjar, A. Ladurelli, and G. Guiochon. “Study of Peak Pro-files in Nonlinear Gas Chromatography. 1. Derviation of a Theoretical Model,”Jour-nal of Physical Chemistry,88(22): 5379 - 5385, 1984.

[52] M. S. Jeansonne and J. P. Foley. “Improved equations for the calculation of chro-matographic figures of merit for ideal and skewed chromatographic peaks,”Journalof Chromatography, 594: 1 - 8, 1992.

[53] A. Kandel.Fuzzy mathematical techniques with applications, Addison-Wesley,Reading, MA, 1986.

157

[54] D. W. Kirmse and A. W. Westerberg. “Resolution Enhancement of ChromatographPeaks,”Analytical Chemistry, 43(8): 1035 - 1039, 1971.

[55] P. T. Kissinger, L. J. Felice, D. J. Miner, C. R. Reddy, and R. E. Shoup. “Detectorsfor Trace Organic Analysis of Liquid Chromatography: Principles and Applica-tions,” in Contemporary Topics in Analytical and Clinical Chemistry, Vol. 2, D. M.Hercules et al., eds., Plenum Press, New York, 1978.

[56] S. A. Klappa and G. R. Long. “Computer assisted determination of the biologicalactivity of polychlorinated biphenyls using gas chromatographic retention indices asmolecular descriptors,”Analytica Chimica Acta, 259: 89 - 93, 1992.

[57] E. Kovats.Advances in Chromatography, vol. I, J. C. Giddings and R. A. Keller eds,Marcel Decker, New York, 1965.

[58] B. R. Kowalski and B. Seasholtz. “Recent Developments in Multivariate Calibra-tion,” Journal of Chemometrics, 5(3): 129 - 145, 1991.

[59] B. K. Lavine, A. Stine, and H. T. Mayfield. “Gas chromatography-pattern recogni-tion techniques in pollution monitoring,”Analytica Chimica Acta, 277: 357 - 367,1993.

[60] K. Levenberg. “A Method for the solution of Certain Non-linear problems in LeastSquares,”Quart. Applied Math, 2:164 - 168, 1944.

[61] W. Lindberg,et. al. “A Simple and Robust Flow Injection Analysis Method forDetermination of Free Acid and Metal Concentrations in Hydrolyzable Metal Solu-tions,” Analytical Chemistry, 62(): 849 - , 1990.

[62] H. H. Madden. “Comments on the Savitzky-Golay Convolution Method for Least-Squares Fit Smoothing and Differentiation of Digital Data,”Analytical Chemistry,50(9): 1383 - 1385, 1978.

[63] N. Majcen, K. Rajer-Kanduc, M. Novic, and J. Zupan. “Modeling of Property Pre-diction from Multicomponent Analytical Data Using Different Neural Networks,”Analytical Chemistry, 67(13): 2154 - 2161, July 1, 1995.

[64] D. W. Marquardt. “An Algorithm for Least-squares Estimation of Nonlinear Parame-ters,”SIAM Journal of Applied Mathematics, 11: 431 - 441, 1963.

[65] A. J. P. Martin and R. L. M. Symge. “Separation of the Highter Monoamino-Acidsby Counter-Current Liquid-Liquid Extraction: The Amino-Acid Composition ofWool,” Biochemical Journal, 35(1): 91-121, 1941.

158

[66] J. J. More’. “The Levenberg-Marquardt Algorithm: Implementation and Theory,”Numerical Analysis, ed. G.A. Watson, Lecture Notes in Mathematics 630, SpringerVerlag, 105:116, 1977.

[67] A. V. Oppenheim and R. W. Schafer.Discrete-Time Signal Processing, pp. 311 -312. Prentice-Hall, Englewood Cliffts, NJ, 1989.

[68] A. N. Papas and T. P. Tougas. “Accuracy of Peak Deconvolution Algorithms withinChromatographic Integrators,”Analytical Chemistry, 62(3): 234 - 239, 1990.

[69] L. F. Pau. “Sensor data fusion,” Journal of Intelligent and Robotic Systems, 1: 103 -116, 1988.

[70] A. Robbat, Jr., G. Xyrafas, and D. Marshall. “Prediction of Gas ChromatographicRetention Characteristic of Polychlorinated Biphenyls,”Analytical Chemistry,60(10): 982 - 985, 1988.

[71] M. L. Salit,et. al. “Integrating Automated Systems with Modular Architecture,”Analytical Chemistry, 66(6): 361A - 367A, 1994.

[72] I. Schechter, R. Wisbrun, R. Niessner, H. Schroder and K. L. Kompa. “Signal Pro-cessing Algorithm for Simultaneous Multi-element Analysis by Laser-producedPlasma Spectroscopy,” SPIE Proceedings on Substance Identification Analytics,Innsbruck, Austria, 2093: 310 - 321, 1994.

[73] I. Schechter. “Correction for Nonlinear Fluctuating Background in MonovariableAnalytical Systems,”Analytical Chemistry, 67(15): 2580 - 2585, 1995.

[74] S. Sekulic, M. B. Seasholtz, Z. Wng, B. R. Kowalski, S. E. Lee, and B. R. Holt.“Nonlinear Multivariate Calibration Methods in Analytical Chemistry,”AnalyticalChemistry, 65(19): 835A - 845A, October 1, 1993.

[75] F. A. Settle, Jr., R. Hollen, and L. W. Yarbrough. “The Contaminant Analysis Auto-mation Project,”American Laboratory, April 1995.

[76] J. C. Sternberg. “Extracolumn contributions to Chromatographic band broadening,”Advances in Chromatography, vol.2, J. C. Giddings and R. A. Keller, Eds., MarcelDekker, New York, 1966.

[77] Target Software (version 3). Thru-Put Systems Inc., Orlando, FL.

[78] United States Environmental Protection Agency, office of Solid Waste and Emer-gency Response, Washington DC, Method 8080A.

159

[79] United States Environmental Protection Agency, office of Solid Waste and Emer-gency Response, Washington DC, Solid Waste 846.

[80] P. D. Welch. “The use of the fast Fourier transform for the estimation of power spec-tra: A method based on time averaging over short modified periodograms,”IEEETrans. on Audio Electroacoustics, AU-15: 70 - 73, 1967.

[81] M. A. Williams. “Application of artificial neural networks in the quantitative analy-sis of gas chromatograms”. M.S. Thesis, University of Tennessee, Knoxville, TN,May 1996.

[82] R. R. Yager. “A general approach to the fusion of imprecise information,”Intl. J. ofIntelligent Systems, 12(1): 1 - 29, 1997.

[83] W. W. Yau and J. J. Kirkland. “Improved computer algothrim for characterizingskewed chromatographic band broadening: I. Method,”Journal of Chromatography,556: 111 - 118, 1991.

[84] W. W. Yau, S. W. Rementer, J. M. Boyajian, J. J. DeStefano, J. F. Graff, K. B. Lim,and J. J. Kirkland. “Improved computer algorithm for characterizing skewed chro-matographic band broadening: II. Results and comparisions,”Journal of Chromatog-raphy, 630: 69 - 77, 1993.

[85] J. Zart. “Low order Fourier series baseline approximation,” Internal Los AlmosNational Lab report, 1996.

[86] L. A. Zadeh. “Fuzzy Sets and Applications: Selected Papers by L. A. Zadeh. JohnWiley & Sons, New York, 1987.

[87] H. J. Zimmermann.Fuzzy Set Theory and Its Applications, 2nd ed. Kluwer Aca-demic, Boston, 1991.

160

APPENDICES

161

APPENDIX A.

DERIVATION OF EMG FUNCTION PARTIAL DERIVATIVES

We start with the definition of the EMG function given by

. 0.1

First we will take the partial derivative of Eqn. 0.1 with respect to the peak area variableA.

This is a simple operation due to the fact that the area term only appears as a overall mul-

tiplier.

0.2

The remaining partials will require the application of the product and chain rules. The

derivative of theerf will also be requied and is given by

. 0.3

We start with the partial with respect toσ. Taking the first term yields

. 0.4

The second term (erf function) results in

. 0.5

hEMG t( ) A2τ-----exp

12--- σ

τ---

2 t tg–

τ------------

– 1 erfZ

2-------

+=

A∂∂

hEMG t( ) 12τ----- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–( )τ

----------------–

exp=

xdd

erf x( ) 2

π-------exp x2–( )=

Aσ2τ3-------- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–( )τ

----------------–

exp

2– A

2 πτ------------- 1

2--- 2

τ------- 1

2--- t tg–( ) 2

σ2-------+ exp

12---σ 2

τ------- 1

2---– t tg–( ) 2

σ-------–

2

162

Combining these two terms and simplfying yields

. 0.6

Considering the peak center term,tg, we again consider the first term in Eqn. 0.1 to obtain

. 0.7

The second terms results in

. 0.8

Combining these two terms and simplfying yields

. 0.9

Finally, we consider the partial of Eqn. 0.1 with respect to expontential decay term,τ.

Again we start with the first term which has two components due to the termτ appearing

in both terms of the Gaussian exponential.

, 0.10

σ∂∂

hEMG t( ) A–

τ π---------- σ

2τ----------

t tg–( )

2σ----------------–

–2 1

2τ----------

t tg–( )

2σ2----------------– σ2

2τ2--------

t tg–

τ------------–

expexp

Aσ2τ3 π---------------- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–

τ------------–

exp

+=

A

2τ2-------- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–

τ------------–

exp

A–

2πτσ-----------------exp

12---σ 2

τ------- 1

2---– t tg–( ) 2

σ-------–

2

σ2

2τ2--------

t tg–

τ------------–

exp

tg∂∂

hEMG t( ) A–

τσ 2π----------------- σ

2τ----------

t tg–( )

2σ----------------–

–2 σ2

2τ2--------

t tg–

τ------------–

expexp

A

2τ2-------- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–

τ------------–

exp

+=

A–

2τ2-------- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–

τ------------–

exp

163

and

. 0.11

The second term results in

. 0.12

Combining yields

0.13

A2τ----- σ–

2

τ3---------

t tg–

τ2------------+

1 erfσ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–

τ------------–

exp

Aσ2πτ3

----------------exp12---σ 2

τ------- 1

2---– t tg–( ) 2

σ-------–

2

σ2

2τ2--------

t tg–

τ------------–

exp

τ∂∂

hEMG t( ) A

2τ2-------- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2

2τ2--------

t tg–

τ------------–

exp–

Aστ3

2π---------------- σ

2τ----------

t tg–( )

2σ----------------–

–2 σ2

2τ2--------

t tg–

τ------------–

expexp

A2τ----- 1 erf

σ2τ

----------t tg–( )

2σ----------------–

– σ2–

τ3---------

t tg–

τ2------------+

σ2

2τ2--------

t tg–

τ------------–

exp

+

+

=

164

APPENDIX B.

TABLE OF INDIVIDUAL RESULTS

Table B.1: Concentration values for sample set

IdAroclor1242

Aroclor1254

Aroclor1260

1242confidence interval



Importance

ug/ml ug/ml ug/ml ug/ml ug/ml ug/ml

20 0.074 0.0755 0.0663 0.0018 0.0019 0.0015 0.98507

21 0.1189 0.1045 0.1351 0.0026 0.0027 0.0022 0.98029

22 0.1935 0.2049 0.2045 0.0048 0.0049 0.004 0.96734

23 0.3699 0.4518 0.3493 0.0147 0.0143 0.011 0.92272

24 0.7772 0.7231 0.8099 0.0218 0.0195 0.0168 0.92997

25 0 0 0.8316 0.0039 0.0038 0.0042 0.97601

26 0.0115 0.8031 0.0223 0.008 0.0111 0.0065 0.94472

27 0.8409 0 0 0.0025 0.0016 0.0014 0.98826

28 0.792 0.7854 0.0814 0.0128 0.0119 0.0072 0.9515

29 0 0.0629 0.1112 0.0016 0.002 0.0016 0.98428

30 0.0701 0 0.125 0.0016 0.0016 0.0013 0.98716

31 0.2214 0.767 0.7906 0.0156 0.0191 0.0162 0.9267

32 0.2094 0.0374 0.8486 0.005 0.0047 0.0052 0.97154

33 0.1368 0.7546 0.3861 0.0119 0.0154 0.011 0.93289

34 0.4068 0.8224 0.0277 0.0095 0.0109 0.0064 0.95099

35 0.7916 0.0816 0.2105 0.0064 0.0045 0.0038 0.97149

36 0.7805 0.0209 0.831 0.0079 0.0055 0.006 0.96983

37 0.0852 0.8759 0.0139 0.0086 0.0115 0.0065 0.94689

38 0.0765 0.1912 0.8664 0.0063 0.0067 0.0071 0.96244

39 0.0067 0.4203 0.0673 0.0043 0.0057 0.0041 0.96297

165

40 0.759 0.2032 0.2271 0.0075 0.0056 0.0046 0.96697

41 0.2066 0 0 0.0014 0.0012 0.0012 0.98907

42 0.0013 0.2104 0 0.0021 0.0026 0.002 0.97941

43 0 0 0.2072 0.0016 0.0017 0.0015 0.98628

44 0.2088 0 0 0.0012 0.0011 0.0011 0.99024

45 0.0023 0.2088 0.0002 0.0021 0.0026 0.0019 0.97987

46 0 0 0.2089 0.0015 0.0016 0.0014 0.98703

47 0.2077 0 0 0.0013 0.0011 0.0012 0.98963

48 0.0023 0.2087 0 0.0023 0.0026 0.0019 0.97949

49 0 0 0.2049 0.0018 0.0017 0.0015 0.98615

Table B.1: Concentration values for sample set

IdAroclor1242

Aroclor1254

Aroclor1260




Importance

ug/ml ug/ml ug/ml ug/ml ug/ml ug/ml

166

VITA

Martin Anthony Hunt was born in Knoxville, Tennessee on October 26, 1963. He was

graduated from Farragut High School in June 1981. He received the Bachelor of Science

degree in Electrical Engineering from Tennessee Technological University, Cookeville,

Tennessee, in June 1985. As an undergraduate he was initiated as a member of Tau Beta

Pi, Eta Kappa Nu, an Mortar Board honor societies. He continued his education at Vander-

bilt University, Nashville, Tennessee, and received a Master of Science degree in Electri-

cal Engineering in May, 1987. During the graduate program he worked as a research

assistant in Dr. Richard Shiavi’s Gait Analysis Laboratory.

He accepted a development staff member position with the Instrumentation and Con-

trols division of Oak Ridge National Laboratory in June, 1987. He has been a member of

the Image Science and Machine Vision group since 1988 and continues to perform

research and development in the areas of computer vision, signal processing, and pattern

recognition. He began the part time pursuit of the Doctor of Philosophy degree at the Uni-

versity of Tennessee in 1990. He will receive the Doctor of Philosophy degree in Electrical

engineering in May 1998.

He married Mary Elizabeth McSpadden of Kingsport, Tennessee in December of

1986. He has two children, Caroline, age 7 years and Robert, age 4 years.

quantitative pattern recognition using nonlinear … · relating the measurable outputs of a...

Documents