statistical signal processing and its applications to …dingqqq/thesis.pdf · s. kay, q. ding, and...

STATISTICAL SIGNAL PROCESSING AND ITS APPLICATIONS TO

DETECTION, MODEL ORDER SELECTION, AND CLASSIFICATION

BY

QUAN DING

A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

IN

ELECTRICAL ENGINEERING

UNIVERSITY OF RHODE ISLAND

2011

DOCTOR OF PHILOSOPHY DISSERTATION

OF

QUAN DING

APPROVED:

Dissertation Committee:

Major Professor

DEAN OF THE GRADUATE SCHOOL

UNIVERSITY OF RHODE ISLAND

2011

ABSTRACT

This dissertation has focused on topics in statistical signal processing including

detection and estimation theory, information fusion, model order selection, as well

as their applications to standoff detection.

Model order selection is a very common problem in statistical signal pro-

cessing. In composite multiple hypothesis testing, the maximum likelihood rule

will always choose the hypothesis with the largest order if the parameters in each

candidate hypothesis are hierarchically nested. Hence, many methods have been

proposed to offset this overestimating tendency by introducing a penalty term.

Two popular methods are the minimum description length (MDL) and the Akaike

information criterion (AIC). It has been shown that the MDL is consistent and the

AIC tends to overestimate the model as the sample size goes to infinity. In this

dissertation, we show that for a fixed sample size, the MDL and the AIC are incon-

sistent as the noise variance goes to zero. The result is surprising since intuitively,

a good model order selection criterion should choose the correct model when the

noise is small enough. Moreover, it is proved that the embedded exponentially

family (EEF) criterion is consistent as the noise variance goes to zero.

Standoff detection aims to detect hazardous substances in an effort to keep

people away from potential damage and danger. The work in standoff detection has

been on developing algorithms for detection and classification of surface chemical

agents using Raman spectra. We use an autoregressive model to fit the Raman

spectra, develop an unsupervised detection algorithm followed by a classification

scheme, and manage to control the false alarm rate to a low level while maintaining

a very good detection and classification performance.

In information fusion and sensor integration,multiple sensors of the same or

different types are deployed in order to obtain more information to make a better

decision than with a single sensor. A common and simple method is to assume that

the measurements of the sensors are independent, so that the joint probability den-

sity function (PDF) is the product of the marginal PDFs. However, this assump-

tion does not hold if the measurements are correlated. We have proposed a novel

method of constructing the joint PDF using the exponential family. This method

combines all the available information in a multi-sensor setting from a statistical

standpoint. It is shown that this method is asymptotically optimal in minimiz-

ing Kullback-Leibler divergence, and it attains comparable detection/classification

performance as existing methods.

The maximum likelihood estimator (MLE) is the most popular method in

parameter estimation. It is asymptotically optimal in that it approximates the

minimum variance unbiased (MVU) estimator for large data records. Under a

misspecified model, it is well known that the MLE still converges to a well defined

limit as the sample size goes to infinity. We have proved that under some reg-

ularity conditions, the MLE under a misspecified model also converges to a well

defined limit at high signal-to-noise ratio (SNR). This result provides important

performance analysis of the MLE under a misspecified model.

ACKNOWLEDGMENTS

First of all, I would like to thank my advisor Dr. Steven Kay for his guidance,

support, patience and understanding during my five-year graduate studies at URI.

I thank him for sending me to many mathematics classes, which helped me develop

mathematical skills for my research. I also thank him for such careful proofreading

of all my papers. I always had a lot of typos and grammatical errors in the draft

versions of papers. It was a great pleasure to work with him on a variety of topics

in statistical signal processing. It was he who led me into such an interesting area

and taught me how to do research. It was really an honor to be his student.

I am also grateful to the faculty of the Department of Electrical, Computer,

and Biomedical Engineering and the Department of Mathematics, especially my

committee members Dr. Kay, Dr. Swaszek, Dr. Pakula, Dr. He, and Dr. Merino

for their help and efforts in participating in my comprehensive exam and disserta-

tion defense.

I must also thank Meredith Leach Sanders for taking care of all my paperwork.

She keeps everything in mind and never forgets to send us a friendly reminder. The

department will not be able to run without her.

I would also like to thank Dr. Pakula of the Department of Mathematics for

his inspiring classes. As an engineering student, I really like the way he teaches a

math class.

I would like to thank all my friends for their encouragement and help.

Finally, I am thankful for my family who always support me with their love

and trust. I thank my parents for everything they have done for me since I was

born. I thank my girlfriend Xiaorong. She makes my life much more beautiful. I

am so thankful that I met her at URI. I also thank her parents for raising such a

nice, wonderful, decent girl.

iv

PREFACE

This dissertation is organized in the manuscript format consisting of seven

manuscripts. The topics and publications of the manuscripts are as followings:

Manuscript 1: (Model order selection)

Q. Ding and S. Kay, “Inconsistency of the MDL: On the Performance of

Model Order Selection Criteria with Increasing Signal-to-Noise Ratio,” to be

published in IEEE Transactions on Signal Processing.

Manuscript 2: (Standoff detection and classification)

Q. Ding, S. Kay, C. Xu, and D. Emge, “Autoregressive Modeling of Ra-

man Spectra for Detection and Classification of Surface Chemicals,” to be

published in IEEE Transactions on Aerospace and Electronic Systems.

Manuscript 3: (Sensor integration)

S. Kay, Q. Ding, and M. Rangaswamy, “Sensor Integration for Distributed

Detection and Classification,” submitted to IEEE Transactions on Aerospace

and Electronic Systems.

Manuscript 4: (Parameter estimation)

Q. Ding and S. Kay, “Maximum Likelihood Estimator under Misspecified

Model with High Signal-to-Noise Ratio,” submitted to IEEE Transactions

on Signal Processing.


S. Kay and Q. Ding, “Exponentially Embedded Families for Multimodal

Sensor Processing,” in Proc. IEEE International Conference on Acoustics,

Speech, and Signal Processing, Mar. 2010, pp. 3770-3773.

v


S. Kay, Q. Ding, and D. Emge, “Joint PDF Construction for Sensor Fusion

and Distributed Detection,” in Proc. International Conference on Informa-

tion Fusion, Jun. 2010.

(This paper has been awarded Runner up for the Best Student Paper Award

at the 13th International Conference on Information Fusion.)


S. Kay, Q. Ding, and M. Rangaswamy, “Sensor Integration for Classification,”

in Proc. Asilomar Conference on Signals, Systems, and Computers, Nov.

2010.

vi

TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . iv

PREFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

MANUSCRIPT

1 Inconsistency of the MDL: On the Performance of ModelOrder Selection Criteria with Increasing Signal-to-Noise Ratio 1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Inconsistency of the MDL and the AIC . . . . . . . . . . . . . . 4

1.3.1 The Linear Model . . . . . . . . . . . . . . . . . . . . . . 4

1.3.2 Inconsistency of the MDL . . . . . . . . . . . . . . . . . 5

1.4 Consistency of the EEF . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.1 Consistency of the EEF for the Linear Model . . . . . . . 7

1.4.2 Consistency of the EEF in General . . . . . . . . . . . . 10

1.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5.1 Linear Signal . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5.2 Non-Linear Signal . . . . . . . . . . . . . . . . . . . . . . 16

vii

Page

viii

1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Appendix 1A - Derivation of the Distribution of yj’s for j ≥ p . . . . 18

Appendix 1B - Derivation of the Distribution of yj’s for j < p . . . . 21

Appendix 1C - Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . 22

Appendix 1D - Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . 25

Appendix 1E - Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . 26

List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 Autoregressive Modeling of Raman Spectra for Detectionand Classification of Surface Chemicals . . . . . . . . . . . . . . 29

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 Problem Statement and Rationale of Approach . . . . . . . . . . 31

2.3 Spectral Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4 Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4.1 Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . 37

2.4.2 Overall Detection Algorithm . . . . . . . . . . . . . . . . 39

2.5 Experimental Detection Performance for Field Background Data 42

2.6 Experimental False Alarm Rate Performance . . . . . . . . . . . 45

2.7 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.7.1 Classification if Only One of M Chemicals Is Present . . 47

2.7.2 Classification if K out of M Chemicals Are Present . . . 48

2.7.3 Model Order Selection on How Many Chemicals ArePresent in the Mixture . . . . . . . . . . . . . . . . . 49

Page

ix

2.8 Experimental Classification Performance for Field BackgroundData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Appendix 2A - Derivation of Estimating the AR Model Order . . . . 57

Appendix 2B - Derivation of Test Statistic for Detection . . . . . . . 60

Appendix 2C - Derivation of Probability of Detection Statistic Thresh-old Crossing for Given False Alarm Rate . . . . . . . . . . . 61

Appendix 2D - Derivation of LMP Test Statistic for Classification . . 63

Appendix 2E - Derivation of The Asymptotic Likelihood FunctionMethod for Classification of Mixture of Chemicals . . . . . . 66


3 Sensor Integration for Distributed Detection and Classification 72

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


3.3 Joint PDF Construction by Exponential Family and Its Appli-cation in Distributed Systems . . . . . . . . . . . . . . . . . 76

3.4 KL Divergence Between The True PDF and The Constructed PDF 78

3.5 Examples-Distributed Detection . . . . . . . . . . . . . . . . . . 80

3.5.1 Partially Observed Linear Model with Gaussian Noise . . 81

3.5.2 Partially Observed Linear Model with Gaussian MixtureNoise . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.6 Examples-Distributed Classification . . . . . . . . . . . . . . . . 88

3.6.1 Linear Model with Known Variance . . . . . . . . . . . . 89

Page

x

3.6.2 Linear Model with Unknown Variance . . . . . . . . . . . 92

3.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.7.1 Distributed Detection . . . . . . . . . . . . . . . . . . . . 93

3.7.2 Distributed Classification . . . . . . . . . . . . . . . . . . 95

3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97


4 Maximum Likelihood Estimator under Misspecified Modelwith High Signal-to-Noise Ratio . . . . . . . . . . . . . . . . . . 100

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.2 White’s Results: QMLE for Large Data Records . . . . . . . . . 102

4.3 QMLE with High SNR . . . . . . . . . . . . . . . . . . . . . . . 103

4.3.1 Misspecified Observation Model . . . . . . . . . . . . . . 103

4.3.2 Performance of QMLE as σ2 → 0 . . . . . . . . . . . . . 104

4.4 A Misspecified Linear Model Example . . . . . . . . . . . . . . . 107


4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


5 Exponentially Embedded Families for Multimodal SensorProcessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.2 EEF and Its Properties . . . . . . . . . . . . . . . . . . . . . . . 117

Page

xi

5.3 EEF for Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 120

5.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122


5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125


6 Joint PDF Construction for Sensor Fusion and DistributedDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127


6.3 Construction of Joint PDF for Detection . . . . . . . . . . . . . 130

6.4 KL Divergence Between The True PDF and The Constructed PDF132

6.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.5.1 Partially Observed Linear Model with Gaussian Noise . . 133

6.5.2 Partially Observed Linear Model with Non-Gaussian Noise135

6.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139


7 Sensor Integration for Classification . . . . . . . . . . . . . . . . 141

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


7.3 Joint PDF Construction and Its Application in Classification . . 143

Page

xii

7.4 A Linear Model Example . . . . . . . . . . . . . . . . . . . . . . 146


7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149


BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

LIST OF TABLES

Table Page

3.1 Comparison of our test statistic and the clairvoyant GLRT . . . . . 88

3.2 Comparison of our test statistic and the estimated MAP classifier . 93

xiii

LIST OF FIGURES

Figure Page

1.1 Performance of MDL, AIC and EEF for the linear model when H1

is true (M=2, N=20). . . . . . . . . . . . . . . . . . . . . . . . 15

1.2 Performance of MDL, AIC and EEF in estimating the polynomialmodel order when H3 is true (M=4, N=20). . . . . . . . . . . . 16

1.3 Probability of correct selection for MDL, AIC and EEF in estimatingthe number of sinusoids when H2 is true (M=3, N=20). . . . . . 19

2.1 AR spectral estimate and background spectral data for asphalt sur-face (Fc = 3300). . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2 AR spectrum for asphalt surface plus an artificial signal (Fc = 3300). 36

2.3 Spectra of the chemicals that are used in simulations. . . . . . . . . 43

2.4 Probability PDp of detecting chemical 15 versus SNR based on asingle pulse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.5 Probability PDp of detecting chemical 31 versus SNR based on asingle pulse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.6 False alarms for a concrete background and fixed threshold. . . . . . 46

2.7 Probability of correct single pulse classification versus SNR. Chem-ical 15 is present. . . . . . . . . . . . . . . . . . . . . . . . . . . 51



2.10 Probability of correct single pulse classification versus SNR. Chem-icals 15 and 16 are present. . . . . . . . . . . . . . . . . . . . . 53

2.11 Probability of correct single pulse classification versus SNR. Chem-icals 15 and 16 are present. Chemical 29 is removed from thelibrary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

xiv

Figure Page

xv



2.14 Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 15, 56 and 58 are present. . . . . . . 55

2.15 Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 15, 31 and 45 are present. . . . . . . 56

2.16 Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 20 and 45 are present. . . . . . . . . 56

2.17 Probability P1 of at most one false alarm per two hours versus PFAb. 63

2.18 Probability of at most one false alarm per two hours versus PFAp . . 64

3.1 Distributed detection/classification system with two sensors . . . . 75

3.2 ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with uncorrelated Gaussian mixture noise. . . . . 95

3.3 ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with correlated Gaussian mixture noise. . . . . . 95

3.4 Probability of correct classification for both methods. . . . . . . . . 97


4.1 The periodogram It(f) = 1N

∣∣∣∑N−1

n=0 st[n] exp(−j2πfn)∣∣∣

2

. In this

case, f∗ ≈ f2 = 0.33. . . . . . . . . . . . . . . . . . . . . . . . . 110

4.2 Convergence of A, f , φ as σ2 → 0. . . . . . . . . . . . . . . . . . . . 111

4.3 The periodogram It(f) = 1N

∣∣∣∑N−1


2

. In this

case, f1 < f∗ < f2. . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.4 Convergence of A, f , φ as σ2 → 0. . . . . . . . . . . . . . . . . . . . 113

4.5 Test statistics of Lilliefors test for A, f , φ as σ2 → 0. We have 1600realizations of {A, f , φ} for each σ2. . . . . . . . . . . . . . . . . 114

Figure Page

xvi

4.6 Histogram of f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.1 ROC curves for different detectors. . . . . . . . . . . . . . . . . . . 124

6.1 Distributed detection system with two sensors . . . . . . . . . . . . 130

6.2 ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with uncorrelated Gaussian mixture noise. . . . . 138

6.3 ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with correlated Gaussian mixture noise. . . . . . 138

7.1 Distributed classification system with two sensors. . . . . . . . . . . 144


MANUSCRIPT 1

Inconsistency of the MDL: On the Performance of Model OrderSelection Criteria with Increasing Signal-to-Noise Ratio

Abstract

In the problem of model order selection, it is well known that the widely

used minimum description length (MDL) criterion is consistent as the sample size

N → ∞. But the consistency as the noise variance σ2 → 0 has not been studied.

In this paper, we find that the MDL is inconsistent as σ2 → 0. The result shows

that the MDL has a tendency to overestimate the model order. We also prove

that another criterion, the exponentially embedded family (EEF), is consistent as

σ2 → 0. Therefore in a high signal-to-noise (SNR) scenario, the EEF provides a

better criterion to use for model order selection.

1.1 Introduction

Model order selection is a fundamental problem in signal processing. It has

many practical applications such as radar, computer vision and biomedical systems.

Model order selection is essentially one of composite hypothesis testing, for which

the probability density functions (PDFs) are known except for some parameters.

Without the knowledge of those parameters, there exists no optimal solution. A

simple and common approach is the generalized likelihood ratio test (GLRT) which

replaces the unknown parameters by their maximum likelihood estimates (MLEs).

However in the case when the model orders are hierarchically nested, the GLRT

philosophy does not work since it will always choose the largest candidate order

(see [1] for a simple example). Many methods have been proposed to offset this

overestimating tendency based on different information criteria such as the Akaike’s

information criterion (AIC) [2], the MDL [3], [4], and the EEF [5]. The reader may

1

wish to read [6] for a review of information criterion rules on model order selection.

One would prefer a criterion that will always choose the true model order if we have

a large enough number of samples. It has been shown in [7] the consistency of the

MDL and the inconsistency of the AIC as the sample size N → ∞, i.e., the MDL

will pick the true order with probability one and the AIC tends to overestimate

the model order as N → ∞. The consistency of the EEF as N → ∞ is shown in

[8].

Except for the above consistency as N → ∞, one would also wish the criterion

to have another consistency that we call consistency as σ2 → 0. In this case the

estimator will choose the true model order in probability as the noise level decreases

to zero. This is the consistency that we will discuss throughout this paper. The

Fisher consistency [9] is the same as the consistency as σ2 → 0 in parameter

estimation in curved exponential families [10]. To our knowledge, no work has

been done on the consistency as σ2 → 0 for the model order selection criteria. In

this paper, we will show that the MDL and the AIC are inconsistent as the noise

variance σ2 → 0. This means that even under high SNR conditions, the MDL and

the AIC still tend to overestimate the model order. Note that the overestimation of

the MDL and the AIC has also been noticed in [11], [12] for some array processing

problems. We then show that the EEF is consistent as σ2 → 0. Simulation results

are provided to support our analysis.

The paper is organized as follows. Section 1.2 presents the problem and the

model order selection criteria. Then we introduce a linear model and show the

inconsistency as σ2 → 0 for the MDL and the AIC in Section 1.3. In Section 1.4,

we prove that the EEF is consistent as σ2 → 0. Simulation results are given in

Section 1.5 to justify our derivation. Finally, Section 1.6 draws the conclusion.

2

1.2 Problem Statement

Consider the multiple composite hypothesis testing problem where we have

M candidate models. Under each model Hi, we have

Hi : x = si(θi) + w = si(θi) + σu (1.1)

for i = 1, 2, . . . , M . x is an N × 1 vector of samples. The N × 1 signal si(θi) is

known except for the unknown i× 1 vector of parameters θi. w = σu is the N × 1

noise vector with known variance σ2, and u has a well defined PDF. So each Hi is

described by a PDF p(x; θi). We assume that the model orders are hierarchically

nested, i.e., we can write the signal si(θi) as

si(θi) = s(

[θ1, . . . , θi, 0, . . .]T)

(1.2)

where s is a function of a M × 1 vector, for i = 1, 2, . . . , M . So the unknown

parameters in signal with higher order contain all of those in a lower order model.

Let H0 be a reference hypothesis with s(

[0, 0, . . . , 0]T)

= 0, so the PDF p(x; θ0)

is completely known as noise only. Then the MDL, AIC and EEF rules choose the

model order that maximizes the following respectively:

− MDL(i) = lGi(x) − i ln N

− AIC(i) = lGi(x) − 2i

EEF (i) =

(

lGi(x) − i

[

ln

(lGi

(x)

i

)

+ 1

])

u

(lGi

(x)

i− 1

)

for i = 1, 2, . . . , M , where u(x) is the unit step function and lGi(x) = 2 ln p(x;

ˆθi)

p(x;θ0).

Here θi is the MLE for θi. Note that the inclusion of the term −2 ln p(x; θ0) does

not affect the maximum and so we use the log-likelihood ratio instead of the more

usual log-likelihood for the MDL and the AIC. Note that we assume a real signal

model in (1.1). The result in this paper can be easily extended to complex signal

model. In the next section we will implement these rules in the linear model to

show the inconsistency of the MDL and the AIC as σ2 → 0.

3

1.3 Inconsistency of the MDL and the AIC

Without causing any confusion, we will use consistency to mean consistency

as σ2 → 0 for the rest of the paper unless otherwise mentioned. In this section,

we will limit the derivation to the MDL. We will start by introducing the linear

model with Gaussian noise, from which we derive the performance of the MDL.

Then the inconsistency of the MDL is readily seen. Note that the inconsistency of

the MDL is proved for the linear model with Gaussian noise, it should be expected

that the MDL is inconsistent in general (non-linear, non-Gaussian models). The

inconsistency of the AIC follows directly from the analysis of the MDL.

1.3.1 The Linear Model

Consider the following linear model:

Hi : x = Hiθi + w for i = 1, 2, . . . , M

where M is the maximum order of all the candidate models, Hi = [h1,h2, . . . ,hi]

is an N × i (with N > M) known observation matrix with full column rank,

θi = [θ1, θ2, . . . , θi]T is an i × 1 unknown parameter vector of the amplitudes, and

w is an N × 1 white Gaussian noise vector with known variance σ2. For the linear

model, lGi(x) = xT Pix

σ2 , where Pi = Hi(HTi Hi)

−1HTi is the projection matrix that

projects x onto the subspace Vi generated by h1,h2, . . . ,hi [13]. So the MDL rule

chooses the model order that minimizes:

MDL(i) = −xTPix

σ2+ i ln N for i = 1, 2, . . . , M

Let yi = xT Pi+1xσ2 − xT Pix

σ2 for i = 1, 2, . . . , M −1 and we have the following theorem.

(See Appendix 1A for the proof of Theorem 1)

Theorem 1 (PDF of yj for j ≥ p). If the true model order is Hp (p ≤ M), that

is, θi = 0 for all i > p, then the yj’s for all j ≥ p do not depend on θp or σ2,

4

and they are independent and identically distributed (IID), each with a chi-square

distribution with 1 degree of freedom.

As we will show next, this theorem gives us a way to find a lower bound of

the probability that the MDL will choose the wrong model order.

1.3.2 Inconsistency of the MDL

We will show that the probability of overestimation does not converge to zero

as σ2 → 0.

If Hp (p < M) is true, then the probability that the MDL will choose the wrong

model order is

Pe = Pr {Hj, j �= p|Hp}

= 1 − Pr{MDL(p) < MDL(j) for all j �= p|Hp}

≥ 1 − Pr{MDL(p) < MDL(j) for all j > p|Hp}

= Pr{MDL(p) ≥ MDL(j) for some j > p|Hp} (1.3)

Since MDL(j) − MDL(j + 1) = yj − ln N , for j > p, MDL(p) − MDL(j) =

∑j−1i=p yi − (j − p) ln N , we have

Pr{MDL(p) ≥ MDL(j) for some j > p|Hp}

= Pr{yp ≥ ln N or yp + yp+1 ≥ 2 ln N or · · · orM−1∑

i=p

yi ≥ (M − p) ln N |Hp}

(1.4)

By Theorem 1, yj ∼ χ21 and yj’s are independent for j ≥ p. So the probability in

(1.4) can be found analytically, although it may not easy. Alternatively, we can

5

find a lower bound of (1.4) which is much easier to calculate. Notice that

Pr{yp ≥ ln N or yp + yp+1 ≥ 2 ln N or · · · orM−1∑

i=p

yi ≥ (M − p) ln N |Hp}

≥ Pr{yp ≥ ln N |Hp}

= Pr{X ≥√

ln N or X ≤ −√

ln N}

= 2Q(√

ln N)

(1.5)

where X is a standard Gaussian random variable (since yp ∼ χ21 under Hp) and

Q(x) is the right-tail probability of a standard Gaussian distribution, that is,

Q(x) =∫∞

x1√2π

exp(

−12t2)

dt. So 2Q(√

ln N)

is also a lower bound of the prob-

ability of error Pe for the MDL. Note that this lower bound decreases slowly

as N increases. For example, in order to have Pe ≤ 0.01, we require that

2Q(√

ln N)

≤ 0.01 and we need as many as N = 761 samples. This lower

bound only depends on the number of samples N . So when N is fixed, this lower

bound is fixed even as σ2 → 0. This shows that the MDL is inconsistent. Since

Pr{MDL(p) ≥ MDL(j) for some j > p|Hp} is bounded below by a fixed bound,

the MDL has a tendency to overestimate the model order.

For the AIC, we just need to replace ln N by 2, so the lower bound is 2Q(√

2)

.

Hence the AIC is also inconsistent. Notice that 2Q(√

ln N)

→ 0 as N → ∞, but

2Q(√

2)

is a constant. This also justifies the result in [7]. Since the MDL is

consistent as N → ∞, the lower bound 2Q(√

ln N)

should decrease to 0. The

lower bound 2Q(√

2)

for the AIC shows that the AIC is inconsistent as N → ∞.

1.4 Consistency of the EEF

As a complement to Section 1.3, we will first show that the EEF is consistent

for the linear model. Next, we will prove that the EEF is consistent in general.

6

1.4.1 Consistency of the EEF for the Linear Model

The next theorem will be used to prove the consistency of the EEF for the

linear model. (See Appendix 1B for the proof of Theorem 2)

Theorem 2 (PDF of yj for j < p). If the true model order is Hp, then

for j < p, yj has a noncentral chi-square distribution with 1 degree of freedom

and noncentrality parameter λj = αj/σ2 where Hj+1,p = [hj+1,hj+2, . . . ,hp],

θj+1,p = [θj+1, θj+2, . . . , θp]T , and αj = (Hj+1,pθj+1,p)

T (Pj+1 − Pj)Hj+1,pθj+1,p.

Furthermore, the yj’s are independent for all j.

The EEF chooses the model order that maximizes

EEF (i) =

(

lGi(x) − i

[

ln

(lGi

(x)

i

)

+ 1

])

u

(lGi

(x)

i− 1

)

=

(xTPix

σ2− i

[

ln

(xTPix

iσ2

)

+ 1

])

u

(xTPix

iσ2− 1

)

(1.6)

If Hp is true, it is well known that [1]

lGp(x) =xTPpx

σ2∼ χ

′2p (λ) (1.7)

where λ =‖Hpθp‖2

σ2 . In order to prove the consistency of the EEF in probability,

we need to show that

Pr{

arg maxi

EEF (i) = p}

→ 1

as σ2 → 0. We start by first comparing EEF (j) with EEF (p) as σ2 → 0 for j > p

and j < p.

For j > p, we know that [1]

lGj(x) =

xTPjx

σ2∼ χ

′2j (λ) (1.8)

where λ is the same as in (1.7). The lemma in [8] shows that if Y is distributed

according to χ′2ν (an) where a is a positive constant, then as n → ∞, Y

nconverges

7

to a in probability, or in symbols, Yn

P→ a. Replacing n by 1/σ2 and from (1.7),

(1.8), we have as σ2 → 0

σ2lGj(x)

P→‖Hpθp‖2 for j ≥ p (1.9)

By the definition of convergence in probability, we have

Pr{∣∣σ2lGj

(x) − ‖Hpθp‖2∣∣ < ε

}

→ 1 for j ≥ p (1.10)

as σ2 → 0 for all ε > 0. Since σ2 → 0, we can find σ2 small enough such that

j <‖Hpθp‖2

−ε

σ2 . Hence, we have

Pr{

lGj(x) > j

}

≥ Pr

{

lGj(x) >

‖Hpθp‖2 − ε

σ2

}

≥ Pr{∣∣σ2lGj

(x) − ‖Hpθp‖2∣∣ < ε

}

for j ≥ p (1.11)

Therefore, as a result of (1.10) and (1.11),

Pr

{lGj

(x)

j− 1 > 0

}

→ 1 for j ≥ p (1.12)

as σ2 → 0 and we can discard the unit step function. As a result,

EEF (p) − EEF (j)

= lGp(x) − lGj(x) − p ln lGp(x) + j ln lGj

(x) + p ln p − j ln j − p + j

= lGp(x) − lGj(x) − p ln

(

σ2lGp(x))

+ j ln(

σ2lGj(x)

)

+ c (1.13)

where

c = (p − j) ln σ2 + p ln p − j ln j − p + j (1.14)

By Theorem 1,

lGp(x) − lGj(x) ∼ −χ2

j−p (1.15)

As a result of (1.9), by the continuity of the logarithm we have [14]

ln(

σ2lGj(x)

) P→ ln ‖Hpθp‖2 for j ≥ p (1.16)

8

We divide (1.13) by c and get

EEF (p) − EEF (j)

c=

lGp(x) − lGj(x) − p ln

(

σ2lGp(x))

+ j ln(

σ2lGj(x)

)

c+ 1

(1.17)

Since 1c→ 0+ for j > p as σ2 → 0, as a result of (1.15) and (1.16), we have (see

Theorems 2.3.3 and 2.3.5 on pages 70-71 in [14] and Theorem (4)(a) on page 310

in [15])

lGp(x) − lGj(x) − p ln

(

σ2lGp(x))

+ j ln(

σ2lGj(x)

)

c

P→ 0 (1.18)

and hence

EEF (p) − EEF (j)

c

P→ 1 (1.19)

for j > p. This shows that as σ2 → 0, Pr{EEF (p) > EEF (j)} → 1.

For j < p, similar to the derivation in Appendix 1B, the distribution of

lGj(x) =

xT Pjx

σ2 can be found as

lGj∼ χ

′2j (λ′) (1.20)

where λ′ =(Hpθp)

TPjHpθp

σ2 . So we also have

Pr

{lGj

(x)

j− 1 > 0

}

→ 1 for j ≤ p (1.21)

as σ2 → 0. Thus we can also omit the unit step function and have

EEF (p) − EEF (j)

= lGp(x) − lGj(x) − p ln

(

σ2lGp(x))

+ j ln(

σ2lGj(x)

)

+ c (1.22)

where

c = (p − j) ln σ2 + p ln p − j ln j − p + j (1.23)

Now by Theorem 2,

lGp(x) − lGj(x) ∼ χ

′2p−j

(∑p−1

i=jλi

)

= χ′2p−j

(p−1∑

i=j

αi/σ2

)

(1.24)

9

so that by the lemma in [8], we have

σ2(

lGp(x) − lGj(x)

) P→p−1∑

i=j

αi (1.25)

Similarly to the above analysis, we have

ln(

σ2lGp(x)) P→ ln ‖Hpθp‖2

ln(

σ2lGj(x)

) P→ ln(

(Hpθp)T PjHpθp

)

for j < p (1.26)

Hence, with σ2 → 0 we have

σ2 ln(

σ2lGj(x)

) P→ 0 for j ≤ p (1.27)

Obviously, σ2c → 0. So by (1.22), (1.25) and (1.27), we have

σ2 (EEF (p) − EEF (j))

= σ2(

lGp(x) − lGj(x)

)

− pσ2 ln(

σ2lGp(x))

+ jσ2 ln(

σ2lGj(x)

)

+ σ2c

P→p−1∑

i=j

αi > 0 (1.28)

for j < p. This means that Pr{EEF (p) > EEF (j)} → 1 as σ2 → 0.

Finally we have shown that Pr{EEF (p) > EEF (j)} → 1 for all j �= p. Since

Pr {A1 ∩ A2} → 1 if Pr {A1} → 1 and Pr {A2} → 1 [14], as a result,

Pr{

arg maxi

EEF (i) = p}

→ 1

as σ2 → 0. This completes the proof that the EEF is consistent for the linear

model.

1.4.2 Consistency of the EEF in General

In the general case, the signal s(θi) does not have to be a linear transformation

of θi, and the noise w does not have to be Gaussian. To prove the consistency of

the EEF in general, we first write the model in (1.1) as

Hi : x = si(θi) + σnu (1.29)

10

where the N × 1 signal si(θi) depends on the i× 1 unknown parameters θi, and u

has a well defined PDF and {σn} is an arbitrary positive sequence that converges

to 0. Because if we consider the probability of correct model order selection Pc as

a function of σ2, then the following conditions are equivalent [16]:

Condition 1)

limσ2→0

Pc(σ2) = 1

Condition 2)

limn→∞

Pc(σ2n) = 1 for any arbitrary sequence {σ2

n} that converges to 0

Hence we will prove Condition 2) to show the consistency of the EEF.

Let us assume the following.

Assumption 1): s(θi) is Lipschitz continuous, i.e., there exists K > 0 such that∥∥si(θ

1i ) − si(θ

2i )∥∥ ≤ K

∥∥θ1

i − θ2i

∥∥ for all θ1

i , θ2i .

Note that the linear signal si(θi) = Hiθi is Lipschitz continuous since si(θi)

is a linear transformation of θi [17].

Assumption 2): The PDF pU(u) of u satisfies

pU(un)/pU(vn) → ∞ if ‖vn‖ − ‖un‖ → ∞

where {un}, {vn} are deterministic sequences such that ‖vn‖ − ‖un‖ → ∞,

and

ln pU(u) is Lipschitz continuous on set {u : ‖u‖ ≤ l} for any l > 0

i.e., for any l > 0, there exists L > 0 such that | ln pU(u1) − ln pU(u2)| ≤

L ‖u1 − u2‖ for all u1, u2 with ‖u1‖ ≤ l, ‖u2‖ ≤ l.

11

Note that the Gaussian and Gaussian mixture PDFs will satisfy Assumption 2).

For example, let the Gaussian mixture PDF be

pU(u) =m∑

i=1

αi√

2πσ2i

e− ‖u‖2

2σ2i

where αi > 0 and∑m

i=1 αi = 1. Let σ2max = max{σ2

1, . . . , σ2m}, σ2

min =

min{σ21, . . . , σ

2m}, and α be the αi that corresponds to σ2

max. Then we have

pU(u)

pU(v)=

m∑

i=1

αi√2πσ2

i

e− ‖u‖2

2σ2i

m∑

i=1

αi√2πσ2

i

e− ‖v‖2

2σ2i

>

α√2πσ2

max

e− ‖u‖2

2σ2max

1√2πσ2

min

e− ‖v‖2

2σ2max

= α

√

σ2min

σ2max

exp

(

‖v‖2 − ‖u‖2

2σ2max

)

(1.30)

So if ‖vn‖ − ‖un‖ → ∞, it follows that ‖vn‖2 − ‖un‖2 → ∞ and hence

pU(un)/pU(vn) → ∞.

Let Hp be the true model. With the above assumptions, the following theo-

rems are proved in Appendices 1.6-1.6.

Theorem 3 (lGj(x) unbounded in probability for j ≥ p). There exists a

sequence {Nn} with Nn → ∞ such that Pr{lGj(x) > Nn} → 1 as σn → 0 for

j ≥ p.

Note that each {Nn} implicitly depends on σn. For example, in the linear

model for j ≥ p,

lGj(x) =

xTPjx

σ2∼ χ

′2j (λ)

where λ =‖Hpθp‖2

σ2n

. If we choose Nn =‖Hpθp‖2

2σ2n

, (1.9) implies that Pr{lGj(x) >

Nn} → 1 as σn → 0.

Theorem 4 (lGj(x) − lGp(x) bounded in probability for j > p). For any

sequence {mn}, Pr{lGj(x) − lGp(x) < mn} → 1 as mn → ∞ for j > p.

12

Here the sequence {mn} can be an arbitrary sequence with mn → ∞, so mn

does not depend on σn. For example, in the linear model for j > p,

lGj(x) − lGp(x) ∼ χ2

j−p

So for any {mn}, Pr{lGj(x) − lGp(x) < mn} → 1 as mn → ∞ for j > p.

Theorem 5 (lGp(x) − lGj(x) unbounded in probability for j < p). There

exists a sequence {Mn} with Mn → ∞ such that Pr{lGp(x) − lGj(x) > Mn} → 1

as σn → 0 for j < p.

Note that each Mn also implicitly depends on σn. For example, in the linear

model for j < p, by (1.24),

lGp(x) − lGj(x) ∼ χ

′2p−j

(p−1∑

i=j

αi/σ2n

)

If we choose Mn =∑p−1

i=j αi/2σ2n, it can be shown that Pr{lGj

(x) > Mn} → 1 as

σn → 0.

First we consider when j > p. For each σn, let Djn = {u : lGj

(x) > Nn},

Dpn = {u : lGp(x) > Nn}, En = {u : lGj

(x) − lGp(x) < mn}, and Fn = {u :

EEF (p) > EEF (j)}. Then for any u ∈ Djn ∩ Dp

n ∩ En, since Nn → ∞, we can

omit the unit function in the EEF. So we have

EEF (p) − EEF (j) = lGp(x) − p

(

lnlGp(x)

p+ 1

)

− lGj(x) + j

(

lnlGj

(x)

j+ 1

)

= p lnlGj

(x)

lGp(x)+ (j − p) ln lGj

(x) −(

lGj(x) − lGp(x)

)

+ p ln p − j ln j − p + j (1.31)

Note thatlGj

(x)

lGp (x)≥ 1, ln lGj

(x) > ln Nn, and lGj(x) − lGp(x) < mn. Since mn is

arbitrary, we can choose mn < (j − p) ln Nn + p ln p − j ln j − p + j but still with

mn → ∞ so that EEF (p) − EEF (j) > 0. This shows that Djn ∩ Dp

n ∩ En ⊆ Fn.

13

By Theorems 3 and 4, we have Pr{Djn} → 1, Pr{Dp

n} → 1 and Pr{En} → 1, and

hence Pr{Djn ∩ Dp

n ∩ En} → 1. This shows that Pr{Fn} → 1 as σn → 0, i.e.,

Pr{EEF (p) > EEF (j)} → 1 as σn → 0 for j > p.

Next, when j < p, let Dpn = {u : lGp(x) > Nn}, Gn = {u : lGp(x) − lGj

(x) >

Mn}, and Hn = {u : EEF (p) > EEF (j)} for each σn. Note that Hn and Fn

are different since the former is for j < p and the latter is for j > p. For any

u ∈ Dkn ∩ Gn, we have

EEF (p) − EEF (j)

=(

lGp(x) − lGj(x)

)

+ j ln lGj(x) − p ln lGp(x) + p ln p − j ln j − p + j (1.32)

Since x − p ln x increases as x increases for x > p, we can find Nn and Mn such

that EEF (p)−EEF (j) > 0. This shows that Dpn ∩Gn ∈ Hn. By Theorem 3 with

j = p and Theorem 5, the rest of the proof is the same as for j > p.

Since we have shown that Pr{EEF (p) > EEF (j)} → 1 for all j �= p, we have

Pr {arg maxi EEF (i) = p} → 1 as σ2 → 0 using the property that Pr {A1 ∩ A2} →

1 if Pr {A1} → 1 and Pr {A2} → 1 [14].

1.5 Simulation Results1.5.1 Linear Signal

For the linear model when M = 2:

H1 : x = h1θ1 + w

H2 : x =[

h1 h2

][

θ1

θ2

]

+ w = H2θ2 + w

If H1 is true, by (1.4) and (1.5), the probability that the MDL will choose H2 is

Pr {H2|H1} = Pr{MDL(1) ≥ MDL(2)|H1} = Pr{y1 ≥ ln N |H1} = 2Q(√

ln N)

(1.33)

So in this case, the lower bound is exactly the probability of overestimation error

for the MDL. For the AIC, the lower bound 2Q(√

2)

is also exactly the probability

14

of overestimation error. Hence the probabilities of correct model order selection Pc

(note here that there is no underestimation error since the correct order is k = 1)

for the MDL and the AIC are

Pc(MDL) = 1 − 2Q(√

ln N)

Pc(AIC) = 1 − 2Q(√

2)

For the simulation, we use N = 20, h1 = [1, 1, . . . , 1]T , h2 =

[1,−1, 1,−1, . . . , 1,−1]T , θ1 = 1 and θ2 = 0. We plot Pc versus 1/σ2. It can be ex-

pected that Pc(MDL) = 1− 2Q(√

ln 20)

= 0.917 and Pc(AIC) = 1− 2Q(√

2)

=

0.843, and Figure 1.1 verifies our result. We can see that the EEF appears to be

consistent in accordance with theorem, and the MDL and the AIC are inconsistent.

Also the performances of the MDL and the AIC do not depend on σ2.

0 500 1000 1500 20000.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

1/σ2

Pc

EEFMDLAIC

Figure 1.1. Performance of MDL, AIC and EEF for the linear model when H1 istrue (M=2, N=20).

Next we consider polynomial order estimation, which is essentially a linear

model. We assume that M = 4, N = 20 and the true model order is H3 with

the nth element of s(θ3) being s[n] = 0.1 + 0.3n + 0.1n2 for n = 0, 1, . . . , N − 1.

15

Pc is plotted versus 1/σ2. As shown in Figure 1.2, the EEF is consistent and the

MDL and the AIC are inconsistent. In this case, we cannot find Pc explicitly for

the MDL and the AIC, but we can see that the performances of the MDL and the

AIC are bounded above by 1 − 2Q(√

ln 20)

= 0.917 and 1 − 2Q(√

2)

= 0.843

respectively.

0 0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1/σ2

Pro

babi

lity

of c

orre

ct s

elec

tion

EEFMDLAIC

Figure 1.2. Performance of MDL, AIC and EEF in estimating the polynomialmodel order when H3 is true (M=4, N=20).

1.5.2 Non-Linear Signal

We consider a problem of estimating of number of sinusoids. Suppose that

under the ith model, the signal consists of i sinusoids embedded in white Gaussian

noise. That is,

Hi : x[n] =i∑

j=1

Aj cos (2πfjn + φj) + w[n]

for n = 0, 1, . . . , N −1, i = 1, 2, . . . , M , where the amplitudes Aj’s, the frequencies

fj’s and the phases φj’s are unknown. To make the problem identifiable, we assume

that Aj > 0, 0 < fj < 1/2, and 0 ≤ φj < 2π. It can be easily checked that

Assumptions 1) and 2) are satisfied for this example. Notice that if the frequencies

16

fj’s are known, the model can be reduced to the linear model [13]

Hi : x = Hiαi + w (1.34)

where

Hi =

⎡

⎢⎢⎢⎣

1 0 · · ·cos 2πf1 sin 2πf1 · · ·

...... · · ·

cos (2πf1(N − 1)) sin (2πf1(N − 1)) · · ·1 0

cos 2πfi sin 2πfi...

...cos (2πfi(N − 1)) sin (2πfi(N − 1))

⎤

⎥⎥⎥⎦

is an N × 2i observation matrix for the ith model, and

αi = [A1 cos φ1,−A1 sin φ1, . . . , Ai cos φi,−Ai sin φi]T

is a one-to-one transformation of the amplitudes Aj’s and phases φj’s. As a result,

the MLEs of Aj’s and φj’s can be found from the MLE of αi according to the

linear model in (2.36) whose observation matrix Hi depends on fj’s. So the MLE

of αi is

αi =(

HTi Hi

)−1HT

i x (1.35)

which is a function of fj’s for j = 1, 2, . . . , i.

If the frequencies fj’s are unknown, as a result of (1.35), the MLEs of fj’s can

be found by maximizing the following over the fj’s

g(f1, f2, . . . , fi) = xTHi

(

HTi Hi

)−1HT

i x (1.36)

Note that (2.8) is a function of fj’s because Hi depends on fj’s.

We denote the observation matrix Hi corresponding to the MLE of fj’s as

Hi. Note that the number of unknown parameters is 3i under Hi. Similar to the

previous subsection, the MDL, the AIC and the EEF choose the model order with

17

the largest of the following respectively

− MDL(i) =xT Hi

(

HTi Hi

)−1

HTi x

σ2− 3i ln N

− AIC(i) =xT Hi

(

HTi Hi

)−1

HTi x

σ2− 6i

EEF (i)

=

(xT Hi(HT

i Hi)−1

HTi x

σ2 − 3i

[

ln

(xT Hi(HT

i Hi)−1

HTi x

3iσ2

)

+ 1

])

· u(

xT Hi(HTi Hi)

−1HT

i x

3iσ2 − 1

)

(1.37)

In the simulation, we assume that M = 3, N = 20 and the true model order

is H2 with s[n] = cos(2π0.1n) + 0.8cos(2π0.3n + π/5) for n = 0, 1, . . . , N − 1. The

MLEs of fj’s that maximizes (2.8) are found by grid search. In Figure 1.3, we

also observe the consistency of the EEF and the inconsistency of the MDL and

the AIC as σ2 → 0. The probabilities of correct selection appear to have upper

bounds for the MDL and the AIC, although no explicit bounds are calculated in

this non-linear signal case.

1.6 Conclusion

The inconsistency as σ2 → 0 of the MDL and the AIC has been shown. A

simple lower bound is provided for their overestimating tendency. The consistency

as σ2 → 0 of the EEF is also proved. Simulation results show that the EEF

performs perfect under small noise while the MDL and the AIC do not.

Appendix 1A - Derivation of the Distribution of yj’s for j ≥ p

We need the following lemma to derive the distribution of yj’s.

Lemma 1. Pj+1 − Pj has rank 1.

Proof. Suppose that for the subspace Vj generated by h1,h2, . . . ,hj, we have an

orthonormal basis {v1,v2, . . . ,vj}. Then for the subspace Vj+1 generated by

18

0 100 200 300 400 5000.4

0.5

0.6

0.7

0.8

0.9

1

1/σ2

Pro

babi

lity

of c

orre

ct s

elec

tion

EEFMDLAIC

Figure 1.3. Probability of correct selection for MDL, AIC and EEF in estimatingthe number of sinusoids when H2 is true (M=3, N=20).

h1,h2, . . . ,hj+1, we can have an orthonormal basis {v1,v2, . . . ,vj,vj+1}. Since

Pj is the projection matrix onto the subspace Vj, for any N × 1 vector x, we have

Pjx =

j∑

i=1

< x,vi > vi (1.38)

where < x,vi > is the inner product defined by

< x,vi >= xTvi

Similarly, we also have

Pj+1x =

j+1∑

i=1

< x,vi > vi (1.39)

So (1.38) and (1.39) tell us that for any x,

(Pj+1 − Pj)x =< x,vj+1 > vj+1 = αvj+1 (1.40)

for a scalar α. This shows that Pj+1 − Pj has rank 1 since it projects any x onto

the 1-dimensional subspace generated by vj+1.

19

Since we assume under Hp that x = Hpθp + w,

yp =(Hpθp + w)T (Pp+1 − Pp) (Hpθp + w)

σ2. (1.41)

Since Hpθp =p∑

i=1

θihi ∈ Vp, the projection of Hpθp onto Vp remains the same.

That is,

PpHpθp = Hpθp.

Also Hpθp =p∑

i=1

θihi + 0hp+1 ∈ Vp+1 , thus Pp+1Hpθp = Hpθp. So we have

(Pp+1 − Pp)Hpθp = 0

and hence

yp =wT (Pp+1 − Pp)w

σ2= uT (Pp+1 − Pp)u (1.42)

where u = w/σ is an N × 1 white Gaussian noise vector with unit variance.

For j > p, we can think of Hpθp as Hjθj where θj = [θ1, θ2, . . . , θp, 0, . . . , 0]T . By

the same derivation as above, we can also show that

yj = uT (Pj+1 − Pj)u. (1.43)

It is well known that Pj is a symmetric idempotent matrix and Pj+1Pj = Pj (see

page 231 in [13]). So

(Pj+1 − Pj) (Pj+1 − Pj) = Pj+1 − Pj.

This says that Pj+1 −Pj is also idempotent. By Lemma 1 Pj+1 −Pj has rank 1,

so by [1]

yj = uT (Pj+1 − Pj)u ∼ χ21 for all j ≥ p. (1.44)

where χ21 is the chi-square distribution with 1 degree of freedom.

We still need to show the independence of yj’s for all j ≥ p. Let zj = (Pj+1 − Pj)u.

Since zj is a linear transform of u, zj is also Gaussian with zero mean. For any

20

l > 0, we will show next that zj and zj+l are independent for any j ≥ p.

Let

[zj

zj+l

]

=

[Pj+1 − Pj

Pj+l+1 − Pj+l

]

u, whose covariance matrix is

Czj ,zj+l=

[Pj+1 − Pj

Pj+l+1 − Pj+l

][

Pj+1 − Pj Pj+l+1 − Pj+l

]

=

[(Pj+1 − Pj) (Pj+1 − Pj) (Pj+1 − Pj) (Pj+l+1 − Pj+l)

(Pj+l+1 − Pj+l) (Pj+1 − Pj) (Pj+l+1 − Pj+l) (Pj+l+1 − Pj+l)

]

.

By the property of Pj that PmPm+n = Pm for n > 0 [13], we have

(Pj+1 − Pj) (Pj+l+1 − Pj+l) = Pj+1Pj+l+1 − PjPj+l+1 − Pj+1Pj+l + PjPj+l

= Pj+1 − Pj − Pj+1 + Pj

= 0N×N .

This shows that zj and zj+l are uncorrelated and hence independent by Gaussian-

ity. Also by Gaussianity, pairwise independence will lead to the independence of

all zj’s. Since yj = zTj zj, we can say yj’s are independent j ≥ p.

Appendix 1B - Derivation of the Distribution of yj’s for j < p

If Hp is true, for j < p we still have

yj =(Hpθp + w)T (Pj+1 − Pj) (Hpθp + w)

σ2. (1.45)

But when j < p,

(Pj+1 − Pj)Hpθp �= 0

so we cannot reduce (1.45) as in (1.43). However, we can write yj as

yj =

(Hpθp

σ+ u

)T

(Pj+1 − Pj)

(Hpθp

σ+ u

)

=

((Pj+1 − Pj)Hpθp

σ+ zj

)T ((Pj+1 − Pj)Hpθp

σ+ zj

)

(1.46)

where u = w/σ and zj = (Pj+1 − Pj)u as in Appendix 1A. Since we have shown

that zTj zj ∼ χ2

1, we have

yj ∼ χ′21 (λj) (1.47)

21

where χ′21 (λj) is the noncentral chi-square distribution with 1 degree of

freedom and noncentrality parameter λj = ‖(Pj+1 − Pj)Hpθp‖2 /σ2 =

(Hpθp)T (Pj+1 − Pj)Hpθp/σ

2 > 0. If we let Hj+1,p = [hj+1,hj+2, . . . ,hp] and

θj+1,p = [θj+1, θj+2, . . . , θp]T , since (Pj+1 − Pj)Hjθj = 0, we have

λj = (Hjθj + Hj+1,pθj+1,p)T (Pj+1 − Pj) (Hjθj + Hj+1,pθj+1,p) /σ2

= (Hj+1,pθj+1,p)T (Pj+1 − Pj)Hj+1,pθj+1,p/σ

2

So λj does not depend on the first j θi’s in θp.

Since the proof of the independence of zj’s in Appendix 1A does not depend on

whether j ≥ p or j < p, zj’s are independent for all j. Hence so are yj’s.

Appendix 1C - Proof of Theorem 3

Theorem 3 (lGj(x) unbounded in probability for j ≥ p). There exists a

sequence {Nn} with Nn → ∞ such that Pr{lGj(x) > Nn} → 1 as σn → 0 for

j ≥ p.

First we will prove the next lemma.

Lemma 2. Under the true model, sp(θp)P→ sp (θp) as σn → 0. That is, for any

ε > 0, Pr{∥∥∥sp

(

θp

)

− sp (θp)∥∥∥ < ε

}

→ 1 as σn → 0.

Proof. First we will introduce the work in [18], which considers the characteristics

of the MLE under high SNR. Let

f(θp,u) = [f1(θp,u), . . . , fp(θp,u)]T =∂pU

(x(u)−sp(θp)

σn

)

∂θp

where we consider x is a function of u, then the MLE of θp is found by solving

f(θp,u) = 0

If fi(θp,u) for i = 1, . . . , p are differentiable functions on a neighborhood of a

point (θ0p,u0) with f(θ0

p,u0) = 0, and the Jacobian matrix Φ with respect to u is

22

nonsingular at (θ0p,u0), then by the implicit function theorem, we have

θp − θp

σn

P→−Φ−1Ψu (1.48)

where Φ and Ψ are deterministic matrices with

Φ =

[∂f

∂u1

∣∣∣(θ0

p,u0), . . . ,

∂f

∂uN

∣∣∣(θ0

p,u0)

]

Ψ =

[∂f

∂θ1

∣∣∣(θ0

p,u0), . . . ,

∂f

∂θp

∣∣∣(θ0

p,u0)

]

Although only Gaussian noise is considered in [18], (1.48) still holds for non-

Gaussian noise by the implicit function theorem.

It has been shown in [14] that if {Xn} is a sequence of random variables that

converges to X in probability and {cn} is a deterministic sequence that converges

to c, then cnXnP→ cX. As a result of (1.48), since σn → 0, we have

θp − θp = σnθp − θp

σn

P→ 0 (1.49)

Then by Assumption 1),∥∥∥sp

(

θp

)

− sp (θp)∥∥∥

P→ 0. This completes the proof of

Lemma 2.

When the true model is Hp, for j > p, the MLE for θj is still under the true

model if we write θj as θj = [θTp , 0, . . . , 0]T . So from (1.49), we have θj

P→θj, i.e.,

⎡

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

θ1

θ2.........

θj

⎤

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

P→

⎡

⎢⎢⎢⎢⎢⎢⎢⎣

θ1...θp

0...0

⎤

⎥⎥⎥⎥⎥⎥⎥⎦

Hence Lemma 2 still holds for j > p, and it extends to

∥∥∥sj

(

θj

)

− sj (θj)∥∥∥

P→ 0 for all j ≥ p (1.50)

23

So we have

lGj(x) = 2 ln

pU

(

x−sj(ˆθj)

σn

)

1σn

pU

(xσn

)1

σn

= 2 ln

pU

(

sj(θj)+σnu−sj(ˆθj)

σn

)

pU

(sj(θj)+σnu

σn

) (1.51)

Since pU(u) is a well defined PDF and hence has a valid cumulative distribution

function (CDF), we have

Pr{‖u‖ < ln} → 1 (1.52)

for any sequence {ln} with ln → ∞.

Let An = {u :∥∥∥sj

(

θj

)

− sj (θj)∥∥∥ < ε} and Bn = {u : ‖u‖ < ln} for each σn.

Since ln and ε are arbitrary, we let ln = ‖sj(θj)‖ /(3σn) and ε = ‖sj(θj)‖ /6. Then

for each u ∈ An ∩ Bn, we have

∥∥∥sj(θj) + σnu − sj(θj)

∥∥∥

σn

≤

∥∥∥sj(θj) − sj(θj)

∥∥∥

σn

+ ‖u‖ <ε

σn

+ ln (1.53)

Hence

‖sj(θj) + σnu‖σn

−

∥∥∥sj(θj) + σnu − sj(θj)

∥∥∥

σn

>

(‖sj(θj)‖

σn

− ‖u‖)

−(

ε

σn

+ ln

)

>‖sj(θj)‖

σn

− 2ln − ε

σn

=‖sj(θj)‖

6σn

→ ∞ (1.54)

as σn → 0. By Assumption 2), this shows that lGj(x) → ∞ as σn → 0 for each

u ∈ An ∩ Bn. Let C = {u : lGj(x) → ∞ as σn → 0}. The previous analysis shows

that An∩Bn ⊆ C. By (1.50) and (1.52), Pr{An} → 1 and Pr{Bn} → 1 as σn → 0.

Hence Pr{An ∩Bn} → 1. Note that An ∩Bn ⊆ C, and thus Pr{C} = 1. From this

24

“almost sure” event, it follows the “in probability” event, i.e., for any ε > 0 and any

M , there exists an integer K such that Pr{lGj(x) ≤ M} < ε for all n ≥ K. Next,

the existence of a sequence {Nn} with Nn → ∞ such that Pr{lGj(x) > Nn} → 1

as σn → 0 for j ≥ p will be shown by constructing such a sequence {Nn}.

Let {Mm} be any positive sequence that goes to ∞. For each Mm, there exists

Km such that Pr{lGj(x) ≤ Mm} < ε for all n ≥ Km. We construct {Nn} as

{Nn} = 0↑

1st term

, . . . , 0, M1

↑K1th term

, . . . , M1, M2

↑K2th term

, . . .

So Nn → ∞ since Mm → ∞. For any n, we can find a m such that Km ≤ n <

Km+1, and Nn = Mm by the above construction of {Nn}. Hence Pr{lGj(x) ≤

Nn} = Pr{lGj(x) ≤ Mm} < ε for all n. This proves the existence of a sequence

{Nn} with Nn → ∞ such that Pr{lGj(x) > Nn} → 1 as σn → 0 for j ≥ p.

Appendix 1D - Proof of Theorem 4

Theorem 4 (lGj(x) − lGp(x) bounded in probability for j > p). For any

sequence {mn}, Pr{lGj(x) − lGp(x) < mn} → 1 as mn → ∞ for j > p.

For j > p,

lGj(x) − lGp(x)

= 2 ln pU

(

sj(θj) + σnu − sj(θj)

σn

)

− 2 ln pU

(

sp(θp) + σnu − sp(θp)

σn

)

(1.55)

Note that we can consider θj as θj = [θTp , 0, . . . , 0]T , and so we have sj(θj) =

sp(θp).

By (1.48) and Assumption 1),∥∥∥sj(θj) − sp(θp)

∥∥∥

σn

≤

∥∥∥sj(θj) − sj(θj)

∥∥∥

σn

+

∥∥∥sp(θp) − sp(θp)

∥∥∥

σn

≤ K

∥∥∥θj − θj

∥∥∥

σn

+ K

∥∥∥θp − θp

∥∥∥

σn

P→ 2K∥∥Φ−1Ψu

∥∥ (1.56)

25

By the Lipschitz continuity of ln pU(u), there exists L such that (1.55) can be

written as

lGj(x) − lGp(x)

=

∣∣∣∣∣2 ln pU

(

sj(θj) + σnu − sj(θj)

σn

)

− 2 ln pU

(


σn

)∣∣∣∣∣

≤ 2L

∥∥∥sj(θj) − sp(θp)

∥∥∥

σn

≤ K

∥∥∥θj − θj

∥∥∥

σn

+ K

∥∥∥θp − θp

∥∥∥

σn

P→ 4LK∥∥Φ−1Ψu

∥∥ (1.57)

where the second inequality is by (1.56). Similar to (1.52), we have

Pr{∥∥Φ−1Ψu

∥∥ < ln} → 1 (1.58)

and hence

Pr{

lGj(x) − lGp(x) < 4LKln

}

→ 1 (1.59)

as ln → ∞. Since {ln} is an arbitrary sequence with ln → ∞, we have Pr{lGj(x)−

lGp(x) < mn} → 1 as mn → ∞ for any sequence {mn}.

Appendix 1E - Proof of Theorem 5

Theorem 5 (lGp(x) − lGj(x) unbounded in probability for j < p). There

exists a sequence {Mn} with Mn → ∞ such that Pr{lGp(x) − lGj(x) > Mn} → 1

as σn → 0 for j < p.

For j < p,

lGp(x) − lGj(x)

= 2 ln pU

(


σn

)

− 2 ln pU

(

sp(θp) + σnu − sj(θj)

σn

)

(1.60)

Note that we do not have sj(θj) = sp(θp) as for the j > p case, because it is under

the misspecified model when j < p. This means that we cannot find θj such that

sj(θj) = sp(θp) or sj(θj) is arbitrarily close to sp(θp). So we assume that there

exists δ > 0 such that ‖sj(θj) − sp(θp)‖ > δ for all θj. Then the rest of the proof

26

follows similarly to the proof of Theorem 3 in Appendix 1C using Lemma 2 and

Assumption 2).

List of References

[1] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory. En-glewood Cliffs, NJ: Prentice-Hall, 1998.

[2] H. Akaike, “A new look at the statistical model identification,” IEEE Trans.Autom. Control, vol. 19, pp. 716–723, Dec. 1974.

[3] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14,no. 5, pp. 465–471, 1978.

[4] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics,vol. 6, no. 2, pp. 461–464, 1978.

[5] S. Kay, “Exponentially embedded families - new approaches to model orderestimation,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, pp. 333–345, Jan.2005.

[6] P. Stoica and Y. Selen, “Model-order selection: A review of information cri-terion rules,” IEEE Signal Process. Mag., vol. 21, pp. 36–47, Jul. 2004.

[7] M. Wax and T. Kailath, “Detection of signals by information theoretic crite-ria,” IEEE Trans. Acoust., Speech, Signal Process., vol. 33, pp. 387–392, Apr.1985.

[8] C. Xu and S. Kay, “Source enumeration via the eef criterion,” IEEE SignalProcess. Lett., vol. 15, pp. 569–572, 2008.

[9] R. Fisher, “On the mathematical foundations of theoretical statistics,” Philos.Trans. Royal Soc. London, vol. 222, no. 594-604, pp. 309–368, Jan. 1922.

[10] R. Kass and P. Vos, Geometrical Foundations of Asymptotic Inference. Wiley,1997.

[11] W. Xu and M. Kaveh, “Analysis of the performance and sensitivity ofeigendecomposition-based detectors,” IEEE Trans. Signal Process., vol. 43,pp. 1413–1426, Jun. 1995.

[12] A. Liavas and P. Regalia, “On the behavior of information theoretic criteria formodel order selection,” IEEE Trans. Signal Process., vol. 49, pp. 1689–1695,Aug. 2001.

[13] S. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory.Englewood Cliffs, NJ: Prentice-Hall, 1993.

27

[14] E. Lehmann, Elements of Large-Sample Theory. Springer, 1998.

[15] G. Grimmett and D. Stirzaker, Probability and Random Processes, 3rd ed.Oxford University Press, 2001.

[16] W. Rudin, Functional Analysis. McGraw-Hill, 1991.

[17] K. Eriksson, D. Estep, and C. Johnson, Applied Mathematics, Body and Soul:Calculus in Several Dimensions. Springer, 2004.

[18] A. Renaux, P. Forster, E. Chaumette, and P. Larzabal, “On the high-SNRconditional maximum-likelihood estimator full statistical characterization,”IEEE Trans. Signal Process., vol. 54, pp. 4840–4843, Dec. 2006.

28

MANUSCRIPT 2

Autoregressive Modeling of Raman Spectra for Detection andClassification of Surface Chemicals

Abstract

This paper considers the problem of detecting and classifying surface chemicals

by analyzing the received Raman spectrum of scattered laser pulses received from a

moving vehicle. An autoregressive (AR) model is proposed to model the spectrum

and a two-stage (detection followed by classification) scheme is used to control the

false alarm rate. The detector decides whether the received spectrum is from pure

background only or background plus some chemicals. The classification is made

among a library of possible chemicals. The problem of mixtures of chemicals is also

addressed. Simulation results using field background data have shown excellent

performance of the proposed approach when the signal-to-noise ratio (SNR) is at

least -10 dB.

2.1 Introduction

Raman spectroscopy has been widely used in detection and classification of

chemical agents in the presence of a material, termed the background [1, 2, 3, 4].

Many spectral data analysis techniques have been developed for this application.

Supervised approaches such as regression analysis [5] and the generalized likelihood

ratio test (GLRT) [6, 7] can be used when the background spectrum is known

since this is a standard subspace detection problem. Unsupervised approaches

such as independent component analysis (ICA) [8], canonical correlation [9, 10]

and a correlation scheme based on a Gaussian filter [11] can be used when the

background spectrum is unknown or varies due to noise.

In this paper, we study the unsupervised problem of detecting and classifying

29

surface chemicals based on Raman spectral returns received from a moving vehicle.

The Raman spectral data are collected by the laser interrogation of surface agents

(LISA) system developed by ITT Industries. LISA provides standoff detection and

identification of surface-deposited chemical agents based on short-range Raman

sensing (see [12] for more information about the system). This detection and

classification problem is complicated by many factors. Some of these are:

1. A background surface whose spectrum is unknown a priori and is changing with

time.

2. Target chemicals that, even if present, are presented to the detector only a

fraction of the time. This is due to an uneven and incomplete distribution

of deposited surface chemicals.

3. The energy in the target return varying with the amount of chemical, the type

of chemical, and the range to the chemical.

4. The possible presence of more than one chemical, i.e., a chemical mixture.

5. Impurities in the background that present themselves as unknown chemical

deposits.

In order to design algorithms that are able to handle this multitude of unknown

situations we rely heavily on adaptive processing. The approaches to be described

take advantage of any information that is known and that can reasonably be as-

sured to be valid in an operational environment. For the remaining uncertainties

the algorithms will estimate on-line the information necessary for their successful

implementation. We will first discuss detection and classification (identification)

of a single chemical from a library of possible chemicals. Next, we will extend the

results to the mixture problem, i.e, when one or two or possibly three chemical

targets may be present in a single scattered spectrum.

30

The paper is organized as follows. Section 2.2 describes the problem and

the two-step detection followed by classification scheme that is proposed. An AR

model that models the Raman spectrum is described in Section 2.3. In Section

2.4 we derive the detection test statistic and the overall algorithm in order to

maintain a low false alarm rate which [13] did not consider. The experimental

detection performance for field background data is shown in Section 2.5. Simulation

results in Section 2.6 show that a very low false alarm rate can be obtained. The

classification algorithm is derived in Section 2.7. Here we extend the case of a

single chemical present to mixtures of chemicals, which was not treated in [13]. In

Section 2.8, we present the classification performance for field background data.

Finally, Sections 2.9 draws the conclusion.

2.2 Problem Statement and Rationale of Approach

Consider the case when a moving vehicle is mounted with a Raman spec-

troscopy unit that probes the ground surface every short time interval (40 mil-

liseconds in our case). A Raman spectrum or a pulse Ii(F ) is received at the ith

probe and consecutive Raman spectra of the road surface are received as the ve-

hicle moves. Each Raman spectrum is an Nf × 1 vector given at equally space

wavenumbers F . We assume that the background is relatively stationary in com-

position, that is, it is a road of the same type for a certain time interval. There

are also possibly some of M target chemicals present on the background. As a

result, the received spectrum at the ith probe could be from background plus noise

or background plus noise and one or several chemicals. We wish to design a test-

ing procedure that decides if no chemicals are present or if chemicals are present,

which chemicals are deposited on the background.

Current approaches to the detection problem have been plagued with high

false alarm rates. Indeed for any operational system the false alarm rate must

31

be controlled or else the system is deemed unreliable and cannot be used. Nearly

identical considerations arise in sonar [14] and radar [15]. It has been generally

accepted, and this philosophy is reflected in the design of these systems, that one

first performs a decision of either a detection or no detection and then follows this

with a classification. In this way the false alarm rate can be controlled since the

initial step does not consider which target may be present but only that some

target is present. This initial binary hypothesis test then allows one to control

the false alarm rate and to reduce it to a reasonable level. This is in contrast to

attempting to decide whether no target is present versus a subset of M possible

targets. The latter approach requires one to formulate a decision strategy that can

decide among multiple hypotheses, for which an error rate or false alarm rate will

be much higher.

2.3 Spectral Modeling

As mentioned in Section 2.1 the background spectrum is unknown and can

change in time. For the algorithms to accommodate this uncertainty, it is necessary

to estimate the spectrum on-line. To do so a spectral estimator that can estimate

the spectrum from a single pulse accurately and with reasonable computation to

allow a real-time implementation is the autoregressive (AR) spectral estimator [16].

Similar approaches have been used in radar [17] and sonar [18]. To implement

this estimator it is assumed that spectral data from the output of the Raman

spectroscopy unit is available over a spatial frequency band, i.e., wavenumber band,

which by letting F denote spatial frequency, extends from F = 0 to F = Fc, the

cutoff frequency. This spectral data I(F ) is also called the periodogram in analogy

with Fourier based methods of spectral estimation. Given I(F ) for 0 ≤ F ≤ Fc,

the AR spectral estimate is found as follows, with details given in [16]:

1. Assume a model order, denoted by p, for the AR spectral estimate. This order

32

is an integer, with smaller values preferred since it relates to the number of

parameters in the model and hence the number of unknowns to be estimated.

2. Based on I(F ) find the real-valued autocorrelation sequence, denoted as

{r[0], r[1], . . . , r[p]}, which is a sampled version (at a rate of 1/Δ samples

per sec) of the inverse continuous-time Fourier transform of I(F ) as

r[k] =

∫ 2Fc

0

I(F ) exp(j2πFkΔ)dF k = 0, 1, . . . , p (2.1)

where Δ is the interval in time between successive samples of the autocorrela-

tion function. The sample interval should be chosen to be less than 1/(2Fc).

Note that since the spectral data I(F ) has a spectrum that is one-sided, we

let I(F ) = I(2Fc − F ) for Fc ≤ F ≤ 2Fc. In this way I(F ) can be viewed

as one period of a periodic spectrum and therefore r[k] becomes real-valued.

The implied sampling rate is then 2Fc.

3. Solve the Yule-Walker equations to estimate the AR filter parameters

{a[1], a[2], . . . , a[p]} from⎡

⎢⎢⎢⎣

r[0] r[−1] . . . r[−(p − 1)]r[1] r[0] . . . r[−(p − 2)]...

.... . .

...r[p − 1] r[p − 2] . . . r[0]

⎤

⎥⎥⎥⎦

⎡

⎢⎢⎢⎣

a[1]a[2]...

a[p]

⎤

⎥⎥⎥⎦

= −

⎡

⎢⎢⎢⎣

r[1]r[2]...

r[p]

⎤

⎥⎥⎥⎦

(2.2)

and then use these estimated filter parameters to find the excitation noise

variance σ2u as

σ2u = r[0] +

p∑

k=1

a[k]r[−k]. (2.3)

Note that the matrix is symmetric and Toeplitz since r[−k] = r[k].

4. Once the parameters {a[1], a[2], . . . , a[p], σ2u} have been found the estimated AR

spectrum is

P (F ) =σ2

uΔ

|1 + a[1] exp(−j2πFΔ) + · · · + a[p] exp(−j2πpFΔ)|2(2.4)

33

for 0 ≤ F ≤ Fc.

Note that this procedure estimates the AR spectrum for a given AR model order

p. However in practice, we also need to estimate the appropriate order p since a

large p will cause overfitting and a small p will cause underfitting. We next assume

that the frequencies have been normalized to discrete frequencies as f = FΔ so

that 0 ≤ f ≤ 1 and hence digital techniques can be used. Clearly, the upper cutoff

frequency Fc corresponds to f = 1/2. The AR model order can be estimated as

follows (see Appendix 2.9 for the derivation):

1. For the spectral data I(f), and for each model order p, estimate the AR filter

parameters or {a[1], a[2], . . . , a[p]} using (2.1) and (2.2), and then estimate

the AR filter frequency response as

Ap(f) = 1 + a[1] exp(−j2πf) + · · · + a[p] exp(−j2πfp) (2.5)

2. Calculate the generalized likelihood ratio lGp(x) for each model order p by

lGp(x) = −N ln

∑Nf

k=1 |Ap(fk)|2I(fk)Δf∑Nf

k=1 I(fk)Δf(2.6)

where N is the unknown number of samples in the time domain since x is

fictitious. We will use N = 2Nf , which produces good results.

3. Choose the model order with the largest of the following

EEF (p) =

⎧

⎨

⎩

lGp(x) − p[

ln(

lGp (x)

p

)

+ 1]

if lG(x)p

> 1

0 iflGp (x)

p≤ 1

(2.7)

This is the exponentially embedded families (EEF) as a model order selection

criterion that has been recently proposed [19].

As an example, using an estimated model order of p = 40 and a single pulse

from a background of asphalt, the AR spectral estimate and original periodogram

34

data are shown in Figure 2.1. Note that the AR spectral estimate is able to model

the general shape of the data spectrum as well as the prominent peaks and valleys.

Additionally, if an artificial signal is included in the spectral data, then the AR

spectral estimate (with a different estimated AR model order of p = 44) appears

as in Figure 2.2. Similar results have been obtained for other surfaces such as

gravel and grass. What this says is that the AR spectral model with appropriate

order p is adequate for representing the main details of a spectrum using Raman

spectroscopy. This includes the cases of background only being present as well as

a target chemical deposited on a background. Consequently, for the development

of signal processing algorithms it allows us to consider the spectral data as having

been obtained from a hypothetical AR time series that has been Fourier transformed

and magnitude-squared. If we further assume that this hypothetical time series is

Gaussian, then many of the powerful techniques of statistical signal processing

[20], [6] can be brought to bear upon this problem. As we will see later, the Gaus-

sian assumption is not entirely accurate but algorithms based on it still perform

exceptionally well.

2.4 Detection Algorithm

The detection algorithm consists of two parts. This is necessary to avoid a

high false alarm rate as described previously. It is assumed that when a chemical

is present it must be present in a certain percentage of the returned pulses. A

detector based upon a single pulse with a reasonably low false alarm rate would

require a high threshold and hence a poorer probability of detection. Hence, the

chemical present condition is defined to be in effect when a certain percentage

of successive pulse returns indicate a chemical. The pulse returns which do not

indicate a chemical, when indeed a chemical present condition is in effect, results

from the lack of presence of a chemical in the illuminated area of the laser imaging

35

0 500 1000 1500 2000 2500 3000 35000

500

1000

1500

2000

2500

3000

3500AR order=40

Wavenumber(1/cm)

A

AR spectrumData spectrum

Figure 2.1. AR spectral estimate and background spectral data for asphalt surface(Fc = 3300).

0 500 1000 1500 2000 2500 3000 35000

500

1000

1500

2000

2500

3000

3500AR order=44

Wavenumber(1/cm)

A

AR spectrumData spectrum

Figure 2.2. AR spectrum for asphalt surface plus an artificial signal (Fc = 3300).

36

system. Thus, we have designed a detection system that

1. Examines each successive pulse for a threshold crossing of a test statistic.

2. Registers a chemical present condition when a suitable number of threshold

crossings are present over a fixed interval of time

We next examine each of these procedures in detail.

2.4.1 Test Statistic

The test statistic is computed for each pulse return or sequentially in time. To

estimate the background, we will need spectral data from the MB previous pulse

returns that do not have a threshold crossing. The choice of MB is made to ensure

that the background has not changed over this time period. For example, if MB =

25, then for a laser firing rate of 25 pulses/sec, we have effectively assumed that

the background spectral shape is stationary over the time interval of MB/25 = 1

second. Analysis of field data supports this assumption. However, it has also

been found that although the background spectral shape is stationary over a short

period of time, its overall level may change significantly from pulse to pulse. This

necessitates us to base any test statistic on the shape of the spectrum but not its

total power. This can be done by assuming for the background a fixed set of

AR filter parameters from pulse to pulse but with a time varying excitation noise

variance. Also, if some of the previous pulse returns have threshold crossings of

the test statistic, then we exclude them from the MB pulses used in estimating the

background. The test statistic is computed as follows (see Appendix 2.9 for the

derivation and explicit statistical assumptions):

1. Using the previous MB pulses that do not have threshold crossings, compute

the average Raman spectrum. Because the overall background power level

can change from pulse to pulse we must first normalize the power before we

37

average. To do so we set the total power of each pulse to one by scaling

appropriately. Let IBi(f) represent the Raman spectrum for the ith pulse

after power normalization. Then, we compute the sample average of the

background spectral data as

IB(fk) =1

MB

MB∑

i=1

IBi(fk) (2.8)

for k = 1, 2, . . . , Nf , where Nf is the number of spectral data points of the

Raman spectrum. Also, IBi(fk) is the Raman spectral data for the ith pulse

at frequency fk, assuming that it previously did not produce a threshold

crossing.

2. Estimate the AR model order p using the procedure described in (2.5), (2.6)

and (2.7). For this estimated order p, use the procedure described in (2.1)

and (2.2) to find the AR filter parameters of IB(fk). These are denoted by

{aB[1], aB[2] . . . , aB[p]}, where the subscript refers to the background spec-

tral model. Note that these may change in time and therefore will have

to be updated periodically. The estimated background AR filter frequency

response then becomes

AB(f) = 1 + aB[1] exp(−j2πf) + · · · + aB[p] exp(−j2πfp) (2.9)

3. Using the Raman spectrum for the return pulse under consideration, which

we denote as IT (f) and where T refers to a potential target, estimate the

AR model order q using the procedure described in (2.5), (2.6) and (2.7).

Compute the AR parameters, again using (2.1) and (2.2). Note that power

normalization is not needed since only the AR filter parameters are esti-

mated. This produces the AR filter parameters or {aT [1], aT [2], . . . , aT [q]}

and the estimated AR filter frequency response for the current pulse under

38

consideration as

AT (f) = 1 + aT [1] exp(−j2πf) + · · · + aT [p] exp(−j2πfq) (2.10)

4. The generalized likelihood ratio test (GLRT) statistic is finally computed as

TD = ln

Nf∑

k=1

|AB(fk)|2IT (fk)

Nf∑

k=1

|AT (fk)|2IT (fk)

(2.11)

will yields values TD ≥ 0. Note that power normalization is not required for

IT (f) since TD does not depend on scaling of IT (f).

This test statistic, which may be viewed as an anomaly detector, will indicate

when the return from any pulse produces a spectrum significantly different from

the background. No information, however, is obtained about the type of departure

and hence of a particular chemical. A threshold crossing, which occurs if TD > γ

for a threshold γ, indicates that the spectrum of the current pulse does not match

the background spectrum. As an example, impurities in the surface will also cause

a threshold crossing. Hopefully, however, these will be isolated occurrences and not

produce a chemical present condition. If this is not the case, then a classification

indicating impurities will be needed.

2.4.2 Overall Detection Algorithm

The test statistic given by (2.11) is computed for each pulse. A threshold

crossing indicates a possible chemical detection in that pulse. In order to declare a

chemical present, however, we expect a certain percentage of the pulse returns to

have a chemical in them. This percentage is currently set to 10%. For example, if

a chemical is present, then for 100 pulses, we expect 10 or more of them to produce

threshold crossings, assuming the test statistic always produces a threshold crossing

39

when a chemical is present in the pulse return. The remaining 90 test statistics

will not have a threshold crossing since they are based on data for which the laser

did not illuminate the chemical, as explained previously. With this assumption we

can now set the desired threshold for TD. We assume that a chemical is present if

10% or more of the test statistics in a given block of pulse data produce threshold

crossings. These threshold crossings need not be sequential, but can be scattered

anywhere within the block. For example, if the block consists of 100 successive

pulse returns, then a chemical is declared to be present if at least 10 of the test

statistics produce a threshold crossing. This block of 100 successive pulses is

assumed to “slide along” in time. For the example described below, the blocks are

overlapped by 50%, although other overlaps can be used.

Next in order to ensure a fixed false alarm rate, we need to set the threshold,

which we call γ, for TD appropriately. Thus, we must specify the probability of a

threshold crossing for TD, which is PFAp = Pr[TD > γ|H0], and is the probability

of false alarm for a single pulse. Then, once PFAp is found, the threshold γ can be

specified. It is shown in Appendix B how PFAp can be found so that the overall

false alarm rate is less than one false alarm per h hours.

First we find PFAb, which is the probability of a false alarm for a single block,

and is given as the solution of

(1 − PFAb)L + L (1 − PFAb

)L−1 = 0.99 (2.12)

where L = 1800h is the number of blocks analyzed in h hours. This value for

L assumes a pulse rate of 25 per second, a block size of 100 pulses, and a 50%

block overlap. Each block is therefore 4 sec long with an overlap of 2 sec. This

can be solved for PFAb. Once PFAb

is found we can determine PFAp by solving the

equation

PFAb= 1 −

9∑

i=0

(100

i

)

P iFAp

(

1 − PFAp

)i. (2.13)

40

This is just the calculation that a false alarm occurs in a block, which is defined

as 10 or more threshold crossings out of 100 possible ones. For example, if h = 2

hours and therefore L = 3600, then from (2.12) we have that PFAb= 5 × 10−5.

Using this value in the left-hand-side of (2.13) we can solve for PFAp , which is

about PFAp = 0.02. The details are given in Appendix B. As a result we need to

find the threshold γ so that the probability that TD > γ for a single pulse is 0.02.

Theoretically, the GLRT statistic TD should have a chi-squared probability

density function (PDF) with q degrees of freedom [6], which would allow us to

determine γ. It has been found through analysis of field data, however, that this

theoretical PDF is not sufficiently accurate. (This is why, as mentioned earlier,

the Gaussian assumption for the fictitious time series is not always accurate.) As a

result, it is necessary to estimate the PDF of TD when background only is present

and then use this to set the threshold. It is conceivable that this threshold will

depend upon the background statistics, which are unknown. We next indicate how

this is done on-line.

Assume that we have I independent and identically distributed test statistics

TDifor i = 1, 2, . . . , I. We can estimate on-line the right-tail probability of the

PDF by using an AR model for the PDF [21], [22]. The procedure is as follows:

1. Normalize the test statistics by a constant equal to the maximum value of the

TDi’s. If we denote this as Tmax = maxi=1,...,I TDi

, then we form the new data

set TDi= TDi

/Tmax. Thus all values are now in the range [0, 1] since TDi≥ 0.

2. Next we use the AR spectral estimator, but as a PDF estimator, with the

“estimated autocorrelation” sequence (actually the estimated characteristic

function)

r[k] =1

I

I∑

i=1

exp(

j2πkTDi

)

(2.14)

41

for k = 0, 1, . . . , p and will in general be complex-valued. The AR parameters

are estimated using (2.2) and (2.3) but with r[−k] = r∗[k]. The estimated

PDF of TD then becomes

pTD(t) =

σ2u

|1 + a[1] exp(−j2πt/Tmax) + · · · + a[p] exp(−j2πpt/Tmax)|2(2.15)

for 0 ≤ t ≤ Tmax, where σ2u is real-valued and σ2

u > 0, and the a[k]’s are

complex-valued.

3. Determine the threshold by numerical integration as the value of γ that solves

∫ ∞

γ

pTD(t)dt = PFAp . (2.16)

2.5 Experimental Detection Performance for Field Background Data

The following results make use of 10,000 pulses of concrete field background

data to which chemical signatures obtained in the laboratory were added using a

computer. The first 500 pulses of background data only are used for initialization

so that the background spectrum can be estimated as needed for |AB(fk)|2 in

(2.11). Also, using the same 500 pulses the threshold γ is found for the detector

using (2.14–2.16). The threshold is then fixed for the entire remaining 9500 pulses.

From the results to be presented it is found that the threshold will have to be

periodically updated. After the initialization period a chemical signature is added

to the background at a rate of 10% in a random manner. As explained previously,

a window of 25 previous pulses without threshold crossings is used to update the

background spectrum.

As an example of the detection performance for concrete field background

data we set the threshold so that the false alarm rate is at most 1 per 2 hours, as

explained previously. Then, we plot the probability of detection PD versus signal-

to-noise ratio (SNR) for a single chemical. The SNR is defined as the broadband

42

SNR. This is

SNR = 10 log10

∑Nf

k=1 θPs(fk)∑Nf

k=1 PB(fk)(2.17)

where PB(f) is the PSD of the background, Ps(f) is the known spectral signature

for the chemical, and θ is a scaling factor that produces the desired SNR. We

next plot the probability of detection PDp based on a single pulse, which is the

probability of a threshold crossing, versus SNR. This is done by examining the

pulses where we know that a chemical has been added throughout the 9500 pulses.

The spectra of all the chemicals that are used in the simulations are plotted in

Figure 2.3 (one or more chemicals are added to the background for either detection

performance or classification performance). When chemical 15 is added to the

background, the probability of detection is shown in Figure 2.4. It is seen that

the probability of detection is perfect for SNRs in excess of -10 dB. If we instead

add chemical 31, the results are as shown in Figure 2.5. Again the detection

performance is nearly perfect at a fairly low SNR.

0 500 1000 1500 2000 2500 3000 35000

5

10x 10

6

Wavenumber(1/cm)

A

Spectrum for chemical 15

0 500 1000 1500 2000 2500 3000 35000

5000

10000

15000

Wavenumber(1/cm)

A


0 500 1000 1500 2000 2500 3000 35000

2

4

6x 10

5

Wavenumber(1/cm)

A


0 500 1000 1500 2000 2500 3000 35000

2

4

6

Wavenumber(1/cm)

A


0 500 1000 1500 2000 2500 3000 35000

1

2

3x 10

6

Wavenumber(1/cm)

A


0 500 1000 1500 2000 2500 3000 35000

2

4

6x 10

4

Wavenumber(1/cm)

A


0 500 1000 1500 2000 2500 3000 35000

1

2x 10

4

Wavenumber(1/cm)

A


0 500 1000 1500 2000 2500 3000 35000

5

10

15x 10

6

Wavenumber(1/cm)

A


Figure 2.3. Spectra of the chemicals that are used in simulations.

43

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PDp vs SNR for chemical 15

SNR(dB)

P Dp

Figure 2.4. Probability PDp of detecting chemical 15 versus SNR based on a singlepulse.

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PDp vs SNR for chemical 31

SNR(dB)

P Dp

Figure 2.5. Probability PDp of detecting chemical 31 versus SNR based on a singlepulse.

44

2.6 Experimental False Alarm Rate Performance

Since the threshold is critical to maintain a reasonable false alarm rate, we

performed an experiment to determine if the computed one was reasonable. For

the same 10,000 pulses (6.67 minutes of data), we used the first 500 pulses for

initialization. Then, for a concrete background (no added chemical) we imple-

mented the detection algorithm previously described. A false alarm will occur if

the number of threshold crossings for a block of 100 pulses exceeds 10. The same

threshold as found from the first 500 pulses was used throughout the remaining

9500 pulses. It was found that for blocks that are 50% overlapped, as assumed in

the analysis, there were 3 false alarms as shown in Figure 2.6. However, two of the

false alarms are close together and so can be considered as the same one. Hence,

there are 2 false alarms. This is still higher than predicted. In a two-hour period

there would be on the average 36 false alarms, instead of the prediction of 1. This

would imply that the background is not stationary over this time interval. Thus

we should update the threshold periodically.

In this example we then updated the threshold for every 500 pulses. That is,

if there is no detection declared for all blocks within 500 pulses, we update the

threshold using all the 500 test statistic TD’s, and use the updated threshold for

the next 500 pulses. Otherwise if there is detection of chemicals for some blocks in

these 500 pulses, we use the test statistic TD’s in the blocks within these 500 pulses

that do not declare detections to update the threshold. It was found that for blocks

that are 50% overlapped, there were no false alarms for the remaining 9500 pulses.

When we updated the threshold every 500 pulses, even for successive blocks, there

were no false alarms either. This suggests our derivation of the required threshold

needed to control the false alarm rate.

45

0 2000 4000 6000 8000 100000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Starting point for the 100−pulse blockF

alse

ala

rm

False alarms for 50% overlapping blocks

Figure 2.6. False alarms for a concrete background and fixed threshold.

2.7 Classification

For the purpose of this paper, we constrain ourselves to single pulse classifica-

tion, that is, we perform a classification based on a single pulse that has a threshold

crossing. Once the detection algorithm declares that some chemicals are present

in a block of data (of say 100 pulses, for which 10 or more have had threshold

crossings), we proceed with single pulse classification on those pulses that have

threshold crossings within this block, and choose the chemicals that appear most

often in single pulse classification. In subsection 2.7.1, it is assumed that only

one of M chemicals may be present. In subsection 2.7.2, we consider the problem

of mixtures of chemicals, i.e., the case when two or possibly three chemicals are

present in a single scattered pulse. To do it, we initially assume that we know

there are K out of M chemicals on the background in 2.7.2, so we just need to

decide which combination of K chemicals is present. Then we will use a model

order selection criterion to decide how many chemicals are present, i.e., the value

of K, in subsection 2.7.3.

46

2.7.1 Classification if Only One of M Chemicals Is Present

To determine which chemical is present we compute M test statistics and

choose the chemical with the largest value of the test statistic. The test statistic

that is used is that associated with a locally most powerful (LMP) test [6]. It

can also be interpreted as an estimate of the chemical amplitude normalized by its

standard deviation. The overall classification procedure is as follows (see Appendix

2.9 for the derivation and detailed model):

1. For the pulse I(f) that has a threshold crossing, estimate the background by

using the previous 25 pulses that did not have threshold crossings. To do

so first normalize the power in each of these pulses to have a total power

of one and then average the spectra to yield I(f). Then, estimate the AR

parameters to obtain PB(f) as given by (2.4). Next normalize I(f) to make

∑Nf

k=1 I(f) =∑Nf

k=1 PB(f). By doing this we make sure the pulse has the same

power as the background. Since the chemical signature power is assumed to

be small, this also guarantees the pulse has about the same background

power, which satisfies the assumption of the M -ary hypothesis test as in

(2.30) in Appendix 2.9. (Note also that by this assumption, the background

normalization needed to form PB(f) should not be affected by the chemical

present.)

2. For each chemical signature Psi(f), use the estimate of the background PB(f)

and the pulse data to be classified I(f) to compute the classification test

statistic

TCi=

Nf∑

k=1

Psi(fk)

PB(fk)

(I(fk)

PB(fk)− 1

)

√√√√

Nf∑

k=1

P 2si(fk)

P 2B(fk)

(2.18)

47

Note that the chemical signature Psi(f) need not be power normalized since

TCidoes not depend on the scaling of Psi

(f).

3. Repeat step 2 for i = 1, 2, . . . , M .

4. Choose the chemical that produces the largest TCi.

Preliminary results indicate that even with a single pulse nearly a perfect classifi-

cation can be made as described in Section 2.8.

2.7.2 Classification if K out of M Chemicals Are Present

This problem is more complicated since we now need to pick K out of M

chemicals instead of just picking one out of M . The total number of possible

combinations is(

MK

)

. An asymptotic likelihood function method is proposed. The

idea is that we first find the asymptotic maximum likelihood estimate (MLE) of the

unknown powers for chemical signatures and plug it into the corresponding log-

likelihood function of this hypothesis. The chemical combination that produces

the largest log-likelihood is chosen. The classification procedure is as follows (see

Appendix 2.9 for the derivation and detailed model):

1. The first step is the same as in the previous subsection. Obtain the average

spectrum of the chemical plus background I(f).

2. For each chemical combination hypothesis, compute the asymptotic MLE of

the chemical signature powers by

θ = I−1(0)∂ ln p(x; θ)

∂θ

∣∣∣∣θ = 0

(2.19)

where

∂ ln p(x; θ)

∂θi

∣∣∣∣θ = 0

=N

2

Nf∑

k=1

Pski(fk)

PB(fk)

(I(fk)

PB(fk)− 1

)

Δf. (2.20)

48

and

Iij(0) =N

2

Nf∑

k=1

Pski(fk)Pskj

(fk)

(PB(fk))2 Δf. (2.21)

If there is at least one negative element in θ, set the log-likelihood of this hy-

pothesis to −∞. Otherwise plug θ into the following log-likelihood function

ln p(x; θ)

= −N

2

Nf∑

k=1

[

ln(∑K

i=1θiPski

(fk) + PB(fk))

+I(fk)

∑Ki=1 θiPski

(fk) + PB(fk)

]

Δf

− N

2ln(2π) (2.22)

Also note that the chemical signatures Pski’s do not need to be power nor-

malized since the ith element θi of θ from (6.7) is proportional to 1/Pski.

Thus θiPskiin (2.22) does not depend on the scaling of Pski

.

3. Repeat step 2 for all the(

MK

)

hypotheses.

4. Choose the chemical combination that corresponds to the hypothesis having the

largest log-likelihood. Note that the number of data samples N in the time

domain is unknown. However, we can compare the log-likelihoods without

knowing N . This is because that θ does not depend on N since N cancels

in (6.7) and N is just a scaling factor in (2.22).

2.7.3 Model Order Selection on How Many Chemicals Are Present inthe Mixture

We have considered the case when we know the number of chemicals that are

present. But in practice, this information is unknown a-priori. Thus, we need to

select the model order, i.e., how many chemicals are present. Again, we will use

the EEF as the model order selection criterion. For each hypothesis, the EEF can

49

be calculated by

EEF =

{

lG(x) − K[

ln(

lG(x)K

)

+ 1]

if lG(x)K

> 1

0 if lG(x)K

≤ 1(2.23)

where K is the assumed number of chemicals deposited on the background and

lG(x) = 2 lnp(x; θ)

p(x;0).

The log-likelihood functions ln p(x; θ) and ln p(x;0) can be found by plugging (6.7)

and θ = 0 into (2.22), respectively. We choose the hypothesis with the largest EEF

value.

Since EEF is increasing with lG(x), for the same model order K, the largest

lG(x) corresponds to the largest EEF. So for each K, we just need to find lG(x)’s

for all the(

MK

)

hypotheses, choose largest lG(x), and plug it into (2.23). Then, we

compare the EEF’s for different K’s and choose the model with the largest EEF.

We select the hypothesis with the largest lG(x) for the model order that has been

chosen.

Again, we need the number of data samples N in the time domain in com-

puting the EEF, since (2.22) depends on N . We will assume that same number of

samples in the time domain as in the frequency domain. Since we have Nf = 1024

samples equally spaced on half a period in the frequency domain, we will use

N = 2Nf = 2048. By simulation we have seen that the performance is excellent

with N = 2048.

2.8 Experimental Classification Performance for Field BackgroundData

For the same data conditions as for the detection experiment, we isolate all

the pulses that have had threshold crossings. The probability of a correct single

pulse classification is found by

PC =number of correct classifications for the pulses that have the added chemical

number of pulses that have the added chemical

50

First we consider the case when there is only one chemical present. Using a library

of M = 60 possible chemicals we classify the pulses with threshold crossings as per

the discussion in Section 2.6. The results for chemicals 15, 31 and 45 are shown in

Figures 2.7, 2.8 and 2.9 respectively. Again nearly perfect results are obtained for

an SNR in excess of -10 dB.

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PC for chemical 15

SNR(dB)

P C

Figure 2.7. Probability of correct single pulse classification versus SNR. Chemical15 is present.

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PC for chemical 31

SNR(dB)

P C


Next we added the two chemicals 15 and 16 to the background, each chemical

with the same SNR. We assume that we know the number of chemicals present. In

51

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PC for chemical 45

SNR(dB)

P C


the simulation, we have found that the classifier will sometimes choose chemicals

16 and 29. In Figure 2.10, we see that the probability of choosing chemicals 15

and 16 does not go to 1 as SNR increases. But if we consider chemicals 15 and 29

to be the same, the performance is much improved. This is because the spectrum

of chemical 15 is very similar to that of chemical 29 as shown in Figure 2.3. The

correlation between the spectra of the two chemicals is 0.968, which means that

they are approximately linearly dependent. In this case, it is hard to distinguish

between these two chemicals. Two approaches are possible. We can either treat

chemicals 15 and 29 as the same, or remove chemical 29 from the library. When

the classifier chooses chemical 15 in the case chemical 29 is removed, a second

stage classification can be performed to further discriminate between chemical 15

and chemical 29. The same approach is considered in [10] where one spectrum

of the spectrum pairs whose correlations are greater than a threshold is removed.

The performance is shown in Figure 2.11 when chemical 29 is removed from the

library. The simulation results for the chemical 20 and 45 combination and for

the chemical 31 and 45 combination are shown in Figure 2.12 and Figure 2.13

respectively. These combinations are easily classified.

52

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PC vs SNR for chemical 15 and 16 combination

SNR(dB)

P C

Consider chemicals 15 and 29 as the same Consider chemicals 15 and 29 as different

Figure 2.10. Probability of correct single pulse classification versus SNR. Chemicals15 and 16 are present.

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


SNR(dB)

P C

Figure 2.11. Probability of correct single pulse classification versus SNR. Chemicals15 and 16 are present. Chemical 29 is removed from the library.

53

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


SNR(dB)

P C


−15 −14 −13 −12 −11 −10 −9 −8 −7 −60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


SNR(dB)

P C


54

Next we would like to ascertain how the EEF works for the case when the

number of chemicals in the mixture is unknown. We assume that there are at

most 3 chemicals present. Thus, we need to compare the EEF for K = 1, 2, 3.

Chemicals 15, 56 and 58 with the same SNR are added to the background. The

performance of the EEF is compared to that of the minimum description length

(MDL) criterion. The MDL is based on coding arguments [23] and can also be

derived by an asymptotic Bayesian procedure [24]. We still consider chemicals 15

and 29 as the same chemical because of the high correlation between them. The

resulting probability of correct classification versus SNR is shown in Figure 2.14.

The result for the chemical 15, 31 and 45 combination is shown in Figure 2.15. The

result for the chemical 20 and 45 combination is shown in Figure 2.16. Comparing

Figure 2.12 and Figure 2.16, we see that the former produces a slightly higher

probability of correct classification. This is because for Figure 2.12, we assume

that we know the number of chemicals, but for Figure 2.16, we need to estimate

the number of chemicals.

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PC vs SNR using EEF and MDL for chemical 15, 56, 58 combination

SNR(dB)

P C

EEFMDL

Figure 2.14. Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 15, 56 and 58 are present.

As we have seen, some of the target chemicals in the library are highly cor-

related. As a result, we need to remove some of them from the library or else

55

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PC vs SNR using EEF and MDL for chemical 15, 31, 45 combination

SNR(dB)

P C

EEFMDL

Figure 2.15. Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 15, 31 and 45 are present.

−15 −14 −13 −12 −11 −10 −9 −8 −7 −60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PC vs SNR using EEF and MDL for chemical 20, 45 combination

SNR(dB)

P C

EEFMDL

Figure 2.16. Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 20 and 45 are present.

treat them as a group of similar chemicals. A further consideration is that a linear

combination of some chemicals might appear similar to another single chemical.

Then the classifier would not perform well if that single chemical were present,

since we might choose the chemicals that form the equivalent linear combination

instead. A future paper will address this issue.

56

2.9 Conclusion

An AR model has been proposed for a chemical detector and a chemical

classifier based on Raman spectra. The use of a detection procedure followed by a

classification scheme is used to control the false alarm rate. This is an unsupervised

approach which estimates on-line the information of the non-stationary background

data. Experiments with field background data have shown excellent performance

of both the detector and the classifier.

Appendix 2A - Derivation of Estimating the AR Model Order

The basic assumption is that the spectral data obtained through the action

of the Raman spectroscopy unit can be modeled as a periodogram of real-valued

Gaussian data. This implies certain statistics of the spectral data, which although

not completely satisfied, allows us to derive a detector that will perform well in

practice. As an example of this modeling discrepancy, in analyzing field obtained

spectral data it has been found that the probability density function of the spec-

tral data is not chi-squared with two degrees of freedom, which is implied by the

Gaussian model. Hence, algorithms which push this Gaussian assumption too far

may not work as predicted. Fortunately, for the problem at hand the algorithms

so derived appear to perform exceedingly well.

Assume that N samples {x[0], x[1], . . . , x[N − 1]} in the time domain of the

Gaussian AR random process are observed (this is fictitious). We assume that we

have the same number of samples in the time domain as in the frequency domain

(as in a discrete Fourier transform). Since Nf is the number of samples equally

spaced on half a period in the frequency domain, we have N = 2Nf . We need

to estimate the order of the AR process. This is a multiple hypothesis testing

57

problem with

H0 : a[1] = 0, a[2] = 0, . . . , a[pM ] = 0, σ2u > 0

H1 : a[1] �= 0, a[2] = 0, . . . , a[pM ] = 0, σ2u > 0

...

HpM: a[1] �= 0, a[2] �= 0, . . . , a[pM ] �= 0, σ2

u > 0

where pM is the largest candidate model order. That is, for the AR process with

order p, only the first p AR parameters are nonzero. Let p(x; ap, σ2u,Hp) denote

the PDF under Hp, where x denotes the random process data vector and ap is the

p × 1 vector of the first p nonzero AR filter parameters. Note that under H0, the

AR process with order 0 is white Gaussian noise, so we write the PDF under H0

as p(x; σ2u,H0).

To estimate the order, we resort to the exponentially embedded families (EEF)

which is a model order selection criterion that has been recently proposed [19]. It

has been shown that asymptotically, the EEF minimizes the divergence between

the true PDF and the estimated one. For each hypothesis Hp, its EEF can be

calculated by

EEF (p) =

⎧

⎨

⎩

lGp(x) − p[

ln(

lGp (x)

p

)

+ 1]

if lG(x)p

> 1

0 iflGp (x)

p≤ 1

(2.24)

where lGp(x) is the generalized likelihood ratio for Hp [6] with

lGp(x) = 2 lnp(x; ap, σ

2up

;Hp)

p(x; σ2u0

,H0)(2.25)

Here ap, σ2up

are the maximum likelihood estimators (MLE) of ap and σ2u under Hp,

and σ2u0

is the MLE of σ2u under H0. The EEF criterion chooses the hypothesis

with the largest EEF value.

The PDF can be written in the frequency domain (and hence the time series

58

data can be replaced by the spectral data) as [20]

ln p(x; ap, σ2u,Hp) = −N

2ln 2π − N

2

∫ 1

0

[

ln Pp(f) +I(f)

Pp(f)

]

df (2.26)

where I(f) is the periodogram data and Pp(f) is the true power spectral density

(PSD) of the AR process with parameters ap, σ2u. Since

Pp(f) =σ2

u

|Ap(f)|2

the log-PDF can be written as

ln p(x; ap, σ2u,Hp) = −N

2ln 2π − N

2

∫ 1

0

[

lnσ2

u

|Ap(f)|2 +I(f)

σ2u

|Ap(f)|2

]

df

= −N

2ln 2π − N

2

∫ 1

0

[

ln σ2u +

|Ap(f)|2I(f)

σ2u

]

df

since it can be shown that∫ 1

0ln |Ap(f)|2df = 0 [16]. Next we maximize the log-

PDF over σ2u to obtain the MLE as

σ2up

=

∫ 1

0

|Ap(f)|2I(f)df.

and substituting back into ln p(x; ap, σ2up

,Hp) yields

ln p(x; ap, σ2up

,Hp) = −N

2ln 2π − N

2ln

∫ 1

0

|Ap(f)|2I(f)df − N

2.

Finally, we need to maximize this over ap to obtain ap. It can be shown that this

maximization requires one to use the Yule-Walker equations to estimate the AR

filter parameters. Denoting the resultant MLE of Ap(f) under Hp as Ap(f), we

have that

ln p(x; ap, σ2up

,Hp) = −N

2ln 2π − N

2ln

∫ 1

0

|Ap(f)|2I(f)df − N

2

Note that since we have white Gaussian noise under H0, we have A0(f) = 1. We

maximize the log-PDF over σ2u for A0(f) = 1 to yield

σ2u0

=

∫ 1

0

I(f)df.

59

and hence we have

ln p(x; σ2u0

,H0) = −N

2ln 2π − N

2ln

∫ 1

0

I(f)df − N

2

As a result,

lGp(x) = 2 lnp(x; ap, σ

2up

;Hp)

p(x; σ2u0

,H0)= −N ln

∫ 1

0|Ap(f)|2I(f)df∫ 1

0I(f)df

(2.27)

When this is discretized over the band 0 ≤ f ≤ 1/2 we have

lGp(x) = −N ln

∑Nf

k=1 |Ap(fk)|2I(fk)Δf∑Nf

k=1 I(fk)Δf(2.28)

Finally we choose the AR model with the largest EEF calculated by (2.24).

Appendix 2B - Derivation of Test Statistic for Detection

To begin, we assume that the background random process (in the time domain)

is a real-valued Gaussian AR process with parameters {aB[1], aB[2], . . . , aB[p], σ2u}.

Note that the parameters {aB[1], aB[2], . . . , aB[p], σ2u} and also the order p is es-

timated using the sample average of the background spectral data as in (2.8).

Under H0, which is background only, the AR filter parameters are assumed to be

known but not the excitation noise variance. Under H1, the AR filter parameters

and the excitation noise variance are both unknown. Let q be the estimated AR

model order under H1 using the observed spectral data I(f). Then, we set up the

hypothesis test

H0 : AR parameters are aB[1], aB[2], . . . , aB[p], σ2u > 0

H1 : AR parameters are a[1], a[2], . . . , a[q], σ2u > 0

This effectively says that under H0 (no signal present) the spectrum is just the

known background spectrum, although with an unspecified σ2u. Under H1 the shape

of the spectrum is changed due to the change in the AR filter parameters. This

is caused by the presence of a signal, which has been added to the background.

60

We also assume that the fictitious N time samples {x[0], x[1], . . . , x[N − 1]} of

the Gaussian AR random process are observed. Let p(x; aB, σ2u,H0) denote the

PDF under H0 and p(x; a, σ2u,H1) denote the PDF under H1, where x denotes the

random process data vector, aB is the known p× 1 vector of AR filter parameters

and a is the unknown q × 1 vector of AR filter parameters. The generalized

likelihood ratio test statistic (GLRT) is [6]

lG(x) = lnp(x; a, σ2

u1;H1)

p(x; aB, σ2u0

,H0)(2.29)

where a, σ2u1

is the maximum likelihood estimator (MLE) of a and σ2u under H1,

and σ2u0

is the MLE of σ2u under H0.

From similar derivations as in Appendix A, and denoting the resultant MLE

of A(f) under H1 as AT (f), we have that

ln p(x; aB, σ2u0

,H0) = −N

2ln 2π − N

2ln

∫ 1

0

|AB(f)|2I(f)df − N

2

ln p(x; a1, σ2u1

,H1) = −N

2ln 2π − N

2ln

∫ 1

0

|AT (f)|2I(f)df − N

2

and finally from (2.29)

lG(x) =N

2ln

∫ 1

0|AB(f)|2I(f)df

∫ 1

0|AT (f)|2I(f)df

.

When this is discretized over the band 0 ≤ f ≤ 1/2 we have

lG(x) =N

2ln

∑Nf

k=1 |AB(fk)|2I(fk)∑Nf

k=1 |AT (fk)|2I(fk).

and omitting the N/2 factor, we have finally (2.11).

Appendix 2C - Derivation of Probability of Detection Statistic Thresh-old Crossing for Given False Alarm Rate

We declare that a chemical has been detected if at least 10% of the pulses in a

given block produce threshold crossings. As an example, we consider the block to

61

consist of 100 pulses and hence a detection occurs if at least 10 threshold crossings

are observed. Also, we assume an operational requirement of one false alarm per

two hours of time. To compute the desired probabilities exactly is difficult due to

the fact that the successive blocks, which differ by only one sample, are heavily

dependent. As an approximation we assume that the blocks are overlapped by 50%

(which may be necessary in practice to avoid excessive computation and is com-

monly done in practice) and therefore that the data in each block is approximately

independent. Then, in two hours we have examined

L = (2 × 3600 × 25)/50 = 3600 blocks

for a 10% threshold crossing rate. Hence, the probability of false alarm for each

block is obtained as follows. Let PFAbbe the probability of a false alarm for a

single block. Then, the probability of at most one false alarm in L independent

blocks is

P1 = P [at most one false alarm in L blocks]

= P [no false alarms in L blocks] + P [one false alarm in L blocks]

= (1 − PFAb)L + LPFAb

(1 − PFAb)L−1

since this is a binomial type of probability. We want the probability of at most

one false alarm per two hours to be large, say 0.99. Hence, we need to solve for

PFAbby finding that value that satisfies

(1 − PFAb)L + LPFAb

(1 − PFAb)L−1 = 0.99.

In general, for at most one false alarm per h hours we should use L = 1800h.

For the example of h = 2 we plot the probability of at most one false alarm

per two hours versus PFAbin Figure 2.17. It is seen that we should require that

PFAb= 4× 10−5, which is the probability of a false alarm for a single block of 100

62

0 0.2 0.4 0.6 0.8 1 1.2x 10−4

0.94

0.95

0.96

0.97

0.98

0.99

1

PFAb

P 1

Figure 2.17. Probability P1 of at most one false alarm per two hours versus PFAb.

pulses. Next, since we declare a chemical present if there are at least 10 threshold

crossings out of a 100 possible ones, then the probability of a false alarm for a

single block is

PFAb=

100∑

k=10

(100

k

)

P kFAp

(1 − PFAp)100−k = 1 −

9∑

k=0

(100

k

)

P kFAp

(1 − PFAp)100−k

where PFAp is the probability of a threshold crossing, i.e., probability of a false

alarm for a single pulse. In Figure 2.18 we plot PFAbversus PFAp . For PFAb

=

4 × 10−5 = −44 dB, we require from Figure 2.18 that PFAp = 0.02. Hence, the

threshold of the test statistic given by (2.11) should be set so that the probability

of TD exceeding this threshold γ is 0.02.

Appendix 2D - Derivation of LMP Test Statistic for Classification

It is assumed that one of M chemicals is present. The spectral data is assumed

to be of the form P (f) = θiPsi(f) + PB(f) if the ith chemical is present. As usual

PB(f) is the PSD of the background, Psi(f) is the known spectral signature for the

ith chemical, and θi is an unknown scaling factor that accounts for the unknown

power of the chemical. To decide which chemical is present we set up an M -ary

63

0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05−80

−70

−60

−50

−40

−30

−20

−10

PFAp

P FA

b (dB

)

Figure 2.18. Probability of at most one false alarm per two hours versus PFAp .

hypothesis test as

H1 : P (f) = θ1Ps1(f) + PB(f)

H2 : P (f) = θ2Ps2(f) + PB(f)

......

HM : P (f) = θMPsM(f) + PB(f).

The θi’s are positive but otherwise unknown. They are assumed to be small so

that an LMP approach can be used. An LMP classification test statistic decides

chemical k is present if among

TCi(x) =

∂ ln p(x;Hi)∂θi

∣∣∣θi=0

√

IFi(0)

(2.30)

for i = 1, 2, . . . , M , TCk(x) is the maximum value. In (2.30) IF (0) is the Fisher

information for θ when evaluated at θ = 0. To evaluate the test statistics, we first

note that the log-PDF of the spectral data is given as in Appendix A as

ln p(x) = −N

2ln 2π − N

2

∫ 1

0

[

ln P (f) +I(f)

P (f)

]

df.

64

Using P (f) = θPs(f) + PB(f) we have

ln p(x) = −N

2ln 2π − N

2

∫ 1

0

[

ln (θPs(f) + PB(f)) +I(f)

θPs(f) + PB(f)

]

df

and differentiating produces

∂ ln p(x)

∂θ= −N

2

∫ 1

0

Ps(f)

θPs(f) + PB(f)− I(f)Ps(f)

(θPs(f) + PB(f))2df (2.31)

which when evaluated at θ = 0 yields

∂ ln p(x)

∂θ

∣∣∣∣θ=0

= −N

2

∫ 1

0

Ps(f)

PB(f)− I(f)Ps(f)

(PB(f))2df

=N

2

∫ 1

0

Ps(f)

PB(f)

(I(f)

PB(f)− 1

)

df. (2.32)

To determine the Fisher information we differentiate (2.31) a second time to pro-

duce

∂2 ln p(x)

∂θ2= −N

2

∫ 1

0

− P 2s (f)

(θPs(f) + PB(f))2+ 2

I(f)P 2s (f)

(θPs(f) + PB(f))3df.

Taking the expected value and noting that E[I(f)] = P (f) = θPs(f) + PB(f)

produces

E

[∂2 ln p(x)

∂θ2

]

= −N

2

∫ 1

0

− P 2s (f)

(θPs(f) + PB(f))2+ 2

(θPs(f) + PB(f))P 2s (f)

(θPs(f) + PB(f))3df

= −N

2

∫ 1

0

− P 2s (f)

(θPs(f) + PB(f))2+ 2

P 2s (f)

(θPs(f) + PB(f))2df.

Setting θ = 0 and taking the negative produces

IF (0) =N

2

∫ 1

0

P 2s (f)

P 2B(f)

df. (2.33)

Therefore, the LMP statistic becomes from (2.32) and (2.33)

TC =

√N2

∫ 1

0Ps(f)PB(f)

(I(f)

PB(f)− 1

)

df√∫ 1

0P 2

s (f)

P 2B(f)

df. (2.34)

65

When discretized over the band 0 ≤ f ≤ 1/2, this becomes√

N2

∑Nf

k=1Ps(fk)PB(fk)

(I(fk)

PB(fk)− 1

)

Δf√∑Nf

k=1P 2

s (fk)

P 2B(fk)

Δf

and ignoring a scaling factor, which will not affect the maximum, we have finally

TC =

∑Nf

k=1Ps(fk)PB(fk)

(I(fk)

PB(fk)− 1

)

√∑Nf

k=1P 2

s (fk)

P 2B(fk)

Appendix 2E - Derivation of The Asymptotic Likelihood FunctionMethod for Classification of Mixture of Chemicals

We assume that K out of M chemicals are present and they are additive.

Hence, the spectral data is of the form P (f) =∑K

i=1 θkiPski

(f) + PB(f) if chemicals

k1, k2, . . . , kK are present. The total number of candidate hypotheses is(

MK

)

. Let

the unknown parameters be θ = [θk1 , θk2 , . . . , θkK]T . Asymptotically [6],

θ = θ0 + I−1(θ0)∂ ln p(x; θ)

∂θ

∣∣∣∣θ=θ0

or in our problem θ0 = 0,

θ = I−1(0)∂ ln p(x; θ)

∂θ

∣∣∣∣θ = 0

. (2.35)

For each candidate hypothesis,

ln p(x; θ)

= −N

2

∫ 1

0

[

ln(∑K

i=1θki

Pski(f) + PB(f)

)

+I(f)

∑Ki=1 θki

Pski(f) + PB(f)

]

df

− N

2ln(2π) (2.36)

and

∂ ln p(x; θ)

∂θki

= −N

2

∫ 1

0

⎡

⎢⎣

Pski(f)

∑Ki=1 θki

Pski(f) + PB(f)

−I(f)Pski

(f)(∑K

i=1 θkiPski

(f) + PB(f))2

⎤

⎥⎦df

66

∂ ln p(x; θ)

∂θki

∣∣∣∣θ = 0

=N

2

∫ 1

0

Pski(f)

PB(f)

(I(f)

PB(f)− 1

)

df. (2.37)

The second derivative is

∂2 ln p(x; θ)

∂θki∂θkj

= −N

2

∫ 1

0

⎡

⎢⎣−

Pski(f)Pskj

(f)(∑K

i=1 θkiPski

(f) + PB(f))2 +

2I(f)Pski(f)Pskj

(f)(∑K

i=1 θkiPski

(f) + PB(f))3

⎤

⎥⎦df

and therefore, the Fisher information matrix is

Iij(θ)

= −E

[∂2 ln p(x; θ)

∂θki∂θkj

]

=N

2E

⎡

⎢⎣

∫ 1

0

⎡

⎢⎣−

Pski(f)Pskj

(f)(∑K

i=1 θkiPski

(f) + PB(f))2 +

2I(f)Pski(f)Pskj

(f)(∑K

i=1 θkiPski

(f) + PB(f))3

⎤

⎥⎦df

⎤

⎥⎦

=N

2

∫ 1

0

⎡

⎢⎣−

Pski(f)Pskj

(f)(∑K

i=1 θkiPski

(f) + PB(f))2 +

2E (I(f)) Pski(f)Pskj

(f)(∑K

i=1 θkiPski

(f) + PB(f))3

⎤

⎥⎦df

=N

2

∫ 1

0

⎡

⎢⎣−

Pski(f)Pskj

(f)(∑K

i=1 θkiPski

(f) + PB(f))2 +

2Pski(f)Pskj

(f)(∑K

i=1 θkiPski

(f) + PB(f))2

⎤

⎥⎦df

=N

2

∫ 1

0

Pski(f)Pskj

(f)(∑K

i=1 θkiPski

(f) + PB(f))2df

or

Iij(0) =N

2

∫ 1

0

Pski(f)Pskj

(f)

(PB(f))2 df. (2.38)

When discretized over the band 0 ≤ f ≤ 1/2, (2.36), (2.37) and (2.38) become

ln p(x; θ)

= −N

2

Nf∑

k=1

[

ln(∑K

i=1θki

Pski(fk) + PB(fk)

)

+I(fk)

∑Ki=1 θki

Pski(fk) + PB(fk)

]

Δf

− N

2ln(2π) (2.39)

67

∂ ln p(x; θ)

∂θki

∣∣∣∣θ = 0

=N

2

Nf∑

k=1

Pski(fk)

PB(fk)

(I(fk)

PB(fk)− 1

)

Δf. (2.40)

Iij(0) =N

2

Nf∑

k=1

Pski(fk)Pskj

(fk)

(PB(fk))2 Δf. (2.41)

Now we have the MLE of θ from (2.35), (2.40) and (2.41). The asymptotic like-

lihood function approach then substitutes θ for θ into (2.39) and chooses the

hypothesis that has the largest likelihood.

One important issue that should be described is our assumption that θki≥ 0

for i = 1, 2, . . . , K. However, the MLE of θ without these nonnegative constraints

may produce negative solutions. However, this can be easily resolved by the Kuhn-

Tucker conditions.

From the Kuhn-Tucker conditions, we know that if the MLE without those

positivity constraints has negative solutions, then the MLE under these constraints

will have at least one θki= 0 [25]. Then this hypothesis is reduced to at least the

(K − 1)th order model. Then, any other Kth order hypothesis that has the same

chemical signatures as the reduced (K − 1)th order hypothesis and one arbitrary

other chemical signature would have likelihood not less than the reduced (K−1)th

order hypothesis. For example, for the hypothesis H1 that has chemical signatures

P1(f), P2(f), . . . , PK(f), if the unconstrained MLE of θ has negative solutions,

then the MLE under positivity constraints would have at least one θi = 0 by the

Kuhn-Tucker conditions, say θ1 = 0. Thus, any other hypothesis that includes

P2(f), P3(f), . . . , PK(f) and any other chemical signature (say Ps1(f)) would have

likelihood not less than H1, since we would at least get the same likelihood by

using the same constrained MLE for hypothesis H1. This argument implies that

we can just ignore the hypothesis that yields an unconstrained MLE with at least

one negative solution, and therefore, the greatest likelihood must correspond to

the hypothesis with a nonnegative unconstrained MLE.

68

Since we have as many as(

MK

)

candidate hypotheses, we do not have to con-

sider the case when all hypotheses have at least one negative solution in uncon-

strained MLE. In this case, it can be considered that there are less than K chem-

icals present, and we should decrease the value of K.

List of References

[1] K. Kneipp, H. Kneipp, I. Itzkan, R. Dasari, and M. Feld, “Ultrasensitivechemical analysis by raman spectroscopy,” Chemical Reviews, vol. 99, p.2957C2975, 1999.

[2] R. Frost, D. Henry, and K. Erickson, “Raman spectroscopic detection ofwyartite in the presence of rabejacite,” Journal of Raman Spectroscopy,vol. 35, pp. 255–260, 2004.

[3] N. Hayazawa, M. Motohashi, Y. Saito, and S. Kawata, “Highly sensitive straindetection in strained silicon by surface-enhanced raman spectroscopy,” AppliedPhysics Letters, vol. 86, pp. 263 114 – 263 114–3, 2005.

[4] A. Portnov, S. Rosenwaks, and I. Bar, “Detection of particles of explosives viabackward coherent anti-stokes raman spectroscopy,” Applied Physics Letters,vol. 93, pp. 041 115 – 041 115–3, 2008.

[5] D. Manolakis, D. Marden, and G. Shaw, “Hyperspectral image processing forautomatic target detection applications,” Lincoln Laboratory Journal, vol. 14,no. 1, pp. 79–116, 2003.


[7] L. Scharf and B. Friedlander, “Matched subspace detectors,” IEEE Trans.Signal Process., vol. 42, no. 8, pp. 2146–2157, Aug. 1994.

[8] W. Wang and T. Adali, “Constrained ica and its application to raman spec-troscopy,” in Proc. Antennas and Propagation Society International Sympo-sium, Jul. 2005, pp. 109–112.

[9] W. Wang, T. Adali, and D. Emge, “Unsupervised detection using canonicalcorrelation analysis and its application to raman spectroscopy,” in Proc. IEEEWorkshop on Machine Learning for Signal Processing, Aug. 2007.

[10] W. Wang, T. Adali, and D. Emge, “Subspace partitioning for target detectionand identification,” IEEE Trans. Signal Process., vol. 57, no. 4, pp. 1250–1259,Apr. 2009.

69

[11] M. Alam, M. Nazrul Islam, A. Bal, and M. Karim, “Hyperspectral targetdetection using gaussian filter and post-processing,” Optics and Lasers inEngineering, vol. 46, pp. 817–822, Nov. 2008.

[12] T. Chyba, N. Higdon, W. Armstrong, C. Lobb, P. Ponsardin, D. Richter,B. Kelly, Q. Bui, R. Babnick, M. Boysworth, A. Sedlacek, and S. Christesen,“Field tests of the laser interrogation of surface agents (lisa) system for on-the-move standoff sensing of chemical agents,” in Proc. Int. Symp. SpectralSensing Research, 2003.

[13] S. Kay, C. Xu, and D. Emge, “Chemical detection and classification in ramanspectra,” in Proceedings of the SPIE, vol. 6969, Mar. 2008, pp. 4–12.

[14] W. Knight, R. Pridham, and S. Kay, “Digital signal processing for sonar,” inProceedings of the IEEE, Nov. 1981, pp. 1451–1506.

[15] R. Wiley, ELINT: The Interception and Analysis of Radar Signals. Boston,MA: Artech House, 2006.

[16] S. Kay, Modern Spectral Estimation: Theory and Application. EnglewoodCliffs, NJ: Prentice-Hall, 1988.

[17] D. Bowyer, P. Rajasekaran, and W. Gebhart, “Adaptive clutter filtering usingautoregressive spectral estimation,” IEEE Trans. Aerosp. Electron. Syst., pp.538–546, Jul. 1979.

[18] S. Kay and J. Salisbury, “Improved active sonar detection using autoregressiveprewhiteners,” J. Acoustical Soc. of America, pp. 1603–1611, Apr. 1990.



[21] A. Pages-Zamora and M. Lagunas, “New approaches in non-linear signal pro-cessing: Estimation of the probability density function by spectral estimationmethods,” in IEEE Workshop on Higher Order Statistics, 1995.

[22] S. Kay, “Model based probability density function estimation,” IEEE SignalProcess. Lett., pp. 318–320, Dec. 1998.

[23] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14,no. 5, pp. 465–471, 1978.

[24] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics,vol. 6, no. 2, pp. 461–464, 1978.

70

[25] C. Lawson and R. Hanson, Solving Least Squares Problems. SIAM, 1995.

71

MANUSCRIPT 3

Sensor Integration for Distributed Detection and Classification

Abstract

We investigate the problem of sensor integration to combine all the available

information in a multi-sensor setting from a statistical standpoint. Specifically, in

this paper, we propose a novel method of constructing the joint probability density

function (PDF) of the measurements from all the sensors based on the exponential

family. This method does not require the knowledge of the marginal PDFs and

hence is useful in many practical cases. We prove that our method is asymptotically

optimal in Kullback-Leibler (KL) divergence. Our method requires less informa-

tion compared to existing methods and attains comparable detection/classification

performance.

3.1 Introduction

Distributed systems and information fusion have been widely studied and used

in engineering, finance, and scientific research. Such applications are to radar,

sonar, biomedical analysis, stock prediction, weather forecasting, and chemical,

biological, radiological, and nuclear (CBRN) detection, to name a few. If the

joint probability density functions (PDFs) under each candidate hypothesis are

known, we would easily obtain the optimal performance by the Neyman-Pearson

rule for detection (binary hypothesis testing) and by the maximum a posteriori

probability (MAP) rule for classification (multiple hypothesis testing) [1]. However

in practice, this information may not be available. This usually happens when the

dimensionality of the sample space is high and we do not have enough training

samples to have an accurate estimate of the joint PDF. The problem is exacerbated

by onerous environmental and systems constraints in radar and sonar applications.

72

This is also recognized as the “curse of dimensionality” in pattern recognition and

machine learning. Hence, it is important to efficiently approximate the unknown

joint PDF using limited training data.. One common approach is to assume that

the measurements from different sensors are independent [2], [3]. This approach

has been widely used due to its simplicity, since the joint PDF is then the product

of the marginal PDFs. This is also known as the “product rule” in combining

classifiers [4]. In spite of its popularity, the independence assumption may not be

a good one if the measurements are actually correlated. Furthermore, as stated in

[4], the product rule is severe because “it is sufficient for a single recognition engine

to inhibit a particular interpretation by outputting a close to zero probability for

it”. Hence researchers have studied other methods that consider the correlation

among the measurements. However, the problem does not have a unique solution

when the data is non-Gaussian. A copula based framework is proposed in [5], [6]

to construct the joint PDF. The exponentially embedded families (EEFs) are used

in [7] to estimate the joint PDF that is asymptotically closest to the true one in

Kullback-Leibler (KL) divergence.

Note that the above methods all require the knowledge of marginal PDFs.

In this paper, we consider the case when the marginal PDFs are not available or

accurate, which can happen due to a high-dimensional sample space and insuffi-

cient training data. We present a new way of constructing the joint PDF without

the knowledge of marginal PDFs but only a reference PDF. The constructed joint

PDF takes the form of the exponential family and incorporates all the available

information. The maximum likelihood estimator (MLE) [8] of the unknown pa-

rameters can be easily solved based on the properties of the exponential family. It

is shown that the constructed PDF is asymptotically the optimal one in the sense

that it is asymptotically closest to the true PDF in KL divergence. Since there is

73

no Gaussian distribution assumption on the reference PDF, this method can be

very useful when the underlying distributions are non-Gaussian. We start with the

detection problem, and then extend our method to the classification problem. For

detection, it is shown that under some conditions, our detection statistics are the

same as the clairvoyant generalized likelihood ratio test (GLRT). For classifica-

tion, our classifier also has the same performance as the estimated MAP classifier.

Both the clairvoyant GLRT and the estimated MAP classifier assume that the true

PDFs under each candidate hypothesis are known except for the usual unknown

parameters.

The paper is organized as follows. In Section 3.2, we introduce a distributed

detection/classification problem. In Section 3.3, we construct the joint PDF by an

exponential family and apply it to the problem in Section 3.2. The KL divergence

between the true PDF and the constructed PDF is examined in Section 3.4, and

the result shows that the constructed PDF is asymptotically optimal. Examples

for distributed detection are given in Section 3.5, and examples for distributed

classification are given in Section 3.6. Simulation results to compare the perfor-

mance of our method with existing methods are shown in Section 3.7. In Section

3.8, we draw the conclusions.


Consider the distributed detection/classification problem when we observe the

outputs of two sensors, T1(x) and T2(x), which are transformations of the underly-

ing samples x. These are unobservable at the central processor as shown in Figure

7.1. We choose two sensors for simplicity. All the results in this paper are valid for

multiple sensors. For detection, we want to distinguish between two hypotheses

H0 and H1 based on the outputs of the two sensors, and for classification, we have

M candidate hypotheses Hi for i = 1, 2, . . . , M .

74

Assume that we have enough training data T1i(x)’s and T2i

(x)’s under H0,

i.e., when there is no signal present. Hence we have a good estimate of the joint

PDF of T1 and T2 under H0 [8], and thus we assume pT1,T2(t1, t2;H0) is completely

known. Under H1 or Hi for i = 1, 2, . . . , M when a signal is present, we may not

even have enough training data to estimate the marginal PDFs. This is especially

the case in the radar scenario, where the target is present for only a small portion of

the time. So our goal is to use the available information to construct an appropriate

pT1,T2(t1, t2;H1) under H1 for detection or pT1,T2(t1, t2;Hi) under each Hi for

classification. A simple illustration is shown in Figure 7.1.

Sensor 1 Sensor 2

CentralProcessor

T1(x) T2(x)

Area of Interest

pT1,T2(t1,t2;H0)

Detection: H0 or H1 ? orClassification: Hi ?, i=1,...,M

Figure 3.1. Distributed detection/classification system with two sensors

75

3.3 Joint PDF Construction by Exponential Family and Its Applica-tion in Distributed Systems

To start with, we consider the detection problem, where we wish to construct

pT1,T2(t1, t2;H1). The result will then be extended to the classification problem.

Since pT1,T2(t1, t2;H1) cannot be uniquely specified based on

pT1,T2(t1, t2;H0), we need the following reasonable assumptions to construct the

joint PDF.

1) Under H1 the signal is small and thus pT1,T2(t1, t2;H1) is close to

pT1,T2(t1, t2;H0).

2) pT1,T2(t1, t2;H1) can be parameterized by some signal parameters θ such

that

pT1,T2(t1, t2;H1) = pT1,T2(t1, t2; θ)

pT1,T2(t1, t2;H0) = pT1,T2(t1, t2;0)

Note that since θ represents signal amplitudes, θ �= 0 under H1. Therefore, the

detection problem is to select between

H0 : θ = 0

H1 : θ �= 0

To simplify the notation, let

T =

[T1

T2

]

so that we can write the joint PDF pT1,T2(t1, t2; θ) as pT(t; θ). With the small

signal assumptions, it has been shown in [9] that by using a first order Taylor

expansion on the log-likelihood function ln pT(t; θ) about θ = 0, we can construct

the PDF of T as

pT(t; θ) = exp[

θT t − K(θ) + ln pT(t;0)]

(3.1)

76

where

K(θ) = ln E0

[

exp(

θTT)]

(3.2)

is the cumulant generating function of pT(t;0), and it normalizes the PDF to

integrate to one. Since T is a sufficient statistic for the constructed exponential

PDF in (7.1), this PDF incorporates all the information from the two sensors. Note

that only pT(t;0) is required in (7.1) to construct pT(t; θ), and it is assumed that

pT(t;0) is available or it can be estimated with reasonable accuracy. Also note

that if T1, T2 are statistically dependent under H0, they will also be dependent

under H1.

The next step is to estimate the unknown parameters θ. We resort to the MLE

[10] by maximizing (7.1) over θ. Note that K(θ) is convex by Holder’s inequality

[11]. Since maximizing (7.1) is equivalent to maximizing θT t−K(θ), this becomes

a convex optimization problem and many existing methods can be readily utilized

[12], [13]. Also, the MLE of θ will satisfy

t =∂K(θ)

∂θ(3.3)

When the MLE, θ, is found, we will use pT(t; θ) as our estimated PDF under H1.

Hence, similar to the GLRT [1], we will decide H1 if

lnpT(t; θ)

pT(t;0)= θ

Tt − K(θ) > τ (3.4)

where τ is a threshold. We will show in the next section that pT(t; θ) is asymp-

totically optimal in the sense of KL divergence.

To extend our method to classification, the above two assumptions can be

simply modified as

1) The signal is small under each Hi and hence pT1,T2(t1, t2;Hi) is close to

pT1,T2(t1, t2;H0).

77

2) Under each Hi, the joint PDF can be parameterized by some signal param-

eters θi so that

pT1,T2(t1, t2;Hi) = pT1,T2(t1, t2; θi)

pT1,T2(t1, t2;H0) = pT1,T2(t1, t2;0)

Similar to (7.1), as shown in [14], we can construct the PDF of T under Hi as

pT(t; θi) = exp[

θTi t − K(θi) + ln pT(t;0)

]

(3.5)

where

K(θi) = ln E0

[

exp(

θTi T

)]

(3.6)

is the cumulant generating function of pT(t;0) that normalizes the constructed

PDF. When the MLE of θi is found by maximizing pT(t; θi) over θi, we consider

pT(t; θi) as our estimate of pT(t;Hi) where θi is the MLE of θi. Hence similar to

the MAP rule [1], we will decide Hi for which the following is maximum over i:

p(Hi|t) =pT(t;Hi)p(Hi)

pT(t)=

pT(t; θi)p(Hi)

pT(t)(3.7)

When we assume that the prior probabilities of each candidate hypothesis are

equal, i.e., p(H1) = . . . , = p(HM) = 1/M , p(Hi) cancels and we can equivalently

decide Hi for which the following is maximum over i:

lnpT(t; θi)

pT(t;0)= θ

T

i t − K(θi) (3.8)

3.4 KL Divergence Between The True PDF and The Constructed PDF

The KL divergence is a non-symmetric measure of difference between two

PDFs. For two PDFs p1 and p0, it is defined as

D (p1 ‖p0 ) =

∫

p1(x) lnp1(x)

p0(x)dx

It is well known that D (p1 ‖p0 ) ≥ 0 with equality if and only if p1 = p0 almost ev-

erywhere [15]. By Stein’s lemma [16], the KL divergence measures the asymptotic

78

performance for detection. An extended result to classification has been recently

presented in [17]. Next we will show that pT(t; θ) is optimal under both hypothe-

ses. That is, under H0, pT(t; θ) = pT(t;0) asymptotically, and similarly under H1,

then pT(t; θ) is asymptotically the closest one to the true PDF in KL divergence.

Similar results and arguments have been shown in [7, 18].

Assume that we observe independent and identically distributed (IID) Ti’s

with

Ti =

[T1i

T2i

]

for i = 1, 2, . . . , N . Shortening the notation, we will write

pT1,T2,...,TM(t1, t2, . . . , tM ; θ) as p(t1, t2, . . . , tM ; θ). The constructed PDF

can be easily extended as (see (7.1))

p(t1, t2, . . . , tM ; θ)

= exp[

θT ∑Mi=1 ti − MK(θ)+ ln p(t1, t2, . . . , tM ;0)

]

(3.9)

so we wish to maximize

1

Mln

p(t1, t2, . . . , tM ; θ)

p(t1, t2, . . . , tM ;0)=

1

MθT

M∑

i=1

ti − K(θ) (3.10)

and θ is found by solving

1

M

M∑

i=1

ti =∂K(θ)

∂θ(3.11)

Now we consider two cases. First, for the true PDF under H0, by the law of

large numbers, it follows that

1

M

M∑

i=1

ti → E0(t)

as M → ∞. Note that

∂K(θ)

∂θ

∣∣θ=0

= E0(t)

Since the solution of (3.11) is unique, asymptotically we have

θ = 0

79

and hence, p(t1, t2, . . . , tM ; θ) = p(t1, t2, . . . , tM ;0).

Secondly, for the true PDF under H1, by the law of large numbers, it follows

that

1

M

M∑

i=1

ti → E1(t)

as M → ∞. From (3.10), we are asymptotically maximizing

θT E1(t) − K(θ) (3.12)

To avoid confusion, we will denote the underlying true PDF under H1 as

p(t1, t2, . . . , tM ;H1) and our constructed PDF as p(t1, t2, . . . , tM ; θ). Since from

(3.9)

lnp(t1, t2, . . . , tM ;H1)

p(t1, t2, . . . , tM ; θ)

= −(

θTM∑

i=1

ti − MK(θ)

)

+ lnp(t1, t2, . . . , tM ;H1)

p(t1, t2, . . . , tM ;0)

the KL divergence between the true PDF and the constructed one is

D (p(t1, t2, . . . , tM ;H1)||p(t1, t2, . . . , tM ; θ))

= EH1

[

−(

θTM∑

i=1

ti − MK(θ)

)

+ ln p(t1,t2,...,tM ;H1)p(t1,t2,...,tM ;0)

]

= −M[

θT E1(t) − K(θ)]

+ D (p(t1, t2, . . . , tM ;H1)||p(t1, t2, . . . , tM ;0)) (3.13)

Since D (p(t1, t2, . . . , tM ;H1)||p(t1, t2, . . . , tM ;0)) is fixed,

D (p(t1, t2, . . . , tM ;H1)||p(t1, t2, . . . , tM ; θ)) is minimized by maximizing

(3.12). This shows that p(t1, t2, . . . , tM ; θ) is asymptotically the closest to

p(t1, t2, . . . , tM ;H1) in KL divergence.

3.5 Examples-Distributed Detection

In this section, we compare our method with the clairvoyant GLRT for a

specific detection problem. The clairvoyant GLRT which provides an upper bound

80

on GLRT performance assumes that we know the true PDF of T under H1 except

for the underlying unknown parameters α. It decides H1 if

lnpT(t; α)

pT(t;0)> τ (3.14)

3.5.1 Partially Observed Linear Model with Gaussian Noise

Suppose we have the linear model with

x = Hα + w (3.15)

with

H0 : α = 0

H1 : α �= 0

where x is an N × 1 vector of the underlying unobservable samples, H is an N × p

observation matrix with full column rank, α is an p × 1 vector of the unknown

signal amplitudes, and w is an N × 1 vector of white Gaussian noise samples with

known variance σ2. We observe two sensor outputs

T1(x) = HT1 x

T2(x) = HT2 x (3.16)

where H1 is N × q1 and H2 is N × q2. Note that [H1 H2] does not have to be H.

This model is called a partially observed linear model.

Let G = [H1 H2]. We assume that G has full column rank so that there is

no perfectly redundant measurements of the sensors. Then we have

T =

[T1(x)T2(x)

]

=

[HT

1 xHT

2 x

]

= GTx (3.17)

81

Thus, T is also Gaussian and

T ∼ N(

0, σ2GTG)

under H0

Let q = q1 + q2, and we can see that T is q× 1. As a result, we construct the PDF

as in (7.1) with

K(θ) = ln E0

[

exp(

θT t)]

=1

2σ2θTGTGθ (3.18)

Hence the constructed PDF is

pT(t; θ)

= exp[


=1

(2πσ2)q2 det

12 (GTG)

exp

(

−tT(

GTG)−1

t

2σ2

)

· exp

[

θT t − 1

2σ2θTGTGθ

]

(3.19)

which can be simplified as

T ∼ N(

σ2GTGθ, σ2GTG)

under H1 (3.20)

Note that θ is the vector of the unknown parameters in the constructed PDF, and

it is different from the truly unknown parameters α. From (6.7) and (3.18), the

MLE of θ satisfies

t =∂K(θ)

∂θ= σ2GTGθ

So

θ =1

σ2

(

GTG)−1

t

and the test statistic becomes

θTt − K(θ) =

1

2σ2tT(

GTG)−1

t (3.21)

Next we consider the clairvoyant GLRT. That is the GLRT when we know

the true PDF of T under H1 except for the underlying unknown parameters α.

82

It is considered as the suboptimal test by plugging the MLE of α into the true

PDF parameterized by α. Since the constructed PDF may not be the true PDF,

the clairvoyant GLRT requires more information than our method. From (6.11)

we know that

T ∼ N(

GTHα, σ2GTG)

under H1 (3.22)

Note that (3.20) is the constructed PDF while (3.22) is the true PDF. In either

case we need to estimate θ in (3.20) or α in (3.22) to implement the PDF. We

write the true PDF under H1 as pT(t; α). The MLE of α is found by maximizing

the true PDF given by (3.22)

lnpT(t; α)

pT(t;0)

= − 1

2σ2

(

t − GTHα)T (

GTG)−1 (

t − GTHα)

+1

2σ2tT(

GTG)−1

t

If q ≤ p, i.e., the length of t is less than or equal to the

length of α, then the MLE α may not be unique. However, since(

t − GTHα)T (

GTG)−1 (

t − GTHα)

≥ 0, we could always find α such that

t = GTHα and hence(

t − GTHα)T (

GTG)−1 (

t − GTHα)

= 0. Hence the clair-

voyant GLRT statistic becomes

lnpT(t; α)

pT(t;0)=

1

2σ2tT(

GTG)−1

t (3.23)

which is the same as our test statistic (see (6.13)) when q ≤ p.

If q > p, it can be shown that

α =(

HTG(

GTG)−1

GTH)−1

HTG(

GTG)−1

t

and the clairvoyant GLRT statistic becomes

lnpT(t; α)

pT(t;0)=

tT(

GTG)−1

GTH(

HTG(

GTG)−1

GTH)−1

HTG(

GTG)−1

t

2σ2

(3.24)

83

3.5.2 Partially Observed Linear Model with Gaussian Mixture Noise

The partially observed linear model remains the same as in the previous sub-

section except for instead of assuming that w is white Gaussian, we will assume

that w has a Gaussian mixture distribution with two components, i.e.,

w ∼ πN (0, σ21I) + (1 − π)N (0, σ2

2I) (3.25)

where π, σ21 and σ2

2 are known (0 < π < 1). The following derivation can be easily

extended when w ∼∑L

i=1 πiN (0, σ2i I).

Since w has a Gaussian mixture distribution, T = GTx is also Gaussian

mixture distributed and

T ∼ πN (0, σ21G

TG) + (1 − π)N (0, σ22G

TG) under H0

So we have

K(θ) = ln E0

[

exp(

θT t)]

= ln(

πe12σ21θ

TGT Gθ + (1 − π)e

12σ22θ

TGT Gθ

)

(3.26)


pT(t; θ)

= exp[


=

[

π

(2πσ21)

q2 det

12 (GTG)

exp

(

−tT(

GTG)−1

t

2σ21

)

+1 − π

(2πσ22)

q2 det

12 (GTG)

exp

(

−tT(

GTG)−1

t

2σ22

)]

· exp(

θT t)

/(

πe12σ21θ


12σ22θ

TGT Gθ

)

(3.27)

Although this constructed PDF cannot be further simplified, we can still find the

84

MLE by solving

t =∂K(θ)

∂θ

=πe

12σ21θ

TGT Gθ · σ2

1GTGθ + (1 − π)e

12σ22θ

TGT Gθ · σ2

2GTGθ

πe12σ21θ


12σ22θ

TGT Gθ

(3.28)

Our test statistic is just

θTt − K(θ)

= θTt − ln

(

πe12σ21

ˆθT

GT Gˆθ + (1 − π)e

12σ22

ˆθT

GT Gˆθ)

(3.29)

where θ satisfies (3.28). Although no analytical solution of the MLE of θ ex-

ists, it can be found using convex optimization techniques [12, 13]. Moreover, an

analytical solution exists when ||θ|| → 0. To see this, we will show that

lim||θ||→0

∂K(θ)

∂θ./(

πσ21G

TGθ + (1 − π)σ22G

TGθ)

= 1 (3.30)

where ./ means element-by-element division.

To prove (3.30), we have

lim||θ||→0

(

πe12σ21θ


12σ22θ

TGT Gθ

)

= 1 (3.31)

and

lim||θ||→0

(

πe12σ21θ

TGT Gθ · σ2

1GTGθ

+(1 − π)e12σ22θ

TGT Gθ · σ2

2GTGθ

)

./

(

πσ21G

TGθ + (1 − π)σ22G

TGθ)

= 1 (3.32)

by L’Hospital’s rule. Dividing (3.32) by (3.31) and from (3.28), (3.30) is proved.

As a result of (3.28) and (3.30), the MLE of θ satisfies

t = πσ21G

TGθ + (1 − π)σ22G

TGθ

85

as ||θ|| → 0 and θ can be easily found as

θ =1

πσ21 + (1 − π)σ2

2

(

GTG)−1

t (3.33)

Since

lim||θ||→0

K(θ)/

(1

2πσ2

1θTGTGθ +

1

2(1 − π)σ2

2θTGTGθ

)

= 1

by using L’Hospital’s rule twice, as ||θ|| → 0, our test statistic becomes (see (6.15))

θTt −

(1

2πσ2

1θTGTGθ +

1

2(1 − π)σ2

2θTGTGθ

)

=1

2 (πσ21 + (1 − π)σ2

2)tT(

GTG)−1

t (3.34)

To find the clairvoyant GLRT statistic, we know that under H1 the true PDF

is

pT(t; α)

=π

(2π)q/2 det1/2 (σ21G

TG)exp

[

−1

2(t − GTHα)T

(

GTG)−1

σ21

(t − GTHα)

]

+1 − π

(2π)q/2 det1/2 (σ22G

TG)exp

[

−1

2(t − GTHα)T

(

GTG)−1

σ22

(t − GTHα)

]

(3.35)

Note the difference between (3.27) and (3.35) since (3.27) is the constructed PDF

and (3.35) is the true PDF. The MLE of α is found by maximizing (3.35) over α.

When q ≤ p, the MLE of α may not be unique but satisfies t = GTHα. As

a result, pT(t; α) is a constant and the clairvoyant GLRT statistic becomes

− ln pT(t;0)

Note that since pT(t;0) is decreasing as tT(

GTG)−1

t increases, the clairvoyant

GLRT statistic becomes

tT(

GTG)−1

t (3.36)

86

which is the same as our test statistic (with only a positive scale factor) as ||θ|| → 0

(see (6.13)).

When q > p, it can be shown that

α =(

HTG(

GTG)−1

GTH)−1

HTG(

GTG)−1

t

and the clairvoyant GLRT statistic becomes

π

(σ21)

q/2exp

[

−1

2(t − GTHα)T

(

GTG)−1

σ21

(t − GTHα)

]

+1 − π

(σ22)

q/2exp

[

−1

2(t − GTHα)T

(

GTG)−1

σ22

(t − GTHα)

]

(3.37)

Note that the noise in (6.14) is uncorrelated but not independent. We next

consider a general case when the noise can be correlated with a Gaussian mixture

PDF

w ∼ πN (0,C1) + (1 − π)N (0,C2) (3.38)

It can be shown that similar to (6.15), our test statistic is

θTt − ln

(

πe12

ˆθT

GT C1Gˆθ + (1 − π)e

12

ˆθT

GT C2Gˆθ)

(3.39)

and the clairvoyant GLRT statistic is

− ln

(π

det1/2 (C1)exp

[

−1

2tT(

GTC1G)−1

t

]

+1 − π

det1/2 (C2)exp

[

−1

2tT(

GTC2G)−1

t

])

(3.40)

when q ≤ p.

When q > p, the MLE of α is not in closed form, and hence we write the

clairvoyant GLRT statistic as

maxα

[π

det1/2 (GTC1G)exp

[

−1

2

(

t − GTHα)T (

GTC1G)−1 (

t − GTHα)]

+1 − π

det1/2 (GTC2G)exp

[

−1

2

(

t − GTHα)T (

GTC2G)−1 (

t − GTHα)]]

(3.41)

87

Table 3.1. Comparison of our test statistic and the clairvoyant GLRTOur Method Clairvoyant GLRT (q ≤ p)

Gaussian Noise tT(

GTG)−1

t tT(

GTG)−1

t

Uncorrelated maxθ

[

θT t − ln(

πe12σ21θ

TGT Gθ

tT(

GTG)−1

t

Non-Gaussian Noise +(1 − π)e12σ22θ

TGT Gθ

)]

Correlated maxθ

[

θT t − ln(

πe12θT

GT C1Gθ − ln(

π

det1/2(C1)exp

[

−12tT(

GTC1G)−1

t]

Non-Gaussian Noise +(1 − π)e12θT

GT C2Gθ)]

+ 1−π

det1/2(C2)exp

[

−12tT(

GTC2G)−1

t])

Clairvoyant GLRT (q > p)

Gaussian Noise tT(

GTG)−1

GTH(

HTG(

GTG)−1

GTH)−1

HTG(

GTG)−1

t

Uncorrelated π

(σ21)

q/2 exp

[

−12(t − GTHα)T (GT G)

−1

σ21

(t − GTHα)

]

Non-Gaussian Noise + 1−π

(σ22)

q/2 exp

[

−12(t − GTHα)T (GT G)

−1

σ22

(t − GTHα)

]

Correlated maxα

[π

det1/2(GT C1G)exp

[

−12

(

t − GTHα)T (

GTC1G)−1 (

t − GTHα)]

Non-Gaussian Noise + 1−π

det1/2(GT C2G)exp

[

−12

(

t − GTHα)T (

GTC2G)−1 (

t − GTHα)]]

3.5.3 Summary

We have considered a partially observed linear model with both Gaussian and

non-Gaussian noise. Table 3.1 compares our test statistic with the clairvoyant

GLRT.

1) In Gaussian noise, w ∼ N (0, σ2I). The test statistics are exactly the same

for q ≤ p.

2) In uncorrelated non-Gaussian noise, w ∼ πN (0, σ21I) + (1 − π)N (0, σ2

2I).

The test statistics are the same as θ → 0 for q ≤ p.

3) In correlated non-Gaussian noise, w ∼ πN (0,C1) + (1 − π)N (0,C2). Al-

though we cannot show the equivalence between these two test statistics, we will

see in Section 3.7 that their performances appear to be the same.

3.6 Examples-Distributed Classification

In this section, we compare our method with the estimated MAP classifier

for some classification problems. The estimated MAP classifier assumes that the

PDF of T under Hi is known except for some unknown underlying parameters

αi. We assume equal prior probability of the candidate hypothesis, i.e., p(H1) =

88

. . . , = p(HM) = 1/M . So the estimated MAP classifier reduces to the estimated

maximum likelihood classifier [1], which finds the MLE of αi and chooses Hi for

which the following is maximum over i:

pT(t; αi) (3.42)

where αi is the MLE of αi.

3.6.1 Linear Model with Known Variance

Consider the following classification model:

Hi : x = Aisi + w (3.43)

where si is an N × 1 known signal vector with the same length as x, Ai is the

unknown signal amplitude, and w is white Gaussian noise with known variance

σ2. Assume that instead of observing x, we can only observe the measurements of

two sensors

T1 = HT1 x

T2 = HT2 x (3.44)

where H1 is N × q1 and H2 is N × q2. Here q1 and q2 are the length for vectors

T1 and T2 respectively. We can write (7.7) as

T = GTx (3.45)

by letting

T =

[T1

T2

]

and

G = [H1 H2]

89

where G is N × (q1 + q2) with q1 + q2 ≤ N . We assume that G has full column

rank so that there are no perfectly redundant measurements of the sensors. Note

that G can be any matrix with full column rank.

Let H0 be the reference hypothesis when there is noise only, i.e.,

H0 : x = w (3.46)

Since x is Gaussian under H0, according to (7.8), T is also Gaussian and

T ∼ N(

0, σ2GTG)

under H0. We construct the PDF under Hi as in (7.1) with

K(θi) = ln E0

[

exp(

θTi T

)]

=1

2σ2θT

i GTGθi (3.47)


pT(t; θi)

= exp[


]

=1

(2πσ2)q1+q2

2 det12 (GTG)

exp

(

−tT(

GTG)−1

t

2σ2

)

· exp

[

θTi t − 1

2σ2θT

i GTGθi

]

(3.48)


T ∼ N(

σ2GTGθi, σ2GTG

)

under Hi (3.49)

The next step is to find the MLE of θi. Note that the MLE of θi is found

by maximizing θiT t − K(θi) over θi. If this optimization procedure is carried out

without any constraints, then θi would be the same for all i. Hence we need some

implicit constraints in finding the MLE. Since θi represents the signal under Hi,

we should have

θi = AiGT si = EHi

(T) (3.50)

90

which is the mean of T under Hi. As a result, (7.12) can be written as

T ∼ N(

σ2AiGTGGT si, σ

2GTG)

under Hi (3.51)

Thus, instead of finding the MLE of θi by maximizing

θTi t − K(θi) = θT

i t − 1

2σ2θT

i GTGθi (3.52)

with the constraint in (7.13), we can find the MLE of Ai in (7.14) (since si is

assumed known) and then plug it into (7.13). It can be shown that

Ai =sT

i Gt

σ2sTi GGTGGT si

(3.53)

and

θi =GT sis

Ti Gt

σ2sTi GGTGGT si

(3.54)

Hence by removing the constant factors, the test statistic of our classifier for Hi is

(sTi Gt)2

(GT si)TGTG(GT si)(3.55)

according to (3.8).

Next we consider the estimate MAP classifier. In this case, we assume that

we know the true PDF except for Ai

T ∼ N(

AiGT si, σ

2GTG)

under Hi (3.56)

Note that (7.19) is the true PDF of T under Hi and (7.14) is the constructed PDF.

It can be shown that the MLE of Ai in the true PDF under Hi is

Ai =sT

i G(

GTG)−1

t

sTi G (GTG)−1 GT si

(3.57)

By removing the constant terms, the test statistic of the estimated MAP classifier

for Hi is

(sTi G

(

GTG)−1

t)2

(GT si) (GTG)−1 (GT si)(3.58)

91

according to (3.42). Note that (7.16) and (7.20) are different because (7.16) is the

MLE of Ai under the constructed PDF and (7.20) is the MLE of Ai under the true

PDF. Also note that if GTG is a scaled identity matrix, test statistics in (7.18) and

(7.21) are equivalent, and hence our method coincides with the estimated MAP

classifier.

3.6.2 Linear Model with Unknown Variance

To extend the above example, we consider the above linear model with un-

known noise variance σ2. As we have shown in (7.14), the constructed PDF is

still

T ∼ N(

σ2AiGTGGT si, σ

2GTG)

under Hi (3.59)

except for that σ2 is unknown. Let Bi = σ2Ai, we have

T ∼ N(

BiGTGGT si, σ

2GTG)

under Hi (3.60)

Instead of finding the MLEs of Ai and σ2, we can equivalently find the MLEs of

Bi and σ2. Let hi = GTGGT si and C = GTG. It can be shown that

Bi = (hTi C−1hi)

−1hTi C−1t (3.61)

and

σ2 =1

p1 + p2

(t − hiBi)TC−1(t − hiBi) (3.62)

By removing the constant factors, it can also be shown that the test statistic is

equivalent to

tTC−1hi(hTi C−1hi)

−1hTi C−1t

tT[

C−1 − C−1hi(hTi C−1h)−1

i hTi C−1

]

t(3.63)

Next we consider the estimated MAP classifier. The true PDF is still

T ∼ N(

AiGT si, σ

2GTG)

under Hi (3.64)

92

Table 3.2. Comparison of our test statistic and the estimated MAP classifierOur Method Estimated MAP

Known σ2 (sTi Gt)2

(GT si)TGTG(GT si)

(sTi G

(

GTG)−1

t)2

(GT si) (GTG)−1 (GT si)

Unknown σ2 tTC−1hi(hTi C−1hi)

−1hTi C−1t

tT[

C−1 − C−1hi(hTi C−1h)−1

i hTi C−1

]

t

tTC−1gi(gTi C−1gi)

−1hTC−1t

tT [C−1 − C−1gi(gTi C−1gi)−1gT

i C−1] t+

where hi = GTGGT si, gi = GT si and C = GTG.

but with unknown Ai and σ2. Let gi = GT si and C = GTG. Similar to (3.61),

(3.62) and (3.63), it can be shown that

Ai = (gTi C−1gi)

−1gTi C−1t (3.65)

σ2 =1

p1 + p2

(t − giAi)TC−1(t − giAi) (3.66)

and the test statistic of the estimated MAP classifier is

tTC−1gi(gTi C−1gi)

−1hTC−1t

tT [C−1 − C−1gi(gTi C−1gi)−1gT

i C−1] t(3.67)

Note that if GTG is a scaled identity matrix, since hi = GTGgi, the test statistics

in (3.63) and (3.67) are equivalent. Hence our method is exactly the same as the

estimated MAP classifier if GTG is a scaled identity matrix.

3.6.3 Summary

We have considered a linear model both known and unknown noise variance.

Table 3.2 compares our test statistic with the estimated MAP classifier. If GTG is a

scaled identity matrix, our method and the estimated MAP classifier are identical.

Note that this is the case when all the columns in G are orthogonal and have same

power, such as the demodulation of M-ary orthogonal signals in communication

theory.

3.7 Simulations3.7.1 Distributed Detection

Since our test statistic coincides with the clairvoyant GLRT under Gaussian

noise for q ≤ p as shown in subsection 3.5.1, we will only compare the performances

93

under non-Gaussian noise (both uncorrelated noise as in (6.14) and correlated noise

as in (6.19)). Consider the model where

x[n] = A1 + A2rn + A3 cos(2πfn + φ) + w[n] (3.68)

for n = 0, 1, . . . , N − 1 with known base r ∈ (0, 1) and frequency f but unknown

amplitudes A1, A2, A3 and phase φ. This is a linear model as in (6.9) where

H =

⎡

⎢⎢⎢⎣

1 1 1 01 r cos(2πf) sin(2πf)...

......

...1 rN−1 cos(2πf(N − 1)) sin(2πf(N − 1))

⎤

⎥⎥⎥⎦

and α = [A1 A2 A3 cos φ − A3 sin φ]T .

Let w have an uncorrelated Gaussian mixture distribution as in (6.14). For

the partially observed linear model, we observe two sensor outputs as in (6.10).

We compare the GLRT in (6.15) with the clairvoyant GLRT in (6.18). Note that

the MLE of θ in (6.15) is found numerically, not by the asymptotic approximation

in (6.16). In the simulation, we use N = 20, A1 = 2, A2 = 3, A3 = 4, φ =

π/4, r = 0.95, f = 0.34, π = 0.9, σ21 = 50, σ2

2 = 500, and H1 and H2 are

the first and third columns in H respectively, i.e., H1 = [1 1 . . . 1]T , H2 =

[1 cos(2πf) . . . cos(2πf(N − 1))]T . Hence, only the DC level is sensed by one

sensor and the in-phase component of the sinusoid is sensed by the other sensor.

As shown in Figure 6.2, the performances are almost the same which justifies their

equivalence under the small signal assumption shown in Section 3.5.

Next for the same model in (6.22), let w have a correlated Gaussian mixture

distribution as in (6.19). We compare performances of the GLRT using the con-

structed PDF as in (6.20) and the clairvoyant GLRT as in (6.21). We use N = 20,

A1 = 3, A2 = 4, A3 = 3, φ = π/7, r = 0.9, f = 0.46, π = 0.7, H1 = [1, 1, . . . , 1]T ,

H2 = [1, cos(2πf), . . . , cos(2πf(N − 1))]T . The covariance matrices C1, C2 are

94

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of False Alarm

Pro

babi

lity

of D

etec

tion

Our MethodClairvoyant GLRT

Figure 3.2. ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with uncorrelated Gaussian mixture noise.

generated using C1 = RT1 R1, C2 = RT

2 R2, where R1, R2 are full rank N × N

matrices. As shown in Figure 6.3, the performances are still very similar.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of D

etec

tion

Our MethodClairvoyant GLRT

Figure 3.3. ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with correlated Gaussian mixture noise.

3.7.2 Distributed Classification

For the model in (7.6)

Hi : x = Aisi + w

95

we first consider a case when GTG is approximately a scaled identity matrix. Let

A1 = 0.4, A2 = 1.2, A3 = 0.9 and

s1(n) = cos(2πf1n)

s2(n) = cos(2πf2n)

s3(n) = cos(2πf3n)

where n = 0, 1, . . . , N − 1 with N = 25, and f1 = 0.14, f2 = 0.34, f3 = 0.41. Let

p(H1) = p(H2) = p(H3) = 1/3. Assume that there are two sensors, each with an

observation matrix as follows respectively:

H1 =

[1 cos(2πf1) · · · cos (2πf1(N − 1))1 cos(2πf2) · · · cos (2πf2(N − 1))

]T

H2 =[

1 cos(2πf3) · · · cos (2πf3(N − 1))]T

We use (7.18) and (7.21) as our test statistics for the two methods respectively,

when σ2 is known. Test statistics in (3.63) and (3.67) are used when σ2 is unknown.

The probabilities of correct classification are plotted versus ln(1/σ2) in Figure

3.4. We see that our method has the same performance with the estimated MAP

classifier with known or unknown σ2, and probabilities of correct classification goes

to 1 as σ2 → 0.

Next we consider a case when GTG is not a scaled identity matrix. Let

A1 = 0.5, A2 = 1, A3 = 1 and

s1(n) = cos(2πf1n) + 1

s2(n) = cos(2πf2n) + 0.5

s3(n) = cos(2πf3n)

where n = 0, 1, . . . , N − 1 with N = 20, and f1 = 0.17, f2 = 0.28, f3 = 0.45.

Let p(H1) = p(H2) = p(H3) = 1/3. Assume that there are three sensors (this is

an extension of the two sensor assumption), each with an observation matrix as

96

−4 −3 −2 −1 0 1 2 3

0.4

0.5

0.6

0.7

0.8

0.9

1

ln(1/σ2)

Pc

Estimated MAP−Known σ2

Our Method−Known σ2

Estimated MAP−Unknown σ2

Our Method−Unknown σ2

Figure 3.4. Probability of correct classification for both methods.

follows respectively:

H1 =[

1 1 · · · 1]T

H2 =


]T

H3 =[

1 cos (2π(f3 + 0.02)) · · · cos (2π(f3 + 0.02)(N − 1))]T

Note that in H3, we set the frequency to f3 + 0.02. This models the case when

the knowledge of the frequency is not accurate. We also see in Figure 7.2 that

the performances of both methods are the same with known or unknown σ2, and

probabilities of correct classification goes to 1 as σ2 → 0.

3.8 Conclusions

A novel method of constructing the joint PDF of the measurements from a

distributed multiple sensor systems has been proposed. Only a reference PDF

is needed in the construction. The constructed PDF is asymptotically optimal

in KL divergence. The performance of our method has shown to be as good as

the clairvoyant GLRT and estimated MAP classifier respectively for detection and

classification, while less information is needed for our method.

97

−4 −3 −2 −1 0 1 2 3

0.4

0.5

0.6

0.7

0.8

0.9

1

ln(1/σ2)

Pc

Estimated MAP−Known σ2

Our Method−Known σ2

Estimated MAP−Unknown σ2

Our Method−Unknown σ2


List of References


[2] S. Thomopoulos, R. Viswanathan, and D. Bougoulias, “Optimal distributeddecision fusion,” IEEE Trans. Aerosp. Electron. Syst., vol. 25, pp. 761–765,Sep. 1989.

[3] Z. Chair and P. Varshney, “Optimal data fusion in multiple sensor detectionsystems,” IEEE Trans. Aerosp. Electron. Syst., vol. 22, pp. 98–101, Jan. 1986.

[4] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,” IEEETrans. Pattern Anal. Mach. Intell., vol. 20, pp. 226–239, Mar. 1998.

[5] A. Sundaresan, P. Varshney, and N. Rao, “Distributed detection of a nuclearradioactive source using fusion of correlated decisions,” in Information Fusion,2007 10th International Conference on, 2007, pp. 1–7.

[6] S. Iyengar, P. Varshney, and T. Damarla, “A parametric copula based frame-work for multimodal signal processing,” in ICASSP, 2009, pp. 1893–1896.

[7] S. Kay and Q. Ding, “Exponentially embedded families for multimodal sensorprocessing,” in ICASSP, Mar. 2010, pp. 3770–3773.

[8] S. Kay, A. Nuttall, and P. Baggenstoss, “Multidimensional probability densityfunction approximations for detection, classification, and model order selec-tion,” IEEE Trans. Signal Process., vol. 49, pp. 2240–2252, Oct. 2001.

98

[9] S. Kay, Q. Ding, and D. Emge, “Joint pdf construction for sensor fusion anddistributed detection,” in International Conference on Information Fusion,Jun. 2010.


[11] L. Brown, Fundamentals of Statistical Exponential Families. Institute ofMathematical Statistics, 1986.

[12] S. Boyd and L.Vandenberghe, Convex Optimization. Cambridge UniversityPress, 2004.

[13] D. Luenberger, Linear and Nonlinear Programming, 2nd ed. Springer, 2003.

[14] S. Kay, Q. Ding, and M. Rangaswamy, “Sensor integration for classification,”in Asilomar Conference on Signals, Systems, and Computers, Nov. 2010.

[15] S. Kullback, Information Theory and Statistics, 2nd ed. Courier Dover Pub-lications, 1997.

[16] T. Cover and J. Thomas, Elements of Information Theory, 2nd ed. JohnWiley and Sons, 2006.

[17] M. Westover, “Asymptotic geometry of multiple hypothesis testing,” IEEETrans. Inf. Theory, vol. 54, no. 7, pp. 3327–3329, Jul. 2008.


99

MANUSCRIPT 4

Maximum Likelihood Estimator under Misspecified Model with HighSignal-to-Noise Ratio

Abstract

It is well known that the maximum likelihood estimator (MLE) under a mis-

specified model converges to a well defined limit, and it is asymptotically Gaussian

as the sample size goes to infinity. In this correspondence, we fully characterize

the asymptotic performance of the MLE under a misspecified model with high

signal-to-noise (SNR). We see that under some regularity conditions, it converges

to a well defined limit and is asymptotically Gaussian with high SNR.

4.1 Introduction

In estimating unknown parameters the most popular method is the maximum

likelihood estimator (MLE). One important reason is that the MLE is asymp-

totically optimal in that it approximates the minimum variance unbiased (MVU)

estimator for large data records [1]. This is the case when the number of samples

goes to infinity. Another asymptotic case is when the signal-to-noise ratio (SNR)

goes to infinity, i.e., the number of samples is fixed with decreasing noise variance.

The asymptotic efficiency and Gaussianity of the MLE with high SNR have re-

cently been proved in [2]. Hence, under some regularity conditions, the MLE at

high SNR has similar performance to the large sample size case.

The above results are all based on the assumption that the model is cor-

rectly specified. However, we may have a misspecified model in practice, i.e., the

samples are generated from a distribution which cannot be parameterized by the

assumed model. In this case, the MLE under a misspecified model is called the

quasi-maximum likelihood estimator (QMLE). Thus, it is natural to consider the

100

properties of the QMLE. Thanks to White’s fundamental result in [3], the asymp-

totic performance of the QMLE as the sample size goes to infinity is well known

in both the statistics and signal processing communities. It is proved in [3] that

the QMLE converges to a limit which minimizes the Kullback-Leibler (KL) diver-

gence between the true probability density function (PDF) and the misspecified

PDF, and it is asymptotically Gaussian for large data records. Note that the KL

divergence is a non-symmetric measure of difference between two PDFs. For two

PDFs p1 and p0, it is defined as

D (p1 ‖p0 ) =

∫

p1(x) lnp1(x)

p0(x)dx

It is well known that D (p1 ‖p0 ) ≥ 0 with equality if and only if p1 = p0 almost

everywhere [4]. White’s results have been applied to the problem of estimating

direction of arrival (DOA) with unknown number of signals in [5] and [6] for a

deterministic signal model and stochastic signal model, respectively. Analogous to

the results in [2], it is expected that with high SNR, the QMLE will have similar

performance to White’s results. In this correspondence, we prove that this is true

for a deterministic signal in additive Gaussian noise. A simple misspecified linear

model is presented to illustrate our results. Simulation results are provided to

verify our analysis.

The paper is organized as follows. We start by presenting White’s results [3]

in Section 4.2 so that we can compare our results with his later on. In Section

4.3, we show that the QMLE is asymptotically Gaussian and it converges to a

well defined limit with high SNR. In Section 4.4, we use a misspecified linear

model to illustrate our analysis. Section 4.5 provides some simulation results of the

asymptotic performance of the QMLE. Finally, Section 4.6 offers some conclusions.

101

4.2 White’s Results: QMLE for Large Data Records

Suppose that we have N independent and identically distributed (IID) sample

vectors xn for n = 0, 1, . . . , N −1. xn’s are generated from a PDF pt (x), which we

call the true PDF. For the misspecified model, we assume that xn’s are generated

from a PDF p (x; θ) parameterized by some unknown parameters θ. So the QMLE

of θ is

θ = arg maxθ

N−1∑

n=0

ln p (xn; θ) (4.1)

where the xn’s are generated from the true PDF pt (x).

Assume that the KL divergence between the true PDF and the misspecified

PDF

D (pt (x) ||p (x; θ)) =

∫

x

pt (x) lnpt (x)

p (x; θ)dx (4.2)

has a unique minimum at θ∗. Under some regularity conditions, it is proved in [3]

that

a) Consistency: θ exists and θa.s.→ θ∗ as N → ∞ where

a.s.→ stands for convergence

almost surely.

b) Asymptotic Gaussianity: Define the following matrices as

[A(θ)]i,j = Et

(∂2 ln p(x; θ)

∂θi∂θj

)

[B(θ)]i,j = Et

(∂ ln p(x; θ)

∂θi

∂ ln p(x; θ)

∂θj

)

(4.3)

C(θ) = A(θ)−1B(θ)A(θ)−1

where Et() is the expected value with respect to the true PDF pt(x). Then we

have√

N(

θ − θ∗

)D→N (0,C(θ∗)) (4.4)

as N → ∞ whereD→ stands for convergence in distribution.

c) If the model is correctly specified, i.e., there exists θ0 such that p (x; θ0) = pt (x),

102

then θ∗ = θ0 and A(θ0) = −B(θ0), so that

C(θ0) = −A(θ0)−1 = B(θ0)

−1 (4.5)

where −A(θ0) is the Fisher information matrix.

We can interpret a) as follows:

Since

1

N

N−1∑

n=0

ln p (xn; θ)P→Et (ln p (x; θ)) (4.6)

as N → ∞, whereP→ stands for convergence in probability, 1

N

∑N−1n=0 ln p (xn; θ) is

a natural estimator of Et (ln p (x; θ)). Note that

D (pt (x) ||p (x; θ)) = Et (ln pt (x)) − Et (ln p (x; θ)) (4.7)

Hence, θ∗ which minimizes D (pt (x) ||p (x; θ)) also maximizes Et (ln p (x; θ)). As

θ maximizes 1N

∑N−1n=0 ln p (xn; θ), we can consider θ as a natural estimator of θ∗

[7].

4.3 QMLE with High SNR4.3.1 Misspecified Observation Model

Consider the case when the true observation model is

x = st + w1 (4.8)

where x is a real N × 1 vector of samples, st is the N × 1 true signal vector, and

w1 is an N × 1 vector of additive Gaussian noise samples with zero mean and

covariance matrix σ2Ct. However, the misspecified model is

x = s(θ) + w2 (4.9)

where the N × 1 signal s(θ) is known except for the unknown p × 1 vector of

parameters θ, and w2 is an N × 1 vector of additive Gaussian noise samples with

103

zero mean and covariance matrix σ2C. It is assumed that σ2 is unknown and C is

known. As a result, the QMLE of θ is found as

θ = arg minθ

{

(x − s(θ))T C−1

(x − s(θ))}

(4.10)

Hence, we will study the performance of the QMLE of θ as in (4.10) when x is

distributed as in (4.8) as σ2 → 0. Note that this real signal model can be easily

extended to complex signal model (see Chapter 15 in [8]).

4.3.2 Performance of QMLE as σ2 → 0

The analysis in this subsection is similar to that in [2] where the consistency

and asymptotic Gaussianity of the MLE under the correctly specified model with

high SNR are proved.

First, we will find θ∗ which minimizes the KL divergence between the true

PDF and the misspecified PDF. We denote the PDFs specified by (4.8) and (4.9)

as pt (x) and p (x; θ) respectively. For Gaussian distributions, the KL divergence

between the true PDF pt (x) and the misspecified PDF p (x; θ) is [4]

D (pt (x) ||p (x; θ))

=1

2ln

det(

C)

det(

Ct

) +1

2tr(

CtC−1)

− N

2+

1

2σ2(st − s(θ))T C

−1(st − s(θ)) (4.11)

We assume that D (pt (x) ||p (x; θ)) has a unique minimum at θ∗. Since only the

last term in (4.11) depends on θ, θ∗ also minimizes (st − s(θ))T C−1

(st − s(θ)).

Hence, we write

θ∗ = arg minθ

(st − s(θ))T C−1

(st − s(θ)) (4.12)

By setting the gradient with respect to θ to zero, we have(

∂s(θ)

∂θ

)T

C−1

(st − s(θ))∣∣∣θ=θ∗

= 0 (4.13)

where[∂s(θ)

∂θ

]

i,j

=∂si(θ)

∂θj

for 1 ≤ i ≤ N, 1 ≤ j ≤ p (4.14)

104

Next, we examine the asymptotic performance of the QMLE of θ using the im-

plicit function theorem. Since the QMLE of θ depends on x which is distributed

according to (4.8), we write (4.10) as

θ = arg minθ

{

(st − s(θ) + w1)T C

−1(st − s(θ) + w1)

}

(4.15)

so that θ is an implicit function of w1. The solution of (4.15) is found by setting

the gradient of (4.15) with respect to θ to zero. Hence, we need to solve the

following p equations:(

∂s(θ)

∂θ

)T

C−1

(st − s(θ) + w1) = 0 (4.16)

where ∂s(θ)

∂θis given as in (4.14).

Let f (θ,w1) = [f1 (θ,w1) f2 (θ,w1) . . . fp (θ,w1)]T =

(∂s(θ)

∂θ

)T

C−1

(st − s(θ) + w1). Note that from (4.13), we have

f (θ∗,0) = 0 (4.17)

We further assume that:

Assumption 1): fi (θ,w1) is differentiable in a neighborhood of the point (θ∗,0)

in Rp × R

N for i = 1, 2, . . . , p.

Assumption 2): The p × p Jacobian matrix∂f(θ,w1)

∂θof f (θ,w1) with respect to θ

is nonsingular at (θ∗,0).

Then by the implicit function theorem [9], there is a unique mapping ϕ : V → U

where V is a neighborhood of 0 in RN and U is a neighborhood of θ∗ in R

p such

that

ϕ(0) = θ∗

f (ϕ(w1),w1) = 0 for all w1 ∈ V (4.18)

Furthermore, we have

ϕ(w1) − θ∗ = −Φ−1Ψ(w1 − 0) + r(w1 − 0) (4.19)

105

where r(w1) = o(||w1||),

Φ =∂f (θ,w1)

∂θ

∣∣∣∣(θ∗,0)

(4.20)

and

Ψ =∂f (θ,w1)

∂w1

∣∣∣∣(θ∗,0)

(4.21)

Note that (4.18) implies that θ = ϕ(w1) for w1 ∈ V . Hence, from (4.19) we have

θ − θ∗ = −Φ−1Ψw1 + r(w1) (4.22)

Note that the deterministic little-o notation r(w1) = o(||w1||) implies the stochas-

tic little-o notation r(w1) = oP (||w1||), i.e., r(w1)||w1||

P→ 0 as ||w1||P→ 0 [10]. Since

w1 ∼ N(

0, σ2Ct

)

, we have w1P→0 and hence r(w1)

P→0 as σ2 → 0. As a result,

we have proved that

θP→θ∗ (4.23)

as σ2 → 0.

Next, we will prove the asymptotic Gaussianity of θ. Dividing (4.22) by σ we

have

θ − θ∗σ

= −Φ−1Ψw1

σ+

r(w1)

σ(4.24)

We write r(w1)σ

as

r(w1)

σ=

r(w1)

||w1||||w1||

σ(4.25)

Since r(w1)||w1||

P→0 as ||w1||P→ 0 or as σ2 → 0, and ||w1||

σfollows a distribution that

does not depend on σ, we have (see Theorems 2.3.3 and 2.3.5 on pages 70-71 in

[11] and Theorem (4)(a) on page 310 in [12])

r(w1)

σ

P→0 (4.26)

Note that −Φ−1Ψw1

σ∼ N

(

0,Φ−1ΨCtΨTΦ−1

)

since w1

σ∼ N

(

0,Ct

)

. Hence, we

have

θ − θ∗σ

D→N(


)

(4.27)

106

From (4.20) and (4.21), it can be shown that

Φ = 2σ2A(θ∗) (4.28)

and

ΨCtΨT = 4σ2B(θ∗) (4.29)

where A(θ∗) and B(θ∗) are defined as in (4.3). Hence, Φ−1ΨCtΨTΦ−1 =

1σ2A(θ∗)

−1B(θ∗)A(θ∗)−1. As a result, (4.23) and (4.27) correspond to a) and

b) of White’s results in Section 4.2. Note that the results in [2] correspond to c)

of White’s results, in which case the model is correctly specified.

4.4 A Misspecified Linear Model Example

Consider the misspecified linear model where the samples are generated from

the true observation model:

x = st + w1 (4.30)

where x is a real N vector of samples, st is the N × 1 true signal vector, and w1 is

the N × 1 additive Gaussian noise samples with zero mean and covariance matrix

σ2Ct. The misspecified model is

x = Hθ + w2 (4.31)

where H is the known N × p observation matrix, θ is the p× 1 vector of unknown

parameters, and w2 is the N × 1 additive Gaussian noise samples with zero mean

and covariance matrix σ2C. It is assumed that σ2 is unknown and C is known.

From (4.12), we have

θ∗ = arg minθ

(st − Hθ)T C−1

(st − Hθ) (4.32)

It can be shown that

θ∗ =(

HTC−1

H)−1

HTC−1

st (4.33)

107

It is well known that the QMLE of θ is [8]

θ =(

HTC−1

H)−1

HTC−1

x (4.34)

Since x is distributed according to (4.30), we have

x ∼ N (st, σ2Ct) (4.35)

As a result, we have

θ ∼ N((

HTC−1

H)−1

HTC−1

st, σ2(

HTC−1

H)−1

HTC−1

CtC−1

H(

HTC−1

H)−1

)

(4.36)

From (4.33) and (4.36), we see that

θP→θ∗ as σ2 → 0 (4.37)

and

θ − θ∗σ

∼ N(

0,(

HTC−1

H)−1

HTC−1

CtC−1

H(

HTC−1

H)−1

)

(4.38)

Note that in (4.38),ˆθ−θ∗

σhas a Gaussian distribution not just for σ2 → 0 but for

all σ2. For this misspecified linear model, from (4.20) and (4.21), it can be shown

that

Φ = HTC−1

H (4.39)

and

Ψ = HTC−1

(4.40)

Hence, we can write (4.38) as

θ − θ∗σ

∼ N(


)

(4.41)

As a result, (4.37) and (4.41) match our results in (4.23) and (4.27) respectively.

108

4.5 Simulation Results

Consider the problem where the true model is

x[n] = A1 cos(2πf1n + φ1) + A2 cos(2πf2n + φ2) + w1[n] (4.42)

for n = 0, 1, . . . , N − 1 where w1 = [w1[0] w1[1] . . . w1[N − 1]]T ∼ N(

0, σ2Ct

)

.

The misspecified model is

x[n] = A cos(2πfn + φ) + w2[n] (4.43)

where A > 0, 0 < f < 1/2, 0 ≤ φ < 2π are unknown, and w2[n]’s are IID with

w2[n] ∼ N (0, σ2) for n = 0, 1, . . . , N − 1. The QMLE of A, f , φ are found as

follows (see Example 7.16 in [8])

f = arg maxf

I(f) = arg maxf

1

N

∣∣∣∣∣

N−1∑

n=0

x[n] exp(−j2πfn)

∣∣∣∣∣

2

A =2

N

∣∣∣∣∣

N−1∑

n=0

x[n] exp(−j2πfn)

∣∣∣∣∣

φ = arctan−∑N−1

n=0 x[n] sin(2πfn)∑N−1

n=0 x[n] cos(2πfn)

Here we use the Newton-Raphson method to find f , and the initial point

is found by a global search of the maximum of the periodogram I(f) =

1N

∣∣∣∑N−1

n=0 x[n] exp(−j2πfn)∣∣∣

2

over a fine grid of f to ensure convergence (see Sec-

tion 7.7 in [8]). Similarly, the A∗, f∗, φ∗ which minimize the KL divergence between

the true PDF and the misspecified PDF can be found as

f∗ = arg maxf

It(f) = arg maxf

1

N

∣∣∣∣∣

N−1∑

n=0

st[n] exp(−j2πfn)

∣∣∣∣∣

2

A∗ =2

N

∣∣∣∣∣

N−1∑

n=0

st[n] exp(−j2πf∗n)

∣∣∣∣∣

φ∗ = arctan−∑N−1

n=0 st[n] sin(2πf∗n)∑N−1

n=0 st[n] cos(2πf∗n)

109

where st[n] = A1 cos(2πf1n + φ1) + A2 cos(2πf2n + φ2).

In the simulation, we choose A1 = 0.8, f1 = 0.11, φ1 = 0.3, A2 = 1.2,

f2 = 0.33, φ2 = 0.47, N = 20, and Ct as a 20 × 20 diagonal matrix with the

first 10 diagonal elements equal 2 and the last 10 diagonal elements equal 1. Note

that in this case f1 and f2 are far away. Thus the maximum of the periodogram

It(f) = 1N

∣∣∣∑N−1


2

will be near f2 as the “leakage” from the

first sinusoid component is comparatively small at f2. We see in Figure 4.1 that

the maximum of It(f) is at about f = 0.33 = f2. Hence, we have f∗ ≈ f2 in

this case. We generate 1000 realizations of {A, f , φ} and plot the sample means of

(A−A∗)2, (f − f∗)

2, (φ−φ∗)2 versus ln(1/σ2) in Figure 4.2. As expected, they all

converge to zero as σ2 → 0.

0 0.1 0.2 0.3 0.4 0.50

1

2

3

4

5

6

7

8

9

f

I t(f)

Figure 4.1. The periodogram It(f) = 1N

∣∣∣∑N−1


2

. In this case,

f∗ ≈ f2 = 0.33.

If f1 and f2 are close, for example, we change f2 to 0.13 in the above example.

We see in Figure 4.3 that two peaks of the two sinusoid components merge into

one peak, and hence the maximum of It(f) is between f1 = 0.11 and f2 = 0.13.

Therefore, f∗ does not match either f1 or f2. For this case, we also plot the sample

110

0 2 4 6 8 1010

−10

10−5

100

ln(1/σ2)

Mea

nof

(A−

A∗)

20 2 4 6 8 10

10−10

10−5

100

ln(1/σ2)

Mea

nof

(f−

f ∗)2

0 2 4 6 8 1010

−5

100

ln(1/σ2)

Mea

nof

(φ−

φ∗)

2

Figure 4.2. Convergence of A, f , φ as σ2 → 0.

means of (A−A∗)2, (f − f∗)

2, (φ− φ∗)2 for 1000 realizations versus ln(1/σ2). We

see in Figure 4.4 that A − A∗, f − f∗, φ − φ∗ all converge to zero as σ2 → 0.

Next, we want to show that the QMLE is asymptotically Gaussian. We still

choose A1 = 0.8, f1 = 0.11, φ1 = 0.3, A2 = 1.2, f2 = 0.33, φ2 = 0.47, N = 20,

and Ct as a 20 × 20 diagonal matrix with the first 10 diagonal elements equal

2 and the last 10 diagonal elements equal 1. For each σ2, we generated 1600

realizations of {A, f , φ}. We use the Lilliefors test to test the null hypothesis

that the samples are from a Gaussian distribution with unspecified mean and

variance against the alternative hypothesis that they do not come from a Gaussian

distribution [13]. The test first estimates the mean and variance from the samples,

and then as for the Kolmogorov-Smirnov test, finds the test statistic tstat which is

the maximum discrepancy between the empirical cumulative distribution function

and the cumulative distribution function for the Gaussian distribution specified by

the estimated mean and variance. The test statistic is compared to the critical

111

0 0.1 0.2 0.3 0.4 0.50

2

4

6

8

10

12

f

I t(f)

Figure 4.3. The periodogram It(f) = 1N

∣∣∣∑N−1


2

. In this case,

f1 < f∗ < f2.

value, which is τ = 0.886/√

1600 = 0.02215 for 1600 realizations if we choose

significance level α = 0.05. If tstat < τ , the Lilliefors test accepts the null hypothesis

that the samples are generated from a Gaussian distribution with unspecified mean

and variance. Otherwise, it rejects the null hypothesis. We plot the test statistic

tstat of Lilliefors test versus ln(1/σ2) and compare the test statistic with the critical

value τ = 0.02215. As shown in Figure 4.5, the test statistics are less than the

critical value for ln(1/σ2) ≥ 4. As a result, the Lilliefors test decides that they

are Gaussian as σ2 → 0. Note that the test statistic for f has large values when

0 ≤ ln(1/σ2) ≤ 3. This is because when ln(1/σ2) = −5, the noise is so large that

the samples are similar to noise only, so that f is uniformly distributed between

0 and 0.5. When ln(1/σ2) = 1, the noise reduces to a certain level such that

f will be centered at θ∗ but with some outliers near the frequency of the first

sinusoid component f = f1 = 0.11, which makes the Lilliefors test statistic larger.

Histograms of f are plotted for ln(1/σ2) = −5 and ln(1/σ2) = 1 in Figure 4.6.

112

0 2 4 6 8 1010

−10

10−5

100

ln(1/σ2)

Mea

nof

(A−

A∗)

20 2 4 6 8 10

10−10

10−5

100

ln(1/σ2)

Mea

nof

(f−

f ∗)2

0 2 4 6 8 1010

−10

10−5

100

ln(1/σ2)

Mea

nof

(φ−

φ∗)

2

Figure 4.4. Convergence of A, f , φ as σ2 → 0.

4.6 Conclusions

We have derived the asymptotic performance of the QMLE with high SNR.

It has been shown that for deterministic signal with additive Gaussian noise, the

QMLE converges to a well defined limit, and it is asymptotically Gaussian as

σ2 → 0. The results are similar to White’s results on the QMLE for a large

number of samples. Simulation results have been provided to verify our analysis.

List of References


[2] A. Renaux, P. Forster, E. Chaumette, and P. Larzabal, “On the high-SNRconditional maximum-likelihood estimator full statistical characterization,”IEEE Trans. Signal Process., vol. 54, pp. 4840–4843, Dec. 2006.

[3] H. White, “Maximum likelihood estimation of misspecified models,” Econo-metrica, vol. 50, no. 1, pp. 1–25, Jan. 1982.


113

−5 0 5 100

0.05

0.1

ln(1/σ2)

Lilli

efor

ste

ston

A

Test StatisticCritical Value

−5 0 5 100

0.5

1

ln(1/σ2)

Lilli

efor

ste

ston

f


−5 0 5 100

0.05

0.1

ln(1/σ2)

Lilli

efor

ste

ston

φ


4 6 8 100

0.01

0.02

0.03

Figure 4.5. Test statistics of Lilliefors test for A, f , φ as σ2 → 0. We have 1600realizations of {A, f , φ} for each σ2.

[5] P.-J. Chung, “ML estimation under misspecified number of signals,” in the39th Asilomar Conference on Signals, Systems, and Computers, Nov. 2005.

[6] P.-J. Chung, “Stochastic maximum likelihood estimation under misspecifiednumbers of signals,” IEEE Trans. Signal Process., vol. 55, pp. 4726–4731, Sep.2007.

[7] H. Akaike, “Information theory and an extension of the likelihood principle,”in Proceedings of the Second International Symposium of Information Theory,1973.


[9] W. Rudin, Principles of Mathematical Analysis, 3rd ed. McGraw-Hill, 1976.

[10] A. W. van der Vaart, Asymptotic Statistics. Cambridge University Press,2000.

[11] E. Lehmann, Elements of Large-Sample Theory. Springer, 1998.

[12] G. Grimmett and D. Stirzaker, Probability and Random Processes, 3rd ed.Oxford University Press, 2001.

[13] H. Lilliefors, “On the kolmogorov-smirnov test for normality with mean andvariance unknown,” Journal of the American Statistical Association, vol. 62,pp. 399–402, 1967.

114

0 0.1 0.2 0.3 0.4 0.50

50

100

150

f

His

togr

amof

f

(a) ln(1/σ2) = −5

0 0.1 0.2 0.3 0.4 0.50

200

400

600

800

1000

1200

1400

f

His

togr

amof

f

(b) ln(1/σ2) = 1

Figure 4.6. Histogram of f .

115

MANUSCRIPT 5

Exponentially Embedded Families for Multimodal Sensor Processing

Abstract

The exponential embedding of two or more probability density functions

(PDFs) is proposed for multimodal sensor processing. It approximates the un-

known PDF by exponentially embedding the known PDFs. Such embedding is of

a exponential family indexed by some parameters, and hence inherits many nice

properties of the exponential family. It is shown that the approximated PDF is

asymptotically the one that is the closest to the unknown PDF in Kullback-Leibler

(KL) divergence. Applied to hypothesis testing, this approach shows improved per-

formance compared to existing methods for cases of practical importance where

the sensor outputs are not independent.

5.1 Introduction

Distributed detection systems have many applications such as radar and sonar,

medical diagnosis, weather prediction, and financial analysis. To obtain optimal

performance, we require the joint PDF of the sensor outputs, which is not al-

ways available. One common approach [1], [2] is to assume that the PDFs of the

sensor outputs are independent, and hence the joint PDF is the product of the

marginal PDFs. However, this assumption may not be satisfied since the sen-

sor measurements could be correlated due to the common source and the relative

sensor locations. The correlation is noticed in [3], [4], where a copula based frame-

work is proposed to estimate the joint PDF from the marginal PDFs. In this work,

we approximate the joint PDF by exponentially embedded families (EEFs) in the

sense that it asymptotically minimizes the KL divergence of the true PDF and the

116

estimated one. For two PDFs p1 and p0, the KL divergence is defined as

D (p1 ‖p0 ) =∫

p1(x) ln p1(x)p0(x)

dx

It is always nonnegative and equals zero if and only if p1 = p0 almost everywhere.

The KL divergence is a measure of the asymptotic performance of binary hypoth-

esis testing by Stein’s lemma [5].

The term “exponentially embedded familie” follows that in [6], where it is used

for model order estimation. The embedded PDF is of an exponential family indexed

by one or more parameters, and so has many nice properties of that family. In a

differential geometry point of view, the EEF forms a manifold in log-PDF space.

In one-dimensional case, the EEF is the PDF that minimizes D (p ‖p0 ) with the

constraint that D (p ‖p0 )−D (p ‖p1 ) = θ [5], [7]. Here we focus on the problem of

binary hypothesis testing. We assume the presence of two sensors in this paper.

Similar results are obtained for multiple hypothesis testing and multiple sensors.

The paper is organized as follows. Section 5.2 defines the EEF and discusses

its properties. Followed in Section 5.3 is its application for hypothesis testing. An

example is given in Section 5.4. In Section 5.5, we show the simulation results by

comparing the ROC curves of different approaches. Conclusion is drawn finally in

Section 5.6.

5.2 EEF and Its Properties

Assume that a source produces the underlying samples x which are unobserv-

able, and we have two sensors whose outputs are the statistics t1(x) and t2(x) of x.

Consider the binary hypothesis testing problem where we know the reference PDF

pX(x;H0), but not pX(x;H1). So we can find the joint PDF pT1,T2(t1, t2;H0), but

do not know pT1,T2(t1, t2;H1). We assume that the marginal PDFs pT1(t1;H1)

and pT2(t2;H1) are known. So the problem is to test between H0 and H1 where

we know the joint PDF under H0 and the marginal PDFs under H1. The EEF is

117

defined as

pX(x; η) =(

pT1(t1(x);H1)

pT1(t1(x);H0)

)η1(

pT2(t2(x);H1)

pT2(t2(x);H0)

)η2

pX (x;H0)

∫ (pT1(t1(x);H1)

pT1(t1(x);H0)

)η1(

pT2(t2(x);H1)

pT2(t2(x);H0)

)η2

pX (x;H0) dx(5.1)

where η = [η1, η2]T are the embedding parameters with the constraints

η ∈ {η : η1, η2 ≥ 0, η1 + η2 ≤ 1} = S (5.2)

Notice that pX(x; η) does not require the knowledge of pX(x;H1). So in practice,

we just need to estimate pX(x;H0) and only the PDFs of T1 and T2 under H1

from training data (see also [8]). The reason why we have the constraints in (5.2)

will be explained later. The next theorem is an extension of Kullback’s results [5],

[7]..

Theorem 6. The PDF of x as in (5.1) is the one that minimizes

D (pX(x) ‖pX(x;H0)) subject to the constraints that

D (pTi(ti) ‖pTi

(ti;H0)) − D (pTi(ti) ‖pTi

(ti;H1)) = θi

for i = 1, 2, where pT1(t1) and pT2(t2) are the PDFs of T1 and T2 corresponding

to pX(x).

Proof. Since

D (pTi(ti) ‖pTi

(ti;H0)) − D (pTi(ti) ‖pTi

(ti;H1))

=

∫

pX(x) lnpTi

(ti(x);H1)

pTi(ti(x);H0)

dx for i = 1, 2

using Lagrange multipliers for the minimization gives

J (pX(x)) =

∫

pX(x) lnpX(x)

pX(x;H0)dx

+ λ1

∫

pX(x) lnpT1 (t1(x);H1)

pT1 (t1(x);H0)dx

+ λ2

∫

pX(x) lnpT2 (t2(x);H1)

pT2 (t2(x);H0)dx + λ3

∫

pX(x)dx

118

Differentiating with respect to pX(x) and setting to 0, we have

lnpX(x)

pX(x;H0)+ 1 + λ1 ln

pT1 (t1(x);H1)

pT1 (t1(x);H0)

+ λ2 lnpT2 (t2(x);H1)

pT2 (t2(x);H0)+ λ3 = 0

Solving this equation and letting η1 = −λ1 and η2 = −λ2, the pX(x) that minimizes

D (pX(x) ‖pX(x;H0)) is of the form as in (5.1) where η1 and η2 are chosen to meet

the constraints.

By letting

K(η) = ln

∫ (pT1 (t1(x);H1)

pT1 (t1(x);H0)

)η1(

pT2 (t2(x);H1)

pT2 (t2(x);H0)

)η2

× pX (x; H0) dx (5.3)

lT1(x) = lnpT1 (t1(x);H1)

pT1 (t1(x);H0), lT2(x) = ln

pT2 (t2(x);H1)

pT2 (t2(x);H0)(5.4)

(5.1) can be written as

pX(x; η) = exp [η1lT1 (x) + η2lT2 (x) − K (η)

+ ln pX (x; H0)] (5.5)

which is a two-parameter exponential family [9]. K (η) is recognized as the cumu-

lant generating function of lT1(x), lT2(x) when the PDF of x is pX(x;H0). Since

(5.5) is of an exponential family, the EEF inherits some useful properties that we

will discuss in the following (refer to [9], [10] and [11] for details).

1) If the PDF of x is pX(x; η), then the joint PDF of T1 and T2 is [11]

pT1,T2(t1, t2; η) = exp [η1lT1 + η2lT2 − K (η)

+ ln pT1,T2(t1, t2;H0)] (5.6)

where

lT1 = lnpT1 (t1;H1)

pT1 (t1;H0), lT2 = ln

pT2 (t2;H1)

pT2 (t2;H0)(5.7)

119

This can also be easily proved using surface integral techniques [12]. Notice that

for (7.7), T1 and T2 are not independent unless they are independent under H0.

2) K (η) is convex by Holder’s inequality [9]. If we assume that lT1 and lT2

are linearly independent [13], then η is identifiable, and hence K (η) is strictly

convex [10].

3) Let Eη (lTi) be the expected value of lTi

for i = 1, 2 and Cη be the

covariance matrix of [lT1 , lT2 ]T when x is distributed according to pX(x; η). We

have

∂K(η)

∂ηi

= Eη (lTi) (5.8)

[∂2K(η)

∂η21

∂2K(η)

∂η1∂η2

∂2K(η)

∂η2∂η1

∂2K(η)

∂η22

]

= Cη (5.9)

Notice that (5.9) also shows that K (η) is convex.

4) [lT1 , lT2 ]T is a minimal and complete sufficient statistic for η. Hence

[lT1 , lT2 ]T can be used to discriminate between pX(x;H1) and pX(x;H0).

5) K (η) is finite on S. To see this, K (η) > −∞ by definition. Obviously,

K (η) = 0 for η = [0, 0]T , [1, 0]T , [0, 1]T . Since K (η) is strictly convex, we have

K (η) ≤ 0 < ∞ for η ∈ S. But when η is outside S, there is no guarantee that

K (η) is finite in general. This explains why we have the constraints in (5.2).

5.3 EEF for Hypothesis Testing

For binary hypothesis testing, we will decide H1 if

maxη

lnpX(x; η)

pX(x;H0)> τ (5.10)

where τ is a threshold. This test statistic actually does not depend on x but only

on t1 and t2 since

g(η) = lnpX(x; η)

pX(x;H0)= η1lT1 + η2lT2 − K(η) (5.11)

120

The reason why we choose this test statistic, as we will show next, is that asymp-

totically maxη

pX(x; η) is the closest to the unknown pX(x;H1) in KL divergence.

Assume that there are a large number of independent and identically dis-

tributed (IID) unobservable xi’s for i = 1, 2, . . . , N , which results in IID t1i’s and

IID t2i’s. We want to maximize

1

N

N∑

i=1

lnpX(xi; η)

pX(xi;H0)

= exp

[

η11

N

N∑

i=1

lT1i+ η2

1

N

N∑

i=1

lT2i− K(η)

]

(5.12)

By the law of large number, under H1

1N

N∑

i=1

lT1i→ EH1 (lT1) = D (pT1 (t1;H1) ‖pT1 (t1;H0))

1N

N∑

i=1

lT2i→ EH1 (lT2) = D (pT2 (t2;H1) ‖pT2 (t2;H0))

as N → ∞. So we are asymptotically maximizing

η1D (pT1 (t1;H1) ‖pT1 (t1;H0))

+ η2D (pT2 (t2;H1) ‖pT2 (t2;H0)) − K(η) (5.13)

Since

lnpX(x;H1)

pX(x; η)= −η1lT1 − η2lT2 + K(η) + ln

pX(x;H1)

pX(x;H0)

the KL divergence between pX(x;H1) and pX(x; η) is

D (pX(x;H1) ‖pX(x; η))

= EH1 exp

[

−η1lT1 − η2lT2 + K(η) + lnpX(x;H1)

pX(x;H0)

]

= −η1D (pT1 (t1;H1) ‖pT1 (t1;H0))

− η2D (pT2 (t2;H1) ‖pT2 (t2;H0))

+ K(η) + D (pX (x;H1) ‖pX (x;H0)) (5.14)

121

This shows that D (pX(x;H1) ‖pX(x; η)) is minimized by maximizing (5.13). A

similar result is shown in [6] by using a Pythagorean-like theorem. Also if T1

and/or T2 are sufficient statistics for deciding between H0 and H1, it can be

shown that pX(x; η) = pX(x;H1). Thus, the true PDF under H1 is recovered [14].

To implement (5.10), we require the maximum likelihood estimate (MLE) of

η. Let η∗ be the MLE of η without constraints in (5.2). Since g (η) is strictly

concave, η∗ is unique. Taking partial derivatives of g (η) and setting to 0, we have

lT1 =∂K(η)

∂η1|η∗ , lT2 =

∂K(η)

∂η2|η∗ (5.15)

Let η be the MLE of η with the constraints. If η∗ is in the constraint set S, then

η = η∗. Otherwise, η is unique and is on the boundary of S since −g (η) is strictly

convex and S is convex also [15], and hence we could simply search the boundary

of S to find η.

5.4 Example

Since only T1 and T2 are used in hypothesis testing, we only need to specify

their distributions. Consider the case when T1 and T2 are scalars (will write them

as T1 and T2) with distributions

[T1

T2

]

∼ N([

00

]

, σ2

[1 ρ0

ρ0 1

])

under H0

[T1

T2

]

∼ N([

A1

A2

]

, σ2

[1 ρ1

ρ1 1

])

under H1

122

where ρ0 is known but ρ1 is unknown (we do not need the joint PDF of T1 and T2

under H1). We have

K(η) = ln EH0 [exp (η1lT1 + η2lT2)]

= ln EH0

[

exp

(

η12t1A1 − A2

1

2σ2+ η2

2t2A2 − A22

2σ2

)]

= −η1A2

1

2σ2− η2

A22

2σ2

+ ln EH0

[

exp

(η1t1A1 + η2t2A2

σ2

)]

Let φ =[

η1A1

σ2 , η2A2

σ2

]Tand t = [t1, t2]

T , then

EH0

[

exp

(η1t1A1 + η2t2A2

σ2

)]

= EH0

[

exp(

φT t)]

= exp

(1

2φTC0φ

)

where C0 = σ2

[1 ρ0

ρ0 1

]

and hence

K(η) = −η1A2

1

2σ2− η2

A22

2σ2+

1

2φTC0φ

So

g(η) = η1lT1 + η2lT2 − K(η)

= η12t1A1 − A2

1

2σ2+ η2

2t2A2 − A22

2σ2− K(η)

=η1A1t1

σ2+

η2A2t2σ2

− 1

2φTC0φ

= tT φ − 1

2φTC0φ

Differentiating and setting to 0, the global maximum is found at

φ∗ = C−10 t =

[t1−ρ0t21−ρ2

0t2−ρ0t11−ρ2

0

]

or

η∗ =

[σ2(t1−ρ0t2)

A1(1−ρ20)

σ2(t2−ρ0t1)

A2(1−ρ20)

]

123

If η∗ ∈ S, then we decide H1 if g(η∗) = tTC−10 t > τ , otherwise we search η on the

boundary and decide H1 if g(η) > τ .

When we observe N IID t1i’s and IID t2i’s, then it easily extends that by

(5.12), [t1, t2]T is replaced by the sample mean

[1N

∑Ni=1 t1i,

1N

∑Ni=1 t2i

]T

, and

everything else remains the same.


For the above example, we set N = 20, A1 = 0.3, A2 = 0.35, σ2 = 1, ρ0 = 0.6

and ρ1 = 0.7. We compare the EEF approach with the clairvoyant detector (ρ1 is

known, its performance is an upper bound), the detector assuming independence

of t1 and t2, and the copula based method. The copula method estimates the

linear correlation coefficient ρ1 using a non-parametric rank correlation measure,

Kendall’s τ . We use the Gaussian copula as in [3]. The simulation is repeated

for 5000 trials. The receiver operating characteristic curves (ROC) are plotted.

As seen in Fig. 5.1, the EEF is only poorer than the clairvoyant detector, and

performs better than the other two methods.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of D

etec

tion

Assume IndependenceEEFClairvoyantCopula

0 0.005 0.01 0.015 0.020

0.02

0.04

0.06

0.08

0.1

Figure 5.1. ROC curves for different detectors.

124

5.6 Conclusion

The EEF based approach is proposed for the problem of multimodal signal

processing when the outputs are not independent. It exponentially embeds two or

more PDFs and approximates an unknown PDF. Such embedding is highly related

to the KL divergence and many of its properties have been discussed. Examples

are given to help understand the application of this method. Compared to some

existing approaches, better performance is observed for the proposed method. The

connections among η, K (η) and the KL divergence and more of its theoretical

properties will be investigated in the future.

List of References









125


[10] P. Bickel and K. Doksum, Mathematical Statistics: Basic Ideas and SelectedTopics. Pearson Prentice Hall, 2006, vol. 1.

[11] E. Lehmann and J. Romano, Testing Statistical Hypotheses, 3rd ed. Springer,2005.

[12] J. Higgins, “Some surface integral techniques in statistics,” The AmericanStatistician, vol. 29, pp. 43–46, Feb. 1975.

[13] J. Pfanzagl and W. Wefelmeyer, Contributions to a General Asymptotic Sta-tistical Theory, ser. Lecture Notes in Statistics. Springer-Verlag, 1982, vol. 13.

[14] S. Kay, “Asymptotically optimal approximation of multidimensional pdf’s bylower dimensional pdf’s,” IEEE Trans. Signal Process., vol. 55, pp. 725–729,Feb. 2007.


126

MANUSCRIPT 6

Joint PDF Construction for Sensor Fusion and Distributed Detection

Abstract

A novel method of constructing a joint PDF under H1, when the joint PDF

under H0 is known, is developed. It has direct application in distributed detection

systems. The construction is based on the exponential family and it is shown

that asymptotically the constructed PDF is optimal. The generalized likelihood

ratio test (GLRT) is derived based on this method for the partially observed linear

model. Interestingly, the test statistic is equivalent to the clairvoyant GLRT, which

uses the true PDF under H1, even if the noise is non-Gaussian.

6.1 Introduction

Data fusion or sensor fusion in distributed detection systems has been widely

studied over the years. By combining the data from different sensors, better per-

formance can be expected than using a single sensor alone. The optimal detection

performance can be obtained if the joint probability density function (PDF) of the

measurements from different sensors under each hypothesis is completely known.

However in practice, this joint PDF is usually not available. So a key issue in this

area is how to construct the joint PDF of the measurements from different sen-

sors. One common approach is to assume that the measurements are independent

[1], [2]. This approach has been widely used due to its simplicity, since the joint

PDF is then the product of the marginal PDFs. This leads to the product rule

in combining classifiers, and it is effectively a severe rule as stated in [3] that “it

is sufficient for a single recognition engine to inhibit a particular interpretation

by outputting a close to zero probability for it”. Moreover, the independence is a

strong assumption and the measurements can be correlated in many cases. The

127

dependence between measurements has been considered in [4, 5, 6]. A copula based

framework is used in [4, 5] to estimate the joint PDF from the marginal PDFs.

The exponentially embedded families (EEFs) are proposed in [6] to asymptotically

minimize the Kullback-Leibler (KL) divergence between the true PDF and the

estimated one.

Note that all the above methods are based on the assumption that we know

the marginal PDFs of the measurements. But in many cases, the marginal PDFs

may not be available or accurate. This could happen when we do not have enough

training data. In this paper, we will present a new way of constructing a joint

PDF without the knowledge of marginal PDFs but only a reference PDF. The

constructed joint PDF takes the form of the exponential family and the maximum

likelihood estimate (MLE) of the unknown parameters can be easily solved based on

the exponential family. Since there is no Gaussian distribution assumption on the

reference PDF, this method can be very useful when the underlying distributions

are non-Gaussian. In the examples when we apply this method to the detection

problem, under some conditions, the detection statistics can be shown to be the

same as the the clairvoyant generalized likelihood ratio test (GLRT), which is

the test when the true PDF under H1 is known except for the usual unknown

parameters.

The paper is organized as follows. Section 6.2 formulates the detection prob-

lem. The construction of the joint PDF is presented and is applied to the detection

problem in Section 6.3. The KL divergence between the true PDF to the con-

structed PDF is examined in Section 6.4. We give two examples in Section 6.5. In

Section 6.6, some simulation results are shown. Conclusions are given in Section

6.7.

128


Consider the detection problem when we observe the outputs of two sensors,

T1(x) and T2(x) which are transformations of the underlying samples x that are

unobservable (see Figure 7.1). All the results are valid for any number of sensors.

We just choose two for simplicity. Assume that we have enough training data

T1i(x)’s and T2i

(x)’s under H0 when there is no signal present. Hence we have

a good estimate of the joint PDF of T1 and T2 under H0 (see [7]), and thus we

assume pT1,T2(t1, t2;H0) is completely known. Under H1 when a signal is present,

we may not have enough training data to estimate the joint PDF under H1. So

our goal is to construct an appropriate pT1,T2(t1, t2;H1) and use it for detection.

Since pT1,T2(t1, t2;H1) cannot be uniquely specified based on pT1,T2(t1, t2;H0), we

need the following reasonable assumptions to construct the joint PDF.

1) Under H1 the signal is small and thus pT1,T2(t1, t2;H1) is close to

pT1,T2(t1, t2;H0).

2) pT1,T2(t1, t2;H1) depends on signal parameters θ so that

pT1,T2(t1, t2;H1) = pT1,T2(t1, t2; θ)

and

pT1,T2(t1, t2;H0) = pT1,T2(t1, t2;0)

Note that since θ represents signal amplitudes, θ �= 0 under H1. Therefore, the

detection problem is

H0 : θ = 0

H1 : θ �= 0

129

Sensor 1 Sensor 2

CentralProcessor

T1(x) T2(x)

Area of Interest

pT1,T2(t1,t2;H0)

H0 or H1 ?

Figure 6.1. Distributed detection system with two sensors

6.3 Construction of Joint PDF for Detection

To simplify the notation, let

T =

[T1

T2

]

so that the joint PDF pT1,T2(t1, t2; θ) can be written as pT(t; θ). Since we assume

that ||θ|| is small, we expand the log-likelihood function using a first order Taylor

expansion.

ln pT(t; θ) = ln pT(t;0) + θT ∂ ln pT(t; θ)

∂θ

∣∣θ=0

+ o(||θ||) (6.1)

130

We omit the o(||θ||) term but in order for pT(t; θ) to be a valid PDF, we normalize

the PDF to integrate to one as

pT(t; θ)

= exp

[

θT ∂ ln pT(t; θ)

∂θ

∣∣θ=0

− K(θ) + ln pT(t;0)

]

(6.2)

where

K(θ) = ln E0

[

exp

(

θT ∂ ln pT(t; θ)

∂θ

∣∣θ=0

)]

(6.3)

Here E0 denotes the expected value under H0.

Next we assume that the sensor outputs are the score functions, i.e.,

t =∂ ln pT(t; θ)

∂θ

∣∣θ=0

(6.4)

and are sufficient statistics for the constructed PDF under H1. This will be true

if pT(t; θ) is in the exponential family with

pT(t; θ) = exp[


(6.5)

where

K(θ) = ln E0

[

exp(

θTT)]

(6.6)

and E0(T) = 0. This can be easily verified since by (7.1), we have

∂ ln pT(t; θ)

∂θ

∣∣θ=0

= t − ∂K(θ)

∂θ

∣∣θ=0

and

∂K(θ)

∂θ

∣∣θ=0

= E0(T)

as well known properties of the exponential family. Note that even if E0(T) �= 0,

we still have

t − E0(T) =∂ ln pT(t; θ)

∂θ

∣∣θ=0

131

We can use t − E0(T) instead of t as the sensor outputs and hence still satisfy

(6.4) and (7.1). As a result, we will use (7.1) as our constructed PDF. This implies

that t is a sufficient statistic for the constructed exponential PDF, and hence this

PDF incorporates all the sensor information. Note that if T1, T2 are statistically

dependent under H0, they will also be dependent under H1. Also note that only

pT(t;0) is required in (7.1). It is assumed in practice that this can be estimated

or found analytically [7] with reasonable accuracy.

Since θ is unknown, the GLRT is used for detection [8]. We want to maximize

pT(t; θ) or ln pT(t;θ)pT(t;0)

= θT t − K(θ) over θ. This is a convex optimization problem

since K(θ) is convex by Holder’s inequality [9]. Hence many convex optimization

techniques can be utilized [10, 11]. By taking the derivative with respect to θ, the

MLE of θ is found by solving

t =∂K(θ)

∂θ(6.7)

Also because K(θ) is convex, the MLE θ is unique. Then we decide H1 if

lnpT(t; θ)

pT(t;0)= θ

Tt − K(θ) > τ (6.8)

where τ is a threshold.

6.4 KL Divergence Between The True PDF and The Constructed PDF

The KL divergence is a non-symmetric measure of difference between two

PDFs. For two PDFs p1 and p0, it is defined as

D (p1 ‖p0 ) =

∫

p1(x) lnp1(x)

p0(x)dx

It is well known that D (p1 ‖p0 ) ≥ 0 with equality if and only if p1 = p0 [12]. By

Stein’s lemma [13], the KL divergence measures the asymptotic performance for

detection.

It can be shown that pT(t; θ) is the optimal under both hypotheses. That is,

if it is under H0, pT(t; θ) = pT(t;0) asymptotically, and if it is under H1, pT(t; θ)

132

is asymptotically the closest to the true PDF in KL divergence. Similar results

and arguments have been shown in [6, 14].

6.5 Examples

In this section, we will apply the the constructed PDF of (7.1) to some detec-

tion problems. We will start with the simple case with Gaussian noise, and then

we will extend the result to the more general case with Gaussian mixture noise.

6.5.1 Partially Observed Linear Model with Gaussian Noise

Suppose we have the linear model with

x = Hα + w (6.9)

with

H0 : α = 0

H1 : α �= 0

where x is an N × 1 vector of the underlying unobservable samples, H is an N × p

observation matrix with full column rank, α is an p × 1 vector of the unknown

signal amplitudes, and w is an N × 1 vector of white Gaussian noise with known

variance σ2. We observe two sensor outputs

T1(x) = HT1 x

T2(x) = HT2 x (6.10)

where T1 and T2 could be any subset of columns of H. Note that [H1,H2] does

not have to be H. This model is called a partially observed linear model. Note

that a sufficient statistic is HTx, so there is some information loss over the case

when x is observed, unless H = [H1,H2].

133

Let G = [H1,H2], then we have

T =

[T1(x)T2(x)

]

=

[HT

1 xHT

2 x

]

= GTx (6.11)

Therefore, T is also Gaussian with PDF

T ∼ N(

0, σ2GTG)

under H0

and T1, T2 are seen to be correlated for HT1 H2 �= 0. As a result, we construct the

PDF as in (7.1) with

K(θ) = ln E0

[

exp(

θTT)]

=1

2σ2θTGTGθ (6.12)

Note that θ is the vector of the unknown parameters in the constructed PDF, and

it is different from the unknown parameters α in the linear model.

By (6.7) and (7.10), the MLE of θ satisfies

t =∂K(θ)

∂θ= σ2GTGθ

So

θ =1

σ2

(

GTG)−1

t

and the test statistic becomes

θTt − K(θ) =

1

2σ2tT(

GTG)−1

t (6.13)

Next we consider the clairvoyant GLRT. That is the GLRT when we know

the true PDF of T under H1 except for the underlying unknown parameters α.

From (6.11) we know that

T ∼ N(

GTHα, σ2GTG)

under H1

134

We write the true PDF under H1 as pT(t; α). The MLE of α is found by maxi-

mizing

lnpT(t; α)

pT(t;0)

= − 1

2σ2

(

t − GTHα)T (

GTG)−1 (

t − GTHα)

+1

2σ2tT(

GTG)−1

t

Let t be q × 1. If q ≤ p, i.e., the length of t is less

than the length of α, then the MLE α may not be unique. Since(

t − GTHα)T (

GTG)−1 (

t − GTHα)

≥ 0, we could always find α such that

t = GTHα and hence(

t − GTHα)T (

GTG)−1 (

t − GTHα)

= 0. Hence the clair-

voyant GLRT statistic becomes

lnpT(t; α)

pT(t;0)=

1

2σ2tT(

GTG)−1

t

which is the same as the GLRT on our constructed PDF (see (6.13)) when q ≤ p.

6.5.2 Partially Observed Linear Model with Non-Gaussian Noise

The partially observed linear model remains the same as in the previous sub-

section except instead of assuming that w is white Gaussian, we will assume that

w has a Gaussian mixture distribution with two components, i.e.,

w ∼ πN (0, σ21I) + (1 − π)N (0, σ2

2I) (6.14)

where π, σ21 and σ2

2 are known (0 < π < 1). The following derivation can be easily

extended when w ∼∑L

i=1 πiN (0, σ2i I).

Since w has a Gaussian mixture distribution, T = GTx is also Gaussian

mixture distributed and

T ∼ πN (0, σ21G

TG) + (1 − π)N (0, σ22G

TG) under H0

135

It can be shown that the GLRT statistic is

maxθ

[

θT t − ln(

πe12σ21θ


12σ22θ

TGT Gθ

)]

(6.15)

Although no analytical solution of the MLE of θ exists, it can be found using

convex optimization techniques [10, 11]. Moreover, an analytical solution exists as

||θ|| → 0. It can be shown that

θ =1

πσ21 + (1 − π)σ2

2

(

GTG)−1

t (6.16)

and the GLRT statistic becomes

1

2 (πσ21 + (1 − π)σ2

2)tT(

GTG)−1

t (6.17)

as ||θ|| → 0.

The clairvoyant GLRT statistic can be shown to be equivalent to

tT(

GTG)−1

t (6.18)

when q ≤ p. Hence the clairvoyant GLRT coincides with the GLRT using the

constructed PDF as ||θ|| → 0.

Note that the noise in (6.14) is uncorrelated but not independent. We consider

a general case when the noise can be correlated with PDF

w ∼ πN (0,C1) + (1 − π)N (0,C2) (6.19)

It can be shown that for the GLRT using the constructed PDF, the test statistic

is

maxθ

[

θT t − ln(

πe12θT

GT C1Gθ + (1 − π)e12θT

GT C2Gθ)]

(6.20)

and the clairvoyant GLRT statistic is

− ln

(π

det1/2 (C1)exp

[

−1

2tT(

GTC1G)−1

t

]

+1 − π

det1/2 (C2)exp

[

−1

2tT(

GTC2G)−1

t

])

(6.21)

when q ≤ p.

136

6.6 Simulations

Since the GLRT using the constructed PDF coincides with the clairvoyant

GLRT under Gaussian noise as shown in subsection 6.5.1, we will only compare

the performances under non-Gaussian noise (both uncorrelated noise as in (6.14)

and correlated noise as in (6.19)).

Consider the model where

x[n] = A1 + A2rn + A3 cos(2πfn + φ) + w[n] (6.22)

for n = 0, 1, . . . , N − 1 with known r and frequency f but unknown amplitudes

A1, A2, A3 and phase φ. This is a linear model as in (6.9) where

H =

⎡

⎢⎢⎢⎣

1 1 1 01 r cos(2πf) sin(2πf)...

......

...1 rN−1 cos(2πf(N − 1)) sin(2πf(N − 1))

⎤

⎥⎥⎥⎦

and α = [A1, A2, A3 cos φ,−A3 sin φ]T .

Let w have an uncorrelated Gaussian mixture distribution as in (6.14). For

the partially observed linear model, we observe two sensor outputs as in (6.10).

We compare the GLRT in (6.15) with the clairvoyant GLRT in (6.18). Note that

the MLE of θ in (6.15) is found numerically, not by the asymptotic approxima-

tion in (6.16). In the simulation, we use N = 20, A1 = 2, A2 = 3, A3 = 4,

φ = π/4, r = 0.95, f = 0.34, π = 0.9, σ21 = 50, σ2

2 = 500, and H1 and

H2 are the first and third columns in H respectively, i.e., H1 = [1, 1, . . . , 1]T ,

H2 = [1, cos(2πf), . . . , cos(2πf(N − 1))]T . As shown in Figure 6.2, the perfor-

mances are almost the same which justifies their equivalence under small signals

assumption shown in Section 6.5.

Next for the same model in (6.22), let w have a correlated Gaussian mixture

distribution as in (6.14). We compare performances of the GLRT using the con-

structed PDF as in (6.20) and the clairvoyant GLRT as in (6.21). We use N = 20,

137

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of D

etec

tion

GLRT on Constructed PDFClairvoyant GLRT

Figure 6.2. ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with uncorrelated Gaussian mixture noise.

A1 = 3, A2 = 4, A3 = 3, φ = π/7, r = 0.9, f = 0.46, π = 0.7, H1 = [1, 1, . . . , 1]T ,

H2 = [1, cos(2πf), . . . , cos(2πf(N − 1))]T . The covariance matrices C1, C2 are

generated using C1 = RT1 ×R1, C2 = RT

2 ×R2, where R1, R2 are full rank N ×N

matrices. As shown in Figure 6.3, the performances are still very similar.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of D

etec

tion

GLRT on Constructed PDFClairvoyant GLRT

Figure 6.3. ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with correlated Gaussian mixture noise.

138

6.7 Conclusions

A novel method of combining sensor outputs for detection based on the ex-

ponential family has been proposed. It does not require the joint PDF under

H1. The constructed PDF has been shown to be optimal in KL divergence. The

GLRT statistic based on this method can be shown to be equivalent to the clair-

voyant GLRT statistic for the partially observed linear model with both Gaussian

or non-Gaussian noise. The equivalence is also shown in simulations.

List of References











139





140

MANUSCRIPT 7

Sensor Integration for Classification

Abstract

In the problem of sensor integration, an important issue is to estimate the

joint PDF of the measurements of sensors. However in practice, we may not have

enough training data to have a good estimate. In this paper, we have constructed

the joint PDF using an exponential family for classification. This method only

requires the PDF under a reference hypothesis. Its performance has shown to be

as good as the estimated maximum a posteriori probability classifier which requires

more information. This shows a wide application of our method in classification

because less information is needed than existing methods.

7.1 Introduction

Distributed detection/classification systems have been widely used in many

applications such as radar, sonar, wireless sensor networks, and medical diagnosis.

Since multiple sensors will collect more information than a single sensor does, a

better decision is expected to be made. In classification, it is well known that

the maximum a posteriori probability (MAP) classifier minimizes the probability

of error [1]. However, the MAP rule requires the complete knowledge of the joint

probability density functions (PDFs) of the measurements from sensors under each

hypothesis, which in practice may not be available. Hence, it is important in sensor

integration to find appropriate estimates of the joint PDFs under each hypothesis,

and the estimates should contain all the available information.

In many works, people assume that the marginal PDFs of the measurements

from each sensor are known. One commonly used method is to simply assume

that the measurements are independent, and the joint PDF is just the product of

141

the marginal PDFs [2], [3]. This is equivalent to the product rule in combining

classifiers, and it is a severe rule as shown in [4]. Another concern is that the

correlation among the measurements is neglected by assuming independence. So

some approaches that consider the dependence among the measurements have

been proposed. A copula based method that estimates the joint PDF from the

marginal PDFs is used in [5], [6]. The exponentially embedded families (EEFs)

that asymptotically minimize the Kullback-Leibler (KL) divergence between the

true PDF and the estimated PDF is proposed in [7].

Note that the marginal PDFs are required in the above mentioned approaches.

However, we may not even have enough training data in practice to have an ac-

curate estimate of the marginal PDFs, especially when the sensor outputs have

high dimensions. In this paper, we construct the joint PDF using an exponential

family. The construction only requires a reference PDF and it incorporates all the

available information. It can be shown that the constructed PDF is asymptotically

the optimal one in the sense that it is asymptotically closest to the true PDF in

KL divergence.

By maximizing the constructed PDF over the signal parameters, our classifier

can be easily derived. The performance of our method is compared to that of the

estimated MAP classifier, which assumes that the true joint PDF is known except

for the unknown parameters. We present an example in which their performances

appear to be the same. Note that our method assumes less information than the

estimated MAP classifier does. This shows that our method has many applications

for distributed systems in practice.

The paper is organized as follows. In Section 7.2, we introduce a distributed

classification problem. In Section 7.3, we construct the joint PDF by an expo-

nential family and apply it to the classification problem. An example is given in

142

Section 7.4. In Section 7.5, the performances of our method and the estimated

MAP classifier are compared via simulation. Conclusions are drawn in Section 7.6.


Consider the classification problem where we have two distributed sensors

whose outputs T1(x) and T2(x) are transformations of the underlying samples x

that are unobservable. We need to decide from among M candidate hypotheses

Hi for i = 1, 2, . . . , M . Assume that there is a reference hypothesis H0 (usually it

is the hypothesis with noise only) and we have enough training data T1n(x)’s and

T2n(x)’s under H0 to accurately estimate the joint PDF of T1 and T2 under H0

[8]. We assume that pT1,T2(t1, t2;H0) is completely known. However, under Hi

(i = 1, 2, . . . , M) when a signal is present, we may not even have enough training

samples to accurately estimate the marginal PDFs under Hi. This is especially

the case in the radar scenario, where the target is present for only a small portion

of the time. Hence, we want to construct appropriate joint PDFs under each Hi

with as much information we have as possible, and make a classification using the

constructed PDFs. A simple illustration is shown in Figure 7.1. Note that the

result in this paper can be easily extended to the general multiple-sensor case.

7.3 Joint PDF Construction and Its Application in Classification

Since pT1,T2(t1, t2;H0) is the only information available, in order to specify

the joint PDF pT1,T2(t1, t2;Hi), we need the following assumptions [9].

1) The signal is small under each Hi and hence pT1,T2(t1, t2;Hi) is close to

pT1,T2(t1, t2;H0).

2) Under each Hi, the joint PDF can be parameterized by some signal param-

eters θi so that

pT1,T2(t1, t2;Hi) = pT1,T2(t1, t2; θi)

143

Sensor 1 Sensor 2

CentralProcessor

T1(x) T2(x)

Area of Interest

pT1,T2(t1,t2;H0)

H0 or H1 ?

Figure 7.1. Distributed classification system with two sensors.

pT1,T2(t1, t2;H0) = pT1,T2(t1, t2;0)

Hence the classification problem is to choose from

Hi : θ = θi for i = 1, . . . , M

Let

T =

[T1

T2

]

so that the joint PDF pT1,T2(t1, t2; θi) can be written as pT(t; θi). As shown in [9]

with a first order Taylor expansion on the log-likelihood function under each Hi,

we can construct the PDF of T under Hi as

pT(t; θi) = exp[


]

(7.1)

where

K(θi) = ln E0

[

exp(

θTi T

)]

(7.2)

144

is the cumulant generating function of pT(t;0), and it normalizes the PDF to

integrate to 1. Note that it is assumed that pT(t;0) is available or it can be

estimated with reasonable accuracy.

In order to estimate the unknown parameters θi in pT(t; θi), we will use the

maximum likelihood estimate (MLE) [10]. We see that in (7.1), the constructed

PDF is in the form of an exponential family, and many nice properties are as

follows:

1. T is a sufficient statistic for constructed PDF, and hence this PDF incor-

porates all the sensor information.

2. K(θi) is convex by Holder’s inequality [11]. Since maximizing pT(t; θi) is

equivalent to maximizing θTi t−K(θi), this becomes a convex optimization problem

and many existing methods can be readily utilized [12], [13].

3. It can be shown that by maximizing pT(t; θi) over θi, the resulting PDF is

asymptotically the closest to the true PDF pT(t;Hi) in KL divergence [9]. Similar

arguments have been shown in [7, 14].

For classification, if we assume equal prior probabilities of each hypothesis,

i.e., p(H1) = p(H2) = · · · = p(HM), the MAP rule can be reduced to the maximum

likelihood (ML) rule [1]. When the MLE of θi is found by maximizing pT(t; θi)

over θi, we consider pT(t; θi) as our estimate of pT(t;Hi) where θi is the MLE

of θi. Hence similar to the ML rule, we will decide Hi for which the following is

maximum over i:

pT(t; θi) (7.3)

By the monotonicity of the log function, we can equivalently decide Hi for which

the following is maximum over i:

lnpT(t; θi)

pT(t;0)= θ

T

i t − K(θi) (7.4)

We will compare the performance of our classifier to that of the estimated

145

MAP classifier. The estimated MAP classifier assumes that the PDF of T under

Hi is known except for some unknown underlying parameters αi. We still assume

that p(H1) = p(H2) = · · · = p(HM). So the estimated MAP classifier finds the

MLE of αi and chooses Hi for which the following is maximum over i:

pT(t; αi) (7.5)

where αi is the MLE of αi. Note that for the estimated MAP classifier, αi are

the unknown parameters in the true PDF under Hi, while θi are the unknown

parameters in the constructed PDF under Hi. Since the constructed PDF may or

may not be the true PDF, the estimated MAP classifier assumes more information

than our classifier.

7.4 A Linear Model Example

Consider the following classification model:

Hi : x = Aisi + w (7.6)

where si is an N × 1 known signal vector with the same length as x, Ai is the

unknown signal amplitude, and w is white Gaussian noise with known variance

σ2. Assume that instead of observing x, we can only observe the measurements of

two sensors

T1 = HT1 x

T2 = HT2 x (7.7)

where H1 is N × p1 and H2 is N × p2. Here p1 and p2 are the length for vectors

T1 and T2 respectively. We can write (7.7) as

T = GTx (7.8)

by letting

T =

[T1

T2

]

146

and

G = [H1 H2]

where G is N × (p1 + p2) with p1 + p2 ≤ N . We assume that G has full column

rank so that there are no redundant measurements of the sensors. Note that G

can be any matrix with full column rank.

Let H0 be the reference hypothesis when there is noise only, i.e.,

H0 : x = w (7.9)

Since x is Gaussian under H0, according to (7.8), T is also Gaussian and

T ∼ N(

0, σ2GTG)

under H0. We construct the PDF under Hi as in (7.1) with

K(θi) = ln E0

[

exp(

θTi T

)]

=1

2σ2θT

i GTGθi (7.10)


pT(t; θi)

= exp[


]

=1

(2πσ2)p1+p2

2 det12 (GTG)

exp

(

−tT(

GTG)−1

t

2σ2

)

· exp

[

θTi t − 1

2σ2θT

i GTGθi

]

(7.11)


T ∼ N(

σ2GTGθi, σ2GTG

)

under Hi (7.12)

The next step is to find the MLE of θi. Note that the MLE of θi is found by

maximizing θiT t − K(θi) over θi. If this optimization procedure is carried without

any constraint, then θi would be the same for all i. Hence we need some implicit

147

constraints in finding the MLE. Since θi represents the signal under Hi, we should

have

θi = AiGT si = EHi

(T) (7.13)

which is the mean of T under Hi. As a result, (7.12) can be written as

T ∼ N(

σ2AiGTGGT si, σ

2GTG)

under Hi (7.14)

Thus, instead of finding the MLE of θi by maximizing

θTi t − K(θi) = θT

i t − 1

2σ2θT

i GTGθi (7.15)

with the constraint in (7.13), we can find the MLE of Ai in (7.14) and then plug

it into (7.13). It can be found that

Ai =sT

i Gt

σ2sTi GGTGGT si

(7.16)

and

θi =GT sis

Ti Gt

σ2sTi GGTGGT si

(7.17)

Hence by removing the constant factors, the test statistic of our classifier for Hi is

(sTi Gt)2

(GT si)TGTG(GT si)(7.18)

Next we consider the estimate MAP classifier. In this case, we assume that

we know

T ∼ N(

AiGT si, σ

2GTG)

under Hi (7.19)

Note that (7.19) is the true PDF of T under Hi and (7.14) is the constructed PDF.

It can be found that the MLE of Ai in the true PDF under Hi is

Ai =sT

i G(

GTG)−1

t

sTi G (GTG)−1 GT si

(7.20)

By removing the constant terms, the test statistic of the estimated MAP classifier

for Hi is

(sTi G

(

GTG)−1

t)2

(GT si) (GTG)−1 (GT si)(7.21)

148

Note that (7.16) and (7.20) are different because (7.16) is the MLE of Ai under

the constructed PDF and (7.20) is the MLE of Ai under the true PDF.


For the model in (7.6)

Hi : x = Aisi + w

let A1 = 0.5, A2 = 1, A3 = 1 and

s1(n) = cos(2πf1n) + 1

s2(n) = cos(2πf2n) + 0.5

s3(n) = cos(2πf3n)

where n = 0, 1, . . . , N − 1 with N = 20, and f1 = 0.17, f2 = 0.28, f3 = 0.45.

Let p(H1) = p(H2) = p(H3) = 1/3. Assume that there are three sensors (this is

an extension of the two sensor assumption), each with an observation matrix as

follows respectively:

H1 =[

1 1 · · · 1]T

H2 =


]T

H3 =[

1 cos (2π(f3 + 0.02)) · · · cos (2π(f3 + 0.02)(N − 1))]T

Note that in H3, we set the frequency to f3 + 0.02. This is the case when the

knowledge of the frequency is not accurate.

The test statistics are used as in (7.18) and (7.21) for the two methods re-

spectively. The probabilities of correct classification are plotted versus ln(1/σ2) in

Figure 7.2. We see that their performances appear to be the same, and probabilities

of correct classification goes to 1 as σ2 → 0.

7.6 Conclusion

A novel method of constructing the joint PDF of sensor outputs for classifica-

tion has been proposed. Only a reference PDF is needed in the construction. The

149

−4 −3 −2 −1 0 1 2 3

0.4

0.5

0.6

0.7

0.8

0.9

1

ln(1/σ2)

Pc

Estimated MAPOur Method


constructed PDF is asymptotically the closest to the true PDF in KL divergence,

and hence it asymptotically optimal. When applied to distributed classification,

its performance is shown to be as good as the estimated MAP classifier, which

assumes more information than our classifier.

List of References







150



[9] S. Kay, Q. Ding, and D. Emge, “Joint pdf construction for sensor fusion anddistributed detection,” in International Conference on Information Fusion,Jun. 2010.






151

BIBLIOGRAPHY

Akaike, H., “Information theory and an extension of the likelihood principle,” inProceedings of the Second International Symposium of Information Theory,1973.

Akaike, H., “A new look at the statistical model identification,” IEEE Trans.Autom. Control, vol. 19, pp. 716–723, Dec. 1974.

Alam, M., Nazrul Islam, M., Bal, A., and Karim, M., “Hyperspectral target de-tection using gaussian filter and post-processing,” Optics and Lasers in Engi-neering, vol. 46, pp. 817–822, Nov. 2008.

Bickel, P. and Doksum, K., Mathematical Statistics: Basic Ideas and SelectedTopics. Pearson Prentice Hall, 2006, vol. 1.

Bowyer, D., Rajasekaran, P., and Gebhart, W., “Adaptive clutter filtering usingautoregressive spectral estimation,” IEEE Trans. Aerosp. Electron. Syst., pp.538–546, Jul. 1979.

Boyd, S. and L.Vandenberghe, Convex Optimization. Cambridge University Press,2004.

Brown, L., Fundamentals of Statistical Exponential Families. Institute of Math-ematical Statistics, 1986.

Chair, Z. and Varshney, P., “Optimal data fusion in multiple sensor detectionsystems,” IEEE Trans. Aerosp. Electron. Syst., vol. 22, pp. 98–101, Jan. 1986.

Chung, P.-J., “ML estimation under misspecified number of signals,” in the 39thAsilomar Conference on Signals, Systems, and Computers, Nov. 2005.

Chung, P.-J., “Stochastic maximum likelihood estimation under misspecified num-bers of signals,” IEEE Trans. Signal Process., vol. 55, pp. 4726–4731, Sep.2007.

Chyba, T., Higdon, N., Armstrong, W., Lobb, C., Ponsardin, P., Richter, D.,Kelly, B., Bui, Q., Babnick, R., Boysworth, M., Sedlacek, A., and Christesen,S., “Field tests of the laser interrogation of surface agents (lisa) system foron-the-move standoff sensing of chemical agents,” in Proc. Int. Symp. SpectralSensing Research, 2003.

Cover, T. and Thomas, J., Elements of Information Theory, 2nd ed. John Wileyand Sons, 2006.

152

Eriksson, K., Estep, D., and Johnson, C., Applied Mathematics, Body and Soul:Calculus in Several Dimensions. Springer, 2004.

Fisher, R., “On the mathematical foundations of theoretical statistics,” Philos.Trans. Royal Soc. London, vol. 222, no. 594-604, pp. 309–368, Jan. 1922.

Frost, R., Henry, D., and Erickson, K., “Raman spectroscopic detection of wyartitein the presence of rabejacite,” Journal of Raman Spectroscopy, vol. 35, pp.255–260, 2004.

Grimmett, G. and Stirzaker, D., Probability and Random Processes, 3rd ed. Ox-ford University Press, 2001.

Hayazawa, N., Motohashi, M., Saito, Y., and Kawata, S., “Highly sensitive straindetection in strained silicon by surface-enhanced raman spectroscopy,” AppliedPhysics Letters, vol. 86, pp. 263 114 – 263 114–3, 2005.

Higgins, J., “Some surface integral techniques in statistics,” The American Statis-tician, vol. 29, pp. 43–46, Feb. 1975.

Iyengar, S., Varshney, P., and Damarla, T., “A parametric copula based frameworkfor multimodal signal processing,” in ICASSP, 2009, pp. 1893–1896.

Kass, R. and Vos, P., Geometrical Foundations of Asymptotic Inference. Wiley,1997.

Kay, S., Modern Spectral Estimation: Theory and Application. Englewood Cliffs,NJ: Prentice-Hall, 1988.

Kay, S., Fundamentals of Statistical Signal Processing: Estimation Theory. En-glewood Cliffs, NJ: Prentice-Hall, 1993.

Kay, S., Fundamentals of Statistical Signal Processing: Detection Theory. Engle-wood Cliffs, NJ: Prentice-Hall, 1998.

Kay, S., “Model based probability density function estimation,” IEEE Signal Pro-cess. Lett., pp. 318–320, Dec. 1998.

Kay, S., “Exponentially embedded families - new approaches to model order es-timation,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, pp. 333–345, Jan.2005.

Kay, S., “Asymptotically optimal approximation of multidimensional pdf’s bylower dimensional pdf’s,” IEEE Trans. Signal Process., vol. 55, pp. 725–729,Feb. 2007.

Kay, S. and Ding, Q., “Exponentially embedded families for multimodal sensorprocessing,” in ICASSP, Mar. 2010, pp. 3770–3773.

153

Kay, S., Ding, Q., and Emge, D., “Joint pdf construction for sensor fusion anddistributed detection,” in International Conference on Information Fusion,Jun. 2010.

Kay, S., Ding, Q., and Rangaswamy, M., “Sensor integration for classification,” inAsilomar Conference on Signals, Systems, and Computers, Nov. 2010.

Kay, S., Nuttall, A., and Baggenstoss, P., “Multidimensional probability densityfunction approximations for detection, classification, and model order selec-tion,” IEEE Trans. Signal Process., vol. 49, pp. 2240–2252, Oct. 2001.

Kay, S. and Salisbury, J., “Improved active sonar detection using autoregressiveprewhiteners,” J. Acoustical Soc. of America, pp. 1603–1611, Apr. 1990.

Kay, S., Xu, C., and Emge, D., “Chemical detection and classification in ramanspectra,” in Proceedings of the SPIE, vol. 6969, Mar. 2008, pp. 4–12.

Kittler, J., Hatef, M., Duin, R., and Matas, J., “On combining classifiers,” IEEETrans. Pattern Anal. Mach. Intell., vol. 20, pp. 226–239, Mar. 1998.

Kneipp, K., Kneipp, H., Itzkan, I., Dasari, R., and Feld, M., “Ultrasensitive chemi-cal analysis by raman spectroscopy,” Chemical Reviews, vol. 99, p. 2957C2975,1999.

Knight, W., Pridham, R., and Kay, S., “Digital signal processing for sonar,” inProceedings of the IEEE, Nov. 1981, pp. 1451–1506.

Kullback, S., Information Theory and Statistics, 2nd ed. Courier Dover Publica-tions, 1997.

Lawson, C. and Hanson, R., Solving Least Squares Problems. SIAM, 1995.

Lehmann, E., Elements of Large-Sample Theory. Springer, 1998.

Lehmann, E. and Romano, J., Testing Statistical Hypotheses, 3rd ed. Springer,2005.

Liavas, A. and Regalia, P., “On the behavior of information theoretic criteria formodel order selection,” IEEE Trans. Signal Process., vol. 49, pp. 1689–1695,Aug. 2001.

Lilliefors, H., “On the kolmogorov-smirnov test for normality with mean and vari-ance unknown,” Journal of the American Statistical Association, vol. 62, pp.399–402, 1967.

Luenberger, D., Linear and Nonlinear Programming, 2nd ed. Springer, 2003.

154

Manolakis, D., Marden, D., and Shaw, G., “Hyperspectral image processing forautomatic target detection applications,” Lincoln Laboratory Journal, vol. 14,no. 1, pp. 79–116, 2003.

Pages-Zamora, A. and Lagunas, M., “New approaches in non-linear signal pro-cessing: Estimation of the probability density function by spectral estimationmethods,” in IEEE Workshop on Higher Order Statistics, 1995.

Pfanzagl, J. and Wefelmeyer, W., Contributions to a General Asymptotic StatisticalTheory, ser. Lecture Notes in Statistics. Springer-Verlag, 1982, vol. 13.

Portnov, A., Rosenwaks, S., and Bar, I., “Detection of particles of explosives viabackward coherent anti-stokes raman spectroscopy,” Applied Physics Letters,vol. 93, pp. 041 115 – 041 115–3, 2008.

Renaux, A., Forster, P., Chaumette, E., and Larzabal, P., “On the high-SNRconditional maximum-likelihood estimator full statistical characterization,”IEEE Trans. Signal Process., vol. 54, pp. 4840–4843, Dec. 2006.

Rissanen, J., “Modeling by shortest data description,” Automatica, vol. 14, no. 5,pp. 465–471, 1978.

Rudin, W., Principles of Mathematical Analysis, 3rd ed. McGraw-Hill, 1976.

Rudin, W., Functional Analysis. McGraw-Hill, 1991.

Scharf, L. and Friedlander, B., “Matched subspace detectors,” IEEE Trans. SignalProcess., vol. 42, no. 8, pp. 2146–2157, Aug. 1994.

Schwarz, G., “Estimating the dimension of a model,” The Annals of Statistics,vol. 6, no. 2, pp. 461–464, 1978.

Stoica, P. and Selen, Y., “Model-order selection: A review of information criterionrules,” IEEE Signal Process. Mag., vol. 21, pp. 36–47, Jul. 2004.

Sundaresan, A., Varshney, P., and Rao, N., “Distributed detection of a nuclearradioactive source using fusion of correlated decisions,” in Information Fusion,2007 10th International Conference on, 2007, pp. 1–7.

Thomopoulos, S., Viswanathan, R., and Bougoulias, D., “Optimal distributeddecision fusion,” IEEE Trans. Aerosp. Electron. Syst., vol. 25, pp. 761–765,Sep. 1989.

van der Vaart, A. W., Asymptotic Statistics. Cambridge University Press, 2000.

Wang, W. and Adali, T., “Constrained ica and its application to raman spec-troscopy,” in Proc. Antennas and Propagation Society International Sympo-sium, Jul. 2005, pp. 109–112.

155

Wang, W., Adali, T., and Emge, D., “Unsupervised detection using canonicalcorrelation analysis and its application to raman spectroscopy,” in Proc. IEEEWorkshop on Machine Learning for Signal Processing, Aug. 2007.

Wang, W., Adali, T., and Emge, D., “Subspace partitioning for target detectionand identification,” IEEE Trans. Signal Process., vol. 57, no. 4, pp. 1250–1259,Apr. 2009.

Wax, M. and Kailath, T., “Detection of signals by information theoretic criteria,”IEEE Trans. Acoust., Speech, Signal Process., vol. 33, pp. 387–392, Apr. 1985.

Westover, M., “Asymptotic geometry of multiple hypothesis testing,” IEEE Trans.Inf. Theory, vol. 54, no. 7, pp. 3327–3329, Jul. 2008.

White, H., “Maximum likelihood estimation of misspecified models,” Economet-rica, vol. 50, no. 1, pp. 1–25, Jan. 1982.

Wiley, R., ELINT: The Interception and Analysis of Radar Signals. Boston, MA:Artech House, 2006.

Xu, C. and Kay, S., “Source enumeration via the eef criterion,” IEEE Signal Pro-cess. Lett., vol. 15, pp. 569–572, 2008.

Xu, W. and Kaveh, M., “Analysis of the performance and sensitivity ofeigendecomposition-based detectors,” IEEE Trans. Signal Process., vol. 43,pp. 1413–1426, Jun. 1995.

156

statistical signal processing and its applications to …dingqqq/thesis.pdf · s. kay, q. ding, and...

Documents