statistical signal processing and its applications to …dingqqq/thesis.pdf · s. kay, q. ding, and...
TRANSCRIPT
STATISTICAL SIGNAL PROCESSING AND ITS APPLICATIONS TO
DETECTION, MODEL ORDER SELECTION, AND CLASSIFICATION
BY
QUAN DING
A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
IN
ELECTRICAL ENGINEERING
UNIVERSITY OF RHODE ISLAND
2011
DOCTOR OF PHILOSOPHY DISSERTATION
OF
QUAN DING
APPROVED:
Dissertation Committee:
Major Professor
DEAN OF THE GRADUATE SCHOOL
UNIVERSITY OF RHODE ISLAND
2011
ABSTRACT
This dissertation has focused on topics in statistical signal processing including
detection and estimation theory, information fusion, model order selection, as well
as their applications to standoff detection.
Model order selection is a very common problem in statistical signal pro-
cessing. In composite multiple hypothesis testing, the maximum likelihood rule
will always choose the hypothesis with the largest order if the parameters in each
candidate hypothesis are hierarchically nested. Hence, many methods have been
proposed to offset this overestimating tendency by introducing a penalty term.
Two popular methods are the minimum description length (MDL) and the Akaike
information criterion (AIC). It has been shown that the MDL is consistent and the
AIC tends to overestimate the model as the sample size goes to infinity. In this
dissertation, we show that for a fixed sample size, the MDL and the AIC are incon-
sistent as the noise variance goes to zero. The result is surprising since intuitively,
a good model order selection criterion should choose the correct model when the
noise is small enough. Moreover, it is proved that the embedded exponentially
family (EEF) criterion is consistent as the noise variance goes to zero.
Standoff detection aims to detect hazardous substances in an effort to keep
people away from potential damage and danger. The work in standoff detection has
been on developing algorithms for detection and classification of surface chemical
agents using Raman spectra. We use an autoregressive model to fit the Raman
spectra, develop an unsupervised detection algorithm followed by a classification
scheme, and manage to control the false alarm rate to a low level while maintaining
a very good detection and classification performance.
In information fusion and sensor integration,multiple sensors of the same or
different types are deployed in order to obtain more information to make a better
decision than with a single sensor. A common and simple method is to assume that
the measurements of the sensors are independent, so that the joint probability den-
sity function (PDF) is the product of the marginal PDFs. However, this assump-
tion does not hold if the measurements are correlated. We have proposed a novel
method of constructing the joint PDF using the exponential family. This method
combines all the available information in a multi-sensor setting from a statistical
standpoint. It is shown that this method is asymptotically optimal in minimiz-
ing Kullback-Leibler divergence, and it attains comparable detection/classification
performance as existing methods.
The maximum likelihood estimator (MLE) is the most popular method in
parameter estimation. It is asymptotically optimal in that it approximates the
minimum variance unbiased (MVU) estimator for large data records. Under a
misspecified model, it is well known that the MLE still converges to a well defined
limit as the sample size goes to infinity. We have proved that under some reg-
ularity conditions, the MLE under a misspecified model also converges to a well
defined limit at high signal-to-noise ratio (SNR). This result provides important
performance analysis of the MLE under a misspecified model.
ACKNOWLEDGMENTS
First of all, I would like to thank my advisor Dr. Steven Kay for his guidance,
support, patience and understanding during my five-year graduate studies at URI.
I thank him for sending me to many mathematics classes, which helped me develop
mathematical skills for my research. I also thank him for such careful proofreading
of all my papers. I always had a lot of typos and grammatical errors in the draft
versions of papers. It was a great pleasure to work with him on a variety of topics
in statistical signal processing. It was he who led me into such an interesting area
and taught me how to do research. It was really an honor to be his student.
I am also grateful to the faculty of the Department of Electrical, Computer,
and Biomedical Engineering and the Department of Mathematics, especially my
committee members Dr. Kay, Dr. Swaszek, Dr. Pakula, Dr. He, and Dr. Merino
for their help and efforts in participating in my comprehensive exam and disserta-
tion defense.
I must also thank Meredith Leach Sanders for taking care of all my paperwork.
She keeps everything in mind and never forgets to send us a friendly reminder. The
department will not be able to run without her.
I would also like to thank Dr. Pakula of the Department of Mathematics for
his inspiring classes. As an engineering student, I really like the way he teaches a
math class.
I would like to thank all my friends for their encouragement and help.
Finally, I am thankful for my family who always support me with their love
and trust. I thank my parents for everything they have done for me since I was
born. I thank my girlfriend Xiaorong. She makes my life much more beautiful. I
am so thankful that I met her at URI. I also thank her parents for raising such a
nice, wonderful, decent girl.
iv
PREFACE
This dissertation is organized in the manuscript format consisting of seven
manuscripts. The topics and publications of the manuscripts are as followings:
Manuscript 1: (Model order selection)
Q. Ding and S. Kay, “Inconsistency of the MDL: On the Performance of
Model Order Selection Criteria with Increasing Signal-to-Noise Ratio,” to be
published in IEEE Transactions on Signal Processing.
Manuscript 2: (Standoff detection and classification)
Q. Ding, S. Kay, C. Xu, and D. Emge, “Autoregressive Modeling of Ra-
man Spectra for Detection and Classification of Surface Chemicals,” to be
published in IEEE Transactions on Aerospace and Electronic Systems.
Manuscript 3: (Sensor integration)
S. Kay, Q. Ding, and M. Rangaswamy, “Sensor Integration for Distributed
Detection and Classification,” submitted to IEEE Transactions on Aerospace
and Electronic Systems.
Manuscript 4: (Parameter estimation)
Q. Ding and S. Kay, “Maximum Likelihood Estimator under Misspecified
Model with High Signal-to-Noise Ratio,” submitted to IEEE Transactions
on Signal Processing.
Manuscript 5: (Sensor integration)
S. Kay and Q. Ding, “Exponentially Embedded Families for Multimodal
Sensor Processing,” in Proc. IEEE International Conference on Acoustics,
Speech, and Signal Processing, Mar. 2010, pp. 3770-3773.
v
Manuscript 6: (Sensor integration)
S. Kay, Q. Ding, and D. Emge, “Joint PDF Construction for Sensor Fusion
and Distributed Detection,” in Proc. International Conference on Informa-
tion Fusion, Jun. 2010.
(This paper has been awarded Runner up for the Best Student Paper Award
at the 13th International Conference on Information Fusion.)
Manuscript 7: (Sensor integration)
S. Kay, Q. Ding, and M. Rangaswamy, “Sensor Integration for Classification,”
in Proc. Asilomar Conference on Signals, Systems, and Computers, Nov.
2010.
vi
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . iv
PREFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
MANUSCRIPT
1 Inconsistency of the MDL: On the Performance of ModelOrder Selection Criteria with Increasing Signal-to-Noise Ratio 1
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Inconsistency of the MDL and the AIC . . . . . . . . . . . . . . 4
1.3.1 The Linear Model . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Inconsistency of the MDL . . . . . . . . . . . . . . . . . 5
1.4 Consistency of the EEF . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Consistency of the EEF for the Linear Model . . . . . . . 7
1.4.2 Consistency of the EEF in General . . . . . . . . . . . . 10
1.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.1 Linear Signal . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.2 Non-Linear Signal . . . . . . . . . . . . . . . . . . . . . . 16
vii
Page
viii
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Appendix 1A - Derivation of the Distribution of yj’s for j ≥ p . . . . 18
Appendix 1B - Derivation of the Distribution of yj’s for j < p . . . . 21
Appendix 1C - Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . 22
Appendix 1D - Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . 25
Appendix 1E - Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . 26
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Autoregressive Modeling of Raman Spectra for Detectionand Classification of Surface Chemicals . . . . . . . . . . . . . . 29
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Problem Statement and Rationale of Approach . . . . . . . . . . 31
2.3 Spectral Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 Overall Detection Algorithm . . . . . . . . . . . . . . . . 39
2.5 Experimental Detection Performance for Field Background Data 42
2.6 Experimental False Alarm Rate Performance . . . . . . . . . . . 45
2.7 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.7.1 Classification if Only One of M Chemicals Is Present . . 47
2.7.2 Classification if K out of M Chemicals Are Present . . . 48
2.7.3 Model Order Selection on How Many Chemicals ArePresent in the Mixture . . . . . . . . . . . . . . . . . 49
Page
ix
2.8 Experimental Classification Performance for Field BackgroundData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Appendix 2A - Derivation of Estimating the AR Model Order . . . . 57
Appendix 2B - Derivation of Test Statistic for Detection . . . . . . . 60
Appendix 2C - Derivation of Probability of Detection Statistic Thresh-old Crossing for Given False Alarm Rate . . . . . . . . . . . 61
Appendix 2D - Derivation of LMP Test Statistic for Classification . . 63
Appendix 2E - Derivation of The Asymptotic Likelihood FunctionMethod for Classification of Mixture of Chemicals . . . . . . 66
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3 Sensor Integration for Distributed Detection and Classification 72
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3 Joint PDF Construction by Exponential Family and Its Appli-cation in Distributed Systems . . . . . . . . . . . . . . . . . 76
3.4 KL Divergence Between The True PDF and The Constructed PDF 78
3.5 Examples-Distributed Detection . . . . . . . . . . . . . . . . . . 80
3.5.1 Partially Observed Linear Model with Gaussian Noise . . 81
3.5.2 Partially Observed Linear Model with Gaussian MixtureNoise . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.6 Examples-Distributed Classification . . . . . . . . . . . . . . . . 88
3.6.1 Linear Model with Known Variance . . . . . . . . . . . . 89
Page
x
3.6.2 Linear Model with Unknown Variance . . . . . . . . . . . 92
3.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.7.1 Distributed Detection . . . . . . . . . . . . . . . . . . . . 93
3.7.2 Distributed Classification . . . . . . . . . . . . . . . . . . 95
3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4 Maximum Likelihood Estimator under Misspecified Modelwith High Signal-to-Noise Ratio . . . . . . . . . . . . . . . . . . 100
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2 White’s Results: QMLE for Large Data Records . . . . . . . . . 102
4.3 QMLE with High SNR . . . . . . . . . . . . . . . . . . . . . . . 103
4.3.1 Misspecified Observation Model . . . . . . . . . . . . . . 103
4.3.2 Performance of QMLE as σ2 → 0 . . . . . . . . . . . . . 104
4.4 A Misspecified Linear Model Example . . . . . . . . . . . . . . . 107
4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5 Exponentially Embedded Families for Multimodal SensorProcessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2 EEF and Its Properties . . . . . . . . . . . . . . . . . . . . . . . 117
Page
xi
5.3 EEF for Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 120
5.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6 Joint PDF Construction for Sensor Fusion and DistributedDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3 Construction of Joint PDF for Detection . . . . . . . . . . . . . 130
6.4 KL Divergence Between The True PDF and The Constructed PDF132
6.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5.1 Partially Observed Linear Model with Gaussian Noise . . 133
6.5.2 Partially Observed Linear Model with Non-Gaussian Noise135
6.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7 Sensor Integration for Classification . . . . . . . . . . . . . . . . 141
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3 Joint PDF Construction and Its Application in Classification . . 143
Page
xii
7.4 A Linear Model Example . . . . . . . . . . . . . . . . . . . . . . 146
7.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
LIST OF TABLES
Table Page
3.1 Comparison of our test statistic and the clairvoyant GLRT . . . . . 88
3.2 Comparison of our test statistic and the estimated MAP classifier . 93
xiii
LIST OF FIGURES
Figure Page
1.1 Performance of MDL, AIC and EEF for the linear model when H1
is true (M=2, N=20). . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Performance of MDL, AIC and EEF in estimating the polynomialmodel order when H3 is true (M=4, N=20). . . . . . . . . . . . 16
1.3 Probability of correct selection for MDL, AIC and EEF in estimatingthe number of sinusoids when H2 is true (M=3, N=20). . . . . . 19
2.1 AR spectral estimate and background spectral data for asphalt sur-face (Fc = 3300). . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 AR spectrum for asphalt surface plus an artificial signal (Fc = 3300). 36
2.3 Spectra of the chemicals that are used in simulations. . . . . . . . . 43
2.4 Probability PDp of detecting chemical 15 versus SNR based on asingle pulse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5 Probability PDp of detecting chemical 31 versus SNR based on asingle pulse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 False alarms for a concrete background and fixed threshold. . . . . . 46
2.7 Probability of correct single pulse classification versus SNR. Chem-ical 15 is present. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.8 Probability of correct single pulse classification versus SNR. Chem-ical 31 is present. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.9 Probability of correct single pulse classification versus SNR. Chem-ical 45 is present. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.10 Probability of correct single pulse classification versus SNR. Chem-icals 15 and 16 are present. . . . . . . . . . . . . . . . . . . . . 53
2.11 Probability of correct single pulse classification versus SNR. Chem-icals 15 and 16 are present. Chemical 29 is removed from thelibrary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xiv
Figure Page
xv
2.12 Probability of correct single pulse classification versus SNR. Chem-icals 20 and 45 are present. . . . . . . . . . . . . . . . . . . . . 54
2.13 Probability of correct single pulse classification versus SNR. Chem-icals 31 and 45 are present. . . . . . . . . . . . . . . . . . . . . 54
2.14 Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 15, 56 and 58 are present. . . . . . . 55
2.15 Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 15, 31 and 45 are present. . . . . . . 56
2.16 Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 20 and 45 are present. . . . . . . . . 56
2.17 Probability P1 of at most one false alarm per two hours versus PFAb. 63
2.18 Probability of at most one false alarm per two hours versus PFAp . . 64
3.1 Distributed detection/classification system with two sensors . . . . 75
3.2 ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with uncorrelated Gaussian mixture noise. . . . . 95
3.3 ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with correlated Gaussian mixture noise. . . . . . 95
3.4 Probability of correct classification for both methods. . . . . . . . . 97
3.5 Probability of correct classification for both methods. . . . . . . . . 98
4.1 The periodogram It(f) = 1N
∣∣∣∑N−1
n=0 st[n] exp(−j2πfn)∣∣∣
2
. In this
case, f∗ ≈ f2 = 0.33. . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2 Convergence of A, f , φ as σ2 → 0. . . . . . . . . . . . . . . . . . . . 111
4.3 The periodogram It(f) = 1N
∣∣∣∑N−1
n=0 st[n] exp(−j2πfn)∣∣∣
2
. In this
case, f1 < f∗ < f2. . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4 Convergence of A, f , φ as σ2 → 0. . . . . . . . . . . . . . . . . . . . 113
4.5 Test statistics of Lilliefors test for A, f , φ as σ2 → 0. We have 1600realizations of {A, f , φ} for each σ2. . . . . . . . . . . . . . . . . 114
Figure Page
xvi
4.6 Histogram of f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1 ROC curves for different detectors. . . . . . . . . . . . . . . . . . . 124
6.1 Distributed detection system with two sensors . . . . . . . . . . . . 130
6.2 ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with uncorrelated Gaussian mixture noise. . . . . 138
6.3 ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with correlated Gaussian mixture noise. . . . . . 138
7.1 Distributed classification system with two sensors. . . . . . . . . . . 144
7.2 Probability of correct classification for both methods. . . . . . . . . 150
MANUSCRIPT 1
Inconsistency of the MDL: On the Performance of Model OrderSelection Criteria with Increasing Signal-to-Noise Ratio
Abstract
In the problem of model order selection, it is well known that the widely
used minimum description length (MDL) criterion is consistent as the sample size
N → ∞. But the consistency as the noise variance σ2 → 0 has not been studied.
In this paper, we find that the MDL is inconsistent as σ2 → 0. The result shows
that the MDL has a tendency to overestimate the model order. We also prove
that another criterion, the exponentially embedded family (EEF), is consistent as
σ2 → 0. Therefore in a high signal-to-noise (SNR) scenario, the EEF provides a
better criterion to use for model order selection.
1.1 Introduction
Model order selection is a fundamental problem in signal processing. It has
many practical applications such as radar, computer vision and biomedical systems.
Model order selection is essentially one of composite hypothesis testing, for which
the probability density functions (PDFs) are known except for some parameters.
Without the knowledge of those parameters, there exists no optimal solution. A
simple and common approach is the generalized likelihood ratio test (GLRT) which
replaces the unknown parameters by their maximum likelihood estimates (MLEs).
However in the case when the model orders are hierarchically nested, the GLRT
philosophy does not work since it will always choose the largest candidate order
(see [1] for a simple example). Many methods have been proposed to offset this
overestimating tendency based on different information criteria such as the Akaike’s
information criterion (AIC) [2], the MDL [3], [4], and the EEF [5]. The reader may
1
wish to read [6] for a review of information criterion rules on model order selection.
One would prefer a criterion that will always choose the true model order if we have
a large enough number of samples. It has been shown in [7] the consistency of the
MDL and the inconsistency of the AIC as the sample size N → ∞, i.e., the MDL
will pick the true order with probability one and the AIC tends to overestimate
the model order as N → ∞. The consistency of the EEF as N → ∞ is shown in
[8].
Except for the above consistency as N → ∞, one would also wish the criterion
to have another consistency that we call consistency as σ2 → 0. In this case the
estimator will choose the true model order in probability as the noise level decreases
to zero. This is the consistency that we will discuss throughout this paper. The
Fisher consistency [9] is the same as the consistency as σ2 → 0 in parameter
estimation in curved exponential families [10]. To our knowledge, no work has
been done on the consistency as σ2 → 0 for the model order selection criteria. In
this paper, we will show that the MDL and the AIC are inconsistent as the noise
variance σ2 → 0. This means that even under high SNR conditions, the MDL and
the AIC still tend to overestimate the model order. Note that the overestimation of
the MDL and the AIC has also been noticed in [11], [12] for some array processing
problems. We then show that the EEF is consistent as σ2 → 0. Simulation results
are provided to support our analysis.
The paper is organized as follows. Section 1.2 presents the problem and the
model order selection criteria. Then we introduce a linear model and show the
inconsistency as σ2 → 0 for the MDL and the AIC in Section 1.3. In Section 1.4,
we prove that the EEF is consistent as σ2 → 0. Simulation results are given in
Section 1.5 to justify our derivation. Finally, Section 1.6 draws the conclusion.
2
1.2 Problem Statement
Consider the multiple composite hypothesis testing problem where we have
M candidate models. Under each model Hi, we have
Hi : x = si(θi) + w = si(θi) + σu (1.1)
for i = 1, 2, . . . , M . x is an N × 1 vector of samples. The N × 1 signal si(θi) is
known except for the unknown i× 1 vector of parameters θi. w = σu is the N × 1
noise vector with known variance σ2, and u has a well defined PDF. So each Hi is
described by a PDF p(x; θi). We assume that the model orders are hierarchically
nested, i.e., we can write the signal si(θi) as
si(θi) = s(
[θ1, . . . , θi, 0, . . .]T)
(1.2)
where s is a function of a M × 1 vector, for i = 1, 2, . . . , M . So the unknown
parameters in signal with higher order contain all of those in a lower order model.
Let H0 be a reference hypothesis with s(
[0, 0, . . . , 0]T)
= 0, so the PDF p(x; θ0)
is completely known as noise only. Then the MDL, AIC and EEF rules choose the
model order that maximizes the following respectively:
− MDL(i) = lGi(x) − i ln N
− AIC(i) = lGi(x) − 2i
EEF (i) =
(
lGi(x) − i
[
ln
(lGi
(x)
i
)
+ 1
])
u
(lGi
(x)
i− 1
)
for i = 1, 2, . . . , M , where u(x) is the unit step function and lGi(x) = 2 ln p(x;
ˆθi)
p(x;θ0).
Here θi is the MLE for θi. Note that the inclusion of the term −2 ln p(x; θ0) does
not affect the maximum and so we use the log-likelihood ratio instead of the more
usual log-likelihood for the MDL and the AIC. Note that we assume a real signal
model in (1.1). The result in this paper can be easily extended to complex signal
model. In the next section we will implement these rules in the linear model to
show the inconsistency of the MDL and the AIC as σ2 → 0.
3
1.3 Inconsistency of the MDL and the AIC
Without causing any confusion, we will use consistency to mean consistency
as σ2 → 0 for the rest of the paper unless otherwise mentioned. In this section,
we will limit the derivation to the MDL. We will start by introducing the linear
model with Gaussian noise, from which we derive the performance of the MDL.
Then the inconsistency of the MDL is readily seen. Note that the inconsistency of
the MDL is proved for the linear model with Gaussian noise, it should be expected
that the MDL is inconsistent in general (non-linear, non-Gaussian models). The
inconsistency of the AIC follows directly from the analysis of the MDL.
1.3.1 The Linear Model
Consider the following linear model:
Hi : x = Hiθi + w for i = 1, 2, . . . , M
where M is the maximum order of all the candidate models, Hi = [h1,h2, . . . ,hi]
is an N × i (with N > M) known observation matrix with full column rank,
θi = [θ1, θ2, . . . , θi]T is an i × 1 unknown parameter vector of the amplitudes, and
w is an N × 1 white Gaussian noise vector with known variance σ2. For the linear
model, lGi(x) = xT Pix
σ2 , where Pi = Hi(HTi Hi)
−1HTi is the projection matrix that
projects x onto the subspace Vi generated by h1,h2, . . . ,hi [13]. So the MDL rule
chooses the model order that minimizes:
MDL(i) = −xTPix
σ2+ i ln N for i = 1, 2, . . . , M
Let yi = xT Pi+1xσ2 − xT Pix
σ2 for i = 1, 2, . . . , M −1 and we have the following theorem.
(See Appendix 1A for the proof of Theorem 1)
Theorem 1 (PDF of yj for j ≥ p). If the true model order is Hp (p ≤ M), that
is, θi = 0 for all i > p, then the yj’s for all j ≥ p do not depend on θp or σ2,
4
and they are independent and identically distributed (IID), each with a chi-square
distribution with 1 degree of freedom.
As we will show next, this theorem gives us a way to find a lower bound of
the probability that the MDL will choose the wrong model order.
1.3.2 Inconsistency of the MDL
We will show that the probability of overestimation does not converge to zero
as σ2 → 0.
If Hp (p < M) is true, then the probability that the MDL will choose the wrong
model order is
Pe = Pr {Hj, j �= p|Hp}
= 1 − Pr{MDL(p) < MDL(j) for all j �= p|Hp}
≥ 1 − Pr{MDL(p) < MDL(j) for all j > p|Hp}
= Pr{MDL(p) ≥ MDL(j) for some j > p|Hp} (1.3)
Since MDL(j) − MDL(j + 1) = yj − ln N , for j > p, MDL(p) − MDL(j) =
∑j−1i=p yi − (j − p) ln N , we have
Pr{MDL(p) ≥ MDL(j) for some j > p|Hp}
= Pr{yp ≥ ln N or yp + yp+1 ≥ 2 ln N or · · · orM−1∑
i=p
yi ≥ (M − p) ln N |Hp}
(1.4)
By Theorem 1, yj ∼ χ21 and yj’s are independent for j ≥ p. So the probability in
(1.4) can be found analytically, although it may not easy. Alternatively, we can
5
find a lower bound of (1.4) which is much easier to calculate. Notice that
Pr{yp ≥ ln N or yp + yp+1 ≥ 2 ln N or · · · orM−1∑
i=p
yi ≥ (M − p) ln N |Hp}
≥ Pr{yp ≥ ln N |Hp}
= Pr{X ≥√
ln N or X ≤ −√
ln N}
= 2Q(√
ln N)
(1.5)
where X is a standard Gaussian random variable (since yp ∼ χ21 under Hp) and
Q(x) is the right-tail probability of a standard Gaussian distribution, that is,
Q(x) =∫∞
x1√2π
exp(
−12t2)
dt. So 2Q(√
ln N)
is also a lower bound of the prob-
ability of error Pe for the MDL. Note that this lower bound decreases slowly
as N increases. For example, in order to have Pe ≤ 0.01, we require that
2Q(√
ln N)
≤ 0.01 and we need as many as N = 761 samples. This lower
bound only depends on the number of samples N . So when N is fixed, this lower
bound is fixed even as σ2 → 0. This shows that the MDL is inconsistent. Since
Pr{MDL(p) ≥ MDL(j) for some j > p|Hp} is bounded below by a fixed bound,
the MDL has a tendency to overestimate the model order.
For the AIC, we just need to replace ln N by 2, so the lower bound is 2Q(√
2)
.
Hence the AIC is also inconsistent. Notice that 2Q(√
ln N)
→ 0 as N → ∞, but
2Q(√
2)
is a constant. This also justifies the result in [7]. Since the MDL is
consistent as N → ∞, the lower bound 2Q(√
ln N)
should decrease to 0. The
lower bound 2Q(√
2)
for the AIC shows that the AIC is inconsistent as N → ∞.
1.4 Consistency of the EEF
As a complement to Section 1.3, we will first show that the EEF is consistent
for the linear model. Next, we will prove that the EEF is consistent in general.
6
1.4.1 Consistency of the EEF for the Linear Model
The next theorem will be used to prove the consistency of the EEF for the
linear model. (See Appendix 1B for the proof of Theorem 2)
Theorem 2 (PDF of yj for j < p). If the true model order is Hp, then
for j < p, yj has a noncentral chi-square distribution with 1 degree of freedom
and noncentrality parameter λj = αj/σ2 where Hj+1,p = [hj+1,hj+2, . . . ,hp],
θj+1,p = [θj+1, θj+2, . . . , θp]T , and αj = (Hj+1,pθj+1,p)
T (Pj+1 − Pj)Hj+1,pθj+1,p.
Furthermore, the yj’s are independent for all j.
The EEF chooses the model order that maximizes
EEF (i) =
(
lGi(x) − i
[
ln
(lGi
(x)
i
)
+ 1
])
u
(lGi
(x)
i− 1
)
=
(xTPix
σ2− i
[
ln
(xTPix
iσ2
)
+ 1
])
u
(xTPix
iσ2− 1
)
(1.6)
If Hp is true, it is well known that [1]
lGp(x) =xTPpx
σ2∼ χ
′2p (λ) (1.7)
where λ =‖Hpθp‖2
σ2 . In order to prove the consistency of the EEF in probability,
we need to show that
Pr{
arg maxi
EEF (i) = p}
→ 1
as σ2 → 0. We start by first comparing EEF (j) with EEF (p) as σ2 → 0 for j > p
and j < p.
For j > p, we know that [1]
lGj(x) =
xTPjx
σ2∼ χ
′2j (λ) (1.8)
where λ is the same as in (1.7). The lemma in [8] shows that if Y is distributed
according to χ′2ν (an) where a is a positive constant, then as n → ∞, Y
nconverges
7
to a in probability, or in symbols, Yn
P→ a. Replacing n by 1/σ2 and from (1.7),
(1.8), we have as σ2 → 0
σ2lGj(x)
P→‖Hpθp‖2 for j ≥ p (1.9)
By the definition of convergence in probability, we have
Pr{∣∣σ2lGj
(x) − ‖Hpθp‖2∣∣ < ε
}
→ 1 for j ≥ p (1.10)
as σ2 → 0 for all ε > 0. Since σ2 → 0, we can find σ2 small enough such that
j <‖Hpθp‖2
−ε
σ2 . Hence, we have
Pr{
lGj(x) > j
}
≥ Pr
{
lGj(x) >
‖Hpθp‖2 − ε
σ2
}
≥ Pr{∣∣σ2lGj
(x) − ‖Hpθp‖2∣∣ < ε
}
for j ≥ p (1.11)
Therefore, as a result of (1.10) and (1.11),
Pr
{lGj
(x)
j− 1 > 0
}
→ 1 for j ≥ p (1.12)
as σ2 → 0 and we can discard the unit step function. As a result,
EEF (p) − EEF (j)
= lGp(x) − lGj(x) − p ln lGp(x) + j ln lGj
(x) + p ln p − j ln j − p + j
= lGp(x) − lGj(x) − p ln
(
σ2lGp(x))
+ j ln(
σ2lGj(x)
)
+ c (1.13)
where
c = (p − j) ln σ2 + p ln p − j ln j − p + j (1.14)
By Theorem 1,
lGp(x) − lGj(x) ∼ −χ2
j−p (1.15)
As a result of (1.9), by the continuity of the logarithm we have [14]
ln(
σ2lGj(x)
) P→ ln ‖Hpθp‖2 for j ≥ p (1.16)
8
We divide (1.13) by c and get
EEF (p) − EEF (j)
c=
lGp(x) − lGj(x) − p ln
(
σ2lGp(x))
+ j ln(
σ2lGj(x)
)
c+ 1
(1.17)
Since 1c→ 0+ for j > p as σ2 → 0, as a result of (1.15) and (1.16), we have (see
Theorems 2.3.3 and 2.3.5 on pages 70-71 in [14] and Theorem (4)(a) on page 310
in [15])
lGp(x) − lGj(x) − p ln
(
σ2lGp(x))
+ j ln(
σ2lGj(x)
)
c
P→ 0 (1.18)
and hence
EEF (p) − EEF (j)
c
P→ 1 (1.19)
for j > p. This shows that as σ2 → 0, Pr{EEF (p) > EEF (j)} → 1.
For j < p, similar to the derivation in Appendix 1B, the distribution of
lGj(x) =
xT Pjx
σ2 can be found as
lGj∼ χ
′2j (λ′) (1.20)
where λ′ =(Hpθp)
TPjHpθp
σ2 . So we also have
Pr
{lGj
(x)
j− 1 > 0
}
→ 1 for j ≤ p (1.21)
as σ2 → 0. Thus we can also omit the unit step function and have
EEF (p) − EEF (j)
= lGp(x) − lGj(x) − p ln
(
σ2lGp(x))
+ j ln(
σ2lGj(x)
)
+ c (1.22)
where
c = (p − j) ln σ2 + p ln p − j ln j − p + j (1.23)
Now by Theorem 2,
lGp(x) − lGj(x) ∼ χ
′2p−j
(∑p−1
i=jλi
)
= χ′2p−j
(p−1∑
i=j
αi/σ2
)
(1.24)
9
so that by the lemma in [8], we have
σ2(
lGp(x) − lGj(x)
) P→p−1∑
i=j
αi (1.25)
Similarly to the above analysis, we have
ln(
σ2lGp(x)) P→ ln ‖Hpθp‖2
ln(
σ2lGj(x)
) P→ ln(
(Hpθp)T PjHpθp
)
for j < p (1.26)
Hence, with σ2 → 0 we have
σ2 ln(
σ2lGj(x)
) P→ 0 for j ≤ p (1.27)
Obviously, σ2c → 0. So by (1.22), (1.25) and (1.27), we have
σ2 (EEF (p) − EEF (j))
= σ2(
lGp(x) − lGj(x)
)
− pσ2 ln(
σ2lGp(x))
+ jσ2 ln(
σ2lGj(x)
)
+ σ2c
P→p−1∑
i=j
αi > 0 (1.28)
for j < p. This means that Pr{EEF (p) > EEF (j)} → 1 as σ2 → 0.
Finally we have shown that Pr{EEF (p) > EEF (j)} → 1 for all j �= p. Since
Pr {A1 ∩ A2} → 1 if Pr {A1} → 1 and Pr {A2} → 1 [14], as a result,
Pr{
arg maxi
EEF (i) = p}
→ 1
as σ2 → 0. This completes the proof that the EEF is consistent for the linear
model.
1.4.2 Consistency of the EEF in General
In the general case, the signal s(θi) does not have to be a linear transformation
of θi, and the noise w does not have to be Gaussian. To prove the consistency of
the EEF in general, we first write the model in (1.1) as
Hi : x = si(θi) + σnu (1.29)
10
where the N × 1 signal si(θi) depends on the i× 1 unknown parameters θi, and u
has a well defined PDF and {σn} is an arbitrary positive sequence that converges
to 0. Because if we consider the probability of correct model order selection Pc as
a function of σ2, then the following conditions are equivalent [16]:
Condition 1)
limσ2→0
Pc(σ2) = 1
Condition 2)
limn→∞
Pc(σ2n) = 1 for any arbitrary sequence {σ2
n} that converges to 0
Hence we will prove Condition 2) to show the consistency of the EEF.
Let us assume the following.
Assumption 1): s(θi) is Lipschitz continuous, i.e., there exists K > 0 such that∥∥si(θ
1i ) − si(θ
2i )∥∥ ≤ K
∥∥θ1
i − θ2i
∥∥ for all θ1
i , θ2i .
Note that the linear signal si(θi) = Hiθi is Lipschitz continuous since si(θi)
is a linear transformation of θi [17].
Assumption 2): The PDF pU(u) of u satisfies
pU(un)/pU(vn) → ∞ if ‖vn‖ − ‖un‖ → ∞
where {un}, {vn} are deterministic sequences such that ‖vn‖ − ‖un‖ → ∞,
and
ln pU(u) is Lipschitz continuous on set {u : ‖u‖ ≤ l} for any l > 0
i.e., for any l > 0, there exists L > 0 such that | ln pU(u1) − ln pU(u2)| ≤
L ‖u1 − u2‖ for all u1, u2 with ‖u1‖ ≤ l, ‖u2‖ ≤ l.
11
Note that the Gaussian and Gaussian mixture PDFs will satisfy Assumption 2).
For example, let the Gaussian mixture PDF be
pU(u) =m∑
i=1
αi√
2πσ2i
e− ‖u‖2
2σ2i
where αi > 0 and∑m
i=1 αi = 1. Let σ2max = max{σ2
1, . . . , σ2m}, σ2
min =
min{σ21, . . . , σ
2m}, and α be the αi that corresponds to σ2
max. Then we have
pU(u)
pU(v)=
m∑
i=1
αi√2πσ2
i
e− ‖u‖2
2σ2i
m∑
i=1
αi√2πσ2
i
e− ‖v‖2
2σ2i
>
α√2πσ2
max
e− ‖u‖2
2σ2max
1√2πσ2
min
e− ‖v‖2
2σ2max
= α
√
σ2min
σ2max
exp
(
‖v‖2 − ‖u‖2
2σ2max
)
(1.30)
So if ‖vn‖ − ‖un‖ → ∞, it follows that ‖vn‖2 − ‖un‖2 → ∞ and hence
pU(un)/pU(vn) → ∞.
Let Hp be the true model. With the above assumptions, the following theo-
rems are proved in Appendices 1.6-1.6.
Theorem 3 (lGj(x) unbounded in probability for j ≥ p). There exists a
sequence {Nn} with Nn → ∞ such that Pr{lGj(x) > Nn} → 1 as σn → 0 for
j ≥ p.
Note that each {Nn} implicitly depends on σn. For example, in the linear
model for j ≥ p,
lGj(x) =
xTPjx
σ2∼ χ
′2j (λ)
where λ =‖Hpθp‖2
σ2n
. If we choose Nn =‖Hpθp‖2
2σ2n
, (1.9) implies that Pr{lGj(x) >
Nn} → 1 as σn → 0.
Theorem 4 (lGj(x) − lGp(x) bounded in probability for j > p). For any
sequence {mn}, Pr{lGj(x) − lGp(x) < mn} → 1 as mn → ∞ for j > p.
12
Here the sequence {mn} can be an arbitrary sequence with mn → ∞, so mn
does not depend on σn. For example, in the linear model for j > p,
lGj(x) − lGp(x) ∼ χ2
j−p
So for any {mn}, Pr{lGj(x) − lGp(x) < mn} → 1 as mn → ∞ for j > p.
Theorem 5 (lGp(x) − lGj(x) unbounded in probability for j < p). There
exists a sequence {Mn} with Mn → ∞ such that Pr{lGp(x) − lGj(x) > Mn} → 1
as σn → 0 for j < p.
Note that each Mn also implicitly depends on σn. For example, in the linear
model for j < p, by (1.24),
lGp(x) − lGj(x) ∼ χ
′2p−j
(p−1∑
i=j
αi/σ2n
)
If we choose Mn =∑p−1
i=j αi/2σ2n, it can be shown that Pr{lGj
(x) > Mn} → 1 as
σn → 0.
First we consider when j > p. For each σn, let Djn = {u : lGj
(x) > Nn},
Dpn = {u : lGp(x) > Nn}, En = {u : lGj
(x) − lGp(x) < mn}, and Fn = {u :
EEF (p) > EEF (j)}. Then for any u ∈ Djn ∩ Dp
n ∩ En, since Nn → ∞, we can
omit the unit function in the EEF. So we have
EEF (p) − EEF (j) = lGp(x) − p
(
lnlGp(x)
p+ 1
)
− lGj(x) + j
(
lnlGj
(x)
j+ 1
)
= p lnlGj
(x)
lGp(x)+ (j − p) ln lGj
(x) −(
lGj(x) − lGp(x)
)
+ p ln p − j ln j − p + j (1.31)
Note thatlGj
(x)
lGp (x)≥ 1, ln lGj
(x) > ln Nn, and lGj(x) − lGp(x) < mn. Since mn is
arbitrary, we can choose mn < (j − p) ln Nn + p ln p − j ln j − p + j but still with
mn → ∞ so that EEF (p) − EEF (j) > 0. This shows that Djn ∩ Dp
n ∩ En ⊆ Fn.
13
By Theorems 3 and 4, we have Pr{Djn} → 1, Pr{Dp
n} → 1 and Pr{En} → 1, and
hence Pr{Djn ∩ Dp
n ∩ En} → 1. This shows that Pr{Fn} → 1 as σn → 0, i.e.,
Pr{EEF (p) > EEF (j)} → 1 as σn → 0 for j > p.
Next, when j < p, let Dpn = {u : lGp(x) > Nn}, Gn = {u : lGp(x) − lGj
(x) >
Mn}, and Hn = {u : EEF (p) > EEF (j)} for each σn. Note that Hn and Fn
are different since the former is for j < p and the latter is for j > p. For any
u ∈ Dkn ∩ Gn, we have
EEF (p) − EEF (j)
=(
lGp(x) − lGj(x)
)
+ j ln lGj(x) − p ln lGp(x) + p ln p − j ln j − p + j (1.32)
Since x − p ln x increases as x increases for x > p, we can find Nn and Mn such
that EEF (p)−EEF (j) > 0. This shows that Dpn ∩Gn ∈ Hn. By Theorem 3 with
j = p and Theorem 5, the rest of the proof is the same as for j > p.
Since we have shown that Pr{EEF (p) > EEF (j)} → 1 for all j �= p, we have
Pr {arg maxi EEF (i) = p} → 1 as σ2 → 0 using the property that Pr {A1 ∩ A2} →
1 if Pr {A1} → 1 and Pr {A2} → 1 [14].
1.5 Simulation Results1.5.1 Linear Signal
For the linear model when M = 2:
H1 : x = h1θ1 + w
H2 : x =[
h1 h2
][
θ1
θ2
]
+ w = H2θ2 + w
If H1 is true, by (1.4) and (1.5), the probability that the MDL will choose H2 is
Pr {H2|H1} = Pr{MDL(1) ≥ MDL(2)|H1} = Pr{y1 ≥ ln N |H1} = 2Q(√
ln N)
(1.33)
So in this case, the lower bound is exactly the probability of overestimation error
for the MDL. For the AIC, the lower bound 2Q(√
2)
is also exactly the probability
14
of overestimation error. Hence the probabilities of correct model order selection Pc
(note here that there is no underestimation error since the correct order is k = 1)
for the MDL and the AIC are
Pc(MDL) = 1 − 2Q(√
ln N)
Pc(AIC) = 1 − 2Q(√
2)
For the simulation, we use N = 20, h1 = [1, 1, . . . , 1]T , h2 =
[1,−1, 1,−1, . . . , 1,−1]T , θ1 = 1 and θ2 = 0. We plot Pc versus 1/σ2. It can be ex-
pected that Pc(MDL) = 1− 2Q(√
ln 20)
= 0.917 and Pc(AIC) = 1− 2Q(√
2)
=
0.843, and Figure 1.1 verifies our result. We can see that the EEF appears to be
consistent in accordance with theorem, and the MDL and the AIC are inconsistent.
Also the performances of the MDL and the AIC do not depend on σ2.
0 500 1000 1500 20000.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
1/σ2
Pc
EEFMDLAIC
Figure 1.1. Performance of MDL, AIC and EEF for the linear model when H1 istrue (M=2, N=20).
Next we consider polynomial order estimation, which is essentially a linear
model. We assume that M = 4, N = 20 and the true model order is H3 with
the nth element of s(θ3) being s[n] = 0.1 + 0.3n + 0.1n2 for n = 0, 1, . . . , N − 1.
15
Pc is plotted versus 1/σ2. As shown in Figure 1.2, the EEF is consistent and the
MDL and the AIC are inconsistent. In this case, we cannot find Pc explicitly for
the MDL and the AIC, but we can see that the performances of the MDL and the
AIC are bounded above by 1 − 2Q(√
ln 20)
= 0.917 and 1 − 2Q(√
2)
= 0.843
respectively.
0 0.2 0.4 0.6 0.8 10.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1/σ2
Pro
babi
lity
of c
orre
ct s
elec
tion
EEFMDLAIC
Figure 1.2. Performance of MDL, AIC and EEF in estimating the polynomialmodel order when H3 is true (M=4, N=20).
1.5.2 Non-Linear Signal
We consider a problem of estimating of number of sinusoids. Suppose that
under the ith model, the signal consists of i sinusoids embedded in white Gaussian
noise. That is,
Hi : x[n] =i∑
j=1
Aj cos (2πfjn + φj) + w[n]
for n = 0, 1, . . . , N −1, i = 1, 2, . . . , M , where the amplitudes Aj’s, the frequencies
fj’s and the phases φj’s are unknown. To make the problem identifiable, we assume
that Aj > 0, 0 < fj < 1/2, and 0 ≤ φj < 2π. It can be easily checked that
Assumptions 1) and 2) are satisfied for this example. Notice that if the frequencies
16
fj’s are known, the model can be reduced to the linear model [13]
Hi : x = Hiαi + w (1.34)
where
Hi =
⎡
⎢⎢⎢⎣
1 0 · · ·cos 2πf1 sin 2πf1 · · ·
...... · · ·
cos (2πf1(N − 1)) sin (2πf1(N − 1)) · · ·1 0
cos 2πfi sin 2πfi...
...cos (2πfi(N − 1)) sin (2πfi(N − 1))
⎤
⎥⎥⎥⎦
is an N × 2i observation matrix for the ith model, and
αi = [A1 cos φ1,−A1 sin φ1, . . . , Ai cos φi,−Ai sin φi]T
is a one-to-one transformation of the amplitudes Aj’s and phases φj’s. As a result,
the MLEs of Aj’s and φj’s can be found from the MLE of αi according to the
linear model in (2.36) whose observation matrix Hi depends on fj’s. So the MLE
of αi is
αi =(
HTi Hi
)−1HT
i x (1.35)
which is a function of fj’s for j = 1, 2, . . . , i.
If the frequencies fj’s are unknown, as a result of (1.35), the MLEs of fj’s can
be found by maximizing the following over the fj’s
g(f1, f2, . . . , fi) = xTHi
(
HTi Hi
)−1HT
i x (1.36)
Note that (2.8) is a function of fj’s because Hi depends on fj’s.
We denote the observation matrix Hi corresponding to the MLE of fj’s as
Hi. Note that the number of unknown parameters is 3i under Hi. Similar to the
previous subsection, the MDL, the AIC and the EEF choose the model order with
17
the largest of the following respectively
− MDL(i) =xT Hi
(
HTi Hi
)−1
HTi x
σ2− 3i ln N
− AIC(i) =xT Hi
(
HTi Hi
)−1
HTi x
σ2− 6i
EEF (i)
=
(xT Hi(HT
i Hi)−1
HTi x
σ2 − 3i
[
ln
(xT Hi(HT
i Hi)−1
HTi x
3iσ2
)
+ 1
])
· u(
xT Hi(HTi Hi)
−1HT
i x
3iσ2 − 1
)
(1.37)
In the simulation, we assume that M = 3, N = 20 and the true model order
is H2 with s[n] = cos(2π0.1n) + 0.8cos(2π0.3n + π/5) for n = 0, 1, . . . , N − 1. The
MLEs of fj’s that maximizes (2.8) are found by grid search. In Figure 1.3, we
also observe the consistency of the EEF and the inconsistency of the MDL and
the AIC as σ2 → 0. The probabilities of correct selection appear to have upper
bounds for the MDL and the AIC, although no explicit bounds are calculated in
this non-linear signal case.
1.6 Conclusion
The inconsistency as σ2 → 0 of the MDL and the AIC has been shown. A
simple lower bound is provided for their overestimating tendency. The consistency
as σ2 → 0 of the EEF is also proved. Simulation results show that the EEF
performs perfect under small noise while the MDL and the AIC do not.
Appendix 1A - Derivation of the Distribution of yj’s for j ≥ p
We need the following lemma to derive the distribution of yj’s.
Lemma 1. Pj+1 − Pj has rank 1.
Proof. Suppose that for the subspace Vj generated by h1,h2, . . . ,hj, we have an
orthonormal basis {v1,v2, . . . ,vj}. Then for the subspace Vj+1 generated by
18
0 100 200 300 400 5000.4
0.5
0.6
0.7
0.8
0.9
1
1/σ2
Pro
babi
lity
of c
orre
ct s
elec
tion
EEFMDLAIC
Figure 1.3. Probability of correct selection for MDL, AIC and EEF in estimatingthe number of sinusoids when H2 is true (M=3, N=20).
h1,h2, . . . ,hj+1, we can have an orthonormal basis {v1,v2, . . . ,vj,vj+1}. Since
Pj is the projection matrix onto the subspace Vj, for any N × 1 vector x, we have
Pjx =
j∑
i=1
< x,vi > vi (1.38)
where < x,vi > is the inner product defined by
< x,vi >= xTvi
Similarly, we also have
Pj+1x =
j+1∑
i=1
< x,vi > vi (1.39)
So (1.38) and (1.39) tell us that for any x,
(Pj+1 − Pj)x =< x,vj+1 > vj+1 = αvj+1 (1.40)
for a scalar α. This shows that Pj+1 − Pj has rank 1 since it projects any x onto
the 1-dimensional subspace generated by vj+1.
19
Since we assume under Hp that x = Hpθp + w,
yp =(Hpθp + w)T (Pp+1 − Pp) (Hpθp + w)
σ2. (1.41)
Since Hpθp =p∑
i=1
θihi ∈ Vp, the projection of Hpθp onto Vp remains the same.
That is,
PpHpθp = Hpθp.
Also Hpθp =p∑
i=1
θihi + 0hp+1 ∈ Vp+1 , thus Pp+1Hpθp = Hpθp. So we have
(Pp+1 − Pp)Hpθp = 0
and hence
yp =wT (Pp+1 − Pp)w
σ2= uT (Pp+1 − Pp)u (1.42)
where u = w/σ is an N × 1 white Gaussian noise vector with unit variance.
For j > p, we can think of Hpθp as Hjθj where θj = [θ1, θ2, . . . , θp, 0, . . . , 0]T . By
the same derivation as above, we can also show that
yj = uT (Pj+1 − Pj)u. (1.43)
It is well known that Pj is a symmetric idempotent matrix and Pj+1Pj = Pj (see
page 231 in [13]). So
(Pj+1 − Pj) (Pj+1 − Pj) = Pj+1 − Pj.
This says that Pj+1 −Pj is also idempotent. By Lemma 1 Pj+1 −Pj has rank 1,
so by [1]
yj = uT (Pj+1 − Pj)u ∼ χ21 for all j ≥ p. (1.44)
where χ21 is the chi-square distribution with 1 degree of freedom.
We still need to show the independence of yj’s for all j ≥ p. Let zj = (Pj+1 − Pj)u.
Since zj is a linear transform of u, zj is also Gaussian with zero mean. For any
20
l > 0, we will show next that zj and zj+l are independent for any j ≥ p.
Let
[zj
zj+l
]
=
[Pj+1 − Pj
Pj+l+1 − Pj+l
]
u, whose covariance matrix is
Czj ,zj+l=
[Pj+1 − Pj
Pj+l+1 − Pj+l
][
Pj+1 − Pj Pj+l+1 − Pj+l
]
=
[(Pj+1 − Pj) (Pj+1 − Pj) (Pj+1 − Pj) (Pj+l+1 − Pj+l)
(Pj+l+1 − Pj+l) (Pj+1 − Pj) (Pj+l+1 − Pj+l) (Pj+l+1 − Pj+l)
]
.
By the property of Pj that PmPm+n = Pm for n > 0 [13], we have
(Pj+1 − Pj) (Pj+l+1 − Pj+l) = Pj+1Pj+l+1 − PjPj+l+1 − Pj+1Pj+l + PjPj+l
= Pj+1 − Pj − Pj+1 + Pj
= 0N×N .
This shows that zj and zj+l are uncorrelated and hence independent by Gaussian-
ity. Also by Gaussianity, pairwise independence will lead to the independence of
all zj’s. Since yj = zTj zj, we can say yj’s are independent j ≥ p.
Appendix 1B - Derivation of the Distribution of yj’s for j < p
If Hp is true, for j < p we still have
yj =(Hpθp + w)T (Pj+1 − Pj) (Hpθp + w)
σ2. (1.45)
But when j < p,
(Pj+1 − Pj)Hpθp �= 0
so we cannot reduce (1.45) as in (1.43). However, we can write yj as
yj =
(Hpθp
σ+ u
)T
(Pj+1 − Pj)
(Hpθp
σ+ u
)
=
((Pj+1 − Pj)Hpθp
σ+ zj
)T ((Pj+1 − Pj)Hpθp
σ+ zj
)
(1.46)
where u = w/σ and zj = (Pj+1 − Pj)u as in Appendix 1A. Since we have shown
that zTj zj ∼ χ2
1, we have
yj ∼ χ′21 (λj) (1.47)
21
where χ′21 (λj) is the noncentral chi-square distribution with 1 degree of
freedom and noncentrality parameter λj = ‖(Pj+1 − Pj)Hpθp‖2 /σ2 =
(Hpθp)T (Pj+1 − Pj)Hpθp/σ
2 > 0. If we let Hj+1,p = [hj+1,hj+2, . . . ,hp] and
θj+1,p = [θj+1, θj+2, . . . , θp]T , since (Pj+1 − Pj)Hjθj = 0, we have
λj = (Hjθj + Hj+1,pθj+1,p)T (Pj+1 − Pj) (Hjθj + Hj+1,pθj+1,p) /σ2
= (Hj+1,pθj+1,p)T (Pj+1 − Pj)Hj+1,pθj+1,p/σ
2
So λj does not depend on the first j θi’s in θp.
Since the proof of the independence of zj’s in Appendix 1A does not depend on
whether j ≥ p or j < p, zj’s are independent for all j. Hence so are yj’s.
Appendix 1C - Proof of Theorem 3
Theorem 3 (lGj(x) unbounded in probability for j ≥ p). There exists a
sequence {Nn} with Nn → ∞ such that Pr{lGj(x) > Nn} → 1 as σn → 0 for
j ≥ p.
First we will prove the next lemma.
Lemma 2. Under the true model, sp(θp)P→ sp (θp) as σn → 0. That is, for any
ε > 0, Pr{∥∥∥sp
(
θp
)
− sp (θp)∥∥∥ < ε
}
→ 1 as σn → 0.
Proof. First we will introduce the work in [18], which considers the characteristics
of the MLE under high SNR. Let
f(θp,u) = [f1(θp,u), . . . , fp(θp,u)]T =∂pU
(x(u)−sp(θp)
σn
)
∂θp
where we consider x is a function of u, then the MLE of θp is found by solving
f(θp,u) = 0
If fi(θp,u) for i = 1, . . . , p are differentiable functions on a neighborhood of a
point (θ0p,u0) with f(θ0
p,u0) = 0, and the Jacobian matrix Φ with respect to u is
22
nonsingular at (θ0p,u0), then by the implicit function theorem, we have
θp − θp
σn
P→−Φ−1Ψu (1.48)
where Φ and Ψ are deterministic matrices with
Φ =
[∂f
∂u1
∣∣∣(θ0
p,u0), . . . ,
∂f
∂uN
∣∣∣(θ0
p,u0)
]
Ψ =
[∂f
∂θ1
∣∣∣(θ0
p,u0), . . . ,
∂f
∂θp
∣∣∣(θ0
p,u0)
]
Although only Gaussian noise is considered in [18], (1.48) still holds for non-
Gaussian noise by the implicit function theorem.
It has been shown in [14] that if {Xn} is a sequence of random variables that
converges to X in probability and {cn} is a deterministic sequence that converges
to c, then cnXnP→ cX. As a result of (1.48), since σn → 0, we have
θp − θp = σnθp − θp
σn
P→ 0 (1.49)
Then by Assumption 1),∥∥∥sp
(
θp
)
− sp (θp)∥∥∥
P→ 0. This completes the proof of
Lemma 2.
When the true model is Hp, for j > p, the MLE for θj is still under the true
model if we write θj as θj = [θTp , 0, . . . , 0]T . So from (1.49), we have θj
P→θj, i.e.,
⎡
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
θ1
θ2.........
θj
⎤
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
P→
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
θ1...θp
0...0
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
Hence Lemma 2 still holds for j > p, and it extends to
∥∥∥sj
(
θj
)
− sj (θj)∥∥∥
P→ 0 for all j ≥ p (1.50)
23
So we have
lGj(x) = 2 ln
pU
(
x−sj(ˆθj)
σn
)
1σn
pU
(xσn
)1
σn
= 2 ln
pU
(
sj(θj)+σnu−sj(ˆθj)
σn
)
pU
(sj(θj)+σnu
σn
) (1.51)
Since pU(u) is a well defined PDF and hence has a valid cumulative distribution
function (CDF), we have
Pr{‖u‖ < ln} → 1 (1.52)
for any sequence {ln} with ln → ∞.
Let An = {u :∥∥∥sj
(
θj
)
− sj (θj)∥∥∥ < ε} and Bn = {u : ‖u‖ < ln} for each σn.
Since ln and ε are arbitrary, we let ln = ‖sj(θj)‖ /(3σn) and ε = ‖sj(θj)‖ /6. Then
for each u ∈ An ∩ Bn, we have
∥∥∥sj(θj) + σnu − sj(θj)
∥∥∥
σn
≤
∥∥∥sj(θj) − sj(θj)
∥∥∥
σn
+ ‖u‖ <ε
σn
+ ln (1.53)
Hence
‖sj(θj) + σnu‖σn
−
∥∥∥sj(θj) + σnu − sj(θj)
∥∥∥
σn
>
(‖sj(θj)‖
σn
− ‖u‖)
−(
ε
σn
+ ln
)
>‖sj(θj)‖
σn
− 2ln − ε
σn
=‖sj(θj)‖
6σn
→ ∞ (1.54)
as σn → 0. By Assumption 2), this shows that lGj(x) → ∞ as σn → 0 for each
u ∈ An ∩ Bn. Let C = {u : lGj(x) → ∞ as σn → 0}. The previous analysis shows
that An∩Bn ⊆ C. By (1.50) and (1.52), Pr{An} → 1 and Pr{Bn} → 1 as σn → 0.
Hence Pr{An ∩Bn} → 1. Note that An ∩Bn ⊆ C, and thus Pr{C} = 1. From this
24
“almost sure” event, it follows the “in probability” event, i.e., for any ε > 0 and any
M , there exists an integer K such that Pr{lGj(x) ≤ M} < ε for all n ≥ K. Next,
the existence of a sequence {Nn} with Nn → ∞ such that Pr{lGj(x) > Nn} → 1
as σn → 0 for j ≥ p will be shown by constructing such a sequence {Nn}.
Let {Mm} be any positive sequence that goes to ∞. For each Mm, there exists
Km such that Pr{lGj(x) ≤ Mm} < ε for all n ≥ Km. We construct {Nn} as
{Nn} = 0↑
1st term
, . . . , 0, M1
↑K1th term
, . . . , M1, M2
↑K2th term
, . . .
So Nn → ∞ since Mm → ∞. For any n, we can find a m such that Km ≤ n <
Km+1, and Nn = Mm by the above construction of {Nn}. Hence Pr{lGj(x) ≤
Nn} = Pr{lGj(x) ≤ Mm} < ε for all n. This proves the existence of a sequence
{Nn} with Nn → ∞ such that Pr{lGj(x) > Nn} → 1 as σn → 0 for j ≥ p.
Appendix 1D - Proof of Theorem 4
Theorem 4 (lGj(x) − lGp(x) bounded in probability for j > p). For any
sequence {mn}, Pr{lGj(x) − lGp(x) < mn} → 1 as mn → ∞ for j > p.
For j > p,
lGj(x) − lGp(x)
= 2 ln pU
(
sj(θj) + σnu − sj(θj)
σn
)
− 2 ln pU
(
sp(θp) + σnu − sp(θp)
σn
)
(1.55)
Note that we can consider θj as θj = [θTp , 0, . . . , 0]T , and so we have sj(θj) =
sp(θp).
By (1.48) and Assumption 1),∥∥∥sj(θj) − sp(θp)
∥∥∥
σn
≤
∥∥∥sj(θj) − sj(θj)
∥∥∥
σn
+
∥∥∥sp(θp) − sp(θp)
∥∥∥
σn
≤ K
∥∥∥θj − θj
∥∥∥
σn
+ K
∥∥∥θp − θp
∥∥∥
σn
P→ 2K∥∥Φ−1Ψu
∥∥ (1.56)
25
By the Lipschitz continuity of ln pU(u), there exists L such that (1.55) can be
written as
lGj(x) − lGp(x)
=
∣∣∣∣∣2 ln pU
(
sj(θj) + σnu − sj(θj)
σn
)
− 2 ln pU
(
sp(θp) + σnu − sp(θp)
σn
)∣∣∣∣∣
≤ 2L
∥∥∥sj(θj) − sp(θp)
∥∥∥
σn
≤ K
∥∥∥θj − θj
∥∥∥
σn
+ K
∥∥∥θp − θp
∥∥∥
σn
P→ 4LK∥∥Φ−1Ψu
∥∥ (1.57)
where the second inequality is by (1.56). Similar to (1.52), we have
Pr{∥∥Φ−1Ψu
∥∥ < ln} → 1 (1.58)
and hence
Pr{
lGj(x) − lGp(x) < 4LKln
}
→ 1 (1.59)
as ln → ∞. Since {ln} is an arbitrary sequence with ln → ∞, we have Pr{lGj(x)−
lGp(x) < mn} → 1 as mn → ∞ for any sequence {mn}.
Appendix 1E - Proof of Theorem 5
Theorem 5 (lGp(x) − lGj(x) unbounded in probability for j < p). There
exists a sequence {Mn} with Mn → ∞ such that Pr{lGp(x) − lGj(x) > Mn} → 1
as σn → 0 for j < p.
For j < p,
lGp(x) − lGj(x)
= 2 ln pU
(
sp(θp) + σnu − sp(θp)
σn
)
− 2 ln pU
(
sp(θp) + σnu − sj(θj)
σn
)
(1.60)
Note that we do not have sj(θj) = sp(θp) as for the j > p case, because it is under
the misspecified model when j < p. This means that we cannot find θj such that
sj(θj) = sp(θp) or sj(θj) is arbitrarily close to sp(θp). So we assume that there
exists δ > 0 such that ‖sj(θj) − sp(θp)‖ > δ for all θj. Then the rest of the proof
26
follows similarly to the proof of Theorem 3 in Appendix 1C using Lemma 2 and
Assumption 2).
List of References
[1] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory. En-glewood Cliffs, NJ: Prentice-Hall, 1998.
[2] H. Akaike, “A new look at the statistical model identification,” IEEE Trans.Autom. Control, vol. 19, pp. 716–723, Dec. 1974.
[3] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14,no. 5, pp. 465–471, 1978.
[4] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics,vol. 6, no. 2, pp. 461–464, 1978.
[5] S. Kay, “Exponentially embedded families - new approaches to model orderestimation,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, pp. 333–345, Jan.2005.
[6] P. Stoica and Y. Selen, “Model-order selection: A review of information cri-terion rules,” IEEE Signal Process. Mag., vol. 21, pp. 36–47, Jul. 2004.
[7] M. Wax and T. Kailath, “Detection of signals by information theoretic crite-ria,” IEEE Trans. Acoust., Speech, Signal Process., vol. 33, pp. 387–392, Apr.1985.
[8] C. Xu and S. Kay, “Source enumeration via the eef criterion,” IEEE SignalProcess. Lett., vol. 15, pp. 569–572, 2008.
[9] R. Fisher, “On the mathematical foundations of theoretical statistics,” Philos.Trans. Royal Soc. London, vol. 222, no. 594-604, pp. 309–368, Jan. 1922.
[10] R. Kass and P. Vos, Geometrical Foundations of Asymptotic Inference. Wiley,1997.
[11] W. Xu and M. Kaveh, “Analysis of the performance and sensitivity ofeigendecomposition-based detectors,” IEEE Trans. Signal Process., vol. 43,pp. 1413–1426, Jun. 1995.
[12] A. Liavas and P. Regalia, “On the behavior of information theoretic criteria formodel order selection,” IEEE Trans. Signal Process., vol. 49, pp. 1689–1695,Aug. 2001.
[13] S. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory.Englewood Cliffs, NJ: Prentice-Hall, 1993.
27
[14] E. Lehmann, Elements of Large-Sample Theory. Springer, 1998.
[15] G. Grimmett and D. Stirzaker, Probability and Random Processes, 3rd ed.Oxford University Press, 2001.
[16] W. Rudin, Functional Analysis. McGraw-Hill, 1991.
[17] K. Eriksson, D. Estep, and C. Johnson, Applied Mathematics, Body and Soul:Calculus in Several Dimensions. Springer, 2004.
[18] A. Renaux, P. Forster, E. Chaumette, and P. Larzabal, “On the high-SNRconditional maximum-likelihood estimator full statistical characterization,”IEEE Trans. Signal Process., vol. 54, pp. 4840–4843, Dec. 2006.
28
MANUSCRIPT 2
Autoregressive Modeling of Raman Spectra for Detection andClassification of Surface Chemicals
Abstract
This paper considers the problem of detecting and classifying surface chemicals
by analyzing the received Raman spectrum of scattered laser pulses received from a
moving vehicle. An autoregressive (AR) model is proposed to model the spectrum
and a two-stage (detection followed by classification) scheme is used to control the
false alarm rate. The detector decides whether the received spectrum is from pure
background only or background plus some chemicals. The classification is made
among a library of possible chemicals. The problem of mixtures of chemicals is also
addressed. Simulation results using field background data have shown excellent
performance of the proposed approach when the signal-to-noise ratio (SNR) is at
least -10 dB.
2.1 Introduction
Raman spectroscopy has been widely used in detection and classification of
chemical agents in the presence of a material, termed the background [1, 2, 3, 4].
Many spectral data analysis techniques have been developed for this application.
Supervised approaches such as regression analysis [5] and the generalized likelihood
ratio test (GLRT) [6, 7] can be used when the background spectrum is known
since this is a standard subspace detection problem. Unsupervised approaches
such as independent component analysis (ICA) [8], canonical correlation [9, 10]
and a correlation scheme based on a Gaussian filter [11] can be used when the
background spectrum is unknown or varies due to noise.
In this paper, we study the unsupervised problem of detecting and classifying
29
surface chemicals based on Raman spectral returns received from a moving vehicle.
The Raman spectral data are collected by the laser interrogation of surface agents
(LISA) system developed by ITT Industries. LISA provides standoff detection and
identification of surface-deposited chemical agents based on short-range Raman
sensing (see [12] for more information about the system). This detection and
classification problem is complicated by many factors. Some of these are:
1. A background surface whose spectrum is unknown a priori and is changing with
time.
2. Target chemicals that, even if present, are presented to the detector only a
fraction of the time. This is due to an uneven and incomplete distribution
of deposited surface chemicals.
3. The energy in the target return varying with the amount of chemical, the type
of chemical, and the range to the chemical.
4. The possible presence of more than one chemical, i.e., a chemical mixture.
5. Impurities in the background that present themselves as unknown chemical
deposits.
In order to design algorithms that are able to handle this multitude of unknown
situations we rely heavily on adaptive processing. The approaches to be described
take advantage of any information that is known and that can reasonably be as-
sured to be valid in an operational environment. For the remaining uncertainties
the algorithms will estimate on-line the information necessary for their successful
implementation. We will first discuss detection and classification (identification)
of a single chemical from a library of possible chemicals. Next, we will extend the
results to the mixture problem, i.e, when one or two or possibly three chemical
targets may be present in a single scattered spectrum.
30
The paper is organized as follows. Section 2.2 describes the problem and
the two-step detection followed by classification scheme that is proposed. An AR
model that models the Raman spectrum is described in Section 2.3. In Section
2.4 we derive the detection test statistic and the overall algorithm in order to
maintain a low false alarm rate which [13] did not consider. The experimental
detection performance for field background data is shown in Section 2.5. Simulation
results in Section 2.6 show that a very low false alarm rate can be obtained. The
classification algorithm is derived in Section 2.7. Here we extend the case of a
single chemical present to mixtures of chemicals, which was not treated in [13]. In
Section 2.8, we present the classification performance for field background data.
Finally, Sections 2.9 draws the conclusion.
2.2 Problem Statement and Rationale of Approach
Consider the case when a moving vehicle is mounted with a Raman spec-
troscopy unit that probes the ground surface every short time interval (40 mil-
liseconds in our case). A Raman spectrum or a pulse Ii(F ) is received at the ith
probe and consecutive Raman spectra of the road surface are received as the ve-
hicle moves. Each Raman spectrum is an Nf × 1 vector given at equally space
wavenumbers F . We assume that the background is relatively stationary in com-
position, that is, it is a road of the same type for a certain time interval. There
are also possibly some of M target chemicals present on the background. As a
result, the received spectrum at the ith probe could be from background plus noise
or background plus noise and one or several chemicals. We wish to design a test-
ing procedure that decides if no chemicals are present or if chemicals are present,
which chemicals are deposited on the background.
Current approaches to the detection problem have been plagued with high
false alarm rates. Indeed for any operational system the false alarm rate must
31
be controlled or else the system is deemed unreliable and cannot be used. Nearly
identical considerations arise in sonar [14] and radar [15]. It has been generally
accepted, and this philosophy is reflected in the design of these systems, that one
first performs a decision of either a detection or no detection and then follows this
with a classification. In this way the false alarm rate can be controlled since the
initial step does not consider which target may be present but only that some
target is present. This initial binary hypothesis test then allows one to control
the false alarm rate and to reduce it to a reasonable level. This is in contrast to
attempting to decide whether no target is present versus a subset of M possible
targets. The latter approach requires one to formulate a decision strategy that can
decide among multiple hypotheses, for which an error rate or false alarm rate will
be much higher.
2.3 Spectral Modeling
As mentioned in Section 2.1 the background spectrum is unknown and can
change in time. For the algorithms to accommodate this uncertainty, it is necessary
to estimate the spectrum on-line. To do so a spectral estimator that can estimate
the spectrum from a single pulse accurately and with reasonable computation to
allow a real-time implementation is the autoregressive (AR) spectral estimator [16].
Similar approaches have been used in radar [17] and sonar [18]. To implement
this estimator it is assumed that spectral data from the output of the Raman
spectroscopy unit is available over a spatial frequency band, i.e., wavenumber band,
which by letting F denote spatial frequency, extends from F = 0 to F = Fc, the
cutoff frequency. This spectral data I(F ) is also called the periodogram in analogy
with Fourier based methods of spectral estimation. Given I(F ) for 0 ≤ F ≤ Fc,
the AR spectral estimate is found as follows, with details given in [16]:
1. Assume a model order, denoted by p, for the AR spectral estimate. This order
32
is an integer, with smaller values preferred since it relates to the number of
parameters in the model and hence the number of unknowns to be estimated.
2. Based on I(F ) find the real-valued autocorrelation sequence, denoted as
{r[0], r[1], . . . , r[p]}, which is a sampled version (at a rate of 1/Δ samples
per sec) of the inverse continuous-time Fourier transform of I(F ) as
r[k] =
∫ 2Fc
0
I(F ) exp(j2πFkΔ)dF k = 0, 1, . . . , p (2.1)
where Δ is the interval in time between successive samples of the autocorrela-
tion function. The sample interval should be chosen to be less than 1/(2Fc).
Note that since the spectral data I(F ) has a spectrum that is one-sided, we
let I(F ) = I(2Fc − F ) for Fc ≤ F ≤ 2Fc. In this way I(F ) can be viewed
as one period of a periodic spectrum and therefore r[k] becomes real-valued.
The implied sampling rate is then 2Fc.
3. Solve the Yule-Walker equations to estimate the AR filter parameters
{a[1], a[2], . . . , a[p]} from⎡
⎢⎢⎢⎣
r[0] r[−1] . . . r[−(p − 1)]r[1] r[0] . . . r[−(p − 2)]...
.... . .
...r[p − 1] r[p − 2] . . . r[0]
⎤
⎥⎥⎥⎦
⎡
⎢⎢⎢⎣
a[1]a[2]...
a[p]
⎤
⎥⎥⎥⎦
= −
⎡
⎢⎢⎢⎣
r[1]r[2]...
r[p]
⎤
⎥⎥⎥⎦
(2.2)
and then use these estimated filter parameters to find the excitation noise
variance σ2u as
σ2u = r[0] +
p∑
k=1
a[k]r[−k]. (2.3)
Note that the matrix is symmetric and Toeplitz since r[−k] = r[k].
4. Once the parameters {a[1], a[2], . . . , a[p], σ2u} have been found the estimated AR
spectrum is
P (F ) =σ2
uΔ
|1 + a[1] exp(−j2πFΔ) + · · · + a[p] exp(−j2πpFΔ)|2(2.4)
33
for 0 ≤ F ≤ Fc.
Note that this procedure estimates the AR spectrum for a given AR model order
p. However in practice, we also need to estimate the appropriate order p since a
large p will cause overfitting and a small p will cause underfitting. We next assume
that the frequencies have been normalized to discrete frequencies as f = FΔ so
that 0 ≤ f ≤ 1 and hence digital techniques can be used. Clearly, the upper cutoff
frequency Fc corresponds to f = 1/2. The AR model order can be estimated as
follows (see Appendix 2.9 for the derivation):
1. For the spectral data I(f), and for each model order p, estimate the AR filter
parameters or {a[1], a[2], . . . , a[p]} using (2.1) and (2.2), and then estimate
the AR filter frequency response as
Ap(f) = 1 + a[1] exp(−j2πf) + · · · + a[p] exp(−j2πfp) (2.5)
2. Calculate the generalized likelihood ratio lGp(x) for each model order p by
lGp(x) = −N ln
∑Nf
k=1 |Ap(fk)|2I(fk)Δf∑Nf
k=1 I(fk)Δf(2.6)
where N is the unknown number of samples in the time domain since x is
fictitious. We will use N = 2Nf , which produces good results.
3. Choose the model order with the largest of the following
EEF (p) =
⎧
⎨
⎩
lGp(x) − p[
ln(
lGp (x)
p
)
+ 1]
if lG(x)p
> 1
0 iflGp (x)
p≤ 1
(2.7)
This is the exponentially embedded families (EEF) as a model order selection
criterion that has been recently proposed [19].
As an example, using an estimated model order of p = 40 and a single pulse
from a background of asphalt, the AR spectral estimate and original periodogram
34
data are shown in Figure 2.1. Note that the AR spectral estimate is able to model
the general shape of the data spectrum as well as the prominent peaks and valleys.
Additionally, if an artificial signal is included in the spectral data, then the AR
spectral estimate (with a different estimated AR model order of p = 44) appears
as in Figure 2.2. Similar results have been obtained for other surfaces such as
gravel and grass. What this says is that the AR spectral model with appropriate
order p is adequate for representing the main details of a spectrum using Raman
spectroscopy. This includes the cases of background only being present as well as
a target chemical deposited on a background. Consequently, for the development
of signal processing algorithms it allows us to consider the spectral data as having
been obtained from a hypothetical AR time series that has been Fourier transformed
and magnitude-squared. If we further assume that this hypothetical time series is
Gaussian, then many of the powerful techniques of statistical signal processing
[20], [6] can be brought to bear upon this problem. As we will see later, the Gaus-
sian assumption is not entirely accurate but algorithms based on it still perform
exceptionally well.
2.4 Detection Algorithm
The detection algorithm consists of two parts. This is necessary to avoid a
high false alarm rate as described previously. It is assumed that when a chemical
is present it must be present in a certain percentage of the returned pulses. A
detector based upon a single pulse with a reasonably low false alarm rate would
require a high threshold and hence a poorer probability of detection. Hence, the
chemical present condition is defined to be in effect when a certain percentage
of successive pulse returns indicate a chemical. The pulse returns which do not
indicate a chemical, when indeed a chemical present condition is in effect, results
from the lack of presence of a chemical in the illuminated area of the laser imaging
35
0 500 1000 1500 2000 2500 3000 35000
500
1000
1500
2000
2500
3000
3500AR order=40
Wavenumber(1/cm)
A
AR spectrumData spectrum
Figure 2.1. AR spectral estimate and background spectral data for asphalt surface(Fc = 3300).
0 500 1000 1500 2000 2500 3000 35000
500
1000
1500
2000
2500
3000
3500AR order=44
Wavenumber(1/cm)
A
AR spectrumData spectrum
Figure 2.2. AR spectrum for asphalt surface plus an artificial signal (Fc = 3300).
36
system. Thus, we have designed a detection system that
1. Examines each successive pulse for a threshold crossing of a test statistic.
2. Registers a chemical present condition when a suitable number of threshold
crossings are present over a fixed interval of time
We next examine each of these procedures in detail.
2.4.1 Test Statistic
The test statistic is computed for each pulse return or sequentially in time. To
estimate the background, we will need spectral data from the MB previous pulse
returns that do not have a threshold crossing. The choice of MB is made to ensure
that the background has not changed over this time period. For example, if MB =
25, then for a laser firing rate of 25 pulses/sec, we have effectively assumed that
the background spectral shape is stationary over the time interval of MB/25 = 1
second. Analysis of field data supports this assumption. However, it has also
been found that although the background spectral shape is stationary over a short
period of time, its overall level may change significantly from pulse to pulse. This
necessitates us to base any test statistic on the shape of the spectrum but not its
total power. This can be done by assuming for the background a fixed set of
AR filter parameters from pulse to pulse but with a time varying excitation noise
variance. Also, if some of the previous pulse returns have threshold crossings of
the test statistic, then we exclude them from the MB pulses used in estimating the
background. The test statistic is computed as follows (see Appendix 2.9 for the
derivation and explicit statistical assumptions):
1. Using the previous MB pulses that do not have threshold crossings, compute
the average Raman spectrum. Because the overall background power level
can change from pulse to pulse we must first normalize the power before we
37
average. To do so we set the total power of each pulse to one by scaling
appropriately. Let IBi(f) represent the Raman spectrum for the ith pulse
after power normalization. Then, we compute the sample average of the
background spectral data as
IB(fk) =1
MB
MB∑
i=1
IBi(fk) (2.8)
for k = 1, 2, . . . , Nf , where Nf is the number of spectral data points of the
Raman spectrum. Also, IBi(fk) is the Raman spectral data for the ith pulse
at frequency fk, assuming that it previously did not produce a threshold
crossing.
2. Estimate the AR model order p using the procedure described in (2.5), (2.6)
and (2.7). For this estimated order p, use the procedure described in (2.1)
and (2.2) to find the AR filter parameters of IB(fk). These are denoted by
{aB[1], aB[2] . . . , aB[p]}, where the subscript refers to the background spec-
tral model. Note that these may change in time and therefore will have
to be updated periodically. The estimated background AR filter frequency
response then becomes
AB(f) = 1 + aB[1] exp(−j2πf) + · · · + aB[p] exp(−j2πfp) (2.9)
3. Using the Raman spectrum for the return pulse under consideration, which
we denote as IT (f) and where T refers to a potential target, estimate the
AR model order q using the procedure described in (2.5), (2.6) and (2.7).
Compute the AR parameters, again using (2.1) and (2.2). Note that power
normalization is not needed since only the AR filter parameters are esti-
mated. This produces the AR filter parameters or {aT [1], aT [2], . . . , aT [q]}
and the estimated AR filter frequency response for the current pulse under
38
consideration as
AT (f) = 1 + aT [1] exp(−j2πf) + · · · + aT [p] exp(−j2πfq) (2.10)
4. The generalized likelihood ratio test (GLRT) statistic is finally computed as
TD = ln
Nf∑
k=1
|AB(fk)|2IT (fk)
Nf∑
k=1
|AT (fk)|2IT (fk)
(2.11)
will yields values TD ≥ 0. Note that power normalization is not required for
IT (f) since TD does not depend on scaling of IT (f).
This test statistic, which may be viewed as an anomaly detector, will indicate
when the return from any pulse produces a spectrum significantly different from
the background. No information, however, is obtained about the type of departure
and hence of a particular chemical. A threshold crossing, which occurs if TD > γ
for a threshold γ, indicates that the spectrum of the current pulse does not match
the background spectrum. As an example, impurities in the surface will also cause
a threshold crossing. Hopefully, however, these will be isolated occurrences and not
produce a chemical present condition. If this is not the case, then a classification
indicating impurities will be needed.
2.4.2 Overall Detection Algorithm
The test statistic given by (2.11) is computed for each pulse. A threshold
crossing indicates a possible chemical detection in that pulse. In order to declare a
chemical present, however, we expect a certain percentage of the pulse returns to
have a chemical in them. This percentage is currently set to 10%. For example, if
a chemical is present, then for 100 pulses, we expect 10 or more of them to produce
threshold crossings, assuming the test statistic always produces a threshold crossing
39
when a chemical is present in the pulse return. The remaining 90 test statistics
will not have a threshold crossing since they are based on data for which the laser
did not illuminate the chemical, as explained previously. With this assumption we
can now set the desired threshold for TD. We assume that a chemical is present if
10% or more of the test statistics in a given block of pulse data produce threshold
crossings. These threshold crossings need not be sequential, but can be scattered
anywhere within the block. For example, if the block consists of 100 successive
pulse returns, then a chemical is declared to be present if at least 10 of the test
statistics produce a threshold crossing. This block of 100 successive pulses is
assumed to “slide along” in time. For the example described below, the blocks are
overlapped by 50%, although other overlaps can be used.
Next in order to ensure a fixed false alarm rate, we need to set the threshold,
which we call γ, for TD appropriately. Thus, we must specify the probability of a
threshold crossing for TD, which is PFAp = Pr[TD > γ|H0], and is the probability
of false alarm for a single pulse. Then, once PFAp is found, the threshold γ can be
specified. It is shown in Appendix B how PFAp can be found so that the overall
false alarm rate is less than one false alarm per h hours.
First we find PFAb, which is the probability of a false alarm for a single block,
and is given as the solution of
(1 − PFAb)L + L (1 − PFAb
)L−1 = 0.99 (2.12)
where L = 1800h is the number of blocks analyzed in h hours. This value for
L assumes a pulse rate of 25 per second, a block size of 100 pulses, and a 50%
block overlap. Each block is therefore 4 sec long with an overlap of 2 sec. This
can be solved for PFAb. Once PFAb
is found we can determine PFAp by solving the
equation
PFAb= 1 −
9∑
i=0
(100
i
)
P iFAp
(
1 − PFAp
)i. (2.13)
40
This is just the calculation that a false alarm occurs in a block, which is defined
as 10 or more threshold crossings out of 100 possible ones. For example, if h = 2
hours and therefore L = 3600, then from (2.12) we have that PFAb= 5 × 10−5.
Using this value in the left-hand-side of (2.13) we can solve for PFAp , which is
about PFAp = 0.02. The details are given in Appendix B. As a result we need to
find the threshold γ so that the probability that TD > γ for a single pulse is 0.02.
Theoretically, the GLRT statistic TD should have a chi-squared probability
density function (PDF) with q degrees of freedom [6], which would allow us to
determine γ. It has been found through analysis of field data, however, that this
theoretical PDF is not sufficiently accurate. (This is why, as mentioned earlier,
the Gaussian assumption for the fictitious time series is not always accurate.) As a
result, it is necessary to estimate the PDF of TD when background only is present
and then use this to set the threshold. It is conceivable that this threshold will
depend upon the background statistics, which are unknown. We next indicate how
this is done on-line.
Assume that we have I independent and identically distributed test statistics
TDifor i = 1, 2, . . . , I. We can estimate on-line the right-tail probability of the
PDF by using an AR model for the PDF [21], [22]. The procedure is as follows:
1. Normalize the test statistics by a constant equal to the maximum value of the
TDi’s. If we denote this as Tmax = maxi=1,...,I TDi
, then we form the new data
set TDi= TDi
/Tmax. Thus all values are now in the range [0, 1] since TDi≥ 0.
2. Next we use the AR spectral estimator, but as a PDF estimator, with the
“estimated autocorrelation” sequence (actually the estimated characteristic
function)
r[k] =1
I
I∑
i=1
exp(
j2πkTDi
)
(2.14)
41
for k = 0, 1, . . . , p and will in general be complex-valued. The AR parameters
are estimated using (2.2) and (2.3) but with r[−k] = r∗[k]. The estimated
PDF of TD then becomes
pTD(t) =
σ2u
|1 + a[1] exp(−j2πt/Tmax) + · · · + a[p] exp(−j2πpt/Tmax)|2(2.15)
for 0 ≤ t ≤ Tmax, where σ2u is real-valued and σ2
u > 0, and the a[k]’s are
complex-valued.
3. Determine the threshold by numerical integration as the value of γ that solves
∫ ∞
γ
pTD(t)dt = PFAp . (2.16)
2.5 Experimental Detection Performance for Field Background Data
The following results make use of 10,000 pulses of concrete field background
data to which chemical signatures obtained in the laboratory were added using a
computer. The first 500 pulses of background data only are used for initialization
so that the background spectrum can be estimated as needed for |AB(fk)|2 in
(2.11). Also, using the same 500 pulses the threshold γ is found for the detector
using (2.14–2.16). The threshold is then fixed for the entire remaining 9500 pulses.
From the results to be presented it is found that the threshold will have to be
periodically updated. After the initialization period a chemical signature is added
to the background at a rate of 10% in a random manner. As explained previously,
a window of 25 previous pulses without threshold crossings is used to update the
background spectrum.
As an example of the detection performance for concrete field background
data we set the threshold so that the false alarm rate is at most 1 per 2 hours, as
explained previously. Then, we plot the probability of detection PD versus signal-
to-noise ratio (SNR) for a single chemical. The SNR is defined as the broadband
42
SNR. This is
SNR = 10 log10
∑Nf
k=1 θPs(fk)∑Nf
k=1 PB(fk)(2.17)
where PB(f) is the PSD of the background, Ps(f) is the known spectral signature
for the chemical, and θ is a scaling factor that produces the desired SNR. We
next plot the probability of detection PDp based on a single pulse, which is the
probability of a threshold crossing, versus SNR. This is done by examining the
pulses where we know that a chemical has been added throughout the 9500 pulses.
The spectra of all the chemicals that are used in the simulations are plotted in
Figure 2.3 (one or more chemicals are added to the background for either detection
performance or classification performance). When chemical 15 is added to the
background, the probability of detection is shown in Figure 2.4. It is seen that
the probability of detection is perfect for SNRs in excess of -10 dB. If we instead
add chemical 31, the results are as shown in Figure 2.5. Again the detection
performance is nearly perfect at a fairly low SNR.
0 500 1000 1500 2000 2500 3000 35000
5
10x 10
6
Wavenumber(1/cm)
A
Spectrum for chemical 15
0 500 1000 1500 2000 2500 3000 35000
5000
10000
15000
Wavenumber(1/cm)
A
Spectrum for chemical 16
0 500 1000 1500 2000 2500 3000 35000
2
4
6x 10
5
Wavenumber(1/cm)
A
Spectrum for chemical 20
0 500 1000 1500 2000 2500 3000 35000
2
4
6
Wavenumber(1/cm)
A
Spectrum for chemical 29
0 500 1000 1500 2000 2500 3000 35000
1
2
3x 10
6
Wavenumber(1/cm)
A
Spectrum for chemical 31
0 500 1000 1500 2000 2500 3000 35000
2
4
6x 10
4
Wavenumber(1/cm)
A
Spectrum for chemical 45
0 500 1000 1500 2000 2500 3000 35000
1
2x 10
4
Wavenumber(1/cm)
A
Spectrum for chemical 56
0 500 1000 1500 2000 2500 3000 35000
5
10
15x 10
6
Wavenumber(1/cm)
A
Spectrum for chemical 58
Figure 2.3. Spectra of the chemicals that are used in simulations.
43
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PDp vs SNR for chemical 15
SNR(dB)
P Dp
Figure 2.4. Probability PDp of detecting chemical 15 versus SNR based on a singlepulse.
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PDp vs SNR for chemical 31
SNR(dB)
P Dp
Figure 2.5. Probability PDp of detecting chemical 31 versus SNR based on a singlepulse.
44
2.6 Experimental False Alarm Rate Performance
Since the threshold is critical to maintain a reasonable false alarm rate, we
performed an experiment to determine if the computed one was reasonable. For
the same 10,000 pulses (6.67 minutes of data), we used the first 500 pulses for
initialization. Then, for a concrete background (no added chemical) we imple-
mented the detection algorithm previously described. A false alarm will occur if
the number of threshold crossings for a block of 100 pulses exceeds 10. The same
threshold as found from the first 500 pulses was used throughout the remaining
9500 pulses. It was found that for blocks that are 50% overlapped, as assumed in
the analysis, there were 3 false alarms as shown in Figure 2.6. However, two of the
false alarms are close together and so can be considered as the same one. Hence,
there are 2 false alarms. This is still higher than predicted. In a two-hour period
there would be on the average 36 false alarms, instead of the prediction of 1. This
would imply that the background is not stationary over this time interval. Thus
we should update the threshold periodically.
In this example we then updated the threshold for every 500 pulses. That is,
if there is no detection declared for all blocks within 500 pulses, we update the
threshold using all the 500 test statistic TD’s, and use the updated threshold for
the next 500 pulses. Otherwise if there is detection of chemicals for some blocks in
these 500 pulses, we use the test statistic TD’s in the blocks within these 500 pulses
that do not declare detections to update the threshold. It was found that for blocks
that are 50% overlapped, there were no false alarms for the remaining 9500 pulses.
When we updated the threshold every 500 pulses, even for successive blocks, there
were no false alarms either. This suggests our derivation of the required threshold
needed to control the false alarm rate.
45
0 2000 4000 6000 8000 100000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Starting point for the 100−pulse blockF
alse
ala
rm
False alarms for 50% overlapping blocks
Figure 2.6. False alarms for a concrete background and fixed threshold.
2.7 Classification
For the purpose of this paper, we constrain ourselves to single pulse classifica-
tion, that is, we perform a classification based on a single pulse that has a threshold
crossing. Once the detection algorithm declares that some chemicals are present
in a block of data (of say 100 pulses, for which 10 or more have had threshold
crossings), we proceed with single pulse classification on those pulses that have
threshold crossings within this block, and choose the chemicals that appear most
often in single pulse classification. In subsection 2.7.1, it is assumed that only
one of M chemicals may be present. In subsection 2.7.2, we consider the problem
of mixtures of chemicals, i.e., the case when two or possibly three chemicals are
present in a single scattered pulse. To do it, we initially assume that we know
there are K out of M chemicals on the background in 2.7.2, so we just need to
decide which combination of K chemicals is present. Then we will use a model
order selection criterion to decide how many chemicals are present, i.e., the value
of K, in subsection 2.7.3.
46
2.7.1 Classification if Only One of M Chemicals Is Present
To determine which chemical is present we compute M test statistics and
choose the chemical with the largest value of the test statistic. The test statistic
that is used is that associated with a locally most powerful (LMP) test [6]. It
can also be interpreted as an estimate of the chemical amplitude normalized by its
standard deviation. The overall classification procedure is as follows (see Appendix
2.9 for the derivation and detailed model):
1. For the pulse I(f) that has a threshold crossing, estimate the background by
using the previous 25 pulses that did not have threshold crossings. To do
so first normalize the power in each of these pulses to have a total power
of one and then average the spectra to yield I(f). Then, estimate the AR
parameters to obtain PB(f) as given by (2.4). Next normalize I(f) to make
∑Nf
k=1 I(f) =∑Nf
k=1 PB(f). By doing this we make sure the pulse has the same
power as the background. Since the chemical signature power is assumed to
be small, this also guarantees the pulse has about the same background
power, which satisfies the assumption of the M -ary hypothesis test as in
(2.30) in Appendix 2.9. (Note also that by this assumption, the background
normalization needed to form PB(f) should not be affected by the chemical
present.)
2. For each chemical signature Psi(f), use the estimate of the background PB(f)
and the pulse data to be classified I(f) to compute the classification test
statistic
TCi=
Nf∑
k=1
Psi(fk)
PB(fk)
(I(fk)
PB(fk)− 1
)
√√√√
Nf∑
k=1
P 2si(fk)
P 2B(fk)
(2.18)
47
Note that the chemical signature Psi(f) need not be power normalized since
TCidoes not depend on the scaling of Psi
(f).
3. Repeat step 2 for i = 1, 2, . . . , M .
4. Choose the chemical that produces the largest TCi.
Preliminary results indicate that even with a single pulse nearly a perfect classifi-
cation can be made as described in Section 2.8.
2.7.2 Classification if K out of M Chemicals Are Present
This problem is more complicated since we now need to pick K out of M
chemicals instead of just picking one out of M . The total number of possible
combinations is(
MK
)
. An asymptotic likelihood function method is proposed. The
idea is that we first find the asymptotic maximum likelihood estimate (MLE) of the
unknown powers for chemical signatures and plug it into the corresponding log-
likelihood function of this hypothesis. The chemical combination that produces
the largest log-likelihood is chosen. The classification procedure is as follows (see
Appendix 2.9 for the derivation and detailed model):
1. The first step is the same as in the previous subsection. Obtain the average
spectrum of the chemical plus background I(f).
2. For each chemical combination hypothesis, compute the asymptotic MLE of
the chemical signature powers by
θ = I−1(0)∂ ln p(x; θ)
∂θ
∣∣∣∣θ = 0
(2.19)
where
∂ ln p(x; θ)
∂θi
∣∣∣∣θ = 0
=N
2
Nf∑
k=1
Pski(fk)
PB(fk)
(I(fk)
PB(fk)− 1
)
Δf. (2.20)
48
and
Iij(0) =N
2
Nf∑
k=1
Pski(fk)Pskj
(fk)
(PB(fk))2 Δf. (2.21)
If there is at least one negative element in θ, set the log-likelihood of this hy-
pothesis to −∞. Otherwise plug θ into the following log-likelihood function
ln p(x; θ)
= −N
2
Nf∑
k=1
[
ln(∑K
i=1θiPski
(fk) + PB(fk))
+I(fk)
∑Ki=1 θiPski
(fk) + PB(fk)
]
Δf
− N
2ln(2π) (2.22)
Also note that the chemical signatures Pski’s do not need to be power nor-
malized since the ith element θi of θ from (6.7) is proportional to 1/Pski.
Thus θiPskiin (2.22) does not depend on the scaling of Pski
.
3. Repeat step 2 for all the(
MK
)
hypotheses.
4. Choose the chemical combination that corresponds to the hypothesis having the
largest log-likelihood. Note that the number of data samples N in the time
domain is unknown. However, we can compare the log-likelihoods without
knowing N . This is because that θ does not depend on N since N cancels
in (6.7) and N is just a scaling factor in (2.22).
2.7.3 Model Order Selection on How Many Chemicals Are Present inthe Mixture
We have considered the case when we know the number of chemicals that are
present. But in practice, this information is unknown a-priori. Thus, we need to
select the model order, i.e., how many chemicals are present. Again, we will use
the EEF as the model order selection criterion. For each hypothesis, the EEF can
49
be calculated by
EEF =
{
lG(x) − K[
ln(
lG(x)K
)
+ 1]
if lG(x)K
> 1
0 if lG(x)K
≤ 1(2.23)
where K is the assumed number of chemicals deposited on the background and
lG(x) = 2 lnp(x; θ)
p(x;0).
The log-likelihood functions ln p(x; θ) and ln p(x;0) can be found by plugging (6.7)
and θ = 0 into (2.22), respectively. We choose the hypothesis with the largest EEF
value.
Since EEF is increasing with lG(x), for the same model order K, the largest
lG(x) corresponds to the largest EEF. So for each K, we just need to find lG(x)’s
for all the(
MK
)
hypotheses, choose largest lG(x), and plug it into (2.23). Then, we
compare the EEF’s for different K’s and choose the model with the largest EEF.
We select the hypothesis with the largest lG(x) for the model order that has been
chosen.
Again, we need the number of data samples N in the time domain in com-
puting the EEF, since (2.22) depends on N . We will assume that same number of
samples in the time domain as in the frequency domain. Since we have Nf = 1024
samples equally spaced on half a period in the frequency domain, we will use
N = 2Nf = 2048. By simulation we have seen that the performance is excellent
with N = 2048.
2.8 Experimental Classification Performance for Field BackgroundData
For the same data conditions as for the detection experiment, we isolate all
the pulses that have had threshold crossings. The probability of a correct single
pulse classification is found by
PC =number of correct classifications for the pulses that have the added chemical
number of pulses that have the added chemical
50
First we consider the case when there is only one chemical present. Using a library
of M = 60 possible chemicals we classify the pulses with threshold crossings as per
the discussion in Section 2.6. The results for chemicals 15, 31 and 45 are shown in
Figures 2.7, 2.8 and 2.9 respectively. Again nearly perfect results are obtained for
an SNR in excess of -10 dB.
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PC for chemical 15
SNR(dB)
P C
Figure 2.7. Probability of correct single pulse classification versus SNR. Chemical15 is present.
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PC for chemical 31
SNR(dB)
P C
Figure 2.8. Probability of correct single pulse classification versus SNR. Chemical31 is present.
Next we added the two chemicals 15 and 16 to the background, each chemical
with the same SNR. We assume that we know the number of chemicals present. In
51
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PC for chemical 45
SNR(dB)
P C
Figure 2.9. Probability of correct single pulse classification versus SNR. Chemical45 is present.
the simulation, we have found that the classifier will sometimes choose chemicals
16 and 29. In Figure 2.10, we see that the probability of choosing chemicals 15
and 16 does not go to 1 as SNR increases. But if we consider chemicals 15 and 29
to be the same, the performance is much improved. This is because the spectrum
of chemical 15 is very similar to that of chemical 29 as shown in Figure 2.3. The
correlation between the spectra of the two chemicals is 0.968, which means that
they are approximately linearly dependent. In this case, it is hard to distinguish
between these two chemicals. Two approaches are possible. We can either treat
chemicals 15 and 29 as the same, or remove chemical 29 from the library. When
the classifier chooses chemical 15 in the case chemical 29 is removed, a second
stage classification can be performed to further discriminate between chemical 15
and chemical 29. The same approach is considered in [10] where one spectrum
of the spectrum pairs whose correlations are greater than a threshold is removed.
The performance is shown in Figure 2.11 when chemical 29 is removed from the
library. The simulation results for the chemical 20 and 45 combination and for
the chemical 31 and 45 combination are shown in Figure 2.12 and Figure 2.13
respectively. These combinations are easily classified.
52
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PC vs SNR for chemical 15 and 16 combination
SNR(dB)
P C
Consider chemicals 15 and 29 as the same Consider chemicals 15 and 29 as different
Figure 2.10. Probability of correct single pulse classification versus SNR. Chemicals15 and 16 are present.
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PC vs SNR for chemical 15 and 16 combination
SNR(dB)
P C
Figure 2.11. Probability of correct single pulse classification versus SNR. Chemicals15 and 16 are present. Chemical 29 is removed from the library.
53
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PC vs SNR for chemical 20 and 45 combination
SNR(dB)
P C
Figure 2.12. Probability of correct single pulse classification versus SNR. Chemicals20 and 45 are present.
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PC vs SNR for chemical 31 and 45 combination
SNR(dB)
P C
Figure 2.13. Probability of correct single pulse classification versus SNR. Chemicals31 and 45 are present.
54
Next we would like to ascertain how the EEF works for the case when the
number of chemicals in the mixture is unknown. We assume that there are at
most 3 chemicals present. Thus, we need to compare the EEF for K = 1, 2, 3.
Chemicals 15, 56 and 58 with the same SNR are added to the background. The
performance of the EEF is compared to that of the minimum description length
(MDL) criterion. The MDL is based on coding arguments [23] and can also be
derived by an asymptotic Bayesian procedure [24]. We still consider chemicals 15
and 29 as the same chemical because of the high correlation between them. The
resulting probability of correct classification versus SNR is shown in Figure 2.14.
The result for the chemical 15, 31 and 45 combination is shown in Figure 2.15. The
result for the chemical 20 and 45 combination is shown in Figure 2.16. Comparing
Figure 2.12 and Figure 2.16, we see that the former produces a slightly higher
probability of correct classification. This is because for Figure 2.12, we assume
that we know the number of chemicals, but for Figure 2.16, we need to estimate
the number of chemicals.
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PC vs SNR using EEF and MDL for chemical 15, 56, 58 combination
SNR(dB)
P C
EEFMDL
Figure 2.14. Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 15, 56 and 58 are present.
As we have seen, some of the target chemicals in the library are highly cor-
related. As a result, we need to remove some of them from the library or else
55
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PC vs SNR using EEF and MDL for chemical 15, 31, 45 combination
SNR(dB)
P C
EEFMDL
Figure 2.15. Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 15, 31 and 45 are present.
−15 −14 −13 −12 −11 −10 −9 −8 −7 −60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1PC vs SNR using EEF and MDL for chemical 20, 45 combination
SNR(dB)
P C
EEFMDL
Figure 2.16. Probability of correct single pulse classification versus SNR usingEEF and MDL. Chemicals 20 and 45 are present.
treat them as a group of similar chemicals. A further consideration is that a linear
combination of some chemicals might appear similar to another single chemical.
Then the classifier would not perform well if that single chemical were present,
since we might choose the chemicals that form the equivalent linear combination
instead. A future paper will address this issue.
56
2.9 Conclusion
An AR model has been proposed for a chemical detector and a chemical
classifier based on Raman spectra. The use of a detection procedure followed by a
classification scheme is used to control the false alarm rate. This is an unsupervised
approach which estimates on-line the information of the non-stationary background
data. Experiments with field background data have shown excellent performance
of both the detector and the classifier.
Appendix 2A - Derivation of Estimating the AR Model Order
The basic assumption is that the spectral data obtained through the action
of the Raman spectroscopy unit can be modeled as a periodogram of real-valued
Gaussian data. This implies certain statistics of the spectral data, which although
not completely satisfied, allows us to derive a detector that will perform well in
practice. As an example of this modeling discrepancy, in analyzing field obtained
spectral data it has been found that the probability density function of the spec-
tral data is not chi-squared with two degrees of freedom, which is implied by the
Gaussian model. Hence, algorithms which push this Gaussian assumption too far
may not work as predicted. Fortunately, for the problem at hand the algorithms
so derived appear to perform exceedingly well.
Assume that N samples {x[0], x[1], . . . , x[N − 1]} in the time domain of the
Gaussian AR random process are observed (this is fictitious). We assume that we
have the same number of samples in the time domain as in the frequency domain
(as in a discrete Fourier transform). Since Nf is the number of samples equally
spaced on half a period in the frequency domain, we have N = 2Nf . We need
to estimate the order of the AR process. This is a multiple hypothesis testing
57
problem with
H0 : a[1] = 0, a[2] = 0, . . . , a[pM ] = 0, σ2u > 0
H1 : a[1] �= 0, a[2] = 0, . . . , a[pM ] = 0, σ2u > 0
...
HpM: a[1] �= 0, a[2] �= 0, . . . , a[pM ] �= 0, σ2
u > 0
where pM is the largest candidate model order. That is, for the AR process with
order p, only the first p AR parameters are nonzero. Let p(x; ap, σ2u,Hp) denote
the PDF under Hp, where x denotes the random process data vector and ap is the
p × 1 vector of the first p nonzero AR filter parameters. Note that under H0, the
AR process with order 0 is white Gaussian noise, so we write the PDF under H0
as p(x; σ2u,H0).
To estimate the order, we resort to the exponentially embedded families (EEF)
which is a model order selection criterion that has been recently proposed [19]. It
has been shown that asymptotically, the EEF minimizes the divergence between
the true PDF and the estimated one. For each hypothesis Hp, its EEF can be
calculated by
EEF (p) =
⎧
⎨
⎩
lGp(x) − p[
ln(
lGp (x)
p
)
+ 1]
if lG(x)p
> 1
0 iflGp (x)
p≤ 1
(2.24)
where lGp(x) is the generalized likelihood ratio for Hp [6] with
lGp(x) = 2 lnp(x; ap, σ
2up
;Hp)
p(x; σ2u0
,H0)(2.25)
Here ap, σ2up
are the maximum likelihood estimators (MLE) of ap and σ2u under Hp,
and σ2u0
is the MLE of σ2u under H0. The EEF criterion chooses the hypothesis
with the largest EEF value.
The PDF can be written in the frequency domain (and hence the time series
58
data can be replaced by the spectral data) as [20]
ln p(x; ap, σ2u,Hp) = −N
2ln 2π − N
2
∫ 1
0
[
ln Pp(f) +I(f)
Pp(f)
]
df (2.26)
where I(f) is the periodogram data and Pp(f) is the true power spectral density
(PSD) of the AR process with parameters ap, σ2u. Since
Pp(f) =σ2
u
|Ap(f)|2
the log-PDF can be written as
ln p(x; ap, σ2u,Hp) = −N
2ln 2π − N
2
∫ 1
0
[
lnσ2
u
|Ap(f)|2 +I(f)
σ2u
|Ap(f)|2
]
df
= −N
2ln 2π − N
2
∫ 1
0
[
ln σ2u +
|Ap(f)|2I(f)
σ2u
]
df
since it can be shown that∫ 1
0ln |Ap(f)|2df = 0 [16]. Next we maximize the log-
PDF over σ2u to obtain the MLE as
σ2up
=
∫ 1
0
|Ap(f)|2I(f)df.
and substituting back into ln p(x; ap, σ2up
,Hp) yields
ln p(x; ap, σ2up
,Hp) = −N
2ln 2π − N
2ln
∫ 1
0
|Ap(f)|2I(f)df − N
2.
Finally, we need to maximize this over ap to obtain ap. It can be shown that this
maximization requires one to use the Yule-Walker equations to estimate the AR
filter parameters. Denoting the resultant MLE of Ap(f) under Hp as Ap(f), we
have that
ln p(x; ap, σ2up
,Hp) = −N
2ln 2π − N
2ln
∫ 1
0
|Ap(f)|2I(f)df − N
2
Note that since we have white Gaussian noise under H0, we have A0(f) = 1. We
maximize the log-PDF over σ2u for A0(f) = 1 to yield
σ2u0
=
∫ 1
0
I(f)df.
59
and hence we have
ln p(x; σ2u0
,H0) = −N
2ln 2π − N
2ln
∫ 1
0
I(f)df − N
2
As a result,
lGp(x) = 2 lnp(x; ap, σ
2up
;Hp)
p(x; σ2u0
,H0)= −N ln
∫ 1
0|Ap(f)|2I(f)df∫ 1
0I(f)df
(2.27)
When this is discretized over the band 0 ≤ f ≤ 1/2 we have
lGp(x) = −N ln
∑Nf
k=1 |Ap(fk)|2I(fk)Δf∑Nf
k=1 I(fk)Δf(2.28)
Finally we choose the AR model with the largest EEF calculated by (2.24).
Appendix 2B - Derivation of Test Statistic for Detection
To begin, we assume that the background random process (in the time domain)
is a real-valued Gaussian AR process with parameters {aB[1], aB[2], . . . , aB[p], σ2u}.
Note that the parameters {aB[1], aB[2], . . . , aB[p], σ2u} and also the order p is es-
timated using the sample average of the background spectral data as in (2.8).
Under H0, which is background only, the AR filter parameters are assumed to be
known but not the excitation noise variance. Under H1, the AR filter parameters
and the excitation noise variance are both unknown. Let q be the estimated AR
model order under H1 using the observed spectral data I(f). Then, we set up the
hypothesis test
H0 : AR parameters are aB[1], aB[2], . . . , aB[p], σ2u > 0
H1 : AR parameters are a[1], a[2], . . . , a[q], σ2u > 0
This effectively says that under H0 (no signal present) the spectrum is just the
known background spectrum, although with an unspecified σ2u. Under H1 the shape
of the spectrum is changed due to the change in the AR filter parameters. This
is caused by the presence of a signal, which has been added to the background.
60
We also assume that the fictitious N time samples {x[0], x[1], . . . , x[N − 1]} of
the Gaussian AR random process are observed. Let p(x; aB, σ2u,H0) denote the
PDF under H0 and p(x; a, σ2u,H1) denote the PDF under H1, where x denotes the
random process data vector, aB is the known p× 1 vector of AR filter parameters
and a is the unknown q × 1 vector of AR filter parameters. The generalized
likelihood ratio test statistic (GLRT) is [6]
lG(x) = lnp(x; a, σ2
u1;H1)
p(x; aB, σ2u0
,H0)(2.29)
where a, σ2u1
is the maximum likelihood estimator (MLE) of a and σ2u under H1,
and σ2u0
is the MLE of σ2u under H0.
From similar derivations as in Appendix A, and denoting the resultant MLE
of A(f) under H1 as AT (f), we have that
ln p(x; aB, σ2u0
,H0) = −N
2ln 2π − N
2ln
∫ 1
0
|AB(f)|2I(f)df − N
2
ln p(x; a1, σ2u1
,H1) = −N
2ln 2π − N
2ln
∫ 1
0
|AT (f)|2I(f)df − N
2
and finally from (2.29)
lG(x) =N
2ln
∫ 1
0|AB(f)|2I(f)df
∫ 1
0|AT (f)|2I(f)df
.
When this is discretized over the band 0 ≤ f ≤ 1/2 we have
lG(x) =N
2ln
∑Nf
k=1 |AB(fk)|2I(fk)∑Nf
k=1 |AT (fk)|2I(fk).
and omitting the N/2 factor, we have finally (2.11).
Appendix 2C - Derivation of Probability of Detection Statistic Thresh-old Crossing for Given False Alarm Rate
We declare that a chemical has been detected if at least 10% of the pulses in a
given block produce threshold crossings. As an example, we consider the block to
61
consist of 100 pulses and hence a detection occurs if at least 10 threshold crossings
are observed. Also, we assume an operational requirement of one false alarm per
two hours of time. To compute the desired probabilities exactly is difficult due to
the fact that the successive blocks, which differ by only one sample, are heavily
dependent. As an approximation we assume that the blocks are overlapped by 50%
(which may be necessary in practice to avoid excessive computation and is com-
monly done in practice) and therefore that the data in each block is approximately
independent. Then, in two hours we have examined
L = (2 × 3600 × 25)/50 = 3600 blocks
for a 10% threshold crossing rate. Hence, the probability of false alarm for each
block is obtained as follows. Let PFAbbe the probability of a false alarm for a
single block. Then, the probability of at most one false alarm in L independent
blocks is
P1 = P [at most one false alarm in L blocks]
= P [no false alarms in L blocks] + P [one false alarm in L blocks]
= (1 − PFAb)L + LPFAb
(1 − PFAb)L−1
since this is a binomial type of probability. We want the probability of at most
one false alarm per two hours to be large, say 0.99. Hence, we need to solve for
PFAbby finding that value that satisfies
(1 − PFAb)L + LPFAb
(1 − PFAb)L−1 = 0.99.
In general, for at most one false alarm per h hours we should use L = 1800h.
For the example of h = 2 we plot the probability of at most one false alarm
per two hours versus PFAbin Figure 2.17. It is seen that we should require that
PFAb= 4× 10−5, which is the probability of a false alarm for a single block of 100
62
0 0.2 0.4 0.6 0.8 1 1.2x 10−4
0.94
0.95
0.96
0.97
0.98
0.99
1
PFAb
P 1
Figure 2.17. Probability P1 of at most one false alarm per two hours versus PFAb.
pulses. Next, since we declare a chemical present if there are at least 10 threshold
crossings out of a 100 possible ones, then the probability of a false alarm for a
single block is
PFAb=
100∑
k=10
(100
k
)
P kFAp
(1 − PFAp)100−k = 1 −
9∑
k=0
(100
k
)
P kFAp
(1 − PFAp)100−k
where PFAp is the probability of a threshold crossing, i.e., probability of a false
alarm for a single pulse. In Figure 2.18 we plot PFAbversus PFAp . For PFAb
=
4 × 10−5 = −44 dB, we require from Figure 2.18 that PFAp = 0.02. Hence, the
threshold of the test statistic given by (2.11) should be set so that the probability
of TD exceeding this threshold γ is 0.02.
Appendix 2D - Derivation of LMP Test Statistic for Classification
It is assumed that one of M chemicals is present. The spectral data is assumed
to be of the form P (f) = θiPsi(f) + PB(f) if the ith chemical is present. As usual
PB(f) is the PSD of the background, Psi(f) is the known spectral signature for the
ith chemical, and θi is an unknown scaling factor that accounts for the unknown
power of the chemical. To decide which chemical is present we set up an M -ary
63
0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05−80
−70
−60
−50
−40
−30
−20
−10
PFAp
P FA
b (dB
)
Figure 2.18. Probability of at most one false alarm per two hours versus PFAp .
hypothesis test as
H1 : P (f) = θ1Ps1(f) + PB(f)
H2 : P (f) = θ2Ps2(f) + PB(f)
......
HM : P (f) = θMPsM(f) + PB(f).
The θi’s are positive but otherwise unknown. They are assumed to be small so
that an LMP approach can be used. An LMP classification test statistic decides
chemical k is present if among
TCi(x) =
∂ ln p(x;Hi)∂θi
∣∣∣θi=0
√
IFi(0)
(2.30)
for i = 1, 2, . . . , M , TCk(x) is the maximum value. In (2.30) IF (0) is the Fisher
information for θ when evaluated at θ = 0. To evaluate the test statistics, we first
note that the log-PDF of the spectral data is given as in Appendix A as
ln p(x) = −N
2ln 2π − N
2
∫ 1
0
[
ln P (f) +I(f)
P (f)
]
df.
64
Using P (f) = θPs(f) + PB(f) we have
ln p(x) = −N
2ln 2π − N
2
∫ 1
0
[
ln (θPs(f) + PB(f)) +I(f)
θPs(f) + PB(f)
]
df
and differentiating produces
∂ ln p(x)
∂θ= −N
2
∫ 1
0
Ps(f)
θPs(f) + PB(f)− I(f)Ps(f)
(θPs(f) + PB(f))2df (2.31)
which when evaluated at θ = 0 yields
∂ ln p(x)
∂θ
∣∣∣∣θ=0
= −N
2
∫ 1
0
Ps(f)
PB(f)− I(f)Ps(f)
(PB(f))2df
=N
2
∫ 1
0
Ps(f)
PB(f)
(I(f)
PB(f)− 1
)
df. (2.32)
To determine the Fisher information we differentiate (2.31) a second time to pro-
duce
∂2 ln p(x)
∂θ2= −N
2
∫ 1
0
− P 2s (f)
(θPs(f) + PB(f))2+ 2
I(f)P 2s (f)
(θPs(f) + PB(f))3df.
Taking the expected value and noting that E[I(f)] = P (f) = θPs(f) + PB(f)
produces
E
[∂2 ln p(x)
∂θ2
]
= −N
2
∫ 1
0
− P 2s (f)
(θPs(f) + PB(f))2+ 2
(θPs(f) + PB(f))P 2s (f)
(θPs(f) + PB(f))3df
= −N
2
∫ 1
0
− P 2s (f)
(θPs(f) + PB(f))2+ 2
P 2s (f)
(θPs(f) + PB(f))2df.
Setting θ = 0 and taking the negative produces
IF (0) =N
2
∫ 1
0
P 2s (f)
P 2B(f)
df. (2.33)
Therefore, the LMP statistic becomes from (2.32) and (2.33)
TC =
√N2
∫ 1
0Ps(f)PB(f)
(I(f)
PB(f)− 1
)
df√∫ 1
0P 2
s (f)
P 2B(f)
df. (2.34)
65
When discretized over the band 0 ≤ f ≤ 1/2, this becomes√
N2
∑Nf
k=1Ps(fk)PB(fk)
(I(fk)
PB(fk)− 1
)
Δf√∑Nf
k=1P 2
s (fk)
P 2B(fk)
Δf
and ignoring a scaling factor, which will not affect the maximum, we have finally
TC =
∑Nf
k=1Ps(fk)PB(fk)
(I(fk)
PB(fk)− 1
)
√∑Nf
k=1P 2
s (fk)
P 2B(fk)
Appendix 2E - Derivation of The Asymptotic Likelihood FunctionMethod for Classification of Mixture of Chemicals
We assume that K out of M chemicals are present and they are additive.
Hence, the spectral data is of the form P (f) =∑K
i=1 θkiPski
(f) + PB(f) if chemicals
k1, k2, . . . , kK are present. The total number of candidate hypotheses is(
MK
)
. Let
the unknown parameters be θ = [θk1 , θk2 , . . . , θkK]T . Asymptotically [6],
θ = θ0 + I−1(θ0)∂ ln p(x; θ)
∂θ
∣∣∣∣θ=θ0
or in our problem θ0 = 0,
θ = I−1(0)∂ ln p(x; θ)
∂θ
∣∣∣∣θ = 0
. (2.35)
For each candidate hypothesis,
ln p(x; θ)
= −N
2
∫ 1
0
[
ln(∑K
i=1θki
Pski(f) + PB(f)
)
+I(f)
∑Ki=1 θki
Pski(f) + PB(f)
]
df
− N
2ln(2π) (2.36)
and
∂ ln p(x; θ)
∂θki
= −N
2
∫ 1
0
⎡
⎢⎣
Pski(f)
∑Ki=1 θki
Pski(f) + PB(f)
−I(f)Pski
(f)(∑K
i=1 θkiPski
(f) + PB(f))2
⎤
⎥⎦df
66
∂ ln p(x; θ)
∂θki
∣∣∣∣θ = 0
=N
2
∫ 1
0
Pski(f)
PB(f)
(I(f)
PB(f)− 1
)
df. (2.37)
The second derivative is
∂2 ln p(x; θ)
∂θki∂θkj
= −N
2
∫ 1
0
⎡
⎢⎣−
Pski(f)Pskj
(f)(∑K
i=1 θkiPski
(f) + PB(f))2 +
2I(f)Pski(f)Pskj
(f)(∑K
i=1 θkiPski
(f) + PB(f))3
⎤
⎥⎦df
and therefore, the Fisher information matrix is
Iij(θ)
= −E
[∂2 ln p(x; θ)
∂θki∂θkj
]
=N
2E
⎡
⎢⎣
∫ 1
0
⎡
⎢⎣−
Pski(f)Pskj
(f)(∑K
i=1 θkiPski
(f) + PB(f))2 +
2I(f)Pski(f)Pskj
(f)(∑K
i=1 θkiPski
(f) + PB(f))3
⎤
⎥⎦df
⎤
⎥⎦
=N
2
∫ 1
0
⎡
⎢⎣−
Pski(f)Pskj
(f)(∑K
i=1 θkiPski
(f) + PB(f))2 +
2E (I(f)) Pski(f)Pskj
(f)(∑K
i=1 θkiPski
(f) + PB(f))3
⎤
⎥⎦df
=N
2
∫ 1
0
⎡
⎢⎣−
Pski(f)Pskj
(f)(∑K
i=1 θkiPski
(f) + PB(f))2 +
2Pski(f)Pskj
(f)(∑K
i=1 θkiPski
(f) + PB(f))2
⎤
⎥⎦df
=N
2
∫ 1
0
Pski(f)Pskj
(f)(∑K
i=1 θkiPski
(f) + PB(f))2df
or
Iij(0) =N
2
∫ 1
0
Pski(f)Pskj
(f)
(PB(f))2 df. (2.38)
When discretized over the band 0 ≤ f ≤ 1/2, (2.36), (2.37) and (2.38) become
ln p(x; θ)
= −N
2
Nf∑
k=1
[
ln(∑K
i=1θki
Pski(fk) + PB(fk)
)
+I(fk)
∑Ki=1 θki
Pski(fk) + PB(fk)
]
Δf
− N
2ln(2π) (2.39)
67
∂ ln p(x; θ)
∂θki
∣∣∣∣θ = 0
=N
2
Nf∑
k=1
Pski(fk)
PB(fk)
(I(fk)
PB(fk)− 1
)
Δf. (2.40)
Iij(0) =N
2
Nf∑
k=1
Pski(fk)Pskj
(fk)
(PB(fk))2 Δf. (2.41)
Now we have the MLE of θ from (2.35), (2.40) and (2.41). The asymptotic like-
lihood function approach then substitutes θ for θ into (2.39) and chooses the
hypothesis that has the largest likelihood.
One important issue that should be described is our assumption that θki≥ 0
for i = 1, 2, . . . , K. However, the MLE of θ without these nonnegative constraints
may produce negative solutions. However, this can be easily resolved by the Kuhn-
Tucker conditions.
From the Kuhn-Tucker conditions, we know that if the MLE without those
positivity constraints has negative solutions, then the MLE under these constraints
will have at least one θki= 0 [25]. Then this hypothesis is reduced to at least the
(K − 1)th order model. Then, any other Kth order hypothesis that has the same
chemical signatures as the reduced (K − 1)th order hypothesis and one arbitrary
other chemical signature would have likelihood not less than the reduced (K−1)th
order hypothesis. For example, for the hypothesis H1 that has chemical signatures
P1(f), P2(f), . . . , PK(f), if the unconstrained MLE of θ has negative solutions,
then the MLE under positivity constraints would have at least one θi = 0 by the
Kuhn-Tucker conditions, say θ1 = 0. Thus, any other hypothesis that includes
P2(f), P3(f), . . . , PK(f) and any other chemical signature (say Ps1(f)) would have
likelihood not less than H1, since we would at least get the same likelihood by
using the same constrained MLE for hypothesis H1. This argument implies that
we can just ignore the hypothesis that yields an unconstrained MLE with at least
one negative solution, and therefore, the greatest likelihood must correspond to
the hypothesis with a nonnegative unconstrained MLE.
68
Since we have as many as(
MK
)
candidate hypotheses, we do not have to con-
sider the case when all hypotheses have at least one negative solution in uncon-
strained MLE. In this case, it can be considered that there are less than K chem-
icals present, and we should decrease the value of K.
List of References
[1] K. Kneipp, H. Kneipp, I. Itzkan, R. Dasari, and M. Feld, “Ultrasensitivechemical analysis by raman spectroscopy,” Chemical Reviews, vol. 99, p.2957C2975, 1999.
[2] R. Frost, D. Henry, and K. Erickson, “Raman spectroscopic detection ofwyartite in the presence of rabejacite,” Journal of Raman Spectroscopy,vol. 35, pp. 255–260, 2004.
[3] N. Hayazawa, M. Motohashi, Y. Saito, and S. Kawata, “Highly sensitive straindetection in strained silicon by surface-enhanced raman spectroscopy,” AppliedPhysics Letters, vol. 86, pp. 263 114 – 263 114–3, 2005.
[4] A. Portnov, S. Rosenwaks, and I. Bar, “Detection of particles of explosives viabackward coherent anti-stokes raman spectroscopy,” Applied Physics Letters,vol. 93, pp. 041 115 – 041 115–3, 2008.
[5] D. Manolakis, D. Marden, and G. Shaw, “Hyperspectral image processing forautomatic target detection applications,” Lincoln Laboratory Journal, vol. 14,no. 1, pp. 79–116, 2003.
[6] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory. En-glewood Cliffs, NJ: Prentice-Hall, 1998.
[7] L. Scharf and B. Friedlander, “Matched subspace detectors,” IEEE Trans.Signal Process., vol. 42, no. 8, pp. 2146–2157, Aug. 1994.
[8] W. Wang and T. Adali, “Constrained ica and its application to raman spec-troscopy,” in Proc. Antennas and Propagation Society International Sympo-sium, Jul. 2005, pp. 109–112.
[9] W. Wang, T. Adali, and D. Emge, “Unsupervised detection using canonicalcorrelation analysis and its application to raman spectroscopy,” in Proc. IEEEWorkshop on Machine Learning for Signal Processing, Aug. 2007.
[10] W. Wang, T. Adali, and D. Emge, “Subspace partitioning for target detectionand identification,” IEEE Trans. Signal Process., vol. 57, no. 4, pp. 1250–1259,Apr. 2009.
69
[11] M. Alam, M. Nazrul Islam, A. Bal, and M. Karim, “Hyperspectral targetdetection using gaussian filter and post-processing,” Optics and Lasers inEngineering, vol. 46, pp. 817–822, Nov. 2008.
[12] T. Chyba, N. Higdon, W. Armstrong, C. Lobb, P. Ponsardin, D. Richter,B. Kelly, Q. Bui, R. Babnick, M. Boysworth, A. Sedlacek, and S. Christesen,“Field tests of the laser interrogation of surface agents (lisa) system for on-the-move standoff sensing of chemical agents,” in Proc. Int. Symp. SpectralSensing Research, 2003.
[13] S. Kay, C. Xu, and D. Emge, “Chemical detection and classification in ramanspectra,” in Proceedings of the SPIE, vol. 6969, Mar. 2008, pp. 4–12.
[14] W. Knight, R. Pridham, and S. Kay, “Digital signal processing for sonar,” inProceedings of the IEEE, Nov. 1981, pp. 1451–1506.
[15] R. Wiley, ELINT: The Interception and Analysis of Radar Signals. Boston,MA: Artech House, 2006.
[16] S. Kay, Modern Spectral Estimation: Theory and Application. EnglewoodCliffs, NJ: Prentice-Hall, 1988.
[17] D. Bowyer, P. Rajasekaran, and W. Gebhart, “Adaptive clutter filtering usingautoregressive spectral estimation,” IEEE Trans. Aerosp. Electron. Syst., pp.538–546, Jul. 1979.
[18] S. Kay and J. Salisbury, “Improved active sonar detection using autoregressiveprewhiteners,” J. Acoustical Soc. of America, pp. 1603–1611, Apr. 1990.
[19] S. Kay, “Exponentially embedded families - new approaches to model orderestimation,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, pp. 333–345, Jan.2005.
[20] S. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory.Englewood Cliffs, NJ: Prentice-Hall, 1993.
[21] A. Pages-Zamora and M. Lagunas, “New approaches in non-linear signal pro-cessing: Estimation of the probability density function by spectral estimationmethods,” in IEEE Workshop on Higher Order Statistics, 1995.
[22] S. Kay, “Model based probability density function estimation,” IEEE SignalProcess. Lett., pp. 318–320, Dec. 1998.
[23] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14,no. 5, pp. 465–471, 1978.
[24] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics,vol. 6, no. 2, pp. 461–464, 1978.
70
[25] C. Lawson and R. Hanson, Solving Least Squares Problems. SIAM, 1995.
71
MANUSCRIPT 3
Sensor Integration for Distributed Detection and Classification
Abstract
We investigate the problem of sensor integration to combine all the available
information in a multi-sensor setting from a statistical standpoint. Specifically, in
this paper, we propose a novel method of constructing the joint probability density
function (PDF) of the measurements from all the sensors based on the exponential
family. This method does not require the knowledge of the marginal PDFs and
hence is useful in many practical cases. We prove that our method is asymptotically
optimal in Kullback-Leibler (KL) divergence. Our method requires less informa-
tion compared to existing methods and attains comparable detection/classification
performance.
3.1 Introduction
Distributed systems and information fusion have been widely studied and used
in engineering, finance, and scientific research. Such applications are to radar,
sonar, biomedical analysis, stock prediction, weather forecasting, and chemical,
biological, radiological, and nuclear (CBRN) detection, to name a few. If the
joint probability density functions (PDFs) under each candidate hypothesis are
known, we would easily obtain the optimal performance by the Neyman-Pearson
rule for detection (binary hypothesis testing) and by the maximum a posteriori
probability (MAP) rule for classification (multiple hypothesis testing) [1]. However
in practice, this information may not be available. This usually happens when the
dimensionality of the sample space is high and we do not have enough training
samples to have an accurate estimate of the joint PDF. The problem is exacerbated
by onerous environmental and systems constraints in radar and sonar applications.
72
This is also recognized as the “curse of dimensionality” in pattern recognition and
machine learning. Hence, it is important to efficiently approximate the unknown
joint PDF using limited training data.. One common approach is to assume that
the measurements from different sensors are independent [2], [3]. This approach
has been widely used due to its simplicity, since the joint PDF is then the product
of the marginal PDFs. This is also known as the “product rule” in combining
classifiers [4]. In spite of its popularity, the independence assumption may not be
a good one if the measurements are actually correlated. Furthermore, as stated in
[4], the product rule is severe because “it is sufficient for a single recognition engine
to inhibit a particular interpretation by outputting a close to zero probability for
it”. Hence researchers have studied other methods that consider the correlation
among the measurements. However, the problem does not have a unique solution
when the data is non-Gaussian. A copula based framework is proposed in [5], [6]
to construct the joint PDF. The exponentially embedded families (EEFs) are used
in [7] to estimate the joint PDF that is asymptotically closest to the true one in
Kullback-Leibler (KL) divergence.
Note that the above methods all require the knowledge of marginal PDFs.
In this paper, we consider the case when the marginal PDFs are not available or
accurate, which can happen due to a high-dimensional sample space and insuffi-
cient training data. We present a new way of constructing the joint PDF without
the knowledge of marginal PDFs but only a reference PDF. The constructed joint
PDF takes the form of the exponential family and incorporates all the available
information. The maximum likelihood estimator (MLE) [8] of the unknown pa-
rameters can be easily solved based on the properties of the exponential family. It
is shown that the constructed PDF is asymptotically the optimal one in the sense
that it is asymptotically closest to the true PDF in KL divergence. Since there is
73
no Gaussian distribution assumption on the reference PDF, this method can be
very useful when the underlying distributions are non-Gaussian. We start with the
detection problem, and then extend our method to the classification problem. For
detection, it is shown that under some conditions, our detection statistics are the
same as the clairvoyant generalized likelihood ratio test (GLRT). For classifica-
tion, our classifier also has the same performance as the estimated MAP classifier.
Both the clairvoyant GLRT and the estimated MAP classifier assume that the true
PDFs under each candidate hypothesis are known except for the usual unknown
parameters.
The paper is organized as follows. In Section 3.2, we introduce a distributed
detection/classification problem. In Section 3.3, we construct the joint PDF by an
exponential family and apply it to the problem in Section 3.2. The KL divergence
between the true PDF and the constructed PDF is examined in Section 3.4, and
the result shows that the constructed PDF is asymptotically optimal. Examples
for distributed detection are given in Section 3.5, and examples for distributed
classification are given in Section 3.6. Simulation results to compare the perfor-
mance of our method with existing methods are shown in Section 3.7. In Section
3.8, we draw the conclusions.
3.2 Problem Statement
Consider the distributed detection/classification problem when we observe the
outputs of two sensors, T1(x) and T2(x), which are transformations of the underly-
ing samples x. These are unobservable at the central processor as shown in Figure
7.1. We choose two sensors for simplicity. All the results in this paper are valid for
multiple sensors. For detection, we want to distinguish between two hypotheses
H0 and H1 based on the outputs of the two sensors, and for classification, we have
M candidate hypotheses Hi for i = 1, 2, . . . , M .
74
Assume that we have enough training data T1i(x)’s and T2i
(x)’s under H0,
i.e., when there is no signal present. Hence we have a good estimate of the joint
PDF of T1 and T2 under H0 [8], and thus we assume pT1,T2(t1, t2;H0) is completely
known. Under H1 or Hi for i = 1, 2, . . . , M when a signal is present, we may not
even have enough training data to estimate the marginal PDFs. This is especially
the case in the radar scenario, where the target is present for only a small portion of
the time. So our goal is to use the available information to construct an appropriate
pT1,T2(t1, t2;H1) under H1 for detection or pT1,T2(t1, t2;Hi) under each Hi for
classification. A simple illustration is shown in Figure 7.1.
Sensor 1 Sensor 2
CentralProcessor
T1(x) T2(x)
Area of Interest
pT1,T2(t1,t2;H0)
Detection: H0 or H1 ? orClassification: Hi ?, i=1,...,M
Figure 3.1. Distributed detection/classification system with two sensors
75
3.3 Joint PDF Construction by Exponential Family and Its Applica-tion in Distributed Systems
To start with, we consider the detection problem, where we wish to construct
pT1,T2(t1, t2;H1). The result will then be extended to the classification problem.
Since pT1,T2(t1, t2;H1) cannot be uniquely specified based on
pT1,T2(t1, t2;H0), we need the following reasonable assumptions to construct the
joint PDF.
1) Under H1 the signal is small and thus pT1,T2(t1, t2;H1) is close to
pT1,T2(t1, t2;H0).
2) pT1,T2(t1, t2;H1) can be parameterized by some signal parameters θ such
that
pT1,T2(t1, t2;H1) = pT1,T2(t1, t2; θ)
pT1,T2(t1, t2;H0) = pT1,T2(t1, t2;0)
Note that since θ represents signal amplitudes, θ �= 0 under H1. Therefore, the
detection problem is to select between
H0 : θ = 0
H1 : θ �= 0
To simplify the notation, let
T =
[T1
T2
]
so that we can write the joint PDF pT1,T2(t1, t2; θ) as pT(t; θ). With the small
signal assumptions, it has been shown in [9] that by using a first order Taylor
expansion on the log-likelihood function ln pT(t; θ) about θ = 0, we can construct
the PDF of T as
pT(t; θ) = exp[
θT t − K(θ) + ln pT(t;0)]
(3.1)
76
where
K(θ) = ln E0
[
exp(
θTT)]
(3.2)
is the cumulant generating function of pT(t;0), and it normalizes the PDF to
integrate to one. Since T is a sufficient statistic for the constructed exponential
PDF in (7.1), this PDF incorporates all the information from the two sensors. Note
that only pT(t;0) is required in (7.1) to construct pT(t; θ), and it is assumed that
pT(t;0) is available or it can be estimated with reasonable accuracy. Also note
that if T1, T2 are statistically dependent under H0, they will also be dependent
under H1.
The next step is to estimate the unknown parameters θ. We resort to the MLE
[10] by maximizing (7.1) over θ. Note that K(θ) is convex by Holder’s inequality
[11]. Since maximizing (7.1) is equivalent to maximizing θT t−K(θ), this becomes
a convex optimization problem and many existing methods can be readily utilized
[12], [13]. Also, the MLE of θ will satisfy
t =∂K(θ)
∂θ(3.3)
When the MLE, θ, is found, we will use pT(t; θ) as our estimated PDF under H1.
Hence, similar to the GLRT [1], we will decide H1 if
lnpT(t; θ)
pT(t;0)= θ
Tt − K(θ) > τ (3.4)
where τ is a threshold. We will show in the next section that pT(t; θ) is asymp-
totically optimal in the sense of KL divergence.
To extend our method to classification, the above two assumptions can be
simply modified as
1) The signal is small under each Hi and hence pT1,T2(t1, t2;Hi) is close to
pT1,T2(t1, t2;H0).
77
2) Under each Hi, the joint PDF can be parameterized by some signal param-
eters θi so that
pT1,T2(t1, t2;Hi) = pT1,T2(t1, t2; θi)
pT1,T2(t1, t2;H0) = pT1,T2(t1, t2;0)
Similar to (7.1), as shown in [14], we can construct the PDF of T under Hi as
pT(t; θi) = exp[
θTi t − K(θi) + ln pT(t;0)
]
(3.5)
where
K(θi) = ln E0
[
exp(
θTi T
)]
(3.6)
is the cumulant generating function of pT(t;0) that normalizes the constructed
PDF. When the MLE of θi is found by maximizing pT(t; θi) over θi, we consider
pT(t; θi) as our estimate of pT(t;Hi) where θi is the MLE of θi. Hence similar to
the MAP rule [1], we will decide Hi for which the following is maximum over i:
p(Hi|t) =pT(t;Hi)p(Hi)
pT(t)=
pT(t; θi)p(Hi)
pT(t)(3.7)
When we assume that the prior probabilities of each candidate hypothesis are
equal, i.e., p(H1) = . . . , = p(HM) = 1/M , p(Hi) cancels and we can equivalently
decide Hi for which the following is maximum over i:
lnpT(t; θi)
pT(t;0)= θ
T
i t − K(θi) (3.8)
3.4 KL Divergence Between The True PDF and The Constructed PDF
The KL divergence is a non-symmetric measure of difference between two
PDFs. For two PDFs p1 and p0, it is defined as
D (p1 ‖p0 ) =
∫
p1(x) lnp1(x)
p0(x)dx
It is well known that D (p1 ‖p0 ) ≥ 0 with equality if and only if p1 = p0 almost ev-
erywhere [15]. By Stein’s lemma [16], the KL divergence measures the asymptotic
78
performance for detection. An extended result to classification has been recently
presented in [17]. Next we will show that pT(t; θ) is optimal under both hypothe-
ses. That is, under H0, pT(t; θ) = pT(t;0) asymptotically, and similarly under H1,
then pT(t; θ) is asymptotically the closest one to the true PDF in KL divergence.
Similar results and arguments have been shown in [7, 18].
Assume that we observe independent and identically distributed (IID) Ti’s
with
Ti =
[T1i
T2i
]
for i = 1, 2, . . . , N . Shortening the notation, we will write
pT1,T2,...,TM(t1, t2, . . . , tM ; θ) as p(t1, t2, . . . , tM ; θ). The constructed PDF
can be easily extended as (see (7.1))
p(t1, t2, . . . , tM ; θ)
= exp[
θT ∑Mi=1 ti − MK(θ)+ ln p(t1, t2, . . . , tM ;0)
]
(3.9)
so we wish to maximize
1
Mln
p(t1, t2, . . . , tM ; θ)
p(t1, t2, . . . , tM ;0)=
1
MθT
M∑
i=1
ti − K(θ) (3.10)
and θ is found by solving
1
M
M∑
i=1
ti =∂K(θ)
∂θ(3.11)
Now we consider two cases. First, for the true PDF under H0, by the law of
large numbers, it follows that
1
M
M∑
i=1
ti → E0(t)
as M → ∞. Note that
∂K(θ)
∂θ
∣∣θ=0
= E0(t)
Since the solution of (3.11) is unique, asymptotically we have
θ = 0
79
and hence, p(t1, t2, . . . , tM ; θ) = p(t1, t2, . . . , tM ;0).
Secondly, for the true PDF under H1, by the law of large numbers, it follows
that
1
M
M∑
i=1
ti → E1(t)
as M → ∞. From (3.10), we are asymptotically maximizing
θT E1(t) − K(θ) (3.12)
To avoid confusion, we will denote the underlying true PDF under H1 as
p(t1, t2, . . . , tM ;H1) and our constructed PDF as p(t1, t2, . . . , tM ; θ). Since from
(3.9)
lnp(t1, t2, . . . , tM ;H1)
p(t1, t2, . . . , tM ; θ)
= −(
θTM∑
i=1
ti − MK(θ)
)
+ lnp(t1, t2, . . . , tM ;H1)
p(t1, t2, . . . , tM ;0)
the KL divergence between the true PDF and the constructed one is
D (p(t1, t2, . . . , tM ;H1)||p(t1, t2, . . . , tM ; θ))
= EH1
[
−(
θTM∑
i=1
ti − MK(θ)
)
+ ln p(t1,t2,...,tM ;H1)p(t1,t2,...,tM ;0)
]
= −M[
θT E1(t) − K(θ)]
+ D (p(t1, t2, . . . , tM ;H1)||p(t1, t2, . . . , tM ;0)) (3.13)
Since D (p(t1, t2, . . . , tM ;H1)||p(t1, t2, . . . , tM ;0)) is fixed,
D (p(t1, t2, . . . , tM ;H1)||p(t1, t2, . . . , tM ; θ)) is minimized by maximizing
(3.12). This shows that p(t1, t2, . . . , tM ; θ) is asymptotically the closest to
p(t1, t2, . . . , tM ;H1) in KL divergence.
3.5 Examples-Distributed Detection
In this section, we compare our method with the clairvoyant GLRT for a
specific detection problem. The clairvoyant GLRT which provides an upper bound
80
on GLRT performance assumes that we know the true PDF of T under H1 except
for the underlying unknown parameters α. It decides H1 if
lnpT(t; α)
pT(t;0)> τ (3.14)
3.5.1 Partially Observed Linear Model with Gaussian Noise
Suppose we have the linear model with
x = Hα + w (3.15)
with
H0 : α = 0
H1 : α �= 0
where x is an N × 1 vector of the underlying unobservable samples, H is an N × p
observation matrix with full column rank, α is an p × 1 vector of the unknown
signal amplitudes, and w is an N × 1 vector of white Gaussian noise samples with
known variance σ2. We observe two sensor outputs
T1(x) = HT1 x
T2(x) = HT2 x (3.16)
where H1 is N × q1 and H2 is N × q2. Note that [H1 H2] does not have to be H.
This model is called a partially observed linear model.
Let G = [H1 H2]. We assume that G has full column rank so that there is
no perfectly redundant measurements of the sensors. Then we have
T =
[T1(x)T2(x)
]
=
[HT
1 xHT
2 x
]
= GTx (3.17)
81
Thus, T is also Gaussian and
T ∼ N(
0, σ2GTG)
under H0
Let q = q1 + q2, and we can see that T is q× 1. As a result, we construct the PDF
as in (7.1) with
K(θ) = ln E0
[
exp(
θT t)]
=1
2σ2θTGTGθ (3.18)
Hence the constructed PDF is
pT(t; θ)
= exp[
θT t − K(θ) + ln pT(t;0)]
=1
(2πσ2)q2 det
12 (GTG)
exp
(
−tT(
GTG)−1
t
2σ2
)
· exp
[
θT t − 1
2σ2θTGTGθ
]
(3.19)
which can be simplified as
T ∼ N(
σ2GTGθ, σ2GTG)
under H1 (3.20)
Note that θ is the vector of the unknown parameters in the constructed PDF, and
it is different from the truly unknown parameters α. From (6.7) and (3.18), the
MLE of θ satisfies
t =∂K(θ)
∂θ= σ2GTGθ
So
θ =1
σ2
(
GTG)−1
t
and the test statistic becomes
θTt − K(θ) =
1
2σ2tT(
GTG)−1
t (3.21)
Next we consider the clairvoyant GLRT. That is the GLRT when we know
the true PDF of T under H1 except for the underlying unknown parameters α.
82
It is considered as the suboptimal test by plugging the MLE of α into the true
PDF parameterized by α. Since the constructed PDF may not be the true PDF,
the clairvoyant GLRT requires more information than our method. From (6.11)
we know that
T ∼ N(
GTHα, σ2GTG)
under H1 (3.22)
Note that (3.20) is the constructed PDF while (3.22) is the true PDF. In either
case we need to estimate θ in (3.20) or α in (3.22) to implement the PDF. We
write the true PDF under H1 as pT(t; α). The MLE of α is found by maximizing
the true PDF given by (3.22)
lnpT(t; α)
pT(t;0)
= − 1
2σ2
(
t − GTHα)T (
GTG)−1 (
t − GTHα)
+1
2σ2tT(
GTG)−1
t
If q ≤ p, i.e., the length of t is less than or equal to the
length of α, then the MLE α may not be unique. However, since(
t − GTHα)T (
GTG)−1 (
t − GTHα)
≥ 0, we could always find α such that
t = GTHα and hence(
t − GTHα)T (
GTG)−1 (
t − GTHα)
= 0. Hence the clair-
voyant GLRT statistic becomes
lnpT(t; α)
pT(t;0)=
1
2σ2tT(
GTG)−1
t (3.23)
which is the same as our test statistic (see (6.13)) when q ≤ p.
If q > p, it can be shown that
α =(
HTG(
GTG)−1
GTH)−1
HTG(
GTG)−1
t
and the clairvoyant GLRT statistic becomes
lnpT(t; α)
pT(t;0)=
tT(
GTG)−1
GTH(
HTG(
GTG)−1
GTH)−1
HTG(
GTG)−1
t
2σ2
(3.24)
83
3.5.2 Partially Observed Linear Model with Gaussian Mixture Noise
The partially observed linear model remains the same as in the previous sub-
section except for instead of assuming that w is white Gaussian, we will assume
that w has a Gaussian mixture distribution with two components, i.e.,
w ∼ πN (0, σ21I) + (1 − π)N (0, σ2
2I) (3.25)
where π, σ21 and σ2
2 are known (0 < π < 1). The following derivation can be easily
extended when w ∼∑L
i=1 πiN (0, σ2i I).
Since w has a Gaussian mixture distribution, T = GTx is also Gaussian
mixture distributed and
T ∼ πN (0, σ21G
TG) + (1 − π)N (0, σ22G
TG) under H0
So we have
K(θ) = ln E0
[
exp(
θT t)]
= ln(
πe12σ21θ
TGT Gθ + (1 − π)e
12σ22θ
TGT Gθ
)
(3.26)
Hence the constructed PDF is
pT(t; θ)
= exp[
θT t − K(θ) + ln pT(t;0)]
=
[
π
(2πσ21)
q2 det
12 (GTG)
exp
(
−tT(
GTG)−1
t
2σ21
)
+1 − π
(2πσ22)
q2 det
12 (GTG)
exp
(
−tT(
GTG)−1
t
2σ22
)]
· exp(
θT t)
/(
πe12σ21θ
TGT Gθ + (1 − π)e
12σ22θ
TGT Gθ
)
(3.27)
Although this constructed PDF cannot be further simplified, we can still find the
84
MLE by solving
t =∂K(θ)
∂θ
=πe
12σ21θ
TGT Gθ · σ2
1GTGθ + (1 − π)e
12σ22θ
TGT Gθ · σ2
2GTGθ
πe12σ21θ
TGT Gθ + (1 − π)e
12σ22θ
TGT Gθ
(3.28)
Our test statistic is just
θTt − K(θ)
= θTt − ln
(
πe12σ21
ˆθT
GT Gˆθ + (1 − π)e
12σ22
ˆθT
GT Gˆθ)
(3.29)
where θ satisfies (3.28). Although no analytical solution of the MLE of θ ex-
ists, it can be found using convex optimization techniques [12, 13]. Moreover, an
analytical solution exists when ||θ|| → 0. To see this, we will show that
lim||θ||→0
∂K(θ)
∂θ./(
πσ21G
TGθ + (1 − π)σ22G
TGθ)
= 1 (3.30)
where ./ means element-by-element division.
To prove (3.30), we have
lim||θ||→0
(
πe12σ21θ
TGT Gθ + (1 − π)e
12σ22θ
TGT Gθ
)
= 1 (3.31)
and
lim||θ||→0
(
πe12σ21θ
TGT Gθ · σ2
1GTGθ
+(1 − π)e12σ22θ
TGT Gθ · σ2
2GTGθ
)
./
(
πσ21G
TGθ + (1 − π)σ22G
TGθ)
= 1 (3.32)
by L’Hospital’s rule. Dividing (3.32) by (3.31) and from (3.28), (3.30) is proved.
As a result of (3.28) and (3.30), the MLE of θ satisfies
t = πσ21G
TGθ + (1 − π)σ22G
TGθ
85
as ||θ|| → 0 and θ can be easily found as
θ =1
πσ21 + (1 − π)σ2
2
(
GTG)−1
t (3.33)
Since
lim||θ||→0
K(θ)/
(1
2πσ2
1θTGTGθ +
1
2(1 − π)σ2
2θTGTGθ
)
= 1
by using L’Hospital’s rule twice, as ||θ|| → 0, our test statistic becomes (see (6.15))
θTt −
(1
2πσ2
1θTGTGθ +
1
2(1 − π)σ2
2θTGTGθ
)
=1
2 (πσ21 + (1 − π)σ2
2)tT(
GTG)−1
t (3.34)
To find the clairvoyant GLRT statistic, we know that under H1 the true PDF
is
pT(t; α)
=π
(2π)q/2 det1/2 (σ21G
TG)exp
[
−1
2(t − GTHα)T
(
GTG)−1
σ21
(t − GTHα)
]
+1 − π
(2π)q/2 det1/2 (σ22G
TG)exp
[
−1
2(t − GTHα)T
(
GTG)−1
σ22
(t − GTHα)
]
(3.35)
Note the difference between (3.27) and (3.35) since (3.27) is the constructed PDF
and (3.35) is the true PDF. The MLE of α is found by maximizing (3.35) over α.
When q ≤ p, the MLE of α may not be unique but satisfies t = GTHα. As
a result, pT(t; α) is a constant and the clairvoyant GLRT statistic becomes
− ln pT(t;0)
Note that since pT(t;0) is decreasing as tT(
GTG)−1
t increases, the clairvoyant
GLRT statistic becomes
tT(
GTG)−1
t (3.36)
86
which is the same as our test statistic (with only a positive scale factor) as ||θ|| → 0
(see (6.13)).
When q > p, it can be shown that
α =(
HTG(
GTG)−1
GTH)−1
HTG(
GTG)−1
t
and the clairvoyant GLRT statistic becomes
π
(σ21)
q/2exp
[
−1
2(t − GTHα)T
(
GTG)−1
σ21
(t − GTHα)
]
+1 − π
(σ22)
q/2exp
[
−1
2(t − GTHα)T
(
GTG)−1
σ22
(t − GTHα)
]
(3.37)
Note that the noise in (6.14) is uncorrelated but not independent. We next
consider a general case when the noise can be correlated with a Gaussian mixture
w ∼ πN (0,C1) + (1 − π)N (0,C2) (3.38)
It can be shown that similar to (6.15), our test statistic is
θTt − ln
(
πe12
ˆθT
GT C1Gˆθ + (1 − π)e
12
ˆθT
GT C2Gˆθ)
(3.39)
and the clairvoyant GLRT statistic is
− ln
(π
det1/2 (C1)exp
[
−1
2tT(
GTC1G)−1
t
]
+1 − π
det1/2 (C2)exp
[
−1
2tT(
GTC2G)−1
t
])
(3.40)
when q ≤ p.
When q > p, the MLE of α is not in closed form, and hence we write the
clairvoyant GLRT statistic as
maxα
[π
det1/2 (GTC1G)exp
[
−1
2
(
t − GTHα)T (
GTC1G)−1 (
t − GTHα)]
+1 − π
det1/2 (GTC2G)exp
[
−1
2
(
t − GTHα)T (
GTC2G)−1 (
t − GTHα)]]
(3.41)
87
Table 3.1. Comparison of our test statistic and the clairvoyant GLRTOur Method Clairvoyant GLRT (q ≤ p)
Gaussian Noise tT(
GTG)−1
t tT(
GTG)−1
t
Uncorrelated maxθ
[
θT t − ln(
πe12σ21θ
TGT Gθ
tT(
GTG)−1
t
Non-Gaussian Noise +(1 − π)e12σ22θ
TGT Gθ
)]
Correlated maxθ
[
θT t − ln(
πe12θT
GT C1Gθ − ln(
π
det1/2(C1)exp
[
−12tT(
GTC1G)−1
t]
Non-Gaussian Noise +(1 − π)e12θT
GT C2Gθ)]
+ 1−π
det1/2(C2)exp
[
−12tT(
GTC2G)−1
t])
Clairvoyant GLRT (q > p)
Gaussian Noise tT(
GTG)−1
GTH(
HTG(
GTG)−1
GTH)−1
HTG(
GTG)−1
t
Uncorrelated π
(σ21)
q/2 exp
[
−12(t − GTHα)T (GT G)
−1
σ21
(t − GTHα)
]
Non-Gaussian Noise + 1−π
(σ22)
q/2 exp
[
−12(t − GTHα)T (GT G)
−1
σ22
(t − GTHα)
]
Correlated maxα
[π
det1/2(GT C1G)exp
[
−12
(
t − GTHα)T (
GTC1G)−1 (
t − GTHα)]
Non-Gaussian Noise + 1−π
det1/2(GT C2G)exp
[
−12
(
t − GTHα)T (
GTC2G)−1 (
t − GTHα)]]
3.5.3 Summary
We have considered a partially observed linear model with both Gaussian and
non-Gaussian noise. Table 3.1 compares our test statistic with the clairvoyant
GLRT.
1) In Gaussian noise, w ∼ N (0, σ2I). The test statistics are exactly the same
for q ≤ p.
2) In uncorrelated non-Gaussian noise, w ∼ πN (0, σ21I) + (1 − π)N (0, σ2
2I).
The test statistics are the same as θ → 0 for q ≤ p.
3) In correlated non-Gaussian noise, w ∼ πN (0,C1) + (1 − π)N (0,C2). Al-
though we cannot show the equivalence between these two test statistics, we will
see in Section 3.7 that their performances appear to be the same.
3.6 Examples-Distributed Classification
In this section, we compare our method with the estimated MAP classifier
for some classification problems. The estimated MAP classifier assumes that the
PDF of T under Hi is known except for some unknown underlying parameters
αi. We assume equal prior probability of the candidate hypothesis, i.e., p(H1) =
88
. . . , = p(HM) = 1/M . So the estimated MAP classifier reduces to the estimated
maximum likelihood classifier [1], which finds the MLE of αi and chooses Hi for
which the following is maximum over i:
pT(t; αi) (3.42)
where αi is the MLE of αi.
3.6.1 Linear Model with Known Variance
Consider the following classification model:
Hi : x = Aisi + w (3.43)
where si is an N × 1 known signal vector with the same length as x, Ai is the
unknown signal amplitude, and w is white Gaussian noise with known variance
σ2. Assume that instead of observing x, we can only observe the measurements of
two sensors
T1 = HT1 x
T2 = HT2 x (3.44)
where H1 is N × q1 and H2 is N × q2. Here q1 and q2 are the length for vectors
T1 and T2 respectively. We can write (7.7) as
T = GTx (3.45)
by letting
T =
[T1
T2
]
and
G = [H1 H2]
89
where G is N × (q1 + q2) with q1 + q2 ≤ N . We assume that G has full column
rank so that there are no perfectly redundant measurements of the sensors. Note
that G can be any matrix with full column rank.
Let H0 be the reference hypothesis when there is noise only, i.e.,
H0 : x = w (3.46)
Since x is Gaussian under H0, according to (7.8), T is also Gaussian and
T ∼ N(
0, σ2GTG)
under H0. We construct the PDF under Hi as in (7.1) with
K(θi) = ln E0
[
exp(
θTi T
)]
=1
2σ2θT
i GTGθi (3.47)
Hence the constructed PDF is
pT(t; θi)
= exp[
θTi t − K(θi) + ln pT(t;0)
]
=1
(2πσ2)q1+q2
2 det12 (GTG)
exp
(
−tT(
GTG)−1
t
2σ2
)
· exp
[
θTi t − 1
2σ2θT
i GTGθi
]
(3.48)
which can be simplified as
T ∼ N(
σ2GTGθi, σ2GTG
)
under Hi (3.49)
The next step is to find the MLE of θi. Note that the MLE of θi is found
by maximizing θiT t − K(θi) over θi. If this optimization procedure is carried out
without any constraints, then θi would be the same for all i. Hence we need some
implicit constraints in finding the MLE. Since θi represents the signal under Hi,
we should have
θi = AiGT si = EHi
(T) (3.50)
90
which is the mean of T under Hi. As a result, (7.12) can be written as
T ∼ N(
σ2AiGTGGT si, σ
2GTG)
under Hi (3.51)
Thus, instead of finding the MLE of θi by maximizing
θTi t − K(θi) = θT
i t − 1
2σ2θT
i GTGθi (3.52)
with the constraint in (7.13), we can find the MLE of Ai in (7.14) (since si is
assumed known) and then plug it into (7.13). It can be shown that
Ai =sT
i Gt
σ2sTi GGTGGT si
(3.53)
and
θi =GT sis
Ti Gt
σ2sTi GGTGGT si
(3.54)
Hence by removing the constant factors, the test statistic of our classifier for Hi is
(sTi Gt)2
(GT si)TGTG(GT si)(3.55)
according to (3.8).
Next we consider the estimate MAP classifier. In this case, we assume that
we know the true PDF except for Ai
T ∼ N(
AiGT si, σ
2GTG)
under Hi (3.56)
Note that (7.19) is the true PDF of T under Hi and (7.14) is the constructed PDF.
It can be shown that the MLE of Ai in the true PDF under Hi is
Ai =sT
i G(
GTG)−1
t
sTi G (GTG)−1 GT si
(3.57)
By removing the constant terms, the test statistic of the estimated MAP classifier
for Hi is
(sTi G
(
GTG)−1
t)2
(GT si) (GTG)−1 (GT si)(3.58)
91
according to (3.42). Note that (7.16) and (7.20) are different because (7.16) is the
MLE of Ai under the constructed PDF and (7.20) is the MLE of Ai under the true
PDF. Also note that if GTG is a scaled identity matrix, test statistics in (7.18) and
(7.21) are equivalent, and hence our method coincides with the estimated MAP
classifier.
3.6.2 Linear Model with Unknown Variance
To extend the above example, we consider the above linear model with un-
known noise variance σ2. As we have shown in (7.14), the constructed PDF is
still
T ∼ N(
σ2AiGTGGT si, σ
2GTG)
under Hi (3.59)
except for that σ2 is unknown. Let Bi = σ2Ai, we have
T ∼ N(
BiGTGGT si, σ
2GTG)
under Hi (3.60)
Instead of finding the MLEs of Ai and σ2, we can equivalently find the MLEs of
Bi and σ2. Let hi = GTGGT si and C = GTG. It can be shown that
Bi = (hTi C−1hi)
−1hTi C−1t (3.61)
and
σ2 =1
p1 + p2
(t − hiBi)TC−1(t − hiBi) (3.62)
By removing the constant factors, it can also be shown that the test statistic is
equivalent to
tTC−1hi(hTi C−1hi)
−1hTi C−1t
tT[
C−1 − C−1hi(hTi C−1h)−1
i hTi C−1
]
t(3.63)
Next we consider the estimated MAP classifier. The true PDF is still
T ∼ N(
AiGT si, σ
2GTG)
under Hi (3.64)
92
Table 3.2. Comparison of our test statistic and the estimated MAP classifierOur Method Estimated MAP
Known σ2 (sTi Gt)2
(GT si)TGTG(GT si)
(sTi G
(
GTG)−1
t)2
(GT si) (GTG)−1 (GT si)
Unknown σ2 tTC−1hi(hTi C−1hi)
−1hTi C−1t
tT[
C−1 − C−1hi(hTi C−1h)−1
i hTi C−1
]
t
tTC−1gi(gTi C−1gi)
−1hTC−1t
tT [C−1 − C−1gi(gTi C−1gi)−1gT
i C−1] t+
where hi = GTGGT si, gi = GT si and C = GTG.
but with unknown Ai and σ2. Let gi = GT si and C = GTG. Similar to (3.61),
(3.62) and (3.63), it can be shown that
Ai = (gTi C−1gi)
−1gTi C−1t (3.65)
σ2 =1
p1 + p2
(t − giAi)TC−1(t − giAi) (3.66)
and the test statistic of the estimated MAP classifier is
tTC−1gi(gTi C−1gi)
−1hTC−1t
tT [C−1 − C−1gi(gTi C−1gi)−1gT
i C−1] t(3.67)
Note that if GTG is a scaled identity matrix, since hi = GTGgi, the test statistics
in (3.63) and (3.67) are equivalent. Hence our method is exactly the same as the
estimated MAP classifier if GTG is a scaled identity matrix.
3.6.3 Summary
We have considered a linear model both known and unknown noise variance.
Table 3.2 compares our test statistic with the estimated MAP classifier. If GTG is a
scaled identity matrix, our method and the estimated MAP classifier are identical.
Note that this is the case when all the columns in G are orthogonal and have same
power, such as the demodulation of M-ary orthogonal signals in communication
theory.
3.7 Simulations3.7.1 Distributed Detection
Since our test statistic coincides with the clairvoyant GLRT under Gaussian
noise for q ≤ p as shown in subsection 3.5.1, we will only compare the performances
93
under non-Gaussian noise (both uncorrelated noise as in (6.14) and correlated noise
as in (6.19)). Consider the model where
x[n] = A1 + A2rn + A3 cos(2πfn + φ) + w[n] (3.68)
for n = 0, 1, . . . , N − 1 with known base r ∈ (0, 1) and frequency f but unknown
amplitudes A1, A2, A3 and phase φ. This is a linear model as in (6.9) where
H =
⎡
⎢⎢⎢⎣
1 1 1 01 r cos(2πf) sin(2πf)...
......
...1 rN−1 cos(2πf(N − 1)) sin(2πf(N − 1))
⎤
⎥⎥⎥⎦
and α = [A1 A2 A3 cos φ − A3 sin φ]T .
Let w have an uncorrelated Gaussian mixture distribution as in (6.14). For
the partially observed linear model, we observe two sensor outputs as in (6.10).
We compare the GLRT in (6.15) with the clairvoyant GLRT in (6.18). Note that
the MLE of θ in (6.15) is found numerically, not by the asymptotic approximation
in (6.16). In the simulation, we use N = 20, A1 = 2, A2 = 3, A3 = 4, φ =
π/4, r = 0.95, f = 0.34, π = 0.9, σ21 = 50, σ2
2 = 500, and H1 and H2 are
the first and third columns in H respectively, i.e., H1 = [1 1 . . . 1]T , H2 =
[1 cos(2πf) . . . cos(2πf(N − 1))]T . Hence, only the DC level is sensed by one
sensor and the in-phase component of the sinusoid is sensed by the other sensor.
As shown in Figure 6.2, the performances are almost the same which justifies their
equivalence under the small signal assumption shown in Section 3.5.
Next for the same model in (6.22), let w have a correlated Gaussian mixture
distribution as in (6.19). We compare performances of the GLRT using the con-
structed PDF as in (6.20) and the clairvoyant GLRT as in (6.21). We use N = 20,
A1 = 3, A2 = 4, A3 = 3, φ = π/7, r = 0.9, f = 0.46, π = 0.7, H1 = [1, 1, . . . , 1]T ,
H2 = [1, cos(2πf), . . . , cos(2πf(N − 1))]T . The covariance matrices C1, C2 are
94
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probability of False Alarm
Pro
babi
lity
of D
etec
tion
Our MethodClairvoyant GLRT
Figure 3.2. ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with uncorrelated Gaussian mixture noise.
generated using C1 = RT1 R1, C2 = RT
2 R2, where R1, R2 are full rank N × N
matrices. As shown in Figure 6.3, the performances are still very similar.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probability of False Alarm
Pro
babi
lity
of D
etec
tion
Our MethodClairvoyant GLRT
Figure 3.3. ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with correlated Gaussian mixture noise.
3.7.2 Distributed Classification
For the model in (7.6)
Hi : x = Aisi + w
95
we first consider a case when GTG is approximately a scaled identity matrix. Let
A1 = 0.4, A2 = 1.2, A3 = 0.9 and
s1(n) = cos(2πf1n)
s2(n) = cos(2πf2n)
s3(n) = cos(2πf3n)
where n = 0, 1, . . . , N − 1 with N = 25, and f1 = 0.14, f2 = 0.34, f3 = 0.41. Let
p(H1) = p(H2) = p(H3) = 1/3. Assume that there are two sensors, each with an
observation matrix as follows respectively:
H1 =
[1 cos(2πf1) · · · cos (2πf1(N − 1))1 cos(2πf2) · · · cos (2πf2(N − 1))
]T
H2 =[
1 cos(2πf3) · · · cos (2πf3(N − 1))]T
We use (7.18) and (7.21) as our test statistics for the two methods respectively,
when σ2 is known. Test statistics in (3.63) and (3.67) are used when σ2 is unknown.
The probabilities of correct classification are plotted versus ln(1/σ2) in Figure
3.4. We see that our method has the same performance with the estimated MAP
classifier with known or unknown σ2, and probabilities of correct classification goes
to 1 as σ2 → 0.
Next we consider a case when GTG is not a scaled identity matrix. Let
A1 = 0.5, A2 = 1, A3 = 1 and
s1(n) = cos(2πf1n) + 1
s2(n) = cos(2πf2n) + 0.5
s3(n) = cos(2πf3n)
where n = 0, 1, . . . , N − 1 with N = 20, and f1 = 0.17, f2 = 0.28, f3 = 0.45.
Let p(H1) = p(H2) = p(H3) = 1/3. Assume that there are three sensors (this is
an extension of the two sensor assumption), each with an observation matrix as
96
−4 −3 −2 −1 0 1 2 3
0.4
0.5
0.6
0.7
0.8
0.9
1
ln(1/σ2)
Pc
Estimated MAP−Known σ2
Our Method−Known σ2
Estimated MAP−Unknown σ2
Our Method−Unknown σ2
Figure 3.4. Probability of correct classification for both methods.
follows respectively:
H1 =[
1 1 · · · 1]T
H2 =
[1 cos(2πf1) · · · cos (2πf1(N − 1))1 cos(2πf2) · · · cos (2πf2(N − 1))
]T
H3 =[
1 cos (2π(f3 + 0.02)) · · · cos (2π(f3 + 0.02)(N − 1))]T
Note that in H3, we set the frequency to f3 + 0.02. This models the case when
the knowledge of the frequency is not accurate. We also see in Figure 7.2 that
the performances of both methods are the same with known or unknown σ2, and
probabilities of correct classification goes to 1 as σ2 → 0.
3.8 Conclusions
A novel method of constructing the joint PDF of the measurements from a
distributed multiple sensor systems has been proposed. Only a reference PDF
is needed in the construction. The constructed PDF is asymptotically optimal
in KL divergence. The performance of our method has shown to be as good as
the clairvoyant GLRT and estimated MAP classifier respectively for detection and
classification, while less information is needed for our method.
97
−4 −3 −2 −1 0 1 2 3
0.4
0.5
0.6
0.7
0.8
0.9
1
ln(1/σ2)
Pc
Estimated MAP−Known σ2
Our Method−Known σ2
Estimated MAP−Unknown σ2
Our Method−Unknown σ2
Figure 3.5. Probability of correct classification for both methods.
List of References
[1] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory. En-glewood Cliffs, NJ: Prentice-Hall, 1998.
[2] S. Thomopoulos, R. Viswanathan, and D. Bougoulias, “Optimal distributeddecision fusion,” IEEE Trans. Aerosp. Electron. Syst., vol. 25, pp. 761–765,Sep. 1989.
[3] Z. Chair and P. Varshney, “Optimal data fusion in multiple sensor detectionsystems,” IEEE Trans. Aerosp. Electron. Syst., vol. 22, pp. 98–101, Jan. 1986.
[4] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,” IEEETrans. Pattern Anal. Mach. Intell., vol. 20, pp. 226–239, Mar. 1998.
[5] A. Sundaresan, P. Varshney, and N. Rao, “Distributed detection of a nuclearradioactive source using fusion of correlated decisions,” in Information Fusion,2007 10th International Conference on, 2007, pp. 1–7.
[6] S. Iyengar, P. Varshney, and T. Damarla, “A parametric copula based frame-work for multimodal signal processing,” in ICASSP, 2009, pp. 1893–1896.
[7] S. Kay and Q. Ding, “Exponentially embedded families for multimodal sensorprocessing,” in ICASSP, Mar. 2010, pp. 3770–3773.
[8] S. Kay, A. Nuttall, and P. Baggenstoss, “Multidimensional probability densityfunction approximations for detection, classification, and model order selec-tion,” IEEE Trans. Signal Process., vol. 49, pp. 2240–2252, Oct. 2001.
98
[9] S. Kay, Q. Ding, and D. Emge, “Joint pdf construction for sensor fusion anddistributed detection,” in International Conference on Information Fusion,Jun. 2010.
[10] S. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory.Englewood Cliffs, NJ: Prentice-Hall, 1993.
[11] L. Brown, Fundamentals of Statistical Exponential Families. Institute ofMathematical Statistics, 1986.
[12] S. Boyd and L.Vandenberghe, Convex Optimization. Cambridge UniversityPress, 2004.
[13] D. Luenberger, Linear and Nonlinear Programming, 2nd ed. Springer, 2003.
[14] S. Kay, Q. Ding, and M. Rangaswamy, “Sensor integration for classification,”in Asilomar Conference on Signals, Systems, and Computers, Nov. 2010.
[15] S. Kullback, Information Theory and Statistics, 2nd ed. Courier Dover Pub-lications, 1997.
[16] T. Cover and J. Thomas, Elements of Information Theory, 2nd ed. JohnWiley and Sons, 2006.
[17] M. Westover, “Asymptotic geometry of multiple hypothesis testing,” IEEETrans. Inf. Theory, vol. 54, no. 7, pp. 3327–3329, Jul. 2008.
[18] S. Kay, “Exponentially embedded families - new approaches to model orderestimation,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, pp. 333–345, Jan.2005.
99
MANUSCRIPT 4
Maximum Likelihood Estimator under Misspecified Model with HighSignal-to-Noise Ratio
Abstract
It is well known that the maximum likelihood estimator (MLE) under a mis-
specified model converges to a well defined limit, and it is asymptotically Gaussian
as the sample size goes to infinity. In this correspondence, we fully characterize
the asymptotic performance of the MLE under a misspecified model with high
signal-to-noise (SNR). We see that under some regularity conditions, it converges
to a well defined limit and is asymptotically Gaussian with high SNR.
4.1 Introduction
In estimating unknown parameters the most popular method is the maximum
likelihood estimator (MLE). One important reason is that the MLE is asymp-
totically optimal in that it approximates the minimum variance unbiased (MVU)
estimator for large data records [1]. This is the case when the number of samples
goes to infinity. Another asymptotic case is when the signal-to-noise ratio (SNR)
goes to infinity, i.e., the number of samples is fixed with decreasing noise variance.
The asymptotic efficiency and Gaussianity of the MLE with high SNR have re-
cently been proved in [2]. Hence, under some regularity conditions, the MLE at
high SNR has similar performance to the large sample size case.
The above results are all based on the assumption that the model is cor-
rectly specified. However, we may have a misspecified model in practice, i.e., the
samples are generated from a distribution which cannot be parameterized by the
assumed model. In this case, the MLE under a misspecified model is called the
quasi-maximum likelihood estimator (QMLE). Thus, it is natural to consider the
100
properties of the QMLE. Thanks to White’s fundamental result in [3], the asymp-
totic performance of the QMLE as the sample size goes to infinity is well known
in both the statistics and signal processing communities. It is proved in [3] that
the QMLE converges to a limit which minimizes the Kullback-Leibler (KL) diver-
gence between the true probability density function (PDF) and the misspecified
PDF, and it is asymptotically Gaussian for large data records. Note that the KL
divergence is a non-symmetric measure of difference between two PDFs. For two
PDFs p1 and p0, it is defined as
D (p1 ‖p0 ) =
∫
p1(x) lnp1(x)
p0(x)dx
It is well known that D (p1 ‖p0 ) ≥ 0 with equality if and only if p1 = p0 almost
everywhere [4]. White’s results have been applied to the problem of estimating
direction of arrival (DOA) with unknown number of signals in [5] and [6] for a
deterministic signal model and stochastic signal model, respectively. Analogous to
the results in [2], it is expected that with high SNR, the QMLE will have similar
performance to White’s results. In this correspondence, we prove that this is true
for a deterministic signal in additive Gaussian noise. A simple misspecified linear
model is presented to illustrate our results. Simulation results are provided to
verify our analysis.
The paper is organized as follows. We start by presenting White’s results [3]
in Section 4.2 so that we can compare our results with his later on. In Section
4.3, we show that the QMLE is asymptotically Gaussian and it converges to a
well defined limit with high SNR. In Section 4.4, we use a misspecified linear
model to illustrate our analysis. Section 4.5 provides some simulation results of the
asymptotic performance of the QMLE. Finally, Section 4.6 offers some conclusions.
101
4.2 White’s Results: QMLE for Large Data Records
Suppose that we have N independent and identically distributed (IID) sample
vectors xn for n = 0, 1, . . . , N −1. xn’s are generated from a PDF pt (x), which we
call the true PDF. For the misspecified model, we assume that xn’s are generated
from a PDF p (x; θ) parameterized by some unknown parameters θ. So the QMLE
of θ is
θ = arg maxθ
N−1∑
n=0
ln p (xn; θ) (4.1)
where the xn’s are generated from the true PDF pt (x).
Assume that the KL divergence between the true PDF and the misspecified
D (pt (x) ||p (x; θ)) =
∫
x
pt (x) lnpt (x)
p (x; θ)dx (4.2)
has a unique minimum at θ∗. Under some regularity conditions, it is proved in [3]
that
a) Consistency: θ exists and θa.s.→ θ∗ as N → ∞ where
a.s.→ stands for convergence
almost surely.
b) Asymptotic Gaussianity: Define the following matrices as
[A(θ)]i,j = Et
(∂2 ln p(x; θ)
∂θi∂θj
)
[B(θ)]i,j = Et
(∂ ln p(x; θ)
∂θi
∂ ln p(x; θ)
∂θj
)
(4.3)
C(θ) = A(θ)−1B(θ)A(θ)−1
where Et() is the expected value with respect to the true PDF pt(x). Then we
have√
N(
θ − θ∗
)D→N (0,C(θ∗)) (4.4)
as N → ∞ whereD→ stands for convergence in distribution.
c) If the model is correctly specified, i.e., there exists θ0 such that p (x; θ0) = pt (x),
102
then θ∗ = θ0 and A(θ0) = −B(θ0), so that
C(θ0) = −A(θ0)−1 = B(θ0)
−1 (4.5)
where −A(θ0) is the Fisher information matrix.
We can interpret a) as follows:
Since
1
N
N−1∑
n=0
ln p (xn; θ)P→Et (ln p (x; θ)) (4.6)
as N → ∞, whereP→ stands for convergence in probability, 1
N
∑N−1n=0 ln p (xn; θ) is
a natural estimator of Et (ln p (x; θ)). Note that
D (pt (x) ||p (x; θ)) = Et (ln pt (x)) − Et (ln p (x; θ)) (4.7)
Hence, θ∗ which minimizes D (pt (x) ||p (x; θ)) also maximizes Et (ln p (x; θ)). As
θ maximizes 1N
∑N−1n=0 ln p (xn; θ), we can consider θ as a natural estimator of θ∗
[7].
4.3 QMLE with High SNR4.3.1 Misspecified Observation Model
Consider the case when the true observation model is
x = st + w1 (4.8)
where x is a real N × 1 vector of samples, st is the N × 1 true signal vector, and
w1 is an N × 1 vector of additive Gaussian noise samples with zero mean and
covariance matrix σ2Ct. However, the misspecified model is
x = s(θ) + w2 (4.9)
where the N × 1 signal s(θ) is known except for the unknown p × 1 vector of
parameters θ, and w2 is an N × 1 vector of additive Gaussian noise samples with
103
zero mean and covariance matrix σ2C. It is assumed that σ2 is unknown and C is
known. As a result, the QMLE of θ is found as
θ = arg minθ
{
(x − s(θ))T C−1
(x − s(θ))}
(4.10)
Hence, we will study the performance of the QMLE of θ as in (4.10) when x is
distributed as in (4.8) as σ2 → 0. Note that this real signal model can be easily
extended to complex signal model (see Chapter 15 in [8]).
4.3.2 Performance of QMLE as σ2 → 0
The analysis in this subsection is similar to that in [2] where the consistency
and asymptotic Gaussianity of the MLE under the correctly specified model with
high SNR are proved.
First, we will find θ∗ which minimizes the KL divergence between the true
PDF and the misspecified PDF. We denote the PDFs specified by (4.8) and (4.9)
as pt (x) and p (x; θ) respectively. For Gaussian distributions, the KL divergence
between the true PDF pt (x) and the misspecified PDF p (x; θ) is [4]
D (pt (x) ||p (x; θ))
=1
2ln
det(
C)
det(
Ct
) +1
2tr(
CtC−1)
− N
2+
1
2σ2(st − s(θ))T C
−1(st − s(θ)) (4.11)
We assume that D (pt (x) ||p (x; θ)) has a unique minimum at θ∗. Since only the
last term in (4.11) depends on θ, θ∗ also minimizes (st − s(θ))T C−1
(st − s(θ)).
Hence, we write
θ∗ = arg minθ
(st − s(θ))T C−1
(st − s(θ)) (4.12)
By setting the gradient with respect to θ to zero, we have(
∂s(θ)
∂θ
)T
C−1
(st − s(θ))∣∣∣θ=θ∗
= 0 (4.13)
where[∂s(θ)
∂θ
]
i,j
=∂si(θ)
∂θj
for 1 ≤ i ≤ N, 1 ≤ j ≤ p (4.14)
104
Next, we examine the asymptotic performance of the QMLE of θ using the im-
plicit function theorem. Since the QMLE of θ depends on x which is distributed
according to (4.8), we write (4.10) as
θ = arg minθ
{
(st − s(θ) + w1)T C
−1(st − s(θ) + w1)
}
(4.15)
so that θ is an implicit function of w1. The solution of (4.15) is found by setting
the gradient of (4.15) with respect to θ to zero. Hence, we need to solve the
following p equations:(
∂s(θ)
∂θ
)T
C−1
(st − s(θ) + w1) = 0 (4.16)
where ∂s(θ)
∂θis given as in (4.14).
Let f (θ,w1) = [f1 (θ,w1) f2 (θ,w1) . . . fp (θ,w1)]T =
(∂s(θ)
∂θ
)T
C−1
(st − s(θ) + w1). Note that from (4.13), we have
f (θ∗,0) = 0 (4.17)
We further assume that:
Assumption 1): fi (θ,w1) is differentiable in a neighborhood of the point (θ∗,0)
in Rp × R
N for i = 1, 2, . . . , p.
Assumption 2): The p × p Jacobian matrix∂f(θ,w1)
∂θof f (θ,w1) with respect to θ
is nonsingular at (θ∗,0).
Then by the implicit function theorem [9], there is a unique mapping ϕ : V → U
where V is a neighborhood of 0 in RN and U is a neighborhood of θ∗ in R
p such
that
ϕ(0) = θ∗
f (ϕ(w1),w1) = 0 for all w1 ∈ V (4.18)
Furthermore, we have
ϕ(w1) − θ∗ = −Φ−1Ψ(w1 − 0) + r(w1 − 0) (4.19)
105
where r(w1) = o(||w1||),
Φ =∂f (θ,w1)
∂θ
∣∣∣∣(θ∗,0)
(4.20)
and
Ψ =∂f (θ,w1)
∂w1
∣∣∣∣(θ∗,0)
(4.21)
Note that (4.18) implies that θ = ϕ(w1) for w1 ∈ V . Hence, from (4.19) we have
θ − θ∗ = −Φ−1Ψw1 + r(w1) (4.22)
Note that the deterministic little-o notation r(w1) = o(||w1||) implies the stochas-
tic little-o notation r(w1) = oP (||w1||), i.e., r(w1)||w1||
P→ 0 as ||w1||P→ 0 [10]. Since
w1 ∼ N(
0, σ2Ct
)
, we have w1P→0 and hence r(w1)
P→0 as σ2 → 0. As a result,
we have proved that
θP→θ∗ (4.23)
as σ2 → 0.
Next, we will prove the asymptotic Gaussianity of θ. Dividing (4.22) by σ we
have
θ − θ∗σ
= −Φ−1Ψw1
σ+
r(w1)
σ(4.24)
We write r(w1)σ
as
r(w1)
σ=
r(w1)
||w1||||w1||
σ(4.25)
Since r(w1)||w1||
P→0 as ||w1||P→ 0 or as σ2 → 0, and ||w1||
σfollows a distribution that
does not depend on σ, we have (see Theorems 2.3.3 and 2.3.5 on pages 70-71 in
[11] and Theorem (4)(a) on page 310 in [12])
r(w1)
σ
P→0 (4.26)
Note that −Φ−1Ψw1
σ∼ N
(
0,Φ−1ΨCtΨTΦ−1
)
since w1
σ∼ N
(
0,Ct
)
. Hence, we
have
θ − θ∗σ
D→N(
0,Φ−1ΨCtΨTΦ−1
)
(4.27)
106
From (4.20) and (4.21), it can be shown that
Φ = 2σ2A(θ∗) (4.28)
and
ΨCtΨT = 4σ2B(θ∗) (4.29)
where A(θ∗) and B(θ∗) are defined as in (4.3). Hence, Φ−1ΨCtΨTΦ−1 =
1σ2A(θ∗)
−1B(θ∗)A(θ∗)−1. As a result, (4.23) and (4.27) correspond to a) and
b) of White’s results in Section 4.2. Note that the results in [2] correspond to c)
of White’s results, in which case the model is correctly specified.
4.4 A Misspecified Linear Model Example
Consider the misspecified linear model where the samples are generated from
the true observation model:
x = st + w1 (4.30)
where x is a real N vector of samples, st is the N × 1 true signal vector, and w1 is
the N × 1 additive Gaussian noise samples with zero mean and covariance matrix
σ2Ct. The misspecified model is
x = Hθ + w2 (4.31)
where H is the known N × p observation matrix, θ is the p× 1 vector of unknown
parameters, and w2 is the N × 1 additive Gaussian noise samples with zero mean
and covariance matrix σ2C. It is assumed that σ2 is unknown and C is known.
From (4.12), we have
θ∗ = arg minθ
(st − Hθ)T C−1
(st − Hθ) (4.32)
It can be shown that
θ∗ =(
HTC−1
H)−1
HTC−1
st (4.33)
107
It is well known that the QMLE of θ is [8]
θ =(
HTC−1
H)−1
HTC−1
x (4.34)
Since x is distributed according to (4.30), we have
x ∼ N (st, σ2Ct) (4.35)
As a result, we have
θ ∼ N((
HTC−1
H)−1
HTC−1
st, σ2(
HTC−1
H)−1
HTC−1
CtC−1
H(
HTC−1
H)−1
)
(4.36)
From (4.33) and (4.36), we see that
θP→θ∗ as σ2 → 0 (4.37)
and
θ − θ∗σ
∼ N(
0,(
HTC−1
H)−1
HTC−1
CtC−1
H(
HTC−1
H)−1
)
(4.38)
Note that in (4.38),ˆθ−θ∗
σhas a Gaussian distribution not just for σ2 → 0 but for
all σ2. For this misspecified linear model, from (4.20) and (4.21), it can be shown
that
Φ = HTC−1
H (4.39)
and
Ψ = HTC−1
(4.40)
Hence, we can write (4.38) as
θ − θ∗σ
∼ N(
0,Φ−1ΨCtΨTΦ−1
)
(4.41)
As a result, (4.37) and (4.41) match our results in (4.23) and (4.27) respectively.
108
4.5 Simulation Results
Consider the problem where the true model is
x[n] = A1 cos(2πf1n + φ1) + A2 cos(2πf2n + φ2) + w1[n] (4.42)
for n = 0, 1, . . . , N − 1 where w1 = [w1[0] w1[1] . . . w1[N − 1]]T ∼ N(
0, σ2Ct
)
.
The misspecified model is
x[n] = A cos(2πfn + φ) + w2[n] (4.43)
where A > 0, 0 < f < 1/2, 0 ≤ φ < 2π are unknown, and w2[n]’s are IID with
w2[n] ∼ N (0, σ2) for n = 0, 1, . . . , N − 1. The QMLE of A, f , φ are found as
follows (see Example 7.16 in [8])
f = arg maxf
I(f) = arg maxf
1
N
∣∣∣∣∣
N−1∑
n=0
x[n] exp(−j2πfn)
∣∣∣∣∣
2
A =2
N
∣∣∣∣∣
N−1∑
n=0
x[n] exp(−j2πfn)
∣∣∣∣∣
φ = arctan−∑N−1
n=0 x[n] sin(2πfn)∑N−1
n=0 x[n] cos(2πfn)
Here we use the Newton-Raphson method to find f , and the initial point
is found by a global search of the maximum of the periodogram I(f) =
1N
∣∣∣∑N−1
n=0 x[n] exp(−j2πfn)∣∣∣
2
over a fine grid of f to ensure convergence (see Sec-
tion 7.7 in [8]). Similarly, the A∗, f∗, φ∗ which minimize the KL divergence between
the true PDF and the misspecified PDF can be found as
f∗ = arg maxf
It(f) = arg maxf
1
N
∣∣∣∣∣
N−1∑
n=0
st[n] exp(−j2πfn)
∣∣∣∣∣
2
A∗ =2
N
∣∣∣∣∣
N−1∑
n=0
st[n] exp(−j2πf∗n)
∣∣∣∣∣
φ∗ = arctan−∑N−1
n=0 st[n] sin(2πf∗n)∑N−1
n=0 st[n] cos(2πf∗n)
109
where st[n] = A1 cos(2πf1n + φ1) + A2 cos(2πf2n + φ2).
In the simulation, we choose A1 = 0.8, f1 = 0.11, φ1 = 0.3, A2 = 1.2,
f2 = 0.33, φ2 = 0.47, N = 20, and Ct as a 20 × 20 diagonal matrix with the
first 10 diagonal elements equal 2 and the last 10 diagonal elements equal 1. Note
that in this case f1 and f2 are far away. Thus the maximum of the periodogram
It(f) = 1N
∣∣∣∑N−1
n=0 st[n] exp(−j2πfn)∣∣∣
2
will be near f2 as the “leakage” from the
first sinusoid component is comparatively small at f2. We see in Figure 4.1 that
the maximum of It(f) is at about f = 0.33 = f2. Hence, we have f∗ ≈ f2 in
this case. We generate 1000 realizations of {A, f , φ} and plot the sample means of
(A−A∗)2, (f − f∗)
2, (φ−φ∗)2 versus ln(1/σ2) in Figure 4.2. As expected, they all
converge to zero as σ2 → 0.
0 0.1 0.2 0.3 0.4 0.50
1
2
3
4
5
6
7
8
9
f
I t(f)
Figure 4.1. The periodogram It(f) = 1N
∣∣∣∑N−1
n=0 st[n] exp(−j2πfn)∣∣∣
2
. In this case,
f∗ ≈ f2 = 0.33.
If f1 and f2 are close, for example, we change f2 to 0.13 in the above example.
We see in Figure 4.3 that two peaks of the two sinusoid components merge into
one peak, and hence the maximum of It(f) is between f1 = 0.11 and f2 = 0.13.
Therefore, f∗ does not match either f1 or f2. For this case, we also plot the sample
110
0 2 4 6 8 1010
−10
10−5
100
ln(1/σ2)
Mea
nof
(A−
A∗)
20 2 4 6 8 10
10−10
10−5
100
ln(1/σ2)
Mea
nof
(f−
f ∗)2
0 2 4 6 8 1010
−5
100
ln(1/σ2)
Mea
nof
(φ−
φ∗)
2
Figure 4.2. Convergence of A, f , φ as σ2 → 0.
means of (A−A∗)2, (f − f∗)
2, (φ− φ∗)2 for 1000 realizations versus ln(1/σ2). We
see in Figure 4.4 that A − A∗, f − f∗, φ − φ∗ all converge to zero as σ2 → 0.
Next, we want to show that the QMLE is asymptotically Gaussian. We still
choose A1 = 0.8, f1 = 0.11, φ1 = 0.3, A2 = 1.2, f2 = 0.33, φ2 = 0.47, N = 20,
and Ct as a 20 × 20 diagonal matrix with the first 10 diagonal elements equal
2 and the last 10 diagonal elements equal 1. For each σ2, we generated 1600
realizations of {A, f , φ}. We use the Lilliefors test to test the null hypothesis
that the samples are from a Gaussian distribution with unspecified mean and
variance against the alternative hypothesis that they do not come from a Gaussian
distribution [13]. The test first estimates the mean and variance from the samples,
and then as for the Kolmogorov-Smirnov test, finds the test statistic tstat which is
the maximum discrepancy between the empirical cumulative distribution function
and the cumulative distribution function for the Gaussian distribution specified by
the estimated mean and variance. The test statistic is compared to the critical
111
0 0.1 0.2 0.3 0.4 0.50
2
4
6
8
10
12
f
I t(f)
Figure 4.3. The periodogram It(f) = 1N
∣∣∣∑N−1
n=0 st[n] exp(−j2πfn)∣∣∣
2
. In this case,
f1 < f∗ < f2.
value, which is τ = 0.886/√
1600 = 0.02215 for 1600 realizations if we choose
significance level α = 0.05. If tstat < τ , the Lilliefors test accepts the null hypothesis
that the samples are generated from a Gaussian distribution with unspecified mean
and variance. Otherwise, it rejects the null hypothesis. We plot the test statistic
tstat of Lilliefors test versus ln(1/σ2) and compare the test statistic with the critical
value τ = 0.02215. As shown in Figure 4.5, the test statistics are less than the
critical value for ln(1/σ2) ≥ 4. As a result, the Lilliefors test decides that they
are Gaussian as σ2 → 0. Note that the test statistic for f has large values when
0 ≤ ln(1/σ2) ≤ 3. This is because when ln(1/σ2) = −5, the noise is so large that
the samples are similar to noise only, so that f is uniformly distributed between
0 and 0.5. When ln(1/σ2) = 1, the noise reduces to a certain level such that
f will be centered at θ∗ but with some outliers near the frequency of the first
sinusoid component f = f1 = 0.11, which makes the Lilliefors test statistic larger.
Histograms of f are plotted for ln(1/σ2) = −5 and ln(1/σ2) = 1 in Figure 4.6.
112
0 2 4 6 8 1010
−10
10−5
100
ln(1/σ2)
Mea
nof
(A−
A∗)
20 2 4 6 8 10
10−10
10−5
100
ln(1/σ2)
Mea
nof
(f−
f ∗)2
0 2 4 6 8 1010
−10
10−5
100
ln(1/σ2)
Mea
nof
(φ−
φ∗)
2
Figure 4.4. Convergence of A, f , φ as σ2 → 0.
4.6 Conclusions
We have derived the asymptotic performance of the QMLE with high SNR.
It has been shown that for deterministic signal with additive Gaussian noise, the
QMLE converges to a well defined limit, and it is asymptotically Gaussian as
σ2 → 0. The results are similar to White’s results on the QMLE for a large
number of samples. Simulation results have been provided to verify our analysis.
List of References
[1] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory. En-glewood Cliffs, NJ: Prentice-Hall, 1998.
[2] A. Renaux, P. Forster, E. Chaumette, and P. Larzabal, “On the high-SNRconditional maximum-likelihood estimator full statistical characterization,”IEEE Trans. Signal Process., vol. 54, pp. 4840–4843, Dec. 2006.
[3] H. White, “Maximum likelihood estimation of misspecified models,” Econo-metrica, vol. 50, no. 1, pp. 1–25, Jan. 1982.
[4] S. Kullback, Information Theory and Statistics, 2nd ed. Courier Dover Pub-lications, 1997.
113
−5 0 5 100
0.05
0.1
ln(1/σ2)
Lilli
efor
ste
ston
A
Test StatisticCritical Value
−5 0 5 100
0.5
1
ln(1/σ2)
Lilli
efor
ste
ston
f
Test StatisticCritical Value
−5 0 5 100
0.05
0.1
ln(1/σ2)
Lilli
efor
ste
ston
φ
Test StatisticCritical Value
4 6 8 100
0.01
0.02
0.03
Figure 4.5. Test statistics of Lilliefors test for A, f , φ as σ2 → 0. We have 1600realizations of {A, f , φ} for each σ2.
[5] P.-J. Chung, “ML estimation under misspecified number of signals,” in the39th Asilomar Conference on Signals, Systems, and Computers, Nov. 2005.
[6] P.-J. Chung, “Stochastic maximum likelihood estimation under misspecifiednumbers of signals,” IEEE Trans. Signal Process., vol. 55, pp. 4726–4731, Sep.2007.
[7] H. Akaike, “Information theory and an extension of the likelihood principle,”in Proceedings of the Second International Symposium of Information Theory,1973.
[8] S. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory.Englewood Cliffs, NJ: Prentice-Hall, 1993.
[9] W. Rudin, Principles of Mathematical Analysis, 3rd ed. McGraw-Hill, 1976.
[10] A. W. van der Vaart, Asymptotic Statistics. Cambridge University Press,2000.
[11] E. Lehmann, Elements of Large-Sample Theory. Springer, 1998.
[12] G. Grimmett and D. Stirzaker, Probability and Random Processes, 3rd ed.Oxford University Press, 2001.
[13] H. Lilliefors, “On the kolmogorov-smirnov test for normality with mean andvariance unknown,” Journal of the American Statistical Association, vol. 62,pp. 399–402, 1967.
114
0 0.1 0.2 0.3 0.4 0.50
50
100
150
f
His
togr
amof
f
(a) ln(1/σ2) = −5
0 0.1 0.2 0.3 0.4 0.50
200
400
600
800
1000
1200
1400
f
His
togr
amof
f
(b) ln(1/σ2) = 1
Figure 4.6. Histogram of f .
115
MANUSCRIPT 5
Exponentially Embedded Families for Multimodal Sensor Processing
Abstract
The exponential embedding of two or more probability density functions
(PDFs) is proposed for multimodal sensor processing. It approximates the un-
known PDF by exponentially embedding the known PDFs. Such embedding is of
a exponential family indexed by some parameters, and hence inherits many nice
properties of the exponential family. It is shown that the approximated PDF is
asymptotically the one that is the closest to the unknown PDF in Kullback-Leibler
(KL) divergence. Applied to hypothesis testing, this approach shows improved per-
formance compared to existing methods for cases of practical importance where
the sensor outputs are not independent.
5.1 Introduction
Distributed detection systems have many applications such as radar and sonar,
medical diagnosis, weather prediction, and financial analysis. To obtain optimal
performance, we require the joint PDF of the sensor outputs, which is not al-
ways available. One common approach [1], [2] is to assume that the PDFs of the
sensor outputs are independent, and hence the joint PDF is the product of the
marginal PDFs. However, this assumption may not be satisfied since the sen-
sor measurements could be correlated due to the common source and the relative
sensor locations. The correlation is noticed in [3], [4], where a copula based frame-
work is proposed to estimate the joint PDF from the marginal PDFs. In this work,
we approximate the joint PDF by exponentially embedded families (EEFs) in the
sense that it asymptotically minimizes the KL divergence of the true PDF and the
116
estimated one. For two PDFs p1 and p0, the KL divergence is defined as
D (p1 ‖p0 ) =∫
p1(x) ln p1(x)p0(x)
dx
It is always nonnegative and equals zero if and only if p1 = p0 almost everywhere.
The KL divergence is a measure of the asymptotic performance of binary hypoth-
esis testing by Stein’s lemma [5].
The term “exponentially embedded familie” follows that in [6], where it is used
for model order estimation. The embedded PDF is of an exponential family indexed
by one or more parameters, and so has many nice properties of that family. In a
differential geometry point of view, the EEF forms a manifold in log-PDF space.
In one-dimensional case, the EEF is the PDF that minimizes D (p ‖p0 ) with the
constraint that D (p ‖p0 )−D (p ‖p1 ) = θ [5], [7]. Here we focus on the problem of
binary hypothesis testing. We assume the presence of two sensors in this paper.
Similar results are obtained for multiple hypothesis testing and multiple sensors.
The paper is organized as follows. Section 5.2 defines the EEF and discusses
its properties. Followed in Section 5.3 is its application for hypothesis testing. An
example is given in Section 5.4. In Section 5.5, we show the simulation results by
comparing the ROC curves of different approaches. Conclusion is drawn finally in
Section 5.6.
5.2 EEF and Its Properties
Assume that a source produces the underlying samples x which are unobserv-
able, and we have two sensors whose outputs are the statistics t1(x) and t2(x) of x.
Consider the binary hypothesis testing problem where we know the reference PDF
pX(x;H0), but not pX(x;H1). So we can find the joint PDF pT1,T2(t1, t2;H0), but
do not know pT1,T2(t1, t2;H1). We assume that the marginal PDFs pT1(t1;H1)
and pT2(t2;H1) are known. So the problem is to test between H0 and H1 where
we know the joint PDF under H0 and the marginal PDFs under H1. The EEF is
117
defined as
pX(x; η) =(
pT1(t1(x);H1)
pT1(t1(x);H0)
)η1(
pT2(t2(x);H1)
pT2(t2(x);H0)
)η2
pX (x;H0)
∫ (pT1(t1(x);H1)
pT1(t1(x);H0)
)η1(
pT2(t2(x);H1)
pT2(t2(x);H0)
)η2
pX (x;H0) dx(5.1)
where η = [η1, η2]T are the embedding parameters with the constraints
η ∈ {η : η1, η2 ≥ 0, η1 + η2 ≤ 1} = S (5.2)
Notice that pX(x; η) does not require the knowledge of pX(x;H1). So in practice,
we just need to estimate pX(x;H0) and only the PDFs of T1 and T2 under H1
from training data (see also [8]). The reason why we have the constraints in (5.2)
will be explained later. The next theorem is an extension of Kullback’s results [5],
[7]..
Theorem 6. The PDF of x as in (5.1) is the one that minimizes
D (pX(x) ‖pX(x;H0)) subject to the constraints that
D (pTi(ti) ‖pTi
(ti;H0)) − D (pTi(ti) ‖pTi
(ti;H1)) = θi
for i = 1, 2, where pT1(t1) and pT2(t2) are the PDFs of T1 and T2 corresponding
to pX(x).
Proof. Since
D (pTi(ti) ‖pTi
(ti;H0)) − D (pTi(ti) ‖pTi
(ti;H1))
=
∫
pX(x) lnpTi
(ti(x);H1)
pTi(ti(x);H0)
dx for i = 1, 2
using Lagrange multipliers for the minimization gives
J (pX(x)) =
∫
pX(x) lnpX(x)
pX(x;H0)dx
+ λ1
∫
pX(x) lnpT1 (t1(x);H1)
pT1 (t1(x);H0)dx
+ λ2
∫
pX(x) lnpT2 (t2(x);H1)
pT2 (t2(x);H0)dx + λ3
∫
pX(x)dx
118
Differentiating with respect to pX(x) and setting to 0, we have
lnpX(x)
pX(x;H0)+ 1 + λ1 ln
pT1 (t1(x);H1)
pT1 (t1(x);H0)
+ λ2 lnpT2 (t2(x);H1)
pT2 (t2(x);H0)+ λ3 = 0
Solving this equation and letting η1 = −λ1 and η2 = −λ2, the pX(x) that minimizes
D (pX(x) ‖pX(x;H0)) is of the form as in (5.1) where η1 and η2 are chosen to meet
the constraints.
By letting
K(η) = ln
∫ (pT1 (t1(x);H1)
pT1 (t1(x);H0)
)η1(
pT2 (t2(x);H1)
pT2 (t2(x);H0)
)η2
× pX (x; H0) dx (5.3)
lT1(x) = lnpT1 (t1(x);H1)
pT1 (t1(x);H0), lT2(x) = ln
pT2 (t2(x);H1)
pT2 (t2(x);H0)(5.4)
(5.1) can be written as
pX(x; η) = exp [η1lT1 (x) + η2lT2 (x) − K (η)
+ ln pX (x; H0)] (5.5)
which is a two-parameter exponential family [9]. K (η) is recognized as the cumu-
lant generating function of lT1(x), lT2(x) when the PDF of x is pX(x;H0). Since
(5.5) is of an exponential family, the EEF inherits some useful properties that we
will discuss in the following (refer to [9], [10] and [11] for details).
1) If the PDF of x is pX(x; η), then the joint PDF of T1 and T2 is [11]
pT1,T2(t1, t2; η) = exp [η1lT1 + η2lT2 − K (η)
+ ln pT1,T2(t1, t2;H0)] (5.6)
where
lT1 = lnpT1 (t1;H1)
pT1 (t1;H0), lT2 = ln
pT2 (t2;H1)
pT2 (t2;H0)(5.7)
119
This can also be easily proved using surface integral techniques [12]. Notice that
for (7.7), T1 and T2 are not independent unless they are independent under H0.
2) K (η) is convex by Holder’s inequality [9]. If we assume that lT1 and lT2
are linearly independent [13], then η is identifiable, and hence K (η) is strictly
convex [10].
3) Let Eη (lTi) be the expected value of lTi
for i = 1, 2 and Cη be the
covariance matrix of [lT1 , lT2 ]T when x is distributed according to pX(x; η). We
have
∂K(η)
∂ηi
= Eη (lTi) (5.8)
[∂2K(η)
∂η21
∂2K(η)
∂η1∂η2
∂2K(η)
∂η2∂η1
∂2K(η)
∂η22
]
= Cη (5.9)
Notice that (5.9) also shows that K (η) is convex.
4) [lT1 , lT2 ]T is a minimal and complete sufficient statistic for η. Hence
[lT1 , lT2 ]T can be used to discriminate between pX(x;H1) and pX(x;H0).
5) K (η) is finite on S. To see this, K (η) > −∞ by definition. Obviously,
K (η) = 0 for η = [0, 0]T , [1, 0]T , [0, 1]T . Since K (η) is strictly convex, we have
K (η) ≤ 0 < ∞ for η ∈ S. But when η is outside S, there is no guarantee that
K (η) is finite in general. This explains why we have the constraints in (5.2).
5.3 EEF for Hypothesis Testing
For binary hypothesis testing, we will decide H1 if
maxη
lnpX(x; η)
pX(x;H0)> τ (5.10)
where τ is a threshold. This test statistic actually does not depend on x but only
on t1 and t2 since
g(η) = lnpX(x; η)
pX(x;H0)= η1lT1 + η2lT2 − K(η) (5.11)
120
The reason why we choose this test statistic, as we will show next, is that asymp-
totically maxη
pX(x; η) is the closest to the unknown pX(x;H1) in KL divergence.
Assume that there are a large number of independent and identically dis-
tributed (IID) unobservable xi’s for i = 1, 2, . . . , N , which results in IID t1i’s and
IID t2i’s. We want to maximize
1
N
N∑
i=1
lnpX(xi; η)
pX(xi;H0)
= exp
[
η11
N
N∑
i=1
lT1i+ η2
1
N
N∑
i=1
lT2i− K(η)
]
(5.12)
By the law of large number, under H1
1N
N∑
i=1
lT1i→ EH1 (lT1) = D (pT1 (t1;H1) ‖pT1 (t1;H0))
1N
N∑
i=1
lT2i→ EH1 (lT2) = D (pT2 (t2;H1) ‖pT2 (t2;H0))
as N → ∞. So we are asymptotically maximizing
η1D (pT1 (t1;H1) ‖pT1 (t1;H0))
+ η2D (pT2 (t2;H1) ‖pT2 (t2;H0)) − K(η) (5.13)
Since
lnpX(x;H1)
pX(x; η)= −η1lT1 − η2lT2 + K(η) + ln
pX(x;H1)
pX(x;H0)
the KL divergence between pX(x;H1) and pX(x; η) is
D (pX(x;H1) ‖pX(x; η))
= EH1 exp
[
−η1lT1 − η2lT2 + K(η) + lnpX(x;H1)
pX(x;H0)
]
= −η1D (pT1 (t1;H1) ‖pT1 (t1;H0))
− η2D (pT2 (t2;H1) ‖pT2 (t2;H0))
+ K(η) + D (pX (x;H1) ‖pX (x;H0)) (5.14)
121
This shows that D (pX(x;H1) ‖pX(x; η)) is minimized by maximizing (5.13). A
similar result is shown in [6] by using a Pythagorean-like theorem. Also if T1
and/or T2 are sufficient statistics for deciding between H0 and H1, it can be
shown that pX(x; η) = pX(x;H1). Thus, the true PDF under H1 is recovered [14].
To implement (5.10), we require the maximum likelihood estimate (MLE) of
η. Let η∗ be the MLE of η without constraints in (5.2). Since g (η) is strictly
concave, η∗ is unique. Taking partial derivatives of g (η) and setting to 0, we have
lT1 =∂K(η)
∂η1|η∗ , lT2 =
∂K(η)
∂η2|η∗ (5.15)
Let η be the MLE of η with the constraints. If η∗ is in the constraint set S, then
η = η∗. Otherwise, η is unique and is on the boundary of S since −g (η) is strictly
convex and S is convex also [15], and hence we could simply search the boundary
of S to find η.
5.4 Example
Since only T1 and T2 are used in hypothesis testing, we only need to specify
their distributions. Consider the case when T1 and T2 are scalars (will write them
as T1 and T2) with distributions
[T1
T2
]
∼ N([
00
]
, σ2
[1 ρ0
ρ0 1
])
under H0
[T1
T2
]
∼ N([
A1
A2
]
, σ2
[1 ρ1
ρ1 1
])
under H1
122
where ρ0 is known but ρ1 is unknown (we do not need the joint PDF of T1 and T2
under H1). We have
K(η) = ln EH0 [exp (η1lT1 + η2lT2)]
= ln EH0
[
exp
(
η12t1A1 − A2
1
2σ2+ η2
2t2A2 − A22
2σ2
)]
= −η1A2
1
2σ2− η2
A22
2σ2
+ ln EH0
[
exp
(η1t1A1 + η2t2A2
σ2
)]
Let φ =[
η1A1
σ2 , η2A2
σ2
]Tand t = [t1, t2]
T , then
EH0
[
exp
(η1t1A1 + η2t2A2
σ2
)]
= EH0
[
exp(
φT t)]
= exp
(1
2φTC0φ
)
where C0 = σ2
[1 ρ0
ρ0 1
]
and hence
K(η) = −η1A2
1
2σ2− η2
A22
2σ2+
1
2φTC0φ
So
g(η) = η1lT1 + η2lT2 − K(η)
= η12t1A1 − A2
1
2σ2+ η2
2t2A2 − A22
2σ2− K(η)
=η1A1t1
σ2+
η2A2t2σ2
− 1
2φTC0φ
= tT φ − 1
2φTC0φ
Differentiating and setting to 0, the global maximum is found at
φ∗ = C−10 t =
[t1−ρ0t21−ρ2
0t2−ρ0t11−ρ2
0
]
or
η∗ =
[σ2(t1−ρ0t2)
A1(1−ρ20)
σ2(t2−ρ0t1)
A2(1−ρ20)
]
123
If η∗ ∈ S, then we decide H1 if g(η∗) = tTC−10 t > τ , otherwise we search η on the
boundary and decide H1 if g(η) > τ .
When we observe N IID t1i’s and IID t2i’s, then it easily extends that by
(5.12), [t1, t2]T is replaced by the sample mean
[1N
∑Ni=1 t1i,
1N
∑Ni=1 t2i
]T
, and
everything else remains the same.
5.5 Simulation Results
For the above example, we set N = 20, A1 = 0.3, A2 = 0.35, σ2 = 1, ρ0 = 0.6
and ρ1 = 0.7. We compare the EEF approach with the clairvoyant detector (ρ1 is
known, its performance is an upper bound), the detector assuming independence
of t1 and t2, and the copula based method. The copula method estimates the
linear correlation coefficient ρ1 using a non-parametric rank correlation measure,
Kendall’s τ . We use the Gaussian copula as in [3]. The simulation is repeated
for 5000 trials. The receiver operating characteristic curves (ROC) are plotted.
As seen in Fig. 5.1, the EEF is only poorer than the clairvoyant detector, and
performs better than the other two methods.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probability of False Alarm
Pro
babi
lity
of D
etec
tion
Assume IndependenceEEFClairvoyantCopula
0 0.005 0.01 0.015 0.020
0.02
0.04
0.06
0.08
0.1
Figure 5.1. ROC curves for different detectors.
124
5.6 Conclusion
The EEF based approach is proposed for the problem of multimodal signal
processing when the outputs are not independent. It exponentially embeds two or
more PDFs and approximates an unknown PDF. Such embedding is highly related
to the KL divergence and many of its properties have been discussed. Examples
are given to help understand the application of this method. Compared to some
existing approaches, better performance is observed for the proposed method. The
connections among η, K (η) and the KL divergence and more of its theoretical
properties will be investigated in the future.
List of References
[1] S. Thomopoulos, R. Viswanathan, and D. Bougoulias, “Optimal distributeddecision fusion,” IEEE Trans. Aerosp. Electron. Syst., vol. 25, pp. 761–765,Sep. 1989.
[2] Z. Chair and P. Varshney, “Optimal data fusion in multiple sensor detectionsystems,” IEEE Trans. Aerosp. Electron. Syst., vol. 22, pp. 98–101, Jan. 1986.
[3] A. Sundaresan, P. Varshney, and N. Rao, “Distributed detection of a nuclearradioactive source using fusion of correlated decisions,” in Information Fusion,2007 10th International Conference on, 2007, pp. 1–7.
[4] S. Iyengar, P. Varshney, and T. Damarla, “A parametric copula based frame-work for multimodal signal processing,” in ICASSP, 2009, pp. 1893–1896.
[5] T. Cover and J. Thomas, Elements of Information Theory, 2nd ed. JohnWiley and Sons, 2006.
[6] S. Kay, “Exponentially embedded families - new approaches to model orderestimation,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, pp. 333–345, Jan.2005.
[7] S. Kullback, Information Theory and Statistics, 2nd ed. Courier Dover Pub-lications, 1997.
[8] S. Kay, A. Nuttall, and P. Baggenstoss, “Multidimensional probability densityfunction approximations for detection, classification, and model order selec-tion,” IEEE Trans. Signal Process., vol. 49, pp. 2240–2252, Oct. 2001.
125
[9] L. Brown, Fundamentals of Statistical Exponential Families. Institute ofMathematical Statistics, 1986.
[10] P. Bickel and K. Doksum, Mathematical Statistics: Basic Ideas and SelectedTopics. Pearson Prentice Hall, 2006, vol. 1.
[11] E. Lehmann and J. Romano, Testing Statistical Hypotheses, 3rd ed. Springer,2005.
[12] J. Higgins, “Some surface integral techniques in statistics,” The AmericanStatistician, vol. 29, pp. 43–46, Feb. 1975.
[13] J. Pfanzagl and W. Wefelmeyer, Contributions to a General Asymptotic Sta-tistical Theory, ser. Lecture Notes in Statistics. Springer-Verlag, 1982, vol. 13.
[14] S. Kay, “Asymptotically optimal approximation of multidimensional pdf’s bylower dimensional pdf’s,” IEEE Trans. Signal Process., vol. 55, pp. 725–729,Feb. 2007.
[15] D. Luenberger, Linear and Nonlinear Programming, 2nd ed. Springer, 2003.
126
MANUSCRIPT 6
Joint PDF Construction for Sensor Fusion and Distributed Detection
Abstract
A novel method of constructing a joint PDF under H1, when the joint PDF
under H0 is known, is developed. It has direct application in distributed detection
systems. The construction is based on the exponential family and it is shown
that asymptotically the constructed PDF is optimal. The generalized likelihood
ratio test (GLRT) is derived based on this method for the partially observed linear
model. Interestingly, the test statistic is equivalent to the clairvoyant GLRT, which
uses the true PDF under H1, even if the noise is non-Gaussian.
6.1 Introduction
Data fusion or sensor fusion in distributed detection systems has been widely
studied over the years. By combining the data from different sensors, better per-
formance can be expected than using a single sensor alone. The optimal detection
performance can be obtained if the joint probability density function (PDF) of the
measurements from different sensors under each hypothesis is completely known.
However in practice, this joint PDF is usually not available. So a key issue in this
area is how to construct the joint PDF of the measurements from different sen-
sors. One common approach is to assume that the measurements are independent
[1], [2]. This approach has been widely used due to its simplicity, since the joint
PDF is then the product of the marginal PDFs. This leads to the product rule
in combining classifiers, and it is effectively a severe rule as stated in [3] that “it
is sufficient for a single recognition engine to inhibit a particular interpretation
by outputting a close to zero probability for it”. Moreover, the independence is a
strong assumption and the measurements can be correlated in many cases. The
127
dependence between measurements has been considered in [4, 5, 6]. A copula based
framework is used in [4, 5] to estimate the joint PDF from the marginal PDFs.
The exponentially embedded families (EEFs) are proposed in [6] to asymptotically
minimize the Kullback-Leibler (KL) divergence between the true PDF and the
estimated one.
Note that all the above methods are based on the assumption that we know
the marginal PDFs of the measurements. But in many cases, the marginal PDFs
may not be available or accurate. This could happen when we do not have enough
training data. In this paper, we will present a new way of constructing a joint
PDF without the knowledge of marginal PDFs but only a reference PDF. The
constructed joint PDF takes the form of the exponential family and the maximum
likelihood estimate (MLE) of the unknown parameters can be easily solved based on
the exponential family. Since there is no Gaussian distribution assumption on the
reference PDF, this method can be very useful when the underlying distributions
are non-Gaussian. In the examples when we apply this method to the detection
problem, under some conditions, the detection statistics can be shown to be the
same as the the clairvoyant generalized likelihood ratio test (GLRT), which is
the test when the true PDF under H1 is known except for the usual unknown
parameters.
The paper is organized as follows. Section 6.2 formulates the detection prob-
lem. The construction of the joint PDF is presented and is applied to the detection
problem in Section 6.3. The KL divergence between the true PDF to the con-
structed PDF is examined in Section 6.4. We give two examples in Section 6.5. In
Section 6.6, some simulation results are shown. Conclusions are given in Section
6.7.
128
6.2 Problem Statement
Consider the detection problem when we observe the outputs of two sensors,
T1(x) and T2(x) which are transformations of the underlying samples x that are
unobservable (see Figure 7.1). All the results are valid for any number of sensors.
We just choose two for simplicity. Assume that we have enough training data
T1i(x)’s and T2i
(x)’s under H0 when there is no signal present. Hence we have
a good estimate of the joint PDF of T1 and T2 under H0 (see [7]), and thus we
assume pT1,T2(t1, t2;H0) is completely known. Under H1 when a signal is present,
we may not have enough training data to estimate the joint PDF under H1. So
our goal is to construct an appropriate pT1,T2(t1, t2;H1) and use it for detection.
Since pT1,T2(t1, t2;H1) cannot be uniquely specified based on pT1,T2(t1, t2;H0), we
need the following reasonable assumptions to construct the joint PDF.
1) Under H1 the signal is small and thus pT1,T2(t1, t2;H1) is close to
pT1,T2(t1, t2;H0).
2) pT1,T2(t1, t2;H1) depends on signal parameters θ so that
pT1,T2(t1, t2;H1) = pT1,T2(t1, t2; θ)
and
pT1,T2(t1, t2;H0) = pT1,T2(t1, t2;0)
Note that since θ represents signal amplitudes, θ �= 0 under H1. Therefore, the
detection problem is
H0 : θ = 0
H1 : θ �= 0
129
Sensor 1 Sensor 2
CentralProcessor
T1(x) T2(x)
Area of Interest
pT1,T2(t1,t2;H0)
H0 or H1 ?
Figure 6.1. Distributed detection system with two sensors
6.3 Construction of Joint PDF for Detection
To simplify the notation, let
T =
[T1
T2
]
so that the joint PDF pT1,T2(t1, t2; θ) can be written as pT(t; θ). Since we assume
that ||θ|| is small, we expand the log-likelihood function using a first order Taylor
expansion.
ln pT(t; θ) = ln pT(t;0) + θT ∂ ln pT(t; θ)
∂θ
∣∣θ=0
+ o(||θ||) (6.1)
130
We omit the o(||θ||) term but in order for pT(t; θ) to be a valid PDF, we normalize
the PDF to integrate to one as
pT(t; θ)
= exp
[
θT ∂ ln pT(t; θ)
∂θ
∣∣θ=0
− K(θ) + ln pT(t;0)
]
(6.2)
where
K(θ) = ln E0
[
exp
(
θT ∂ ln pT(t; θ)
∂θ
∣∣θ=0
)]
(6.3)
Here E0 denotes the expected value under H0.
Next we assume that the sensor outputs are the score functions, i.e.,
t =∂ ln pT(t; θ)
∂θ
∣∣θ=0
(6.4)
and are sufficient statistics for the constructed PDF under H1. This will be true
if pT(t; θ) is in the exponential family with
pT(t; θ) = exp[
θT t − K(θ) + ln pT(t;0)]
(6.5)
where
K(θ) = ln E0
[
exp(
θTT)]
(6.6)
and E0(T) = 0. This can be easily verified since by (7.1), we have
∂ ln pT(t; θ)
∂θ
∣∣θ=0
= t − ∂K(θ)
∂θ
∣∣θ=0
and
∂K(θ)
∂θ
∣∣θ=0
= E0(T)
as well known properties of the exponential family. Note that even if E0(T) �= 0,
we still have
t − E0(T) =∂ ln pT(t; θ)
∂θ
∣∣θ=0
131
We can use t − E0(T) instead of t as the sensor outputs and hence still satisfy
(6.4) and (7.1). As a result, we will use (7.1) as our constructed PDF. This implies
that t is a sufficient statistic for the constructed exponential PDF, and hence this
PDF incorporates all the sensor information. Note that if T1, T2 are statistically
dependent under H0, they will also be dependent under H1. Also note that only
pT(t;0) is required in (7.1). It is assumed in practice that this can be estimated
or found analytically [7] with reasonable accuracy.
Since θ is unknown, the GLRT is used for detection [8]. We want to maximize
pT(t; θ) or ln pT(t;θ)pT(t;0)
= θT t − K(θ) over θ. This is a convex optimization problem
since K(θ) is convex by Holder’s inequality [9]. Hence many convex optimization
techniques can be utilized [10, 11]. By taking the derivative with respect to θ, the
MLE of θ is found by solving
t =∂K(θ)
∂θ(6.7)
Also because K(θ) is convex, the MLE θ is unique. Then we decide H1 if
lnpT(t; θ)
pT(t;0)= θ
Tt − K(θ) > τ (6.8)
where τ is a threshold.
6.4 KL Divergence Between The True PDF and The Constructed PDF
The KL divergence is a non-symmetric measure of difference between two
PDFs. For two PDFs p1 and p0, it is defined as
D (p1 ‖p0 ) =
∫
p1(x) lnp1(x)
p0(x)dx
It is well known that D (p1 ‖p0 ) ≥ 0 with equality if and only if p1 = p0 [12]. By
Stein’s lemma [13], the KL divergence measures the asymptotic performance for
detection.
It can be shown that pT(t; θ) is the optimal under both hypotheses. That is,
if it is under H0, pT(t; θ) = pT(t;0) asymptotically, and if it is under H1, pT(t; θ)
132
is asymptotically the closest to the true PDF in KL divergence. Similar results
and arguments have been shown in [6, 14].
6.5 Examples
In this section, we will apply the the constructed PDF of (7.1) to some detec-
tion problems. We will start with the simple case with Gaussian noise, and then
we will extend the result to the more general case with Gaussian mixture noise.
6.5.1 Partially Observed Linear Model with Gaussian Noise
Suppose we have the linear model with
x = Hα + w (6.9)
with
H0 : α = 0
H1 : α �= 0
where x is an N × 1 vector of the underlying unobservable samples, H is an N × p
observation matrix with full column rank, α is an p × 1 vector of the unknown
signal amplitudes, and w is an N × 1 vector of white Gaussian noise with known
variance σ2. We observe two sensor outputs
T1(x) = HT1 x
T2(x) = HT2 x (6.10)
where T1 and T2 could be any subset of columns of H. Note that [H1,H2] does
not have to be H. This model is called a partially observed linear model. Note
that a sufficient statistic is HTx, so there is some information loss over the case
when x is observed, unless H = [H1,H2].
133
Let G = [H1,H2], then we have
T =
[T1(x)T2(x)
]
=
[HT
1 xHT
2 x
]
= GTx (6.11)
Therefore, T is also Gaussian with PDF
T ∼ N(
0, σ2GTG)
under H0
and T1, T2 are seen to be correlated for HT1 H2 �= 0. As a result, we construct the
PDF as in (7.1) with
K(θ) = ln E0
[
exp(
θTT)]
=1
2σ2θTGTGθ (6.12)
Note that θ is the vector of the unknown parameters in the constructed PDF, and
it is different from the unknown parameters α in the linear model.
By (6.7) and (7.10), the MLE of θ satisfies
t =∂K(θ)
∂θ= σ2GTGθ
So
θ =1
σ2
(
GTG)−1
t
and the test statistic becomes
θTt − K(θ) =
1
2σ2tT(
GTG)−1
t (6.13)
Next we consider the clairvoyant GLRT. That is the GLRT when we know
the true PDF of T under H1 except for the underlying unknown parameters α.
From (6.11) we know that
T ∼ N(
GTHα, σ2GTG)
under H1
134
We write the true PDF under H1 as pT(t; α). The MLE of α is found by maxi-
mizing
lnpT(t; α)
pT(t;0)
= − 1
2σ2
(
t − GTHα)T (
GTG)−1 (
t − GTHα)
+1
2σ2tT(
GTG)−1
t
Let t be q × 1. If q ≤ p, i.e., the length of t is less
than the length of α, then the MLE α may not be unique. Since(
t − GTHα)T (
GTG)−1 (
t − GTHα)
≥ 0, we could always find α such that
t = GTHα and hence(
t − GTHα)T (
GTG)−1 (
t − GTHα)
= 0. Hence the clair-
voyant GLRT statistic becomes
lnpT(t; α)
pT(t;0)=
1
2σ2tT(
GTG)−1
t
which is the same as the GLRT on our constructed PDF (see (6.13)) when q ≤ p.
6.5.2 Partially Observed Linear Model with Non-Gaussian Noise
The partially observed linear model remains the same as in the previous sub-
section except instead of assuming that w is white Gaussian, we will assume that
w has a Gaussian mixture distribution with two components, i.e.,
w ∼ πN (0, σ21I) + (1 − π)N (0, σ2
2I) (6.14)
where π, σ21 and σ2
2 are known (0 < π < 1). The following derivation can be easily
extended when w ∼∑L
i=1 πiN (0, σ2i I).
Since w has a Gaussian mixture distribution, T = GTx is also Gaussian
mixture distributed and
T ∼ πN (0, σ21G
TG) + (1 − π)N (0, σ22G
TG) under H0
135
It can be shown that the GLRT statistic is
maxθ
[
θT t − ln(
πe12σ21θ
TGT Gθ + (1 − π)e
12σ22θ
TGT Gθ
)]
(6.15)
Although no analytical solution of the MLE of θ exists, it can be found using
convex optimization techniques [10, 11]. Moreover, an analytical solution exists as
||θ|| → 0. It can be shown that
θ =1
πσ21 + (1 − π)σ2
2
(
GTG)−1
t (6.16)
and the GLRT statistic becomes
1
2 (πσ21 + (1 − π)σ2
2)tT(
GTG)−1
t (6.17)
as ||θ|| → 0.
The clairvoyant GLRT statistic can be shown to be equivalent to
tT(
GTG)−1
t (6.18)
when q ≤ p. Hence the clairvoyant GLRT coincides with the GLRT using the
constructed PDF as ||θ|| → 0.
Note that the noise in (6.14) is uncorrelated but not independent. We consider
a general case when the noise can be correlated with PDF
w ∼ πN (0,C1) + (1 − π)N (0,C2) (6.19)
It can be shown that for the GLRT using the constructed PDF, the test statistic
is
maxθ
[
θT t − ln(
πe12θT
GT C1Gθ + (1 − π)e12θT
GT C2Gθ)]
(6.20)
and the clairvoyant GLRT statistic is
− ln
(π
det1/2 (C1)exp
[
−1
2tT(
GTC1G)−1
t
]
+1 − π
det1/2 (C2)exp
[
−1
2tT(
GTC2G)−1
t
])
(6.21)
when q ≤ p.
136
6.6 Simulations
Since the GLRT using the constructed PDF coincides with the clairvoyant
GLRT under Gaussian noise as shown in subsection 6.5.1, we will only compare
the performances under non-Gaussian noise (both uncorrelated noise as in (6.14)
and correlated noise as in (6.19)).
Consider the model where
x[n] = A1 + A2rn + A3 cos(2πfn + φ) + w[n] (6.22)
for n = 0, 1, . . . , N − 1 with known r and frequency f but unknown amplitudes
A1, A2, A3 and phase φ. This is a linear model as in (6.9) where
H =
⎡
⎢⎢⎢⎣
1 1 1 01 r cos(2πf) sin(2πf)...
......
...1 rN−1 cos(2πf(N − 1)) sin(2πf(N − 1))
⎤
⎥⎥⎥⎦
and α = [A1, A2, A3 cos φ,−A3 sin φ]T .
Let w have an uncorrelated Gaussian mixture distribution as in (6.14). For
the partially observed linear model, we observe two sensor outputs as in (6.10).
We compare the GLRT in (6.15) with the clairvoyant GLRT in (6.18). Note that
the MLE of θ in (6.15) is found numerically, not by the asymptotic approxima-
tion in (6.16). In the simulation, we use N = 20, A1 = 2, A2 = 3, A3 = 4,
φ = π/4, r = 0.95, f = 0.34, π = 0.9, σ21 = 50, σ2
2 = 500, and H1 and
H2 are the first and third columns in H respectively, i.e., H1 = [1, 1, . . . , 1]T ,
H2 = [1, cos(2πf), . . . , cos(2πf(N − 1))]T . As shown in Figure 6.2, the perfor-
mances are almost the same which justifies their equivalence under small signals
assumption shown in Section 6.5.
Next for the same model in (6.22), let w have a correlated Gaussian mixture
distribution as in (6.14). We compare performances of the GLRT using the con-
structed PDF as in (6.20) and the clairvoyant GLRT as in (6.21). We use N = 20,
137
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probability of False Alarm
Pro
babi
lity
of D
etec
tion
GLRT on Constructed PDFClairvoyant GLRT
Figure 6.2. ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with uncorrelated Gaussian mixture noise.
A1 = 3, A2 = 4, A3 = 3, φ = π/7, r = 0.9, f = 0.46, π = 0.7, H1 = [1, 1, . . . , 1]T ,
H2 = [1, cos(2πf), . . . , cos(2πf(N − 1))]T . The covariance matrices C1, C2 are
generated using C1 = RT1 ×R1, C2 = RT
2 ×R2, where R1, R2 are full rank N ×N
matrices. As shown in Figure 6.3, the performances are still very similar.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probability of False Alarm
Pro
babi
lity
of D
etec
tion
GLRT on Constructed PDFClairvoyant GLRT
Figure 6.3. ROC curves for the GLRT using the constructed PDF and the clair-voyant GLRT with correlated Gaussian mixture noise.
138
6.7 Conclusions
A novel method of combining sensor outputs for detection based on the ex-
ponential family has been proposed. It does not require the joint PDF under
H1. The constructed PDF has been shown to be optimal in KL divergence. The
GLRT statistic based on this method can be shown to be equivalent to the clair-
voyant GLRT statistic for the partially observed linear model with both Gaussian
or non-Gaussian noise. The equivalence is also shown in simulations.
List of References
[1] S. Thomopoulos, R. Viswanathan, and D. Bougoulias, “Optimal distributeddecision fusion,” IEEE Trans. Aerosp. Electron. Syst., vol. 25, pp. 761–765,Sep. 1989.
[2] Z. Chair and P. Varshney, “Optimal data fusion in multiple sensor detectionsystems,” IEEE Trans. Aerosp. Electron. Syst., vol. 22, pp. 98–101, Jan. 1986.
[3] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,” IEEETrans. Pattern Anal. Mach. Intell., vol. 20, pp. 226–239, Mar. 1998.
[4] A. Sundaresan, P. Varshney, and N. Rao, “Distributed detection of a nuclearradioactive source using fusion of correlated decisions,” in Information Fusion,2007 10th International Conference on, 2007, pp. 1–7.
[5] S. Iyengar, P. Varshney, and T. Damarla, “A parametric copula based frame-work for multimodal signal processing,” in ICASSP, 2009, pp. 1893–1896.
[6] S. Kay and Q. Ding, “Exponentially embedded families for multimodal sensorprocessing,” in ICASSP, Mar. 2010, pp. 3770–3773.
[7] S. Kay, A. Nuttall, and P. Baggenstoss, “Multidimensional probability densityfunction approximations for detection, classification, and model order selec-tion,” IEEE Trans. Signal Process., vol. 49, pp. 2240–2252, Oct. 2001.
[8] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory. En-glewood Cliffs, NJ: Prentice-Hall, 1998.
[9] L. Brown, Fundamentals of Statistical Exponential Families. Institute ofMathematical Statistics, 1986.
[10] S. Boyd and L.Vandenberghe, Convex Optimization. Cambridge UniversityPress, 2004.
139
[11] D. Luenberger, Linear and Nonlinear Programming, 2nd ed. Springer, 2003.
[12] S. Kullback, Information Theory and Statistics, 2nd ed. Courier Dover Pub-lications, 1997.
[13] T. Cover and J. Thomas, Elements of Information Theory, 2nd ed. JohnWiley and Sons, 2006.
[14] S. Kay, “Exponentially embedded families - new approaches to model orderestimation,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, pp. 333–345, Jan.2005.
140
MANUSCRIPT 7
Sensor Integration for Classification
Abstract
In the problem of sensor integration, an important issue is to estimate the
joint PDF of the measurements of sensors. However in practice, we may not have
enough training data to have a good estimate. In this paper, we have constructed
the joint PDF using an exponential family for classification. This method only
requires the PDF under a reference hypothesis. Its performance has shown to be
as good as the estimated maximum a posteriori probability classifier which requires
more information. This shows a wide application of our method in classification
because less information is needed than existing methods.
7.1 Introduction
Distributed detection/classification systems have been widely used in many
applications such as radar, sonar, wireless sensor networks, and medical diagnosis.
Since multiple sensors will collect more information than a single sensor does, a
better decision is expected to be made. In classification, it is well known that
the maximum a posteriori probability (MAP) classifier minimizes the probability
of error [1]. However, the MAP rule requires the complete knowledge of the joint
probability density functions (PDFs) of the measurements from sensors under each
hypothesis, which in practice may not be available. Hence, it is important in sensor
integration to find appropriate estimates of the joint PDFs under each hypothesis,
and the estimates should contain all the available information.
In many works, people assume that the marginal PDFs of the measurements
from each sensor are known. One commonly used method is to simply assume
that the measurements are independent, and the joint PDF is just the product of
141
the marginal PDFs [2], [3]. This is equivalent to the product rule in combining
classifiers, and it is a severe rule as shown in [4]. Another concern is that the
correlation among the measurements is neglected by assuming independence. So
some approaches that consider the dependence among the measurements have
been proposed. A copula based method that estimates the joint PDF from the
marginal PDFs is used in [5], [6]. The exponentially embedded families (EEFs)
that asymptotically minimize the Kullback-Leibler (KL) divergence between the
true PDF and the estimated PDF is proposed in [7].
Note that the marginal PDFs are required in the above mentioned approaches.
However, we may not even have enough training data in practice to have an ac-
curate estimate of the marginal PDFs, especially when the sensor outputs have
high dimensions. In this paper, we construct the joint PDF using an exponential
family. The construction only requires a reference PDF and it incorporates all the
available information. It can be shown that the constructed PDF is asymptotically
the optimal one in the sense that it is asymptotically closest to the true PDF in
KL divergence.
By maximizing the constructed PDF over the signal parameters, our classifier
can be easily derived. The performance of our method is compared to that of the
estimated MAP classifier, which assumes that the true joint PDF is known except
for the unknown parameters. We present an example in which their performances
appear to be the same. Note that our method assumes less information than the
estimated MAP classifier does. This shows that our method has many applications
for distributed systems in practice.
The paper is organized as follows. In Section 7.2, we introduce a distributed
classification problem. In Section 7.3, we construct the joint PDF by an expo-
nential family and apply it to the classification problem. An example is given in
142
Section 7.4. In Section 7.5, the performances of our method and the estimated
MAP classifier are compared via simulation. Conclusions are drawn in Section 7.6.
7.2 Problem Statement
Consider the classification problem where we have two distributed sensors
whose outputs T1(x) and T2(x) are transformations of the underlying samples x
that are unobservable. We need to decide from among M candidate hypotheses
Hi for i = 1, 2, . . . , M . Assume that there is a reference hypothesis H0 (usually it
is the hypothesis with noise only) and we have enough training data T1n(x)’s and
T2n(x)’s under H0 to accurately estimate the joint PDF of T1 and T2 under H0
[8]. We assume that pT1,T2(t1, t2;H0) is completely known. However, under Hi
(i = 1, 2, . . . , M) when a signal is present, we may not even have enough training
samples to accurately estimate the marginal PDFs under Hi. This is especially
the case in the radar scenario, where the target is present for only a small portion
of the time. Hence, we want to construct appropriate joint PDFs under each Hi
with as much information we have as possible, and make a classification using the
constructed PDFs. A simple illustration is shown in Figure 7.1. Note that the
result in this paper can be easily extended to the general multiple-sensor case.
7.3 Joint PDF Construction and Its Application in Classification
Since pT1,T2(t1, t2;H0) is the only information available, in order to specify
the joint PDF pT1,T2(t1, t2;Hi), we need the following assumptions [9].
1) The signal is small under each Hi and hence pT1,T2(t1, t2;Hi) is close to
pT1,T2(t1, t2;H0).
2) Under each Hi, the joint PDF can be parameterized by some signal param-
eters θi so that
pT1,T2(t1, t2;Hi) = pT1,T2(t1, t2; θi)
143
Sensor 1 Sensor 2
CentralProcessor
T1(x) T2(x)
Area of Interest
pT1,T2(t1,t2;H0)
H0 or H1 ?
Figure 7.1. Distributed classification system with two sensors.
pT1,T2(t1, t2;H0) = pT1,T2(t1, t2;0)
Hence the classification problem is to choose from
Hi : θ = θi for i = 1, . . . , M
Let
T =
[T1
T2
]
so that the joint PDF pT1,T2(t1, t2; θi) can be written as pT(t; θi). As shown in [9]
with a first order Taylor expansion on the log-likelihood function under each Hi,
we can construct the PDF of T under Hi as
pT(t; θi) = exp[
θTi t − K(θi) + ln pT(t;0)
]
(7.1)
where
K(θi) = ln E0
[
exp(
θTi T
)]
(7.2)
144
is the cumulant generating function of pT(t;0), and it normalizes the PDF to
integrate to 1. Note that it is assumed that pT(t;0) is available or it can be
estimated with reasonable accuracy.
In order to estimate the unknown parameters θi in pT(t; θi), we will use the
maximum likelihood estimate (MLE) [10]. We see that in (7.1), the constructed
PDF is in the form of an exponential family, and many nice properties are as
follows:
1. T is a sufficient statistic for constructed PDF, and hence this PDF incor-
porates all the sensor information.
2. K(θi) is convex by Holder’s inequality [11]. Since maximizing pT(t; θi) is
equivalent to maximizing θTi t−K(θi), this becomes a convex optimization problem
and many existing methods can be readily utilized [12], [13].
3. It can be shown that by maximizing pT(t; θi) over θi, the resulting PDF is
asymptotically the closest to the true PDF pT(t;Hi) in KL divergence [9]. Similar
arguments have been shown in [7, 14].
For classification, if we assume equal prior probabilities of each hypothesis,
i.e., p(H1) = p(H2) = · · · = p(HM), the MAP rule can be reduced to the maximum
likelihood (ML) rule [1]. When the MLE of θi is found by maximizing pT(t; θi)
over θi, we consider pT(t; θi) as our estimate of pT(t;Hi) where θi is the MLE
of θi. Hence similar to the ML rule, we will decide Hi for which the following is
maximum over i:
pT(t; θi) (7.3)
By the monotonicity of the log function, we can equivalently decide Hi for which
the following is maximum over i:
lnpT(t; θi)
pT(t;0)= θ
T
i t − K(θi) (7.4)
We will compare the performance of our classifier to that of the estimated
145
MAP classifier. The estimated MAP classifier assumes that the PDF of T under
Hi is known except for some unknown underlying parameters αi. We still assume
that p(H1) = p(H2) = · · · = p(HM). So the estimated MAP classifier finds the
MLE of αi and chooses Hi for which the following is maximum over i:
pT(t; αi) (7.5)
where αi is the MLE of αi. Note that for the estimated MAP classifier, αi are
the unknown parameters in the true PDF under Hi, while θi are the unknown
parameters in the constructed PDF under Hi. Since the constructed PDF may or
may not be the true PDF, the estimated MAP classifier assumes more information
than our classifier.
7.4 A Linear Model Example
Consider the following classification model:
Hi : x = Aisi + w (7.6)
where si is an N × 1 known signal vector with the same length as x, Ai is the
unknown signal amplitude, and w is white Gaussian noise with known variance
σ2. Assume that instead of observing x, we can only observe the measurements of
two sensors
T1 = HT1 x
T2 = HT2 x (7.7)
where H1 is N × p1 and H2 is N × p2. Here p1 and p2 are the length for vectors
T1 and T2 respectively. We can write (7.7) as
T = GTx (7.8)
by letting
T =
[T1
T2
]
146
and
G = [H1 H2]
where G is N × (p1 + p2) with p1 + p2 ≤ N . We assume that G has full column
rank so that there are no redundant measurements of the sensors. Note that G
can be any matrix with full column rank.
Let H0 be the reference hypothesis when there is noise only, i.e.,
H0 : x = w (7.9)
Since x is Gaussian under H0, according to (7.8), T is also Gaussian and
T ∼ N(
0, σ2GTG)
under H0. We construct the PDF under Hi as in (7.1) with
K(θi) = ln E0
[
exp(
θTi T
)]
=1
2σ2θT
i GTGθi (7.10)
Hence the constructed PDF is
pT(t; θi)
= exp[
θTi t − K(θi) + ln pT(t;0)
]
=1
(2πσ2)p1+p2
2 det12 (GTG)
exp
(
−tT(
GTG)−1
t
2σ2
)
· exp
[
θTi t − 1
2σ2θT
i GTGθi
]
(7.11)
which can be simplified as
T ∼ N(
σ2GTGθi, σ2GTG
)
under Hi (7.12)
The next step is to find the MLE of θi. Note that the MLE of θi is found by
maximizing θiT t − K(θi) over θi. If this optimization procedure is carried without
any constraint, then θi would be the same for all i. Hence we need some implicit
147
constraints in finding the MLE. Since θi represents the signal under Hi, we should
have
θi = AiGT si = EHi
(T) (7.13)
which is the mean of T under Hi. As a result, (7.12) can be written as
T ∼ N(
σ2AiGTGGT si, σ
2GTG)
under Hi (7.14)
Thus, instead of finding the MLE of θi by maximizing
θTi t − K(θi) = θT
i t − 1
2σ2θT
i GTGθi (7.15)
with the constraint in (7.13), we can find the MLE of Ai in (7.14) and then plug
it into (7.13). It can be found that
Ai =sT
i Gt
σ2sTi GGTGGT si
(7.16)
and
θi =GT sis
Ti Gt
σ2sTi GGTGGT si
(7.17)
Hence by removing the constant factors, the test statistic of our classifier for Hi is
(sTi Gt)2
(GT si)TGTG(GT si)(7.18)
Next we consider the estimate MAP classifier. In this case, we assume that
we know
T ∼ N(
AiGT si, σ
2GTG)
under Hi (7.19)
Note that (7.19) is the true PDF of T under Hi and (7.14) is the constructed PDF.
It can be found that the MLE of Ai in the true PDF under Hi is
Ai =sT
i G(
GTG)−1
t
sTi G (GTG)−1 GT si
(7.20)
By removing the constant terms, the test statistic of the estimated MAP classifier
for Hi is
(sTi G
(
GTG)−1
t)2
(GT si) (GTG)−1 (GT si)(7.21)
148
Note that (7.16) and (7.20) are different because (7.16) is the MLE of Ai under
the constructed PDF and (7.20) is the MLE of Ai under the true PDF.
7.5 Simulation Results
For the model in (7.6)
Hi : x = Aisi + w
let A1 = 0.5, A2 = 1, A3 = 1 and
s1(n) = cos(2πf1n) + 1
s2(n) = cos(2πf2n) + 0.5
s3(n) = cos(2πf3n)
where n = 0, 1, . . . , N − 1 with N = 20, and f1 = 0.17, f2 = 0.28, f3 = 0.45.
Let p(H1) = p(H2) = p(H3) = 1/3. Assume that there are three sensors (this is
an extension of the two sensor assumption), each with an observation matrix as
follows respectively:
H1 =[
1 1 · · · 1]T
H2 =
[1 cos(2πf1) · · · cos (2πf1(N − 1))1 cos(2πf2) · · · cos (2πf2(N − 1))
]T
H3 =[
1 cos (2π(f3 + 0.02)) · · · cos (2π(f3 + 0.02)(N − 1))]T
Note that in H3, we set the frequency to f3 + 0.02. This is the case when the
knowledge of the frequency is not accurate.
The test statistics are used as in (7.18) and (7.21) for the two methods re-
spectively. The probabilities of correct classification are plotted versus ln(1/σ2) in
Figure 7.2. We see that their performances appear to be the same, and probabilities
of correct classification goes to 1 as σ2 → 0.
7.6 Conclusion
A novel method of constructing the joint PDF of sensor outputs for classifica-
tion has been proposed. Only a reference PDF is needed in the construction. The
149
−4 −3 −2 −1 0 1 2 3
0.4
0.5
0.6
0.7
0.8
0.9
1
ln(1/σ2)
Pc
Estimated MAPOur Method
Figure 7.2. Probability of correct classification for both methods.
constructed PDF is asymptotically the closest to the true PDF in KL divergence,
and hence it asymptotically optimal. When applied to distributed classification,
its performance is shown to be as good as the estimated MAP classifier, which
assumes more information than our classifier.
List of References
[1] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory. En-glewood Cliffs, NJ: Prentice-Hall, 1998.
[2] S. Thomopoulos, R. Viswanathan, and D. Bougoulias, “Optimal distributeddecision fusion,” IEEE Trans. Aerosp. Electron. Syst., vol. 25, pp. 761–765,Sep. 1989.
[3] Z. Chair and P. Varshney, “Optimal data fusion in multiple sensor detectionsystems,” IEEE Trans. Aerosp. Electron. Syst., vol. 22, pp. 98–101, Jan. 1986.
[4] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,” IEEETrans. Pattern Anal. Mach. Intell., vol. 20, pp. 226–239, Mar. 1998.
[5] A. Sundaresan, P. Varshney, and N. Rao, “Distributed detection of a nuclearradioactive source using fusion of correlated decisions,” in Information Fusion,2007 10th International Conference on, 2007, pp. 1–7.
[6] S. Iyengar, P. Varshney, and T. Damarla, “A parametric copula based frame-work for multimodal signal processing,” in ICASSP, 2009, pp. 1893–1896.
150
[7] S. Kay and Q. Ding, “Exponentially embedded families for multimodal sensorprocessing,” in ICASSP, Mar. 2010, pp. 3770–3773.
[8] S. Kay, A. Nuttall, and P. Baggenstoss, “Multidimensional probability densityfunction approximations for detection, classification, and model order selec-tion,” IEEE Trans. Signal Process., vol. 49, pp. 2240–2252, Oct. 2001.
[9] S. Kay, Q. Ding, and D. Emge, “Joint pdf construction for sensor fusion anddistributed detection,” in International Conference on Information Fusion,Jun. 2010.
[10] S. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory.Englewood Cliffs, NJ: Prentice-Hall, 1993.
[11] L. Brown, Fundamentals of Statistical Exponential Families. Institute ofMathematical Statistics, 1986.
[12] S. Boyd and L.Vandenberghe, Convex Optimization. Cambridge UniversityPress, 2004.
[13] D. Luenberger, Linear and Nonlinear Programming, 2nd ed. Springer, 2003.
[14] S. Kay, “Exponentially embedded families - new approaches to model orderestimation,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, pp. 333–345, Jan.2005.
151
BIBLIOGRAPHY
Akaike, H., “Information theory and an extension of the likelihood principle,” inProceedings of the Second International Symposium of Information Theory,1973.
Akaike, H., “A new look at the statistical model identification,” IEEE Trans.Autom. Control, vol. 19, pp. 716–723, Dec. 1974.
Alam, M., Nazrul Islam, M., Bal, A., and Karim, M., “Hyperspectral target de-tection using gaussian filter and post-processing,” Optics and Lasers in Engi-neering, vol. 46, pp. 817–822, Nov. 2008.
Bickel, P. and Doksum, K., Mathematical Statistics: Basic Ideas and SelectedTopics. Pearson Prentice Hall, 2006, vol. 1.
Bowyer, D., Rajasekaran, P., and Gebhart, W., “Adaptive clutter filtering usingautoregressive spectral estimation,” IEEE Trans. Aerosp. Electron. Syst., pp.538–546, Jul. 1979.
Boyd, S. and L.Vandenberghe, Convex Optimization. Cambridge University Press,2004.
Brown, L., Fundamentals of Statistical Exponential Families. Institute of Math-ematical Statistics, 1986.
Chair, Z. and Varshney, P., “Optimal data fusion in multiple sensor detectionsystems,” IEEE Trans. Aerosp. Electron. Syst., vol. 22, pp. 98–101, Jan. 1986.
Chung, P.-J., “ML estimation under misspecified number of signals,” in the 39thAsilomar Conference on Signals, Systems, and Computers, Nov. 2005.
Chung, P.-J., “Stochastic maximum likelihood estimation under misspecified num-bers of signals,” IEEE Trans. Signal Process., vol. 55, pp. 4726–4731, Sep.2007.
Chyba, T., Higdon, N., Armstrong, W., Lobb, C., Ponsardin, P., Richter, D.,Kelly, B., Bui, Q., Babnick, R., Boysworth, M., Sedlacek, A., and Christesen,S., “Field tests of the laser interrogation of surface agents (lisa) system foron-the-move standoff sensing of chemical agents,” in Proc. Int. Symp. SpectralSensing Research, 2003.
Cover, T. and Thomas, J., Elements of Information Theory, 2nd ed. John Wileyand Sons, 2006.
152
Eriksson, K., Estep, D., and Johnson, C., Applied Mathematics, Body and Soul:Calculus in Several Dimensions. Springer, 2004.
Fisher, R., “On the mathematical foundations of theoretical statistics,” Philos.Trans. Royal Soc. London, vol. 222, no. 594-604, pp. 309–368, Jan. 1922.
Frost, R., Henry, D., and Erickson, K., “Raman spectroscopic detection of wyartitein the presence of rabejacite,” Journal of Raman Spectroscopy, vol. 35, pp.255–260, 2004.
Grimmett, G. and Stirzaker, D., Probability and Random Processes, 3rd ed. Ox-ford University Press, 2001.
Hayazawa, N., Motohashi, M., Saito, Y., and Kawata, S., “Highly sensitive straindetection in strained silicon by surface-enhanced raman spectroscopy,” AppliedPhysics Letters, vol. 86, pp. 263 114 – 263 114–3, 2005.
Higgins, J., “Some surface integral techniques in statistics,” The American Statis-tician, vol. 29, pp. 43–46, Feb. 1975.
Iyengar, S., Varshney, P., and Damarla, T., “A parametric copula based frameworkfor multimodal signal processing,” in ICASSP, 2009, pp. 1893–1896.
Kass, R. and Vos, P., Geometrical Foundations of Asymptotic Inference. Wiley,1997.
Kay, S., Modern Spectral Estimation: Theory and Application. Englewood Cliffs,NJ: Prentice-Hall, 1988.
Kay, S., Fundamentals of Statistical Signal Processing: Estimation Theory. En-glewood Cliffs, NJ: Prentice-Hall, 1993.
Kay, S., Fundamentals of Statistical Signal Processing: Detection Theory. Engle-wood Cliffs, NJ: Prentice-Hall, 1998.
Kay, S., “Model based probability density function estimation,” IEEE Signal Pro-cess. Lett., pp. 318–320, Dec. 1998.
Kay, S., “Exponentially embedded families - new approaches to model order es-timation,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, pp. 333–345, Jan.2005.
Kay, S., “Asymptotically optimal approximation of multidimensional pdf’s bylower dimensional pdf’s,” IEEE Trans. Signal Process., vol. 55, pp. 725–729,Feb. 2007.
Kay, S. and Ding, Q., “Exponentially embedded families for multimodal sensorprocessing,” in ICASSP, Mar. 2010, pp. 3770–3773.
153
Kay, S., Ding, Q., and Emge, D., “Joint pdf construction for sensor fusion anddistributed detection,” in International Conference on Information Fusion,Jun. 2010.
Kay, S., Ding, Q., and Rangaswamy, M., “Sensor integration for classification,” inAsilomar Conference on Signals, Systems, and Computers, Nov. 2010.
Kay, S., Nuttall, A., and Baggenstoss, P., “Multidimensional probability densityfunction approximations for detection, classification, and model order selec-tion,” IEEE Trans. Signal Process., vol. 49, pp. 2240–2252, Oct. 2001.
Kay, S. and Salisbury, J., “Improved active sonar detection using autoregressiveprewhiteners,” J. Acoustical Soc. of America, pp. 1603–1611, Apr. 1990.
Kay, S., Xu, C., and Emge, D., “Chemical detection and classification in ramanspectra,” in Proceedings of the SPIE, vol. 6969, Mar. 2008, pp. 4–12.
Kittler, J., Hatef, M., Duin, R., and Matas, J., “On combining classifiers,” IEEETrans. Pattern Anal. Mach. Intell., vol. 20, pp. 226–239, Mar. 1998.
Kneipp, K., Kneipp, H., Itzkan, I., Dasari, R., and Feld, M., “Ultrasensitive chemi-cal analysis by raman spectroscopy,” Chemical Reviews, vol. 99, p. 2957C2975,1999.
Knight, W., Pridham, R., and Kay, S., “Digital signal processing for sonar,” inProceedings of the IEEE, Nov. 1981, pp. 1451–1506.
Kullback, S., Information Theory and Statistics, 2nd ed. Courier Dover Publica-tions, 1997.
Lawson, C. and Hanson, R., Solving Least Squares Problems. SIAM, 1995.
Lehmann, E., Elements of Large-Sample Theory. Springer, 1998.
Lehmann, E. and Romano, J., Testing Statistical Hypotheses, 3rd ed. Springer,2005.
Liavas, A. and Regalia, P., “On the behavior of information theoretic criteria formodel order selection,” IEEE Trans. Signal Process., vol. 49, pp. 1689–1695,Aug. 2001.
Lilliefors, H., “On the kolmogorov-smirnov test for normality with mean and vari-ance unknown,” Journal of the American Statistical Association, vol. 62, pp.399–402, 1967.
Luenberger, D., Linear and Nonlinear Programming, 2nd ed. Springer, 2003.
154
Manolakis, D., Marden, D., and Shaw, G., “Hyperspectral image processing forautomatic target detection applications,” Lincoln Laboratory Journal, vol. 14,no. 1, pp. 79–116, 2003.
Pages-Zamora, A. and Lagunas, M., “New approaches in non-linear signal pro-cessing: Estimation of the probability density function by spectral estimationmethods,” in IEEE Workshop on Higher Order Statistics, 1995.
Pfanzagl, J. and Wefelmeyer, W., Contributions to a General Asymptotic StatisticalTheory, ser. Lecture Notes in Statistics. Springer-Verlag, 1982, vol. 13.
Portnov, A., Rosenwaks, S., and Bar, I., “Detection of particles of explosives viabackward coherent anti-stokes raman spectroscopy,” Applied Physics Letters,vol. 93, pp. 041 115 – 041 115–3, 2008.
Renaux, A., Forster, P., Chaumette, E., and Larzabal, P., “On the high-SNRconditional maximum-likelihood estimator full statistical characterization,”IEEE Trans. Signal Process., vol. 54, pp. 4840–4843, Dec. 2006.
Rissanen, J., “Modeling by shortest data description,” Automatica, vol. 14, no. 5,pp. 465–471, 1978.
Rudin, W., Principles of Mathematical Analysis, 3rd ed. McGraw-Hill, 1976.
Rudin, W., Functional Analysis. McGraw-Hill, 1991.
Scharf, L. and Friedlander, B., “Matched subspace detectors,” IEEE Trans. SignalProcess., vol. 42, no. 8, pp. 2146–2157, Aug. 1994.
Schwarz, G., “Estimating the dimension of a model,” The Annals of Statistics,vol. 6, no. 2, pp. 461–464, 1978.
Stoica, P. and Selen, Y., “Model-order selection: A review of information criterionrules,” IEEE Signal Process. Mag., vol. 21, pp. 36–47, Jul. 2004.
Sundaresan, A., Varshney, P., and Rao, N., “Distributed detection of a nuclearradioactive source using fusion of correlated decisions,” in Information Fusion,2007 10th International Conference on, 2007, pp. 1–7.
Thomopoulos, S., Viswanathan, R., and Bougoulias, D., “Optimal distributeddecision fusion,” IEEE Trans. Aerosp. Electron. Syst., vol. 25, pp. 761–765,Sep. 1989.
van der Vaart, A. W., Asymptotic Statistics. Cambridge University Press, 2000.
Wang, W. and Adali, T., “Constrained ica and its application to raman spec-troscopy,” in Proc. Antennas and Propagation Society International Sympo-sium, Jul. 2005, pp. 109–112.
155
Wang, W., Adali, T., and Emge, D., “Unsupervised detection using canonicalcorrelation analysis and its application to raman spectroscopy,” in Proc. IEEEWorkshop on Machine Learning for Signal Processing, Aug. 2007.
Wang, W., Adali, T., and Emge, D., “Subspace partitioning for target detectionand identification,” IEEE Trans. Signal Process., vol. 57, no. 4, pp. 1250–1259,Apr. 2009.
Wax, M. and Kailath, T., “Detection of signals by information theoretic criteria,”IEEE Trans. Acoust., Speech, Signal Process., vol. 33, pp. 387–392, Apr. 1985.
Westover, M., “Asymptotic geometry of multiple hypothesis testing,” IEEE Trans.Inf. Theory, vol. 54, no. 7, pp. 3327–3329, Jul. 2008.
White, H., “Maximum likelihood estimation of misspecified models,” Economet-rica, vol. 50, no. 1, pp. 1–25, Jan. 1982.
Wiley, R., ELINT: The Interception and Analysis of Radar Signals. Boston, MA:Artech House, 2006.
Xu, C. and Kay, S., “Source enumeration via the eef criterion,” IEEE Signal Pro-cess. Lett., vol. 15, pp. 569–572, 2008.
Xu, W. and Kaveh, M., “Analysis of the performance and sensitivity ofeigendecomposition-based detectors,” IEEE Trans. Signal Process., vol. 43,pp. 1413–1426, Jun. 1995.
156