detection - estimation - freedjafari.free.fr/cours/nd/de1.pdf · chapter 1 introduction generally...
TRANSCRIPT
Detection - Estimation
Ali Mohammad-Djafari
A Graduated Course
Department of Electrical Engineering
University of Notre Dame
Notre Dame, IN 46555, USA
Draft: July 12, 2005 1
1File: nd1
2
Contents
1 Introduction 9
1.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Summary of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Basic concepts of binary hypothesis testing 15
2.1 Binary hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Bayesian binary hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Minimax binary hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Neyman-Pearson hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 23
3 Basic concepts of general hypothesis testing 25
3.1 A general M -ary hypothesis testing problem . . . . . . . . . . . . . . . . . . 25
3.1.1 Deterministic or probabilistic decision rules . . . . . . . . . . . . . . 26
3.1.2 Conditional, A priori and Joint Probability Distributions . . . . . . 27
3.1.3 Probabilities of false and correct detection . . . . . . . . . . . . . . . 28
3.1.4 Penalty or costs coefficients . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.5 Conditional and Bayes risks . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.6 Bayesian and non Bayesian hypothesis testing . . . . . . . . . . . . . 29
3.1.7 Admissible decision rules and stopping rule . . . . . . . . . . . . . . 29
3.1.8 Classification of simple hypothesis testing schemes . . . . . . . . . . 30
3.2 Composite hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Penalty or cost functions . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Case of binary hypothesis testing . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Classification of hypothesis testing schemes for a parametricallyknown stochastic process . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Classification of parameter estimation schemes . . . . . . . . . . . . . . . . 36
3.4 Summary of notations and abbreviations . . . . . . . . . . . . . . . . . . . . 40
4 Bayesian hypothesis testing 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Radar applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Detection of a known signal in an additive noise . . . . . . . . . . . 49
4.4 Binary chanel transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3
4 CONTENTS
5 Signal detection and structure of optimal detectors 55
5.1 Bayesian composite hypothesis testing . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 Case of binary composite hypothesis testing . . . . . . . . . . . . . . 56
5.2 Uniform most powerful (UMP) test . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Locally most powerful (LMP) test . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Maximum likelihood test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 Examples of signal detection schemes . . . . . . . . . . . . . . . . . . . . . . 59
5.5.1 Case of Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5.2 Laplacian noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5.3 Locally optimal detectors . . . . . . . . . . . . . . . . . . . . . . . . 61
5.6 Detection of signals with unknown parameter . . . . . . . . . . . . . . . . . 63
5.7 sequential detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.8 Robust detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6 Elements of parameter estimation 75
6.1 Bayesian parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.1 Minimum-Mean-Squared-Error . . . . . . . . . . . . . . . . . . . . . 76
6.1.2 Minimum-Mean-Absolute-Error . . . . . . . . . . . . . . . . . . . . . 76
6.1.3 Maximum A Posteriori (MAP) estimation . . . . . . . . . . . . . . . 77
6.2 Other cost functions and related estimators . . . . . . . . . . . . . . . . . . 82
6.3 Examples of posterior calculation . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4 Estimation of vector parameters . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4.1 Minimum-Mean-Squared-Error . . . . . . . . . . . . . . . . . . . . . 84
6.4.2 Minimum-Mean-Absolute-Error . . . . . . . . . . . . . . . . . . . . . 84
6.4.3 Marginal Maximum A Posteriori (MAP) estimation . . . . . . . . . 85
6.4.4 Maximum A Posteriori (MAP) estimation . . . . . . . . . . . . . . . 85
6.4.5 Estimation of a Gaussian vector parameter from jointly Gaussianobservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.6 Case of linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5.1 curve fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7 Elements of signal estimation 101
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Kalman filtering : General linear case . . . . . . . . . . . . . . . . . . . . . 103
7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3.1 1D case: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3.2 Track-While-Scan (TWS) Radar . . . . . . . . . . . . . . . . . . . . 108
7.3.3 Track-While-Scan (TWS) Radar with dependent acceleration se-quences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4 Fast Kalman filter equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.5 Kalman filter equations for signal deconvolution . . . . . . . . . . . . . . . . 112
7.5.1 AR, MA and ARMA Models . . . . . . . . . . . . . . . . . . . . . . 116
CONTENTS 5
8 Some complements to Bayesian estimation 1178.1 Choice of a prior law in the Bayesian estimation . . . . . . . . . . . . . . . 117
8.1.1 Invariance principles . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.2 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208.3 Non informative priors based on Fisher information . . . . . . . . . . . . . . 126
9 Linear Estimation 1299.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1299.2 One step prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319.3 Levinson algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319.4 Vector observation case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1329.5 Wiener-Kolmogorov filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.5.1 Non causal Wiener-Kolmogorov . . . . . . . . . . . . . . . . . . . . . 1339.6 Causal Wiener-Kolmogorov . . . . . . . . . . . . . . . . . . . . . . . . . . . 1359.7 Rational spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A Annexes 143A.1 Summary of Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . 144A.2 Summary of probability distributions . . . . . . . . . . . . . . . . . . . . . . 169
A Exercises 179A.1 Exercise 1: Signal detection and parameter estimation . . . . . . . . . . . . 180A.2 Exercise 2: Discrete deconvolution . . . . . . . . . . . . . . . . . . . . . . . 182
6 CONTENTS
Preface
During my visit at the Electrical Engineering Department of the Notre Dame University,I had to teach a course on Detection-Estimation.
I could not find really a complete and convenient textbook to cover all the materialsthat an Electrical Engineer have to know on the subject. Some books are too mathe-matical, some others are too oriented on the applications and not enough rigorous on themathematics.
The purpose of these notes is to fill this gap: to introduce the reader to the basic theoryof hypothesis testing and estimation theory using the main probability and statistical toolsand also to give him the basic theory of signal detection and estimation as used in practicalapplications of electrical engineering.
The contents of these notes are mainly covered by two books:– Detection and estimation, by D. Kazakos and P. Papantoni-Kazakos, and– An introduction to signal detection and estimation, by H. Vincent Poor.
Ali Mohammad-Djafari
7
8 CONTENTS
Chapter 1
Introduction
Generally speaking, signal detection and estimation is the area of study that deals withinformation processing: conversion, transmission, observation and information extraction.The main area of applications of detection and estimation theory are radar, sonar, analogor digital communications, but detection and estimation theory becomes also the maintool in other area such as radioastronomy, geophysics, medical imaging, biological dataprocessing, etc.
In general, detection and estimation applications involve making inferences from ob-servations that are distorted or corrupted in some unknown way or too complicated tobe modelled in a deterministic way. Moreover, sometimes even the information that onewishes to extract from such observations is not well determined. Thus, it is very useful tocast detection and estimation problems in a probabilistic framework and statistical infer-ence. But using the probability theory and the statistical inference tools does not forciblymeans that the corresponding physical phenomena are necessarily random.
In statistical inference, the goal is not to make an immediate decision, but is instead toprovide a summary of the statistical evidence which the future users can easily incorporateinto their decision process. The task of decision making is then given to the decision theory.
Signal detection is inherently a decision making task. In signal estimation also we needoften to make decisions. So, for detection and estimation we need not only the probabilitytheory and statistical inference tools but also the decision and hypothesis testing tools.The main common tool with which we have to start is then the probabilistic and stochasticdescription of the observations and the unknown quantities.
Once again a probablistic or stochastic description models the effect of causes whoseorigin and nature are either unknown or too complex to be described deterministically.
The simplest tool of a probabilistic model for a quantity is a scalar random variableX which is fully described by its probability distribution F (x) = Pr {X ≤ x}. The nextsimplest model is a random vector X = [X1, · · · ,Xn]t, where {Xj} are random variables.A random vector is fully described by its probability distribution F (x) = Pr {X ≤ x}.The next and the most general stochastic model for a quantity is a random function X(r),where r is a finite dimensional independent variable and where for every fixed valuesr = rj , the scalar quantity Xj = X(rj) is a scalar random variable. For example, whenr = (x, y) represents the spatial coordinates in a plane, then X(x, y) is called a randomfield and when r = t represents the time variable, then X(t) is called a stochastic process.In the rest of these notes, we consider only this last model.
9
10 CHAPTER 1. INTRODUCTION
A stochastic process X(t) is completely described by the probability distribution
F (x1, · · · , xn; t1, · · · , tn) = F (x; t) = Pr {X(tj) ≤ xj; j = 1, · · · , n}
for every n and every time instants {tj}. The stochastic process is discrete-time if it isdescribed only by its realizations on a countable set {tj} of time instants. Then, timeis counted by the indices j, and the stochastic process is fully described by the randomvectors Xj = [Xj ,Xj+1, · · · ,Xj+n]t.
A stochastic processX(t) is said well known, if the distribution F (x1, · · · , xn; t1, · · · , tn)is precisely known for all n, every set {tj} and every vector value x. The process isinstead said parametrically known, if there exists a finite dimensional parameter vector θ =[θ1, · · · , θm]t such that the conditional distribution F (x1, · · · , xn; t1, · · · , tn|θ) is preciselyknown for all n, every set {tj}, every vector value x and a fixed given value of θ. Astochastic process X(t) is non parametrically described, if there is no vector parameter θ
of finite dimensionality such that the distribution F (x, t|θ) is completely described for allvalues of the vector θ and for all n, t and x. As an example, a stationary, discrete timeprocess {Xi} where the random variables Xi have finite variances is a nonparametricallydescribed process. In fact, this description represents a whole class of stochastic processes.If we assume now that this process is also Gaussian, then it becomes parametrically known,since only its mean and spectral density functions are needed for its full description. Whenthese two quantities are also provided, the process becomes well known.
From now, we have the main necessary ingredients to give a general scope of thedetection and estimation theory. Let consider a case where the observed quantity ismodelled by a stochastic process X(t) and the observed signal x(t) is considered as arealization of the process, i.e., an observed waveform generated by X(t).
1.1. BASIC DEFINITIONS 11
1.1 Basic definitions
• Probability spaces:The probability theory starts by defining an observation set Γ and a class of subsetsG of it, called observation events, to which we wish to assign probabilities. The pair(Γ,G) is termed the observation space.
For analytical reasons we will always assume that the collection G is a σ-algebra; thatis, G contains all complements relative to Γ and denumerable unions of its members,i.e.;
if A ∈ G −→ Ac ∈ Gandif A1, A2, ... ∈ G −→ ∪iAi ∈ G
(1.1)
Two special cases are of interest:
– Discrete case: Γ = {γ1, γ2, · · ·}In this case G is the set of all subsets of Γ which is usually denoted by 2Γ andis called the power set of Γ.For this case, probabilities can be assigned to subsets of Γ in terms of a proba-bility mass function, p : Γ −→ [0, 1], by
P (A) =∑
γi∈A
p(γi), A ∈ 2Γ. (1.2)
Any function mapping Γ to [0, 1] can be a probability mass function providedthat it satisfies the condition of normality
∑
i
p(γi) = 1. (1.3)
– Continuous case: Γ = IRn, the set of n-dimensional vectors with real compo-nents.In this case we want to assign the probabilities to the sets
{x = (x1, · · · , xn) ∈ IRn|a1 < x1 < b1, · · · , an < xn < bn} (1.4)
where the ai’s and bi’s are arbitrary real numbers. So, in this case, G is thesmallest σ-algebra containing all of these sets with the ai’s and bi’s rangingthroughout the reals. This σ-algebra is usually denoted Bn and is called theclass of Borel sets in IRn.In this case the probabilities can be assigned in terms of a probability densityfunction, p : IRn −→ IR+, by
P (A) =
∫∫
Ap(x) dx, A ∈ Bn. (1.5)
Any integrable function mapping IRn to IR+ can be a probability density func-tion provided that it satisfies the condition
∫∫
IRnp(x) dx = 1. (1.6)
12 CHAPTER 1. INTRODUCTION
• For compactness, we may use the term density for both probability density functionand probability mass function and use the following notation when necessary
P (A) =
∫∫
Ap(x)µ( dx) (1.7)
for both the summation equation (1.2) and the integration equation (1.5).
• Random variable:X = X(ω) is a function ω 7→ IR, where ω represents elements on the probabilityspace.
• Probability distribution:F (x) is a function IR 7→ [0, 1] such that
F (x) = Pr {X ≤ x} (1.8)
• Probability density function:f(x) is a function IR 7→ IR+ such that
f(x) =∂F (x)
∂x,
F (x) = Pr {X ≤ x} =
∫ x
−inftyf(t) dt (1.9)
• For a real function g of the random variable X, the expected value of g(X), denotedE [g(X)], is defined by any of the followings:
E [g(X)] =∑
i
g(γi) p(γi) (1.10)
E [g(X)] =
∫
IRg(x) p(x) dx (1.11)
E [g(X)] =
∫
Γg(x) p(x)µ( dx) (1.12)
• Random vector or a vector of random variables:X = [X1, · · · ,Xn] where Xi are scalar random varaibles.
• Joint probability distribution:
F (x1, · · · , xn) = Pr {X1 ≤ x1, · · · ,Xn ≤ xn}F (x) = Pr {X ≤ x} (1.13)
• Stochastic process
• Stochastic process: X(t) = X(t, ω), where X(t, ω) is a scalar random variable for allt.
1.1. BASIC DEFINITIONS 13
• A stochastic process is completely defined by
F (x1, · · · , xn; t1, · · · , tn) = Pr {Xi(ti) ≤ xi, i = 1, · · · , n}F (x; t) = Pr {X(t) ≤ x} (1.14)
• A stochastic process is stationary if
F (x1, · · · , xn; t1, · · · , tn) = F (x1, · · · , xn; (t1 + τ), · · · , (tn + τ))
F (x; t) = F (x; t + τ1) (1.15)
• A stochastic process is memoryless or white if
F (x1, · · · , xn; t1, · · · , tn) =∏
i
F (xi; ti) (1.16)
• Discrete time stochastic process:
F (x) = Pr {X ≤ x} (1.17)
• Memoryless discrete time stochastic process:
F (x) = Pr {X ≤ x} =n∏
i=1
Fi(xi) (1.18)
• Memoryless and stationary discrete time stochastic process:
F (x) =n∏
i=1
Fi(xi) and Fi(x) = Fj(x) = F (x),∀i, j (1.19)
• A memoryless and stationary discrete time stochastic process generates in time in-dependent and identically distributed (i.i.d.) random variables.
• Well known stochastic process:A stochastic process is well known if the distribution F (x, t) is known for all n, tand x.
• Parametrically well known stochastic process:A stochastic process is parametrically well known if there exists a finite dimentionalvector parameter θ = [θ1, · · · , θm] such that the conditional distribution F (x, t|θ) isknown for all n, t and x.
• Non parametric description of a stochastic process:A stochastic process X(t) is non parametrically described, if there is no vector pa-rameter θ of finite dimensionality such that the distribution F (x, t|θ) is completelydescribed for every given vector θ and for all n, t and x.
• Observed data:samples of x(t) a realization of X(t) in some time interval [0, T ].
14 CHAPTER 1. INTRODUCTION
1.2 Summary of notations
X A random variablex A realization of a random variablex = {x1, · · · , xn} n samples (realizations) of a random variable
X(t) A random function or a stochastic processx(t) A realization of a random function
X A random vector or a discrete-time stochastic processx A realization of a random vector or a discrete-time stochastic processxn = {x1, · · · ,xn} n samples (realizations) of a random vector or a discrete-time stochastic process
F (x) A probability distribution of a scalar random variablef(x) A probability density function of a scalar random variable
Fθ(x) A parametrical probability distributionfθ(x|θ) A parametrical probability density function
F (x|θ) A conditional probability distributionf(x|θ) A conditional probability density function
F (θ|x) A posterior probability distribution of θ conditioned on the observations x
f(θ|x) A posterior probability density function of θ conditioned on the observations x
Θ A random scalar parametersθ A sample of Θ
Θ A random vector of parametersθ A sample of Θ
Γ The space of possible values of xΓ The space of possible values of x
T The space of possible values of θT The space of possible values of θ
Chapter 2
Basic concepts of binaryhypothesis testing
Most signal detection problems can be cast in the framework of M -ary hypothesis testing,where from some observations (data) we wish to decide among M possible situations.For example, in a communication system, the receiver observes an electric waveform thatconsists of one of theM possible signals corrupted by channel or receiver noise, and we wishto decide which of the M possible signals is present during the observation. Obviously, forany given decision problem, there are a number of possible decision strategies or rules thatcan be applied, however we would like to choose a decision rule that is optimal in somesense. There are several classical useful criteria of optimality for such problems. The mainobject of this chapter is to give all the necessary basic definitions to define these criteriaand their practical signification. Before going to the general case of M -ary hypothesistesting problem, let us start by a particular problem of binary (M=2) hypothesis testingwhich allows us to introduce the main basis more easily.
2.1 Binary hypothesis testing
The primary problem that we consider as an introduction is the simple hypothesis testingproblem in which we assume that the observed data belong only on two possible processeswith well known probability distributions P0 and P1:
{H0 : X ∼ P0
H1 : X ∼ P1(2.1)
where “X ∼ P” denotes “X has distribution P” or “Data come from a stochastic processwhose distribution is P”. The hypotheses H0 and H1 are respectively referred to as nulland alternative hypotheses. A decision rule δ for H0 versus H1 is any partition of theobservation space Γ into Γ1 and Γ0 = Γc
1 such that we choose H1 when x ∈ Γ1 and H0
when x ∈ Γ0. The sets Γ1 and Γ0 are respectively called the rejection and acceptanceregions. So, we can think of the decision rule δ as a function on Γ such that
δ(x) =
{δ1 = 1 if x ∈ Γ1
δ0 = 0 if x ∈ Γ0 = Γc1
(2.2)
15
16 CHAPTER 2. BASIC CONCEPTS OF BINARY HYPOTHESIS TESTING
so that the value of δ for a given x is the index of the hypothesis accepted by the decisionrule δ.
We can also think of the decision rule δ(x) as a probability distribution {δ0, δ1} on thespace D of all the possible decisions where δj is the probability of deciding Hj in the lightof the data x. In both cases δ0 + δ1 = 1.
We would like to choose H0 or H1 in some optimal way and, with this in mind, we mayassign costs to our decisions. In particular we may assign costs cij ≥ 0 to pay if we makethe decision Hi while the true decision to make was Hj. With the partition Γ = {Γ0,Γ1}of the observation set, we can then define the conditional probabilities
Pij = Pr {X = x ∈ Γi|H = Hj} = Pj(Γi) =
∫∫
Γi
pj(x) dx (2.3)
and then the average or expected conditional risks Rj(δ) for each hypothesis as
Rj(δ) =1∑
i=0
cijPij = c1jPj(Γ1) + c0jPj(Γ0), j = 0, 1 (2.4)
2.2 Bayesian binary hypothesis testing
Now assume that we can assign prior probabilities π0 and π1 = 1−π0 to the hypotheses H0
and H1, either to translate our preferences or to translate our prior knowledge about thesehypotheses. Note that πj is the probability that Hj is true unconditional (or independent)of the observation data x of X. This is why they are called prior or a priori probabilities.For given priors {π0, π1} we can define the posterior or a posteriori probabilities
πj(x) = Pr {H = Hj|X = x} =pj(x)πj
m(x)(2.5)
wherem(x) =
∑
j
Pj(x)πj (2.6)
is the overall density of X.We can also define an average or Bayes risk r(δ) as the overall average cost incurred
by the decision rule δ:r(δ) =
∑
j
πj Rj(δ) (2.7)
We may now use this quantity to define an optimum decision rule as the one that minimizes,over all decision rules, the Bayes risk. Such a decision rule is known as a Bayes decisionrule.
To go a little further in details, let combine (2.4) and (2.7) to give
r(δ) =∑
j
πj Rj(δ) =∑
j
πj
∑
i
cijPj(Γi)
=∑
j
πjc0jPj(Γ0) + πjc1jPj(Γ1)
=∑
j
πjc0j(1 − Pj(Γ1)) + πjc1jPj(Γ1)
2.2. BAYESIAN BINARY HYPOTHESIS TESTING 17
=∑
j
πjc0j +∑
j
πj(c1j − c0j)Pj(Γ1)
=∑
j
πjc0j +
∫∫
Γ1
∑
j
πj (c1j − c0j) pj(x) dx (2.8)
Thus, we see that r(δ) is a minimum over all Γ1 if we choose
Γ1 =
x ∈ Γ|
∑
j
(c1j − c0j)πj pj(x) ≤ 0
(2.9)
= {x ∈ Γ|(c11 − c01)π1 p1(x) ≤ (c00 − c10)π0 p0(x)} (2.10)
In general, the costs cjj < cij which means that the cost of correctly deciding Hi is lessthan the cost of incorrectly deciding it. Then, (2.10) can be written
Γ1 =
{x ∈ Γ|π1 p1(x)
π0 p0(x)≥ τ1 =
c10 − c00c01 − c11
}(2.11)
=
{x ∈ Γ|p1(x)
p0(x)≥ τ2 =
π0
π1
c10 − c00c01 − c11
}(2.12)
= {x ∈ Γ|c10π0(x) + c11π1(x) ≤ c00π0(x) + c01π1(x)} (2.13)
This decision rule is known as a likelihood-ratio test or posterior probability ratio test dueto the fact that L(x) = π1(x)
π0(x) is the ratio of the likelihoods and π1 p1(x)π0 p0(x) is the ratio of the
posterior probabilities.
Note also that the quantity ci0π0(x)+ci1π1(x) is the average cost incurred by choosingthe hypothesis Hi given that X = x. This quantity is called the posterior cost of choosingHi given the observation X = x. Thus, the Bayes rule makes its decision by choosing thehypothesis that yields the minimum posterior cost.
This test plays a central role in the theory of hypothesis testing. It computes thelikelihood ratio L(x) for a given observed value X = x and then makes its decision bycomparing this ratio to a threshold τ1, i.e.;
δ(x) =
{1 if L(x) ≥ τ10 if L(x) < τ1
(2.14)
A commonly cost assignment is the uniform costs given by
cij =
{0 if i = j1 if i 6= j
(2.15)
The Bayes risk δ then becomes
r(δ) = π0P0(Γ1) + π1P1(Γ0) = π0P01 + π1P10 (2.16)
Noting that Pi(Γj) is the probability of choosing Hj when Hi is true, r(δ) in this casebecomes the average probability of error incurred by the rule δ. This decision rule is thena minimum probability of error decision scheme.
18 CHAPTER 2. BASIC CONCEPTS OF BINARY HYPOTHESIS TESTING
Note also that with the uniform cost coefficients (2.15) the decision rule can be rewrit-ten as
δ(x) =
{1 if π1(x) ≥ π0(x)0 if π1(x) < π0(x)
(2.17)
This test is called the maximum a posteriori (MAP) decision scheme.
Example : Detection of a constant signal in a Gaussian noiseLet consider {
H0 X = ǫH1 X = µ+ ǫ
(2.18)
where ǫ is a Gaussian random variable with zero mean and a known variance σ2 and whereµ > 0 is a known constant. In terms of distributions we can rewrite these hypotheses as
{H0 X ∼ N (
0, σ2)
H1 X = N (µ, σ2
) (2.19)
where N (µ, σ2
)means
N(µ, σ2
)= (2πσ2)−
1
2 exp
[− 1
2σ2(x− µ)2
](2.20)
It is then easy to calculate the likelihood ratio L(x)
L(x) =p1(x)
p0(x)= exp
[µ
σ2(x− µ/2)
](2.21)
Thus, the Bayes test for these hypotheses becomes
δ(x) =
{1 if L(x) ≥ τ0 if L(x) < τ
(2.22)
where τ is an appropriate threshold. We can remark that L(x) is a striclty increasingfunction of x, so comparing L(x) to a threshold τ is equivalent to comparing x itself to
another threshold τ ′ = L−1(τ) = σ2
µ log(τ) + µ/2:
δ(x) =
{1 if x ≥ τ ′
0 if x < τ ′(2.23)
where L−1 is the inverse function of L.
2.2. BAYESIAN BINARY HYPOTHESIS TESTING 19
Γ1Γ0
0 µ/2 µ x
N (µ, σ2)N (0, σ2)
P1(x)p0(x)
Figure 2.1: Location testing with Gaussian errors, uniform costs and equal priors.
In the special case of uniform costs and equal priors, we have τ = 1 and so τ ′ = µ/2.Then, it is not difficult to show that the conditional probabilities are
Pi1 = Pr {X = x ∈ Γ1|H = Hj} = Pj(Γ1) =
∫ ∞
τ ′pj(x) dx =
{1 − Φ
( µ2σ
)for j = 1
1 − Φ(− µ
2σ
)for j = 0
(2.24)and the minimum Bayes risk r(δ) is
r(δ) = 1 − Φ
(µ
2σ
). (2.25)
This is a decreasing function of µσ , a quantity which is a simple version of the signal to
noise ratio.
20 CHAPTER 2. BASIC CONCEPTS OF BINARY HYPOTHESIS TESTING
Summary of notations for binary hypothesis testing
Hypotheses H H0 H1
Processes P0 P1
Conditional densitiesor
likelihood functionsp0(x) p1(x)
Observation space Γpartition
Γ0 Γ1
Decisions δ(x) δ0(x) δ1(x)
Conditional probabilitiesPij =
∫Γipj(x) dx
P00, P01 P10, P11
Costs cij c00, c01 c10, c11Conditional risksRj =
∑i cijPij
R0 = c00P00 + c10P10 R1 = c01P01 + c11P11
Prior probabilities πj π0 π1
Posterior probabilities
πj(x) =pj(x)πj
m(x)
π0(x) π1(x)
Joint probabilitiesQij = πjPij
Q00, Q01 Q10, Q11
Posterior costsci(x) =
∑j cijπj(x)
c0(x) = c00π0(x) + c01π1(x) c1(x) = c10π0(x) + c11π1(x)
Bayes risk r(δ) r(δ) =∑
j πjRj(x)
Likelihoods ratio L(x) L(x) = p1(x)p0(x)
Posteriors ratio π1(x)π0(x) = π1p1(x)
π0p0(x)
Posterior costs ratio c1(x)c0(x) = c10π0(x)+c11π1(x)
c00π0(x)+c01π1(x)
The following equivalent tests minimize the Bayes risk r(δ)
Likelihoods ratio test L(x) = p1(x)p0(x) > τ1 = π0
π1
c10−c00c01−c11
Posteriors ratio test π1(x)π0(x) > τ2 = c10−c00
c01−c11
Posterior costs ratio test c1(x)c0(x) > 1
2.2. BAYESIAN BINARY HYPOTHESIS TESTING 21
Special case of uniform costs binary hypothesis testing
Costs cij =
{0 if i = j1 if i 6= j
c00 = 0, c01 = 1 c10 = 1, c11 = 0
Conditional risksRj =
∑i cijPij
R0 = P00 R1 = P11
Prior probabilities πj π0 π1
Posterior probabilities
πj(x) =pj(x)πj
m(x)
π0(x) π1(x)
Joint probabilitiesQij = πjPij
Q00, Q01 Q10, Q11
Posterior costscj(x) =
∑j cijπj(x)
c0(x) = π1(x) c1(x) = π0(x)
Bayes risk r(δ) r(δ) =∑
j πjRj(x) =∑
j πjPjj
Likelihoods ratio L(x) L(x) = p1(x)p0(x)
Posteriors ratio π1(x)π0(x) = π1p1(x)
π0p0(x)
Posterior costs ratio c1(x)c0(x) = π1(x)
π0(x)
The following equivalent tests minimize the Bayes risk r(δ)
Likelihoods ratio test L(x) = p1(x)p0(x) > τ1 = π0
π1
Posteriors ratio test π1(x)π0(x) > τ2 = 1
Posterior costs ratio test c1(x)c0(x) = π1(x)
π0(x) > 1
22 CHAPTER 2. BASIC CONCEPTS OF BINARY HYPOTHESIS TESTING
2.3 Minimax binary hypothesis testing
In the previous section, we saw how the Bayesian hypothesis testing gives us a completeprocedure to the hypothesis testing problems. However, in some applications, we maynot be able to assign the prior probabilities {π0, π1}. Then, one approache is to choosearbitrarily π0 = π1 = 1/2 and continue all the Bayesian procedure as in the last section.An alternative approach is to choose another design criterion than the expected penaltyr(δ). For example, we may use the conditional risks R0(δ) and R1(δ) and design a decisionrule that minimizes, over all δ, the following criterion
max{R0(δ), R1(δ)} (2.26)
The decision rule based on this criterion is known as the minimax rule.To design this decision rule, it is useful to consider the function r(π0, δ), defined for a
given prior π0 ∈ [0, 1] and a given decision rule δ as the average risk
r(π0, δ) = π0R0(δ) + (1 − π0)R1(δ) (2.27)
Noting that r(π0, δ) is a linear function of π0, then for fixed δ, its maximum occurs ateither π0 = 0 or π0 = 1 with the maximum value respectively either R1(δ) or R0(δ).So, the optimization problem of minimizing the criterion (2.26) over δ is equivalent tominimizing the quantity
maxπ0∈[0,1]
r(π0, δ) (2.28)
over δ.Now, for each prior π0, let δπ0
denote a Bayes rule corresponding to that prior andlet V (π0) = r(π0, δπ0
); that is V (π0) is the minimum Bayes risk for the prior π0. Then,it is not difficult to show that V (π0) is a concave function of π0 with V (0) = c11 andV (1) = c00.
Now consider the function r(π0, δπ′0) which is a straight line tangent to V (π0) at π0 = π′0
and parallel to r(π0, δ) (see figure 2.3).From this figure, we can see that only Bayes rules can possibly be minimax rules.
Indeed, we see that the minimax rule, in this case, is a Bayes rule for the prior value π0 = πL
that maximizes V , and for this prior r(π0, δπL) is constant over π0 and so R0(δπL
)) =R1(δπL
)). This decision rule (with equal conditional risks) is called an equalizer rule.Because πL maximizes the minimum Bayes risk, it is called the least-favorable prior. Thus,in this case, a minimax decision rule is the Bayes rule for the least-favorable prior πL.
Even if we arrived at this conclusion through an example, it can be prouved that thisfact is true in all practical situations. This result is stated as the following proposition:
Suppose that πL is a solution to V (πL) = maxπ0∈[0,1]V (π0). Assume further thateither R0(δπL
) = R1(δπL) or πL = {0, 1}. Then δπL
is a minimax rule. (see V. Poor forthe proof). We will be back more in details on the minimax rule in chapter x.
2.4. NEYMAN-PEARSON HYPOTHESIS TESTING 23
0 π′0 πL π′′0π0
1
R0(δπ′′0)
R0(δπL)
R0(δπ′0)
R0(δ)
c00
V (π0)
c11
R1(δπ′0)
R1(δ)
R1(δπL)
R1(δπ′′0)
r(π0, δπ′′0)
r(π0, δπ′0)
r(π0, δ)
r(π0, δπL)
Figure 2.2: Illustration of minimax rule.
2.4 Neyman-Pearson hypothesis testing
In previous sections, we examined first the the Bayes hypothesis testing where the opti-mality was defined through the overall expected cost r(δ). Then, we considered the casewhere the prior probabilities {π0, π1} are not available and described the minimax decisionrule in terms of the maximum value of the conditional risks R0(δ) and R1(δ).
In both cases, we need to define the costs cij . In some applications, imposing a specialcost structure on the decisions may not be available or not desirable. In such cases, analternative criterion, known as the Neyman-Pearson criterion, is designed which is basedon the probability of making a false decision. The main idea of this procedure is to chooseone of the hypotheses as to be the main hypothesis and test other hypotheses against it.For example, in testing H0 against H1, two kinds of errors can be made:
• Falsely rejecting H0 (or in this case falsely detecting H1). This error is called eithera Type I error or a false alarm or still a false detection.
• Falsely rejecting H1 (or in this case falsely detecting H0). This error is called eithera Type II error or a miss.
The terms “false alarm” and “miss” come from radar applications in which H0 and H1
usually represent the absence and presence of a target.
For a decision rule δ, the probability of a Type I error is known as false alarm probabilityand denoted by PF (δ). Similarly, the probability of a Type II error is called the miss
24 CHAPTER 2. BASIC CONCEPTS OF BINARY HYPOTHESIS TESTING
probability and denoted by PM (δ). The quantity PD(δ) = 1 − PM (δ) is called as thedetection probability or still the power of δ.
The Neyman-Pearson decision rule criterion is based on these quantities. It tries toplace a bound on the false alarm probability and minimizes the miss probability within thisconstraint, i.e.;
maxPD(δ) subject to PF (δ) = 1 − PD(δ) ≤ α (2.29)
where α is known as the significance level of the test. Thus the Neyman-Pearson decisionrule criterion is to find the most powerful α-level test of H0 against H1.
Note that, in the Neyman-Pearson test, as opposed to the Bayesian and minimax tests,the two hypotheses are not considered symetrically.
The general form of the Neyman-Pearson decision rule takes the forme
δ(x) =
1 if L(x) > τγ(x) if L(x) = τ0 if L(x) < τ
(2.30)
where τ is a threshold.The false alarm probability and the detection probability of a decision rule δ can be
calculated respectively by
PF (δ) = E0 {δ(x)} =
∫∫δ(x)p0(x) dx (2.31)
PD(δ) = E1 {δ(x)} =
∫∫δ(x)p1(x) dx (2.32)
A parametric plot of PD(δ) as a function of PF (δ) is called the receiver operation charac-terization (ROCs).
Chapter 3
Basic concepts of generalhypothesis testing
In previous chapters, we introduced the main basis of the simple binary hypothesis testingproblem. In this chapter, we consider the more general case of the M -ary hypothesistesting. First, we give the basic definitions for the case of simple hypothesis testing forthe well-known stochastic processes. Then, we consider the case of composite hypothesistesting where the stochastic processes are parametrically known. In each case, we try tomake simple classification of different decision rules, describing their optimality criterionand their performances.
3.1 A general M-ary hypothesis testing problem
To consider a general M -ary hypothesis testing problem, let consider the necessary stepsthat any decision making procedure has to follow:
1. Get the data: observe x(t) a realization of X(t) in some time interval [0, T ].
2. Define a library of hypothesis {Hi, i = 1, · · · ,M} where Hi is the hypothesis thatthe data x(t) come from a stochastic processes Xi(t) with either finite or infinitemembership.
3. Define a performance criterion for evaluating the decisions {δi, i = 1, · · · ,M}.
4. If possible, define a probability measure determining the a priori probabilities ofstochastic processes in the library.
5. Use all the available assets to formulate a general optimization problem whose solu-tion is a decision.
Evidently, the nature of the optimization problem and the subsequent decisions vary sig-nificantly with the specifities of the library of the stochastic processes, with the availabilityof the a priori probability distribution on these stochastic processes, the necessary perfor-mance criterion to optimize and the possibility of controling the observation time [0, T ]dynamically.
25
26 CHAPTER 3. BASIC CONCEPTS OF GENERAL HYPOTHESIS TESTING
Γ
Γk −→ Hk
Γj −→ Hj
Γ
Γj −→ Hj
Γk −→ Hk
Γj ∪ Γk
Hj w.p. qjHk w.p. qkqk + qj = 1
Figure 3.1: Partition of the observation space and deterministic or probabilistic decisionrules.
For any fixed specifications on the above issues, the decision then depends on the data.This is what is called the decision rule or test.
We can distinguish the following special cases:
• If the library has a finite number of members, the decision process is classified ashypothesis testing.
• If this number is only two, then the decision process is called detection.
• If the stochastic processes are well-known, the hypothesis testing is called simple.If they are defined parametrically, then the decision process is called parametric, ifnot, it is called nonparametric.
3.1.1 Deterministic or probabilistic decision rules
A decision rule δ = {δj(x), j = 1, · · · ,M} subdivides the space of the observations Γ intoM subspaces {Γj , j = 1, · · · ,M}. One can distinguish two types of decision rules:
• Deterministic decision rule:If these subspaces are all disjoints and for a given data set x, the hypothesis Hj isdecided with probability one, the decision rule is called deterministic.
• Probabilistic decision rule:If some of theses subspaces overlap and for a given data set x, none of the hypothesiscan be decided with probability one, then the decision rule is called probabilistic.That is, given x, the hypothesis Hk is decided with probability qk and Hj withprobability qj with
∑j qj = 1.
3.1. A GENERAL M -ARY HYPOTHESIS TESTING PROBLEM 27
3.1.2 Conditional, A priori and Joint Probability Distributions
For a given decision rule δ, we can define the following probability distributions:
• Conditional probability distribution:Pki(δ) is the conditional probability that Hk is chosen given that Hi is true. Theseprobabilities can be calculated from the probability distribution of the stochasticprocess:
Pki(δ) = Pr {Hk decided by rule δ|Hi true}=
∫∫
ΓdPr {Hk decided and x observed |Hi true}
=
∫∫
ΓPr {Hk decided|x observed} dPr {x observed |Hi true}
=
∫∫
Γδk(x) dFi(x) =
∫∫
Γδk(x)fi(x) dx (3.1)
Note that, in this derivation, we have used the theorem of total probabilities and theBayes rule and the fact that the decision induced by the decision rule δ is independentof the true hypothesis. Note also that the decision rule δ consists of probabilitiessuch that
n∑
j=1
δj(x) = 1 ∀x ∈ Γ (3.2)
Using this fact, it is easy to show that
M∑
k=1
Pki = 1 ∀i = 1, · · · ,M (3.3)
This can be interpreted as: Given the true hypothesis Hi, the decision induced bythe decision rule δ is restricted to one of the M hypotheses.
• A priori probability distribution:{πi, i = 1, · · · ,M} is a prior probability distribution on the hypothesis {Hi, i =1, · · · ,M}. Naturally we have
M∑
k=1
πk = 1 (3.4)
• Joint probability distribution:Using the conditional probabilities Pkj(δ) and the prior probabilities πi, we cancalculate the joint probabilities Qki(δ), denoting the probabilities that Hk is decidedby the decision rule δ while Hi is true. We then have:
Qki(δ) = πiPki(δ) = πi
∫∫
Γδk(x)fi(x) dx (3.5)
M∑
k=1
Qki = πi ∀i = 1, · · · ,M (3.6)
28 CHAPTER 3. BASIC CONCEPTS OF GENERAL HYPOTHESIS TESTING
M∑
k=1
M∑
i=1
Qki = 1 (3.7)
(3.8)
3.1.3 Probabilities of false and correct detection
For a given decision rule δ, we can define the following probabilities:
• Probability of false decision:Pe(δ) is the probability that the decision induced by δ is erroneous, i.e.; Hk is decidedwhile Hi6=k is true.
Pe(δ) =∑
k 6=i
Qki(δ) (3.9)
• Probability of correct decision:Pd(δ) is the probability that the decision induced by δ is correct, i.e.Hk is decidedwhile Hk is true.
Pd(δ) =∑
k
Qkk(δ) = 1 − Pe(δ) (3.10)
• One can use these probabilities to define optimal decision procedures:
Pe(δ∗) ≤ Pe(δ), ∀δ ∈ D (3.11)
Pd(δ∗) ≥ Pd(δ), ∀δ ∈ D (3.12)
3.1.4 Penalty or costs coefficients
In addition to the M known hypotheses and their a priori probabilities, the analyst maybe equipped with a set of real cost or penalty coefficients cki such that
cki ≥ 0 and cki ≥ ckk ∀k, i = 1, · · · ,M, (3.13)
where cki is the penalty paied when Hk is decided while Hi is true. The implication behindthe condition that each coefficient is nonnegative is that there is no gain associated withany decision, thus the term penalty and, in general, the penalties cki are chosen greaterthan ckk.
3.1.5 Conditional and Bayes risks
For a given decision rule δ and a given set of cost functions cki, one can calculate thefollowing quantities:
• Conditional expected penalties or conditional risks:
Ri(δ) =M∑
k=1
ckiPki(δ) (3.14)
3.1. A GENERAL M -ARY HYPOTHESIS TESTING PROBLEM 29
• Expected penalty or Bayes risk:
r(δ) =M∑
k=1
M∑
i=1
ckiQki(δ) (3.15)
3.1.6 Bayesian and non Bayesian hypothesis testing
Different optimization problems which are classically used to define a decision process are:
• Bayesian hypothesis testing:
– If a specific cost function that penalizes wrong decisions is provided, then theminimization of the expected penalty or the Bayes risk is chosen as the perfor-mance criterion.
– If not, the probability of making a decision error is minimized instead. Thisdecision process is called ideal observation test.
• Non Bayesian hypothesis testing:When an a priori probability distribution is unavailable, then
– If a specific cost function is available, then first a least favorable a priori prob-ability distribution is defined and then the expected penalty with this leastfavorable a priori probability distribution is minimized to obtain a decisionrule. This decision process is what is called the minimax decision process.
– If a cost function is not available then first one of the hypothesis is selected inadvance as to be the most important and then the performance criterion used isthe maximization of the probability of the detection of that hypothesis subjectto the constraint that the probability of its false alarm does not exceed a givenvalue α. This is what is called the Neyman-Pearson test procedure.
3.1.7 Admissible decision rules and stopping rule
• Admissible decision rules:It may happen that, for a given set of optimal criterions, there exist more than onebest decision rule which satisfy these performances criterion, then these rules arecalled admissible.
• When a decision rule is designed, one may be intended to know how this decisionrule performs with respect of the observed time interval [0, T ], i.e.; the number ofdata. The study of the behavior of the decision rule is called stoping rule.
All the above test procedures take a dynamic form if the observation time interval[0, T ] can be controlled dynamically.
30 CHAPTER 3. BASIC CONCEPTS OF GENERAL HYPOTHESIS TESTING
3.1.8 Classification of simple hypothesis testing schemes
To summarize, let again list the richest possible set of assets available to the analyst:
i. A library of M distinct hypothesis {Hi, i = 1, · · · ,M};
ii. A set of data x which is assumed to be a realization of a well known stochasticprocess under only one of these hypotheses;
iii. A prior probability distribution {πi, i = 1, · · · ,M} for the M hypotheses;
iv. A set of penalty coefficients {cki, k, i = 1, · · · ,M};The minimum set of assets that is (or must be) always available consists of those in
i and ii and the performance criterion will suffer limitations as the number of remainingavailable assets decrease.
Now, to continue, first we assume that all assets in i to iv are available. Then anoptimal rule δ∗ is such that the expected penalty r(δ) is lower that any others, i.e.
δ∗ : r(δ∗) ≤ r(δ) ∀δ ∈ D (3.16)
This rule then guarantees a minimum average cost due to the wrong decisions. Notethat this rule may not be unique. When the uniqueness is not satisfied, this means thatthere exist a number of admissible rules, among them, we can choose the one which is thesimplest to implement.
If assets in i to iii are available, then an optimal rule δ∗ can be defined by using theinduced probability of error Pe(δ), or the probability of the detection,i.e.
δ∗ : Pe(δ∗) ≤ Pe(δ) ∀δ ∈ D (3.17)
orδ∗ : Pd(δ
∗) ≥ Pd(δ) ∀δ ∈ D (3.18)
Again, note that there may not exist a unique decision rule, but a set of admissible rules,among them, we can choose the one which is the simplest to implement.
The hypothesis testing rules based on the above criterions are called Bayesian due tothe basic ingredient which is the availability of the prior probabilities πi on the hypothesisspace. When the asset iii is not available, then the decision rules are called non Bayesian.
Now assume all assets i, ii and iv are available, then the analyst can choose an arbitraryprior probability distribution π = {πi} and calculate the induced conditional expectedpenalty R(δ, π). Then, the decision rule
δ∗ : supπR(δ∗, π) ≤ sup
πR(δ, π) ∀δ ∈ D (3.19)
defines admissible ones. The analyst then can choose between these admissible rules theone with the lowest complexity. This procedure, when successful, isolates the decisionrule that induces the minimum maximum value of the conditional expected penalty andprotects the analyst against the most costly case. This formalism and procedure is calledminimax.
Finally, assume that only the assets in i and ii are available, Then, the main idea is toselect one of the hypothesis as to be the principal and use the notion of the power functionP (δ).
3.1. A GENERAL M -ARY HYPOTHESIS TESTING PROBLEM 31
General Hypothesis Testing Schemes
Scheme A priori Cost Decision rule
Yes Yes Minimization of expected penalty r(δ)Bayesian
Yes No Minimization of error probability Pe(δ)
No Yes Minimax test rule using conditional risks Rj(δ)Non Bayesian
No No Neyman-Pearson test rule using Pe(δ) and Pd(δ)
Classes of Hypothesis Testing Schemes
for Well Known Stochastic Processes.
Scheme Assets used Optimization function Optimal Decision rule δ∗ Specific Name
i, ii, iii, iv, v r(δ) r(δ∗) ≤ r(δ) BayesianBayesian
i, ii, iii, v Pe(δ) Pe(δ∗) ≤ Pe(δ) Bayesian
i, ii, iii supp r(δ, π) supp
r(δ∗, π) ≤ supp
r(δ, π) Minimax
Non Bayesian
ii, iiiPd(δ)
subject toPe(δ) ≤ α
Pd(δ∗) ≥ Pd(δ)and
Pe(δ) ≤ αNeyman-Pearson
32 CHAPTER 3. BASIC CONCEPTS OF GENERAL HYPOTHESIS TESTING
3.2 Composite hypothesis
Now assume that, the hypothesis Hi means that x is a realization of the process Xi butthe process X i is only parametrically known, i.e.; its probability distribution is knownwithin a set of unknown parameters θ so that, the prior probabilities {πi} depend on theparameter θ. Now, assume that the partitioning of the decision rule is due to the partitionof the parameter space T of possible values of θ, i.e.;
T = {T1, · · · ,TM} , ∪iTi = T (3.20)
Assume also that, for each value of θ, the stochastic process is defined through its prob-ability distribution fθ(x) and assume that we can define a probability density functionπ(θ) over the space T . Then we have
πi =
∫
Ti
π(θ) dθ (3.21)
and
Pk,θ(δ) =
∫∫
Γδk(x)fθ(x) dx (3.22)
whereM∑
k=1
Pk,θ(δ) = 1 ∀θ ∈ T (3.23)
We can now calculate the conditional probabilities Pki(δ)
Pki(δ) = π−1i
∫∫
Ti
Pk,θ(δ)π(θ) dθ
=
[∫∫
Ti
π(θ) dθ
]−1 ∫∫
Γdx δk(x)
∫∫
Ti
fθ(x)π(θ) dθ (3.24)
and the joint probabilities Qki as follows:
Qki(δ) = πiPki(δ) =
∫∫
Ti
Pk,θ(δ)π(θ) dθ
=
∫∫
Γdx δk(x)
∫∫
Ti
fθ(x)π(θ) dθ (3.25)
3.2.1 Penalty or cost functions
In addition to the M parametrically known hypotheses, we may be equiped with a set ofreal penalty or cost functions ck(θ).
We can then calculate:
• Conditional expected penalty or conditional risk function:
R(δ,θ) =M∑
k=1
ck(θ)Pk,θ(δ) (3.26)
=
∫∫dx
M∑
k=1
δk(x)ck(θ)fθ(x) (3.27)
3.2. COMPOSITE HYPOTHESIS 33
• Expected penalty or Bayes risk:If π(θ) is available, then the expected penaly can be calculated by
r(δ) =
∫∫
TR(δ,θ)π(θ) dθ =
M∑
k=1
∫∫ck(θ)Pk,θ(δ) dθ (3.28)
=
∫∫dx
M∑
k=1
δk(x)
∫∫ck(θ)fθ(x)π(θ) dθ (3.29)
3.2.2 Case of binary hypothesis testing
Now, consider the particular case of binary hypothesis testing, where there are only hy-potheses, and assume that they are determined through a single parametrically knownstochastic process and two disjoint subdivisions of the parameter space T . Let note thesetwo hypotheses H0 and H1 and the decision rules δ0(x) and δ1(x) with δ0(x) + δ1(x) =1 ∀x ∈ Γ. Now, if we emphasize the hypothesis H1 (detection), then δ0(x) = 1 − δ1(x),so that we can drop the indices in the decision rule and denote by δ(x) = δ1(x). Now, wehave
Pθ(δ) =
∫∫
Γδ(x) fθ(x) dx (3.30)
This expression represents the probability that the emphasized hypothesis H1 is decided,conditioned on the value θ of the vector parameter. This function is called the powerfunction of the decision rule. This is due to the fact that it provides the probability withwhich the emphatic hypothesis is decided for each fixed parameter value θ.
3.2.3 Classification of hypothesis testing schemes for a parametricallyknown stochastic process
As before, we now summarize the richest possible set of assets available for a compositehypothesis testing problem:
1. A library of M distinct hypotheses {Hi, i = 1, · · · ,M}
2. A set of data x which is assumed to be a realization of a parametrically knownstochastic process, the hypotheses {Hi, i = 1, · · · ,M} corresponding to the M dis-joint subdivisions of the parameter space T .
3. A prior probability distribution π(θ) on the parameter space T
4. A set of penalty functions {ck(θ), k = 1, · · · ,M} defined on T .
First we assume that all assets in i to iv are available. Then an optimal rule δ∗ is suchthat the expected penalty r(δ) is lower that any others, i.e.
δ∗ : r(δ∗) ≤ r(δ) ∀δ ∈ D (3.31)
If assets in i to iii are available, then an optimal rule δ∗ can be defined by using theinduced probability of error Pe(δ), or the probability of the detection, i.e.;
δ∗ : Pe(δ∗) ≤ Pe(δ) ∀δ ∈ D (3.32)
34 CHAPTER 3. BASIC CONCEPTS OF GENERAL HYPOTHESIS TESTING
orδ∗ : Pd(δ
∗) ≥ Pd(δ) ∀δ ∈ D (3.33)
Again, note that there may not exist a unique decision rule, but a set of admissible rules,among which we can choose the one which is the simplest to implement.
Now assume that the assets i, ii and iv are available, then the analyst can use theinduced conditional expected penalty R(δ,θ). An optimal rule would induce relativelylow R(δ,θ) values for all values of θ ∈ T . So, if there exist two rules δ(1) and δ(2) suchthat
R(δ(1),θ) ≤ R(δ(2),θ) ∀θ ∈ T (3.34)
then δ(2) should be rejected in the presence of δ(1). The rule δ(1) is said to be uniformlysuperior than the rule δ(2). But, it may happen that R(δ(1),θ) ≤ R(δ(2),θ) for some valuesof θ and R(δ(1),θ) > R(δ(2),θ) for other values of θ. In this case, we may ask to preferδ(1) to δ(2) if
supθ∈T
R(δ(1),θ) ≤ supθ∈T
R(δ(2),θ) (3.35)
Thus, the selection procedure has, in general, two steps: first reject all the uniformlyinadmissible rules, and then between the remaining ones, define the optimal rules:
δ∗ : supθ∈T
R(δ∗,θ) ≤ supθ∈T
R(δ,θ) ∀δ ∈ D (3.36)
which are admissible. Finally, the analyst then can choose between these admissible rulesthe one with the lowest complexity. This procedure, when successful, isolates the decisionrule that induces the minimum maximum value of the conditional expected penalty andprotects the analyst against the most costly case. This formalism and procedure is calledminimax.
θ θ
R(δ(2), θ)
R(δ(1), θ)
R(δ(2), θ)
R(δ(1), θ)
Figure 3.2: Two decision rules δ(1) and δ(2) and thier respectives risk functions. In bothcases δ(2) is rejected in presence of δ(1).
Finally, assume that only the assets in i and ii are available. Then, the main idea is toselect one of the hypotheses as to be the principal and use the notion of the power functionPθ(δ). Let note H1 the emphasized hypothesis, T1 its associated region in T and Pθ(δ)the power function associated to it. It is then desirable that Pθ(δ) for any θ ∈ T1 has avalue higher that its value for other hypotheses, i.e.;
Pθ∈T1(δ) ≥ Pθ∈Tj
(δ) (3.37)
3.2. COMPOSITE HYPOTHESIS 35
The quantity supθ∈T0Pθ(δ) is the false alarm induced by δ. The value of Pθ(δ) for a given
value of θ ∈ T1 is the power induced by the decision rule δ. If the subspaces T0 and T1 arefixed, then the best decision rule δ∗ is the one that induces the highest power subject toa false alarm constraint, i.e.;
δ∗ : Pθ(δ∗) ≤ Pθ(δ) ∀θ ∈ T1 subject to supθ∈T0
Pθ(δ∗) ≤ α, ∀δ ∈ D (3.38)
The procedure, as in the minimax scheme, may have more than one solution.
θθ0
T0 T1
Pθ(δ(2)) Pθ(δ
(1))
Figure 3.3: Two decision rules δ(1) and δ(2) and thier respectives power functions. In thiscase δ(1) is prefered to δ(2).
Classes of Hypothesis Testing Schemes
for Parametrically Known Stochastic Processes.
Scheme Assets used Optimization function Optimal Estimate δ∗ Specific Name
i, ii, iii, iv, v r(δ) r(δ∗) ≤ r(δ) BayesianBayesian
i, ii, iii, v Pe(δ) Pe(δ∗) ≤ Pe(δ) Bayesian
i, ii, iii supθ∈TR(δ,θ) sup
θ∈T
R(δ∗,θ) ≤ supθ∈T
R(δ,θ) Minimax
Non Bayesian
ii, iii
Pθ∈Θ1(δ)
subject tosupθ∈Θ0
Pθ(δ) ≤ α
Pθ∈Θ1(δ∗) ≥ Pθ∈Θ1
(δ)and
supθ∈Θ0
Pθ(δ) ≤ αNeyman-Pearson
36 CHAPTER 3. BASIC CONCEPTS OF GENERAL HYPOTHESIS TESTING
3.3 Classification of parameter estimation schemes
The basic ingredient that distinguishes the parameter estimation from the hypothesistesting is the dimension of the hypothesis space and the nature of the stochastic processcorresponding to each alternative. In hypothesis testing the dimension of the hypothesisspace is finite and any of the M alternatives are represented by one stochastic process.In parameter estimation, we are face to an infinite number of alternatives represented bysome m dimentional vector parameter θ that takes its values in T .
The basic elements of parameter estimation are then the vector parameter θ and astochastic process X(t) which is parameterized by θ and we still can distinguish twocases:
• If for a fixed θ the stochastic process is well-known, then we have a parametricparameter estimation scheme.
• If for a fixed θ the stochastic process is a member of some class Fθ of processes,then we have a non parametric or robust parameter estimation scheme.
In both cases, the main assumption is that the value of the parameter and so the natureof the stochastic process remains unchanged during the observation time [0, T ]. The mainobjective is then to determine the active value of the parameter θ. Given a set of data x,the solution is noted θ(x) and is called parameter estimate.
Between the different criteria to measure the performances of an estimate θ(x), onecan mention the following:
• Bias :For a real valued parameter vector θ, the Euclidean norm
∥∥∥θ − E[θ(X)
]∥∥∥1/2
is called the bias of the estimate θ(x) at the process. If the bias is zero for all θ ∈ T ,then the estimate θ(x) is called unbiased at the process.
• Conditional variance :The quantity
E[∥∥∥θ(X) − E
[θ(X)
]∥∥∥ | θ]
is called the Conditional variance of the estimate θ(x).
In general, the bias and the conditional variance present a tradeoff. Indeed, an unbiasedestimate may induce a relatively large variance, and very often, admitting a small biasmay result in a significant reduction of the conditional variance.
A parameter estimate θ(x) is called efficient at the process, if the conditional varianceequals a lower bound known as the Cramer-Rao bound.
A more general criterion is here also the expected penalty, if we define a penaltyfunction c[θ(x),θ]– a scalar, non negative function whose values vary as θ and θ vary inT . We can then define:
3.3. CLASSIFICATION OF PARAMETER ESTIMATION SCHEMES 37
• Conditional expected penalty or conditional risk function:
R(θ,θ) = E[c[θ(X),θ] | θ
]=
∫∫
Γc[θ(x),θ] fθ(x) dx (3.39)
where fθ(x) is the probability density function of the stochastic process defined byθ at the point x.
• Expected penalty or Bayes risk function:When an a priori probability density function π(θ) is available, we can calculate theexpected value of R(θ,θ) for all θ ∈ T , and thus definec the total expected penaltyor the Bayes risk function by
r(θ) = r(θ, π) =
∫∫
TR(θ,θ)π(θ) dθ (3.40)
Now, let try to make a classification of parameter estimation schemes. For this, we listthe richest possible set of assets:
i. A parametric or nonparametric description of a stochastic process depending on afinite dimensional parameter vector θ.
ii. A set of data x which is assumed to be a realization of one of the active stochasticprocesses with the implicite assumption that this process remains unchanged duringthe observation time.
iii. A parameter space T where θ takes its values.
iv. An a priori probability distribution π(θ) defined on the parameter space T .
v. A penalty function c[θ(x),θ] defined for each data sequence x, parameter vector θ
and the estimated parameter vector θ(x).
Here also, some of the assets listed above may not be available and we will see howdifferent schemes come out from partial availability of these assets.
The minimum set of assets that is (or must be) always available consists of those ini, ii and iii and the performance criterion of the estimation will suffer limitations as thenumber of remaining available assets decrease.
First we assume that a parametric description of the stochastic process is available.When all the assets are available, we will have the Bayesian parameter estimation schemewhere the performance criterion is the expected penalty or the Bayes risk
r(θ) = E[c[θ(X),θ]
]=
∫∫
T
∫∫
Γc[θ(x),θ] fθ(x)π(θ) dxdθ (3.41)
with respect to the estimate θ(x) for all x in the observation space Γ.
The Bayesian optimal estimate is then defined as the estimate θ∗(x) which minimizes
the expected penalty function, i.e.
θ∗(x) : r(θ
∗(x)) ≤ r(θ(x)) ∀θ ∈ T (3.42)
38 CHAPTER 3. BASIC CONCEPTS OF GENERAL HYPOTHESIS TESTING
If assets in i to iv are available, then we can calculate the posterior probability densityfunction of θ, using the Bayes rule:
π(θ|x) =p(x|θ)π(θ)
m(x)=fθ(x)π(θ)
m(x)(3.43)
where
m(x) =
∫∫
Tfθ(x)π(θ) dθ (3.44)
and define an estimate, called maximum a posteriori (MAP) estimate by
θ∗(x) : π(θ
∗|x) ≥ π(θ|x) ∀θ ∈ T (3.45)
or written differentlyθ∗
= arg maxθ∈T
{π(θ|x)} (3.46)
If assets in i to iii and v are available, then an optimal estimate θ∗(x) can be defined
by using the expected conditional penalty
R(θ,θ) = E[c[θ(X),θ]|θ]
]=
∫∫
Γc[θ(X),θ]fθ(x) dx (3.47)
where fθ(x) is the probability density function of the stochastic process defined by θ atthe point x. We are then in the minimax parameter estimation scheme which is basedon the saddle-point game formalization, with payoff function the expected penalty r(θ, π)and with variables the parameters estimate θ and the a priori probability density functionπ.
In summary, we can say that, if a minimax estimate θ∗
exists, it is an optimal Bayesianestimate at some least favorable a priori probability distribution p0, i.e.
θ∗(x) : ∃π0 : r[θ
∗(x), π] ≤ r[θ
∗(x), π0] ≤ r[θ(x), π0] ∀θ ∈ T and ∀π (3.48)
When only the assets i to iii are available, then the analyst can use the inducedconditional probability density function fθ(x). The scheme is called maximum likelihoodand the main idea is to use the induced conditional probability density function fθ(x) as afunction l(θ) = fθ(x), called likelihood of the vector parameter θ and define the maximum
likelihood (ML) estimate θ∗(x) as the one who maximizes the likelihood l(θ), i.e.
θ∗(x) : l(θ
∗) ≥ l(θ) ∀θ ∈ T (3.49)
All the above schemes comprise the class of parametric parameter estimation proce-dures with the common characteristic of the assumtion that the stochastic process thatgenerates the data is parametrically well-known.
When, for a given vector parameter θ, the stochastic process is nonparametricallydescribed, then the parameter estimation is called nonparametric or sometimes robust. Asin minimax scheme, the robust estimation scheme uses a saddle-point game procedure,but here the payoff function originates from the likelihood. So, in this scheme, in additionto the nonparametric description of the stochastic process, the only assets ii and iii areused to define a performance criterion using the likelihood function. We will be back morein details on this scheme in future chapters.
3.3. CLASSIFICATION OF PARAMETER ESTIMATION SCHEMES 39
Classes of Parameter Estimation Schemes
for Parametrically Known Stochastic Processes.
Scheme A priori Cost Decision rule
Yes Yes Minimization of expected penalty r(θ)Bayesian
Yes No Maximization of the posterior probability π(θ|x)
No Yes Minimax estimation using r(θ, π)Non Bayesian
No No Maximum likelihood tests using l(θ)
Classes of Parameter Estimation Schemes
Assets used Optimization function Optimal estimate θ∗
Scheme
i, ii, iii, iv, v r(θ) θ∗
: r(θ∗
) ≤ r(θ) ∀θ ∈ T Bayesian
i, ii, iii, iv π(θ|x) π(θ∗|x) ≤ π(θ|x) ∀θ ∈ T MAP
i, ii, iv R(θ,θ) supθ∈T
R(θ∗
,θ) ≤ supθ∈T
R(θ,θ) Minimax
i, ii l(θ) θ∗
: l(θ∗
) ≥ l(θ) ∀θ ∈ T Maximum likelihood
i, iinonparametric
descriptionof the
stochastic process
Based on l(θ)Appropriate saddle point
optimization Robust estimation
40 CHAPTER 3. BASIC CONCEPTS OF GENERAL HYPOTHESIS TESTING
3.4 Summary of notations and abbreviations
• δ = {δk, k = 1, · · · ,M}A decision rule (or a set of possible actions)
• Γ = {Γk, k = 1, · · · ,M}The partitions of the observation space Γ corresponding to the hypotheses {Hk} andthe decision rule δ
• T = {Tk, k = 1, · · · ,M}The partitions of the parameter space T corresponding to the hypotheses {Hk} andthe decision rule δ
• {πi}A prior probability distribution for the hypotheses {Hi}
• π(θ)A prior probability density function for a scalar parameter θ
• π(θ)A prior probability density function for a vector parameter θ
• {πi(θ)}Conditional prior probability density functions for the vector parameter θ under thehypothesis Hi
• {ri(θ)} = {πi πi(θ)}Unconditional prior probability density functions for the vector parameter θ underthe hypothesis Hi
• fθ(x) = p(x|θ)Conditional probability density function of the observations for a given θ
• l(θ) = fθ(x) = p(x|θ)Likelihood function of θ for a given data x
• π(θ|x) =fθ(x) π(θ)
m(x) = p(x|θ) π(θ)m(x)
Posterior probability density function of θ given the observations x
• m(x) =
∫∫p(x|θ)π(θ) dθ
Marginal distribution of the observations x
• {cki}Penalty coefficients
• {cki(θ)} or {ck(θ)}Penalty functions
• {Pki(δ)}Conditional probabilities of the decision rule δ for a well known stochastic process
3.4. SUMMARY OF NOTATIONS AND ABBREVIATIONS 41
• {Pki,θ(δ)} or {Pk,θ((δ)}Conditional probabilities of the decision rule δ for a parametrically known stochasticprocess
• {Qki(δ) = πiPki(δ)}Probabilities of the decisions in the decision rule δ
• Pe(δ) =∑
k 6=i
Qki(δ)
Probability of the error due to the decision rule δ
• Pd(δ) =∑
k
Qkk(δ) = 1 − Pe(δ)
Probability of the correct detection due to the decision rule δ
• Pfa(δ) = Q10(δ)Probability of false alarm in a binary hypothesis testing
• Pfd(δ) = Q01(δ)Probability of false detection in a binary hypothesis testing
• Pe(δ) = Q01(δ) +Q10(δ)Probability of the error due to the decision rule δ in a binary hypothesis testing
• Pd(δ) = Q00(δ) +Q11(δ)Probability of the correct detection due to the decision rule δ in a binary hypothesistesting
• Conditional expected penalty or Risk function
Ri(δ) =M∑
k=1
ckiPki(δ)
for a well known stochastic process
Rθ(δ) =M∑
k=1
ck(θ)Pk,θ(δ)
for a parametrically known stochastic process
• Expected penalty or Bayes risk function
ri(δ) =M∑
k=1
ckiQki(δ)
for a well known stochastic process
rθ(δ) =M∑
k=1
ck(θ)Qk,θ(δ)
for a parametrically known stochastic process
42 CHAPTER 3. BASIC CONCEPTS OF GENERAL HYPOTHESIS TESTING
Chapter 4
Bayesian hypothesis testing
4.1 Introduction
Let start by reminding and precising the notations and definitions. First we consider ageneral case and we assume that there exists a parameter vector θ of finite dimensionalitym and M stochastic processes, such that for every fixed value of θ ∈ T , the condi-tional distributions {Fθ,i(x) = Fi(x|θ), i = 1, · · · ,M} and their coresponding densities
{fθ,i(x) = fi(x|θ), i = 1, · · · ,M} are well known, for all values x ∈ Γ.
We also assume to know the conditional prior probability distributions
{πi(θ) = π(θ|Hi), i = 1, · · · ,M},
their coresponding unconditional prior distributions
{ri(θ) = P (H = Hi)πi(θ) = πi πi(θ), i = 1, · · · ,M}
and the penalty functions {cki(θ)}.For a given decision rule δ(x) = {δj(x), j = 1, · · · ,M} we define the expected penalty
r(δ) =
∫∫
Γdx
M∑
k=1
δk(x)
∫∫
T
M∑
i=1
cki(θ)fθ,i(x) ri(θ) dθ (4.1)
Note that, this general case reduces to the two following special cases:
• When the M stochastic processes coresponding to the M hypotheses are describedthrough a single parametric stochastic process withM disjoint subdivisions {T1, · · · ,TM}of the parameter space T , then the quantities πi(θ) and ri(θ), both reduce to π(θ),and the quantities {fθ,i(x)} and {cki(θ)} reduce to {fθ(x)} and {ck(θ)}. We thenhave
r(δ) =
∫∫
Γdx
M∑
k=1
δk(x)
∫∫
Tck(θ)fθ(x)π(θ) dθ (4.2)
• When the M stochastic processes coresponding to the M hypotheses are all wellknown then we can eliminate θ from the above equations and the quantities fθ,i(x),
43
44 CHAPTER 4. BAYESIAN HYPOTHESIS TESTING
ri(θ) and {cki(θ)} reduce respectively to {fi(x)}, πi = Pr {Hi} and {cki}. We thenhave
r(δ) =
∫∫
Γdx
M∑
k=1
δk(x)M∑
i=1
cki fi(x)πi dx (4.3)
Note that, in all the three above cases we can write
r(δ) =
∫∫
Γdx
M∑
k=1
δk(x)gk(x) (4.4)
where gk(x) is given by one of the following equations:
gk(x) =
∫∫
T
M∑
i=1
cki(θ)fθ,i(x) ri(θ) dθ (4.5)
gk(x) =
∫∫
Tck(θ)fθ(x)π(θ) dθ (4.6)
gk(x) =M∑
i=1
cki fi(x)πi dx (4.7)
4.2 Optimization problem
Now, we have all the ingredients to write down the optimization problem of the Bayesianhypothesis testing. Before starting, remember that for any decision rule δ = {δj(x), j =1, · · · ,M}, we have
δk(x) ≥ 0, k = 1, · · · ,M andM∑
k=1
δk(x) = 1 ∀x ∈ Γ (4.8)
Now, consider the following optimization problem:
Minimize r(δ) =
∫∫
Γdx
M∑
k=1
δk(x) gk(x) (4.9)
subject to
{δk(x) ≥ 0, k = 1, · · · ,M,∑M
k=1 δk(x) = 1, ∀x ∈ Γ(4.10)
where {gk(x), k = 1, · · · ,M} are non negative functions defined on Γ. Their expressioncan be given by either (4.5), (4.6) or (4.7).
This optimization problem does not have, in general, a unique solution and there mayexist a whole class D∗ of equivalent decision rules. we remember that two decision rulesδ∗1 and δ∗2 are equivalent if
r(δ∗1) = r(δ∗2) ≤ r(δ) ∀δ ∈ D (4.11)
The calss D∗ includes both random and determinist decision rules. For the reason ofsimplicity, in general, one chooses the non random decision rules. Noting that δk are then
4.2. OPTIMIZATION PROBLEM 45
either 0 or 1 depending on the conditions x ∈ Γi or x 6∈ Γi. The expression (4.9) of r(δ)becomes
r(δ) =∑
k
∫∫
Γk
gk(x) dx (4.12)
Now, assuming that gk(x) ≥ 0 ∀x ∈ Γk, the Bayesian hypothesis testing scheme consistesof the following steps:
• Given the observations x, compute
t(x) = minkgk(x); (4.13)
• Select a single k∗ = k(x) such that gk∗(x)(x) = t(x);
• Define
δ∗j (x) =
{1 j = k∗(x)0 j 6= k∗(x)
(4.14)
The function t(x) together with the index k(x) are called the test and the statisticbehavior of the pair [t(X), k(X)] is called the test statistics.
To go more in details we consider some examples from general cases to more specificones.
Let start with a general case. We here we assume that during any n observations thetransmitted signal is exactly one of the M possibles {si, i = 0, · · · ,M −1}. Then we haveM hypotheses:
Hi : x ∼ fi(x), i = 0, · · · ,M − 1 (4.15)
Then, we have
gk(x) =M∑
i=1
cki fi(x) πi (4.16)
and
t(x) = minkgk(x) = min
k
M∑
i=1
cki fi(x) πi (4.17)
Given the observation x, the search for some index k(x) that satisfies (4.17) can be realizedvia the differences
M∑
i=1
(cki − cli) fi(x) πi (4.18)
The optimal index k∗(x) is such that
k∗(x) :M∑
i=1
(ck∗i − cli) fi(x) πi ≤ 0 ∀l (4.19)
If x is such that f0(x) > 0 and if π0 > 0, then we have
k∗(x) :M∑
i=1
(ck∗i − cli)πi
π0
fi(x)
f0(x)≤ 0 ∀l (4.20)
46 CHAPTER 4. BAYESIAN HYPOTHESIS TESTING
The ratio{
fi(x)f0(x)
}are called the likelihood ratios, so that, the procedure to obtain the
decisions now consists in comparing the likelihood ratios against some thresholds. Thetest induced by (4.20) thus consists of a weighted sum of the likelihood ratios. Thisweigthed sum is called the test function. So, in general, the test function is compared withthe threshold zero which is independent of the observation.
Note also that we can rewrite (4.20) in the two following other forms:
k∗(x) :M∑
i=1
(ck∗i − cli)πi(x)
π0(x)≤ 0 ∀l (4.21)
or still
k∗(x) :M∑
i=1
ck∗i πi(x) ≤M∑
i=1
cli πi(x) ∀l (4.22)
The fractionsπi(x)
π0(x)are the posterior likelihood ratios and cl(x) =
M∑
i=1
cli πi(x) are the
expected posterior penalties.Further simplifications can be acheived with uniform cost functions
cki =
{0 if k = i1 if k 6= i
(4.23)
and with uniform priors
π1 = π2 = · · · = πM =1
M. (4.24)
With these assumptions, the Bayesian decision rule becomes
k∗(x) : Lk∗(x) ≥ Ll(x) ∀l (4.25)
where
Li(x) =fi(x)
f0(x)(4.26)
Now, let consider some special cases.
4.3. EXAMPLES 47
4.3 Examples
4.3.1 Radar applications
One of the oldest area where the detection-estimation theory has been used is the radarapplications. The main problem is to detect the presence of M knwon signals transmittedthrough the atmosphere. The transmission chanel is the atmosphere and it is assumed tobe statistically well knwon. A simple model for the received signal is then
X(t) = S(t) +N(t) (4.27)
where N(t) is the additive noise due to the chanel. In the discrete case, where we assumeto observe n samples of the received signal in the time period [0, T ], this model becomes
Xj = Sj +Nj, j = 1, · · · , n (4.28)
or stillx = s + n. (4.29)
Consider now the case where one of the signals si is null. Then the M hypothesesbecome: {
H0 : x = n
Hi : x = si + n, i = 1, . . . ,M − 1(4.30)
Assume that we know also the conditional probability density functions f0(x) and fi(x)of the received signal under the hypotheses H0 and Hi. The likelihood ratios Li(x) thenbecome
Li(x) =fi(x)
f0(x)=
exp[−12σ2 (x − si)
t(x − si)]
exp[−12σ2 xtx
] = exp
[−1
2σ2(−2st
ix + stisi)
](4.31)
and k∗(x) satisfies:
k∗(x) : stk∗x +
1
2st
k∗sk∗ > stlx +
1
2st
lsl ∀l (4.32)
Figure 4.3.1 shows the structure of this optimal test.Indeed, if we assume that all the signals have the same energies |si|2 = st
isi, then wehave
k∗(x) : stk∗x > st
lx ∀l (4.33)
Figure 4.3.1 shows the structure of this optimal test.
48 CHAPTER 4. BAYESIAN HYPOTHESIS TESTING
x
+
+
+
k∗(x)
weighting
weighting
weighting
st1 x
stk x
stM x
sM
sk
s1
cM
ck
c1
Select
Largest
Figure 4.1: General structure of a Bayesian optimal detector.
x k∗(x)
weighting
weighting
weighting
st1 x
stk x
stM x
sM
sk
s1
Select
Largest
Figure 4.2: Simplified structure of a Bayesian optimal detector.
4.3. EXAMPLES 49
4.3.2 Detection of a known signal in an additive noise
Consider now the case where there is only one signal. So that, we have a binary detectionproblem: {
H0 : x = n −→ f0(x) = f(x)H1 : x = s + n −→ f1(x) = f(x − s)
(4.34)
Then, we have:
L(x) =f1(x)
f0(x)(4.35)
General case:
The general optimal Bayesian detector structure becomes:
x - L(x) -
>
=
<
τ
-
-
-
H1
H1 or H0
H0
Figure 4.3: General Bayesian detector.
Case of white noise:
Now, if we assume that the noise is white, we have:{H0 : x = n −→ f0(x) =
∏j f(xj)
H1 : x = s + n −→ f1(x) =∏
j f(xj − sj)(4.36)
and consequently
L(x) =∏
j
Lj(xj) =∏
j
f(xj − sj)
f(xj)(4.37)
xj - log[lj(xj)] -n∑
j=1
(.) -
>
=
<
τ
-
-
-
H1
H1 or H0
H0
Figure 4.4: Bayesian detector in the case of i.i.d. data.
Case of Gaussian noise:
In this case we haveH0 : x = n −→ f0(x) ∝ exp
[− 1
2σ2 xtx]
H1 : x = s + n −→ f1(x) ∝ exp[− 1
2σ2 [x − s]t[x − s]] (4.38)
50 CHAPTER 4. BAYESIAN HYPOTHESIS TESTING
L(x) =f1(x)
f0(x)=
exp[−12σ2 (x − s)t(x − s)
]
exp[−12σ2 xtx
] = exp
[ −1
2σ2(−2stx + sts)
](4.39)
We then have
δB(x) =
1 >0/1 if st(x − s) = τ1 <
(4.40)
The detector has the following structure:
xj - ++
−6sj/2
- ×6sj
-n∑
j=1
(.) -
>
=
<
τ1
-
-
-
H1
H1 or H0
H0
Figure 4.5: General Bayesian detector in the case of i.i.d. Gaussian data.
Note that we can rewrite (4.32) as:
δB(x) =
1 >0/1 if stx = τ ′
1 <(4.41)
xj - ×6sj
-n∑
j=1
(.) -
>
=
<
τ2
-
-
-
H1
H1 or H0
H0
Figure 4.6: Simplified Bayesian detector in the case of i.i.d. Gaussian data.
Case of Laplacian noise:
H0 : x = n −→ f0(x) =n∏
j=1
α
2exp
−αn∑
j=1
|xj |
(4.42)
Hi : x = s + n −→ f1(x) =n∏
j=1
α
2exp
−αn∑
j=1
|xj − sj|
(4.43)
L(x) =n∏
j=1
Lj(xj) (4.44)
4.3. EXAMPLES 51
with
Lj(xj) = =f1(xj)
f0(xj)= exp [α|xj − sj| + α|xj |]
=
exp [−α|sj |] if sgn(sj)xj < 0expfα|2xj − sj| if 0 < sgn(sj)xj < |sj|expf−α|sj| if sgn(sj)xj > |sj|
(4.45)
where
sgn(x) =
+1 if x > 00 if x = 0
−1 if x < 0(4.46)
Considering then two cases of sj < 0 and sj > 0 we obtain
xj - ++
−6sj/2
- -
6Out
In - ×
6
sgn(sj)
-n∑
j=1
(.) -
>
=
<
τ
-
-
-
H
H1 or
H
Figure 4.7: Bayesian detector in the case of i.i.d. Laplacian data.
52 CHAPTER 4. BAYESIAN HYPOTHESIS TESTING
4.4 Binary chanel transmission
In numeric signal transmission, in general, we have to transmit binary sequences. If weassume that the chanel transmits each bit separately in a memoryless fashion and thateach bit sj is transmitted correctly with probability q, then we can describe this chanelgraphically as follows
Input bit Output bit
0
1
0
1
q
1 − q
q
1 − q
Figure 4.8: A binary chanel.
A useful binary, memoryless and symetric chanel should have a probability of correcttransmission higher than the probability of incorrect transmission, i.e.q > 0.5 and 1− q <0.5.
With these assumptions on the chanel we can easily calculate the probability of ob-serving x conditional to the transmitted sequence s
Pr {x|s} =n∏
j=1
Pr {x[j]|s[j]} =n∏
i=1
q1−(x[j]⊕s[j])(1 − q)(x[j]⊕s[j])
= qn(
1 − q
q
)∑M
i=1(x[j]⊕s[j])
(4.47)
where ⊕ signifies binary sum.Now, assume that, during each observation period, only one of the M well knwon
binary sequences sk (called codewords) are transmitted. Now, we have received the binarysequence x and we want to know which one of them has been transmitted.
Indeed, if we assume p1 = p2 = · · · = pM = 1/M and if we note by
H(sk, sl) =1
n
n∑
j=1
sk[j] ⊕ sl[j] (4.48)
the Haming distance between the two binary words sk and sl, then, the likelihood ratioshave the following form:
Pr {x|sk}Pr {x|sl}
=n∏
j=1
q1−(x[j]⊕sk[j])(1 − q)(x[j]⊕sk[j])
= qn(
1 − q
q
)∑M
i=1(x[j]⊕sk[j])
(4.49)
4.4. BINARY CHANEL TRANSMISSION 53
and the Bayesian optimal test becomes:
k∗(x) :
(1 − q
q
)∑n
j=1(x[j]⊕sk∗[j])−
∑n
j=1(x[j]⊕sl[j])
≥ 1 ∀l = 1, . . . ,M (4.50)
Taking the logarithm of both parts we obtain the following condition on k∗(x):
k∗(x) :n∑
j=1
(x[j] ⊕ sk∗[j]) −n∑
j=1
(x[j] ⊕ sl[j]) log
(1 − q
q
)≥ 0 ∀l = 1, . . . ,M (4.51)
We can then discriminate two cases:
• Case 1: Let q > .5, which means that the transmission chanel has a higher probabilityof transmitting correctly than incorrectly. Then 1−q
q < 1 and k∗(x) satisfies
k∗(x) :n∑
j=1
(x[j] ⊕ sk∗[j]) ≤n∑
j=1
(x[j] ⊕ sl[j]), ∀l = 1, . . . ,M (4.52)
or stillk∗ : H(x, sk∗) ≤ H(x, sl), ∀l = 1, . . . ,M (4.53)
The test clearly decides in favor of the codeword sk∗ whose Hamming distance fromthe observed sequence x is the minimum one. This is why this detector is called theminimum distance decoding scheme.
Let now the M codewords be designed so that the Haming distance between anytwo such codewords equals (2d+ 1)/n, where d is a positive integer, i.e.
H(sk, sl) = (2d+ 1)/n, ∀k 6= l, k, l = 1, . . . ,M (4.54)
and d such thatd∑
i=0
(ni
)≤ 2n/M (4.55)
Then, via the minimum distance decoding scheme, if the distance between the re-ceived word x and the codeword sk is at most d/n, then the codeword sk is correctlydetected and we have
Pd(sk) ≥d∑
i=0
(ni
)qn−i(1 − q)i (4.56)
Pd =M∑
k=1
(1/M)Pd(sk) ≥d∑
i=0
(ni
)qn−i(1 − q)i (4.57)
Pe = 1 − Pd ≤ 1 −d∑
i=0
(ni
)qn−i(1 − q)i (4.58)
(4.59)
• Case 2: In the case of q < 0.5, by the same analysis, the Bayesian detection schemedecides in favor of the codeword whose Hamming distance from the observed se-quence is the maximum. This is not surprising because if q < 0.5, then withprobability 1 − q > 0.5, more than half of the codeword bits are changed in thetransmission.
54 CHAPTER 4. BAYESIAN HYPOTHESIS TESTING
Chapter 5
Signal detection and structure ofoptimal detectors
In previous chapters we discussed some basic optimality criteria and design methods forgeneral hypothesis testing problems. In this chapter we apply them to derive optimalprocedures for the detection of signals corrupted by some noise. We consider only thediscrete case.
First, we summarize the Bayesian composite hypothesis testing and focus on the binarycase. Then, we describe other related tests in this particular case. Finally, through someexamples with different models for the signal and the noise, we derive the optimum detectorstructures.
At the end, we give some basic elements of robust, sequential and non parametricdetection.
5.1 Bayesian composite hypothesis testing
Consider the following composite hypothesis testing:
X ∼ fθ(x) (5.1)
and define the decision δ(x), its associated cost function c[δ(x),θ]. Then the conditionalrisk function is given by
Rθ(δ) = Eθ {c[δ(X),θ]} = E [c[δ(X),Θ]|Θ = θ]
=
∫∫
Γc[δ(x),θ]fθ(x) dx (5.2)
and the Bayes risk by
r(δ) = E[RΘ(δ(X))
]= E [E [c[δ(x),Θ]|Θ = θ]]
=
∫∫
τ
∫∫
Γc[δ(x),θ]fθ(x)π(θ) dxdθ
=
∫∫
Γ
∫∫
τc[δ(x),θ]π(θ|x) dθ dx
= E [E [c[δ(X),Θ]|X = x]] (5.3)
55
56CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS
From this relation, and the fact that in general the cost function is a positive function,we can deduce that minimizing r(δ) over δ is equivalent to minimize, for any x ∈ Γ, themean posterior cost
c[x|θ] = E [c[δ(X),Θ]|X = x] =
∫∫
τc[δ(x),θ]π(θ|x) dθ. (5.4)
5.1.1 Case of binary composite hypothesis testing
In this case, we have
δB(x) =
1 >0/1 if E [c[1,θ]|X = x] = E [c[0,θ]|X = x]0 <
(5.5)
If the two hypotheses correspond to two disjoint partitions of the parameter space τ ={T0,T1}, we have
c[i,θ] = cij if θ ∈ Tj (5.6)
and if we consider the uniform cost function, then we have
δB(x) =
1 >
0/1 ifPr{θ∈T1|X=x}Pr{θ∈T0|X=x} = c10−c00
c01−c11
0 <
(5.7)
Now, using the Bayes’ rule, we have
Pr {θ ∈ Ti|X = x} =Pr {X = x|θ ∈ Ti}Pr {θ ∈ Ti}∑1
i=0 Pr {X = x|θ ∈ Ti} Pr {θ ∈ Ti}, i = 0, 1. (5.8)
The decision rule (5.7) becomes
δB(x) =
1 >
0/1 if L(x) =Pr{X=x|θ∈T1}Pr{X=x|θ∈T0} = π0
π1
c10−c00c01−c11
0 <
(5.9)
whereπi = Pr {θ ∈ Ti} , i = 0, 1. (5.10)
The conditional probability density functions of θ are noted ri(θ) and given by
ri(θ) =
{0 if θ 6∈ Ti
p(θ)/πi if θ ∈ Ti(5.11)
The expression of Pr {X = x|θ ∈ Ti} is then given by
Pr {X = x|θ ∈ Ti} =
∫
τfθ(x) ri(θ) dθ
=1
πi
∫
τ i
fθ(x) p(θ) dθ
=1
πiEi{fΘ(x)
}(5.12)
5.2. UNIFORM MOST POWERFUL (UMP) TEST 57
where Ei{fΘ(x)
}stands for the expectation under the hypothesis Hi. We can then
rewrite (5.9) as:
δB(x) =
1 >
0/1 if L(x) =E1
{fΘ(x)
}
E0
{fΘ(x)
} = c10−c00c01−c11
0 <
(5.13)
With such hypotheses, the false alarm and correct detection probabilities become respec-tively
PF (δ,θ) = Eθ {δ(X)} for θ ∈ T0 (5.14)
PD(δ,θ) = Eθ {δ(X)} for θ ∈ T1 (5.15)
5.2 Uniform most powerful (UMP) test
The α-level uniform most powerful (UMP) test is defined as:
maxPD(δ,θ) s.t. PF (δ,θ) ≤ α (5.16)
Unfortunately, this optimization problem may not have a solution. In those cases, we cantry to design a test by following the Neyman-Pearson scheme which is summarized below.
Neyman-Pearson lemma:Suppose the hypothesis H0 is simple (T0 has only one component θ0), the hypothesis H1
is composite and we have a parametrically defined probability density function p(θ) forθ ∈ T1. Then the most powerful α-level test for H0 against H1 has a unique critical regiongiven by
Γθ =
{x ∈ Γ | fθ(x) > τ fθ0
(x)
}(5.17)
where τ depends on α. The corresponding test is given by
δ(x) =
1 >0/1 if fθ(x) = τ fθ0
(x)
0 <
(5.18)
5.3 Locally most powerful (LMP) test
The UMP test is too strong and the optimization problem may not have a unique solutionin some more general cases. To illustrate this test, let consider the following case:
{H0 : θ = θ0H1 : θ = θ > θ0
(5.19)
The α-level locally most powerful (LMP) test is based on the development of PD(δ, θ) inTaylor series around the simple hypothesis parameter value θ0
58CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS
PD(δ, θ) ≃ PD(δ, θ0) + (θ − θ0)∂PD(δ, θ)
∂θ
∣∣∣∣θ=θ0
+ O(θ − θ0)2 (5.20)
Noting that PF (δ, θ) = PD(δ, θ0), then the Neyman-Pearson test
max PD(δ, θ) s.t. PF (δ, θ) ≤ α (5.21)
becomes
max∂PD(δ, θ)
∂θ
∣∣∣∣θ=θ0
s.t. PF (δ, θ) ≤ α (5.22)
Noting that
PD(δ, θ) = Eθ {δ(X)} =
∫∫
Γδ(x)fθ(x) dx (5.23)
and assuming that fθ(x) is sufficiently regular in the neighbourhood of θ0, we can calculate
P ′D(δ, θ0) =
∂PD(δ, θ)
∂θ
∣∣∣∣θ=θ0
=
∫∫
Γδ(x)
∂fθ(x)
∂θ
∣∣∣∣θ=θ0
dx (5.24)
In conclusion, the α-level locally most powerful (LMP) test is obtained in the same way
that the α-level most powerful test by replacing fθ(x) by f ′θ0(x) = ∂fθ(x)
∂θ
∣∣∣∣θ=θ0
. The critical
region of H0 against H1 is then given by
Γθ =
{x ∈ Γ | f ′θ0
(x) > τ fθ0(x)
}(5.25)
and the test becomes
δ(x) =
1 >0/1 if f ′θ0
(x) = τfθ0(x)
0 <
(5.26)
where τ and η depend on α.
5.4 Maximum likelihood test
In the absence of the aformentionned optimal tests, we can design a test just based on thelikelihood ratios
δ(x) =
1 >
η ifmaxθ∈T1
fθ(x)
maxθ∈T0fθ(x) = τ
0 <
(5.27)
5.5. EXAMPLES OF SIGNAL DETECTION SCHEMES 59
5.5 Examples of signal detection schemes
To illustrate the common structure of these test designs, consider the following signaldetection test: {
H0 : Xj = Nj ,H1 : Xj = Sj +Nj
, j = 1, . . . , n (5.28)
Assume that the noise is i.i.d. with known probability density function f(x)
{H0 : Xj = Nj, −→ f0(x) =
∏j f(xj)
H1 : Xj = Sj +Nj −→ f1(x) =∏
j f(xj − sj), j = 1, . . . , n (5.29)
The likelihood ratio becomes
L(x) =∏
j
Lj(xj) with Lj(xj) =f(xj − sj)
f(xj)(5.30)
and the test becomes
δ(x) =
1 >0/1 if
∑j logLj(xj) = log τ
0 <
(5.31)
xj - ln lj(xj) -n∑
j=1
(.) -
>
=
<
τ
-
-
-
H1
H1 or H0
H0
Figure 5.1: The structure of the optimal detector for an i.i.d. noise model.
5.5.1 Case of Gaussian noise
In this case we have
f(xj) ∝ exp
[− 1
2σ2x2
j
](5.32)
and the log likelihood Lj(xj) ratios become
logLj(xj) = logf(xj − sj)
f(xj)=
1
σ2
[sj(xj −
sj
2)
](5.33)
Noting by τ1 = σ2 log τ , we have the following structure for the optimal detector:
60CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS
xj - ++
−6sj/2
- ×6sj
-n∑
j=1
(.) -
>
=
<
τ1
-
-
-
H1
H1 or H0
H0
Figure 5.2: The structure of the optimal detector for an i.i.d. Gaussian noise model.
In reporting the constant value∑
j s2j in the treshold, we note τ2 = τ1 + 1
2
∑nj=1 s
2j and
obtain the following scheme
xj - ×6sj
-n∑
j=1
(.) -
>
=
<
τ2
-
-
-
H1
H1 or H0
H0
Figure 5.3: The simplified structure of the optimal detector for an i.i.d. Gaussian noisemodel.
5.5.2 Laplacian noise
In this case we havef(xj) ∝ exp [−α|xj |] (5.34)
and the log likelihood Lj(xj) ratios become
logLj(xj) = logf(xj − sj)
f(xj)= −α |xj − sj| + α |xj |
=
−α|sj | if sgn(sj)xj < 0αsgn(xj) |2xj − sj| if 0 < sgn(sj)xj < |sj|
α|sj | if sgn(sj)xj > |sj|(5.35)
xj - ++
−6sj/2
- -
6Out
In - ×6
sgn(sj)
-n∑
j=1
(.) -
>
=
<
τ
-
-
-
H1
H1 or H0
H0
Figure 5.4: Bayesian detector in the case of i.i.d. Laplacian data.
5.5. EXAMPLES OF SIGNAL DETECTION SCHEMES 61
5.5.3 Locally optimal detectors
Consider the following problem:
{H0 : Xj = Nj
H1 : Xj = Nj + θsj, θ > 0(5.36)
We remember that the α-level uniformly optimal test for this problem is:
δ(x) =
1 >η if Lθ(x) = τ0 <
(5.37)
and the α-level locally optimal test for this problem is:
δ(x) =
1 >
η if ∂Lθ(x)∂θ
∣∣∣∣θ=θ0
= τ
0 <
(5.38)
where τ and η depend on α. For the case of i.i.d. noise model we have
Lθ(x) =n∏
j=1
f(xj − θsj)
f(xj)(5.39)
Then, it is easy to show that
∂logLθ(x)
∂θ
∣∣∣∣θ=θ0
=n∑
j=1
sj glo(xj) (5.40)
where
glo(x) = −∂f(x)
∂x
f(x)= −f
′(x)
f(x)(5.41)
and the structure of a locally optimal detector is given in the following figure.
xj - ln[glo(xj)] - ×6sj
-n∑
j=1
(.) -
>
=
<
τ
-
-
-
H1
H1 or H0
H0
Figure 5.5: The structure of a locally optimal detector for an i.i.d. noise model.
62CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS
It is clear from (5.41) that, if we know the expression of the probability density functionof the noise f(x), we can easily determine the expression of glo(x). For example:
• For a Gaussian noise model we have
f(x) ∝ exp
[− 1
2σ2x2]−→ glo(x) =
1
σ2x (5.42)
xj - -
6Out
In - ×6sj
-n∑
j=1
(.) -
>
=
<
τ
-
-
-
H1
H1 or H0
H0
Figure 5.6: The structure of a locally optimal detector for an i.i.d. Gaussian noise model.
• For a Laplacian noise model we have
f(x) ∝ exp [−α |x|] −→ glo(x) = α sgn(x) (5.43)
xj - -
6Out
In - ×6sj
-n∑
j=1
(.) -
>
=
<
τ
-
-
-
H1
H1 or H0
H0
Figure 5.7: The structure of a locally optimal detector for an i.i.d. Laplacian noise model.
• For a Cauchy noise model we have
f(x) ∝ 1
1 + x2−→ glo(x) =
2x
1 + x2(5.44)
xj - -
6Out
In - ×6sj
-n∑
j=1
(.) -
>
=
<
τ
-
-
-
H1
H1 or H0
H0
Figure 5.8: The structure of a locally optimal detector for an i.i.d. Cauchy noise model.
5.6. DETECTION OF SIGNALS WITH UNKNOWN PARAMETER 63
5.6 Detection of signals with unknown parameter
Here, we consider the case where the transmitted signals depend on some unknown pa-rameter θ: {
H0 : Xj = Nj + S0j(θ)
H1 : Xj = Nj + S1j(θ)
, N ∼ f(n) (5.45)
The main quantity that we need to calculate is
L(x) =E1
{f(x − s1j
(θ))}
E0
{f(x − s0j
(θ))} (5.46)
In the following we consider the particular case of s0 = 0, s1 = s(θ) and i.i.d. noise. Thenwe have {
H0 : Xj = Nj
H1 : Xj = Nj + Sj(θ), Nj ∼ f(nj) (5.47)
L(x) =
∫
Γ
f(x − s(θ)
f(x)p(θ) dθ
=
∫
Γ
n∏
j=1
f(xj − sj(θ)
f(xj)p(θ) dθ
= =
∫Lθ(x) p(θ) dθ (5.48)
Example: Non coherent detection:Consider an amplitude modulated signal
sj(θ) = aj sin[(j − 1)ωcTs + θ], j = 1, . . . , n, ωcTs = m2π
n(5.49)
where ωc is the career frequency, Ts is the sampling rate, m is the number of periods in theobservation time [0, T = nTs] and n/m is the number of samples per cycle. The carrierphase is unknown. With a uniform prior p(θ) = 1
2π and the i.i.d. noise assumption wehave
L(x) =1
2π
∫ 2π
0exp
1
σ2
n∑
j=1
xjsj(θ) −1
2
n∑
j=1
s2j(θ)
dθ (5.50)
Using the trigonometric relations:
sin(a+ b) = cos a sin b+ sin a cos b (5.51)
sin2(a) =1
2− 1
2cos(2a) (5.52)
we obtain:n∑
j=1
xjsj(θ) = xc sin θ + xs cos θ (5.53)
64CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS
with
xcdef=
n∑
j=1
ajxj cos[(j − 1)ωcTs] (5.54)
xsdef=
n∑
j=1
ajxj sin[(j − 1)ωcTs] (5.55)
From (5.49) we have
n∑
j=1
s2j(θ) =1
2
n∑
j=1
a2j +
1
2
n∑
j=1
a2j cos[2(j − 1)ωcTs + 2θ] (5.56)
The second term is, in general either equal to zero or negligeable with respect to the firstterm.
Noting a2 = 1n
∑nj=1 a
2j , we obtain
L(x) = exp
[−na
2
4σ2
]1
2π
∫ 2π
0exp
[1
σ2[xc sin θ + xs cos θ]
]dθ (5.57)
= exp
[−na
2
4σ2
]1
2πI0
(r
σ2
)(5.58)
where I0 is the zeroth-order Bessel function, which is monotone. Then the detection rulebecomes
1 >γ if L(x) = 10 <
−→
1 >
γ if r = τ ′ = σ2I−10
(τ exp
[na2
4σ2
])
0 <
(5.59)
The structure of this detector is given in the following figure.
5.6. DETECTION OF SIGNALS WITH UNKNOWN PARAMETER 65
xj
π/2
localoscillator
- ×?
cos[(j − 1)ωcTs]
- × -n∑
j=1
(.)
?
aj6
(.)2
- ×
6
sin[(j − 1)ωcTs]
- × -n∑
j=1
(.)
6
aj ?
(.)2
?
6
+ -r2
-
>
=
<
τ2
-
-
-
H1
H1 or H0
H0
Figure 5.9: Coherent detector.
66CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS
Performance analysis:We need to calculate the probabilities such as
Pj(R > τ ′) = Pj(R2 > τ ′2), j = 0, 1 (5.60)
with R2 = X2c +X2
s .Note that Xc and Xs are linear combinations of Xj . So, if Xj are Gaussian, Xc and
Xs are Gaussian too.Under the hypothesis H0, we have
E [Xc|H0] = E [Xs|H0] = 0 (5.61)
Var [Xs|H0] = Var [Xs|H0] =nσ2a2
2(5.62)
Cov {Xs,Xc|H0} = 0 (5.63)
and
P0(Γ1) =
∫∫
r2=x2c+x2
s≥τ ′2
1
nπσ2a2exp
[− 1
nπσ2a2(x2
c + x2s)
]dxc dxs (5.64)
With the cartesian to polar coordinate change (xc, xs) −→ (r, θ) we obtain
P0(Γ1) =1
nπσ2a2
∫ 2π
0
∫ ∞
τ ′r exp
[− r2
nσ2a2
]dr dθ
=1
nπσ2a2exp
[− τ ′2
nσ2a2
](5.65)
Under the hypothesis H1, noting that for a given value of θ x|θ ∼ N (s(θ), σ2I
)we
have
E [Xc|H1,Θ = θ] =na2
2sin θ (5.66)
E [Xs|H1,Θ = θ] =na2
2cos θ (5.67)
Var [Xs|H1,Θ = θ] = Var [Xs|H1,Θ = θ] =nσ2a2
2(5.68)
Cov {Xs,Xc|H1,Θ = θ} = 0 (5.69)
and
p(xc, xs|H1) =1
2π
∫ 2π
0
1
nπσ2a2exp
[− 1
nσ2a2q(xc, xs;na2/2, θ)
]dθ (5.70)
= p(xc, xs|H0) exp
[−na
2
4σ2
]I0
(r
σ2
)(5.71)
The detection probability becomes then
PD(δ) = P1(Γ1) =
∫∫
r2=x2c+x2
s≥τ ′2p(xc, xs|H1) dxc dxs (5.72)
exp[−na2
4σ2
]
nπσ2a2
∫ 2π
0
∫ ∞
r′r exp
[− r2
nσ2a2
]I0
(r
σ2
)dr dθ (5.73)
5.6. DETECTION OF SIGNALS WITH UNKNOWN PARAMETER 67
Noting b2 = na2
2σ2 and τ0 = τbσ2 and changing the variable x = r
bσ2 we obtain
PD(δ) = P1(Γ1) =
∫ ∞
r0
x exp
[−1
2(x2 + b2)
]I0(bx) dx
def= Q(b, τ0) (5.74)
Q(b, τ0) is called Marcum’s Q-function.Note also that Pf (δ) = Q(0, τ0). So for a α-level Neyman-Pearson detection test we
have τ ′ =[nσ2a2 log(1/α)
] 1
2 and the probability of detection is given by
PD(δ) = Q[b, 2[log(1/α)]
1
2
](5.75)
Note also that
E
1
n
n∑
j=1
s2j(θ)
= a2/2 (5.76)
so, b2 = na2
2σ2 is a measure of S/N ratio.
68CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS
5.7 sequential detection
Γ = {Xj , j = 1, 2, . . .} (5.77){H0 : Xj ∼ P0
H1 : Xj ∼ P1(5.78)
A sequential decision rule is a pair of sequences (∆, δ) where :
• ∆ = {∆j, j = 0, 1, 2, . . .} is the sequence of stopping rules ;
• δ = {δj , j = 0, 1, 2, . . .} is the sequence of decision rules ;
• ∆j(x1, . . . , xj) is a function from IRj to {0, 1} ;
• δj(x1, . . . , xj) is a decision rule on (IRj,Bj) ;
• If ∆n(x1, . . . , xn) = 0 we take another sample ;
• If ∆n(x1, . . . , xn) = 1 we stop sampling and make a decision.
• N = min{n|∆n(x1, . . . , xn) = 1} is the stopping time ;
• ∆N (x1, . . . , xN ) is the terminal decision rule.
• (∆0, δ0) correspond to the situation where we have not yet observed any data. ∆0 = 0means take at least one sample before making a decision. ∆0 = 1 means make adecision without taking any sample.
Note that N is a random variable depending on the data sequence. The terminal decisionrule δN (x1, . . . , xN ) tells us which decision to make when we stop sampling.
The fixed-sample-size N decision rule can be defined as the following sequential detec-tion rule:
∆j(x1, . . . , xj) =
{0 if j 6= N1 if j = N
(5.79)
δj(x1, . . . , xj) =
{δ(x1, . . . , xn) if j = Narbitrary if j 6= N
(5.80)
In the following we consider only the binary hypothesis testing and we analyse theBayesian approach with the prior distribution {π0 = 1 − π1, π1} and the uniform costfunction. We assume that we can have an infinite number of i.i.d. observations at ourdisposal. However, we should assign a cost c > 0 to each sample, so that the cost of takingn samples is nc.
With these assumptions, the conditional risks for a given sequential decision rule are:
R0(∆, δ) = E0 {δ(x1, . . . , xn)} + E0 {cN} (5.81)
R1(∆, δ) = 1 − E1 {δ(x1, . . . , xn)} + E1 {cN} (5.82)
where the subscripts denote the hypotheses under which the expectation is computed andN is the stopping time. The Bayes risk is thus given by
r(∆, δ) = (1 − π1)R0(∆, δ) + π1R1(∆, δ) (5.83)
5.7. SEQUENTIAL DETECTION 69
and the sequential Bayesian rule is the one which minimizes r(∆, δ).To analyse the structure of this optimal rule we define
V ∗(π1)def= min
∆, δ∆0 = 0
r(∆, δ), 0 ≤ π1 ≤ 1. (5.84)
0 πLπU 1
π1
1 − π1
π1
c c
V ∗(π1)
Figure 5.10: Sequential detection.
Since ∆0 = 0 means that the test does not stop with zero observation, V ∗(π1) cor-responds then to the minimum Bayes risk over all sequential tests that take at least onesample. V ∗(π1) is in general concave and continuous and V ∗(0) = V ∗(1) = c. Now, letplot this function as well as these two specific sequential tests:
• Take no sample and decide H0, i.e., ∆0 = 1, δ0 = 0 and
• Take no sample and decide H1, i.e., ∆0 = 1, δ0 = 1.
Note that the Bayes risks for these tests are
r(∆, δ)|∆0=1,δ0=0 = 1 − π1
r(∆, δ)|∆0=1,δ0=1 = π1
These tests are the only two Bayesian tests that are not included in the minimization of(5.84). We note, respectively by πU and πL the abscisses of the intersections of the linesr(∆, δ)|∆0=1,δ0=0 and r(∆, δ)|∆0=1,δ0=1 with V ∗(π1).
70CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS
Now, by inspection of these plots, we see that the Bayes rule with a fixed given priorπ1 is:
• (∆0 = 1, δ0 = 0) if π1 ≤ πL ;
• (∆0 = 1, δ0 = 1) if π1 ≥ πU ;
• The decision rule with minimizes the Bayes risk among all the tests such that (∆0 =0) corresponds to a point such that πL ≤ π1 ≤ πU .
In the two first cases the test is stopped. In the third one, we know that the optimaltest takes at least one more sample. After doing so, we are faced to a similar situationas before except that we now have more information due to the additional sample. Inparticular, the prior π1 is replaced by π1(x1) = Pr {H = H1|X1 = x1} which is the poste-rior probability of H1 given the observation X1 = x1. We can apply this method to anyarbitrary number of samples. We then have the following rules:
• Stopping rule:
∆n(x1, . . . , xn) =
{0 if πL < π1(x1, . . . , xn) < πU
1 otherwise.(5.85)
• Terminal decision rule:
δn(x1, . . . , xn) =
{0 if π1(x1, . . . , xn) ≤ πL
1 if π1(x1, . . . , xn) ≥ πU .(5.86)
It has been proved that under mild conditions the posterior probability π1(x1, . . . , xn)converges almost surely to 1 under H1 and to 0 under H0. Thus the test terminates withprobability one. The only knowledge of the probabilities πL and πU and an algorithm tocompute π1(x1, . . . , xn) are sufficient to define this rule. The computation of π1(x1, . . . , xn)is quite easy, but unfortunately, it is very difficult to obtain exactly πL and πU .
Now consider the case where the two processes P0 and P1 have densities f0 and f1.Then the Baye fomula yields
π1(x1, . . . , xn) =π1∏n
j=1 f1(xj)
π0∏n
j=1 f0(xj) + π1∏n
j=1 f1(xj)
=π1 λn(x1, . . . , xn)
π0 + π1 λn(x1, . . . , xn)
(5.87)
where
λn(x1, . . . , xn) =n∏
j=1
f1(xj)
f0(xj)(5.88)
is the likelihood ratio based on n samples.Noting that π1(x1, . . . , xn) is an increasing function of λn(x1, . . . , xn) we can rewrite
(5.85) and (5.86) as:
5.7. SEQUENTIAL DETECTION 71
H0
H1
π1
πU
1
πL
0 1 2 3 4 5 N = 6 n
π1(x1, . . . , xn)
Figure 5.11: Stopping rule in sequential detection.
• Stopping rule:
∆n(x1, . . . , xn) =
{0 if π < λn(x1, . . . , xn) < π1 otherwise.
(5.89)
• Terminal decision rule:
δn(x1, . . . , xn) =
{0 if λn(x1, . . . , xn) ≤ π1 if λn(x1, . . . , xn) ≥ π.
(5.90)
where
πdef=
π0πL
π1(1 − πL)and π
def=
π0πU
π1(1 − πU ). (5.91)
In conclusion, the Bayesian sequential test takes samples until the likelihood ratiofalls outside the interval [π, π] and decides H0 or H1 if λn(x1, . . . , xn) falls outside of thisinterval.
The main problem in practical situations is to fix the values of the boundaries a = πand b = π. This test is called the sequential probability ratio test with the boundaries aand b and is noted SPART (a, b).
The following theorem gives some of the optimality properties of SPART (a, b).
Wald-Wolfowitz theorem :Note by
N(∆) = min{n|∆n(x1, . . . , xn) = 1}
72CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS
H0
H1
π1/π0
b
a
0 1 2 3 4 5 N = 6 n
λn(x1, . . . , xn)
Figure 5.12: Stopping rule in SPART (a, b).
PF (∆, δ) = Pr {δN (x1, . . . , xN ) = 1|H = H0}PM (∆, δ) = Pr {δN (x1, . . . , xN ) = 0|H = H1}
and (∆∗, δ∗) the SPART (a, b). Then, for any sequential decision rule (∆, δ) for which
PF (∆, δ) ≤ PF (∆∗, δ∗)
PM (∆, δ) ≤ PM (∆∗, δ∗)
we haveE [N(∆)|H = Hj] ≥ E [N(∆∗)|H = Hj] , j = 0, 1
The validity of Wald-Wolfowitz theorem is a consequence of the Bayes optimality ofSPART (a, b). The results of this theorem and other related theorems are sumarized inthe following items:
• For a given performance, there is no other sequential decision rule with a smallerexpected sample size than the SPART (a, b) with the same performance.
• The average sample size of SPART (a, b) is not greater than the sample size of afixed-sample-size test with the same performance.
• For a given expected sample size, no sequential decision rule has smaller error prob-abilities than the SPART (a, b).
5.7. SEQUENTIAL DETECTION 73
Two main questions remains:
• How to choose a and b to yield a desired level of performance?
• How to evaluate the expected sample size of a sequential detector?
The following result gives an answer to the first one.
Let (∆, δ) = SPART (a, b) with a < 1 < b and α = PF (∆, δ), γ = 1 − β =PM (∆, δ) and N = N(∆). Then the rejection region of (∆, δ) is
Γ1 =
{x ∈ IR∞
∣∣∣∣λN (x1, . . . , xN ) ≥ b
}= ∪∞
n=1Qn (5.92)
with
Qn =
{x ∈ IR∞
∣∣∣∣N = n, λn(x1, . . . , xN ) ≥ b
}= ∪∞
n=1Qn (5.93)
Since Qn and Qm are mutually exclusive sets for m 6= n, we have
α = Pr {λn(x1, . . . , xN ) ≥ b|H = H0} =∞∑
n=1
∫
Qn
n∏
j=1
f0(xj) dxj (5.94)
On Qn we haven∏
j=1
f0(xj) dxj ≤1
b
n∏
j=1
f1(xj) dxj (5.95)
So, we have
α ≤ 1
b
∞∑
n=1
∫
Qn
n∏
j=1
f1(xj) dxj =1
bPr {λn(x1, . . . , xN ) ≥ b|H = H0} =
1
b(1− γ)
(5.96)and in the same manner we obtain
γ = Pr {λn(x1, . . . , xN ) ≤ a|H = H1} ≤ a (1 − α) (5.97)
From these two relations we deduce{b < 1−γ
αa > γ
1−α
(5.98)
The following choice is called the Wald’s approximation:
{b ≃ 1−γ
αa ≃ γ
1−α
(5.99)
To be completed later
74CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS
5.8 Robust detection
To be completed later
Chapter 6
Elements of parameter estimation
6.1 Bayesian parameter estimation
Throughout this chapter we assume that the data are samples of a parametrically knownprocess {Pθ; θ ∈ τ}, where Pθ denotes a distribution on the observation space (Γ,G):
X ∼ Pθ(x) (6.1)
The goal of the parameter estimation problem is to find a function θ(x) : Γ 7→ τ suchthat θ(x) is the best guess of the true value of θ. Of course, the solution depends ona goodness criterion. As in the hypothesis testing problems, we have to define a costfunction c[θ(X),θ] : τ × τ 7→ IR+ such that c[a,θ] is the cost of estimating the true valueof θ by a.
Then, as in the hypothesis testing problems, we can define the conditional risk function
Rθ(θ) = Eθ
{c[θ(X),θ]
}= E
[c[θ(X),Θ] |Θ = θ
]
=
∫∫
Γc[θ(x),θ]fθ(x) dx (6.2)
and the Bayes risk
r(θ) = E[RΘ(θ(X))
]
=
∫∫
τ
∫∫
Γc[θ(x),θ]fθ(x)π(θ) dxdθ
=
∫∫
Γ
∫∫
τc[θ(x),θ]π(θ|x) dθ dx
= E[E[c[θ(X),Θ] |X = x
]](6.3)
From this relation, and the fact that in general the cost function is positive, we see thatr(θ) is minimized over θ, when for any x ∈ Γ, the mean posterior cost
c[x] = E[c[θ(X),Θ] |X = x
]=
∫∫
τc[θ(x),θ]π(θ|x) dθ (6.4)
is minimized.It is clear that the resulting estimate depends on the choice of the cost function. In
the following section we first consider the case of a scalar parameter and then extend itto the vector parameter case.
75
76 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
6.1.1 Minimum-Mean-Squared-Error
In case where τ = IR, a commonly used cost function is
c[a, θ] = c(a− θ) = (a− θ)2, (a, θ) ∈ IR2 (6.5)
The corresponding Bayes risk is E[(θ(X) − Θ)2
], a quantity which is known as the Mean-
Squared-Error (MSE). The corresponding Bayes estimate is called the Minimum-Mean-Squared-Error (MMSE) estimator.
The posterior cost is given by
E[(θ(X) − Θ)2 |X = x
]= E
[θ2(X) |X = x
]− 2E
[θ(X)Θ |X = x
]+ E
[Θ2 |X = x
]
= [θ(X)]2 − 2 θ(X) E [Θ |X = x] + E[Θ2 |X = x
](6.6)
This expression is a quadratic function of θ(X) and its minimum is obtained for
θMMSE(X) = E [Θ |X = x] (6.7)
Thus the MMSE estimate is the mean of the posterior probability density function. Thisestimate is also called posterior mean (PM) estimate.
6.1.2 Minimum-Mean-Absolute-Error
In case where τ = IR, another commonly used cost function is
c[a, θ] = c(a− θ) = |a− θ|, (a, θ) ∈ IR2 (6.8)
The corresponding Bayes risk is E[|θ(X) − Θ|
], a quantity which is known as the Mean-
Absolute-Error (MAE). The corresponding Bayes estimate is called the Minimum-Mean-Absolute-Error (MMAE) estimator.
The posterior cost is given by
E[|θ(X) − Θ| |X = x
]=
∫ ∞
0Pr{|θ(x) − Θ| > z |X = x
}dz
=
∫ ∞
0Pr{Θ > z + θ(x) |X = x
}dz
+
∫ ∞
0Pr{Θ < −z + θ(x) |X = x
}dz
(6.9)
Doing the variable change t = z+ θ(x) in the first integral and t = −z+ θ(x) in the secondone we obtain
E[|θ(x) − Θ| |X = x
]=
∫ ∞
θ(x)Pr {Θ > t |X = x} dt
+
∫ θ(x)
−∞Pr {Θ < t |X = x} dt (6.10)
6.1. BAYESIAN PARAMETER ESTIMATION 77
This expression is differentiable with respect to θ(x) and
∂E[|θ(X) − Θ| |X = x
]
∂θ(x)= Pr
{Θ < θ(x) |X = x
}
−Pr{Θ > θ(x) |X = x
}(6.11)
This derivative is a nondecreasing function of θ(x) which approaches −1 as θ(x) −→ −∞and +1 as θ(x) −→ +∞. The minimum of (6.11) is achieved at the point θ(x) where thederivative vanishes. Consequently, the Bayes estimate satisfies
Pr {Θ < t |X = x} ≤ Pr {Θ > t |X = x} , t < θ(x)
and
Pr {Θ < t |X = x} ≥ Pr {Θ > t |X = x} , t > θ(x)
or
Pr{Θ < θ(x) |X = x
}= Pr
{Θ > θ(x) |X = x
}. (6.12)
θ(x) is the median of the posterior distribution of Θ given X = x:
θMMAE(X) = median of π(θ |X = x) (6.13)
6.1.3 Maximum A Posteriori (MAP) estimation
Another commonly used cost function in the cases where τ = IR is
c[a, θ] = c(a− θ) =
{0 if |a− θ| ≤ ∆1 if |a− θ| > ∆
(6.14)
where ∆ is a positive real number. The corresponding Bayes risk is given by
E[c[θ(X) − Θ] |X = x
]= Pr
{|θ(x) − Θ| > ∆ |X = x
}
= 1 − Pr{|θ(x) − Θ| ≤ ∆ |X = x
}(6.15)
To minimize this expression we consider two cases:
• Θ is a discrete random variable taking its values in a finite set τ = {θ1, . . . , θM}such that |θi − θj| > ∆ for any i 6= j. Then we have
E[c[θ(X),Θ] |X = x
]= 1 − Pr
{Θ = θ(x) |X = x
}= 1 − π(θ(x) |x) (6.16)
where π(θ |x) is the posterior distribution of Θ given X = x. The estimate is thevalue of Θ which has the maximum a posteriori probability:
θMAP = arg maxθ∈T
{π(θ |x)} (6.17)
78 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
• Θ is a continuous random variable. In this case, we have
E[c[θ(x),Θ] |X = x
]= 1 −
∫ θ(x)+∆
θ(x)−∆π(θ |X = x) dθ (6.18)
If we assume that the posterior probability distribution π(θ |x) is a continuous andsmooth function and ∆ is sufficiently small, then we can write
E[c[θ(x),Θ] |X = x
]= 1 − 2∆ π(θ(x) |X = x) (6.19)
and again we haveθMAP = argmax
θ∈T{π(θ |x)} (6.20)
Example 1: Estimation of the parameter of an exponential distribution.Suppose both distributions fθ(x) and π(θ) are exponential:
fθ(x) =
{θ exp [−θx] if x > 00 otherwise
(6.21)
and
π(θ) =
{α exp [−αθ] if θ > 00 otherwise
(6.22)
Note that {E [X] = θVar {X} = E
[(X − θ)2
]= θ2 (6.23)
and {E [Θ] = αVar {Θ} = E
[(Θ − α)2
]= α2 (6.24)
Then, we can calculate the joint distribution φ(x, θ)
φ(x, θ) =
{α θ exp [−θx− αθ] if θ > 0, x > 00 otherwise.
(6.25)
The marginal distribution m(x) is given by
m(x) =
{ ∫∞0 α θ exp [−(α+ x)θ] dθ = α
(α+x)2 if x > 0
0 otherwise.(6.26)
and the posterior distribution π(θ|x) is given by
π(θ|x) =
{αθ exp[−(α+x)θ]
m(x) = (α+ x)2θ exp [−(α+ x)θ] if θ > 0
0 otherwise.(6.27)
Note that we have {E [Θ |X = x] = 2
α+x
Var {Θ |X = x} = 2(α+x)2
(6.28)
6.1. BAYESIAN PARAMETER ESTIMATION 79
The MMSE estimator is given by
θMMSE(x) = E [Θ |X = x]
=
∫ ∞
0θπ(θ|x) dθ
=
∫ ∞
0(α+ x)2θ2 exp [−(α+ x)θ] dθ
=2
α+ x(6.29)
and the corresponding MMSE is:
MMSE = r(θMMSE) = E [Var {Θ |X}]=
∫ ∞
0
2
(α+ x)2m(x) dx
=
∫ ∞
0
2α
(α+ x)4dx
=2
3α2(6.30)
The MMAE estimate θABS(x) is such that∫ ∞
θABS
π(θ|x) dθ = [1 + (α+ x)θABS ] exp[−(α+ x)θABS
]=
1
2(6.31)
It is easily shown that
θABS(x) =T0
α+ x(6.32)
where T0 is the solution of
(1 + T0) exp [−T0] =1
2−→ T0 ≃ 1.68
To calculate θMAP (x), we can remark that
∂log π(θ |x)∂θ
=1
θ− (α+ x)
∂2log π(θ |x)∂θ2
= − 1
θ2< 0
(6.33)
So, we have
θMAP (x) =1
α+ x(6.34)
The following table sumarizes theses estimates:
θMMSE(x) =2
α+ x
θABS(x) =T0
α+ x
θMAP (x) =1
α+ x
(6.35)
80 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
Example 2: Estimation of the location parameter of a Gaussian distribution.Assume X|Θ and Θ have both Gaussian distributions:
X|Θ ∼ fθ(x) = N (θ, σ2
)
Θ ∼ π(θ) = N (θ0, σ
2θ
)
Then, the joint distribution φ(x, θ), the marginal distribution m(x) and the posteriordistribution π(θ |x) are
X,Θ ∼ φ(x, θ) = N(
(θ, θ0),
(σ2
x 00 σ2
θ
))
X ∼ m(x) = N (θ, σ2
x = σ2 + σ2θ
)
Θ |X ∼ π(θ |x) = N(θ, σ2
)
with
θ =σ2
θ
σ2xθ0 + σ2
σ2xx
σ2 = σ σθ
σ2x
In this case all the estimators are equal and we have
θMMSE(x) = θABS(x) = θMAP (x) = θ =σ2
θ
σ2x
θ0 +σ2
σ2x
x (6.36)
=σ2
θ
σ2 + σ2θ
θ0 +σ2
σ2 + σ2θ
x (6.37)
(6.38)
They also have the same posterior variance
σ2 =σ σθ
σ2x
=σ σθ
σ2 + σ2θ
(6.39)
Note also the following limit cases:
When σ2 −→ 0 then θ −→ θ0
When σ2θ −→ 0 then θ −→ x
6.1. BAYESIAN PARAMETER ESTIMATION 81
Example 3: Estimation of the parameter of a Binomial distribution.Assume that
X | θ ∼ f(x | θ) = Bin(x|θ, n) = Cxn θ
x (1 − θ)n−x
Θ ∼ π(θ) = Beta(θ|α, β) = 1B(α,β) θ
α−1 (1 − θ)β−1
Then, the joint distribution φ(x, θ), the marginal distribution m(x) and the posteriordistribution π(θ |x) are
(X,Θ) ∼ φ(x, θ) = Cxn
B(α,β) θα+x−1 (1 − θ)β+n−x−1
X ∼ m(x) = Cxn
B(α,β) B(α+ x, n+ β − x)
= Cxn
Γ(α+β)Γ(α)Γ(β)
Γ(α+x)Γ(n+β−x)Γ(α+β+n)
Θ |X ∼ π(θ|x) = 1B(α+x,β+n−x) θ
α+x−1 (1 − θ)β+n−x−1
= Beta(α+ x, n+ β − x)
82 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
6.2 Other cost functions and related estimators
Here are some other cost functions and corresponding Bayesian estimator expressions.
Name C[a, θ] θ
Quadratic q(a− θ)2 E [Θ|x] =∫θπ(θ|x) dθ
WeightedQuadratic
ω(θ)(a− θ)2 E[ω(Θ) Θ|x]
E[ω(Θ)|x]=
∫ω(θ) θ π(θ|x) dθ∫ω(θ) π(θ|x) dθ
.
Absolute |a− θ| ∫ θ−∞ π(θ|x) dθ =
∫+∞
θπ(θ|x) dθ = 1
2
NonsymetricAbsolute
{k2(θ − a) si θ ≤ a,k1(a− θ) si θ ≥ a.
k1∫ θ−∞ π(θ|x) dθ = k2
∫ +∞
θπ(θ|x) dθ
Linexβ [exp [−α(a− θ)] − α(α − θ) − 1] ,α 6= 0, β > 0
− 1α log (E [exp [−αΘ] |x])
= − 1α log [
∫exp [−αθ] p(θ|x) dθ]
(a−θ)2
θ1
E[1/Θ|x]= 1∫
1
θπ(θ|x)dθ
(a−θ)2
a
√E [Θ2|x] =
√∫θ2 π(θ|x) dθ
− ln[a(θ)] π(θ|x)
Table 6.1: Relations between the data, a priori, marginal and a posteriori distributions.
6.3. EXAMPLES OF POSTERIOR CALCULATION 83
6.3 Examples of posterior calculation
In previous sections, we saw that the computation of Bayesian estimates requires theposterior probability distribution. The following table gives a summary of the expressionsof the posterior probability distributions in some classical cases.
Observation lawf(x|θ)
Prior lawπ(θ)
Marginal law
m(x) =
∫f(x|θ)π(θ) dθ
Posterior law
π(θ|x) =f(x|θ)π(θ)
m(x)
Discrete variables
BinomialBin(x|n, θ)
BetaBet(θ|α, β)
Binomial-BetaBinBet(x|α, β, n)
BetaBet(θ|α+ x, β + n− x)
Negative BinomialNegBin(x|n, θ)
BetaBet(θ|α, β)
Negative Binomial-BetaNegBinBet(x|α, β, θ)
BetaBet(θ|α+ n, β + x)
PoissonPn(x|θ)
GammaGam(θ|α, β)
Poisson-GammaPnGam(x|α, β, 1)
GammaGam(θ|α+ x, β + 1)
Continuous variables
GammaGam(x|ν, θ)
GammaGam(θ|α, β)
Gamma-GammaGamGam(x|α, β, ν)
GammaGam(θ|α+ ν, β + x)
ExponentialEx(x|θ)
GammaGam(θ|α, β)
ParetoPar(x|α, β)
GammaGam(θ|α+ 1, β + x)
NormalN(x|θ, σ2)
NormalN(θ|µ, τ2)
NormalN(x|µ+ θ, τ2)?
Normal
N(µ|µσ2+τ2x
σ2+τ2 , σ2 τ2
σ2+τ2
)
NormalN(x|µ, λθ)
GammaGam(θ|α2 , α
2 )Student (t)St(x|µ, λ, α)
Gamma
Gam(θ|α+1
2 , α2 + 1
2 (µ− x)2)
Table 6.2: Relation between the data, a priori, marginal and a posteriori distributions.
84 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
6.4 Estimation of vector parameters
In the case where we have a vector parameter θ = [θ1, . . . , θm]t we have to define a costfunction c[a,θ] : IRm × IRm 7→ IR+. Then it is again possible to define the Bayes risk. Inmany cases the cost function is of the form
c[a,θ] =m∑
i=1
ci[ai, θi] (6.40)
We then have
E[c[θ(x),θ] |X = x
]=
m∑
i=1
E[ci[θi(x), θi] |X = x
](6.41)
Here after, we consider some common cost functions:
6.4.1 Minimum-Mean-Squared-Error
In case where τ = IRm, a commonly used cost function is
c[a,θ] = ‖a − θ‖2 =m∑
i=1
(ai − θi)2 (6.42)
The corresponding Bayes risk is E[‖θ(X) − Θ‖2
]and the corresponding Bayes estimate
is the Minimum-Mean-Squared-Error (MMSE) estimator or the Bayes estimate:
θMMSE(X) = E [Θ |X = x] (6.43)
Thus the MMSE estimate is the mean of the posterior probability density function. It isalso called posterior mean (PM) estimate.
Note that, as in the scalar case, the following weighted quadratic cost function
c[a,θ] = ‖a − θ‖2Q = [a − θ]tQ[a − θ] (6.44)
gives the same estimate as in (6.43), i.e. the MMSE estimate does not depend on theweighting matrix Q. However, the corresponding minimum Bayes risks are different andwe have
E[‖θ(X) − Θ‖2
Q
]= tr {QE [Cov {Θ |X = x}]} (6.45)
.
6.4.2 Minimum-Mean-Absolute-Error
In case where τ = IR, another commonly used cost function is
c[a,θ] =m∑
i=1
|ai − θi|, (6.46)
The corresponding estimate is such that
Pr{Θi < θi(x) |X = x
}= Pr
{Θi > θi(x) |X = x
}, (6.47)
which means that θi(x) is the median of the marginal posterior distribution of Θi givenX = x, i.e. π(θi |X = x)
6.4. ESTIMATION OF VECTOR PARAMETERS 85
6.4.3 Marginal Maximum A Posteriori (MAP) estimation
Another commonly used cost function in the cases where τ = IRm is
c[a,θ] =m∑
i=1
c[ai − θi] with c[ai − θi] =
{0 if |ai − θi| ≤ ∆1 if |ai − θi| > ∆
(6.48)
where ∆ is a positive real number. The corresponding estimate is given by :
θi = arg maxθi∈T
{π(θi |x)} (6.49)
if ∆ is sufficiently small.
6.4.4 Maximum A Posteriori (MAP) estimation
Two other cost functions which give the same estimates are :
c[a,θ] =
{0 if maxi |ai − θi| ≤ ∆1 if maxi |ai − θi| > ∆
(6.50)
and
c[a,θ] =
{0 if ‖a − θ‖2 ≤ ∆1 if ‖a − θ‖2 > ∆
(6.51)
where ∆ is a positive real number.In both cases, if the posterior distribution π(θ |x) is continuous and smooth enough,
we obtain the MAP estimate.The corresponding estimate is given by :
θMAP = arg maxθ∈τ
{π(θ |x)} (6.52)
Note that the MMAP estimate in (6.49) and the estimate (6.52) may be very different.
86 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
6.4.5 Estimation of a Gaussian vector parameter from jointly Gaussianobservation
The case of the estimation of a Gaussian vector parameter θ ∈ IRm from a jointly Gaussianobservation x ∈ IRn is a very useful example and is used in many applications.
Suppose Θ and X have the following a priori distributions:
Θ ∼ N (θ0,RΘ)
X ∼ N (x0,RX)
and (ΘX0
)∼ N
((θ0
x0
),
(RΘ RΘX
RXΘ RX
))
with RΘX = RtXΘ.
It is easy to show that the posterio law is also Gaussian and is given by
Θ|X ∼ N (θ, R) (6.53)
with
θ = θ0 + RΘXR−1X (x − x0) (6.54)
R = RΘ − RΘXR−1X RXΘ (6.55)
We also have
E [Θ |X = x] = θ (6.56)
Cov {Θ |X = x} = R (6.57)
The corresponding minimum Bayes risk is
r(θ) = tr{QR
}= tr {QRΘ} − tr
{QRΘXR−1
X RXΘ
}(6.58)
6.4. ESTIMATION OF VECTOR PARAMETERS 87
6.4.6 Case of linear models
When the observation vector is related to the vector parameter θ by a linear model wehave
Xi =m∑
j=1
hi,j Θj +Ni, i = 1, . . . , n (6.59)
or in a matrix formX = HΘ + N (6.60)
withΘ ∼ N (θ0,RΘ)
N ∼ N (0,RN )
Then we have
X |Θ = θ ∼ N (Hθ,RN )
RX = HRΘHt + HRΘN + RNΘHt + RN
RXΘ = HRΘ + RΘN , RΘX = RtXΘ
If we assume that the noise N and the vector parameter Θ are independant, we have
RX = HRΘH t + RN (6.61)
RXΘ = HRΘ, RΘX = RtΘHt (6.62)
R−1X = R−1
N − R−1N H
(R−1
Θ + HtR−1N H
)−1H tR−1
Θ (6.63)
andΘ|X ∼ N (θ, R) (6.64)
with {θ = E [Θ |x] = θ0 + RΘXR−1
X (x − Hθ0)
R = E[(θ − θ)t(θ − θ) |x
]= RΘ − RΘXR−1
X RXΘ
θ = θ0 + RΘHt[HRΘH t + RN
]−1(x − Hθ0)
= θ0 +[H tR−1
N H + R−1Θ
]−1H tR−1
N (x − Hθ0)
R = RΘ − RΘH t [HRΘHt + RN]−1
HRΘ
=[HtR−1
N H + R−1Θ
]−1
(6.65)
Consider now the particular case of RN = σ2bI, RΘ = σ2
x(DtD)−1 and θ0 = 0. We thenhave {
θ =(H tH + λDtD
)−1H tx,
R = σ2b
(H tH + λDtD
)−1, with λ = σ2
b/σ2b
(6.66)
88 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
6.5 Examples
6.5.1 curve fitting
We consider here a classical problem of curve fitting that any engineer is almost anytimefaced to. We analyse this problem as a parameter estimation : Given a set of data{(xi, ti), i = 1, . . . , n} estimate the parameters of an algebraic curve to fit the best thesedata. Among different curves, the polynomials are used very commonly.
A polynomial model of degree p relating xi = x(ti) to ti is
xi = x(ti) = θ0 + θ1ti + θ2t2i + · · · + θpt
pi , i = 1, . . . , n (6.67)
Noting that this relation is linear in θi, we can rewrite it in the following
x1
x2...xn
=
1 t1 t21 · · · · · · tp11 t2 t22 · · · · · · tp2...
......
......
1 tn t2n · · · · · · tpn
θ0θ1...θp
(6.68)
or
x = Hθ (6.69)
The matrix H is called the Vandermond matrix. It is entirely determined by the vectort = [t1, t2, . . . , tn]t.
In the case where n = p + 1, this matrix is invertible iff ti 6= tj ,∀i 6= j. In general,however we have more data than unknowns, i.e. n > p+ 1.
Note that the matrix H tH is a Hankel matrix:
[H tH ]kl =n∑
i=1
tk−1i tl−1
i =n∑
i=1
tk+l−2i , k, l = 1, . . . , p+ 1 (6.70)
and the vector H tx is such that
[H tx]k =n∑
i=1
tk−1i xi, k = 1, . . . , p+ 1 (6.71)
Line fitting is the following particular case
x1
x2...xn
=
1 t11 t2...
...1 tn
(θ0θ1
)(6.72)
In this case we have
H tH =
nn∑
i=1
ti
n∑
i=1
ti
n∑
i=1
t2i
(6.73)
6.5. EXAMPLES 89
t
x
xi
ti
ei
xi = θ0 + θ1ti + · · · , θptpi
o
o
o
o
o
o
o
o
Figure 6.1: Curve fitting.
and
Htx =
n∑
i=1
xi
n∑
i=1
ti xi
(6.74)
In the following we consider the line fitting case and will see how different assumptionsabout the problem can give different solutions.
Model 1:The easiest model is to assume that ti are perfectly known, and we only have uncertaintieson xi, i.e.
xi = x(ti) = θ0 + θ1ti + ei, i = 1, . . . , n (6.75)
where ei represents the error on xi. In a geometric language, ei is the signed distancebetween the point (ti, xi) and the point (ti, θ0 + θ1ti) (see figure 6.5.1).
Here, we have x = Hθ + e with θ = [θ0, θ1]t. The matrix H is perfectly known. Note
that in this model, if we assume that ei are zero mean, white and Gaussian
ei = xi − θ0 − θ1ti ∼ N(0, σ2
e
)(6.76)
then the likelihood function becomes
f(x |θ) = N(0, σ2
eI)∝ exp
[− 1
2σ2e
‖x − Hθ‖2]
(6.77)
and the maximum likelihood estimate is
θ = argminθ
{‖x − Hθ‖2
}(6.78)
90 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
t
x
xi
ti
eixi = θ0 + θ1ti
o
o
o
o
o
o
o
o
Figure 6.2: Line fitting: model 1: ei = xi − (θ0 + θ1ti)
If the H tH is invertible, θ is given by
θ = [H tH ]−1H tx (6.79)
To define any Bayesian estimate, we have to assign a prior probability law to θ. Letassume that θ0 and θ1 are independent and
θ0 ∼ N(0, σ2
0
), θ1 ∼ N
(1, σ2
1
)(6.80)
or
θ =
(θ0θ1
)∼ N
((01
),
(σ2
0 00 σ2
1
))= N (θ0,Σθ) (6.81)
Exercise 1:
• Write the complete expressions of f(xi |θ), f(x |θ), π(θ) and π(θ |x)
• Show that the posterior law π(θ |x) is Gaussian, i.e.
π(θ |x) ∼ N(θ, Σ
)(6.82)
and give the expressions of θ and Σ.
• Show that the MAP estimate is obtained by
θ = arg minθ
{J(θ)} (6.83)
with
J(θ) =1
σ2e
‖x − Hθ‖2 + (θ − θ0)tΣ−1
θ (θ − θ0)
=1
σ2e
n∑
i=1
(xi − θ0 − θ1ti)2 + (θ − θ0)
tΣ−1θ (θ − θ0)
6.5. EXAMPLES 91
• Show that, in this case an explicite expression of θ is available and is given by
θ =
[1
σ2e
H tH + Σ−1θ
]−1 [ 1
σ2e
H tx + Σ−1θ θ0
](6.84)
• Compare this solution to the ML solution (6.79) which is equal to least square (LS)solution.
Model 2:A little more complex model is
xi = x(ti) = θ0 + θ1ti + ei, i = 1, . . . , n (6.85)
with the assumption that
ri = ei cosφ =ei√
1 + θ21
=xi − θ0 − θ1ti√
1 + θ21
the distance of the point (ti, xi) to the line x(ti) = θ0 + θ1ti is zero mean, white andGaussian with known variance σ2
r (See figure 6.5.1.)Note that ri is no more a linear function of θ1.
t
x
xi
ti
ri
xi = θ0 + θ1ti
o
o
o
o
o
o
o
o
φ
ei
Figure 6.3: Line fitting: model 2: ri = ei cosφ = ei√1+θ2
1
= xi−θ0−θ1ti√1+θ2
1
92 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
Exercise 2: With this model and assuming that ri are zero mean, white and Gaussianwith known variance σ2 = 1 :
• Write the expressions of f(xi |θ), f(x |θ), π(θ) and π(θ |x)
• Show that the MAP estimate is obtained by
θ = arg minθ
{J(θ)} (6.86)
with
J(θ) = n ln(2π(1 + θ2
1)σ2r
)+
1
(1 + θ21)σ
2r
n∑
i=1
(xi − θ0 − θ1ti)2 + (θ − θ0)
tΣ−1θ (θ − θ0)
(6.87)where θ = [θ0, θ1]
t.
• Is it possible to obtain explicit expressions for θ0 and θ1 ?
Model 3:
A little different model assumes that ti are also uncertain, i.e.
xi = x(ti) = θ0 + θ1(ti + ǫi) + ei, i = 1, . . . , n (6.88)
where ǫi represents the error on ti. Here also, we have x = Hθ + e with θ = [θ0, θ1]t, but
the matrix H is now uncertain.Note that we have
xi = x(ti) = θ0 + θ1ti + θ1ǫi + ei, i = 1, . . . , n (6.89)
which can also be written asx = H0θ + Hǫθ + e
with
H0 =
1 t1...
...
......
1 tn
, Hǫ =
0 ǫ1...
...
......
0 ǫn
,
Exercise 3: With this model and assuming that ǫi are zero mean, white and Gaussianwith known variance σ2
ǫ and that ei are also zero mean, white and Gaussian with knownvariance σ2
e :
• Write the expressions of f(xi |θ), f(x |θ), π(θ) and π(θ |x)
• Give the expressions of the ML and the MAP estimators.
• Compare them to the solutions of the previous cases.
6.5. EXAMPLES 93
t
x
xi
ti
ri
xi = θ0 + θ1ti
o
o
o
o
o
o
o
o
φ
ei
Figure 6.4: Line fitting: model 3
Model 4:
This is the combination of cases 2 and 3, i.e.
xi = x(ti) = θ0 + θ1(ti + ǫi) + ei, i = 1, . . . , n (6.90)
where
ri = ei cosφ =ei√
1 + θ21
=xi − θ0 − θ1ti√
1 + θ21
the distance of the point (ti, xi) to the line x(ti) = θ0 + θ1ti are assumed zero mean, whiteand Gaussian.
Exercise 4: With this model and assuming that ǫi are zero mean, white and Gaussianwith known variance σ2
ǫ and that ri are also zero mean, white and Gaussian with knownvariance σ2
r :
• Write the expressions of f(xi |θ), f(x |θ), π(θ) and π(θ |x)
• Give the expressions of the ML and the MAP estimators.
• Compare them to the solutions in previous examples.
94 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
t
x
xi
ti
eixi = θ0 + θ1ti
o
o
o
o
o
o
o
o
t
x
xi
ti
ri
xi = θ0 + θ1ti
o
o
o
o
o
o
o
o
φ
ei
Model 1: ei = xi − (θ0 + θ1ti) Model 2: ri = [xi − (θ0 + θ1ti)] cos φ = xi−(θ0+θ1ti)√1+θ2
1
Figure 6.5: Line fitting: models 1 and 2.
6.5. EXAMPLES 95
x|θ ∼ N (Hθ, RN )θ ∼ N (θ0, RΘ)
θ|x ∼ N (θ, R)
θ = θ0 + RΘH t [HRΘH t + RN]−1
(x − Hθ0)
= θ0 +[H tR−1
N H + R−1Θ
]−1H tR−1
N (x − Hθ0)
R = RΘ − RΘH t [HRΘH t + RN]−1
HRΘ
=[H tR−1
N H + R−1Θ
]−1
Model 1:ei ∼ N (0, σ2
e)xi|θ ∼ N (θ0 + θ1ti, σ
2e)
x|θ ∼ N (Hθ, σ2eI)
θ = θ0 +[HtH + σ2
eR−1Θ
]−1H t(x − Hθ0)
R = σ2e
[H tH + σ2
eR−1Θ
]−1
Model 2:
ri = ei√1+θ2
1
∼ N (0, σ2r )
ei ∼ N (0, (1 + θ2
1)σ2r
)
xi|θ ∼ N (θ0 + θ1ti, (1 + θ2
1)σ2r
)
x|θ ∼ N (Hθ, (1 + θ2
1)σ2rI)
π(θ|x) = 1m(x) [2π(1 + θ2
1)σ2e ]
−n/2 exp[− 1
2(1+θ21)σ2
e‖x − Hθ‖2 − 1
2 (θ − θ0)tR−1
Θ (θ − θ0)]
θMAP = argminθ {J(θ)}J(θ) = −n
2 ln[2π(1 + θ21)σ
2e ] − 1
2(1+θ21)σ2
e‖x − Hθ‖2 − 1
2(θ − θ0)tR−1
Θ (θ − θ0)
Model 3:
ǫi ∼ N (0, σ2ǫ )
ei ∼ N (0, σ2e)
xi|θ ∼ N (θ0 + θ1ti, θ
21σ
2ǫ + σ2
e
)
x|θ ∼ N (Hθ, (θ2
1σ2ǫ + σ2
e)I)
π(θ|x) = 1m(x) [2π(θ2
1σ2ǫ + σ2
e)]−n/2 exp
[− 1
2(θ21σ2
ǫ +σ2e)‖x − Hθ‖2 − 1
2(θ − θ0)tR−1
Θ (θ − θ0)]
θMAP = argminθ {J(θ)}J(θ) = −n
2 ln[2π(θ21σ
2ǫ + σ2
e)] − 12(θ2
1σ2
ǫ +σ2e)‖x − Hθ‖2 − 1
2(θ − θ0)tR−1
Θ (θ − θ0)
Model 4:
ǫi ∼ N (0, σ2ǫ )
ei ∼ N (0, (1 + θ1)2σ2
e)xi|θ ∼ N (
θ0 + θ1ti, θ21σ
2ǫ + (1 + θ1)
2σ2e
)
= N (θ0 + θ1ti, θ
21(σ
2ǫ + σ2
e) + σ2e
)
x|θ ∼ N (Hθ, (θ2
1(σ2ǫ + σ2
e) + σ2e)I
)
π(θ|x) = 1m(x) [2π(θ2
1(σ2ǫ + σ2
e) + σ2e)]
−n/2 exp[− 1
2(θ21(σ2
ǫ +σ2e)+σ2
e)‖x − Hθ‖2 − 1
2(θ − θ0)tR−1
Θ (θ − θ0)]
θMAP = argminθ {J(θ)}J(θ) = −n
2 ln[(θ21(σ
2ǫ + σ2
e) + σ2e)] − 1
2(θ21(σ2
ǫ +σ2e)+σ2
e)‖x − Hθ‖2 − 1
2(θ − θ0)tR−1
Θ (θ − θ0)
96 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
Remark: To do these calculations easily we need the following relations:
• If A, B and A + B are invertible, then we have
[A−1 + B−1
]−1= A [A + B]−1 B = B [A + B]−1 A (6.91)
[A + B]−1 = A−1[A−1 + B−1
]−1B−1 = B−1
[A−1 + B−1
]−1A−1
• If A and C are invertible matrices, then we have
[A + BCD]−1 = A−1 − A−1B[DA−1B + C−1
]−1DA−1 (6.92)
• A special case very useful in system theory
[I + B(sI − C)−1D
]−1= I − B [sI − C + DB]−1 D (6.93)
• If A is invertible then,
[A + uvt
]−1= A−1 − (A−1u) (vtA−1)
1 + vtA−1u(6.94)
• If A is a bloc matrix
A =
(A11 A12
A21 A22
)
then B = A−1 is also a bloc matrix
B =
(B11 B12
B21 B22
)
and
– If A−122 exists, then
A =
(I A12A
−122
0 I
)(A11 − A12A
−122 A21 0
0 A22
)(I 0
A−122 A21 I
)
and we have
rank {A} = rank{A11 − A12A
−122 A21
}+ rank {A22}
A−1 exists iff the matrix T = A11 − A12A−122 A21 is invertible. Then we have
B11 = T−1 =(A11 − A12A
−122 A21
)−1
B22 = = A−122 + A−1
22 A21B11A12A−122
B12 = −B11A12A−122
B21 = −A−122 A21B11
Written differently, we have
(A11 A12
A21 A22
)−1
=
((A11 − A12A
−122 A21)
−1 −B11A12A−122
−A−122 A21B11 A−1
22 + A−122 A21B11A12A
−122
)
6.5. EXAMPLES 97
– If A−122 exists, then we have
B11 = = A−111 + A−1
11 A12B22A21A−111
B22 = D−1 = (A22 − A21A−111 A12)
−1
B12 = −A−111 A12B22
B21 = −B22A21A−111
The matrices T and D are called the Shur’s complement of the matrix A.
• Particular case 1 :If A is a superior bloc-triangular, i.e. A21 = 0, then B is also superior bloc-triangular, i.e.
B =
(A11 A12
0 A22
)−1
=
(A−1
11 −A−111 A12A
−122
0 A−122
)
• Particular case 2 :If A is an inferior bloc-triangular matrix, i.e. A12 = 0, then B is also an inferiorbloc-triangular matrix, i.e.
B =
(A11 0A21 A22
)−1
=
(A−1
11 0−A−1
22 A21A−111 A−1
22
)
• Particular case 3 :If A22 is a sclar and A21 and A12 are vectors, we have
A =
(A11 x
zt y
), B = A−1 =
1
α
(αA−1
11 + wvt w
vt 1
)
where α, w, and v are given by:
α =1
(y − ztA−111 x)
=|A||A11|
, w = −A−111 x, v = −A−t
11 z
• If A is a [N,P ] matrix
IN ± AAt = [IN ± AR−1At].[IN ± AR−1At]t
where R = IP + [IP ± AtA]1/2.
• If x is a vector and u(x) a scalar function of x and if we define the gradient vector
∇u = ∂u∂x =
[∂u∂xi
], then we have the following relations
– If u = θtx then ∇u = ∂u∂x = θ
– If u = xtAx then ∇u = ∂u∂x = 2Ax
98 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
Generalized Gaussian
p(x) =p1−1/p
2σΓ(1/p)exp
[−1
p
|x− x0|pσp
]
p = 1 −→ p(x) =1
2σexp
[−|x− x0|
σ
]
p = 2 −→ p(x) =1√
2πσ2exp
[−1
2
|x− x0|2σ2
]
p = ∞ −→ p(x) =
{1/2σ if |x− x0| < σ0 otherwise
Centered case x0 = 0.
p(x) =p1−1/p
2σΓ(1/p)exp
[−1
p
|x|pσp
]
p = 1 −→ p(x) =1
2σexp
[−|x|σ
]
p = 2 −→ p(x) =1√
2πσ2exp
[−1
2
|x|2σ2
]
p = ∞ −→ p(x) =
{1/2σ if |x| < σ0 otherwise
Multivariable case:Separable:
p(x) =pn−n/p
2nσnΓn(1/p)exp
[− 1
pσp
n∑
i=1
|xi|p]
Correlated: Markov models
p(x) = Z(α) exp
−α
n∑
i=1
∑
j i
φ(xi − xj)
Z(α) =
∫∫exp
−αn∑
i=1
∑
j i
φ(xi − xj)
dx
Example: φ(x) = x2, j i = i− 1
p(x) = Z(α) exp
[−αx2
1 − αn∑
i=2
(xi − xi−1)2
]
This can be written as
p(x) = Z(α) exp[−αxtDtDx
]
6.5. EXAMPLES 99
with
D =
1 0 0
1 −1...
0 1 −1... 1 −1
0 · · · 0 1 −1
Z(α) = (2π)−n/2 (2α)n/2|DtD|Extension :
p(x) = Z(α) exp
[−αφ(x1) − α
n∑
i=2
φ(xi − xi−1)
]= Z(α) exp
[−α
n∑
i=1
φ([Dx]i)
]
with φ(x) = |x|pThe questions are:
Z(α) exists ?Can we obtain an analytical expression for it ?
100 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION
Chapter 7
Elements of signal estimation
In the previous chapters we discussed the methods for designing estimators for staticparameter estimation. In this chapter we consider the case of dynamic or time varyingparameters (signal estimation).
7.1 Introduction
In many time-varying systems, the physical quantities of interest x can be modeled asobeying a dynamic equation
xn+1 = fn(xn,un) (7.1)
where
• x0,x1, . . ., is a sequence of vectors in IRN , called the state of the system, representingthe unknown quantities of interest;
• u0,u1, . . ., is a sequence of vectors in IRM , called the state input of the system,representing the influencing quantities acting on xn;
• f0,f 1, . . ., is a sequence of functions mapping IRN × IRM to IRM , called the stateequation of the system, representing the dynamic model relating xn and un;
A dynamic system is such that, for any fixed k and l, xk is completely determined fromthe state at time l and the inputs from times l up to k− 1. So, complete determination ofxn, n = 1, 2, . . . requires not only the inputs un, n = 0, 1, 2, . . . but also the initial conditionx0.
The equation (7.1) is called the state equation. Associated to this equation is theobservation equation
zn = hn(xn,vn) (7.2)
where
• z0,z1, . . ., is a sequence of vectors in IRP representing the observable quantities;
• v0,v1, . . ., is a sequence of vectors in IRP representing the errors on the observations;
• h0,h1, . . ., is a sequence of functions mapping IRN × IRP to IRP representing theobservation model.
101
102 CHAPTER 7. ELEMENTS OF SIGNAL ESTIMATION
The main problem then is to estimate the state vector xk from the observations z0,z1, . . . ,zl.
Example 1: One-dimensional motion
Consider a moving target subjected to an acceleration At for t > 0. Its position Xt
and its velocity Vt at time t satisfy
Xt = dPt
dt
At = dVt
dt
(7.3)
Assume that we can measure the position Vt at time instants tn = nT and we wish towrite a model of type (7.1) describing its motion. Assuming T is small, a Taylor seriesapproximation allows us to write
{Xn+1 ≃ Xn + TVn
Vn+1 ≃ Vn + TAn(7.4)
From these equations we see that two quantities Xn and Vn are necessary to describe themotion. So, defining
x =
(XV
)
U = AZ = X + V
−→
xn =
(Xn
Vn
)
Un = An
Zn = Xn + Vn
(7.5)
we can write{
xn+1 = F xn + GUn
Zn = Hxn + Vnwith F =
(1 T0 1
),G =
(0T
),H = (1 0 ) (7.6)
{fn(x,u) = Fx + Gu
hn(x,v) = Hx + v(7.7)
In this example, we assumed that we can measure directly the position of the movingtarget. In general, however, we may observe a quantity z(n) related to the unknownquantity x(n) by a linear transformation:
x(n) −→ Linear System −→ z(n)
non observable observable
and we want to estimate x(n) from the observed values of {z(n), n = 1, . . . , k}. Theestimate x(n) is then a function of the data {z(n), n = 1, . . . , k} and we note
x(n | z(1), z(2), . . . , z(k)) def= x(n | k)
Three cases may occur:
• we may want to estimate x(n+k) from the past observations. The estimate x(n+k|n)is called the k-th order prediction of z(n) and the estimation procedure is calledprediction.
7.2. KALMAN FILTERING : GENERAL LINEAR CASE 103
• we may want to estimate x(n) from present and past observations. The estimatex(n|n) is the filtered value of z(n) and the estimation procedure is called filtering.
• we may want to estimate x(n) from past, present and future observations. Theestimate x(n|n + l) is the smoothed value of z(n) and the estimation procedure iscalled smoothing.
7.2 Kalman filtering : General linear case
In this section we consider the linear systems with finite dimensions described by thefollowing equations:
{xk+1 = F k xk + Gk uk state equation,zk = Hk xk + vk observation equation
where
• k = 0, 1, 2, . . . represents the discrete time ;
• xk is a N -dimensional vector called state vector of the system ;
• zk is a P -dimensional vector containing the observations (output of the system) ;
• vk is a P -dimensional vector containing the observations errors (output noise ofthe system) ;
• uk is a M -dimensional vector representing the state representation error (statespace noise process) ;
• F k, Gk and Hk with respective dimensions of (N,N), (N,M) and (P,N) are thestate transition, the state input and the observation matrices and are assumed to beknown.
• The noise sequences {uk} and {vk} are assumed to be centered, white and jointlyGaussian.
• The initial state x0 is also assumed to be Gaussian and independent of {uk} and{vk} :
E
vk
x0
uk
( vt
l ,xt0,u
tl )
=
Rk 0 00 P 0 00 0 Qk
δkl
where Rk is the covariance matrix of the observation noise vector vk, Qk is thecovariance matrix of the state noise vector Qk and P 0 is the covariance matrix ofthe initial state x0.
Remember that the aim is to find a best estimate xk|l of xk from the observationsz1,z2, . . . ,zl. Depending on the relative position of k with respect to i we have:
• If k > l prediction
104 CHAPTER 7. ELEMENTS OF SIGNAL ESTIMATION
• If k = l filtering
• If k < l smoothing.
Of course, for fixed k and l, this problem is not different from the vector parameterestimation of the last chapter. However, we are usually interested in producing estimateseith in real time or at least on-line for increasing k.
Three different approaches can be used to obtain the Kalman filtering equations.
• Linear Mean Square (LMS) estimation :
xk|ldef= LMS(xk |z1, . . . ,zl)
which minimizes
E[[xk − xk|l]
t W k [xk − xk|l]]
• Maximum A posteriori (MAP) estimate :
xk|l = arg maxx
{p(xk |z1, . . . ,zl)}
• Bayesian MSE estimate :
xk|l = E [xk |z1, . . . ,zk]
We know that, for linear relations and Gaussian assumption all these estimates are equiv-alent. We consider here the last approach.
The main procedure is to apply the Bayes rule recursively to find the expression of theposterior law p(xk |z1, . . . ,zl). Note that, we can obtain easily this expression thanks tothe following facts :
1. All variables are assumed Gaussian ;
2. {uk} and {vk} are assumed white, Gaussian and mutually independent ;
3. The state and the observation models are linear ;
4. All the conditional laws such as p(zk+1|xk+1) and p(zk+1|z1:k) are Gaussian. So,the posterior law
p(xk+1|z1:k+1) = p(xk+1|z1:k)p(zk+1|xk+1)
p(zk+1|z1:k)
is also Gaussian.
To obtain the equations in the general case we note
• xk|k the estimate of the state vector at time k from the observations up to time k ;
• xk+1|k the estimate of the state vector at time k+1 from the observations up to theinstant k ;
7.2. KALMAN FILTERING : GENERAL LINEAR CASE 105
• ek+1 = zk+1 −Hk+1 xk+1|k the innovation process of the observations at the instantk + 1
• The covariance matrix of the innovation by
Rek+1 = E
[ek+1|k et
k+1|k
]
which is diagonal;
• The covariance matrix of the prediction error by
P k+1|k = E
[[xk+1 − xk+1|k
] [xk+1 − xk+1|k
]t],
• The posterior covariance matrix of the estimation error by
P k+1|k+1 = E
[[xk+1 − xk+1|k+1
] [xk+1 − xk+1|k+1
]t]
which is also called the covariance matrix of the filtering error.
With these definitions, we obtain easily:
E [xk|z1:k]def= xk|k
Cov {xk|z1:k} def= P k|k
E [xk+1|z1:k]def= xk+1|k
Cov {xk+1|z1:k} def= P k+1|k
E [zk+1|xk+1] = Hk+1 xk+1
Cov {zk+1|xk+1} = Rk+1
E [xk+1|z1:k] = F k xk|k
Cov {xk+1|z1:k} = F k P kFtk + Gk Qk Gt
k
E [zk+1|z1:k] = Hk+1 F k xk|k
Cov {zk+1|z1:k} = Hk+1 P k+1|kHtk+1 + Rk+1
Replacing the expressions of p(zk+1|xk+1) and p(zk+1|z1:k) we obtain :
p(xk+1|z1:k+1) = A exp
[−1
2[xk+1 − xk+1|k]
tP−1k+1|k+1[xk+1 − xk+1|k]
]
with
A =1
(2π)n/2
{∣∣∣Hk+1 P k+1|k H tk+1 + Rk
∣∣∣1/2
|Rk |−1/2∣∣∣P k+1|k
∣∣∣}−1/2
xk+1|k = F k xk|k + P k+1|k H tk+1
[Rk+1 + Hk+1 P k+1|k H t
k+1
]−1 [zk+1 − Hk+1 F k xk|k
]
P k+1|k = F k P k|k F tk + Gk Qk Gt
k
P−1k+1|k+1 = P−1
k+1|k + Htk+1R
−1k+1 Hk+1
106 CHAPTER 7. ELEMENTS OF SIGNAL ESTIMATION
These equations can be rewritten in many different ways. Here are two of them:
• Prediction-Correction form :
– Prediction (Time update) :
xk+1|k = F k xk|k
P k+1|k = F k P k|k F tk + Gk Qk Gt
k
– Correction (measurement update) :
xk+1|k+1 = xk+1|k + Kgk+1[zk+1 − Hk+1 xk+1|k]
Kgk+1 = P k+1|k H t
k+1(Rek+1)
−1
Rek+1 = Rk+1 + Hk+1 P k+1|k H t
k+1
P k+1|k+1 = [I − Kfk+1Hk+1]P k+1|k
• Compact form for prediction :
xk+1|k = F k xk|k−1 + Kk(Rek)
−1[zk − Hk xk|k−1]
Rek = Rk + Hk P k|k−1 H t
k
Kk = F k P k|k−1 H tk
P k+1|k = F k P k|k−1 F tk + Gk Qk Gt
k − Kk(Rek)
−1 Ktk
• Compact form for filtering :
xk|k = xk|k−1 + Kgk[zk − Hk xk|k−1]
Kgk = P k|k−1 H t
k[Rk + Hk P k|k−1 H tk]
−1
P k|k = P k|k−1 − KgkHkP k|k−1
P k+1|k = F kP k|k F tk + Gk Qk Gt
k
• Very compact form for prediction :
xk+1|k = F kxk|k−1 + Kgk[zk − Hk xk|k−1]
Kgk = P k|k−1 Ht
k[Rk + Hk P k|k−1 Htk]
−1
P k+1|k = F kP k|k−1 F tk − F kK
gkHkP k|k−1F
tk + Gk Qk Gt
k
where Kk is called the Kalman filter gain and Kgk = Kk(R
ek)
−1 the generalizedKalman filter gain.
In all cases the initialization is :
x0|−1 = 0 P 0|−1 = P 0
7.3. EXAMPLES 107
7.3 Examples
7.3.1 1D case:{xn+1 = f xn + un
zn = hxn + vn(7.8)
where un and vn are assumed independent, zero-mean, white and Gaussian with knownvariance q and r respectively. x0 is also assumed Gaussian with known mean m0 andknown variance p0
un ∼ N (0, q)vn ∼ N (0, r)x0 ∼ N (m0, p0)
(7.9)
The equations in this case reduce to
xn+1|n = f xn|n
xn|n = xn|n−1 + kn (zn − h xn|n−1)
kn =pn|n−1h
h2pn|n−1+r= 1
h
pn|n−1
pn|n−1+r/h2
(7.10)
The role of the Kalman gain in the measurement update is easily seen from these expres-sions. pn|n−1 is the MSE incurred in the estimation of xn from z0:n−1, and the ratio r/h2
is a measure of noisiness of the observations. It is interesting to compare these equationswith the Bayesian estimation of the signal amplitude as described in Example ().
For this particular time-invariant model, we have
{pn+1|n = f2 pn|n + q
pn|n = 1h
pn|n−1r
h2 pn|n−1+1
(7.11)
We can eliminate the coupling between these equations and obtain
pn+1|n =f2 pn|n−1
h2
r pn|n−1 + 1+ q (7.12)
We see here that as n increases, pn+1|n and so the gain kn approaches a constant. Notethat if pn+1|n does approach a constant, say p∞, then p∞ must satisfy
p∞ =f2 p∞
h2
r p∞ + 1+ q (7.13)
This equation is quadratic and has a unique positive solution
p∞ =1
2
{[r
h2(1 − f2) − q
]2+
4rq
h2
}1/2
− r
2h2(1 − f2) + q (7.14)
∣∣∣pn+1|n − p∞∣∣∣ = f2
∣∣∣∣∣pn|n−1
h2
r pn|n−1 + 1− p∞
h2
r p∞ + 1
∣∣∣∣∣
≤ f2∣∣∣pn|n−1 − p∞
∣∣∣
108 CHAPTER 7. ELEMENTS OF SIGNAL ESTIMATION
∣∣∣pn+1|n − p∞∣∣∣ ≤ f2(n+1) |p0 − p∞|
≤ f2∣∣∣pn|n−1 − p∞
∣∣∣
This means that if |f | < 1 then pn+1|n converges to p∞. So |f | < 1 is a sufficient conditionfor Kalman-Bucy filter to approach a steady state.
7.3.2 Track-While-Scan (TWS) Radar
Let consider the example of one dimentional moving target and assume that the target issubject to random acceleration An. Then we have
(Xn+1
Vn+1
)=
(1 T0 1
)(Xn
Vn
)+
(0T
)An
Zn = ( 1 0 )
(Xn
Vn
)+ en
For a more general case in 3D we have a state vector with 6 components (3 positionsand 3 velocities). But, if we assume that the measurement noise in 3 dimensions areindependent of one another and independent to the components of the acceleration, theproblem can be treated as 3 independent one-dimensional moving target.
The Kalman equations for this simple model become
(Xn+1|n
Vn+1|n
)=
(Xn|n + T Vn|n
Vn|n
)
(Xn|n
Vn|n
)=
(Xn|n−1
Vn|n−1
)+
(Kn,1
Kn,2
)(Zn − Xn|n−1 )
(Kn,1
Kn,2
)=
( P (1,1)P (1,1)+r
P (2,1)P (1,1)+r
)
where P (k, l) is the (k − l)th component of the matrix Pn|n−1.
To reduce the computation, the time varying elements of the Kalman gain vector canbe replaced with some constants (the steady states values) to obtain
(Xn|n
Vn|n
)=
(Xn|n−1
Vn|n−1
)+
(αβ/T
)(Zn − Xn|n−1 ) (7.15)
with constatnt values for α and β.
7.3.3 Track-While-Scan (TWS) Radar with dependent acceleration se-quences
The simple model of the previous example is not very realistic, because there was assumedthat the target is subjected to random acceleration. For a heavy target, we can do a littlebetter by assuming that the acceleration An is modeled as
An+1 = ρAn +Wn, n = 0, 1, . . .
7.3. EXAMPLES 109
a first order autoregressive (AR) model. The value of ρ can be choosed in accordance ofthe target. ρ near to 0 means a very low inertia target and ρ near to 1 means a highinertia target.
To account for this equation, we can extend the state vector and we obtain
Xn+1
Vn+1
An+1
=
1 T 00 1 T0 0 ρ
Xn
Vn
An
+
001
Wn
Zn = ( 1 0 0 )
Xn
Vn
An
+ en
We again can apply the Kalman equations to this model and obtain:
Xn+1|n
Vn+1|n
An+1|n
=
Xn|n + T Vn|n
Vn|n + TAn|n
ρAn|n
Xn|n
Vn|n
An|n
=
Xn|n−1
Vn|n−1
An|n−1
+
Kn,1
Kn,2
Kn,3
(Zn − Xn|n−1 )
Kn,1
Kn,2
Kn,3
=
( P (1,1)P (1,1)+r
P (2,1)P (1,1)+r
P (3,1)P (1,1)+r
)
where P (k, l) is the (k − l)th component of the matrix Pn|n−1.Again to reduce the computation, the time varying elements of the Kalman gain vector
can be replaced with some constants related to its steady state value and obtain
Xn|n
Vn|n
An|n
Xn|n−1
Vn|n−1
An|n−1
+
αβ/Tγ/T 2
(Zn − Xn|n−1 ) (7.16)
with constant values for α, β and γ.
110 CHAPTER 7. ELEMENTS OF SIGNAL ESTIMATION
7.4 Fast Kalman filter equations
The general equations of the Kalman filtering do not assume any stationarity of the systemand all the matrices of the system F , G and H , and also R and Q may depend on theindex k.
Through the simple example in previous section, we saw that, when these quantitiesare independent of k, i.e.
xk+1|k = F xk|k−1 + Kk(Rek)
−1[zk − Hxk|k−1]
Rek = R + HP k|k−1 H t
Kk = FP k|k−1 H t
P k+1|k = FP k|k−1 F t + GQGt − Kk(Rek)
−1 Ktk
the system can reach a stationnary point and we can use the steady state values of the gainor an approximation to it to reduce the calculation cost. But doing so does not give anoptimal estimate and may not give a very satisfactory solution. Here, we present shortlya slightly better way to obtain fast algorithms without loosing too much its optimality.
Let assume a constatnt system and note by δP k, δKgk and δRe
k the increments
δP k = P k|k−1 − P k−1|k−2
δKgk = K
gk − K
gk−1
δRek = Re
k − Rek−1
Then, it can be shown that the δP k+1 can be factorized by
δP k+1 = [F − Kgk−1H ][δP k − δP k H t(Re
k)−1 HP k][F − K
gk−1H ]t
= [F − KgkH ][δP k + δP k H t(Re
k−1)−1 HP k][F − K
gkH ]t
Then, it is possible to reduce the cost of the calculations by noting that if δP k can befactorized as:
δP 1 = z0M0zt0,
then δP k+1 can also be factorized as
δP k+1 = zk Mk ztk
and we obtain then the following equations:
δP k+1 = zk M k ztk
zk = [F − KgkH ]zk−1
Mk = Mk−1 + M k−1 ztk−1H
t(Rek−1)
−1 Hzk−1 Mk−1
Rek−1 = Re
k + Hzk Mk ztkH
t
Kk+1 = Kgk+1 Re
k+1 = Kk + Fzk M k ztkH
t
P k|k−1 = P 0 +k−1∑
j=0
zjM jztj
7.4. FAST KALMAN FILTER EQUATIONS 111
which are called Chandrasekhar equations.Note that if α = rang {δP 1} where
δP 1 = FP 0Ft + GQGt − K0(R
e0)
−1Kt0 − P 0
then zk has dimensions (N,α), M k has dimensions (α,α). So, in place of updating, ateach time k the matrix P k with dimensions (N,N) we only have to update the matrixeszk and Mk with dimensions (N,α) and (α,α).
Note also that M0 is the signature matrix of δP 1 and the value of α depends on thechoice of the initial covariance matrix P 0. It is not unusual to have α = 1 which greatelyreduces the computation cost.
112 CHAPTER 7. ELEMENTS OF SIGNAL ESTIMATION
7.5 Kalman filter equations for signal deconvolution
Starting by the convolution equation:
z(k) =p−1∑
i=0
h(i)x(k − i) + b(k)
rewritten in matrix form
z(1)...
z(k)...
z(M)
=
h(p−1) · · · h(0) · · · · · ·...
...0 h(p−1) · · · h(0) 0...
...0 0 h(p−1) · · · h(0)
x(−p)...
x(0)...
x(M)
+
b(1)...
b(k)...
z(M)
we can propose the following models:
• Constant state vector model
{xk+1 = xk = x = [x−p, . . . , x−1, x0, x1, . . . , xn]t
zk = htk · xk + vk
hk = ( 0 0 0 hp−1 . . . h0 0 0 )t
where coefficient h0 is in the k-th position. Then we have
uk = 0F k = Gk = I
htk+1 = Dht
k
with D =
0 . . . . . . 0 01 0 . . . 0 0
0 1 0...
......
......
0 . . . 1 0
If we note by
E [x] = x0 E[[x − x0][x − x0]
t]
= P 0
E [vk] = 0 E [vkvj] = r δkj
we obtain
xk|k = xk|k−1 + Kk(rek)
−1[yk − htk · xk|k−1]
rek = r + ht
kP k|k−1 hk
Kk = P k|k−1 htk
P k+1|k = P k|k−1 − Kk(rek)
−1 Ktk
Note that the observations yk are scalar. So, rek is also a scalar quantity, but x is a
N -dimensional vector and so the covariance matrix P has the dimensions (N ×N).
7.5. KALMAN FILTER EQUATIONS FOR SIGNAL DECONVOLUTION 113
• Non constant state space model
we can choose
h = [h0, . . . , hp−1]t, xk = [xk, xk−1, . . . , xk−p+1]
t
z(k) = htxk + b(k), dimension of xk = p ≤ N
But now we need to introduce a generating state space model for xk.
One of such models is an AR model :
x(n+ 1) =q∑
i=1
a(i)x(n − i+ 1) + u(n+ 1)
whereE [un] = 0, E
[|un|2
]= β2, E [umun] = 0, m 6= n
It is easy then to see that we can write
x(n+ 1)x(n)
...
...x(n− q + 2)
=
a1 a2 . . . . . . aq
1 0 . . . . . . 00 1 0 . . . 0...
......
......
......
......
...... 1 0
x(n)x(n − 1)
...
...x(n− q + 1)
+
10......0
u(n+ 1)
Thus we have {xk+1 = Fxk + Guk+1
yk = ht · xk + bk
F =
a1 a2 . . . . . . aq
1 0 . . . . . . 00 1 0 . . . 0...
......
......
......
......
...... 1 0
, G =
10...0
This model has the advantage that F , G and H are constant and we can use fastalgorithms, but the main drawback in practical applications is the determination ofq and ak, k = 1, . . . , q.
114 CHAPTER 7. ELEMENTS OF SIGNAL ESTIMATION
Consider the following convolution model :
f(t) - H - +?
b(t)
-g(t)
g(t) =
∫f(t′)h(t− t′) dt′ + b(t) =
∫h(t′)f(t− t′) dt′ + b(t)
and assume the following hypotheses :
• The signals f(t), g(t), h(t) are discretized with the same sampling period ∆T = 1,
• The impulse response is finite (FIR) : h(t) = 0, for t such that t < −q∆T or∀t > p∆T .
Then we have :
g(m) =p∑
k=−q
h(k)f(m− k) + b(m), m = 0, · · · ,M
or in a matrix form
g(0)g(1)
...
...
...
...g(M)
=
h(p) · · · h(0) · · · h(−q) 0 · · · · · · · · · 0
0. . .
. . .. . .
......
...... h(p) · · · h(0) · · · h(−q) ......
......
. . .. . .
......
. . . 00 · · · · · · 0 h(p) · · · h(0) · · · h(−q)
f(−p)...
f(0)f(1)
...f(M)
f(M + 1)...
f(M + q)
org = Hf + v
Note that g is a (M + 1)-dimensional vector, f has dimension M + p + q + 1, h =[h(p), · · · , h(0), · · · , h(−q)] has dimension (p+ q + 1) and matrix H has dimensions (M +1) × (M + p+ q + 1).
Now, if we assume that the system is causal (q = 0) we obtain
g(0)g(1)
...
...
...
...g(M)
=
h(p) · · · h(0) 0 · · · · · · 0
0...
......
... h(p) · · · h(0)...
......
... 00 · · · · · · 0 h(p) · · · h(0)
f(−p)...
f(0)f(1)
...
...f(M)
7.5. KALMAN FILTER EQUATIONS FOR SIGNAL DECONVOLUTION 115
If the input signal is also assumed to be causal, we obtain :
g(0)g(1)
...
...
...
...g(M)
=
h(0)
h(1). . .
...h(p) · · · h(0)
0. . .
. . ....0 · · · 0 h(p) · · · h(0)
f(0)f(1)
...
...
...
...f(M)
(7.17)
and finally if p = M we have :
g(0)g(1)
...g(M)
=
h(0)h(1) h(0)
.... . .
h(M) · · · h(1) h(0)
f(0)f(1)
...f(M)
Remark that, in all cases matrix H is Toeplitz.In the case where the input signal and the system are both causal, (7.17) can be
rewritten as
g(0)g(1)
...
...
...
...g(M)
0...0
=
h(0) 0 · · · 0 h(p) · · · h(1)
h(1). . .
. . ....
... h(p)h(p) · · · h(0) 0 0
0. . .
. . ....
...0 · · · 0 h(p) · · · h(0) 0...
.... . .
. . . 00 · · · · · · 0 h(p) · · · h(0)
f(0)f(1)
...
...
...
...f(M)
0...0
where f and g have been completed artificially by some zeros. This operation is calledzero-filling and the main advantage to do so is that the matrix H is now a circulant matrix.
Starting by the Kalman filter equations:{
zk = Hk xk + vk observation equationxk+1 = F k xk + Gk uk state equation
xk+1|k = F k xk|k
P k+1|k = F k P k|k F tk + Gk Qk Gt
k
xk+1|k+1 = xk+1|k + Kfk+1[zk+1 − Hk+1 xk+1|k]
Kfk+1 = P k+1|k Ht
k+1(Rek+1)
−1
Rek+1 = Rk+1 + Hk+1 P k+1|k H t
k+1
P k+1|k+1 = [I − Kfk+1Hk+1]P k+1|k
116 CHAPTER 7. ELEMENTS OF SIGNAL ESTIMATION
7.5.1 AR, MA and ARMA Models
AR model
u(n) =p∑
k=1
a(k)u(n − k) + ǫ(n), ∀n
E [ǫ(n)] = 0, E[|ǫ(n)|2
]= β2,
E [ǫ(n)u(m)] = 0, m 6= n
ǫ(n)−→ H(z) =1
A(z)=
1
1 +∑p
k=1 a(k)z−k
−→u(n)
MA model
u(n) =q∑
k=0
b(k) ǫ(n − k), ∀n
ǫ(n)−→ B(z) =q∑
k=0
b(k)z−k −→u(n)
ARMA model
u(n) =p∑
k=1
a(k)u(n − k) +q∑
l=0
b(l) ǫ(n − l)
ǫ(n)−→ H(z) =B(z)
A(z)=
∑qk=0 b(k)z
−k
1 +∑p
k=1 a(k)z−k
−→u(n)
ǫ(n)−→ H(z) = Bq(z) −→ H(z) =1
Ap(z)−→u(n)
In a dynamic system, in general, we are interested in a physical quantity x throughthe observation of a quantity z related to x by the following system of equations
{xn+1 = fn(xn,un)zn = hn(xn,vn)
(7.18)
Chapter 8
Some complements to Bayesianestimation
8.1 Choice of a prior law in the Bayesian estimation
One of the main difficulties in the application of Bayesian theory in practice is the choiceor the attribution of the direct probabilities f(x|θ) and π(θ). In general, f(x|θ) is obtainedvia an appropriate model relating the observable quantity X to the parameters θ and is wellaccepted. The choice or the attribution of the prior π(θ) has been, and still is, the mainsubject of discussion and controversy between the Bayesian and orthodox statisticians.
Here, I will try to give a brief summary of different approaches and different tools thatcan be used to attribute a prior probability distribution. There are mainly four tools:
• use of some invariance principles
• use of maximum entropy (ME) principle
• use of conjugate and reference priors
• use of other information criteria
8.1.1 Invariance principles
Definition 1 [Group invariance] A probability model f(x|θ) is said to be invariant (orclosed) under the action of a group of transformations G if, for every g ∈ G, there exists aunique θ∗ = g(θ) ∈ T such that y = g(x) is distributed according to f(y|θ∗).
Exemple 1 Any probability density function in the form f(x|θ) = f(x− θ) is invariantunder the translation group
G : {gc(x) : gc(x) = x+ c, c ∈ IR} (8.1)
This can be verified as follows
x ∼ f(x− θ) −→ y = x+ c ∼ f(y − θ∗) with θ∗ = θ + c
117
118 CHAPTER 8. SOME COMPLEMENTS TO BAYESIAN ESTIMATION
Exemple 2 Any probability density function in the form f(x|θ) = 1θf(x
θ ) is invariantunder the multiplicative or scale transformation group
G : {gs(x) : gs(x) = s x, s > 0} (8.2)
This can be verified as follows
x ∼ 1
θf(x
θ) −→ y = s x ∼ 1
θ∗f(y
θ∗) with θ∗ = s θ
Exemple 3 Any probability density function in the form f(x|θ1, θ2) = 1θ2f(x−θ1
θ2) is in-
variant under the affine transformation group
G : {ga,b(x) : ga,b(x) = ax+ b, a > 0, b ∈ IR} (8.3)
This can be verified as follows
x ∼ 1
θ2f(x− θ1θ2
) −→ y = ax+ b ∼ 1
θ∗2f(y − θ∗1θ∗2
) with θ∗2 = a θ2, θ∗1 = a θ1 + b.
Exemple 4 Any multi variable probability density function in the form f(x|θ) = f(x−θ)is invariant under the translation group
G : {gc(x) : gc(x) = x − c, c ∈ IRn} (8.4)
Exemple 5 Any multi variable probability density function in the form f(x) = f(‖x‖)is invariant under the orthogonal transformation group
G :{gA(x) : gA(x) = Ax, AtA = AAt = I
}(8.5)
Exemple 6 Any multi variable probability density function in the form f(x|θ) = 1θf(‖x‖
θ )is invariant under the following transformation group
G :{gA,s(x) : gA,s(x) = sAx, AtA = AAt = I, s > 0.
}(8.6)
This can be verified as follows
x ∼ 1
θf(
‖x‖θ
) −→ y = sAx ∼ 1
θ∗f(
‖y‖θ∗
) with θ∗ = s θ.
From these examples we see also that any invariance transformation group G on x ∈ Xinduces a corresponding transformation group G on θ ∈ T . For example for the translationinvariance G on x ∈ X induces the following translation group on θ ∈ T
G : {gc(θ) : gc(θ) = θ + c, c ∈ IR} (8.7)
and the scale invariance G on x ∈ X induces the following translation groupe on θ ∈ T
G : {gs(θ) : gs(θ) = s θ, s > 0} (8.8)
We just see that for an invariant family of f(x|θ) we have a corresponding invariantfamily of prior laws π(θ). To be complete, we have also to consider the cost function tobe able to define the Bayesian estimate.
8.1. CHOICE OF A PRIOR LAW IN THE BAYESIAN ESTIMATION 119
Definition 2 [Invariant cost functions] Assume a probability model f(x|θ) is invariantunder the action of the group of transformations G. Then the cost function c[θ, θ] is saidto be invariant under the group of transformations G if, for every g ∈ G and θ ∈ T , thereexists a unique θ∗ = g(θ) ∈ T with g ∈ G such that
c[θ, θ] = c[g(θ), θ∗] for every θ ∈ T .Definition 3 [Invariant estimate] For an invariant probability model f(x|θ) under thegroup of transformation Gc and an invariant cost function c[θ, θ] under the correspondinggroup of transformation G, an estimate θ is said to be invariant or equivariant if
θ(g(x)) = g(θ(x)
)
Exemple 7 Estimation of θ from the data coming from any model of the kind f(x|θ) =f(x− θ) with a quadratic cost function c[θ, θ] = (θ − θ)2 is equivariant and we have
G = G = G = {gc(x) : gc(x) = x− c, c ∈ IR}Exemple 8 Estimation of θ from the data coming from any model of the kind f(x|θ) =1θf(1
θ ) with the entropy cost function
c[θ, θ] =θ
θ− ln(
θ
θ) − 1
is equivariant and we have
G = {gs(x) : gs(x) = s x, s > 0}G = G = {gs(θ) : gs(θ) = s θ, s > 0}
Proposition 1 [Invariant Bayesian estimate] Suppose that a probability model f(x|θ)is invariant under the group of transformations G and that there exists a probabilitydistribution π∗(θ) on T which is invariant under the group of transformations G, i.e.,
π∗(g(A)) = π∗(A)
for any measurable set A ∈ T . Then the Bayes estimator associated with π∗, noted θ∗
minimizes∫∫R(θ, θ)π∗(θ) dθ =
∫∫R(θ, g(θ)
)π∗(θ) dθ =
∫∫E[c[θ, g
(θ(X)
)]]π∗(θ) dθ over θ.
If this Bayes estimator is unique, it satisfies
θ∗(x) = g−1(θ∗(g(x))
)
Therefore, a Bayes estimator associated with an invariant prior and a strictly convexinvariant cost function is almost equivariant.
Actually, invariant probability distributions are rare. The following are some examples:
Exemple 9 If π(θ) is invariant under the translation group Gc, it satisfies π(θ) = π(θ+c)for every θ and for every c, which implies that π(θ) = π(0) uniformly on IR and this leadsto the Lebesgue measure as an invariant measure.
Exemple 10 If θ > 0 and π(θ) is invariant under the scale group Gs, it satisfies π(θ) =s π(sθ) for every θ > 0 and for every s > 0, which implies that π(θ) = 1/θ.
Note that in both cases the invariant laws are improper.
120 CHAPTER 8. SOME COMPLEMENTS TO BAYESIAN ESTIMATION
8.2 Conjugate priors
The conjugate prior concept is tightly related to the sufficient statistic and exponentialfamilies.
Definition 4 [Sufficient statistics] When X ∼ Pθ(x), a function h(X) is said to be asufficient statistic for {Pθ(x), θ ∈ T } if the distribution of X conditioned on h(X) doesnot depend on θ for θ ∈ T .
Definition 5 [Minimal sufficiency] A function h(X) is said to be minimal sufficient for{Pθ(x), θ ∈ T } if it is a function of every other sufficient statistic for Pθ(x).
A minimal sufficient statistic contains the whole information brought by the observationX = x about θ.
Proposition 2 [Factorization theorem] Suppose that {Pθ(x), θ ∈ T } has a correspondingfamily of densities {pθ(x), θ ∈ T }. A statistic T is sufficient for θ if and only if there existfunctions gθ and h such that
pθ(x) = gθ(T (x))h(x) (8.9)
for all x ∈ Γ and θ ∈ T .
Exemple 11 If X ∼ N (θ, 1) then T (x) = x can be chosen as a sufficient statistic.
Exemple 12 If {X1,X2, . . . ,Xn} are i.i.d. and Xi ∼ N (θ, 1) then
f(x|θ) = (2π)−n/2 exp
[−1
2
n∑
i=1
(xi − θ)2]
= exp
[−1
2
n∑
i=1
x2i
](2π)−n/2 exp
[−n
2θ2]exp
[θ
n∑
i=1
xi
]
and we have T (x) =∑n
i=1 xi.
Note that, in this case, we need to know n and x = 1n
∑ni=1 xi. Note also that we can write
f(x|θ) = a(x) g(θ) exp [θT (x)]
where
g(θ) = (2π)−n/2 exp
[−n
2θ2]
and a(x) = exp
[−1
2
n∑
i=1
x2i
]
Exemple 13 If X ∼ N (0, θ) then T (x) = x2 can be chosen as a sufficient statistic.
Exemple 14 If X ∼ N (θ1, θ2) then T1(x) = x2 and T2(x) = x can be chosen as a set ofsufficient statistics.
8.2. CONJUGATE PRIORS 121
Exemple 15 If {X1,X2, . . . ,Xn} are i.i.d. and Xi ∼ N (θ1, θ2) then
f(x|θ1, θ2) = (2π)−n/2 θ−1/22 exp
[− 1
2θ2
n∑
i=1
(xi − θ1)2
]
= (2π)−n/2 θ−1/22 exp
[−nθ
21
2θ2
]exp
[− 1
2θ2
n∑
i=1
x2i +
θ1θ2
n∑
i=1
xi
]
and we have T1(x) =∑n
i=1 xi and T2(x) =∑n
i=1 x2i .
Note also that we can write
f(x|θ) = a(x) g(θ1, θ2) exp
[θ1θ2T1(x) − 1
2θ2T2(x)
]
where
g(θ1, θ2) = (2π)−n/2θ−1/22 exp
[−nθ
21
2θ2
]and a(x) = 1.
In this case, θ1
θ2and −1
2θ2are called canonical parametrization. It is also usual to use n,
x = 1n
∑ni=1 xi and x2 = 1
n
∑ni=1 x
2i as the sufficient statistics.
Exemple 16 If X ∼ Gam(α, θ) then T (x) = x can be chosen as a sufficient statistic.
Exemple 17 If X ∼ Gam(θ, β) then T (x) = lnx can be chosen as a sufficient statistic.
Exemple 18 If X ∼ Gam(θ1, θ2) then T1(x) = lnx and T2(x) = x can be chosen as aset of sufficient statistics.
Exemple 19 If {X1,X2, . . . ,Xn} are i.i.d. and Xi ∼ Gam(θ1, θ2) then it is easy to showthat T1(x) =
∑ni=1 lnxi and T2(x) =
∑ni=1 xi.
Definition 6 [Exponential family] A class of distributions {Pθ(x), θ ∈ T } is said to bean exponential family if there exist: a(x) a function of Γ on IR, g(θ) a function of T onIR+, φk(θ) functions of T on IR, and hk(x) functions of Γ on IR such that
pθ(x) = p(x|θ) = a(x) g(θ) exp
[K∑
k=1
φk(θ)hk(x)
]
= a(x) g(θ) exp[φt(θ)h(x)
]
for all θ ∈ T and x ∈ Γ. This family is entirely determined by a(x), g(θ), and{φk(θ), hk(x), k = 1, · · · ,K} and is noted Exfn(x|a, g,φ,h)
Particular cases:
• When a(x) = 1 and g(θ) = exp [−b(θ)] we have
p(x|θ) = exp[φt(θ)h(x) − b(θ)
]
and is noted CExf(x|b,φ,h).
122 CHAPTER 8. SOME COMPLEMENTS TO BAYESIAN ESTIMATION
• Natural exponential family:When a(x) = 1, g(θ) = exp [−b(θ)], h(x) = x and φ(θ) = θ we have
p(x|θ) = exp[θtx − b(θ)
]Exf(x|b).
and is noted NExf(x|b).
• Scalar random variable with a vector parameter:
p(x|θ) = Exf(x|a, g,φ,h)
= a(x)g(θ) exp
[K∑
k=1
φk(θ)hk(x)
]
= a(x)g(θ) exp[φt(θ)h(x)
]
and is noted Exfk(x|a, g,φ,h).
• Scalar random variable with a scalar parameter:
p(x|θ) = Exf(x|a, g, φ, h) = a(x)g(θ) exp [φ(θ)h(x)]
and is noted Exf(x|a, g, φ, h).
• Simple scalar exponential family:
p(x|θ) = θ exp [−θx] = exp [−θx+ ln θ] , x ≥ 0, θ ≥ 0.
Definition 7 [Conjugate distributions] A family F of probability distributions π(θ) onT is said to be conjugate (or closed under sampling) if, for every π(θ) ∈ F , the posteriordistribution π(θ|x) also belongs to F .
The main argument for the development of the conjugate priors is the following: Whenthe observation of a variable X with a probability law f(x|θ) modifies the prior π(θ) to aposterior π(θ|x), the information conveyed by x about θ is obviously limited, therefore itshould not lead to a modification of the whole structure of π(θ), but only of its parameters.
Definition 8 [Conjugate priors] Assume that f(x|θ) = l(θ|x) = l(θ|t(x)) where t ={n, s} = {n, s1, . . . , sk} is a vector of dimension k+ 1 and is sufficient statistic for f(x|θ).Then, if there exists a vector {τ0, τ} = {τ0, τ1, . . . , τk} such that
π(θ|τ ) =f(s = (τ1, · · · , τk)|θ, n = τ0)∫∫f(s = (τ1, · · · , τk)|θ′, n = τ0) dθ′
exists and defines a family F of distributions for θ ∈ T , then the posterior π(θ|x, τ ) willremain in the same family F . The prior distribution π(θ|τ ) is then a conjugate prior forthe sampling distribution f(x|θ).
8.2. CONJUGATE PRIORS 123
Proposition 3 [Sufficient statistics for the exponential family] For a set of n i.i.d. samples{x1, · · · , xn} of a random variable X ∼ Exf(x|a, g,θ,h) we have
f(x|θ) =n∏
j=1
f(xj|θ) = [g(θ)]n
n∏
j=1
a(xj)
exp
K∑
k=1
φk(θ)n∑
j=1
hk(xj)
= gn(θ) a(x) exp
φt(θ)
n∑
j=1
h(xj)
,
where a(x) =∏n
j=1 a(xj). Then, using the factorization theorem it is easy to see that
t =
n,n∑
j=1
h1(xj), · · · ,n∑
j=1
hK(xj)
is a sufficient statistic for θ.
Proposition 4 [Conjugate priors of the Exponential family] A conjugate prior family forthe exponential family
f(x|θ) = a(x) g(θ) exp
[K∑
k=1
φk(θ)hk(x)
]
is given by
π(θ|τ0, τ ) = z(τ )[g(θ)]τ0 exp
[K∑
k=1
τkφk(θ)
]
The associated posterior law is
π(θ|x, τ0, τ ) ∝ [g(θ)]n+τ0a(x)z(τ ) exp
K∑
k=1
τk +n∑
j=1
hk(xj)
φk(θ)
.
We can rewrite this in a more compact way:If
f(x|θ) = Exfn(x|a(x), g(θ),φ,h),
then a conjugate prior family is
π(θ|τ ) = Exfn(θ|gτ0 , z(τ ), τ ,φ),
and the associated posterior law is
π(θ|x, τ ) = Exfn(θ|gn+τ0 , a(x) z(τ ), τ ′,φ)
where
τ ′k = τk +n∑
j=1
hk(xj)
or
τ ′ = τ + h, with hk =n∑
j=1
hk(xj).
124 CHAPTER 8. SOME COMPLEMENTS TO BAYESIAN ESTIMATION
Definition 9 [Conjugate priors of natural exponential family] If
f(x|θ) = a(x) exp[θtx − b(θ)
]
Then a conjugate prior family is
π(θ|τ 0) = g(θ) exp[τ t
0θ − d(τ 0)]
and the corresponding posterior is
π(θ|x, τ 0) = g(θ) exp[τ t
nθ − d(τ n)]
with τn = τ 0 + x
where
xn =1
n
n∑
j=1
xj
A slightly more general notation which gives some more explicit properties of theconjugate priors of the natural exponential family is the following:If
f(x|θ) = a(x) exp[θtx − b(θ)
]
Then a conjugate prior family is
π(θ|α0, τ 0) = g(α0, τ 0) exp[α0 τ t
0θ − α0b(τ 0)]
The posterior is
π(θ|α0, τ 0,x) = g(α, τ ) exp[α τ tθ − αb(τ )
]
with
α = α0 + n and τ =α0τ 0 + nx
(α0 + n))
and we have the following properties:
E [X|θ] = E[X|θ] = ∇b(θ)
E [∇b(Θ)|α0, τ 0] = τ 0
E [∇b(θ)|α0, τ 0,x] =nx + α0τ 0
α0 + n= πxn + (1 − π)τ 0, with π =
n
α0 + n
8.2. CONJUGATE PRIORS 125
Conjugate priors
Observation lawp(x|θ)
Prior lawp(θ|τ )
Posterior lawp(θ|x, τ ) ∝ p(θ|τ )p(x|θ)
Discrete variables
BinomialBin(x|n, θ)
BetaBet(θ|α, β)
BetaBet(θ|α+ x, β + n− x)
Negative BinomialNegBin(x|n, θ)
BetaBet(θ|α, β)
BetaBet(θ|α+ n, β + x)
MultinomialMk(x|θ1, · · · , θk)
DirichletDik(θ|α1, · · · , αk)
DirichletDik(θ|α1 + x1, · · · , αk + xk)
PoissonPn(x|θ)
GammaGam(θ|α, β)
GammaGam(θ|α+ x, β + 1)
GammaGam(x|ν, θ)
GammaGam(θ|α, β)
GammaGam(θ|α+ ν, β + x)
BetaBet(x|α, θ)
ExponentialEx(θ|λ)
ExponentialEx(θ|λ− log(1 − x))
NormalN(x|θ, σ2)
NormalN(θ|µ, τ2)
Normal
N(µ|µσ2+τ2x
σ2+τ2 , σ2 τ2
σ2+τ2
)
Continuous variables
NormalN(x|µ, 1/θ)
GammaGam(θ|α, β)
Gamma
Gam(θ|α+ 1
2 , β + 12(µ− x)2
)
NormalN(x|θ, θ2)
Generalized inverse NormalINg(θ|α, µ, σ) ∝|θ|−α exp
[− 1
2σ2
(1θ − µ
)2]
Generalized inverse NormalINg(θ|αn, µn, σn)
Table 8.1: Relation between the sampling distributions, their associated conjugate priorsand their corresponding posteriors
126 CHAPTER 8. SOME COMPLEMENTS TO BAYESIAN ESTIMATION
8.3 Non informative priors based on Fisher information
Another notion of information related to the maximum likelihood estimation is the Fisherinformation. In this section, first we give some definitions and results related to this notionand we see how this is used to define non informative priors.
Proposition 5 [Information Inequality] Let θ be an estimate of the parameter θ in afamily {Pθ; θ ∈ T } and assume that the following conditions hold:
1. The family {Pθ; θ ∈ T } has a corresponding family of densities {pθ(x); θ ∈ T }, allwith the same support.
2. pθ(x) is differentiable for all θ ∈ T and all x in its support.
3. The integral
g(θ) =
∫
Γh(x) pθ(x)µ( dx)
exists and is differentiable for θ ∈ T , for h(x) = θ(x) and for h(x) = 1 and
∂g(θ)
∂θ=
∫
Γh(x)
∂pθ(x)
∂θµ( dx)
Then
Varθ[θ(X)] ≥
[∂∂θEθ
{θ(X)
}]2
Iθ(8.10)
where
Iθdef= Eθ
{[∂
∂θln pθ(X)
]2}(8.11)
Furthermore, if ∂2
∂θ2pθ(x) exists for all θ ∈ T and all x in the support of pθ(x), and if
∫∂2
∂θ2pθ(x)µ( dx) =
∂2
∂θ2
∫pθ(x)µ( dx)
then Iθ can be computed via
Iθ = −Eθ
{∂2
∂θ2ln pθ(X)
}(8.12)
The quantity defined in (8.11) is known as Fisher’s information for estimating θ from X,and (8.10) is called the information inequality.
For the particular case in which θ is unbiased Eθ
{θ(X)
}= θ, the information inequal-
ity becomes
Varθ[θ(X)] ≥ 1
Iθ(8.13)
Expression 1Iθ
is known as the Cramer-Rao lower bound (CRLB).
8.3. NON INFORMATIVE PRIORS BASED ON FISHER INFORMATION 127
Exemple 20 [The information Inequality for exponential families] Assume that T is openand pθ is given by
pθ(x) = a(x) g(θ) exp [g(θ)h(x)]
Then it can be shown that
Iθdef= Eθ
{[∂
∂θln pθ(X)
]2}=∣∣g′(θ)
∣∣2 Varθ(h(X)) (8.14)
and∂
∂θEθ {h(X)} = g′(θ)Varθ(h(X)) (8.15)
and thus, if we choose θ(x) = h(x) we obtain the lower bound in the information inequality(8.10)
Varθ[θ(X)] =
[∂∂θEθ
{θ(X)
}]2
Iθ(8.16)
Definition 10 [Non informative priors] Assume X ∼ f(x|θ) = pθ(x) and assume that
Iθdef= Eθ
{[∂
∂θln pθ(X)
]2}= −Eθ
{∂2
∂θ2ln pθ(X)
}(8.17)
Then, a non informative prior π(θ) is defined as
π(θ) ∝ I1/2θ (8.18)
Definition 11 [Non informative priors, case of vector parameters]Assume X ∼ f(x|θ) = pθ(x) and assume that
Iij(θ)def= −Eθ
{∂2
∂θi∂θjln pθ(X)
}(8.19)
Then, a non informative prior π(θ) is defined as
π(θ) ∝ |I(θ)|1/2 (8.20)
where I(θ) is the Fisher information matrix with the elements Iij(θ).
Exemple 21 If
f(x|θ) = a(x) exp[θtx − b(θ)
]
then
I(θ) = ∇∇tb(θ)
and
π(θ) ∝ |I(θ)|1/2 =
∣∣∣∣∣
n∏
i=1
∂2θi
∂b(θ)2
∣∣∣∣∣
1/2
128 CHAPTER 8. SOME COMPLEMENTS TO BAYESIAN ESTIMATION
Exemple 22 If
f(x|θ) = N(µ, σ2
), θ = (µ, σ2)
then
I(θ) = Eθ
{(1σ2
2(X−µ)σ3
2(X−µ)σ3
3(X−µ)2
σ4 − 1σ2
)}=
( 1σ2 00 2
σ2
)
and
π(θ) = π(µ, σ2) ∝ 1
σ4
Chapter 9
Linear Estimation
In previous chapters we saw that the optimum estimation, in a MMSE sense, of an un-known signal Xt, given the observations Ya:b = {Ya, . . . , Yb} of a related quantity, is givenby Xt = E [Xt|Ya:b]. This estimate is not, in general, a linear function of the data and itscomputation needs the knowledge of the joint distribution of {Xt, Ya:b}. Only when thisjoint distribution is Gaussian and when Xt is a related to Ya:b by a linear relation, thisoptimal estimate is a linear function of the data. Even in this case, its computation needsthe inversion of the covariance matrix of the data ΣY whose dimensions increase with thenumber of data.
One way to circumvent these drawbacks is, from the first step, to constraint the esti-mate to be a linear function of the data. Doing so, as we will see below, we do not needanymore the joint distribution of {Xt, Ya:b} but only its second order statistics. Further-more, we will see that, in this case, we can develop real time or on-line algorithms withlower complexity and lower cost, if we assume data to be stationary.
9.1 Introduction
Assume that we want to obtain an estimate Xt of a quantity Xt which is a linear (or moreprecisely an affine) function of the data Ya:b = {Ya, . . . , Yb}, i.e.
Xt =b∑
n=a
ht,nYn + ct (9.1)
where, in general, a can be either −∞ or finite and b can also be either finite or ∞. Whena and b are finite the meaning of the summation is clear. For the cases where a = −∞or b = ∞, these and all the following summations have to be understood in the MMSEsense, for example for the case a = −∞
limm7→−∞
E
(
b∑
n=m
ht,nYn + ct − Xt
)2
= 0 (9.2)
In these cases we need also to assume that
E[X2
n
]<∞ and E
[Y 2
n
]<∞.
129
130 CHAPTER 9. LINEAR ESTIMATION
The following propositions resume all we need for developing linear estimation theory.
Proposition 1 Assume Xt ∈ Hba where Hb
a is the Hilbert space generated by the affine
transform (9.1). Then E[X2
t
]<∞ and if Z is a random variable satisfying E
[Z2]<∞,
then
E[Z Xt
]=
b∑
n=a
ht,nE [Z Yn] + ct E [Z]
Proposition 2 (Orthogonality principle) Xt ∈ Hba solves
minXt∈Hb
a
E[(Xt −Xt)
2]
(9.3)
if and only if
E[(Xt −Xt)Z
]= 0 ∀Z ∈ Hb
a. (9.4)
In other words, Xt is a MMSE linear estimate of Xt given Ya:b if and only if theestimation error (Xt −Xt) is orthogonal to every linear function of the observation Ya:b.
Considering the particular cases of Z = 1 and Z = Yl, a ≤ l ≤ b we can rewrite thisproposition in the following way
Proposition 3 Xt ∈ Hba solves (9.3) if and only if
E[Xt
]= E [Xt] (9.5)
andE[(Xt −Xt)Yl
]= 0 ∀a ≤ l ≤ b. (9.6)
Now replacing (9.1) in (9.6) we obtain
E[(Xt − Xt)Yl
]= E
[(Xt −
b∑
n=a
ht,nYn − ct)Yl
]= 0 ∀a ≤ l ≤ b. (9.7)
To go further in details more easily and without any loss of generality, we assumeE [Yl] = 0,∀a ≤ l ≤ b. Then, since E [Xt] = ct, the previous equation becomes
Cov {Xt, Yl} =b∑
n=a
ht,nCov {Yn, Yl} ∀a ≤ l ≤ b. (9.8)
which is known as the Wiener-Hopft equation.Writing this in a matrix form we have
σXY (t) = ΣY ht (9.9)
where
σXY (t)def= [Cov {Xt, Ya} , . . . ,Cov {Xt, Yb}]t
ΣXY (t)def= [Cov {Yn, Yl}]
htdef= [ht,a, . . . , ht,b]
t
9.2. ONE STEP PREDICTION 131
So, theoretically we have
ht = Σ−1Y σXY (t) (9.10)
The main difficulty is however the computation of Σ−1Y . Note that this matrix is symmetric
and positive definite. So, theoretically, it is not singular. However, its inversion costincreases exponentially with the number of data. In the following we will see how thestationary assumption will help to reduce this cost.
9.2 One step prediction
Consider the case where a = 0, b = t and Xt = Yt+1 and assume that Yl is wide sensestationary, i.e., E [Yl] = 0 and Cov {Yl, Ym} = CY (l −m). Then we have
Cov {Xt, Yl} = Cov {Yt+1, Yl} = CY (t+ 1 − l) (9.11)
and the Wiener-Hopft equation becomes
CY (t+ 1)CY (t)
...
...CY (1)
=
CY (0) CY (1) CY (t)
CY (1). . .
. . .. . .
CY (1)CY (t) CY (1) CY (0)
ht,0
ht,1...
...ht,t
(9.12)
called Yule-Walker equation.Note that the w.s.s. hypothesis of the data Yn leads to a covariance matrix which is
Toeplitz. Unlike the general case, the cost of the inversion of this matrix is only O(n2)against O(n3) for the general case, where n is the number of data. Thus, in any lin-ear MMSE estimation problem, the w.s.s. assumption can reduce the complexity of thecomputation of the coefficients by a factor equal to the number of the data.
In the following, we will see that, we can still go further and use the specific structureof the Yule-Walker equation to keep on reducing this cost.
9.3 Levinson algorithm
Levinson algorithm uses the special structure of the Yule-Walker equation for the one stepprediction problem where the left hand side vector of this equation is equal to the lastcolumn of the covariance matrix shifted by one time unit.
Rewriting this equation
Yt+1 =t∑
n=0
ht,nYn = −t∑
n=0
at+1,t+1−nYn (9.13)
the coefficients at,1, . . . , at,t can be updated recursively in t through the Levinson algorithm:
at+1,k = at,k − kt at,t+1−k, k = 1, . . . , t
at+1,t+1 = −kt
132 CHAPTER 9. LINEAR ESTIMATION
where kt itself, is generated recursively with ǫtdef= E
[(Yt − Yt)
2]
via
kt =1
ǫt[CY (t+ 1) +
t∑
k=1
at,k CY (t+ 1 − k)]
ǫt+1 = (1 − k2t )ǫt
with the initialization k0 = CY (1)CY (0) and ǫ0 = CY (0).
The coefficients ak are called reflection coefficients or still partial correlation coefficients(PARCOR).
9.4 Vector observation case
The linear estimation can be extended to the case where both the observation sequenceand the quantity to be estimated are vectors. This extension is straight forward and wehave:
X t =b∑
n=a
H t,nY n + ct (9.14)
where H t,n is a sequence of matrices. When a or b are infinite, the summations have theMSE sense. For example when a = −∞, we have
limm7→−∞
E
∥∥∥∥∥
b∑
n=m
Ht,nY n + ct − Xt
∥∥∥∥∥
2
= 0 (9.15)
where ‖x‖ def= xtx.
The orthogonality principle becomes:
Proposition 4 (Orthogonality principle) Xt ∈ Hba solves
minXt∈H
ba
E
[∥∥∥X t −X t
∥∥∥2]
(9.16)
if and only if
E[(Xt −X t)
t Z]
= 0 ∀Z ∈ Hba. (9.17)
Writing this last equation for Z = 1 and for Z = Y tl we obtain
E [X t] = E[X t
]
E[(X t − Xt)Y
tl
]= 0, ∀a ≤ l ≤ b.
Using these relations we obtain
E[(X t − Xt)Y
tl
]= E
[(Xt −
b∑
n=a
H t,nY n − ct
)Y t
l
]= [0], ∀a ≤ l ≤ b. (9.18)
9.5. WIENER-KOLMOGOROV FILTERING 133
where [0] means a matrix whose all elements are equal to zero. The Wiener-Hopft equationbecomes:
CXY (t, l) =b∑
n=a
H t,nCY (n, l), ∀a ≤ l ≤ b
where CXY (t, l)def= Cov {Xt, Y l} is the cross-covariance of {Xn}∞n=−∞ and {Y l}∞l=−∞ and
CY (n, l)def= Cov {Y n, Y l} is the auto-covariance of {Y n}∞n=−∞. Note that CXY (t, l) and
CY (n, l) are (m × k) and (k × k) matrices respectively, where k and m are respectivelythe dimensions of the vectors Y n and X t.
9.5 Wiener-Kolmogorov filtering
We assume here that Yn is wide sense stationarity (w.s.s.) and that there is an infinitenumber of observations.
Two cases are of interest: Non causal where (a = −∞, b = t) and Causal where(a = −∞, b = ∞).
9.5.1 Non causal Wiener-Kolmogorov
Without losing any generality, we assume E [Yn] = E [Xn] = 0. Then we have
Xt =∞∑
n=−∞
ht,nYn (9.19)
The Wiener-Hopft equation becomes
CXY (t− l) =∞∑
n=−∞
ht,nCY (n− l) (9.20)
Changing variable τ = t− l we obtain
CXY (τ) =∞∑
n=−∞
ht,nCY (n+ τ − t) (9.21)
Changing now the summation variable n = t− α we obtain
CXY (τ) =∞∑
α=−∞
ht,t−αCY (τ − α) (9.22)
In this summation t appears only in ht,t−α. This means that we can choose it to beindependent of t, i.e. if this equation has a solution, it can be chosen to be time-invariant
with coefficients ht,t−α = h0,0−αdef= hα. Then we have
CXY (τ) =∞∑
α=−∞
hαCY (τ − α) (9.23)
134 CHAPTER 9. LINEAR ESTIMATION
which is a convolution equation. Using then the following DFTs
H(ω)def=
∞∑
n=−∞
hn exp [−jωn] , −π < ω < π
SXY (ω)def=
∞∑
n=−∞
CXY (n) exp [−jωn] − π < ω < π
SY (ω)def=
∞∑
n=−∞
CY (n) exp [−jωn] − π < ω < π
we have
SXY (ω) = H(ω)SY (ω)
and finally
H(ω) =SXY (ω)
SY (ω)(9.24)
The coefficients hn can then be obtained by inverse FT
hn =1
2π
∫ π
−π
SXY (ω)
SY (ω)exp [jωn] dω, n ∈ ZZ (9.25)
It is interesting to see that we have
E[XtXt
]=
1
2π
∫ π
−π
SXY (ω)
SY (ω)dω
E[X2
t
]= CX(0) =
1
2π
∫ π
−πSX(ω) dω
E[(Xt −Xt)
2]
=1
2π
∫ π
−πSX(ω) − SXY (ω)
SY (ω)dω
This last equation can be written
MMSE = E[(Xt −Xt)
2]
=1
2π
∫ π
−π
[1 − |SXY (ω)|2
SX(ω)SY (ω)
]SX(ω) dω (9.26)
Noting that we have |SXY (ω)|2 ≤ SX(ω)SY (ω), with equality if {Xt}∞t=−∞ and {Yn}∞n=−∞
are perfectly correlated, we can conclude that the MMSE ranges from E[X2
t
]to zero as
the relationship between {Xt}∞t=−∞ and {Yn}∞n=−∞ ranges from independence to perfectcorrelation.
Example 1 (Noise filtering) Consider the model
Yn = Sn +Nn, n ∈ ZZ
where Sn and Nn are assumed uncorrelated, zero mean and w.s.s. Suppose that we wantto estimate Xt = St+λ for some integer λ. The problem represents filtering, prediction
9.6. CAUSAL WIENER-KOLMOGOROV 135
and smoothing respectively when λ = 0, λ > 0 and when λ < 0. To obtain the necessaryequations, it is straightforward to show
SY (ω) = SX(ω) + SN (ω)
SXY (ω) = exp [jωλ] SX(ω)
SX(ω) = SS(ω)
So, the transfer function of the optimum non causal filter is
H(ω) =exp [jωλ] SS(ω)
SS(ω) + SN (ω)
Example 2 (Deconvolution) Consider the model
Yn =p∑
k=0
hkSn−k +Nn, n ∈ ZZ
where Sn and Nn are assumed uncorrelated, zero mean and w.s.s. Suppose that we wantto estimate Xt = St+λ for some integer λ. Again here, it is straightforward to show
SY (ω) = |H(ω)|2SX(ω) + SN (ω)
SXY (ω) = exp [jωλ] H∗(ω)SX(ω)
SX(ω) = SS(ω)
where
H(ω) =p∑
n=0
hn exp [−jnω] ,−π < ω < π
So, the transfer function of the optimum non causal filter is
H(ω) =exp [jωλ] H(ω)∗SS(ω)
|H(ω)|2SS(ω) + SN (ω)
For λ = 0 we have
H(ω) =H(ω)∗SS(ω)
|H(ω)|2SS(ω) + SN (ω)=
1
H(ω)
|H(ω)|2
|H(ω)|2 + SN (ω)SS(ω)
9.6 Causal Wiener-Kolmogorov
To develop the causal Wiener-Kolmogorov filtering, let note by Xt the non causal Wiener-Kolmogorov solution and by Xt the causal one, i.e.
Xt =∞∑
n=−∞
ht−nYn
Xt =∞∑
n=−∞
ht−nYn
136 CHAPTER 9. LINEAR ESTIMATION
Note also that Ht−∞ is a subset of H∞
−∞. So, if the solution to the noncausal Wiener-Kolmogorov problem happens to be causal, it also solves the causal Wiener-Kolmogorovproblem. But unfortunately, this is not the case excepted very special cases. However,there is surely a relation between these two solutions.
To obtain this relation we start by writing
(Xt − Xt) = (Xt − Xt) + (Xt − Xt)
So, for any Z ∈ Ht−∞ we have
E[(Xt − Xt)Z
]= E
[(Xt − Xt)Z
]+ E
[(Xt − Xt)Z
]
The left hand term E[(Xt − Xt)Z
]is zero due to the orthogonality principle applied to
Xt. The second right hand term E[(Xt − Xt)Z
]is zero due to the orthogonality principle
applied to Xt. So we have
E[(Xt − Xt)Z
]= 0, ∀Z ∈ Ht
−∞
which means that Xt is the MMSE estimate of Xt among all estimates in Ht−∞. In other
words, Xt which is the projection of Xt on Ht−∞ can be obtained by first projecting Xt
on H∞−∞ to get Xt and then projecting Xt onto Ht
−∞.
Now, let define
Xtdef=
t∑
n=−∞
ht−nYn
and consider the error
Xt − Xt =∞∑
n=−∞
ht−nYn −t∑
n=−∞
ht−nYn =∞∑
n=t+1
ht−nYn
If this error could be orthogonal to Ym for all m ≤ t, we could consider Xt as the projectionof Xt on Ht
−∞. But this is not the case in general. This could be the case if {Yn}∞n=−∞
was a sequence of uncorrelated random variables, because in that case we would have
E[(Xt − Xt)Ym
]= E
∞∑
n=t+1
ht−nYn
Ym
=∞∑
n=t+1
ht−nE [Yn, Ym]
= σ2∞∑
n=t+1
ht−nδn,m = 0, ∀ m ≤ t
where δn,m is the Kronecker delta (δn,m = 1 if n = m and δn,m = 0 if n 6= m and whereσ2 = E
[Y 2
n
]
9.6. CAUSAL WIENER-KOLMOGOROV 137
From the above discussion we see that, if we transform first the data {Yn}∞n=−∞
into an equivalent w.s.s. and uncorrelated sequence {Zn}∞n=−∞, then the causal Wiener-Kolmogorov filter coefficients can be obtained by simple truncation on the non causalWiener-Kolmogorov filter, i.e.
Xt =t∑
n=−∞
ht−nZn
where
hn =1
2π
∫ π
−πSXZ(ω) exp [jnω] dω = CXZ(n), n ≥ 0
where SXZ(ω) and CXZ(n) are, respectively, the cross spectrum and the cross covarianceof the sequences {Xn}∞n=−∞ and {Zn}∞n=−∞.
Now the aim is to obtain {Zn}∞n=−∞ from {Yn}∞n=−∞. The analyse of the one stepprediction problem in previous section in this chapter gives us the solution. If we note by
Yt|t−1 the one step prediction of Yt from the data {Yn}t−1n=−∞ and by σ2
t = E[(Yt − Yt|t−1)
],
then define
Zn =Yn − Yn|n−1
σn, n ∈ ZZ
we can verify that E [Zn] = 0, E[Z2
n
]= 1 and Cov {Zn, Zm} = 0. So, Zn has all the
necessary properties that we need. Now, still we have to show that {Zn}∞n=−∞ is equivalentto {Yn}∞n=−∞ for the purpose of MMSE. To do this we need the result of the followingtheorem.
Theorem 1 (Spectral Factorization) Assume Yn has a power spectral density satisfy-ing the Paley-Wiener condition
1
2π
∫ π
−πlnSY (ω) dω > −∞.
Then
SY (ω) = S−Y (ω)S+
Y (ω), −π < ω < π (9.27)
where
SY (ω) = |S−Y (ω)|2 = |S+
Y (ω)|2 (9.28)
and
1
2π
∫ π
−πS−
Y (ω) exp [jnω] dω = 0 n < 0
1
2π
∫ π
−π
1
S−Y (ω)
exp [jnω] dω = 0 n < 0
1
2π
∫ π
−πS+
Y (ω) exp [jnω] dω = 0 n > 0
1
2π
∫ π
−π
1
S+Y (ω)
exp [jnω] dω = 0 n > 0
138 CHAPTER 9. LINEAR ESTIMATION
Now consider the time invariant filter with H(ω) = 1S+
Y(ω)
. This filter, by definition, is
causal. The output of this filter with the wide sense stationnary input sequence Yn willbe another wide sense stationnary sequence Zn with
SZ(ω) =
∣∣∣∣∣1
S−Y (ω)
∣∣∣∣∣
2
SY (ω) = 1, −π < ω < π (9.29)
Since SZ(ω) = 1 corresponds to a white sequence, the filter is called a whitenning filter.Note also that the input sequence Yn can also be obtained causally from the output
Zn by
Yt =∞∑
n=−∞
ft−nZn, t ∈ ZZ (9.30)
where
fn =1
2π
∫ π
−πS+
Y (ω) exp [jnω] dω, n ≥ 0
Now consider the λ-step prediction of Yn from Zn:
Yt+λ =t+λ∑
n=−∞
ft+λ−nZn
=t∑
n=−∞
ft+λ−nZn +t∑
n=t+1
ft+λ−nZn
But Zn is white, so Zn+1, . . . , Zt+λ are orthogonal to {Zn}tn=−∞. So the best estimate
Yt+λ of Yt+λ from {Yn}tn=−∞ is
Yt+λ =t∑
n=−∞
ft+λ−nZn
Combining the whitening filter and this prediction we obtain
· · · , Yt−2, Yt−1, Yt −→ 1S+
Y(ω)
−→ Zn −→[exp [jωλ]S+
Y (ω)]
+−→ Yt+λ
where the operator [H(ω)]+ is defined as
[H(ω)]+ =∞∑
n=0
hn exp [−jωn]
where
hn =1
2π
∫ π
−πH(ω) exp [jnω] dω
Now we obtain a procedure for causally prewhitenning a stationary sequence of ob-servations. Thus the solution of the general causal Wiener-Kolmogorov problem followsimmediately. Assuming that {Yn}∞n=−∞ satisfies the Paley-Winner condition, it can betransformed equivalently into {Zn}t
n=−∞ by passing it through the causal filter 1/S+Y (ω)
Yn −→ 1S+
Y(ω)
−→ Zn
9.7. RATIONAL SPECTRA 139
Then we need to find the cross spectrum SXZ(ω) and pass {Zn}tn=−∞ through the causal
filter [SXZ(ω)]+ to obtain the required result.
Zn −→ S+XZ(ω) −→ Yt
Knowing that
SXZ(ω) =SXY (ω)
[S+Y (ω)]∗
=SXY (ω)
S−Y (ω)
we obtain
Yn −→ 1S+
Y(ω)
−→ Zn −→[
SXY (ω)
S−Y
(ω)
]
+−→ Yt
It is interesting to compare this filter with the non causal Wiener-Kolmogorov filter
SXY (ω)
SY (ω)=
1
S+Y (ω)
[SXY (ω)
S−Y (ω)
]
The following diagrams summarize this comparison:
Yn −→ 1S+
Y(ω)
−→[
SXY (ω)
S−Y
(ω)
]
+−→ Yt+λ Yn −→ 1
S+
Y(ω)
−→ SXY (ω)
S−Y
(ω)−→ Yt+λ
Causal Non Causal
9.7 Rational spectra
Definition 1 (Rational spectra) Sy(ω) is said rational if it can be written as the ratioof two real trigonometric polynomials
Sy(ω) =
n0 + 2p∑
k=1
nk cos kω
d0 + 2m∑
k=1
dk cos kω
, nk, dk ∈ IR (9.31)
Using the relation cos kω = exp [jkω] + exp [−jkω] we deduce
Sy(ω) =N (exp [jkω])
D (exp [jkω]).
Noting z = exp [jkω] we have
N(z) =p∑
k=−p
n|k|z−k
D(z) =p∑
k=−p
d|k|z−k
140 CHAPTER 9. LINEAR ESTIMATION
zpN(z) is a (2p)-th order polynomial. So we can write
N(z) = np z−p
2p∏
k=1
(z − zk)
Since N(z) = N(1/z) the roots zk are reciprocal pairs. If we order them in such a waythat |z1| > |z2| > · · · > |z2p| we have
|z2p| =1
|z1|, |z2p−1| =
1
|z2|, · · · , |zp+1| =
1
|zp|.
We deduce that all the roots are outside or over the unit circle. Due to the reciprocity ofthe roots we can write
N(z) = B(z)B(1/Z) (9.32)
where
B(z) =
√√√√(−1)pnp/p∏
k=1
zk
p∏
k=1
(z−1 − zk) (9.33)
So B(z) is a polynomial of degree p and can be extended as
B(z) =p∑
k=0
bk z−k (9.34)
Similarly, we can do exactly the same analysis for the denominator D(z):
D(z) = A(z)A(1/Z) (9.35)
where
A(z) =m∑
k=0
ak z−k (9.36)
Putting these together we have
Sy(ω) =B (exp [jkω]) B (exp [−jkω])
A (exp [jkω]) A (exp [−jkω])=B (exp [jkω])
A (exp [jkω])
B (exp [−jkω])
A (exp [−jkω])
Assuming that none of the roots of B or A is on the unit circle |z| = 1 we have
S+y (ω) =
B (exp [jkω])
A (exp [jkω])
S−y (ω) =
B (exp [−jkω])
A (exp [−jkω])
Now consider the whitenning filter of the last section and assume that the powerspectrum of the data SY (ω) is a rational fraction.
Yn −→ 1S+
Y(ω)
−→ Zn −→ Yn −→ A(z)B(z) =
∑m
k=0ak z−k
∑p
k=0bk z−k −→ Zn
9.7. RATIONAL SPECTRA 141
Then, we can see easily that we have
m∑
k=0
ak Yn−k =p∑
k=0
bk Zn−k (9.37)
We can rewrite this equation in two other equivalent forms
b0zn = −p∑
k=1
bk Zn−k +m∑
k=0
ak Yn−k
a0Yn = −m∑
k=1
ak Yn−k +p∑
k=0
bk Zn−k
Autoregressive, moving Average (ARMA) sequence of order (m, p). For p = 0, we havean Autoregressive (AR) and for m = 0 we have a moving average (MA) sequence.
Example 3 (Wide-Sense Markov sequences) A simple and useful model for the cor-relation structure of a stationary random sequence is the so-called wide-sense Markovmodel:
CY (n) = σ2r|n|, n ∈ ZZ (9.38)
where |r| < 1. The power spectrum of such a sequence is
SY (ω) =σ2(1 − r2)
1 − 2r cos(ω) + r2
which is a rational fraction, and we can see easily that we can write it
SY (ω) =σ2(1 − r2)
(1 − r exp [−jω]) (1 − r exp [+jω])=
1
A(exp [−jω])A(exp [+jω])=
1
A(Z)A(z−1)
whereA(z) = a0 + a1z
−1
with a0 =√σ2(1 − r2), a1 = −r a0.
We can conclude here that a wide-sense Markov sequence with the covariance structure(9.38) is an AR(1) sequence.
Example 4 (Prediction of a Wide-Sense Markov sequences) Consider now the pre-diction problem where we wish to predict Yn+λ from the sequence Yk
nk=−∞. Using the rela-
tions we obtained in the last section, this can be done through a causal filter whose transferfunction is
Yn −→ H(ω) = 1S+
Y(ω)
[SXY (ω)
S−Y
(ω)
]
+−→ Yt+λ
Here we have
H(ω) = A(exp [jω])
[exp [jωλ]
A(exp [jω])
]
+
Using the following geometric series relations
∞∑
k=0
xk =1
1 − x, and
∞∑
k=1
xk =x
1 − x, |x| < 1
142 CHAPTER 9. LINEAR ESTIMATION
we obtain easily1
A(z)=
1
a0
∞∑
k=0
rn z−1
and
[exp [jωλ]
A(exp [jω])
]
+
=
[1
a0
∞∑
n=0
rn exp [−jω(n− λ)]
]
+
=1
a0
∞∑
n=λ
rn exp [−jω(n− λ)]
=1
a0
∞∑
l=0
rl+λ exp [−jlω]
=rλ
A(exp [jω])
Finally, we obtain
H(ω) = A(exp [jω])rλ
A(exp [jω])= rλ
which is a pure pure gainYt+λ = rλ Yt
It is also easy to show that, in this case we have
MSE = E[(Yt+λ − Yt+λ)2
]= σ2(1 − r2λ)
which means that the prediction error increases monotonically from σ2(1 − r2λ) to σ2 asλ increases from 1 to ∞.
Appendix A
Annexes
143
144 APPENDIX A. ANNEXES
A.1 Summary of Bayesian inference
Observation model:Xi ∼ f(xi|θ)z = {x1, · · · , xn}, zn = {x1, · · · , xn}, zn+1 = {x1, · · · , xn, xn+1},
Likelihood and sufficient statistics:
l(θ|z) = f(x|θ) =n∏
i=1
f(xi|θ)
l(θ|z) = l(θ|t(z))
p(t(z)|θ) =l(θ|t(z))∫∫l(θ|t(z)) dθ
Inference with any prior law:π(θ)
f(xi) =
∫f(xi|θ)π(θ) dθ and f(x) =
∫∫f(x|θ)π(θ) dθ
p(t(z)) =
∫∫p(t(z)|θ)π(θ) dθ
p(z,θ) = p(z|θ)π(θ)
p(z) =
∫∫p(z|θ)π(θ) dθ, prior predictive
π(θ|z) =p(z|θ)π(θ)
p(z)E [θ|z] =
∫∫θ π(θ|z) dθ
f(x|z) =p(z, x)
p(z)=p(zn+1)
p(zn), posterior predictive
E [x|z] =
∫x f(x|z) dx
Inference with conjugate priors:
π(θ|τ 0) =p(t = τ 0|θ)∫∫p(t = τ 0|θ) dθ
∈ Fτ 0(θ)
π(θ|z, τ ) ∈ Fτ (θ), with τ = g(τ 0, n,z)
A.1. SUMMARY OF BAYESIAN INFERENCE 145
Inference with conjugate priors and generalized exponential family:
if f(xi|θ) = a(xi) g(θ) exp
[K∑
k=1
ckφk(θ)hk(xi)
]
then
tk(x) =n∑
j=1
hk(xj), k = 1, · · · ,K
π(θ|τ 0) = [g(θ)]τ0 z(τ ) exp
[K∑
k=1
τkφk(θ)
]
π(θ|x, τ ) = [g(θ)]n+τ0 a(x)Z(τ ) exp
[K∑
k=1
ckφk(θ) (τk + tk(x))
].
Inference with conjugate priors and natural exponential family:if f(xi|θ) = a(xi) exp [θxi − b(θ)]then
t(x) =n∑
i=1
xi
π(θ|τ0) = c(θ) exp [τ0θ − d(τ0)]π(θ|x, τ0) = c(θ) exp [τnθ − d(τn)] with τn = τ0 + x
where x =1
n
n∑
i=1
xi,
Inference with conjugate priors and natural exponential familyMultivariable case:
if f(xi|θ) = a(xi) exp[θtxi − b(θ)
]
then
tk(x) =n∑
i=1
xki, k = 1, . . . ,K
π(θ|τ 0) = c(θ) exp [τ 0θ − d(τ 0)]π(θ|x, τ0) = c(θ) exp [τnθ − d(τn)] with τn = τ 0 + x
where x =1
n
n∑
i=1
xi,
146 APPENDIX A. ANNEXES
Bernouilli model:
z = {x1, · · · , xn}, xi ∈ {0, 1}, r =∑
xi : number of 1, n− r : number of 0
f(xi|θ) = Ber(xi|θ), 0 < θ < 1
Likelihood and sufficient statistics:
l(θ|z) =n∏
i=1
Ber(xi|θ) = θ∑
xi(1 − θ)n−∑
xi = θr(1 − θ)n−r
t(z) = r =n∑
i=1
xi, l(θ|r) = θr(1 − θ)1−r
p(r|θ) = Bin(r|θ, n)
Inference with conjugate priors:π(θ) = Bet(θ|α, β)f(x) = BinBet(x|α, β, 1)p(r) = BinBet(r|α, β, n)
π(θ|z) = Bet(θ|α+ r, β + n− r), E [θ|z] =α+ r
β + n− r
f(x|z) = BinBet(x|α+ r, β + n− r, 1), E [x|z] =α+ r
β + n− r
Inference with reference priors:
π(θ) = Bet(θ|12,1
2)
π(x) = BinBet(x|12,1
2, 1)
π(r) = BinBet(r|12,1
2, n)
π(θ|z) = Bet(θ|12
+ r,1
2+ n− r)
π(x|z) = BinBet(x|12
+ r,1
2+ n− r, 1)
A.1. SUMMARY OF BAYESIAN INFERENCE 147
Binomial model:z = {x1, · · · , xn}, xi = 0, 1, 2, · · · ,mf(xi|θ,m) = Bin(xi|θ,m), 0 < θ < 1, m = 0, 1, 2, · · ·
Likelihood and sufficient statistics:
l(θ|z) =n∏
i=1
Bin(xi|θ,m)
t(z) = r =n∑
i=1
xi
p(r|θ) = Bin(r|θ, nm)
Inference with conjugate priors:π(θ) = Bet(θ|α, β)f(x) = BinBet(x|α, β,m)p(r) = BinBet(r|α, β, nm)
π(θ|z) = Bet(θ|α+ r, β + n− r), E [θ|z] =α+ r
β + n− rf(x|z) = BinBet(x|α+ r, β + n− r,m)
Inference with reference priors:
π(θ) = Bet(θ|12,1
2)
π(x) = BinBet(x|12,1
2, 1)
π(r) = BinBet(r|12,1
2, n)
π(θ|z) = Bet(θ|12
+ r,1
2+ n− r)
π(x|z) = BinBet(x|12
+ r,1
2+ n− r,m)
148 APPENDIX A. ANNEXES
Poisson:z = {x1, · · · , xn}, xi = 0, 1, 2, · · ·f(xi|λ) = Pn(xi|λ), λ ≥ 0
Likelihood and sufficient statistics:
l(λ|z) =n∏
i=1
Pn(xi|λ)
t(z) = r =n∑
i=1
xi
p(r|λ) = Pn(r|nλ)
Inference with conjugate priors:p(λ) = Gam(λ|α, β)f(x) = PnGam(x|α, β, 1)p(r) = PnGam(r|α, β, n)
p(λ|z) = Gam(λ|α+ r, β + n), E [λ|z] =α+ r
β + nf(x|z) = PnGam(x|α+ r, β + n, 1)
Inference with reference priors:
π(λ) ∝ λ−1/2 = Gam(λ|12, 0)
π(x) = PnGam(x|12, 0, 1)
π(r) = PnGam(r|12, 0, n)
π(λ|z) = Gam(λ|12
+ r, n)
π(x|z) = PnGam(x|12
+ r, n, 1)
A.1. SUMMARY OF BAYESIAN INFERENCE 149
Negative Binomial model:z = {x1, · · · , xn}, xi = 0, 1, 2, · · ·f(xi|θ, r) = NegBin(xi|θ, r), 0 < θ < 0, r = 1, 2, · · ·
Likelihood and sufficient statistics:
l(θ|z) =n∏
i=1
NegBin(xi|θ, r)
t(z) = s =n∑
i=1
xi
p(s|θ) = NegBin(s|θ, nr)
Inference with conjugate priors:π(θ) = Bet(θ|α, β)f(x) = NegBinBet(x|α, β, r)p(s) = NegBinBet(s|α, β, nr)π(θ|z) = Bet(θ|α+ nr, β + s), E [θ|z] =
α+ nr
β + sf(x|z) = NegBinBet(x|α+ nr, β + s, nr)
Inference with reference priors:
π(θ) ∝ θ−1(1 − θ)−1/2 = Bet(θ|0, 12)
π(x) = NegBinBet(x|0, 12, r)
π(s) = NegBinBet(s|0, 12, nr)
π(θ|z) = Bet(θ|nr, s +1
2)
π(x|z) = NegBinBet(x|nr, s +1
2, nr)
150 APPENDIX A. ANNEXES
Exponential model:z = {x1, · · · , xn}, 0 < xi <∞f(xi|λ) = Ex(xi|λ), λ > 0
Likelihood and sufficient statistics:
l(λ|z) =n∏
i=1
Ex(xi|λ)
t(z) = t =n∑
i=1
xi
p(t|λ) = Gam(t|n, λ)
Inference with conjugate priors:p(λ) = Gam(λ|α, β)f(x) = GamGam(x|α, β, 1)p(t) = GamGam(t|α, β, n)
p(λ|z) = Gam(λ|α+ n, β + t) E [λ|z] =α+ n
β + tf(x|z) = GamGam(x|α+ n, β + t, 1)
Inference with reference priors:
π(λ) ∝ λ−1 = Gam(λ|0, 0)π(x) = GamGam(x|0, 0, 1)π(t) = GamGam(t|0, 0, n)π(λ|z) = Gam(λ|n, t)π(x|z) = GamGam(x|n, t, 1)
A.1. SUMMARY OF BAYESIAN INFERENCE 151
Uniform model:z = {x1, · · · , xn}, 0 < xi < θf(xi|θ) = Uni(xi|0, θ), θ > 0
Likelihood and sufficient statistics:
l(θ|z) =n∏
i=1
Uni(xi|0, θ)
t(z) = t = max{x1, · · · , xn}p(t|θ) = IPar(t|n, θ−1)
Inference with conjugate priors:π(θ) = Par(θ|α, β)
f(x) =
{α
α+1Uni(x|0, β), if x ≤ β,1
α+1Par(x|α, β), if x > β
p(t) =
{α
α+nIPar(t|n, β−1), if t ≤ β,n
α+nPar(t|α, β), if x > β
π(θ|z) = Par(θ|α+ n, βn), βn = max{β, t}
f(x|z) =
{α+n
α+n+1Uni(x|0, βn), if t ≤ βn,1
α+n+1Par(x|α, βn), if x > βn
Inference with reference priors:
π(θ) ∝ θ−1 = Par(θ|0, 0)π(θ|z) = Par(θ|n, t)
π(x|z) =
nn+1Uni(x|0, t), if x ≤ t,
1
n+ 1Par(x|n, t), if x > t
152 APPENDIX A. ANNEXES
Normal with known precision λ = 1σ2 > 0
(Estimation of µ):
z = {x1, · · · , xn}, xi ∈ IR, xi = µ+ bi, bi ∼ N(bi|0, λ)f(xi|µ, λ) = N(xi|µ, λ), µ ∈ IR
Likelihood and sufficient statistics:
l(µ|z) =n∏
i=1
N(xi|µ, λ)
t(z) = x =1
n
n∑
i=1
xi
p(x|µ, λ) = N(x|µ, nλ)
Inference with conjugate priors:p(µ) = N(µ|µ0, λ0)
f(x) = N
(x|µ0,
λλ0
λ+ λ0
)
f(x) = f(x1, · · · , xn) = Nn
(x|µ01,
(1
λI +
1
λ01.1t
)−1)
p(x) = N(x|µ0,
nλλ0
λn
), λn = λ0 + nλ
p(µ|z) = N (µ|µn, λn) , µn =λ0µ0 + nλx
λn
f(x|z) = N(x|µn,
λλn
λ+λn
)
Inference with reference priors:π(µ) = constantπ(µ|z) = N(µ|x, nλ)
π(x|z) = N
(x|x, nλ
n+ 1
)
Inference with other prior laws:
π(µ) = St(µ|0, τ2, α) = π1(µ|ρ)π2(ρ|α)with
π1(µ|ρ) = N(µ|0, τ2ρ),π2(ρ|α) = IGam(ρ|α/2, α/2),
π(µ|z, ρ) = N
(µ| 1
1 + τ2ρx,
τ2ρ
1 + τ2ρ
)
π(ρ|z) ∝ (1 + τ2ρ)−1/2 exp
[ −1
2(1 + τ2ρ)xtx
]π2(ρ)
A.1. SUMMARY OF BAYESIAN INFERENCE 153
Normal with known variance σ2 > 0(Estimation of µ):
z = {x1, · · · , xn}, xi ∈ IR, xi = µ+ bi, bi ∼ N(bi|0, σ2)
f(x) = f(xi|µ, σ2) = N(xi|µ, σ2), µ ∈ IR
Likelihood and sufficient statistics:
l(µ|z) =n∏
i=1
N(xi|µ, σ2)
t(z) = x =1
n
n∑
i=1
xi
p(x|µ, σ2) = N(x|µ, 1
nσ2)
Inference with conjugate priors:
p(µ) = N(µ|µ0, σ20)
f(x) = N(x|µ0, σ
20 + σ2
)
f(x1, · · · , xn) = Nn
(x|µ01, σ
2I + σ201.1
t)
p(x) = N
(x|µ0, σ
20 +
1
nσ2),
p(µ|z) = N(µ|µn, σ
2n
), µn =
σ20σ
2
nσ20 + σ2
(1
σ20
µ0 +n
σ2x
), σ2
n =σ2
0σ2
nσ20 + σ2
f(x|z) = N(x|µn, σ
2 + σ2n
)
Inference with reference priors:π(µ) = constant
π(µ|z) = N(µ|x, 1
nσ2)
π(x|z) = N
(x|x, n+ 1
nσ2
)
Inference with other prior laws:
π(µ) = St(µ|0, τ2, α) = π1(µ|ρ)π2(ρ|α)with
π1(µ|ρ) = N(µ|0, τ2ρ),π2(ρ|α) = IGam(ρ|α/2, α/2),
π(µ|z, ρ) = N
(µ| 1
1 + τ2ρx,
τ2ρ
1 + τ2ρ
)
π(ρ|z) ∝ (1 + τ2ρ)−1/2 exp
[ −1
2(1 + τ2ρ)xtx
]π2(ρ)
154 APPENDIX A. ANNEXES
Normal with known mean µ ∈ IR(Estimation of λ):
z = {x1, · · · , xn}, xi ∈ IR, xi = µ+ bi, bi ∼ N(bi|0, λ)f(xi|µ, λ) = N(xi|µ, λ), λ > 0
Likelihood and sufficient statistics:
l(λ|z) =n∏
i=1
N(xi|µ, λ)
t(z) = t =n∑
i=1
(xi − µ)2
p(t|µ, λ) = Gam(t|n2, λ/2), p(λt|µ, λ) = Chi2(λt|n)
Inference with conjugate priors:p(λ) = Gam(λ|α, β)f(x) = St (x|µ, α/β, 2α)
p(t) = GamGam
(t|α, 2β, n
2
)
p(λ|z) = Gam
(λ|α+
n
2, β +
t
2
)
f(x|z) = St
(x|µ, α+ n
2
β+ t2
, 2α + n
)
Inference with reference priors:
π(λ) ∝ λ−1 = Gam(λ|0, 0)π(λ|z) = Gam(λ|n
2,t
2)
π(x|z) = St
(x|µ, n
t, n
)
A.1. SUMMARY OF BAYESIAN INFERENCE 155
Normal with known mean µ ∈ IR(Estimation of σ2):
z = {x1, · · · , xn}, xi ∈ IR, xi = µ+ bi, bi ∼ N(bi|0, σ2)
f(xi|µ, σ2) = N(xi|µ, σ2), σ2 > 0
Likelihood and sufficient statistics:
l(σ2|z) =n∏
i=1
N(xi|µ, σ2)
t(z) = t =n∑
i=1
(xi − µ)2
p(t|µ, σ2) = Gam
(t|n
2,σ2
2
), p(
t
σ2|µ, σ2) = Chi2
(t
σ2|n)
Inference with conjugate priors:
p(σ2) = IGam(σ2|α, β)f(x) = St (x|µ, α/β, 2α)
p(t) = GamGam
(t|α, 2β, n
2
)
p(σ2|z) = IGam
(σ2|α+
n
2, β +
t
2
)
f(x|z) = St
(x|µ, α+ n
2
β+ t2
, 2α + n
)
Inference with reference priors:
π(σ2) ∝ 1
σ2= IGam(σ2|0, 0)
π(σ2|z) = IGam
(σ2|n
2,t
2
)
π(x|z) = St
(x|µ, n
t, n
)
156 APPENDIX A. ANNEXES
Normal with both unknown parametersEstimation of mean and precision (µ, λ):
z = {x1, · · · , xn}, xi ∈ IR, xi = µ+ bi, bi ∼ N(bi|0, λ)f(xi|µ, λ) = N(xi|µ, λ), µ ∈ IR, λ > 0
Likelihood and sufficient statistics:
l(µ, λ|z) =n∏
i=1
N(xi|µ, λ)
t(z) = (x, s), x =1
n
n∑
i=1
xi, s2 =n∑
i=1
(xi − x)2
p(x|µ, λ) = N(x|µ, nλ),
p(ns2|µ, λ) = Gam(ns2|(n − 1)/2, λ/2), p(λns2|µ, λ) = Chi2(λns2|n− 1)
Inference with conjugate priors:p(µ, λ) = NGam(µ, λ|µ0, n0, α, β) = N(µ|µ0, n0λ)Gam(λ|α, β)
p(µ) = St
(µ|µ0, n0
α
β, 2α
)
p(λ) = Gam(λ|α, β)
f(x) = St
(x|µ0,
n0
n0 + 1
α
β, 2α
)
p(x) = St
(x|µ0,
n0n
n0 + n
α
β, 2α
)
p(ns2) = GamGam
(ns2|α, 2β, n− 1
2
)
p(µ|z) = St(µ|µn, (n+ n0)(αn)β−1
n , 2αn
),
αn = α+ n2 ,
µn = n0µ0+nxn0+n ,
βn = β + ns2/2 + 12
n0nn0+n(µ0 − x)2
p(λ|z) = Gam (λ|αn, βn)
f(x|z) = St(x|µn,
n+n0
n+n0+1αn
βn, 2αn
)
Inference with reference priors:
π(µ, λ) = π(λ, µ) ∝ λ−1, n > 1
π(µ|z) = St(µ|x, (n − 1)s2, n − 1)
π(λ|z) = Gam(λ|(n − 1)/2, ns2/2)
π(x|z) = St
(x|x, n− 1
n+ 1s−2, n− 1
)
A.1. SUMMARY OF BAYESIAN INFERENCE 157
Normal with both unknown parameters mean and varianceEstimation of (µ, σ2):
z = {x1, · · · , xn}, xi ∈ IR, xi = µ+ bi, bi ∼ N(bi|0, σ2)
f(xi|µ, σ2) = N(xi|µ, σ2), µ ∈ IR, σ2 > 0
Likelihood and sufficient statistics:
l(µ, σ2|z) =n∏
i=1
N(xi|µ, σ2)
t(z) = (x, s), x =1
n
n∑
i=1
xi, s2 =n∑
i=1
(xi − x)2
p(x|µ, σ2) = N(x|µ, 1
nσ2),
p(ns2|µ, σ2) = Gam(ns2|(n − 1)/2, σ2
2 ), p(σ2ns2|µ, σ2) = Chi2(σ2ns2|n− 1)
Inference with conjugate priors:
p(µ, σ2) = NIGam(µ, σ2|µ0, n0, α, β) = N(µ|µ0, n0σ2) IGam(σ2|α, β)
p(µ) = St
(µ|µ0, n0
α
β, 2α
)
p(σ2) = IGam(σ2|α, β)
f(x) = St
(x|µ0,
n0
n0 + 1
α
β, 2α
)
p(x) = St
(x|µ0,
n0n
n0 + n
α
β, 2α
)
p(ns2) = GamGam
(ns2|α, 2β, n− 1
2
)
p(µ|z) = St(µ|µn, (n + n0)(αn)β−1
n , 2αn
),
αn = α+ n2 ,
µn = n0µ0+nxn0+n ,
βn = β + ns2/2 + 12
n0nn0+n(µ0 − x)2
p(σ2|z) = IGam(σ2|αn, βn
)
f(x|z) = St(x|µn,
n+n0
n+n0+1αn
βn, 2αn
)
Inference with reference priors:
π(µ, σ2) = π(σ2, µ) ∝ 1
σ2, n > 1
π(µ|z) = St(µ|x, (n − 1)s2, n − 1)
π(σ2|z) = IGam(σ2|(n− 1)/2, ns2/2)
π(x|z) = St
(x|x, n− 1
n+ 1s−2, n− 1
)
158 APPENDIX A. ANNEXES
Multinomial:
z = {r1, · · · , rk, n}, ri = 0, 1, 2, · · · ,k∑
i=1
ri ≤ n,
p(ri|θi, n) = Bin(ri|θi, n),
p(z|θ, n) = Muk(z|θ, n), 0 < θi < 1,∑k
i=1 θi ≤ 1
Likelihood and sufficient statistics:l(θ|z) = Muk(z|θ, n)t(z) = (r, n), r = {r1, · · · , rk}p(r|θ) = Muk(r|θ, n)
Inference with conjugate priors:π(θ) = Dik(θ|α), α = {α1, · · · , αk+1}p(r) = Muk(r|α, n)
π(θ|z) = Dik
(θ|α1 + r1, · · · , αk + rk, αk+1 + n−
k∑
i=1
rk
)
f(x|z) = Dik
(θ|α1 + r1, · · · , αk + rk, αk+1 + n−
k∑
i=1
rk
)
Inference with reference priors:π(θ) ∝??π(θ|z) =??π(x|z) =??
A.1. SUMMARY OF BAYESIAN INFERENCE 159
Multi-variable Normal with known precision matrix Λ(Estimation of the mean µ):
z = {x1, · · · ,xn}, xi ∈ IRk, xi = µ + bi, bi ∼ Nk(bi|0,Λ)
f(xi|µ,Λ) = Nk(xi|µ,Λ), µ ∈ IRk, Λ matrix d.p. of dimensions k × k
Likelihood and sufficient statistics:
l(µ|z) =n∏
i=1
Nk(xi|µ,Λ)
t(z) = x, x =1
n
n∑
i=1
xi,
p(x|µ,Λ) = Nk(x|µ, nΛ)
Inference with conjugate priors:p(µ) = Nk(µ|µ0,Λ0)
f(x) = Nk
(x|µ0, (Λ0Λ)Λ−1
1
), Λ1 = Λ0 + Λ
p(µ|z) = Nk (µ|µn,Λn)Λn = Λ0 + nΛ,µn = Λ−1
n (Λ0µ0 + nΛx)f(x|z) = Nk (x|µn,Λn)
Inference with reference priors:π(µ) =??π(µ|z) =??π(x|z) =??
160 APPENDIX A. ANNEXES
Multi-variable Normal with known covariance matrix Σ(Estimation of the mean µ):
z = {x1, · · · ,xn}, xi ∈ IRk, xi = µ + bi, bi ∼ Nk(bi|0,Σ)
f(xi|µ,Σ) = Nk(xi|µ,Σ), µ ∈ IRk, Σ p.d. matrix of dimensions k × k
Likelihood and sufficient statistics:
l(µ|z) =n∏
i=1
Nk(xi|µ,Σ)
t(z) = x, x =1
n
n∑
i=1
xi,
p(x|µ,Σ) = Nk(x|µ, nΣ)
Inference with conjugate priors:p(µ) = Nk(µ|µ0,Σ0)f(x) = Nk (x|µ0,Σ1) , Σ1 = Σ0 + Σp(µ|z) = Nk (µ|µn,Σn)
Σn = Σ0 + 1nΣ,
µn = Σ−1n (Σ0µ0 + nΣx)
f(x|z) = Nk (x|µn,Σn)
Inference with reference priors:π(µ) =??π(µ|z) =??π(x|z) =??
A.1. SUMMARY OF BAYESIAN INFERENCE 161
Multi-variable Normal with known mean µ
(Estimation of precision matrix Λ):
z = {x1, · · · ,xn}, xi ∈ IRk, xi = µ + bi, bi ∼ Nk(bi|0,Λ)
f(xi|µ,λ) = Nk(xi|µ,λ), µ ∈ IRk, λ matrix d.p. of dimensions k × k
Likelihood and sufficient statistics:
l(λ|z) =n∏
i=1
Nk(xi|µ,λ)
t(z) = S, S =n∑
i=1
(xi − µ)(xi − µ)t
p(S|λ) = Wik(S|(n− 1)/2,λ/2),
Inference with conjugate priors:p(λ) = Wik(λ|α,β)
f(x) = Stk
(x|µ0,
n0
n0 + 1(α− k − 1
2)β−1, 2α− k + 1
)
p(λ|z) = Wik (λ|αn,βn)
αn = α+ n2 − k−1
2 ,
µn =n0µ0
+nxn0+n ,
βn = β + 12S + 1
2n0n
n0+n(µ0 − x)(µ0 − x)t
f(x|z) = Stk
(x|µn,
n+n0
n+n0+1αnβ−1n , 2αn
)
Inference with reference priors:π(λ) =??π(λ|z) =??π(x|z) =??
162 APPENDIX A. ANNEXES
Multi-variable Normal with known mean µ
(Estimation of covariance matrix Σ):
z = {x1, · · · ,xn}, xi ∈ IRk, xi = µ + bi, bi ∼ Nk(bi|0,Σ)
f(xi|µ,Σ) = Nk(xi|µ,Σ), µ ∈ IRk, Σ matrix d.p. of dimensions k × k
Likelihood and sufficient statistics:
l(Σ|z) =n∏
i=1
Nk(xi|µ,Σ)
t(z) = S, S =n∑
i=1
(xi − µ)(xi − µ)t
p(S|Σ) = Wik(S|(n− 1)/2,Σ/2),
Inference with conjugate priors:p(Σ) = IWik(Σ|α,β)
f(x) = Stk
(x|µ0,
n0
n0 + 1(α− k − 1
2)β−1, 2α− k + 1
)
p(Σ|z) = IWik (Σ|αn,βn)
αn = α+ n2 − k−1
2 ,
µn =n0µ0
+nxn0+n ,
βn = β + 12S + 1
2n0n
n0+n(µ0 − x)(µ0 − x)t
f(x|z) = Stk
(x|µn,
n+n0
n+n0+1αnβ−1n , 2αn
)
Inference with reference priors:π(Σ) =??π(Σ|z) =??π(x|z) =??
A.1. SUMMARY OF BAYESIAN INFERENCE 163
Multi-variable Normal with both unknown parametersEstimation of mean and precision matrix (µ,Λ):
z = {x1, · · · ,xn}, xi ∈ IRk,xi = µ + bi, bi ∼ Nk(bi|0,Λ) µ ∼ Nk(µ|µ0, n0Λ) Λ ∼ Wik(Λ|α,β)
f(xi|µ,Λ) = Nk(xi|µ,Λ), µ ∈ IRk, Λ matrix d.p. of dimensions k × k
Likelihood and sufficient statistics:
l(µ,Λ|z) =n∏
i=1
Nk(xi|µ,Λ)
t(z) = (x,S), x =1
n
n∑
i=1
xi, S =n∑
i=1
(xi − x)(xi − x)t
p(x|µ,Λ) = Nk(x|µ, nΛ)p(S|Λ) = Wik(S|(n− 1)/2,Λ/2),
Inference with conjugate priors:p(µ,Λ) = NWik(µ,Λ|µ0, n0, α,β) = Nk(µ|µ0, n0Λ)Wik(Λ|α,β)
p(µ) = Stk
(µ|µ0, n0αβ−1, 2α
)??
p(Λ) = Wik(Λ|α,β) ??
f(x) = Stk
(x|µ0,
n0
n0 + 1(α− k − 1
2)β−1, 2α− k + 1
)
p(µ|z) = Stk
(µ|µn, (n + n0)αnβ−1
n , 2αn
)
αn = α+ n2 − k−1
2 ,
µn =n0µ0
+nxn0+n ,
βn = β + 12S + 1
2n0n
n0+n(µ0 − x)(µ0 − x)t
p(Λ|z) = Wik (Λ|αn,βn)
f(x|z) = Stk
(x|µn,
n+n0
n+n0+1αnβ−1n , 2αn
)
Inference with reference priors:π(µ,Λ) =??π(µ|z) =??π(Λ|z) =??π(x|z) =??
164 APPENDIX A. ANNEXES
Mult-variable Normal with both unknown parametersEstimation of mean and covariance matrix (µ,Σ):
z = {x1, · · · ,xn}, xi ∈ IRk, xi = µ + bi, bi ∼ Nk(bi|0,Σ)
f(xi|µ,Σ) = Nk(xi|µ,Σ), µ ∈ IRk, Σ matrix d.p. of dimensions k × k
Likelihood and sufficient statistics:
l(µ,Σ|z) =n∏
i=1
Nk(xi|µ,Σ)
t(z) = (x,S), x =1
n
n∑
i=1
xi, S =n∑
i=1
(xi − x)(xi − x)t
p(x|µ,Σ) = Nk(x|µ, nΣ)p(S|Σ) = Wik(S|(n− 1)/2,Σ/2),
Inference with conjugate priors:p(µ,Σ) = NWik(µ,Σ|µ0, n0, α,β) = Nk(µ|µ0, n0Σ)Wik(Σ|α,β)
p(µ) = Stk
(µ|µ0, n0αβ−1, 2α
)??
p(Σ) = IWik(Σ|α,β) ??
f(x) = Stk
(x|µ0,
n0
n0 + 1(α− k − 1
2)β−1, 2α− k + 1
)
p(µ|z) = Stk
(µ|µn, (n + n0)αnβ−1
n , 2αn
)
αn = α+ n2 − k−1
2 ,
µn =n0µ0
+nxn0+n ,
βn = β + 12S + 1
2n0n
n0+n(µ0 − x)(µ0 − x)t
p(Σ|z) = Wik (Σ|αn,βn)
f(x|z) = Stk
(x|µn,
n+n0
n+n0+1αnβ−1n , 2αn
)
Inference with reference priors:π(µ,Σ) =??π(µ|z) =??π(Σ|z) =??π(x|z) =??
A.1. SUMMARY OF BAYESIAN INFERENCE 165
Linear regression:z = (y,X), y = {y1, · · · , yn} ∈ IRn,
xi = {xi1 , · · · ,xik} = {xi1, · · · , xik} ∈ IRk, X = (xi,j)
θ = {θ1, · · · , θk} ∈ IRk, yi = xtiθ = θtxi
p(y|X,θ, λ) = Nn(y|Xθ, λIn), θ ∈ IRk, λ > 0
Likelihood and sufficient statistics:l(θ|z) = Nn(y|Xθ, λIn)t(z) = (XtX,X ty)
Inference with conjugate priors:π(θ, λ) = NGamk(θ, λ|θ0,Λ0, α, β) = Nk(θ|θ0, λΛ0)Gam(λ|α, β)
π(θ|λ) = Nk(θ|θ0, λΛ0), E [θ|λ] = θ0, Var [θ|λ] = (λΛ0)−1
π(λ|α, β) = Gam(λ|α, β)
π(θ) = Stk
(θ|θ0,
α
βΛ0, 2α
), E [θ] = θ0, Var [θ] =
α
α− 2Λ−1
0
p(yi|xi) = St
(yi|xt
iθ0,α
βf(xi), 2α
), with f(xi) = 1 − xt
i(Λ0 + xixti)−1xi,
π(θ, λ|z) = NGamk(θ, λ|θn,Λn, αn, βn) = Nk(θ|θn, λΛn)Gam(λ|αn, βn)
π(θ|z) = Stk
(θ|θn, (Λ0 + XtX)
αn
βn, 2αn
)
αn = α+ n2 ,
θn = (Λ0 + XtX)−1(Λ0θ0 + Xty) = (I − Λn)θ0 + Λnθ,βn = β + 1
2(y − Xtθn)ty + 12(θ0 − θn)tΛ0θ0 = β + 1
2yty + 12θt
0Λ0θ0 − 12θnΛnθn
θ = (XtX)−1Xty, Λn = (Λ0 + XtX)−1XtX
E [θ|z] = θn, Var [θ|z] = (Λ0 + X tX)−1
π(λ|z) = Gam (λ|αn, βn)
p(yi|xi,z) = St(yi|xt
iθn, fn(xi)αn
βn, 2αn
)
fn(xi) = 1 − xti(X
tX + Λ0 + xixti)−1xi,
Inference with reference priors:
π(θ, λ) = π(λ,θ) ∝ λ−(k+1)/2
π(θ|z) = Stk
(θ|θn,
n− k
2βn
XtX, n− k
)
θn = (XtX)−1Xty,
βn =1
2(y − Xtθn)t(y − Xtθn)
π(λ|z) = Gam
(λ|n− k
2, βn
)
p(yi|xi,z) = St
(yi|xt
iθn,n− k
2βn
fn(xi), n − k
),
fn(xi) = 1 − xti(X
tX + xixti)−1xi,
166 APPENDIX A. ANNEXES
A.1. SUMMARY OF BAYESIAN INFERENCE 167
Inverse problems:
z = Hx + b, z = {z1, · · · , zn} ∈ IRn, hi = {hi1 , · · · , hik} ∈ IRk, H = (hi,j)
x = {x1, · · · , xk} ∈ IRk,
p(z|H ,x, λ) = Nn(z|Hx, λIn), x ∈ IRk, λ > 0
Likelihood and sufficient statistics:l(x|z) = Nn(z|Hx, λIn)t(z) = (H tz,H txtxH)
Inference with conjugate priors:π(x, λ) = NGamk(x, λ|x0,Λ0, α, β) = Nk(x|x0, λΛ0)Gam(λ|α, β)
π(x|λ) = Nk(x|x0, λΛ0), E [x|λ] = x0, Var [x|λ] = (λΛ0)−1
f(x) = Stk
(x|x0,
α
βΛ0, 2α
)
π(λ|α, β) = Gam(λ|α, β)
π(x) = Stk
(x|x0,
α
βΛ0, 2α
), E [x] = x0, Var [x] =
α
α− 2Λ−1
0
π(λ|α, β) = Gam(λ|α, β)
p(zi|x) = St
(zi|xtx0,
α
βf(x), 2α
)
f(x) = 1 − xt(Λ0 + xtx)−1x
π(x, λ|z) = NGamk(x, λ|xbn,Λn, αn, βn) = Nk(x|xn, λΛn)Gam(λ|αn, βn)
f(x|z) = Stk
(x|xn, (Λ0 + HtH)
αn
βn, 2αn
)
αn = α+ n2 ,
xn = (Λ0 + H tH)−1(Λ0x0 + H tz) = (I − Λn)x0 + Λnθ
βn = β + 12(z − Htxn)tz + 1
2(x0 − xn)tΛ0x0 = β + 12ztz + 1
2xt0Λ0x0 − 1
2xtnΛnθn
x = (H tH)−1H ty, Λn = (Λ0 + H tH)−1H tH
E [x|z] = xn, Var [x|z] = (Λ0 + H tH)−1
π(λ|z) = Gam (λ|αn, βn)
p(zi|hi,z) = St(zi|ht
izn, fn(hi)αn
βn, 2αn
)
fn(hi) = 1 − hti(H
tH + Λ0 + hihti)−1hi,
Inference with reference priors:
π(x, λ) = π(λ,x) ∝ λ−(k+1)/2
π(x|z) = Stk
(x|xn,
n− k
2βn
H tH , n− k
)
xn = (H tH)−1H tz,
βn =1
2(z − H txn)tz
π(λ|z) = Gam
(λ|n− k
2, βn
)
p(zi|hi,z) = St(zi|ht
izn, fn(hi)αn
βn, 2αn
)
fn(hi) = 1 − hti(H
tH + Λ0 + hihti)−1hi,
168 APPENDIX A. ANNEXES
A.2. SUMMARY OF PROBABILITY DISTRIBUTIONS 169
A.2 Summary of probability distributions
Probability laws for discrete random variables
Bernoulli Ber (x|θ) = θx (1 − θ)1−x,0 < θ < 1,x = {0, 1}
Binomial Bin (x|θ, n) = c θx (1 − θ)n−x,
c =
(nx
),
0 < θ < 1, n = 1, 2, · · · ,x = 0, 1, · · · , n
Hypergeometric HypGeo (x|N,M,n) = c
(Nx
)(M
n− x
),
c =
(N +M
n
)−1
,
N,M = 1, 2, · · · , n = 1, · · · , N +M,x = a, a+ 1, · · · , b, avec a = max{0, n −M}, b = min{N,n}
Negative-Binomial NegBin (x|θ, r) = c
(r + x− 1r − 1
)(1 − θ)x,
c = θr,0 < θ < 1, r = 1, 2, · · · ,x = 0, 1, 2, · · ·
Poisson Pn (x|λ) =λx
x!,
c = exp [−λ] ,λ > 0,x = 0, 1, 2, · · ·
Binomial-Beta BinBet (x|α, β, n) = c
(nx
)Γ(α+ x)Γ(β + n− x),
c =Γ(α+ β)
Γ(α)Γ(β)Γ(α + β + n),
α, β > 0, n = 1, 2, · · · ,x = 0, · · · , n
170 APPENDIX A. ANNEXES
Probability laws for discrete random variables (cont.)
Negative Binomial-Beta NegBinBet (x|α, β, r) = c
(r + x− 1r − 1
)Γ(β + x)
Γ(α+ β + x+ α),
c =Γ(α+ β)Γ(α+ r)
Γ(α)Γ(β),
α, β > 0, r = 1, 2, · · · ,x = 0, 1, 2, · · ·
Poisson-Gamma PnGam (x|α, β, n) = cΓ(α+ x)
x !
nx
(β + n)α+x,
c =βα
Γ(α),
α, β > 0, n = 0, 1, 2, · · · ,x = 0, 1, 2, · · ·
Composite Poisson Pnc (x|λ, µ) = exp [−λ]∞∑
n=0
(nµ)x exp [−nµ]
x!
λN
n!,
λ, µ > 0, x = 0, 1, 2 · · ·Geometric Geo (x|θ) = c (1 − θ)x−1,
c = θ,θ > 0, x = 0, 1, 2 · · ·
Pascal Pas (x|m, θ) = Cm−1x−1 θ
m(1 − θ)x−m,m > 0, 0 < θ < 1 x = 0, 1, 2 · · ·
A.2. SUMMARY OF PROBABILITY DISTRIBUTIONS 171
Probability laws for real random variables
Beta Bet (x|α, β) = c xα−1(1 − x)β−1,
c =Γ(α+ β)
Γ(α)Γ(β),
α, β > 0,0 < x < 1
Gamma Gam (x|α, β) = c xα−1 exp [−βx] ,c =
βα
Γ(α),
α, β > 0,x > 0
Inverse Gamma IGam (x|α, β) = c x−(α+1) exp
[−βx
],
c =βα
Γ(α),
α, β > 0,x > 0
Gamma–Gamma GamGam (x|α, β, n) = cxn−1
(β + x)α+n,
c =βα
Γ(α)
Γ(α+ n)
Γ(n),
α, β > 0, n = 0, 1, 2, · · · ,x = 0, 1, 2, · · ·
Pareto Par (x|α, β) = c x−(α+1),c = αβα,α, β > 0,x ≥ β
Normal N (x|µ, λ) = c exp
[−1
2λ(x− µ)2
],
c =
√λ
2π,
µ ∈ IR, λ > 0,x ∈ IR
Normal N (x|µ, σ) = c exp
[− 1
2σ2(x− µ)2
],
c =1√
2πσ2,
µ ∈ IR, σ > 0,x ∈ IR
172 APPENDIX A. ANNEXES
Probability laws for real random variables (cont.)
Logistic Lo (x|α, β) = cexp
[−β−1(x− α)]
(1 + exp [β−1(x− α)])2,
c = β−1,α ∈ IR, β > 0,x ∈ IR
Student (t) St (x|µ, λ, α) = c
[1 +
λ
α(x− µ)2
]−(α+1)/2
,
c =Γ(α+1
2 )
Γ(α/2)Γ(1/2)
(λ
α
)1/2
µ ∈ IR, λ, α > 0,x ∈ IR
Fisher-Snedecor FS (x|α, β) = cxα/2−1
(β + αx)(α+β)/2,
c =Γ((α+ β)/2)
Γ(α/2)Γ(β/2)αα/2ββ/2,
α, β > 0,x > 0
Uniform Uni (x|θ1, θ2) = c
c =1
θ2 − θ1,
θ2 > θ1,θ1 < x < θ2
Exponential Ex (x|λ) = c exp [−λx] ,c = λ,λ > 0, x > 0
Inverse Gamma IGam−1/2 (x|α, β) = c x−(2α+1) exp
[− β
x2
],
c =2βα
Γ(α),
α, β > 0,x > 0
Inverse Pareto IPar (x|α, β) = c xα−1,c = αβα,α, β > 0,
0 < x < β−1
A.2. SUMMARY OF PROBABILITY DISTRIBUTIONS 173
Probability laws for real random variables (cont.)
Cauchy Cau (x|λ) =1/(πλ)
1 + (x/λ)2,
λ ∈ IR, x ∈ IR
Rayleight Ray (x|θ) = c x exp
[−x
2
θ
],
c =2
θ,
θ > 0, x > 0
Log-Normal LogN (x|µ,Λ) = c exp
[−(lnx− µ)2
2Λ2
],
c =1
Λ√
2πx,
Generalized Normal Ngen (x|α, β) = c xα−1 exp[−βx2
],
c =2βα/2
Γ(α/2),
Weibull Wei (x|α) = c xα−1 exp[−xβ/α
],
c =β
α,
α, β > 0, x > 0
Double Exponential Exd (x|λ) = c exp [−|λ|x] ,c =
λ
2,
λ > 0, x ∈ IR
Truncated Exponential Ext (x|λ) = exp [−(x− λ)] ,λ > 0, x > λ
174 APPENDIX A. ANNEXES
Probability laws for real random variables (cont.)
Khi Chi (x|n) = c xn/2−1 exp
[−x
2
],
c =12
n/2
Γ(n/2),
n > 0, x > 0
Khi-squared Chi2 (x|n) = c xn/2−1 exp
[−x
2
],
c =12
n/2
Γ(n/2),
n > 0, x > 0
Non centered Khi-squared Chi2 (x|ν, λ) =∞∑
i=0
Pn
(x|λ
2
)χ2 (x|ν + 2i) ,
ν, λ > 0,x > 0
Inverse Khi IChi (x|ν) = c x−(ν/2+1) exp
[− 1
2x2
],
c =12
ν/2
Γ(ν/2,
ν > 0, x > 0
Probability laws for real random variables (cont.)
Generalized Exponentialwith one parameters
Exf (x|a, g, φ, h, θ) = a(x) g(θ) exp [h(θ)φ(x)] ,
Generalized Exponentialwith K parameters
Exfk (x|a, g,φ,h,θ) = a(x) g(θ) exp
[K∑
k=1
hk(θ)φk(x)
],
Probability laws for two real random variables
Normal-Gamma NGam (x, y|µ, λ, α, β) = N(x|µ, λy)Gam(y|α, β),µ ∈ IR, λ, α, β > 0,x ∈ IR, y > 0
Pareto bi-variable Par2 (x, y|α, β0, β1) = (y − x)(α+2),c = α(α + 1)(β1 − β0)
α,
(β0, β1) ∈ IR2, β0 < β1, α > 0,
(x, y) ∈ IR2, x < β0, y > β1
A.2. SUMMARY OF PROBABILITY DISTRIBUTIONS 175
Probability laws with n discrete variables
Multinomial Muk (x|θ, n) =n!
∏k+1l=1 xl!
k+1∏
l=1
θxl
l ,
xk+1 = n−k∑
l=1
xl, θk+1 = 1 −k∑
l=1
θl,
0 < θl < 1,∑
θl < 1, n = 1, 2, · · · ,xl = 0, 1, 2, · · · ,
∑xl ≤ n
Dirichlet Dik (x|α) = ck+1∏
l=1
xαl−1l ,
c =Γ(∑k+1
l=1 αl
)
∏k+1l=1 Γ(αl)
,
αl > 0, l = 1, · · · k + 1
0 < xl < 1, l = 1, · · · k + 1 xl+1 = 1 −k∑
l=1
xl
Multinomial–Dirichlet MuDik (x|α, n) = ck+1∏
l=1
α[xl]l
xl!,
c =n!
∑k+1l=1 α
[n]l
,
α[s] =s∏
l=1
(α+ l − 1), xk+1 = n−k∑
l=1
xl,
αl > 0, n = 1, 2, · · · ,
xl = 0, 1, 2, · · · ,k∑
l=1
xl < n
176 APPENDIX A. ANNEXES
Probability laws for n real variables
Canonical Exponential (x|b, φ, h, θ)GeneralizedExponential
Exfn (x|a, g, φ, h,θ)
Normal Nk (x|µ,Λ) = c exp
[−1
2(x − µ)tΛ(x − µ)
],
c = |Λ|1/2 (2π)−k2 ,
µ ∈ IRk,Λ > 0,
x ∈ IRk
Normal Nk (x|µ,Σ) = c exp
[−1
2(x − µ)tΣ−1(x − µ)
],
c = |Σ|−1/2 (2π)−k2 ,
µ ∈ IRk,Σ > 0,
x ∈ IRk
Student Stk (x|µ,Λ, α) = c
[1 +
1
α(x − µ)tΛ(x − µ)
]−(α+k)/2
,
c =Γ ((α+ k)/2)
Γ(α/2)(απ)k/2
(Λ
α
)1/2
,
µ ∈ IRk,Λ > 0, α > 0,
x ∈ IRk
Wishart Wik (X|α,Λ) = c |X|α−(k+1)/2 exp [−tr(ΛX)] ,
c =|Λ|αΓk(α)
,
Λune matrice de dimensions k × k,X une matrice symetrique d.p. de dimensions k × k,Xi,j = Xj,i, i, j = 1, · · · , k,2α > k − 1
Probability laws for n+ 1 real variables
Normal-Gamma NGamk (x, y|µ,Λ, α, β) = Nk(x|µ, yΛ)Gam(y|α, β),
µ ∈ IRk,Λ > 0, α, β > 0,
x ∈ IRk, y > 0
Normal-Wishart NWik (x,Y |µ, λ, α,B) = Nk(x|µ, λY )Wik(Y |α,B),
µ ∈ IRk, λ > 0, 2α > k − 1,B > 0,
x ∈ IRk, Yi,j = Yj,i, i, j = 1, · · · , k
A.2. SUMMARY OF PROBABILITY DISTRIBUTIONS 177
Link between different distributions
Bin(x|θ, 1) Ber(x|θ)NegBin(x|θ, 1) Geo(x|θ) = Pas(x|θ)BinBet(x|1, 1, n) Unid(x|n) = 1
n+1 , x = 0, 1, · · · , n
Bet(x|1, 1) Uni(x|0, 1)Gam(x|0, β) Ex(x|β)Gam(x|α, 1) Erl(x|α)Gam(x|ν2 , 1/2) Chi2(x|ν)IGam(x|ν2 , 1/2) IChi2(x|ν)St(x|µ, λ, 1) Cau(x|µ, λ)
Mu1(x|θ, n) Bin(x|θ, n)Di1(x|α1, α2) Bet(x|α1, α2)Wi1(x|α, β) Gam(x|α, β)St1(x|µ, λ, α) St(x|µ, λ, α)N1(x|µ, λ) N(x|µ, λ)
BinBet(x|α, β, n)
∫ 1
0Bin(x|θ, n)Beta(θ|α, β) dθ
NegBinBet(x|α, β, r)∫ 1
0NegBin(x|θ, r)Beta(θ|α, β) dθ
PnGam(x|α, β, n)
∫ ∞
0Pn(x|nλ)Gam(λ|α, β) dλ
Par(x|α, β)
∫ ∞
0Ex(x− β|λ)Gam(λ|α, β) dλ
St(x|µ, λ, α)
∫ ∞
0N(x|µ, λy)Gam(y|α/2, β/2) dy
Chi2nc(x|µ, λ)∞∑
i=1
Pn(i|λ/2)Chi2(x|ν + 2i)
limα7→∞
St(x|µ, λ, α) N(x|µ, λ)
St(x|0, 1, α) Standard StudentSt(x|α)
if x ∼ Gam(x|α, β) then y = 1x ∼ IGam(y|α, β)
if xi ∼ Gam(xi|αi, β) then y =n∑
i=1
xi ∼ Gam (y|∑ni=1 αi, β)
if x ∼ N(x|0, 1) and y ∼= Chi2(x|ν) then z = x√x/ν
∼ St(z|0, 1, ν)if x ∼ Chi2(x|ν1) and y ∼ Chi2(y|ν2) then z = x/ν1
y/ν2∼ FS(z|ν1, ν2)
178 APPENDIX A. ANNEXES
Family of invariant position-scale distribution
p(x|µ, β) = 1β f(t) with t = x−µ
β Var [x] = β2v − ∫ p(x) ln p(x) dx = log β + h
Family f(t) v h
Normal (2π)−1
2 exp[− t2
2
]1 1
2 log(2πe)
Gumbel exp [−t] exp [− exp [−t]] π2/6 1 + γLaplace 1
2 exp [−|t|] 2 1 + log 2
Logistic exp[−t](1+exp[−t])2 π2/3 2
Exponential exp [−t] , x > 0 1 1
Uniform 1, µ− β2 < x < µ+ β
2 1/12 0
Family of invariant shape-scale distributions
p(x|α, β) = 1β f(t;α) with t = x
β
{Var [x] = β2v(α)− ∫ p(x) ln p(x) dx = log β + h(α)
Family f(t)
{vh
Generalized Gaussianα = 2 : Rayleigh,α = 3 : Maxwell-Boltzmann,α = ν : khi
2Γ(α
2)t
α−1 exp[−t2]
α−2Γ2(α+1
2)
Γ2(α2)
log[Γ(α2 )/2] + 1−α
2 ψ(α2 ) + α
2
Inverse Generalized Gaussian
α = ν, β =√
2 : khi inverse
2Γ(α
2)t
−α−1 exp[−t−2
]{
1α−2 − αΓ2(α−1)
Γ2(α2)
log[Γ(α2 )/2] + −α
2 ψ(α2 ) + α
2
Gammaα = 1 : Exponential,α = ν
2 , β = 2 : khi-squared,α = ν : Erlang
1Γ(α) t
α−1 exp [−t]{αlog[Γ(α)] + (1 − α)ψ(α) + α
Inverse Gammaα = ν
2 , β = 12 : khi-squared inverse
1Γ(α) t
−α+1 exp[−t−1
]{
1(α−1)2(α−2) , α > 2
log[Γ(α)] − (1 + α)ψ(α) + α
Pareto αt−α−1, x > β
{1
(α−1)2(α−2) , α > 21α − log α+ 1
Weibull αtα−1 exp [−t]α{
Γ(1 + 2α)Γ2(1 + 1
α )γ(α−1)
α − logα+ 1
Appendix A
Exercises
179
180 APPENDIX A. EXERCISES
A.1 Exercise 1: Signal detection and parameter estimation
Assume zi = si+ei where si = a cos(ωti+φ) is the transmitted signal, ei is the noise and ziis the received signal. Assume that we have received n independent samples zi, i = 1, . . . , n.Assume also that
∑ni=1 zi = 0.
1. Assume that ei ∼ N (0, θ) and that si is perfectly known. Design a Bayesian optimaldetector with the uniform prior and the uniform cost coefficients.
2. Repeat (1) with the assumption that ei ∼ 12θ exp [−|ei|/θ] , θ > 0.
3. Repeat (1) but assume now that θ is unknown and distributed as θ ∼ π(θ) ∝ θα.
4. Assume that ei ∼ N (0, θ), but now a is not known.
• Give the expression of the ML estimate aML(z) of a.
• Give the expressions of the MAP estimate aMAP (z) and the Bayes optimalestimate aB(z) if we assume that a ∼ N (0, 1).
• Give the expressions of the MAP estimate aMAP (z) and the Bayes optimalestimate aB(z) if we assume that 1
2α exp [−|a|/α] , α > 0.
5. Repeat (4) but assume now that θ is unknown and distributed as θ ∼ π(θ) ∝ θα.
6. Assume that ei ∼ N (0, θ) with known θ and that a is known, but now ω is unknown.
• Give the expression of the ML estimate ωML(z) of ω.
• Give the expressions of the MAP estimate ωMAP (z) and the Bayes optimalestimate ωB(z), if we assume that ω ∼ Uni(0, 1).
7. Assume that ei ∼ N (0, θ) with known θ and that a and ω are known, but now φ isunknown.
• Give the expression of the ML estimate φML(z) of φ.
• Give the expressions of the MAP estimate φMAP (z) and the Bayes optimalestimate φB(z), if we assume that φ ∼ Uni(0, 2π).
8. Assume that ei ∼ N (0, θ) with known θ, but now both a and ω are unknown.
• Give the expressions of the ML estimates aML(z) of a and ωML(z) of ω.
• Give the expressions of the Bayes optimal estimates aB(z) of a and ωB(z) of ωif we assume that a ∼ N (0, 1) and ω ∼ Uni(0, 1).
• Give the expressions of the MAP estimates aMAP (z) of a and ωMAP (z) of ω ifwe assume that a ∼ N (0, 1) and ω ∼ Uni(0, 1).
9. Assume now that ω is known and we know that a can only take the values {−1, 0,+1}.Design an optimal detector for a.
A.1. EXERCISE 1: SIGNAL DETECTION AND PARAMETER ESTIMATION 181
10. Assume now that
si =K∑
k=1
ak cos(ωkt+ φk), ωk 6= ωl,∀k 6= l
• Give the expression of the ML estimates ak(z) assuming ωk and φk known.
• Give the expressions of the ML estimates ωk(z) assuming ak and φk known.
• Give the expressions of the ML estimates φk(z) assuming ak and ωk known.
• Give the expressions of the joint ML estimates ak(z) and ωk(z) assuming φk
unknown.
• Discuss the pssibilty of the joint estimation of ak(z), ωk(z) and ak.
182 APPENDIX A. EXERCISES
A.2 Exercise 2: Discrete deconvolution
Assume zi = si + ei where
si =p∑
k=0
hk x(i− k)
and where
• h = [h0, h1, . . . , hp]t represents the finite impulse response of a chanel,
• x = [x(0), . . . , x(n)]t the finite input sequence,
• z = [z(p), . . . , z(p + n)]t the received signal, and
• e = [e(p), . . . , e(p + n)]t the chanel noise.
1. Construct the matrixes H and X in such a way that z = Hx+ e and z = Xh + e.
2. Assume that ei ∼ N (0, θ) and that we know perfectly h and the input sequencex. Design a Bayesian optimal detector with the uniform prior and the uniform costcoefficients.
3. Repeat (2) with the assumption that ei ∼ 12θ exp [−|ei|/θ] , θ > 0.
4. Repeat (2) but assume now that θ is unknown and distributed as θ ∼ π(θ) ∝ θα.
5. Assume that ei ∼ N (0, θ), but now x is unknown.
• Give the expression of the ML estimate xML(z) of x.
• Give the expressions of the MAP xMAP (z) and the Bayes optimal estimatexB(z) if we assume that x ∼ N (0, I).
6. Repeat (5) but assume now that θ is unknown and distributed as θ ∼ π(θ) ∝ θα.
7. Assume that ei ∼ N (0, θ) with known θ, and that x is known but h is unknown.
• Give the expression of the ML estimate hML(z) of h.
• Give the expression of the MAP estimate hMAP (z) and the Bayes optimalestimate hB(z) if we assume that h ∼ N (0, I).
8. Repeat (7) but assume now that θ is unknown and distributed as θ ∼ π(θ) ∝ θα.
9. Assume that ei ∼ N (0, θ) with known θ, but now both h and x are unknown.
• Give the expressions of the ML estimates hML(z) of h and xML(z) of x.
• Give the expressions of the MAP and the Bayes optimal estimates xMAP (z)and xB(z) of x and hMAP (z) and hB(z) of h if we assume that x ∼ N (0, I)and h ∼ N (0, I).
• Give the expressions of the MAP and the Bayes optimal estimates xMAP (z) andxB(z) of x and hMAP (z) and hB(z) of h if we assume that x ∼ N (
0, σ2xDt
xDx)
and h ∼ N (0, σ2
hDthDh
).
A.2. EXERCISE 2: DISCRETE DECONVOLUTION 183
10. Assume now that h is known and we know that x(k) can only take the values {0, 1}.
• First assume that x(k) are independent and π0 = P (x(k) = 0) and π1 =P (x(k) = 1) = 1 − π0, with known π0. Design an optimal detector for x(k).
• Now assume x(k) can be modelled as a first order Markov chaine and that weknow the probabilites
π00 = P (x(k) = 0, x(k + 1) = 0) π01 = P (x(k) = 0, x(k + 1) = 1) = 1 − π00
π11 = P (x(k) = 1, x(k + 1) = 1) π10 = P (x(k) = 1, x(k + 1) = 0) = 1 − π11
Design an optimal detector for x(k).
• Repeat these two last items, assuming now that h is unknown and h ∼ N (0, σ2
hDthDh
).
184 APPENDIX A. EXERCISES
List of Figures
2.1 Location testing with Gaussian errors, uniform costs and equal priors. . . . 19
2.2 Illustration of minimax rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Partition of the observation space and deterministic or probabilistic decisionrules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Two decision rules δ(1) and δ(2) and thier respectives risk functions. . . . . . 34
3.3 Two decision rules δ(1) and δ(2) and thier respectives power functions. . . . 35
4.1 General structure of a Bayesian optimal detector. . . . . . . . . . . . . . . . 48
4.2 Simplified structure of a Bayesian optimal detector. . . . . . . . . . . . . . . 48
4.3 General Bayesian detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Bayesian detector in the case of i.i.d. data. . . . . . . . . . . . . . . . . . . 49
4.5 General Bayesian detector in the case of i.i.d. Gaussian data. . . . . . . . . 50
4.6 Simplified Bayesian detector in the case of i.i.d. Gaussian data. . . . . . . . 50
4.7 Bayesian detector in the case of i.i.d. Laplacian data. . . . . . . . . . . . . . 51
4.8 A binary chanel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 The structure of the optimal detector for an i.i.d. noise model. . . . . . . . 59
5.2 The structure of the optimal detector for an i.i.d. Gaussian noise model. . . 60
5.3 The simplified structure of the optimal detector for an i.i.d. Gaussian noisemodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Bayesian detector in the case of i.i.d. Laplacian data. . . . . . . . . . . . . . 60
5.5 The structure of a locally optimal detector for an i.i.d. noise model. . . . . 61
5.6 The structure of a locally optimal detector for an i.i.d. Gaussian noise model. 62
5.7 The structure of a locally optimal detector for an i.i.d. Laplacian noise model. 62
5.8 The structure of a locally optimal detector for an i.i.d. Cauchy noise model. 62
5.9 Coherent detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.10 Sequential detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.11 Stopping rule in sequential detection. . . . . . . . . . . . . . . . . . . . . . . 71
5.12 Stopping rule in SPART (a, b). . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1 Curve fitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Line fitting: model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3 Line fitting: model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Line fitting: model 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Line fitting: models 1 and 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 94
185
186 LIST OF FIGURES
Bibliography
187
Index
Maximum A Posteriori (MAP) estimation,77, 85
Minimum-Mean-Absolute-Error, 76, 84Minimum-Mean-Squared-Error, 76, 84Neyman-Pearson hypothesis testing, 23Wiener-Kolmogorov filtering, 133
Admissible decision rules, 29AR, 116ARMA, 116
Bayesian parameter estimation, 75Bayesian binary hypothesis testing, 16Bayesian composite hypothesis testing, 55Bayesian estimation, 117Bayesian MSE estimate, 104Bernouilli, 146Binary chanel transmission, 52Binary composite hypothesis testing, 56Binary hypothesis testing, 15Binomial, 147
Causal Wiener-Kolmogorov, 135Conjugate distributions, 122
Conjugate priors, 120, 122, 123, 145curve fitting, 88
Deconvolution, 112, 135
Exponential family, 121, 145Exponential model, 150
Factorization theorem, 120false alarm, 28Fast Kalman filter, 110Fisher information, 126Fisher-Information, 126
Gaussian vector, 86Group invariance, 117
Inference, 145Information-Inequality, 126
Invariance principles, 117Invariant Bayesian estimate, 119
Inverse problems, 166
Joint probability distribution, 12
Jointly Gaussian, 86
Kalman filter, 112Kalman filtering, 103
Laplacian noise, 50, 60Linear Estimation, 129
Linear estimation, 129Linear Mean Square (LMS), 104
Linear models, 87Linear regression, 165
Locally most powerful (LMP) test, 57Locally optimal detectors, 61
Location estimation, 18
MA, 116
Maximum A posteriori (MAP), 104Memoryless stochastic process, 13
Minimal sufficiency, 120Minimax binary hypothesis testing, 22
miss, 28Mult-variable Normal, 164
Multi-variable Normal, 159–163
Multinomial, 158
Natural exponential family, 123Negative Binomial, 149
Noise filtering, 134Non causal Wiener-Kolmogorov filtering, 133
Non informative priors, 126, 127
Non parametric description of a stochasticprocess, 13
188
INDEX 189
Normal, 152–157
One step prediction, 131Optimal detectors, 55Optimization, 44Orthogonality principle, 132
Parametrically well known stochastic pro-cess, 13
Performance analysis, 66Poisson, 148Prediction, 131Prediction-Correction, 106Prior law, 117Probability density, 12Probability distribution, 9, 12Probability spaces, 11Probablistic description, 9
Radar, 108Random variable, 9, 12Random vector, 12Rational spectra, 139Robust detection, 74
Sequential detection, 70sequential detection, 68Signal detection, 55Signal estimation, 101SPART, 71Stationary stochastic process, 13statistical inference, 9Stochastic process, 9Stopping rule, 29, 70Sufficient statistics, 120
time-invariant, 107Track-While-Scan (TWS) Radar, 108
Uniform, 151Uniform most powerful (UMP) test, 57
Wald-Wolfowitz theorem, 71Wide-Sense Markov sequences, 141