detection - estimation - freedjafari.free.fr/cours/nd/de1.pdf · chapter 1 introduction generally...

Detection - Estimation

Ali Mohammad-Djafari

A Graduated Course

Department of Electrical Engineering

University of Notre Dame

Notre Dame, IN 46555, USA

Draft: July 12, 2005 1

1File: nd1

Contents

1 Introduction 9

1.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Summary of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Basic concepts of binary hypothesis testing 15

2.1 Binary hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Bayesian binary hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Minimax binary hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 22

2.4 Neyman-Pearson hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 23

3 Basic concepts of general hypothesis testing 25

3.1 A general M -ary hypothesis testing problem . . . . . . . . . . . . . . . . . . 25

3.1.1 Deterministic or probabilistic decision rules . . . . . . . . . . . . . . 26

3.1.2 Conditional, A priori and Joint Probability Distributions . . . . . . 27

3.1.3 Probabilities of false and correct detection . . . . . . . . . . . . . . . 28

3.1.4 Penalty or costs coefficients . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.5 Conditional and Bayes risks . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.6 Bayesian and non Bayesian hypothesis testing . . . . . . . . . . . . . 29

3.1.7 Admissible decision rules and stopping rule . . . . . . . . . . . . . . 29

3.1.8 Classification of simple hypothesis testing schemes . . . . . . . . . . 30

3.2 Composite hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.1 Penalty or cost functions . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.2 Case of binary hypothesis testing . . . . . . . . . . . . . . . . . . . . 33

3.2.3 Classification of hypothesis testing schemes for a parametricallyknown stochastic process . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Classification of parameter estimation schemes . . . . . . . . . . . . . . . . 36

3.4 Summary of notations and abbreviations . . . . . . . . . . . . . . . . . . . . 40

4 Bayesian hypothesis testing 43

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3.1 Radar applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3.2 Detection of a known signal in an additive noise . . . . . . . . . . . 49

4.4 Binary chanel transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3

4 CONTENTS

5 Signal detection and structure of optimal detectors 55

5.1 Bayesian composite hypothesis testing . . . . . . . . . . . . . . . . . . . . . 55

5.1.1 Case of binary composite hypothesis testing . . . . . . . . . . . . . . 56

5.2 Uniform most powerful (UMP) test . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 Locally most powerful (LMP) test . . . . . . . . . . . . . . . . . . . . . . . 57

5.4 Maximum likelihood test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.5 Examples of signal detection schemes . . . . . . . . . . . . . . . . . . . . . . 59

5.5.1 Case of Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.5.2 Laplacian noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.5.3 Locally optimal detectors . . . . . . . . . . . . . . . . . . . . . . . . 61

5.6 Detection of signals with unknown parameter . . . . . . . . . . . . . . . . . 63

5.7 sequential detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.8 Robust detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6 Elements of parameter estimation 75

6.1 Bayesian parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.1.1 Minimum-Mean-Squared-Error . . . . . . . . . . . . . . . . . . . . . 76

6.1.2 Minimum-Mean-Absolute-Error . . . . . . . . . . . . . . . . . . . . . 76

6.1.3 Maximum A Posteriori (MAP) estimation . . . . . . . . . . . . . . . 77

6.2 Other cost functions and related estimators . . . . . . . . . . . . . . . . . . 82

6.3 Examples of posterior calculation . . . . . . . . . . . . . . . . . . . . . . . . 83

6.4 Estimation of vector parameters . . . . . . . . . . . . . . . . . . . . . . . . 84

6.4.1 Minimum-Mean-Squared-Error . . . . . . . . . . . . . . . . . . . . . 84

6.4.2 Minimum-Mean-Absolute-Error . . . . . . . . . . . . . . . . . . . . . 84

6.4.3 Marginal Maximum A Posteriori (MAP) estimation . . . . . . . . . 85

6.4.4 Maximum A Posteriori (MAP) estimation . . . . . . . . . . . . . . . 85

6.4.5 Estimation of a Gaussian vector parameter from jointly Gaussianobservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.4.6 Case of linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.5.1 curve fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7 Elements of signal estimation 101

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.2 Kalman filtering : General linear case . . . . . . . . . . . . . . . . . . . . . 103

7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3.1 1D case: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3.2 Track-While-Scan (TWS) Radar . . . . . . . . . . . . . . . . . . . . 108

7.3.3 Track-While-Scan (TWS) Radar with dependent acceleration se-quences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.4 Fast Kalman filter equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.5 Kalman filter equations for signal deconvolution . . . . . . . . . . . . . . . . 112

7.5.1 AR, MA and ARMA Models . . . . . . . . . . . . . . . . . . . . . . 116

CONTENTS 5

8 Some complements to Bayesian estimation 1178.1 Choice of a prior law in the Bayesian estimation . . . . . . . . . . . . . . . 117

8.1.1 Invariance principles . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.2 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208.3 Non informative priors based on Fisher information . . . . . . . . . . . . . . 126

9 Linear Estimation 1299.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1299.2 One step prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319.3 Levinson algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319.4 Vector observation case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1329.5 Wiener-Kolmogorov filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

9.5.1 Non causal Wiener-Kolmogorov . . . . . . . . . . . . . . . . . . . . . 1339.6 Causal Wiener-Kolmogorov . . . . . . . . . . . . . . . . . . . . . . . . . . . 1359.7 Rational spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

A Annexes 143A.1 Summary of Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . 144A.2 Summary of probability distributions . . . . . . . . . . . . . . . . . . . . . . 169

A Exercises 179A.1 Exercise 1: Signal detection and parameter estimation . . . . . . . . . . . . 180A.2 Exercise 2: Discrete deconvolution . . . . . . . . . . . . . . . . . . . . . . . 182

6 CONTENTS

Preface

During my visit at the Electrical Engineering Department of the Notre Dame University,I had to teach a course on Detection-Estimation.

I could not find really a complete and convenient textbook to cover all the materialsthat an Electrical Engineer have to know on the subject. Some books are too mathe-matical, some others are too oriented on the applications and not enough rigorous on themathematics.

The purpose of these notes is to fill this gap: to introduce the reader to the basic theoryof hypothesis testing and estimation theory using the main probability and statistical toolsand also to give him the basic theory of signal detection and estimation as used in practicalapplications of electrical engineering.

The contents of these notes are mainly covered by two books:– Detection and estimation, by D. Kazakos and P. Papantoni-Kazakos, and– An introduction to signal detection and estimation, by H. Vincent Poor.

Ali Mohammad-Djafari

7

8 CONTENTS

Chapter 1

Introduction

Generally speaking, signal detection and estimation is the area of study that deals withinformation processing: conversion, transmission, observation and information extraction.The main area of applications of detection and estimation theory are radar, sonar, analogor digital communications, but detection and estimation theory becomes also the maintool in other area such as radioastronomy, geophysics, medical imaging, biological dataprocessing, etc.

In general, detection and estimation applications involve making inferences from ob-servations that are distorted or corrupted in some unknown way or too complicated tobe modelled in a deterministic way. Moreover, sometimes even the information that onewishes to extract from such observations is not well determined. Thus, it is very useful tocast detection and estimation problems in a probabilistic framework and statistical infer-ence. But using the probability theory and the statistical inference tools does not forciblymeans that the corresponding physical phenomena are necessarily random.

In statistical inference, the goal is not to make an immediate decision, but is instead toprovide a summary of the statistical evidence which the future users can easily incorporateinto their decision process. The task of decision making is then given to the decision theory.

Signal detection is inherently a decision making task. In signal estimation also we needoften to make decisions. So, for detection and estimation we need not only the probabilitytheory and statistical inference tools but also the decision and hypothesis testing tools.The main common tool with which we have to start is then the probabilistic and stochasticdescription of the observations and the unknown quantities.

Once again a probablistic or stochastic description models the effect of causes whoseorigin and nature are either unknown or too complex to be described deterministically.

The simplest tool of a probabilistic model for a quantity is a scalar random variableX which is fully described by its probability distribution F (x) = Pr {X ≤ x}. The nextsimplest model is a random vector X = [X1, · · · ,Xn]t, where {Xj} are random variables.A random vector is fully described by its probability distribution F (x) = Pr {X ≤ x}.The next and the most general stochastic model for a quantity is a random function X(r),where r is a finite dimensional independent variable and where for every fixed valuesr = rj , the scalar quantity Xj = X(rj) is a scalar random variable. For example, whenr = (x, y) represents the spatial coordinates in a plane, then X(x, y) is called a randomfield and when r = t represents the time variable, then X(t) is called a stochastic process.In the rest of these notes, we consider only this last model.

9

10 CHAPTER 1. INTRODUCTION

A stochastic process X(t) is completely described by the probability distribution

F (x1, · · · , xn; t1, · · · , tn) = F (x; t) = Pr {X(tj) ≤ xj; j = 1, · · · , n}

for every n and every time instants {tj}. The stochastic process is discrete-time if it isdescribed only by its realizations on a countable set {tj} of time instants. Then, timeis counted by the indices j, and the stochastic process is fully described by the randomvectors Xj = [Xj ,Xj+1, · · · ,Xj+n]t.

A stochastic processX(t) is said well known, if the distribution F (x1, · · · , xn; t1, · · · , tn)is precisely known for all n, every set {tj} and every vector value x. The process isinstead said parametrically known, if there exists a finite dimensional parameter vector θ =[θ1, · · · , θm]t such that the conditional distribution F (x1, · · · , xn; t1, · · · , tn|θ) is preciselyknown for all n, every set {tj}, every vector value x and a fixed given value of θ. Astochastic process X(t) is non parametrically described, if there is no vector parameter θ

of finite dimensionality such that the distribution F (x, t|θ) is completely described for allvalues of the vector θ and for all n, t and x. As an example, a stationary, discrete timeprocess {Xi} where the random variables Xi have finite variances is a nonparametricallydescribed process. In fact, this description represents a whole class of stochastic processes.If we assume now that this process is also Gaussian, then it becomes parametrically known,since only its mean and spectral density functions are needed for its full description. Whenthese two quantities are also provided, the process becomes well known.

From now, we have the main necessary ingredients to give a general scope of thedetection and estimation theory. Let consider a case where the observed quantity ismodelled by a stochastic process X(t) and the observed signal x(t) is considered as arealization of the process, i.e., an observed waveform generated by X(t).

1.1. BASIC DEFINITIONS 11

1.1 Basic definitions

• Probability spaces:The probability theory starts by defining an observation set Γ and a class of subsetsG of it, called observation events, to which we wish to assign probabilities. The pair(Γ,G) is termed the observation space.

For analytical reasons we will always assume that the collection G is a σ-algebra; thatis, G contains all complements relative to Γ and denumerable unions of its members,i.e.;

if A ∈ G −→ Ac ∈ Gandif A1, A2, ... ∈ G −→ ∪iAi ∈ G

(1.1)

Two special cases are of interest:

– Discrete case: Γ = {γ1, γ2, · · ·}In this case G is the set of all subsets of Γ which is usually denoted by 2Γ andis called the power set of Γ.For this case, probabilities can be assigned to subsets of Γ in terms of a proba-bility mass function, p : Γ −→ [0, 1], by

P (A) =∑

γi∈A

p(γi), A ∈ 2Γ. (1.2)

Any function mapping Γ to [0, 1] can be a probability mass function providedthat it satisfies the condition of normality

∑

i

p(γi) = 1. (1.3)

– Continuous case: Γ = IRn, the set of n-dimensional vectors with real compo-nents.In this case we want to assign the probabilities to the sets

{x = (x1, · · · , xn) ∈ IRn|a1 < x1 < b1, · · · , an < xn < bn} (1.4)

where the ai’s and bi’s are arbitrary real numbers. So, in this case, G is thesmallest σ-algebra containing all of these sets with the ai’s and bi’s rangingthroughout the reals. This σ-algebra is usually denoted Bn and is called theclass of Borel sets in IRn.In this case the probabilities can be assigned in terms of a probability densityfunction, p : IRn −→ IR+, by

P (A) =

∫∫

Ap(x) dx, A ∈ Bn. (1.5)

Any integrable function mapping IRn to IR+ can be a probability density func-tion provided that it satisfies the condition

∫∫

IRnp(x) dx = 1. (1.6)


• For compactness, we may use the term density for both probability density functionand probability mass function and use the following notation when necessary

P (A) =

∫∫

Ap(x)µ( dx) (1.7)

for both the summation equation (1.2) and the integration equation (1.5).

• Random variable:X = X(ω) is a function ω 7→ IR, where ω represents elements on the probabilityspace.

• Probability distribution:F (x) is a function IR 7→ [0, 1] such that

F (x) = Pr {X ≤ x} (1.8)

• Probability density function:f(x) is a function IR 7→ IR+ such that

f(x) =∂F (x)

∂x,

F (x) = Pr {X ≤ x} =

∫ x

−inftyf(t) dt (1.9)

• For a real function g of the random variable X, the expected value of g(X), denotedE [g(X)], is defined by any of the followings:

E [g(X)] =∑

i

g(γi) p(γi) (1.10)

E [g(X)] =

∫

IRg(x) p(x) dx (1.11)

E [g(X)] =

∫

Γg(x) p(x)µ( dx) (1.12)

• Random vector or a vector of random variables:X = [X1, · · · ,Xn] where Xi are scalar random varaibles.

• Joint probability distribution:

F (x1, · · · , xn) = Pr {X1 ≤ x1, · · · ,Xn ≤ xn}F (x) = Pr {X ≤ x} (1.13)

• Stochastic process

• Stochastic process: X(t) = X(t, ω), where X(t, ω) is a scalar random variable for allt.

1.1. BASIC DEFINITIONS 13

• A stochastic process is completely defined by

F (x1, · · · , xn; t1, · · · , tn) = Pr {Xi(ti) ≤ xi, i = 1, · · · , n}F (x; t) = Pr {X(t) ≤ x} (1.14)

• A stochastic process is stationary if

F (x1, · · · , xn; t1, · · · , tn) = F (x1, · · · , xn; (t1 + τ), · · · , (tn + τ))

F (x; t) = F (x; t + τ1) (1.15)

• A stochastic process is memoryless or white if

F (x1, · · · , xn; t1, · · · , tn) =∏

i

F (xi; ti) (1.16)

• Discrete time stochastic process:

F (x) = Pr {X ≤ x} (1.17)

• Memoryless discrete time stochastic process:

F (x) = Pr {X ≤ x} =n∏

i=1

Fi(xi) (1.18)

• Memoryless and stationary discrete time stochastic process:

F (x) =n∏

i=1

Fi(xi) and Fi(x) = Fj(x) = F (x),∀i, j (1.19)

• A memoryless and stationary discrete time stochastic process generates in time in-dependent and identically distributed (i.i.d.) random variables.

• Well known stochastic process:A stochastic process is well known if the distribution F (x, t) is known for all n, tand x.

• Parametrically well known stochastic process:A stochastic process is parametrically well known if there exists a finite dimentionalvector parameter θ = [θ1, · · · , θm] such that the conditional distribution F (x, t|θ) isknown for all n, t and x.

• Non parametric description of a stochastic process:A stochastic process X(t) is non parametrically described, if there is no vector pa-rameter θ of finite dimensionality such that the distribution F (x, t|θ) is completelydescribed for every given vector θ and for all n, t and x.

• Observed data:samples of x(t) a realization of X(t) in some time interval [0, T ].


1.2 Summary of notations

X A random variablex A realization of a random variablex = {x1, · · · , xn} n samples (realizations) of a random variable

X(t) A random function or a stochastic processx(t) A realization of a random function

X A random vector or a discrete-time stochastic processx A realization of a random vector or a discrete-time stochastic processxn = {x1, · · · ,xn} n samples (realizations) of a random vector or a discrete-time stochastic process

F (x) A probability distribution of a scalar random variablef(x) A probability density function of a scalar random variable

Fθ(x) A parametrical probability distributionfθ(x|θ) A parametrical probability density function

F (x|θ) A conditional probability distributionf(x|θ) A conditional probability density function

F (θ|x) A posterior probability distribution of θ conditioned on the observations x

f(θ|x) A posterior probability density function of θ conditioned on the observations x

Θ A random scalar parametersθ A sample of Θ

Θ A random vector of parametersθ A sample of Θ

Γ The space of possible values of xΓ The space of possible values of x

T The space of possible values of θT The space of possible values of θ

Chapter 2

Basic concepts of binaryhypothesis testing

Most signal detection problems can be cast in the framework of M -ary hypothesis testing,where from some observations (data) we wish to decide among M possible situations.For example, in a communication system, the receiver observes an electric waveform thatconsists of one of theM possible signals corrupted by channel or receiver noise, and we wishto decide which of the M possible signals is present during the observation. Obviously, forany given decision problem, there are a number of possible decision strategies or rules thatcan be applied, however we would like to choose a decision rule that is optimal in somesense. There are several classical useful criteria of optimality for such problems. The mainobject of this chapter is to give all the necessary basic definitions to define these criteriaand their practical signification. Before going to the general case of M -ary hypothesistesting problem, let us start by a particular problem of binary (M=2) hypothesis testingwhich allows us to introduce the main basis more easily.

2.1 Binary hypothesis testing

The primary problem that we consider as an introduction is the simple hypothesis testingproblem in which we assume that the observed data belong only on two possible processeswith well known probability distributions P0 and P1:

{H0 : X ∼ P0

H1 : X ∼ P1(2.1)

where “X ∼ P” denotes “X has distribution P” or “Data come from a stochastic processwhose distribution is P”. The hypotheses H0 and H1 are respectively referred to as nulland alternative hypotheses. A decision rule δ for H0 versus H1 is any partition of theobservation space Γ into Γ1 and Γ0 = Γc

1 such that we choose H1 when x ∈ Γ1 and H0

when x ∈ Γ0. The sets Γ1 and Γ0 are respectively called the rejection and acceptanceregions. So, we can think of the decision rule δ as a function on Γ such that

δ(x) =

{δ1 = 1 if x ∈ Γ1

δ0 = 0 if x ∈ Γ0 = Γc1

(2.2)

15

16 CHAPTER 2. BASIC CONCEPTS OF BINARY HYPOTHESIS TESTING

so that the value of δ for a given x is the index of the hypothesis accepted by the decisionrule δ.

We can also think of the decision rule δ(x) as a probability distribution {δ0, δ1} on thespace D of all the possible decisions where δj is the probability of deciding Hj in the lightof the data x. In both cases δ0 + δ1 = 1.

We would like to choose H0 or H1 in some optimal way and, with this in mind, we mayassign costs to our decisions. In particular we may assign costs cij ≥ 0 to pay if we makethe decision Hi while the true decision to make was Hj. With the partition Γ = {Γ0,Γ1}of the observation set, we can then define the conditional probabilities

Pij = Pr {X = x ∈ Γi|H = Hj} = Pj(Γi) =

∫∫

Γi

pj(x) dx (2.3)

and then the average or expected conditional risks Rj(δ) for each hypothesis as

Rj(δ) =1∑

i=0

cijPij = c1jPj(Γ1) + c0jPj(Γ0), j = 0, 1 (2.4)

2.2 Bayesian binary hypothesis testing

Now assume that we can assign prior probabilities π0 and π1 = 1−π0 to the hypotheses H0

and H1, either to translate our preferences or to translate our prior knowledge about thesehypotheses. Note that πj is the probability that Hj is true unconditional (or independent)of the observation data x of X. This is why they are called prior or a priori probabilities.For given priors {π0, π1} we can define the posterior or a posteriori probabilities

πj(x) = Pr {H = Hj|X = x} =pj(x)πj

m(x)(2.5)

wherem(x) =

∑

j

Pj(x)πj (2.6)

is the overall density of X.We can also define an average or Bayes risk r(δ) as the overall average cost incurred

by the decision rule δ:r(δ) =

∑

j

πj Rj(δ) (2.7)

We may now use this quantity to define an optimum decision rule as the one that minimizes,over all decision rules, the Bayes risk. Such a decision rule is known as a Bayes decisionrule.

To go a little further in details, let combine (2.4) and (2.7) to give

r(δ) =∑

j

πj Rj(δ) =∑

j

πj

∑

i

cijPj(Γi)

=∑

j

πjc0jPj(Γ0) + πjc1jPj(Γ1)

=∑

j

πjc0j(1 − Pj(Γ1)) + πjc1jPj(Γ1)

2.2. BAYESIAN BINARY HYPOTHESIS TESTING 17

=∑

j

πjc0j +∑

j

πj(c1j − c0j)Pj(Γ1)

=∑

j

πjc0j +

∫∫

Γ1

∑

j

πj (c1j − c0j) pj(x) dx (2.8)

Thus, we see that r(δ) is a minimum over all Γ1 if we choose

Γ1 =

x ∈ Γ|

∑

j

(c1j − c0j)πj pj(x) ≤ 0

(2.9)

= {x ∈ Γ|(c11 − c01)π1 p1(x) ≤ (c00 − c10)π0 p0(x)} (2.10)

In general, the costs cjj < cij which means that the cost of correctly deciding Hi is lessthan the cost of incorrectly deciding it. Then, (2.10) can be written

Γ1 =

{x ∈ Γ|π1 p1(x)

π0 p0(x)≥ τ1 =

c10 − c00c01 − c11

}(2.11)

=

{x ∈ Γ|p1(x)

p0(x)≥ τ2 =

π0

π1

c10 − c00c01 − c11

}(2.12)

= {x ∈ Γ|c10π0(x) + c11π1(x) ≤ c00π0(x) + c01π1(x)} (2.13)

This decision rule is known as a likelihood-ratio test or posterior probability ratio test dueto the fact that L(x) = π1(x)

π0(x) is the ratio of the likelihoods and π1 p1(x)π0 p0(x) is the ratio of the

posterior probabilities.

Note also that the quantity ci0π0(x)+ci1π1(x) is the average cost incurred by choosingthe hypothesis Hi given that X = x. This quantity is called the posterior cost of choosingHi given the observation X = x. Thus, the Bayes rule makes its decision by choosing thehypothesis that yields the minimum posterior cost.

This test plays a central role in the theory of hypothesis testing. It computes thelikelihood ratio L(x) for a given observed value X = x and then makes its decision bycomparing this ratio to a threshold τ1, i.e.;

δ(x) =

{1 if L(x) ≥ τ10 if L(x) < τ1

(2.14)

A commonly cost assignment is the uniform costs given by

cij =

{0 if i = j1 if i 6= j

(2.15)

The Bayes risk δ then becomes

r(δ) = π0P0(Γ1) + π1P1(Γ0) = π0P01 + π1P10 (2.16)

Noting that Pi(Γj) is the probability of choosing Hj when Hi is true, r(δ) in this casebecomes the average probability of error incurred by the rule δ. This decision rule is thena minimum probability of error decision scheme.


Note also that with the uniform cost coefficients (2.15) the decision rule can be rewrit-ten as

δ(x) =

{1 if π1(x) ≥ π0(x)0 if π1(x) < π0(x)

(2.17)

This test is called the maximum a posteriori (MAP) decision scheme.

Example : Detection of a constant signal in a Gaussian noiseLet consider {

H0 X = ǫH1 X = µ+ ǫ

(2.18)

where ǫ is a Gaussian random variable with zero mean and a known variance σ2 and whereµ > 0 is a known constant. In terms of distributions we can rewrite these hypotheses as

{H0 X ∼ N (

0, σ2)

H1 X = N (µ, σ2

) (2.19)

where N (µ, σ2

)means

N(µ, σ2

)= (2πσ2)−

1

2 exp

[− 1

2σ2(x− µ)2

](2.20)

It is then easy to calculate the likelihood ratio L(x)

L(x) =p1(x)

p0(x)= exp

[µ

σ2(x− µ/2)

](2.21)

Thus, the Bayes test for these hypotheses becomes

δ(x) =

{1 if L(x) ≥ τ0 if L(x) < τ

(2.22)

where τ is an appropriate threshold. We can remark that L(x) is a striclty increasingfunction of x, so comparing L(x) to a threshold τ is equivalent to comparing x itself to

another threshold τ ′ = L−1(τ) = σ2

µ log(τ) + µ/2:

δ(x) =

{1 if x ≥ τ ′

0 if x < τ ′(2.23)

where L−1 is the inverse function of L.


Γ1Γ0

0 µ/2 µ x

N (µ, σ2)N (0, σ2)

P1(x)p0(x)

Figure 2.1: Location testing with Gaussian errors, uniform costs and equal priors.

In the special case of uniform costs and equal priors, we have τ = 1 and so τ ′ = µ/2.Then, it is not difficult to show that the conditional probabilities are

Pi1 = Pr {X = x ∈ Γ1|H = Hj} = Pj(Γ1) =

∫ ∞

τ ′pj(x) dx =

{1 − Φ

( µ2σ

)for j = 1

1 − Φ(− µ

2σ

)for j = 0

(2.24)and the minimum Bayes risk r(δ) is

r(δ) = 1 − Φ

(µ

2σ

). (2.25)

This is a decreasing function of µσ , a quantity which is a simple version of the signal to

noise ratio.


Summary of notations for binary hypothesis testing

Hypotheses H H0 H1

Processes P0 P1

Conditional densitiesor

likelihood functionsp0(x) p1(x)

Observation space Γpartition

Γ0 Γ1

Decisions δ(x) δ0(x) δ1(x)

Conditional probabilitiesPij =

∫Γipj(x) dx

P00, P01 P10, P11

Costs cij c00, c01 c10, c11Conditional risksRj =

∑i cijPij

R0 = c00P00 + c10P10 R1 = c01P01 + c11P11

Prior probabilities πj π0 π1

Posterior probabilities

πj(x) =pj(x)πj

m(x)

π0(x) π1(x)

Joint probabilitiesQij = πjPij

Q00, Q01 Q10, Q11

Posterior costsci(x) =

∑j cijπj(x)

c0(x) = c00π0(x) + c01π1(x) c1(x) = c10π0(x) + c11π1(x)

Bayes risk r(δ) r(δ) =∑

j πjRj(x)

Likelihoods ratio L(x) L(x) = p1(x)p0(x)

Posteriors ratio π1(x)π0(x) = π1p1(x)

π0p0(x)

Posterior costs ratio c1(x)c0(x) = c10π0(x)+c11π1(x)

c00π0(x)+c01π1(x)

The following equivalent tests minimize the Bayes risk r(δ)

Likelihoods ratio test L(x) = p1(x)p0(x) > τ1 = π0

π1

c10−c00c01−c11

Posteriors ratio test π1(x)π0(x) > τ2 = c10−c00

c01−c11

Posterior costs ratio test c1(x)c0(x) > 1


Special case of uniform costs binary hypothesis testing

Costs cij =

{0 if i = j1 if i 6= j

c00 = 0, c01 = 1 c10 = 1, c11 = 0

Conditional risksRj =

∑i cijPij

R0 = P00 R1 = P11

Prior probabilities πj π0 π1

Posterior probabilities

πj(x) =pj(x)πj

m(x)

π0(x) π1(x)

Joint probabilitiesQij = πjPij

Q00, Q01 Q10, Q11

Posterior costscj(x) =

∑j cijπj(x)

c0(x) = π1(x) c1(x) = π0(x)

Bayes risk r(δ) r(δ) =∑

j πjRj(x) =∑

j πjPjj

Likelihoods ratio L(x) L(x) = p1(x)p0(x)

Posteriors ratio π1(x)π0(x) = π1p1(x)

π0p0(x)

Posterior costs ratio c1(x)c0(x) = π1(x)

π0(x)

The following equivalent tests minimize the Bayes risk r(δ)

Likelihoods ratio test L(x) = p1(x)p0(x) > τ1 = π0

π1

Posteriors ratio test π1(x)π0(x) > τ2 = 1

Posterior costs ratio test c1(x)c0(x) = π1(x)

π0(x) > 1


2.3 Minimax binary hypothesis testing

In the previous section, we saw how the Bayesian hypothesis testing gives us a completeprocedure to the hypothesis testing problems. However, in some applications, we maynot be able to assign the prior probabilities {π0, π1}. Then, one approache is to choosearbitrarily π0 = π1 = 1/2 and continue all the Bayesian procedure as in the last section.An alternative approach is to choose another design criterion than the expected penaltyr(δ). For example, we may use the conditional risks R0(δ) and R1(δ) and design a decisionrule that minimizes, over all δ, the following criterion

max{R0(δ), R1(δ)} (2.26)

The decision rule based on this criterion is known as the minimax rule.To design this decision rule, it is useful to consider the function r(π0, δ), defined for a

given prior π0 ∈ [0, 1] and a given decision rule δ as the average risk

r(π0, δ) = π0R0(δ) + (1 − π0)R1(δ) (2.27)

Noting that r(π0, δ) is a linear function of π0, then for fixed δ, its maximum occurs ateither π0 = 0 or π0 = 1 with the maximum value respectively either R1(δ) or R0(δ).So, the optimization problem of minimizing the criterion (2.26) over δ is equivalent tominimizing the quantity

maxπ0∈[0,1]

r(π0, δ) (2.28)

over δ.Now, for each prior π0, let δπ0

denote a Bayes rule corresponding to that prior andlet V (π0) = r(π0, δπ0

); that is V (π0) is the minimum Bayes risk for the prior π0. Then,it is not difficult to show that V (π0) is a concave function of π0 with V (0) = c11 andV (1) = c00.

Now consider the function r(π0, δπ′0) which is a straight line tangent to V (π0) at π0 = π′0

and parallel to r(π0, δ) (see figure 2.3).From this figure, we can see that only Bayes rules can possibly be minimax rules.

Indeed, we see that the minimax rule, in this case, is a Bayes rule for the prior value π0 = πL

that maximizes V , and for this prior r(π0, δπL) is constant over π0 and so R0(δπL

)) =R1(δπL

)). This decision rule (with equal conditional risks) is called an equalizer rule.Because πL maximizes the minimum Bayes risk, it is called the least-favorable prior. Thus,in this case, a minimax decision rule is the Bayes rule for the least-favorable prior πL.

Even if we arrived at this conclusion through an example, it can be prouved that thisfact is true in all practical situations. This result is stated as the following proposition:

Suppose that πL is a solution to V (πL) = maxπ0∈[0,1]V (π0). Assume further thateither R0(δπL

) = R1(δπL) or πL = {0, 1}. Then δπL

is a minimax rule. (see V. Poor forthe proof). We will be back more in details on the minimax rule in chapter x.

2.4. NEYMAN-PEARSON HYPOTHESIS TESTING 23

0 π′0 πL π′′0π0

1

R0(δπ′′0)

R0(δπL)

R0(δπ′0)

R0(δ)

c00

V (π0)

c11

R1(δπ′0)

R1(δ)

R1(δπL)

R1(δπ′′0)

r(π0, δπ′′0)

r(π0, δπ′0)

r(π0, δ)

r(π0, δπL)

Figure 2.2: Illustration of minimax rule.

2.4 Neyman-Pearson hypothesis testing

In previous sections, we examined first the the Bayes hypothesis testing where the opti-mality was defined through the overall expected cost r(δ). Then, we considered the casewhere the prior probabilities {π0, π1} are not available and described the minimax decisionrule in terms of the maximum value of the conditional risks R0(δ) and R1(δ).

In both cases, we need to define the costs cij . In some applications, imposing a specialcost structure on the decisions may not be available or not desirable. In such cases, analternative criterion, known as the Neyman-Pearson criterion, is designed which is basedon the probability of making a false decision. The main idea of this procedure is to chooseone of the hypotheses as to be the main hypothesis and test other hypotheses against it.For example, in testing H0 against H1, two kinds of errors can be made:

• Falsely rejecting H0 (or in this case falsely detecting H1). This error is called eithera Type I error or a false alarm or still a false detection.

• Falsely rejecting H1 (or in this case falsely detecting H0). This error is called eithera Type II error or a miss.

The terms “false alarm” and “miss” come from radar applications in which H0 and H1

usually represent the absence and presence of a target.

For a decision rule δ, the probability of a Type I error is known as false alarm probabilityand denoted by PF (δ). Similarly, the probability of a Type II error is called the miss


probability and denoted by PM (δ). The quantity PD(δ) = 1 − PM (δ) is called as thedetection probability or still the power of δ.

The Neyman-Pearson decision rule criterion is based on these quantities. It tries toplace a bound on the false alarm probability and minimizes the miss probability within thisconstraint, i.e.;

maxPD(δ) subject to PF (δ) = 1 − PD(δ) ≤ α (2.29)

where α is known as the significance level of the test. Thus the Neyman-Pearson decisionrule criterion is to find the most powerful α-level test of H0 against H1.

Note that, in the Neyman-Pearson test, as opposed to the Bayesian and minimax tests,the two hypotheses are not considered symetrically.

The general form of the Neyman-Pearson decision rule takes the forme

δ(x) =

1 if L(x) > τγ(x) if L(x) = τ0 if L(x) < τ

(2.30)

where τ is a threshold.The false alarm probability and the detection probability of a decision rule δ can be

calculated respectively by

PF (δ) = E0 {δ(x)} =

∫∫δ(x)p0(x) dx (2.31)

PD(δ) = E1 {δ(x)} =

∫∫δ(x)p1(x) dx (2.32)

A parametric plot of PD(δ) as a function of PF (δ) is called the receiver operation charac-terization (ROCs).

Chapter 3

Basic concepts of generalhypothesis testing

In previous chapters, we introduced the main basis of the simple binary hypothesis testingproblem. In this chapter, we consider the more general case of the M -ary hypothesistesting. First, we give the basic definitions for the case of simple hypothesis testing forthe well-known stochastic processes. Then, we consider the case of composite hypothesistesting where the stochastic processes are parametrically known. In each case, we try tomake simple classification of different decision rules, describing their optimality criterionand their performances.

3.1 A general M-ary hypothesis testing problem

To consider a general M -ary hypothesis testing problem, let consider the necessary stepsthat any decision making procedure has to follow:

1. Get the data: observe x(t) a realization of X(t) in some time interval [0, T ].

2. Define a library of hypothesis {Hi, i = 1, · · · ,M} where Hi is the hypothesis thatthe data x(t) come from a stochastic processes Xi(t) with either finite or infinitemembership.

3. Define a performance criterion for evaluating the decisions {δi, i = 1, · · · ,M}.

4. If possible, define a probability measure determining the a priori probabilities ofstochastic processes in the library.

5. Use all the available assets to formulate a general optimization problem whose solu-tion is a decision.

Evidently, the nature of the optimization problem and the subsequent decisions vary sig-nificantly with the specifities of the library of the stochastic processes, with the availabilityof the a priori probability distribution on these stochastic processes, the necessary perfor-mance criterion to optimize and the possibility of controling the observation time [0, T ]dynamically.

25

26 CHAPTER 3. BASIC CONCEPTS OF GENERAL HYPOTHESIS TESTING

Γ

Γk −→ Hk

Γj −→ Hj

Γ

Γj −→ Hj

Γk −→ Hk

Γj ∪ Γk

Hj w.p. qjHk w.p. qkqk + qj = 1

Figure 3.1: Partition of the observation space and deterministic or probabilistic decisionrules.

For any fixed specifications on the above issues, the decision then depends on the data.This is what is called the decision rule or test.

We can distinguish the following special cases:

• If the library has a finite number of members, the decision process is classified ashypothesis testing.

• If this number is only two, then the decision process is called detection.

• If the stochastic processes are well-known, the hypothesis testing is called simple.If they are defined parametrically, then the decision process is called parametric, ifnot, it is called nonparametric.

3.1.1 Deterministic or probabilistic decision rules

A decision rule δ = {δj(x), j = 1, · · · ,M} subdivides the space of the observations Γ intoM subspaces {Γj , j = 1, · · · ,M}. One can distinguish two types of decision rules:

• Deterministic decision rule:If these subspaces are all disjoints and for a given data set x, the hypothesis Hj isdecided with probability one, the decision rule is called deterministic.

• Probabilistic decision rule:If some of theses subspaces overlap and for a given data set x, none of the hypothesiscan be decided with probability one, then the decision rule is called probabilistic.That is, given x, the hypothesis Hk is decided with probability qk and Hj withprobability qj with

∑j qj = 1.

3.1. A GENERAL M -ARY HYPOTHESIS TESTING PROBLEM 27

3.1.2 Conditional, A priori and Joint Probability Distributions

For a given decision rule δ, we can define the following probability distributions:

• Conditional probability distribution:Pki(δ) is the conditional probability that Hk is chosen given that Hi is true. Theseprobabilities can be calculated from the probability distribution of the stochasticprocess:

Pki(δ) = Pr {Hk decided by rule δ|Hi true}=

∫∫

ΓdPr {Hk decided and x observed |Hi true}

=

∫∫

ΓPr {Hk decided|x observed} dPr {x observed |Hi true}

=

∫∫

Γδk(x) dFi(x) =

∫∫

Γδk(x)fi(x) dx (3.1)

Note that, in this derivation, we have used the theorem of total probabilities and theBayes rule and the fact that the decision induced by the decision rule δ is independentof the true hypothesis. Note also that the decision rule δ consists of probabilitiessuch that

n∑

j=1

δj(x) = 1 ∀x ∈ Γ (3.2)

Using this fact, it is easy to show that

M∑

k=1

Pki = 1 ∀i = 1, · · · ,M (3.3)

This can be interpreted as: Given the true hypothesis Hi, the decision induced bythe decision rule δ is restricted to one of the M hypotheses.

• A priori probability distribution:{πi, i = 1, · · · ,M} is a prior probability distribution on the hypothesis {Hi, i =1, · · · ,M}. Naturally we have

M∑

k=1

πk = 1 (3.4)

• Joint probability distribution:Using the conditional probabilities Pkj(δ) and the prior probabilities πi, we cancalculate the joint probabilities Qki(δ), denoting the probabilities that Hk is decidedby the decision rule δ while Hi is true. We then have:

Qki(δ) = πiPki(δ) = πi

∫∫

Γδk(x)fi(x) dx (3.5)

M∑

k=1

Qki = πi ∀i = 1, · · · ,M (3.6)


M∑

k=1

M∑

i=1

Qki = 1 (3.7)

(3.8)

3.1.3 Probabilities of false and correct detection

For a given decision rule δ, we can define the following probabilities:

• Probability of false decision:Pe(δ) is the probability that the decision induced by δ is erroneous, i.e.; Hk is decidedwhile Hi6=k is true.

Pe(δ) =∑

k 6=i

Qki(δ) (3.9)

• Probability of correct decision:Pd(δ) is the probability that the decision induced by δ is correct, i.e.Hk is decidedwhile Hk is true.

Pd(δ) =∑

k

Qkk(δ) = 1 − Pe(δ) (3.10)

• One can use these probabilities to define optimal decision procedures:

Pe(δ∗) ≤ Pe(δ), ∀δ ∈ D (3.11)

Pd(δ∗) ≥ Pd(δ), ∀δ ∈ D (3.12)

3.1.4 Penalty or costs coefficients

In addition to the M known hypotheses and their a priori probabilities, the analyst maybe equipped with a set of real cost or penalty coefficients cki such that

cki ≥ 0 and cki ≥ ckk ∀k, i = 1, · · · ,M, (3.13)

where cki is the penalty paied when Hk is decided while Hi is true. The implication behindthe condition that each coefficient is nonnegative is that there is no gain associated withany decision, thus the term penalty and, in general, the penalties cki are chosen greaterthan ckk.

3.1.5 Conditional and Bayes risks

For a given decision rule δ and a given set of cost functions cki, one can calculate thefollowing quantities:

• Conditional expected penalties or conditional risks:

Ri(δ) =M∑

k=1

ckiPki(δ) (3.14)


• Expected penalty or Bayes risk:

r(δ) =M∑

k=1

M∑

i=1

ckiQki(δ) (3.15)

3.1.6 Bayesian and non Bayesian hypothesis testing

Different optimization problems which are classically used to define a decision process are:

• Bayesian hypothesis testing:

– If a specific cost function that penalizes wrong decisions is provided, then theminimization of the expected penalty or the Bayes risk is chosen as the perfor-mance criterion.

– If not, the probability of making a decision error is minimized instead. Thisdecision process is called ideal observation test.

• Non Bayesian hypothesis testing:When an a priori probability distribution is unavailable, then

– If a specific cost function is available, then first a least favorable a priori prob-ability distribution is defined and then the expected penalty with this leastfavorable a priori probability distribution is minimized to obtain a decisionrule. This decision process is what is called the minimax decision process.

– If a cost function is not available then first one of the hypothesis is selected inadvance as to be the most important and then the performance criterion used isthe maximization of the probability of the detection of that hypothesis subjectto the constraint that the probability of its false alarm does not exceed a givenvalue α. This is what is called the Neyman-Pearson test procedure.

3.1.7 Admissible decision rules and stopping rule

• Admissible decision rules:It may happen that, for a given set of optimal criterions, there exist more than onebest decision rule which satisfy these performances criterion, then these rules arecalled admissible.

• When a decision rule is designed, one may be intended to know how this decisionrule performs with respect of the observed time interval [0, T ], i.e.; the number ofdata. The study of the behavior of the decision rule is called stoping rule.

All the above test procedures take a dynamic form if the observation time interval[0, T ] can be controlled dynamically.


3.1.8 Classification of simple hypothesis testing schemes

To summarize, let again list the richest possible set of assets available to the analyst:

i. A library of M distinct hypothesis {Hi, i = 1, · · · ,M};

ii. A set of data x which is assumed to be a realization of a well known stochasticprocess under only one of these hypotheses;

iii. A prior probability distribution {πi, i = 1, · · · ,M} for the M hypotheses;

iv. A set of penalty coefficients {cki, k, i = 1, · · · ,M};The minimum set of assets that is (or must be) always available consists of those in

i and ii and the performance criterion will suffer limitations as the number of remainingavailable assets decrease.

Now, to continue, first we assume that all assets in i to iv are available. Then anoptimal rule δ∗ is such that the expected penalty r(δ) is lower that any others, i.e.

δ∗ : r(δ∗) ≤ r(δ) ∀δ ∈ D (3.16)

This rule then guarantees a minimum average cost due to the wrong decisions. Notethat this rule may not be unique. When the uniqueness is not satisfied, this means thatthere exist a number of admissible rules, among them, we can choose the one which is thesimplest to implement.

If assets in i to iii are available, then an optimal rule δ∗ can be defined by using theinduced probability of error Pe(δ), or the probability of the detection,i.e.

δ∗ : Pe(δ∗) ≤ Pe(δ) ∀δ ∈ D (3.17)

orδ∗ : Pd(δ

∗) ≥ Pd(δ) ∀δ ∈ D (3.18)

Again, note that there may not exist a unique decision rule, but a set of admissible rules,among them, we can choose the one which is the simplest to implement.

The hypothesis testing rules based on the above criterions are called Bayesian due tothe basic ingredient which is the availability of the prior probabilities πi on the hypothesisspace. When the asset iii is not available, then the decision rules are called non Bayesian.

Now assume all assets i, ii and iv are available, then the analyst can choose an arbitraryprior probability distribution π = {πi} and calculate the induced conditional expectedpenalty R(δ, π). Then, the decision rule

δ∗ : supπR(δ∗, π) ≤ sup

πR(δ, π) ∀δ ∈ D (3.19)

defines admissible ones. The analyst then can choose between these admissible rules theone with the lowest complexity. This procedure, when successful, isolates the decisionrule that induces the minimum maximum value of the conditional expected penalty andprotects the analyst against the most costly case. This formalism and procedure is calledminimax.

Finally, assume that only the assets in i and ii are available, Then, the main idea is toselect one of the hypothesis as to be the principal and use the notion of the power functionP (δ).


General Hypothesis Testing Schemes

Scheme A priori Cost Decision rule

Yes Yes Minimization of expected penalty r(δ)Bayesian

Yes No Minimization of error probability Pe(δ)

No Yes Minimax test rule using conditional risks Rj(δ)Non Bayesian

No No Neyman-Pearson test rule using Pe(δ) and Pd(δ)

Classes of Hypothesis Testing Schemes

for Well Known Stochastic Processes.

Scheme Assets used Optimization function Optimal Decision rule δ∗ Specific Name

i, ii, iii, iv, v r(δ) r(δ∗) ≤ r(δ) BayesianBayesian

i, ii, iii, v Pe(δ) Pe(δ∗) ≤ Pe(δ) Bayesian

i, ii, iii supp r(δ, π) supp

r(δ∗, π) ≤ supp

r(δ, π) Minimax

Non Bayesian

ii, iiiPd(δ)

subject toPe(δ) ≤ α

Pd(δ∗) ≥ Pd(δ)and

Pe(δ) ≤ αNeyman-Pearson


3.2 Composite hypothesis

Now assume that, the hypothesis Hi means that x is a realization of the process Xi butthe process X i is only parametrically known, i.e.; its probability distribution is knownwithin a set of unknown parameters θ so that, the prior probabilities {πi} depend on theparameter θ. Now, assume that the partitioning of the decision rule is due to the partitionof the parameter space T of possible values of θ, i.e.;

T = {T1, · · · ,TM} , ∪iTi = T (3.20)

Assume also that, for each value of θ, the stochastic process is defined through its prob-ability distribution fθ(x) and assume that we can define a probability density functionπ(θ) over the space T . Then we have

πi =

∫

Ti

π(θ) dθ (3.21)

and

Pk,θ(δ) =

∫∫

Γδk(x)fθ(x) dx (3.22)

whereM∑

k=1

Pk,θ(δ) = 1 ∀θ ∈ T (3.23)

We can now calculate the conditional probabilities Pki(δ)

Pki(δ) = π−1i

∫∫

Ti

Pk,θ(δ)π(θ) dθ

=

[∫∫

Ti

π(θ) dθ

]−1 ∫∫

Γdx δk(x)

∫∫

Ti

fθ(x)π(θ) dθ (3.24)

and the joint probabilities Qki as follows:

Qki(δ) = πiPki(δ) =

∫∫

Ti

Pk,θ(δ)π(θ) dθ

=

∫∫

Γdx δk(x)

∫∫

Ti

fθ(x)π(θ) dθ (3.25)

3.2.1 Penalty or cost functions

In addition to the M parametrically known hypotheses, we may be equiped with a set ofreal penalty or cost functions ck(θ).

We can then calculate:

• Conditional expected penalty or conditional risk function:

R(δ,θ) =M∑

k=1

ck(θ)Pk,θ(δ) (3.26)

=

∫∫dx

M∑

k=1

δk(x)ck(θ)fθ(x) (3.27)

3.2. COMPOSITE HYPOTHESIS 33

• Expected penalty or Bayes risk:If π(θ) is available, then the expected penaly can be calculated by

r(δ) =

∫∫

TR(δ,θ)π(θ) dθ =

M∑

k=1

∫∫ck(θ)Pk,θ(δ) dθ (3.28)

=

∫∫dx

M∑

k=1

δk(x)

∫∫ck(θ)fθ(x)π(θ) dθ (3.29)

3.2.2 Case of binary hypothesis testing

Now, consider the particular case of binary hypothesis testing, where there are only hy-potheses, and assume that they are determined through a single parametrically knownstochastic process and two disjoint subdivisions of the parameter space T . Let note thesetwo hypotheses H0 and H1 and the decision rules δ0(x) and δ1(x) with δ0(x) + δ1(x) =1 ∀x ∈ Γ. Now, if we emphasize the hypothesis H1 (detection), then δ0(x) = 1 − δ1(x),so that we can drop the indices in the decision rule and denote by δ(x) = δ1(x). Now, wehave

Pθ(δ) =

∫∫

Γδ(x) fθ(x) dx (3.30)

This expression represents the probability that the emphasized hypothesis H1 is decided,conditioned on the value θ of the vector parameter. This function is called the powerfunction of the decision rule. This is due to the fact that it provides the probability withwhich the emphatic hypothesis is decided for each fixed parameter value θ.

3.2.3 Classification of hypothesis testing schemes for a parametricallyknown stochastic process

As before, we now summarize the richest possible set of assets available for a compositehypothesis testing problem:

1. A library of M distinct hypotheses {Hi, i = 1, · · · ,M}

2. A set of data x which is assumed to be a realization of a parametrically knownstochastic process, the hypotheses {Hi, i = 1, · · · ,M} corresponding to the M dis-joint subdivisions of the parameter space T .

3. A prior probability distribution π(θ) on the parameter space T

4. A set of penalty functions {ck(θ), k = 1, · · · ,M} defined on T .

First we assume that all assets in i to iv are available. Then an optimal rule δ∗ is suchthat the expected penalty r(δ) is lower that any others, i.e.

δ∗ : r(δ∗) ≤ r(δ) ∀δ ∈ D (3.31)

If assets in i to iii are available, then an optimal rule δ∗ can be defined by using theinduced probability of error Pe(δ), or the probability of the detection, i.e.;

δ∗ : Pe(δ∗) ≤ Pe(δ) ∀δ ∈ D (3.32)


orδ∗ : Pd(δ

∗) ≥ Pd(δ) ∀δ ∈ D (3.33)

Again, note that there may not exist a unique decision rule, but a set of admissible rules,among which we can choose the one which is the simplest to implement.

Now assume that the assets i, ii and iv are available, then the analyst can use theinduced conditional expected penalty R(δ,θ). An optimal rule would induce relativelylow R(δ,θ) values for all values of θ ∈ T . So, if there exist two rules δ(1) and δ(2) suchthat

R(δ(1),θ) ≤ R(δ(2),θ) ∀θ ∈ T (3.34)

then δ(2) should be rejected in the presence of δ(1). The rule δ(1) is said to be uniformlysuperior than the rule δ(2). But, it may happen that R(δ(1),θ) ≤ R(δ(2),θ) for some valuesof θ and R(δ(1),θ) > R(δ(2),θ) for other values of θ. In this case, we may ask to preferδ(1) to δ(2) if

supθ∈T

R(δ(1),θ) ≤ supθ∈T

R(δ(2),θ) (3.35)

Thus, the selection procedure has, in general, two steps: first reject all the uniformlyinadmissible rules, and then between the remaining ones, define the optimal rules:

δ∗ : supθ∈T

R(δ∗,θ) ≤ supθ∈T

R(δ,θ) ∀δ ∈ D (3.36)

which are admissible. Finally, the analyst then can choose between these admissible rulesthe one with the lowest complexity. This procedure, when successful, isolates the decisionrule that induces the minimum maximum value of the conditional expected penalty andprotects the analyst against the most costly case. This formalism and procedure is calledminimax.

θ θ

R(δ(2), θ)

R(δ(1), θ)

R(δ(2), θ)

R(δ(1), θ)

Figure 3.2: Two decision rules δ(1) and δ(2) and thier respectives risk functions. In bothcases δ(2) is rejected in presence of δ(1).

Finally, assume that only the assets in i and ii are available. Then, the main idea is toselect one of the hypotheses as to be the principal and use the notion of the power functionPθ(δ). Let note H1 the emphasized hypothesis, T1 its associated region in T and Pθ(δ)the power function associated to it. It is then desirable that Pθ(δ) for any θ ∈ T1 has avalue higher that its value for other hypotheses, i.e.;

Pθ∈T1(δ) ≥ Pθ∈Tj

(δ) (3.37)

3.2. COMPOSITE HYPOTHESIS 35

The quantity supθ∈T0Pθ(δ) is the false alarm induced by δ. The value of Pθ(δ) for a given

value of θ ∈ T1 is the power induced by the decision rule δ. If the subspaces T0 and T1 arefixed, then the best decision rule δ∗ is the one that induces the highest power subject toa false alarm constraint, i.e.;

δ∗ : Pθ(δ∗) ≤ Pθ(δ) ∀θ ∈ T1 subject to supθ∈T0

Pθ(δ∗) ≤ α, ∀δ ∈ D (3.38)

The procedure, as in the minimax scheme, may have more than one solution.

θθ0

T0 T1

Pθ(δ(2)) Pθ(δ

(1))

Figure 3.3: Two decision rules δ(1) and δ(2) and thier respectives power functions. In thiscase δ(1) is prefered to δ(2).

Classes of Hypothesis Testing Schemes

for Parametrically Known Stochastic Processes.

Scheme Assets used Optimization function Optimal Estimate δ∗ Specific Name

i, ii, iii, iv, v r(δ) r(δ∗) ≤ r(δ) BayesianBayesian

i, ii, iii, v Pe(δ) Pe(δ∗) ≤ Pe(δ) Bayesian

i, ii, iii supθ∈TR(δ,θ) sup

θ∈T

R(δ∗,θ) ≤ supθ∈T

R(δ,θ) Minimax

Non Bayesian

ii, iii

Pθ∈Θ1(δ)

subject tosupθ∈Θ0

Pθ(δ) ≤ α

Pθ∈Θ1(δ∗) ≥ Pθ∈Θ1

(δ)and

supθ∈Θ0

Pθ(δ) ≤ αNeyman-Pearson


3.3 Classification of parameter estimation schemes

The basic ingredient that distinguishes the parameter estimation from the hypothesistesting is the dimension of the hypothesis space and the nature of the stochastic processcorresponding to each alternative. In hypothesis testing the dimension of the hypothesisspace is finite and any of the M alternatives are represented by one stochastic process.In parameter estimation, we are face to an infinite number of alternatives represented bysome m dimentional vector parameter θ that takes its values in T .

The basic elements of parameter estimation are then the vector parameter θ and astochastic process X(t) which is parameterized by θ and we still can distinguish twocases:

• If for a fixed θ the stochastic process is well-known, then we have a parametricparameter estimation scheme.

• If for a fixed θ the stochastic process is a member of some class Fθ of processes,then we have a non parametric or robust parameter estimation scheme.

In both cases, the main assumption is that the value of the parameter and so the natureof the stochastic process remains unchanged during the observation time [0, T ]. The mainobjective is then to determine the active value of the parameter θ. Given a set of data x,the solution is noted θ(x) and is called parameter estimate.

Between the different criteria to measure the performances of an estimate θ(x), onecan mention the following:

• Bias :For a real valued parameter vector θ, the Euclidean norm

∥∥∥θ − E[θ(X)

]∥∥∥1/2

is called the bias of the estimate θ(x) at the process. If the bias is zero for all θ ∈ T ,then the estimate θ(x) is called unbiased at the process.

• Conditional variance :The quantity

E[∥∥∥θ(X) − E

[θ(X)

]∥∥∥ | θ]

is called the Conditional variance of the estimate θ(x).

In general, the bias and the conditional variance present a tradeoff. Indeed, an unbiasedestimate may induce a relatively large variance, and very often, admitting a small biasmay result in a significant reduction of the conditional variance.

A parameter estimate θ(x) is called efficient at the process, if the conditional varianceequals a lower bound known as the Cramer-Rao bound.

A more general criterion is here also the expected penalty, if we define a penaltyfunction c[θ(x),θ]– a scalar, non negative function whose values vary as θ and θ vary inT . We can then define:

3.3. CLASSIFICATION OF PARAMETER ESTIMATION SCHEMES 37

• Conditional expected penalty or conditional risk function:

R(θ,θ) = E[c[θ(X),θ] | θ

]=

∫∫

Γc[θ(x),θ] fθ(x) dx (3.39)

where fθ(x) is the probability density function of the stochastic process defined byθ at the point x.

• Expected penalty or Bayes risk function:When an a priori probability density function π(θ) is available, we can calculate theexpected value of R(θ,θ) for all θ ∈ T , and thus definec the total expected penaltyor the Bayes risk function by

r(θ) = r(θ, π) =

∫∫

TR(θ,θ)π(θ) dθ (3.40)

Now, let try to make a classification of parameter estimation schemes. For this, we listthe richest possible set of assets:

i. A parametric or nonparametric description of a stochastic process depending on afinite dimensional parameter vector θ.

ii. A set of data x which is assumed to be a realization of one of the active stochasticprocesses with the implicite assumption that this process remains unchanged duringthe observation time.

iii. A parameter space T where θ takes its values.

iv. An a priori probability distribution π(θ) defined on the parameter space T .

v. A penalty function c[θ(x),θ] defined for each data sequence x, parameter vector θ

and the estimated parameter vector θ(x).

Here also, some of the assets listed above may not be available and we will see howdifferent schemes come out from partial availability of these assets.

The minimum set of assets that is (or must be) always available consists of those ini, ii and iii and the performance criterion of the estimation will suffer limitations as thenumber of remaining available assets decrease.

First we assume that a parametric description of the stochastic process is available.When all the assets are available, we will have the Bayesian parameter estimation schemewhere the performance criterion is the expected penalty or the Bayes risk

r(θ) = E[c[θ(X),θ]

]=

∫∫

T

∫∫

Γc[θ(x),θ] fθ(x)π(θ) dxdθ (3.41)

with respect to the estimate θ(x) for all x in the observation space Γ.

The Bayesian optimal estimate is then defined as the estimate θ∗(x) which minimizes

the expected penalty function, i.e.

θ∗(x) : r(θ

∗(x)) ≤ r(θ(x)) ∀θ ∈ T (3.42)


If assets in i to iv are available, then we can calculate the posterior probability densityfunction of θ, using the Bayes rule:

π(θ|x) =p(x|θ)π(θ)

m(x)=fθ(x)π(θ)

m(x)(3.43)

where

m(x) =

∫∫

Tfθ(x)π(θ) dθ (3.44)

and define an estimate, called maximum a posteriori (MAP) estimate by

θ∗(x) : π(θ

∗|x) ≥ π(θ|x) ∀θ ∈ T (3.45)

or written differentlyθ∗

= arg maxθ∈T

{π(θ|x)} (3.46)

If assets in i to iii and v are available, then an optimal estimate θ∗(x) can be defined

by using the expected conditional penalty

R(θ,θ) = E[c[θ(X),θ]|θ]

]=

∫∫

Γc[θ(X),θ]fθ(x) dx (3.47)

where fθ(x) is the probability density function of the stochastic process defined by θ atthe point x. We are then in the minimax parameter estimation scheme which is basedon the saddle-point game formalization, with payoff function the expected penalty r(θ, π)and with variables the parameters estimate θ and the a priori probability density functionπ.

In summary, we can say that, if a minimax estimate θ∗

exists, it is an optimal Bayesianestimate at some least favorable a priori probability distribution p0, i.e.

θ∗(x) : ∃π0 : r[θ

∗(x), π] ≤ r[θ

∗(x), π0] ≤ r[θ(x), π0] ∀θ ∈ T and ∀π (3.48)

When only the assets i to iii are available, then the analyst can use the inducedconditional probability density function fθ(x). The scheme is called maximum likelihoodand the main idea is to use the induced conditional probability density function fθ(x) as afunction l(θ) = fθ(x), called likelihood of the vector parameter θ and define the maximum

likelihood (ML) estimate θ∗(x) as the one who maximizes the likelihood l(θ), i.e.

θ∗(x) : l(θ

∗) ≥ l(θ) ∀θ ∈ T (3.49)

All the above schemes comprise the class of parametric parameter estimation proce-dures with the common characteristic of the assumtion that the stochastic process thatgenerates the data is parametrically well-known.

When, for a given vector parameter θ, the stochastic process is nonparametricallydescribed, then the parameter estimation is called nonparametric or sometimes robust. Asin minimax scheme, the robust estimation scheme uses a saddle-point game procedure,but here the payoff function originates from the likelihood. So, in this scheme, in additionto the nonparametric description of the stochastic process, the only assets ii and iii areused to define a performance criterion using the likelihood function. We will be back morein details on this scheme in future chapters.

3.3. CLASSIFICATION OF PARAMETER ESTIMATION SCHEMES 39

Classes of Parameter Estimation Schemes

for Parametrically Known Stochastic Processes.

Scheme A priori Cost Decision rule

Yes Yes Minimization of expected penalty r(θ)Bayesian

Yes No Maximization of the posterior probability π(θ|x)

No Yes Minimax estimation using r(θ, π)Non Bayesian

No No Maximum likelihood tests using l(θ)

Classes of Parameter Estimation Schemes

Assets used Optimization function Optimal estimate θ∗

Scheme

i, ii, iii, iv, v r(θ) θ∗

: r(θ∗

) ≤ r(θ) ∀θ ∈ T Bayesian

i, ii, iii, iv π(θ|x) π(θ∗|x) ≤ π(θ|x) ∀θ ∈ T MAP

i, ii, iv R(θ,θ) supθ∈T

R(θ∗

,θ) ≤ supθ∈T

R(θ,θ) Minimax

i, ii l(θ) θ∗

: l(θ∗

) ≥ l(θ) ∀θ ∈ T Maximum likelihood

i, iinonparametric

descriptionof the

stochastic process

Based on l(θ)Appropriate saddle point

optimization Robust estimation


3.4 Summary of notations and abbreviations

• δ = {δk, k = 1, · · · ,M}A decision rule (or a set of possible actions)

• Γ = {Γk, k = 1, · · · ,M}The partitions of the observation space Γ corresponding to the hypotheses {Hk} andthe decision rule δ

• T = {Tk, k = 1, · · · ,M}The partitions of the parameter space T corresponding to the hypotheses {Hk} andthe decision rule δ

• {πi}A prior probability distribution for the hypotheses {Hi}

• π(θ)A prior probability density function for a scalar parameter θ

• π(θ)A prior probability density function for a vector parameter θ

• {πi(θ)}Conditional prior probability density functions for the vector parameter θ under thehypothesis Hi

• {ri(θ)} = {πi πi(θ)}Unconditional prior probability density functions for the vector parameter θ underthe hypothesis Hi

• fθ(x) = p(x|θ)Conditional probability density function of the observations for a given θ

• l(θ) = fθ(x) = p(x|θ)Likelihood function of θ for a given data x

• π(θ|x) =fθ(x) π(θ)

m(x) = p(x|θ) π(θ)m(x)

Posterior probability density function of θ given the observations x

• m(x) =

∫∫p(x|θ)π(θ) dθ

Marginal distribution of the observations x

• {cki}Penalty coefficients

• {cki(θ)} or {ck(θ)}Penalty functions

• {Pki(δ)}Conditional probabilities of the decision rule δ for a well known stochastic process

3.4. SUMMARY OF NOTATIONS AND ABBREVIATIONS 41

• {Pki,θ(δ)} or {Pk,θ((δ)}Conditional probabilities of the decision rule δ for a parametrically known stochasticprocess

• {Qki(δ) = πiPki(δ)}Probabilities of the decisions in the decision rule δ

• Pe(δ) =∑

k 6=i

Qki(δ)

Probability of the error due to the decision rule δ

• Pd(δ) =∑

k

Qkk(δ) = 1 − Pe(δ)

Probability of the correct detection due to the decision rule δ

• Pfa(δ) = Q10(δ)Probability of false alarm in a binary hypothesis testing

• Pfd(δ) = Q01(δ)Probability of false detection in a binary hypothesis testing

• Pe(δ) = Q01(δ) +Q10(δ)Probability of the error due to the decision rule δ in a binary hypothesis testing

• Pd(δ) = Q00(δ) +Q11(δ)Probability of the correct detection due to the decision rule δ in a binary hypothesistesting

• Conditional expected penalty or Risk function

Ri(δ) =M∑

k=1

ckiPki(δ)

for a well known stochastic process

Rθ(δ) =M∑

k=1

ck(θ)Pk,θ(δ)

for a parametrically known stochastic process

• Expected penalty or Bayes risk function

ri(δ) =M∑

k=1

ckiQki(δ)

for a well known stochastic process

rθ(δ) =M∑

k=1

ck(θ)Qk,θ(δ)

for a parametrically known stochastic process

Chapter 4

Bayesian hypothesis testing

4.1 Introduction

Let start by reminding and precising the notations and definitions. First we consider ageneral case and we assume that there exists a parameter vector θ of finite dimensionalitym and M stochastic processes, such that for every fixed value of θ ∈ T , the condi-tional distributions {Fθ,i(x) = Fi(x|θ), i = 1, · · · ,M} and their coresponding densities

{fθ,i(x) = fi(x|θ), i = 1, · · · ,M} are well known, for all values x ∈ Γ.

We also assume to know the conditional prior probability distributions

{πi(θ) = π(θ|Hi), i = 1, · · · ,M},

their coresponding unconditional prior distributions

{ri(θ) = P (H = Hi)πi(θ) = πi πi(θ), i = 1, · · · ,M}

and the penalty functions {cki(θ)}.For a given decision rule δ(x) = {δj(x), j = 1, · · · ,M} we define the expected penalty

r(δ) =

∫∫

Γdx

M∑

k=1

δk(x)

∫∫

T

M∑

i=1

cki(θ)fθ,i(x) ri(θ) dθ (4.1)

Note that, this general case reduces to the two following special cases:

• When the M stochastic processes coresponding to the M hypotheses are describedthrough a single parametric stochastic process withM disjoint subdivisions {T1, · · · ,TM}of the parameter space T , then the quantities πi(θ) and ri(θ), both reduce to π(θ),and the quantities {fθ,i(x)} and {cki(θ)} reduce to {fθ(x)} and {ck(θ)}. We thenhave

r(δ) =

∫∫

Γdx

M∑

k=1

δk(x)

∫∫

Tck(θ)fθ(x)π(θ) dθ (4.2)

• When the M stochastic processes coresponding to the M hypotheses are all wellknown then we can eliminate θ from the above equations and the quantities fθ,i(x),

43

44 CHAPTER 4. BAYESIAN HYPOTHESIS TESTING

ri(θ) and {cki(θ)} reduce respectively to {fi(x)}, πi = Pr {Hi} and {cki}. We thenhave

r(δ) =

∫∫

Γdx

M∑

k=1

δk(x)M∑

i=1

cki fi(x)πi dx (4.3)

Note that, in all the three above cases we can write

r(δ) =

∫∫

Γdx

M∑

k=1

δk(x)gk(x) (4.4)

where gk(x) is given by one of the following equations:

gk(x) =

∫∫

T

M∑

i=1

cki(θ)fθ,i(x) ri(θ) dθ (4.5)

gk(x) =

∫∫

Tck(θ)fθ(x)π(θ) dθ (4.6)

gk(x) =M∑

i=1

cki fi(x)πi dx (4.7)

4.2 Optimization problem

Now, we have all the ingredients to write down the optimization problem of the Bayesianhypothesis testing. Before starting, remember that for any decision rule δ = {δj(x), j =1, · · · ,M}, we have

δk(x) ≥ 0, k = 1, · · · ,M andM∑

k=1

δk(x) = 1 ∀x ∈ Γ (4.8)

Now, consider the following optimization problem:

Minimize r(δ) =

∫∫

Γdx

M∑

k=1

δk(x) gk(x) (4.9)

subject to

{δk(x) ≥ 0, k = 1, · · · ,M,∑M

k=1 δk(x) = 1, ∀x ∈ Γ(4.10)

where {gk(x), k = 1, · · · ,M} are non negative functions defined on Γ. Their expressioncan be given by either (4.5), (4.6) or (4.7).

This optimization problem does not have, in general, a unique solution and there mayexist a whole class D∗ of equivalent decision rules. we remember that two decision rulesδ∗1 and δ∗2 are equivalent if

r(δ∗1) = r(δ∗2) ≤ r(δ) ∀δ ∈ D (4.11)

The calss D∗ includes both random and determinist decision rules. For the reason ofsimplicity, in general, one chooses the non random decision rules. Noting that δk are then

4.2. OPTIMIZATION PROBLEM 45

either 0 or 1 depending on the conditions x ∈ Γi or x 6∈ Γi. The expression (4.9) of r(δ)becomes

r(δ) =∑

k

∫∫

Γk

gk(x) dx (4.12)

Now, assuming that gk(x) ≥ 0 ∀x ∈ Γk, the Bayesian hypothesis testing scheme consistesof the following steps:

• Given the observations x, compute

t(x) = minkgk(x); (4.13)

• Select a single k∗ = k(x) such that gk∗(x)(x) = t(x);

• Define

δ∗j (x) =

{1 j = k∗(x)0 j 6= k∗(x)

(4.14)

The function t(x) together with the index k(x) are called the test and the statisticbehavior of the pair [t(X), k(X)] is called the test statistics.

To go more in details we consider some examples from general cases to more specificones.

Let start with a general case. We here we assume that during any n observations thetransmitted signal is exactly one of the M possibles {si, i = 0, · · · ,M −1}. Then we haveM hypotheses:

Hi : x ∼ fi(x), i = 0, · · · ,M − 1 (4.15)

Then, we have

gk(x) =M∑

i=1

cki fi(x) πi (4.16)

and

t(x) = minkgk(x) = min

k

M∑

i=1

cki fi(x) πi (4.17)

Given the observation x, the search for some index k(x) that satisfies (4.17) can be realizedvia the differences

M∑

i=1

(cki − cli) fi(x) πi (4.18)

The optimal index k∗(x) is such that

k∗(x) :M∑

i=1

(ck∗i − cli) fi(x) πi ≤ 0 ∀l (4.19)

If x is such that f0(x) > 0 and if π0 > 0, then we have

k∗(x) :M∑

i=1

(ck∗i − cli)πi

π0

fi(x)

f0(x)≤ 0 ∀l (4.20)


The ratio{

fi(x)f0(x)

}are called the likelihood ratios, so that, the procedure to obtain the

decisions now consists in comparing the likelihood ratios against some thresholds. Thetest induced by (4.20) thus consists of a weighted sum of the likelihood ratios. Thisweigthed sum is called the test function. So, in general, the test function is compared withthe threshold zero which is independent of the observation.

Note also that we can rewrite (4.20) in the two following other forms:

k∗(x) :M∑

i=1

(ck∗i − cli)πi(x)

π0(x)≤ 0 ∀l (4.21)

or still

k∗(x) :M∑

i=1

ck∗i πi(x) ≤M∑

i=1

cli πi(x) ∀l (4.22)

The fractionsπi(x)

π0(x)are the posterior likelihood ratios and cl(x) =

M∑

i=1

cli πi(x) are the

expected posterior penalties.Further simplifications can be acheived with uniform cost functions

cki =

{0 if k = i1 if k 6= i

(4.23)

and with uniform priors

π1 = π2 = · · · = πM =1

M. (4.24)

With these assumptions, the Bayesian decision rule becomes

k∗(x) : Lk∗(x) ≥ Ll(x) ∀l (4.25)

where

Li(x) =fi(x)

f0(x)(4.26)

Now, let consider some special cases.

4.3. EXAMPLES 47

4.3 Examples

4.3.1 Radar applications

One of the oldest area where the detection-estimation theory has been used is the radarapplications. The main problem is to detect the presence of M knwon signals transmittedthrough the atmosphere. The transmission chanel is the atmosphere and it is assumed tobe statistically well knwon. A simple model for the received signal is then

X(t) = S(t) +N(t) (4.27)

where N(t) is the additive noise due to the chanel. In the discrete case, where we assumeto observe n samples of the received signal in the time period [0, T ], this model becomes

Xj = Sj +Nj, j = 1, · · · , n (4.28)

or stillx = s + n. (4.29)

Consider now the case where one of the signals si is null. Then the M hypothesesbecome: {

H0 : x = n

Hi : x = si + n, i = 1, . . . ,M − 1(4.30)

Assume that we know also the conditional probability density functions f0(x) and fi(x)of the received signal under the hypotheses H0 and Hi. The likelihood ratios Li(x) thenbecome

Li(x) =fi(x)

f0(x)=

exp[−12σ2 (x − si)

t(x − si)]

exp[−12σ2 xtx

] = exp

[−1

2σ2(−2st

ix + stisi)

](4.31)

and k∗(x) satisfies:

k∗(x) : stk∗x +

1

2st

k∗sk∗ > stlx +

1

2st

lsl ∀l (4.32)

Figure 4.3.1 shows the structure of this optimal test.Indeed, if we assume that all the signals have the same energies |si|2 = st

isi, then wehave

k∗(x) : stk∗x > st

lx ∀l (4.33)

Figure 4.3.1 shows the structure of this optimal test.


x

+

+

+

k∗(x)

weighting

weighting

weighting

st1 x

stk x

stM x

sM

sk

s1

cM

ck

c1

Select

Largest

Figure 4.1: General structure of a Bayesian optimal detector.

x k∗(x)

weighting

weighting

weighting

st1 x

stk x

stM x

sM

sk

s1

Select

Largest

Figure 4.2: Simplified structure of a Bayesian optimal detector.

4.3. EXAMPLES 49

4.3.2 Detection of a known signal in an additive noise

Consider now the case where there is only one signal. So that, we have a binary detectionproblem: {

H0 : x = n −→ f0(x) = f(x)H1 : x = s + n −→ f1(x) = f(x − s)

(4.34)

Then, we have:

L(x) =f1(x)

f0(x)(4.35)

General case:

The general optimal Bayesian detector structure becomes:

x - L(x) -

>

=

<

τ

-

-

-

H1

H1 or H0

H0

Figure 4.3: General Bayesian detector.

Case of white noise:

Now, if we assume that the noise is white, we have:{H0 : x = n −→ f0(x) =

∏j f(xj)

H1 : x = s + n −→ f1(x) =∏

j f(xj − sj)(4.36)

and consequently

L(x) =∏

j

Lj(xj) =∏

j

f(xj − sj)

f(xj)(4.37)

xj - log[lj(xj)] -n∑

j=1

(.) -

>

=

<

τ

-

-

-

H1

H1 or H0

H0

Figure 4.4: Bayesian detector in the case of i.i.d. data.

Case of Gaussian noise:

In this case we haveH0 : x = n −→ f0(x) ∝ exp

[− 1

2σ2 xtx]

H1 : x = s + n −→ f1(x) ∝ exp[− 1

2σ2 [x − s]t[x − s]] (4.38)


L(x) =f1(x)

f0(x)=

exp[−12σ2 (x − s)t(x − s)

]

exp[−12σ2 xtx

] = exp

[ −1

2σ2(−2stx + sts)

](4.39)

We then have

δB(x) =

1 >0/1 if st(x − s) = τ1 <

(4.40)

The detector has the following structure:

xj - ++

−6sj/2

- ×6sj

-n∑

j=1

(.) -

>

=

<

τ1

-

-

-

H1

H1 or H0

H0

Figure 4.5: General Bayesian detector in the case of i.i.d. Gaussian data.

Note that we can rewrite (4.32) as:

δB(x) =

1 >0/1 if stx = τ ′

1 <(4.41)

xj - ×6sj

-n∑

j=1

(.) -

>

=

<

τ2

-

-

-

H1

H1 or H0

H0

Figure 4.6: Simplified Bayesian detector in the case of i.i.d. Gaussian data.

Case of Laplacian noise:

H0 : x = n −→ f0(x) =n∏

j=1

α

2exp

−αn∑

j=1

|xj |

(4.42)

Hi : x = s + n −→ f1(x) =n∏

j=1

α

2exp

−αn∑

j=1

|xj − sj|

(4.43)

L(x) =n∏

j=1

Lj(xj) (4.44)


4.4 Binary chanel transmission

In numeric signal transmission, in general, we have to transmit binary sequences. If weassume that the chanel transmits each bit separately in a memoryless fashion and thateach bit sj is transmitted correctly with probability q, then we can describe this chanelgraphically as follows

Input bit Output bit

0

1

0

1

q

1 − q

q

1 − q

Figure 4.8: A binary chanel.

A useful binary, memoryless and symetric chanel should have a probability of correcttransmission higher than the probability of incorrect transmission, i.e.q > 0.5 and 1− q <0.5.

With these assumptions on the chanel we can easily calculate the probability of ob-serving x conditional to the transmitted sequence s

Pr {x|s} =n∏

j=1

Pr {x[j]|s[j]} =n∏

i=1

q1−(x[j]⊕s[j])(1 − q)(x[j]⊕s[j])

= qn(

1 − q

q

)∑M

i=1(x[j]⊕s[j])

(4.47)

where ⊕ signifies binary sum.Now, assume that, during each observation period, only one of the M well knwon

binary sequences sk (called codewords) are transmitted. Now, we have received the binarysequence x and we want to know which one of them has been transmitted.

Indeed, if we assume p1 = p2 = · · · = pM = 1/M and if we note by

H(sk, sl) =1

n

n∑

j=1

sk[j] ⊕ sl[j] (4.48)

the Haming distance between the two binary words sk and sl, then, the likelihood ratioshave the following form:

Pr {x|sk}Pr {x|sl}

=n∏

j=1

q1−(x[j]⊕sk[j])(1 − q)(x[j]⊕sk[j])

= qn(

1 − q

q

)∑M

i=1(x[j]⊕sk[j])

(4.49)

4.4. BINARY CHANEL TRANSMISSION 53

and the Bayesian optimal test becomes:

k∗(x) :

(1 − q

q

)∑n

j=1(x[j]⊕sk∗[j])−

∑n

j=1(x[j]⊕sl[j])

≥ 1 ∀l = 1, . . . ,M (4.50)

Taking the logarithm of both parts we obtain the following condition on k∗(x):

k∗(x) :n∑

j=1

(x[j] ⊕ sk∗[j]) −n∑

j=1

(x[j] ⊕ sl[j]) log

(1 − q

q

)≥ 0 ∀l = 1, . . . ,M (4.51)

We can then discriminate two cases:

• Case 1: Let q > .5, which means that the transmission chanel has a higher probabilityof transmitting correctly than incorrectly. Then 1−q

q < 1 and k∗(x) satisfies

k∗(x) :n∑

j=1

(x[j] ⊕ sk∗[j]) ≤n∑

j=1

(x[j] ⊕ sl[j]), ∀l = 1, . . . ,M (4.52)

or stillk∗ : H(x, sk∗) ≤ H(x, sl), ∀l = 1, . . . ,M (4.53)

The test clearly decides in favor of the codeword sk∗ whose Hamming distance fromthe observed sequence x is the minimum one. This is why this detector is called theminimum distance decoding scheme.

Let now the M codewords be designed so that the Haming distance between anytwo such codewords equals (2d+ 1)/n, where d is a positive integer, i.e.

H(sk, sl) = (2d+ 1)/n, ∀k 6= l, k, l = 1, . . . ,M (4.54)

and d such thatd∑

i=0

(ni

)≤ 2n/M (4.55)

Then, via the minimum distance decoding scheme, if the distance between the re-ceived word x and the codeword sk is at most d/n, then the codeword sk is correctlydetected and we have

Pd(sk) ≥d∑

i=0

(ni

)qn−i(1 − q)i (4.56)

Pd =M∑

k=1

(1/M)Pd(sk) ≥d∑

i=0

(ni

)qn−i(1 − q)i (4.57)

Pe = 1 − Pd ≤ 1 −d∑

i=0

(ni

)qn−i(1 − q)i (4.58)

(4.59)

• Case 2: In the case of q < 0.5, by the same analysis, the Bayesian detection schemedecides in favor of the codeword whose Hamming distance from the observed se-quence is the maximum. This is not surprising because if q < 0.5, then withprobability 1 − q > 0.5, more than half of the codeword bits are changed in thetransmission.

Chapter 5

Signal detection and structure ofoptimal detectors

In previous chapters we discussed some basic optimality criteria and design methods forgeneral hypothesis testing problems. In this chapter we apply them to derive optimalprocedures for the detection of signals corrupted by some noise. We consider only thediscrete case.

First, we summarize the Bayesian composite hypothesis testing and focus on the binarycase. Then, we describe other related tests in this particular case. Finally, through someexamples with different models for the signal and the noise, we derive the optimum detectorstructures.

At the end, we give some basic elements of robust, sequential and non parametricdetection.

5.1 Bayesian composite hypothesis testing

Consider the following composite hypothesis testing:

X ∼ fθ(x) (5.1)

and define the decision δ(x), its associated cost function c[δ(x),θ]. Then the conditionalrisk function is given by

Rθ(δ) = Eθ {c[δ(X),θ]} = E [c[δ(X),Θ]|Θ = θ]

=

∫∫

Γc[δ(x),θ]fθ(x) dx (5.2)

and the Bayes risk by

r(δ) = E[RΘ(δ(X))

]= E [E [c[δ(x),Θ]|Θ = θ]]

=

∫∫

τ

∫∫

Γc[δ(x),θ]fθ(x)π(θ) dxdθ

=

∫∫

Γ

∫∫

τc[δ(x),θ]π(θ|x) dθ dx

= E [E [c[δ(X),Θ]|X = x]] (5.3)

55

56CHAPTER 5. SIGNAL DETECTION AND STRUCTURE OF OPTIMAL DETECTORS

From this relation, and the fact that in general the cost function is a positive function,we can deduce that minimizing r(δ) over δ is equivalent to minimize, for any x ∈ Γ, themean posterior cost

c[x|θ] = E [c[δ(X),Θ]|X = x] =

∫∫

τc[δ(x),θ]π(θ|x) dθ. (5.4)

5.1.1 Case of binary composite hypothesis testing

In this case, we have

δB(x) =

1 >0/1 if E [c[1,θ]|X = x] = E [c[0,θ]|X = x]0 <

(5.5)

If the two hypotheses correspond to two disjoint partitions of the parameter space τ ={T0,T1}, we have

c[i,θ] = cij if θ ∈ Tj (5.6)

and if we consider the uniform cost function, then we have

δB(x) =

1 >

0/1 ifPr{θ∈T1|X=x}Pr{θ∈T0|X=x} = c10−c00

c01−c11

0 <

(5.7)

Now, using the Bayes’ rule, we have

Pr {θ ∈ Ti|X = x} =Pr {X = x|θ ∈ Ti}Pr {θ ∈ Ti}∑1

i=0 Pr {X = x|θ ∈ Ti} Pr {θ ∈ Ti}, i = 0, 1. (5.8)

The decision rule (5.7) becomes

δB(x) =

1 >

0/1 if L(x) =Pr{X=x|θ∈T1}Pr{X=x|θ∈T0} = π0

π1

c10−c00c01−c11

0 <

(5.9)

whereπi = Pr {θ ∈ Ti} , i = 0, 1. (5.10)

The conditional probability density functions of θ are noted ri(θ) and given by

ri(θ) =

{0 if θ 6∈ Ti

p(θ)/πi if θ ∈ Ti(5.11)

The expression of Pr {X = x|θ ∈ Ti} is then given by

Pr {X = x|θ ∈ Ti} =

∫

τfθ(x) ri(θ) dθ

=1

πi

∫

τ i

fθ(x) p(θ) dθ

=1

πiEi{fΘ(x)

}(5.12)

5.2. UNIFORM MOST POWERFUL (UMP) TEST 57

where Ei{fΘ(x)

}stands for the expectation under the hypothesis Hi. We can then

rewrite (5.9) as:

δB(x) =

1 >

0/1 if L(x) =E1

{fΘ(x)

}

E0

{fΘ(x)

} = c10−c00c01−c11

0 <

(5.13)

With such hypotheses, the false alarm and correct detection probabilities become respec-tively

PF (δ,θ) = Eθ {δ(X)} for θ ∈ T0 (5.14)

PD(δ,θ) = Eθ {δ(X)} for θ ∈ T1 (5.15)

5.2 Uniform most powerful (UMP) test

The α-level uniform most powerful (UMP) test is defined as:

maxPD(δ,θ) s.t. PF (δ,θ) ≤ α (5.16)

Unfortunately, this optimization problem may not have a solution. In those cases, we cantry to design a test by following the Neyman-Pearson scheme which is summarized below.

Neyman-Pearson lemma:Suppose the hypothesis H0 is simple (T0 has only one component θ0), the hypothesis H1

is composite and we have a parametrically defined probability density function p(θ) forθ ∈ T1. Then the most powerful α-level test for H0 against H1 has a unique critical regiongiven by

Γθ =

{x ∈ Γ | fθ(x) > τ fθ0

(x)

}(5.17)

where τ depends on α. The corresponding test is given by

δ(x) =

1 >0/1 if fθ(x) = τ fθ0

(x)

0 <

(5.18)

5.3 Locally most powerful (LMP) test

The UMP test is too strong and the optimization problem may not have a unique solutionin some more general cases. To illustrate this test, let consider the following case:

{H0 : θ = θ0H1 : θ = θ > θ0

(5.19)

The α-level locally most powerful (LMP) test is based on the development of PD(δ, θ) inTaylor series around the simple hypothesis parameter value θ0


PD(δ, θ) ≃ PD(δ, θ0) + (θ − θ0)∂PD(δ, θ)

∂θ

∣∣∣∣θ=θ0

+ O(θ − θ0)2 (5.20)

Noting that PF (δ, θ) = PD(δ, θ0), then the Neyman-Pearson test

max PD(δ, θ) s.t. PF (δ, θ) ≤ α (5.21)

becomes

max∂PD(δ, θ)

∂θ

∣∣∣∣θ=θ0

s.t. PF (δ, θ) ≤ α (5.22)

Noting that

PD(δ, θ) = Eθ {δ(X)} =

∫∫

Γδ(x)fθ(x) dx (5.23)

and assuming that fθ(x) is sufficiently regular in the neighbourhood of θ0, we can calculate

P ′D(δ, θ0) =

∂PD(δ, θ)

∂θ

∣∣∣∣θ=θ0

=

∫∫

Γδ(x)

∂fθ(x)

∂θ

∣∣∣∣θ=θ0

dx (5.24)

In conclusion, the α-level locally most powerful (LMP) test is obtained in the same way

that the α-level most powerful test by replacing fθ(x) by f ′θ0(x) = ∂fθ(x)

∂θ

∣∣∣∣θ=θ0

. The critical

region of H0 against H1 is then given by

Γθ =

{x ∈ Γ | f ′θ0

(x) > τ fθ0(x)

}(5.25)

and the test becomes

δ(x) =

1 >0/1 if f ′θ0

(x) = τfθ0(x)

0 <

(5.26)

where τ and η depend on α.

5.4 Maximum likelihood test

In the absence of the aformentionned optimal tests, we can design a test just based on thelikelihood ratios

δ(x) =

1 >

η ifmaxθ∈T1

fθ(x)

maxθ∈T0fθ(x) = τ

0 <

(5.27)

5.5. EXAMPLES OF SIGNAL DETECTION SCHEMES 59

5.5 Examples of signal detection schemes

To illustrate the common structure of these test designs, consider the following signaldetection test: {

H0 : Xj = Nj ,H1 : Xj = Sj +Nj

, j = 1, . . . , n (5.28)

Assume that the noise is i.i.d. with known probability density function f(x)

{H0 : Xj = Nj, −→ f0(x) =

∏j f(xj)

H1 : Xj = Sj +Nj −→ f1(x) =∏

j f(xj − sj), j = 1, . . . , n (5.29)

The likelihood ratio becomes

L(x) =∏

j

Lj(xj) with Lj(xj) =f(xj − sj)

f(xj)(5.30)

and the test becomes

δ(x) =

1 >0/1 if

∑j logLj(xj) = log τ

0 <

(5.31)

xj - ln lj(xj) -n∑

j=1

(.) -

>

=

<

τ

-

-

-

H1

H1 or H0

H0

Figure 5.1: The structure of the optimal detector for an i.i.d. noise model.

5.5.1 Case of Gaussian noise

In this case we have

f(xj) ∝ exp

[− 1

2σ2x2

j

](5.32)

and the log likelihood Lj(xj) ratios become

logLj(xj) = logf(xj − sj)

f(xj)=

1

σ2

[sj(xj −

sj

2)

](5.33)

Noting by τ1 = σ2 log τ , we have the following structure for the optimal detector:


xj - ++

−6sj/2

- ×6sj

-n∑

j=1

(.) -

>

=

<

τ1

-

-

-

H1

H1 or H0

H0

Figure 5.2: The structure of the optimal detector for an i.i.d. Gaussian noise model.

In reporting the constant value∑

j s2j in the treshold, we note τ2 = τ1 + 1

2

∑nj=1 s

2j and

obtain the following scheme

xj - ×6sj

-n∑

j=1

(.) -

>

=

<

τ2

-

-

-

H1

H1 or H0

H0

Figure 5.3: The simplified structure of the optimal detector for an i.i.d. Gaussian noisemodel.

5.5.2 Laplacian noise

In this case we havef(xj) ∝ exp [−α|xj |] (5.34)

and the log likelihood Lj(xj) ratios become

logLj(xj) = logf(xj − sj)

f(xj)= −α |xj − sj| + α |xj |

=

−α|sj | if sgn(sj)xj < 0αsgn(xj) |2xj − sj| if 0 < sgn(sj)xj < |sj|

α|sj | if sgn(sj)xj > |sj|(5.35)

xj - ++

−6sj/2

- -

6Out

In - ×6

sgn(sj)

-n∑

j=1

(.) -

>

=

<

τ

-

-

-

H1

H1 or H0

H0

Figure 5.4: Bayesian detector in the case of i.i.d. Laplacian data.

5.5. EXAMPLES OF SIGNAL DETECTION SCHEMES 61

5.5.3 Locally optimal detectors

Consider the following problem:

{H0 : Xj = Nj

H1 : Xj = Nj + θsj, θ > 0(5.36)

We remember that the α-level uniformly optimal test for this problem is:

δ(x) =

1 >η if Lθ(x) = τ0 <

(5.37)

and the α-level locally optimal test for this problem is:

δ(x) =

1 >

η if ∂Lθ(x)∂θ

∣∣∣∣θ=θ0

= τ

0 <

(5.38)

where τ and η depend on α. For the case of i.i.d. noise model we have

Lθ(x) =n∏

j=1

f(xj − θsj)

f(xj)(5.39)

Then, it is easy to show that

∂logLθ(x)

∂θ

∣∣∣∣θ=θ0

=n∑

j=1

sj glo(xj) (5.40)

where

glo(x) = −∂f(x)

∂x

f(x)= −f

′(x)

f(x)(5.41)

and the structure of a locally optimal detector is given in the following figure.

xj - ln[glo(xj)] - ×6sj

-n∑

j=1

(.) -

>

=

<

τ

-

-

-

H1

H1 or H0

H0

Figure 5.5: The structure of a locally optimal detector for an i.i.d. noise model.


It is clear from (5.41) that, if we know the expression of the probability density functionof the noise f(x), we can easily determine the expression of glo(x). For example:

• For a Gaussian noise model we have

f(x) ∝ exp

[− 1

2σ2x2]−→ glo(x) =

1

σ2x (5.42)

xj - -

6Out

In - ×6sj

-n∑

j=1

(.) -

>

=

<

τ

-

-

-

H1

H1 or H0

H0

Figure 5.6: The structure of a locally optimal detector for an i.i.d. Gaussian noise model.

• For a Laplacian noise model we have

f(x) ∝ exp [−α |x|] −→ glo(x) = α sgn(x) (5.43)

xj - -

6Out

In - ×6sj

-n∑

j=1

(.) -

>

=

<

τ

-

-

-

H1

H1 or H0

H0

Figure 5.7: The structure of a locally optimal detector for an i.i.d. Laplacian noise model.

• For a Cauchy noise model we have

f(x) ∝ 1

1 + x2−→ glo(x) =

2x

1 + x2(5.44)

xj - -

6Out

In - ×6sj

-n∑

j=1

(.) -

>

=

<

τ

-

-

-

H1

H1 or H0

H0

Figure 5.8: The structure of a locally optimal detector for an i.i.d. Cauchy noise model.

5.6. DETECTION OF SIGNALS WITH UNKNOWN PARAMETER 63

5.6 Detection of signals with unknown parameter

Here, we consider the case where the transmitted signals depend on some unknown pa-rameter θ: {

H0 : Xj = Nj + S0j(θ)

H1 : Xj = Nj + S1j(θ)

, N ∼ f(n) (5.45)

The main quantity that we need to calculate is

L(x) =E1

{f(x − s1j

(θ))}

E0

{f(x − s0j

(θ))} (5.46)

In the following we consider the particular case of s0 = 0, s1 = s(θ) and i.i.d. noise. Thenwe have {

H0 : Xj = Nj

H1 : Xj = Nj + Sj(θ), Nj ∼ f(nj) (5.47)

L(x) =

∫

Γ

f(x − s(θ)

f(x)p(θ) dθ

=

∫

Γ

n∏

j=1

f(xj − sj(θ)

f(xj)p(θ) dθ

= =

∫Lθ(x) p(θ) dθ (5.48)

Example: Non coherent detection:Consider an amplitude modulated signal

sj(θ) = aj sin[(j − 1)ωcTs + θ], j = 1, . . . , n, ωcTs = m2π

n(5.49)

where ωc is the career frequency, Ts is the sampling rate, m is the number of periods in theobservation time [0, T = nTs] and n/m is the number of samples per cycle. The carrierphase is unknown. With a uniform prior p(θ) = 1

2π and the i.i.d. noise assumption wehave

L(x) =1

2π

∫ 2π

0exp

1

σ2

n∑

j=1

xjsj(θ) −1

2

n∑

j=1

s2j(θ)

dθ (5.50)

Using the trigonometric relations:

sin(a+ b) = cos a sin b+ sin a cos b (5.51)

sin2(a) =1

2− 1

2cos(2a) (5.52)

we obtain:n∑

j=1

xjsj(θ) = xc sin θ + xs cos θ (5.53)


with

xcdef=

n∑

j=1

ajxj cos[(j − 1)ωcTs] (5.54)

xsdef=

n∑

j=1

ajxj sin[(j − 1)ωcTs] (5.55)

From (5.49) we have

n∑

j=1

s2j(θ) =1

2

n∑

j=1

a2j +

1

2

n∑

j=1

a2j cos[2(j − 1)ωcTs + 2θ] (5.56)

The second term is, in general either equal to zero or negligeable with respect to the firstterm.

Noting a2 = 1n

∑nj=1 a

2j , we obtain

L(x) = exp

[−na

2

4σ2

]1

2π

∫ 2π

0exp

[1

σ2[xc sin θ + xs cos θ]

]dθ (5.57)

= exp

[−na

2

4σ2

]1

2πI0

(r

σ2

)(5.58)

where I0 is the zeroth-order Bessel function, which is monotone. Then the detection rulebecomes

1 >γ if L(x) = 10 <

−→

1 >

γ if r = τ ′ = σ2I−10

(τ exp

[na2

4σ2

])

0 <

(5.59)

The structure of this detector is given in the following figure.


xj

π/2

localoscillator

- ×?

cos[(j − 1)ωcTs]

- × -n∑

j=1

(.)

?

aj6

(.)2

- ×

6

sin[(j − 1)ωcTs]

- × -n∑

j=1

(.)

6

aj ?

(.)2

?

6

+ -r2

-

>

=

<

τ2

-

-

-

H1

H1 or H0

H0

Figure 5.9: Coherent detector.


Performance analysis:We need to calculate the probabilities such as

Pj(R > τ ′) = Pj(R2 > τ ′2), j = 0, 1 (5.60)

with R2 = X2c +X2

s .Note that Xc and Xs are linear combinations of Xj . So, if Xj are Gaussian, Xc and

Xs are Gaussian too.Under the hypothesis H0, we have

E [Xc|H0] = E [Xs|H0] = 0 (5.61)

Var [Xs|H0] = Var [Xs|H0] =nσ2a2

2(5.62)

Cov {Xs,Xc|H0} = 0 (5.63)

and

P0(Γ1) =

∫∫

r2=x2c+x2

s≥τ ′2

1

nπσ2a2exp

[− 1

nπσ2a2(x2

c + x2s)

]dxc dxs (5.64)

With the cartesian to polar coordinate change (xc, xs) −→ (r, θ) we obtain

P0(Γ1) =1

nπσ2a2

∫ 2π

0

∫ ∞

τ ′r exp

[− r2

nσ2a2

]dr dθ

=1

nπσ2a2exp

[− τ ′2

nσ2a2

](5.65)

Under the hypothesis H1, noting that for a given value of θ x|θ ∼ N (s(θ), σ2I

)we

have

E [Xc|H1,Θ = θ] =na2

2sin θ (5.66)

E [Xs|H1,Θ = θ] =na2

2cos θ (5.67)

Var [Xs|H1,Θ = θ] = Var [Xs|H1,Θ = θ] =nσ2a2

2(5.68)

Cov {Xs,Xc|H1,Θ = θ} = 0 (5.69)

and

p(xc, xs|H1) =1

2π

∫ 2π

0

1

nπσ2a2exp

[− 1

nσ2a2q(xc, xs;na2/2, θ)

]dθ (5.70)

= p(xc, xs|H0) exp

[−na

2

4σ2

]I0

(r

σ2

)(5.71)

The detection probability becomes then

PD(δ) = P1(Γ1) =

∫∫

r2=x2c+x2

s≥τ ′2p(xc, xs|H1) dxc dxs (5.72)

exp[−na2

4σ2

]

nπσ2a2

∫ 2π

0

∫ ∞

r′r exp

[− r2

nσ2a2

]I0

(r

σ2

)dr dθ (5.73)


Noting b2 = na2

2σ2 and τ0 = τbσ2 and changing the variable x = r

bσ2 we obtain

PD(δ) = P1(Γ1) =

∫ ∞

r0

x exp

[−1

2(x2 + b2)

]I0(bx) dx

def= Q(b, τ0) (5.74)

Q(b, τ0) is called Marcum’s Q-function.Note also that Pf (δ) = Q(0, τ0). So for a α-level Neyman-Pearson detection test we

have τ ′ =[nσ2a2 log(1/α)

] 1

2 and the probability of detection is given by

PD(δ) = Q[b, 2[log(1/α)]

1

2

](5.75)

Note also that

E

1

n

n∑

j=1

s2j(θ)

= a2/2 (5.76)

so, b2 = na2

2σ2 is a measure of S/N ratio.


5.7 sequential detection

Γ = {Xj , j = 1, 2, . . .} (5.77){H0 : Xj ∼ P0

H1 : Xj ∼ P1(5.78)

A sequential decision rule is a pair of sequences (∆, δ) where :

• ∆ = {∆j, j = 0, 1, 2, . . .} is the sequence of stopping rules ;

• δ = {δj , j = 0, 1, 2, . . .} is the sequence of decision rules ;

• ∆j(x1, . . . , xj) is a function from IRj to {0, 1} ;

• δj(x1, . . . , xj) is a decision rule on (IRj,Bj) ;

• If ∆n(x1, . . . , xn) = 0 we take another sample ;

• If ∆n(x1, . . . , xn) = 1 we stop sampling and make a decision.

• N = min{n|∆n(x1, . . . , xn) = 1} is the stopping time ;

• ∆N (x1, . . . , xN ) is the terminal decision rule.

• (∆0, δ0) correspond to the situation where we have not yet observed any data. ∆0 = 0means take at least one sample before making a decision. ∆0 = 1 means make adecision without taking any sample.

Note that N is a random variable depending on the data sequence. The terminal decisionrule δN (x1, . . . , xN ) tells us which decision to make when we stop sampling.

The fixed-sample-size N decision rule can be defined as the following sequential detec-tion rule:

∆j(x1, . . . , xj) =

{0 if j 6= N1 if j = N

(5.79)

δj(x1, . . . , xj) =

{δ(x1, . . . , xn) if j = Narbitrary if j 6= N

(5.80)

In the following we consider only the binary hypothesis testing and we analyse theBayesian approach with the prior distribution {π0 = 1 − π1, π1} and the uniform costfunction. We assume that we can have an infinite number of i.i.d. observations at ourdisposal. However, we should assign a cost c > 0 to each sample, so that the cost of takingn samples is nc.

With these assumptions, the conditional risks for a given sequential decision rule are:

R0(∆, δ) = E0 {δ(x1, . . . , xn)} + E0 {cN} (5.81)

R1(∆, δ) = 1 − E1 {δ(x1, . . . , xn)} + E1 {cN} (5.82)

where the subscripts denote the hypotheses under which the expectation is computed andN is the stopping time. The Bayes risk is thus given by

r(∆, δ) = (1 − π1)R0(∆, δ) + π1R1(∆, δ) (5.83)

5.7. SEQUENTIAL DETECTION 69

and the sequential Bayesian rule is the one which minimizes r(∆, δ).To analyse the structure of this optimal rule we define

V ∗(π1)def= min

∆, δ∆0 = 0

r(∆, δ), 0 ≤ π1 ≤ 1. (5.84)

0 πLπU 1

π1

1 − π1

π1

c c

V ∗(π1)

Figure 5.10: Sequential detection.

Since ∆0 = 0 means that the test does not stop with zero observation, V ∗(π1) cor-responds then to the minimum Bayes risk over all sequential tests that take at least onesample. V ∗(π1) is in general concave and continuous and V ∗(0) = V ∗(1) = c. Now, letplot this function as well as these two specific sequential tests:

• Take no sample and decide H0, i.e., ∆0 = 1, δ0 = 0 and

• Take no sample and decide H1, i.e., ∆0 = 1, δ0 = 1.

Note that the Bayes risks for these tests are

r(∆, δ)|∆0=1,δ0=0 = 1 − π1

r(∆, δ)|∆0=1,δ0=1 = π1

These tests are the only two Bayesian tests that are not included in the minimization of(5.84). We note, respectively by πU and πL the abscisses of the intersections of the linesr(∆, δ)|∆0=1,δ0=0 and r(∆, δ)|∆0=1,δ0=1 with V ∗(π1).


Now, by inspection of these plots, we see that the Bayes rule with a fixed given priorπ1 is:

• (∆0 = 1, δ0 = 0) if π1 ≤ πL ;

• (∆0 = 1, δ0 = 1) if π1 ≥ πU ;

• The decision rule with minimizes the Bayes risk among all the tests such that (∆0 =0) corresponds to a point such that πL ≤ π1 ≤ πU .

In the two first cases the test is stopped. In the third one, we know that the optimaltest takes at least one more sample. After doing so, we are faced to a similar situationas before except that we now have more information due to the additional sample. Inparticular, the prior π1 is replaced by π1(x1) = Pr {H = H1|X1 = x1} which is the poste-rior probability of H1 given the observation X1 = x1. We can apply this method to anyarbitrary number of samples. We then have the following rules:

• Stopping rule:

∆n(x1, . . . , xn) =

{0 if πL < π1(x1, . . . , xn) < πU

1 otherwise.(5.85)

• Terminal decision rule:

δn(x1, . . . , xn) =

{0 if π1(x1, . . . , xn) ≤ πL

1 if π1(x1, . . . , xn) ≥ πU .(5.86)

It has been proved that under mild conditions the posterior probability π1(x1, . . . , xn)converges almost surely to 1 under H1 and to 0 under H0. Thus the test terminates withprobability one. The only knowledge of the probabilities πL and πU and an algorithm tocompute π1(x1, . . . , xn) are sufficient to define this rule. The computation of π1(x1, . . . , xn)is quite easy, but unfortunately, it is very difficult to obtain exactly πL and πU .

Now consider the case where the two processes P0 and P1 have densities f0 and f1.Then the Baye fomula yields

π1(x1, . . . , xn) =π1∏n

j=1 f1(xj)

π0∏n

j=1 f0(xj) + π1∏n

j=1 f1(xj)

=π1 λn(x1, . . . , xn)

π0 + π1 λn(x1, . . . , xn)

(5.87)

where

λn(x1, . . . , xn) =n∏

j=1

f1(xj)

f0(xj)(5.88)

is the likelihood ratio based on n samples.Noting that π1(x1, . . . , xn) is an increasing function of λn(x1, . . . , xn) we can rewrite

(5.85) and (5.86) as:


H0

H1

π1

πU

1

πL

0 1 2 3 4 5 N = 6 n

π1(x1, . . . , xn)

Figure 5.11: Stopping rule in sequential detection.

• Stopping rule:

∆n(x1, . . . , xn) =

{0 if π < λn(x1, . . . , xn) < π1 otherwise.

(5.89)

• Terminal decision rule:

δn(x1, . . . , xn) =

{0 if λn(x1, . . . , xn) ≤ π1 if λn(x1, . . . , xn) ≥ π.

(5.90)

where

πdef=

π0πL

π1(1 − πL)and π

def=

π0πU

π1(1 − πU ). (5.91)

In conclusion, the Bayesian sequential test takes samples until the likelihood ratiofalls outside the interval [π, π] and decides H0 or H1 if λn(x1, . . . , xn) falls outside of thisinterval.

The main problem in practical situations is to fix the values of the boundaries a = πand b = π. This test is called the sequential probability ratio test with the boundaries aand b and is noted SPART (a, b).

The following theorem gives some of the optimality properties of SPART (a, b).

Wald-Wolfowitz theorem :Note by

N(∆) = min{n|∆n(x1, . . . , xn) = 1}


H0

H1

π1/π0

b

a

0 1 2 3 4 5 N = 6 n

λn(x1, . . . , xn)

Figure 5.12: Stopping rule in SPART (a, b).

PF (∆, δ) = Pr {δN (x1, . . . , xN ) = 1|H = H0}PM (∆, δ) = Pr {δN (x1, . . . , xN ) = 0|H = H1}

and (∆∗, δ∗) the SPART (a, b). Then, for any sequential decision rule (∆, δ) for which

PF (∆, δ) ≤ PF (∆∗, δ∗)

PM (∆, δ) ≤ PM (∆∗, δ∗)

we haveE [N(∆)|H = Hj] ≥ E [N(∆∗)|H = Hj] , j = 0, 1

The validity of Wald-Wolfowitz theorem is a consequence of the Bayes optimality ofSPART (a, b). The results of this theorem and other related theorems are sumarized inthe following items:

• For a given performance, there is no other sequential decision rule with a smallerexpected sample size than the SPART (a, b) with the same performance.

• The average sample size of SPART (a, b) is not greater than the sample size of afixed-sample-size test with the same performance.

• For a given expected sample size, no sequential decision rule has smaller error prob-abilities than the SPART (a, b).


Two main questions remains:

• How to choose a and b to yield a desired level of performance?

• How to evaluate the expected sample size of a sequential detector?

The following result gives an answer to the first one.

Let (∆, δ) = SPART (a, b) with a < 1 < b and α = PF (∆, δ), γ = 1 − β =PM (∆, δ) and N = N(∆). Then the rejection region of (∆, δ) is

Γ1 =

{x ∈ IR∞

∣∣∣∣λN (x1, . . . , xN ) ≥ b

}= ∪∞

n=1Qn (5.92)

with

Qn =

{x ∈ IR∞

∣∣∣∣N = n, λn(x1, . . . , xN ) ≥ b

}= ∪∞

n=1Qn (5.93)

Since Qn and Qm are mutually exclusive sets for m 6= n, we have

α = Pr {λn(x1, . . . , xN ) ≥ b|H = H0} =∞∑

n=1

∫

Qn

n∏

j=1

f0(xj) dxj (5.94)

On Qn we haven∏

j=1

f0(xj) dxj ≤1

b

n∏

j=1

f1(xj) dxj (5.95)

So, we have

α ≤ 1

b

∞∑

n=1

∫

Qn

n∏

j=1

f1(xj) dxj =1

bPr {λn(x1, . . . , xN ) ≥ b|H = H0} =

1

b(1− γ)

(5.96)and in the same manner we obtain

γ = Pr {λn(x1, . . . , xN ) ≤ a|H = H1} ≤ a (1 − α) (5.97)

From these two relations we deduce{b < 1−γ

αa > γ

1−α

(5.98)

The following choice is called the Wald’s approximation:

{b ≃ 1−γ

αa ≃ γ

1−α

(5.99)

To be completed later


5.8 Robust detection

To be completed later

Chapter 6

Elements of parameter estimation

6.1 Bayesian parameter estimation

Throughout this chapter we assume that the data are samples of a parametrically knownprocess {Pθ; θ ∈ τ}, where Pθ denotes a distribution on the observation space (Γ,G):

X ∼ Pθ(x) (6.1)

The goal of the parameter estimation problem is to find a function θ(x) : Γ 7→ τ suchthat θ(x) is the best guess of the true value of θ. Of course, the solution depends ona goodness criterion. As in the hypothesis testing problems, we have to define a costfunction c[θ(X),θ] : τ × τ 7→ IR+ such that c[a,θ] is the cost of estimating the true valueof θ by a.

Then, as in the hypothesis testing problems, we can define the conditional risk function

Rθ(θ) = Eθ

{c[θ(X),θ]

}= E

[c[θ(X),Θ] |Θ = θ

]

=

∫∫

Γc[θ(x),θ]fθ(x) dx (6.2)

and the Bayes risk

r(θ) = E[RΘ(θ(X))

]

=

∫∫

τ

∫∫

Γc[θ(x),θ]fθ(x)π(θ) dxdθ

=

∫∫

Γ

∫∫

τc[θ(x),θ]π(θ|x) dθ dx

= E[E[c[θ(X),Θ] |X = x

]](6.3)

From this relation, and the fact that in general the cost function is positive, we see thatr(θ) is minimized over θ, when for any x ∈ Γ, the mean posterior cost

c[x] = E[c[θ(X),Θ] |X = x

]=

∫∫

τc[θ(x),θ]π(θ|x) dθ (6.4)

is minimized.It is clear that the resulting estimate depends on the choice of the cost function. In

the following section we first consider the case of a scalar parameter and then extend itto the vector parameter case.

75

76 CHAPTER 6. ELEMENTS OF PARAMETER ESTIMATION

6.1.1 Minimum-Mean-Squared-Error

In case where τ = IR, a commonly used cost function is

c[a, θ] = c(a− θ) = (a− θ)2, (a, θ) ∈ IR2 (6.5)

The corresponding Bayes risk is E[(θ(X) − Θ)2

], a quantity which is known as the Mean-

Squared-Error (MSE). The corresponding Bayes estimate is called the Minimum-Mean-Squared-Error (MMSE) estimator.

The posterior cost is given by

E[(θ(X) − Θ)2 |X = x

]= E

[θ2(X) |X = x

]− 2E

[θ(X)Θ |X = x

]+ E

[Θ2 |X = x

]

= [θ(X)]2 − 2 θ(X) E [Θ |X = x] + E[Θ2 |X = x

](6.6)

This expression is a quadratic function of θ(X) and its minimum is obtained for

θMMSE(X) = E [Θ |X = x] (6.7)

Thus the MMSE estimate is the mean of the posterior probability density function. Thisestimate is also called posterior mean (PM) estimate.

6.1.2 Minimum-Mean-Absolute-Error

In case where τ = IR, another commonly used cost function is

c[a, θ] = c(a− θ) = |a− θ|, (a, θ) ∈ IR2 (6.8)

The corresponding Bayes risk is E[|θ(X) − Θ|

], a quantity which is known as the Mean-

Absolute-Error (MAE). The corresponding Bayes estimate is called the Minimum-Mean-Absolute-Error (MMAE) estimator.

The posterior cost is given by

E[|θ(X) − Θ| |X = x

]=

∫ ∞

0Pr{|θ(x) − Θ| > z |X = x

}dz

=

∫ ∞

0Pr{Θ > z + θ(x) |X = x

}dz

+

∫ ∞

0Pr{Θ < −z + θ(x) |X = x

}dz

(6.9)

Doing the variable change t = z+ θ(x) in the first integral and t = −z+ θ(x) in the secondone we obtain

E[|θ(x) − Θ| |X = x

]=

∫ ∞

θ(x)Pr {Θ > t |X = x} dt

+

∫ θ(x)

−∞Pr {Θ < t |X = x} dt (6.10)

6.1. BAYESIAN PARAMETER ESTIMATION 77

This expression is differentiable with respect to θ(x) and

∂E[|θ(X) − Θ| |X = x

]

∂θ(x)= Pr

{Θ < θ(x) |X = x

}

−Pr{Θ > θ(x) |X = x

}(6.11)

This derivative is a nondecreasing function of θ(x) which approaches −1 as θ(x) −→ −∞and +1 as θ(x) −→ +∞. The minimum of (6.11) is achieved at the point θ(x) where thederivative vanishes. Consequently, the Bayes estimate satisfies

Pr {Θ < t |X = x} ≤ Pr {Θ > t |X = x} , t < θ(x)

and

Pr {Θ < t |X = x} ≥ Pr {Θ > t |X = x} , t > θ(x)

or

Pr{Θ < θ(x) |X = x

}= Pr

{Θ > θ(x) |X = x

}. (6.12)

θ(x) is the median of the posterior distribution of Θ given X = x:

θMMAE(X) = median of π(θ |X = x) (6.13)

6.1.3 Maximum A Posteriori (MAP) estimation

Another commonly used cost function in the cases where τ = IR is

c[a, θ] = c(a− θ) =

{0 if |a− θ| ≤ ∆1 if |a− θ| > ∆

(6.14)

where ∆ is a positive real number. The corresponding Bayes risk is given by

E[c[θ(X) − Θ] |X = x

]= Pr

{|θ(x) − Θ| > ∆ |X = x

}

= 1 − Pr{|θ(x) − Θ| ≤ ∆ |X = x

}(6.15)

To minimize this expression we consider two cases:

• Θ is a discrete random variable taking its values in a finite set τ = {θ1, . . . , θM}such that |θi − θj| > ∆ for any i 6= j. Then we have

E[c[θ(X),Θ] |X = x

]= 1 − Pr

{Θ = θ(x) |X = x

}= 1 − π(θ(x) |x) (6.16)

where π(θ |x) is the posterior distribution of Θ given X = x. The estimate is thevalue of Θ which has the maximum a posteriori probability:

θMAP = arg maxθ∈T

{π(θ |x)} (6.17)


• Θ is a continuous random variable. In this case, we have

E[c[θ(x),Θ] |X = x

]= 1 −

∫ θ(x)+∆

θ(x)−∆π(θ |X = x) dθ (6.18)

If we assume that the posterior probability distribution π(θ |x) is a continuous andsmooth function and ∆ is sufficiently small, then we can write

E[c[θ(x),Θ] |X = x

]= 1 − 2∆ π(θ(x) |X = x) (6.19)

and again we haveθMAP = argmax

θ∈T{π(θ |x)} (6.20)

Example 1: Estimation of the parameter of an exponential distribution.Suppose both distributions fθ(x) and π(θ) are exponential:

fθ(x) =

{θ exp [−θx] if x > 00 otherwise

(6.21)

and

π(θ) =

{α exp [−αθ] if θ > 00 otherwise

(6.22)

Note that {E [X] = θVar {X} = E

[(X − θ)2

]= θ2 (6.23)

and {E [Θ] = αVar {Θ} = E

[(Θ − α)2

]= α2 (6.24)

Then, we can calculate the joint distribution φ(x, θ)

φ(x, θ) =

{α θ exp [−θx− αθ] if θ > 0, x > 00 otherwise.

(6.25)

The marginal distribution m(x) is given by

m(x) =

{ ∫∞0 α θ exp [−(α+ x)θ] dθ = α

(α+x)2 if x > 0

0 otherwise.(6.26)

and the posterior distribution π(θ|x) is given by

π(θ|x) =

{αθ exp[−(α+x)θ]

m(x) = (α+ x)2θ exp [−(α+ x)θ] if θ > 0

0 otherwise.(6.27)

Note that we have {E [Θ |X = x] = 2

α+x

Var {Θ |X = x} = 2(α+x)2

(6.28)


The MMSE estimator is given by

θMMSE(x) = E [Θ |X = x]

=

∫ ∞

0θπ(θ|x) dθ

=

∫ ∞

0(α+ x)2θ2 exp [−(α+ x)θ] dθ

=2

α+ x(6.29)

and the corresponding MMSE is:

MMSE = r(θMMSE) = E [Var {Θ |X}]=

∫ ∞

0

2

(α+ x)2m(x) dx

=

∫ ∞

0

2α

(α+ x)4dx

=2

3α2(6.30)

The MMAE estimate θABS(x) is such that∫ ∞

θABS

π(θ|x) dθ = [1 + (α+ x)θABS ] exp[−(α+ x)θABS

]=

1

2(6.31)

It is easily shown that

θABS(x) =T0

α+ x(6.32)

where T0 is the solution of

(1 + T0) exp [−T0] =1

2−→ T0 ≃ 1.68

To calculate θMAP (x), we can remark that

∂log π(θ |x)∂θ

=1

θ− (α+ x)

∂2log π(θ |x)∂θ2

= − 1

θ2< 0

(6.33)

So, we have

θMAP (x) =1

α+ x(6.34)

The following table sumarizes theses estimates:

θMMSE(x) =2

α+ x

θABS(x) =T0

α+ x

θMAP (x) =1

α+ x

(6.35)


Example 2: Estimation of the location parameter of a Gaussian distribution.Assume X|Θ and Θ have both Gaussian distributions:

X|Θ ∼ fθ(x) = N (θ, σ2

)

Θ ∼ π(θ) = N (θ0, σ

2θ

)

Then, the joint distribution φ(x, θ), the marginal distribution m(x) and the posteriordistribution π(θ |x) are

X,Θ ∼ φ(x, θ) = N(

(θ, θ0),

(σ2

x 00 σ2

θ

))

X ∼ m(x) = N (θ, σ2

x = σ2 + σ2θ

)

Θ |X ∼ π(θ |x) = N(θ, σ2

)

with

θ =σ2

θ

σ2xθ0 + σ2

σ2xx

σ2 = σ σθ

σ2x

In this case all the estimators are equal and we have

θMMSE(x) = θABS(x) = θMAP (x) = θ =σ2

θ

σ2x

θ0 +σ2

σ2x

x (6.36)

=σ2

θ

σ2 + σ2θ

θ0 +σ2

σ2 + σ2θ

x (6.37)

(6.38)

They also have the same posterior variance

σ2 =σ σθ

σ2x

=σ σθ

σ2 + σ2θ

(6.39)

Note also the following limit cases:

When σ2 −→ 0 then θ −→ θ0

When σ2θ −→ 0 then θ −→ x


Example 3: Estimation of the parameter of a Binomial distribution.Assume that

X | θ ∼ f(x | θ) = Bin(x|θ, n) = Cxn θ

x (1 − θ)n−x

Θ ∼ π(θ) = Beta(θ|α, β) = 1B(α,β) θ

α−1 (1 − θ)β−1

Then, the joint distribution φ(x, θ), the marginal distribution m(x) and the posteriordistribution π(θ |x) are

(X,Θ) ∼ φ(x, θ) = Cxn

B(α,β) θα+x−1 (1 − θ)β+n−x−1

X ∼ m(x) = Cxn

B(α,β) B(α+ x, n+ β − x)

= Cxn

Γ(α+β)Γ(α)Γ(β)

Γ(α+x)Γ(n+β−x)Γ(α+β+n)

Θ |X ∼ π(θ|x) = 1B(α+x,β+n−x) θ

α+x−1 (1 − θ)β+n−x−1

= Beta(α+ x, n+ β − x)


6.2 Other cost functions and related estimators

Here are some other cost functions and corresponding Bayesian estimator expressions.

Name C[a, θ] θ

Quadratic q(a− θ)2 E [Θ|x] =∫θπ(θ|x) dθ

WeightedQuadratic

ω(θ)(a− θ)2 E[ω(Θ) Θ|x]

E[ω(Θ)|x]=

∫ω(θ) θ π(θ|x) dθ∫ω(θ) π(θ|x) dθ

.

Absolute |a− θ| ∫ θ−∞ π(θ|x) dθ =

∫+∞

θπ(θ|x) dθ = 1

2

NonsymetricAbsolute

{k2(θ − a) si θ ≤ a,k1(a− θ) si θ ≥ a.

k1∫ θ−∞ π(θ|x) dθ = k2

∫ +∞

θπ(θ|x) dθ

Linexβ [exp [−α(a− θ)] − α(α − θ) − 1] ,α 6= 0, β > 0

− 1α log (E [exp [−αΘ] |x])

= − 1α log [

∫exp [−αθ] p(θ|x) dθ]

(a−θ)2

θ1

E[1/Θ|x]= 1∫

1

θπ(θ|x)dθ

(a−θ)2

a

√E [Θ2|x] =

√∫θ2 π(θ|x) dθ

− ln[a(θ)] π(θ|x)

Table 6.1: Relations between the data, a priori, marginal and a posteriori distributions.


6.4 Estimation of vector parameters

In the case where we have a vector parameter θ = [θ1, . . . , θm]t we have to define a costfunction c[a,θ] : IRm × IRm 7→ IR+. Then it is again possible to define the Bayes risk. Inmany cases the cost function is of the form

c[a,θ] =m∑

i=1

ci[ai, θi] (6.40)

We then have

E[c[θ(x),θ] |X = x

]=

m∑

i=1

E[ci[θi(x), θi] |X = x

](6.41)

Here after, we consider some common cost functions:

6.4.1 Minimum-Mean-Squared-Error

In case where τ = IRm, a commonly used cost function is

c[a,θ] = ‖a − θ‖2 =m∑

i=1

(ai − θi)2 (6.42)

The corresponding Bayes risk is E[‖θ(X) − Θ‖2

]and the corresponding Bayes estimate

is the Minimum-Mean-Squared-Error (MMSE) estimator or the Bayes estimate:

θMMSE(X) = E [Θ |X = x] (6.43)

Thus the MMSE estimate is the mean of the posterior probability density function. It isalso called posterior mean (PM) estimate.

Note that, as in the scalar case, the following weighted quadratic cost function

c[a,θ] = ‖a − θ‖2Q = [a − θ]tQ[a − θ] (6.44)

gives the same estimate as in (6.43), i.e. the MMSE estimate does not depend on theweighting matrix Q. However, the corresponding minimum Bayes risks are different andwe have

E[‖θ(X) − Θ‖2

Q

]= tr {QE [Cov {Θ |X = x}]} (6.45)

.

6.4.2 Minimum-Mean-Absolute-Error

In case where τ = IR, another commonly used cost function is

c[a,θ] =m∑

i=1

|ai − θi|, (6.46)

The corresponding estimate is such that

Pr{Θi < θi(x) |X = x

}= Pr

{Θi > θi(x) |X = x

}, (6.47)

which means that θi(x) is the median of the marginal posterior distribution of Θi givenX = x, i.e. π(θi |X = x)

6.4. ESTIMATION OF VECTOR PARAMETERS 85

6.4.3 Marginal Maximum A Posteriori (MAP) estimation

Another commonly used cost function in the cases where τ = IRm is

c[a,θ] =m∑

i=1

c[ai − θi] with c[ai − θi] =

{0 if |ai − θi| ≤ ∆1 if |ai − θi| > ∆

(6.48)

where ∆ is a positive real number. The corresponding estimate is given by :

θi = arg maxθi∈T

{π(θi |x)} (6.49)

if ∆ is sufficiently small.

6.4.4 Maximum A Posteriori (MAP) estimation

Two other cost functions which give the same estimates are :

c[a,θ] =

{0 if maxi |ai − θi| ≤ ∆1 if maxi |ai − θi| > ∆

(6.50)

and

c[a,θ] =

{0 if ‖a − θ‖2 ≤ ∆1 if ‖a − θ‖2 > ∆

(6.51)

where ∆ is a positive real number.In both cases, if the posterior distribution π(θ |x) is continuous and smooth enough,

we obtain the MAP estimate.The corresponding estimate is given by :

θMAP = arg maxθ∈τ

{π(θ |x)} (6.52)

Note that the MMAP estimate in (6.49) and the estimate (6.52) may be very different.


6.4.5 Estimation of a Gaussian vector parameter from jointly Gaussianobservation

The case of the estimation of a Gaussian vector parameter θ ∈ IRm from a jointly Gaussianobservation x ∈ IRn is a very useful example and is used in many applications.

Suppose Θ and X have the following a priori distributions:

Θ ∼ N (θ0,RΘ)

X ∼ N (x0,RX)

and (ΘX0

)∼ N

((θ0

x0

),

(RΘ RΘX

RXΘ RX

))

with RΘX = RtXΘ.

It is easy to show that the posterio law is also Gaussian and is given by

Θ|X ∼ N (θ, R) (6.53)

with

θ = θ0 + RΘXR−1X (x − x0) (6.54)

R = RΘ − RΘXR−1X RXΘ (6.55)

We also have

E [Θ |X = x] = θ (6.56)

Cov {Θ |X = x} = R (6.57)

The corresponding minimum Bayes risk is

r(θ) = tr{QR

}= tr {QRΘ} − tr

{QRΘXR−1

X RXΘ

}(6.58)

6.4. ESTIMATION OF VECTOR PARAMETERS 87

6.4.6 Case of linear models

When the observation vector is related to the vector parameter θ by a linear model wehave

Xi =m∑

j=1

hi,j Θj +Ni, i = 1, . . . , n (6.59)

or in a matrix formX = HΘ + N (6.60)

withΘ ∼ N (θ0,RΘ)

N ∼ N (0,RN )

Then we have

X |Θ = θ ∼ N (Hθ,RN )

RX = HRΘHt + HRΘN + RNΘHt + RN

RXΘ = HRΘ + RΘN , RΘX = RtXΘ

If we assume that the noise N and the vector parameter Θ are independant, we have

RX = HRΘH t + RN (6.61)

RXΘ = HRΘ, RΘX = RtΘHt (6.62)

R−1X = R−1

N − R−1N H

(R−1

Θ + HtR−1N H

)−1H tR−1

Θ (6.63)

andΘ|X ∼ N (θ, R) (6.64)

with {θ = E [Θ |x] = θ0 + RΘXR−1

X (x − Hθ0)

R = E[(θ − θ)t(θ − θ) |x

]= RΘ − RΘXR−1

X RXΘ

θ = θ0 + RΘHt[HRΘH t + RN

]−1(x − Hθ0)

= θ0 +[H tR−1

N H + R−1Θ

]−1H tR−1

N (x − Hθ0)

R = RΘ − RΘH t [HRΘHt + RN]−1

HRΘ

=[HtR−1

N H + R−1Θ

]−1

(6.65)

Consider now the particular case of RN = σ2bI, RΘ = σ2

x(DtD)−1 and θ0 = 0. We thenhave {

θ =(H tH + λDtD

)−1H tx,

R = σ2b

(H tH + λDtD

)−1, with λ = σ2

b/σ2b

(6.66)


6.5 Examples

6.5.1 curve fitting

We consider here a classical problem of curve fitting that any engineer is almost anytimefaced to. We analyse this problem as a parameter estimation : Given a set of data{(xi, ti), i = 1, . . . , n} estimate the parameters of an algebraic curve to fit the best thesedata. Among different curves, the polynomials are used very commonly.

A polynomial model of degree p relating xi = x(ti) to ti is

xi = x(ti) = θ0 + θ1ti + θ2t2i + · · · + θpt

pi , i = 1, . . . , n (6.67)

Noting that this relation is linear in θi, we can rewrite it in the following

x1

x2...xn

=

1 t1 t21 · · · · · · tp11 t2 t22 · · · · · · tp2...

......

......

1 tn t2n · · · · · · tpn

θ0θ1...θp

(6.68)

or

x = Hθ (6.69)

The matrix H is called the Vandermond matrix. It is entirely determined by the vectort = [t1, t2, . . . , tn]t.

In the case where n = p + 1, this matrix is invertible iff ti 6= tj ,∀i 6= j. In general,however we have more data than unknowns, i.e. n > p+ 1.

Note that the matrix H tH is a Hankel matrix:

[H tH ]kl =n∑

i=1

tk−1i tl−1

i =n∑

i=1

tk+l−2i , k, l = 1, . . . , p+ 1 (6.70)

and the vector H tx is such that

[H tx]k =n∑

i=1

tk−1i xi, k = 1, . . . , p+ 1 (6.71)

Line fitting is the following particular case

x1

x2...xn

=

1 t11 t2...

...1 tn

(θ0θ1

)(6.72)

In this case we have

H tH =

nn∑

i=1

ti

n∑

i=1

ti

n∑

i=1

t2i

(6.73)

6.5. EXAMPLES 89

t

x

xi

ti

ei

xi = θ0 + θ1ti + · · · , θptpi

o

o

o

o

o

o

o

o

Figure 6.1: Curve fitting.

and

Htx =

n∑

i=1

xi

n∑

i=1

ti xi

(6.74)

In the following we consider the line fitting case and will see how different assumptionsabout the problem can give different solutions.

Model 1:The easiest model is to assume that ti are perfectly known, and we only have uncertaintieson xi, i.e.

xi = x(ti) = θ0 + θ1ti + ei, i = 1, . . . , n (6.75)

where ei represents the error on xi. In a geometric language, ei is the signed distancebetween the point (ti, xi) and the point (ti, θ0 + θ1ti) (see figure 6.5.1).

Here, we have x = Hθ + e with θ = [θ0, θ1]t. The matrix H is perfectly known. Note

that in this model, if we assume that ei are zero mean, white and Gaussian

ei = xi − θ0 − θ1ti ∼ N(0, σ2

e

)(6.76)

then the likelihood function becomes

f(x |θ) = N(0, σ2

eI)∝ exp

[− 1

2σ2e

‖x − Hθ‖2]

(6.77)

and the maximum likelihood estimate is

θ = argminθ

{‖x − Hθ‖2

}(6.78)


t

x

xi

ti

eixi = θ0 + θ1ti

o

o

o

o

o

o

o

o

Figure 6.2: Line fitting: model 1: ei = xi − (θ0 + θ1ti)

If the H tH is invertible, θ is given by

θ = [H tH ]−1H tx (6.79)

To define any Bayesian estimate, we have to assign a prior probability law to θ. Letassume that θ0 and θ1 are independent and

θ0 ∼ N(0, σ2

0

), θ1 ∼ N

(1, σ2

1

)(6.80)

or

θ =

(θ0θ1

)∼ N

((01

),

(σ2

0 00 σ2

1

))= N (θ0,Σθ) (6.81)

Exercise 1:

• Write the complete expressions of f(xi |θ), f(x |θ), π(θ) and π(θ |x)

• Show that the posterior law π(θ |x) is Gaussian, i.e.

π(θ |x) ∼ N(θ, Σ

)(6.82)

and give the expressions of θ and Σ.

• Show that the MAP estimate is obtained by

θ = arg minθ

{J(θ)} (6.83)

with

J(θ) =1

σ2e

‖x − Hθ‖2 + (θ − θ0)tΣ−1

θ (θ − θ0)

=1

σ2e

n∑

i=1

(xi − θ0 − θ1ti)2 + (θ − θ0)

tΣ−1θ (θ − θ0)

6.5. EXAMPLES 91

• Show that, in this case an explicite expression of θ is available and is given by

θ =

[1

σ2e

H tH + Σ−1θ

]−1 [ 1

σ2e

H tx + Σ−1θ θ0

](6.84)

• Compare this solution to the ML solution (6.79) which is equal to least square (LS)solution.

Model 2:A little more complex model is

xi = x(ti) = θ0 + θ1ti + ei, i = 1, . . . , n (6.85)

with the assumption that

ri = ei cosφ =ei√

1 + θ21

=xi − θ0 − θ1ti√

1 + θ21

the distance of the point (ti, xi) to the line x(ti) = θ0 + θ1ti is zero mean, white andGaussian with known variance σ2

r (See figure 6.5.1.)Note that ri is no more a linear function of θ1.

t

x

xi

ti

ri

xi = θ0 + θ1ti

o

o

o

o

o

o

o

o

φ

ei

Figure 6.3: Line fitting: model 2: ri = ei cosφ = ei√1+θ2

1

= xi−θ0−θ1ti√1+θ2

1


Exercise 2: With this model and assuming that ri are zero mean, white and Gaussianwith known variance σ2 = 1 :

• Write the expressions of f(xi |θ), f(x |θ), π(θ) and π(θ |x)

• Show that the MAP estimate is obtained by

θ = arg minθ

{J(θ)} (6.86)

with

J(θ) = n ln(2π(1 + θ2

1)σ2r

)+

1

(1 + θ21)σ

2r

n∑

i=1

(xi − θ0 − θ1ti)2 + (θ − θ0)

tΣ−1θ (θ − θ0)

(6.87)where θ = [θ0, θ1]

t.

• Is it possible to obtain explicit expressions for θ0 and θ1 ?

Model 3:

A little different model assumes that ti are also uncertain, i.e.

xi = x(ti) = θ0 + θ1(ti + ǫi) + ei, i = 1, . . . , n (6.88)

where ǫi represents the error on ti. Here also, we have x = Hθ + e with θ = [θ0, θ1]t, but

the matrix H is now uncertain.Note that we have

xi = x(ti) = θ0 + θ1ti + θ1ǫi + ei, i = 1, . . . , n (6.89)

which can also be written asx = H0θ + Hǫθ + e

with

H0 =

1 t1...

...

......

1 tn

, Hǫ =

0 ǫ1...

...

......

0 ǫn

,

Exercise 3: With this model and assuming that ǫi are zero mean, white and Gaussianwith known variance σ2

ǫ and that ei are also zero mean, white and Gaussian with knownvariance σ2

e :


• Give the expressions of the ML and the MAP estimators.

• Compare them to the solutions of the previous cases.

6.5. EXAMPLES 93

t

x

xi

ti

ri

xi = θ0 + θ1ti

o

o

o

o

o

o

o

o

φ

ei

Figure 6.4: Line fitting: model 3

Model 4:

This is the combination of cases 2 and 3, i.e.

xi = x(ti) = θ0 + θ1(ti + ǫi) + ei, i = 1, . . . , n (6.90)

where

ri = ei cosφ =ei√

1 + θ21

=xi − θ0 − θ1ti√

1 + θ21

the distance of the point (ti, xi) to the line x(ti) = θ0 + θ1ti are assumed zero mean, whiteand Gaussian.

Exercise 4: With this model and assuming that ǫi are zero mean, white and Gaussianwith known variance σ2

ǫ and that ri are also zero mean, white and Gaussian with knownvariance σ2

r :


• Give the expressions of the ML and the MAP estimators.

• Compare them to the solutions in previous examples.


t

x

xi

ti

eixi = θ0 + θ1ti

o

o

o

o

o

o

o

o

t

x

xi

ti

ri

xi = θ0 + θ1ti

o

o

o

o

o

o

o

o

φ

ei

Model 1: ei = xi − (θ0 + θ1ti) Model 2: ri = [xi − (θ0 + θ1ti)] cos φ = xi−(θ0+θ1ti)√1+θ2

1

Figure 6.5: Line fitting: models 1 and 2.

6.5. EXAMPLES 95

x|θ ∼ N (Hθ, RN )θ ∼ N (θ0, RΘ)

θ|x ∼ N (θ, R)

θ = θ0 + RΘH t [HRΘH t + RN]−1

(x − Hθ0)

= θ0 +[H tR−1

N H + R−1Θ

]−1H tR−1

N (x − Hθ0)

R = RΘ − RΘH t [HRΘH t + RN]−1

HRΘ

=[H tR−1

N H + R−1Θ

]−1

Model 1:ei ∼ N (0, σ2

e)xi|θ ∼ N (θ0 + θ1ti, σ

2e)

x|θ ∼ N (Hθ, σ2eI)

θ = θ0 +[HtH + σ2

eR−1Θ

]−1H t(x − Hθ0)

R = σ2e

[H tH + σ2

eR−1Θ

]−1

Model 2:

ri = ei√1+θ2

1

∼ N (0, σ2r )

ei ∼ N (0, (1 + θ2

1)σ2r

)

xi|θ ∼ N (θ0 + θ1ti, (1 + θ2

1)σ2r

)

x|θ ∼ N (Hθ, (1 + θ2

1)σ2rI)

π(θ|x) = 1m(x) [2π(1 + θ2

1)σ2e ]

−n/2 exp[− 1

2(1+θ21)σ2

e‖x − Hθ‖2 − 1

2 (θ − θ0)tR−1

Θ (θ − θ0)]

θMAP = argminθ {J(θ)}J(θ) = −n

2 ln[2π(1 + θ21)σ

2e ] − 1

2(1+θ21)σ2

e‖x − Hθ‖2 − 1

2(θ − θ0)tR−1

Θ (θ − θ0)

Model 3:

ǫi ∼ N (0, σ2ǫ )

ei ∼ N (0, σ2e)

xi|θ ∼ N (θ0 + θ1ti, θ

21σ

2ǫ + σ2

e

)

x|θ ∼ N (Hθ, (θ2

1σ2ǫ + σ2

e)I)

π(θ|x) = 1m(x) [2π(θ2

1σ2ǫ + σ2

e)]−n/2 exp

[− 1

2(θ21σ2

ǫ +σ2e)‖x − Hθ‖2 − 1

2(θ − θ0)tR−1

Θ (θ − θ0)]


2 ln[2π(θ21σ

2ǫ + σ2

e)] − 12(θ2

1σ2

ǫ +σ2e)‖x − Hθ‖2 − 1

2(θ − θ0)tR−1

Θ (θ − θ0)

Model 4:

ǫi ∼ N (0, σ2ǫ )

ei ∼ N (0, (1 + θ1)2σ2

e)xi|θ ∼ N (

θ0 + θ1ti, θ21σ

2ǫ + (1 + θ1)

2σ2e

)

= N (θ0 + θ1ti, θ

21(σ

2ǫ + σ2

e) + σ2e

)

x|θ ∼ N (Hθ, (θ2

1(σ2ǫ + σ2

e) + σ2e)I

)

π(θ|x) = 1m(x) [2π(θ2

1(σ2ǫ + σ2

e) + σ2e)]

−n/2 exp[− 1

2(θ21(σ2

ǫ +σ2e)+σ2

e)‖x − Hθ‖2 − 1

2(θ − θ0)tR−1

Θ (θ − θ0)]


2 ln[(θ21(σ

2ǫ + σ2

e) + σ2e)] − 1

2(θ21(σ2

ǫ +σ2e)+σ2

e)‖x − Hθ‖2 − 1

2(θ − θ0)tR−1

Θ (θ − θ0)


Remark: To do these calculations easily we need the following relations:

• If A, B and A + B are invertible, then we have

[A−1 + B−1

]−1= A [A + B]−1 B = B [A + B]−1 A (6.91)

[A + B]−1 = A−1[A−1 + B−1

]−1B−1 = B−1

[A−1 + B−1

]−1A−1

• If A and C are invertible matrices, then we have

[A + BCD]−1 = A−1 − A−1B[DA−1B + C−1

]−1DA−1 (6.92)

• A special case very useful in system theory

[I + B(sI − C)−1D

]−1= I − B [sI − C + DB]−1 D (6.93)

• If A is invertible then,

[A + uvt

]−1= A−1 − (A−1u) (vtA−1)

1 + vtA−1u(6.94)

• If A is a bloc matrix

A =

(A11 A12

A21 A22

)

then B = A−1 is also a bloc matrix

B =

(B11 B12

B21 B22

)

and

– If A−122 exists, then

A =

(I A12A

−122

0 I

)(A11 − A12A

−122 A21 0

0 A22

)(I 0

A−122 A21 I

)

and we have

rank {A} = rank{A11 − A12A

−122 A21

}+ rank {A22}

A−1 exists iff the matrix T = A11 − A12A−122 A21 is invertible. Then we have

B11 = T−1 =(A11 − A12A

−122 A21

)−1

B22 = = A−122 + A−1

22 A21B11A12A−122

B12 = −B11A12A−122

B21 = −A−122 A21B11

Written differently, we have

(A11 A12

A21 A22

)−1

=

((A11 − A12A

−122 A21)

−1 −B11A12A−122

−A−122 A21B11 A−1

22 + A−122 A21B11A12A

−122

)

6.5. EXAMPLES 97

– If A−122 exists, then we have

B11 = = A−111 + A−1

11 A12B22A21A−111

B22 = D−1 = (A22 − A21A−111 A12)

−1

B12 = −A−111 A12B22

B21 = −B22A21A−111

The matrices T and D are called the Shur’s complement of the matrix A.

• Particular case 1 :If A is a superior bloc-triangular, i.e. A21 = 0, then B is also superior bloc-triangular, i.e.

B =

(A11 A12

0 A22

)−1

=

(A−1

11 −A−111 A12A

−122

0 A−122

)

• Particular case 2 :If A is an inferior bloc-triangular matrix, i.e. A12 = 0, then B is also an inferiorbloc-triangular matrix, i.e.

B =

(A11 0A21 A22

)−1

=

(A−1

11 0−A−1

22 A21A−111 A−1

22

)

• Particular case 3 :If A22 is a sclar and A21 and A12 are vectors, we have

A =

(A11 x

zt y

), B = A−1 =

1

α

(αA−1

11 + wvt w

vt 1

)

where α, w, and v are given by:

α =1

(y − ztA−111 x)

=|A||A11|

, w = −A−111 x, v = −A−t

11 z

• If A is a [N,P ] matrix

IN ± AAt = [IN ± AR−1At].[IN ± AR−1At]t

where R = IP + [IP ± AtA]1/2.

• If x is a vector and u(x) a scalar function of x and if we define the gradient vector

∇u = ∂u∂x =

[∂u∂xi

], then we have the following relations

– If u = θtx then ∇u = ∂u∂x = θ

– If u = xtAx then ∇u = ∂u∂x = 2Ax


Generalized Gaussian

p(x) =p1−1/p

2σΓ(1/p)exp

[−1

p

|x− x0|pσp

]

p = 1 −→ p(x) =1

2σexp

[−|x− x0|

σ

]

p = 2 −→ p(x) =1√

2πσ2exp

[−1

2

|x− x0|2σ2

]

p = ∞ −→ p(x) =

{1/2σ if |x− x0| < σ0 otherwise

Centered case x0 = 0.

p(x) =p1−1/p

2σΓ(1/p)exp

[−1

p

|x|pσp

]

p = 1 −→ p(x) =1

2σexp

[−|x|σ

]

p = 2 −→ p(x) =1√

2πσ2exp

[−1

2

|x|2σ2

]

p = ∞ −→ p(x) =

{1/2σ if |x| < σ0 otherwise

Multivariable case:Separable:

p(x) =pn−n/p

2nσnΓn(1/p)exp

[− 1

pσp

n∑

i=1

|xi|p]

Correlated: Markov models

p(x) = Z(α) exp

−α

n∑

i=1

∑

j i

φ(xi − xj)

Z(α) =

∫∫exp

−αn∑

i=1

∑

j i

φ(xi − xj)

dx

Example: φ(x) = x2, j i = i− 1

p(x) = Z(α) exp

[−αx2

1 − αn∑

i=2

(xi − xi−1)2

]

This can be written as

p(x) = Z(α) exp[−αxtDtDx

]

6.5. EXAMPLES 99

with

D =

1 0 0

1 −1...

0 1 −1... 1 −1

0 · · · 0 1 −1

Z(α) = (2π)−n/2 (2α)n/2|DtD|Extension :

p(x) = Z(α) exp

[−αφ(x1) − α

n∑

i=2

φ(xi − xi−1)

]= Z(α) exp

[−α

n∑

i=1

φ([Dx]i)

]

with φ(x) = |x|pThe questions are:

Z(α) exists ?Can we obtain an analytical expression for it ?

Chapter 7

Elements of signal estimation

In the previous chapters we discussed the methods for designing estimators for staticparameter estimation. In this chapter we consider the case of dynamic or time varyingparameters (signal estimation).

7.1 Introduction

In many time-varying systems, the physical quantities of interest x can be modeled asobeying a dynamic equation

xn+1 = fn(xn,un) (7.1)

where

• x0,x1, . . ., is a sequence of vectors in IRN , called the state of the system, representingthe unknown quantities of interest;

• u0,u1, . . ., is a sequence of vectors in IRM , called the state input of the system,representing the influencing quantities acting on xn;

• f0,f 1, . . ., is a sequence of functions mapping IRN × IRM to IRM , called the stateequation of the system, representing the dynamic model relating xn and un;

A dynamic system is such that, for any fixed k and l, xk is completely determined fromthe state at time l and the inputs from times l up to k− 1. So, complete determination ofxn, n = 1, 2, . . . requires not only the inputs un, n = 0, 1, 2, . . . but also the initial conditionx0.

The equation (7.1) is called the state equation. Associated to this equation is theobservation equation

zn = hn(xn,vn) (7.2)

where

• z0,z1, . . ., is a sequence of vectors in IRP representing the observable quantities;

• v0,v1, . . ., is a sequence of vectors in IRP representing the errors on the observations;

• h0,h1, . . ., is a sequence of functions mapping IRN × IRP to IRP representing theobservation model.

101

102 CHAPTER 7. ELEMENTS OF SIGNAL ESTIMATION

The main problem then is to estimate the state vector xk from the observations z0,z1, . . . ,zl.

Example 1: One-dimensional motion

Consider a moving target subjected to an acceleration At for t > 0. Its position Xt

and its velocity Vt at time t satisfy

Xt = dPt

dt

At = dVt

dt

(7.3)

Assume that we can measure the position Vt at time instants tn = nT and we wish towrite a model of type (7.1) describing its motion. Assuming T is small, a Taylor seriesapproximation allows us to write

{Xn+1 ≃ Xn + TVn

Vn+1 ≃ Vn + TAn(7.4)

From these equations we see that two quantities Xn and Vn are necessary to describe themotion. So, defining

x =

(XV

)

U = AZ = X + V

−→

xn =

(Xn

Vn

)

Un = An

Zn = Xn + Vn

(7.5)

we can write{

xn+1 = F xn + GUn

Zn = Hxn + Vnwith F =

(1 T0 1

),G =

(0T

),H = (1 0 ) (7.6)

{fn(x,u) = Fx + Gu

hn(x,v) = Hx + v(7.7)

In this example, we assumed that we can measure directly the position of the movingtarget. In general, however, we may observe a quantity z(n) related to the unknownquantity x(n) by a linear transformation:

x(n) −→ Linear System −→ z(n)

non observable observable

and we want to estimate x(n) from the observed values of {z(n), n = 1, . . . , k}. Theestimate x(n) is then a function of the data {z(n), n = 1, . . . , k} and we note

x(n | z(1), z(2), . . . , z(k)) def= x(n | k)

Three cases may occur:

• we may want to estimate x(n+k) from the past observations. The estimate x(n+k|n)is called the k-th order prediction of z(n) and the estimation procedure is calledprediction.

7.2. KALMAN FILTERING : GENERAL LINEAR CASE 103

• we may want to estimate x(n) from present and past observations. The estimatex(n|n) is the filtered value of z(n) and the estimation procedure is called filtering.

• we may want to estimate x(n) from past, present and future observations. Theestimate x(n|n + l) is the smoothed value of z(n) and the estimation procedure iscalled smoothing.

7.2 Kalman filtering : General linear case

In this section we consider the linear systems with finite dimensions described by thefollowing equations:

{xk+1 = F k xk + Gk uk state equation,zk = Hk xk + vk observation equation

where

• k = 0, 1, 2, . . . represents the discrete time ;

• xk is a N -dimensional vector called state vector of the system ;

• zk is a P -dimensional vector containing the observations (output of the system) ;

• vk is a P -dimensional vector containing the observations errors (output noise ofthe system) ;

• uk is a M -dimensional vector representing the state representation error (statespace noise process) ;

• F k, Gk and Hk with respective dimensions of (N,N), (N,M) and (P,N) are thestate transition, the state input and the observation matrices and are assumed to beknown.

• The noise sequences {uk} and {vk} are assumed to be centered, white and jointlyGaussian.

• The initial state x0 is also assumed to be Gaussian and independent of {uk} and{vk} :

E

vk

x0

uk

( vt

l ,xt0,u

tl )

=

Rk 0 00 P 0 00 0 Qk

δkl

where Rk is the covariance matrix of the observation noise vector vk, Qk is thecovariance matrix of the state noise vector Qk and P 0 is the covariance matrix ofthe initial state x0.

Remember that the aim is to find a best estimate xk|l of xk from the observationsz1,z2, . . . ,zl. Depending on the relative position of k with respect to i we have:

• If k > l prediction


• If k = l filtering

• If k < l smoothing.

Of course, for fixed k and l, this problem is not different from the vector parameterestimation of the last chapter. However, we are usually interested in producing estimateseith in real time or at least on-line for increasing k.

Three different approaches can be used to obtain the Kalman filtering equations.

• Linear Mean Square (LMS) estimation :

xk|ldef= LMS(xk |z1, . . . ,zl)

which minimizes

E[[xk − xk|l]

t W k [xk − xk|l]]

• Maximum A posteriori (MAP) estimate :

xk|l = arg maxx

{p(xk |z1, . . . ,zl)}

• Bayesian MSE estimate :

xk|l = E [xk |z1, . . . ,zk]

We know that, for linear relations and Gaussian assumption all these estimates are equiv-alent. We consider here the last approach.

The main procedure is to apply the Bayes rule recursively to find the expression of theposterior law p(xk |z1, . . . ,zl). Note that, we can obtain easily this expression thanks tothe following facts :

1. All variables are assumed Gaussian ;

2. {uk} and {vk} are assumed white, Gaussian and mutually independent ;

3. The state and the observation models are linear ;

4. All the conditional laws such as p(zk+1|xk+1) and p(zk+1|z1:k) are Gaussian. So,the posterior law

p(xk+1|z1:k+1) = p(xk+1|z1:k)p(zk+1|xk+1)

p(zk+1|z1:k)

is also Gaussian.

To obtain the equations in the general case we note

• xk|k the estimate of the state vector at time k from the observations up to time k ;

• xk+1|k the estimate of the state vector at time k+1 from the observations up to theinstant k ;

7.2. KALMAN FILTERING : GENERAL LINEAR CASE 105

• ek+1 = zk+1 −Hk+1 xk+1|k the innovation process of the observations at the instantk + 1

• The covariance matrix of the innovation by

Rek+1 = E

[ek+1|k et

k+1|k

]

which is diagonal;

• The covariance matrix of the prediction error by

P k+1|k = E

[[xk+1 − xk+1|k

] [xk+1 − xk+1|k

]t],

• The posterior covariance matrix of the estimation error by

P k+1|k+1 = E

[[xk+1 − xk+1|k+1

] [xk+1 − xk+1|k+1

]t]

which is also called the covariance matrix of the filtering error.

With these definitions, we obtain easily:

E [xk|z1:k]def= xk|k

Cov {xk|z1:k} def= P k|k

E [xk+1|z1:k]def= xk+1|k

Cov {xk+1|z1:k} def= P k+1|k

E [zk+1|xk+1] = Hk+1 xk+1

Cov {zk+1|xk+1} = Rk+1

E [xk+1|z1:k] = F k xk|k

Cov {xk+1|z1:k} = F k P kFtk + Gk Qk Gt

k

E [zk+1|z1:k] = Hk+1 F k xk|k

Cov {zk+1|z1:k} = Hk+1 P k+1|kHtk+1 + Rk+1

Replacing the expressions of p(zk+1|xk+1) and p(zk+1|z1:k) we obtain :

p(xk+1|z1:k+1) = A exp

[−1

2[xk+1 − xk+1|k]

tP−1k+1|k+1[xk+1 − xk+1|k]

]

with

A =1

(2π)n/2

{∣∣∣Hk+1 P k+1|k H tk+1 + Rk

∣∣∣1/2

|Rk |−1/2∣∣∣P k+1|k

∣∣∣}−1/2

xk+1|k = F k xk|k + P k+1|k H tk+1

[Rk+1 + Hk+1 P k+1|k H t

k+1

]−1 [zk+1 − Hk+1 F k xk|k

]

P k+1|k = F k P k|k F tk + Gk Qk Gt

k

P−1k+1|k+1 = P−1

k+1|k + Htk+1R

−1k+1 Hk+1

7.3. EXAMPLES 107

7.3 Examples

7.3.1 1D case:{xn+1 = f xn + un

zn = hxn + vn(7.8)

where un and vn are assumed independent, zero-mean, white and Gaussian with knownvariance q and r respectively. x0 is also assumed Gaussian with known mean m0 andknown variance p0

un ∼ N (0, q)vn ∼ N (0, r)x0 ∼ N (m0, p0)

(7.9)

The equations in this case reduce to

xn+1|n = f xn|n

xn|n = xn|n−1 + kn (zn − h xn|n−1)

kn =pn|n−1h

h2pn|n−1+r= 1

h

pn|n−1

pn|n−1+r/h2

(7.10)

The role of the Kalman gain in the measurement update is easily seen from these expres-sions. pn|n−1 is the MSE incurred in the estimation of xn from z0:n−1, and the ratio r/h2

is a measure of noisiness of the observations. It is interesting to compare these equationswith the Bayesian estimation of the signal amplitude as described in Example ().

For this particular time-invariant model, we have

{pn+1|n = f2 pn|n + q

pn|n = 1h

pn|n−1r

h2 pn|n−1+1

(7.11)

We can eliminate the coupling between these equations and obtain

pn+1|n =f2 pn|n−1

h2

r pn|n−1 + 1+ q (7.12)

We see here that as n increases, pn+1|n and so the gain kn approaches a constant. Notethat if pn+1|n does approach a constant, say p∞, then p∞ must satisfy

p∞ =f2 p∞

h2

r p∞ + 1+ q (7.13)

This equation is quadratic and has a unique positive solution

p∞ =1

2

{[r

h2(1 − f2) − q

]2+

4rq

h2

}1/2

− r

2h2(1 − f2) + q (7.14)

∣∣∣pn+1|n − p∞∣∣∣ = f2

∣∣∣∣∣pn|n−1

h2

r pn|n−1 + 1− p∞

h2

r p∞ + 1

∣∣∣∣∣

≤ f2∣∣∣pn|n−1 − p∞

∣∣∣


∣∣∣pn+1|n − p∞∣∣∣ ≤ f2(n+1) |p0 − p∞|

≤ f2∣∣∣pn|n−1 − p∞

∣∣∣

This means that if |f | < 1 then pn+1|n converges to p∞. So |f | < 1 is a sufficient conditionfor Kalman-Bucy filter to approach a steady state.

7.3.2 Track-While-Scan (TWS) Radar

Let consider the example of one dimentional moving target and assume that the target issubject to random acceleration An. Then we have

(Xn+1

Vn+1

)=

(1 T0 1

)(Xn

Vn

)+

(0T

)An

Zn = ( 1 0 )

(Xn

Vn

)+ en

For a more general case in 3D we have a state vector with 6 components (3 positionsand 3 velocities). But, if we assume that the measurement noise in 3 dimensions areindependent of one another and independent to the components of the acceleration, theproblem can be treated as 3 independent one-dimensional moving target.

The Kalman equations for this simple model become

(Xn+1|n

Vn+1|n

)=

(Xn|n + T Vn|n

Vn|n

)

(Xn|n

Vn|n

)=

(Xn|n−1

Vn|n−1

)+

(Kn,1

Kn,2

)(Zn − Xn|n−1 )

(Kn,1

Kn,2

)=

( P (1,1)P (1,1)+r

P (2,1)P (1,1)+r

)

where P (k, l) is the (k − l)th component of the matrix Pn|n−1.

To reduce the computation, the time varying elements of the Kalman gain vector canbe replaced with some constants (the steady states values) to obtain

(Xn|n

Vn|n

)=

(Xn|n−1

Vn|n−1

)+

(αβ/T

)(Zn − Xn|n−1 ) (7.15)

with constatnt values for α and β.

7.3.3 Track-While-Scan (TWS) Radar with dependent acceleration se-quences

The simple model of the previous example is not very realistic, because there was assumedthat the target is subjected to random acceleration. For a heavy target, we can do a littlebetter by assuming that the acceleration An is modeled as

An+1 = ρAn +Wn, n = 0, 1, . . .

7.3. EXAMPLES 109

a first order autoregressive (AR) model. The value of ρ can be choosed in accordance ofthe target. ρ near to 0 means a very low inertia target and ρ near to 1 means a highinertia target.

To account for this equation, we can extend the state vector and we obtain

Xn+1

Vn+1

An+1

=

1 T 00 1 T0 0 ρ

Xn

Vn

An

+

001

Wn

Zn = ( 1 0 0 )

Xn

Vn

An

+ en

We again can apply the Kalman equations to this model and obtain:

Xn+1|n

Vn+1|n

An+1|n

=

Xn|n + T Vn|n

Vn|n + TAn|n

ρAn|n

Xn|n

Vn|n

An|n

=

Xn|n−1

Vn|n−1

An|n−1

+

Kn,1

Kn,2

Kn,3

(Zn − Xn|n−1 )

Kn,1

Kn,2

Kn,3

=

( P (1,1)P (1,1)+r

P (2,1)P (1,1)+r

P (3,1)P (1,1)+r

)

where P (k, l) is the (k − l)th component of the matrix Pn|n−1.Again to reduce the computation, the time varying elements of the Kalman gain vector

can be replaced with some constants related to its steady state value and obtain

Xn|n

Vn|n

An|n

Xn|n−1

Vn|n−1

An|n−1

+

αβ/Tγ/T 2

(Zn − Xn|n−1 ) (7.16)

with constant values for α, β and γ.


7.4 Fast Kalman filter equations

The general equations of the Kalman filtering do not assume any stationarity of the systemand all the matrices of the system F , G and H , and also R and Q may depend on theindex k.

Through the simple example in previous section, we saw that, when these quantitiesare independent of k, i.e.

xk+1|k = F xk|k−1 + Kk(Rek)

−1[zk − Hxk|k−1]

Rek = R + HP k|k−1 H t

Kk = FP k|k−1 H t

P k+1|k = FP k|k−1 F t + GQGt − Kk(Rek)

−1 Ktk

the system can reach a stationnary point and we can use the steady state values of the gainor an approximation to it to reduce the calculation cost. But doing so does not give anoptimal estimate and may not give a very satisfactory solution. Here, we present shortlya slightly better way to obtain fast algorithms without loosing too much its optimality.

Let assume a constatnt system and note by δP k, δKgk and δRe

k the increments

δP k = P k|k−1 − P k−1|k−2

δKgk = K

gk − K

gk−1

δRek = Re

k − Rek−1

Then, it can be shown that the δP k+1 can be factorized by

δP k+1 = [F − Kgk−1H ][δP k − δP k H t(Re

k)−1 HP k][F − K

gk−1H ]t

= [F − KgkH ][δP k + δP k H t(Re

k−1)−1 HP k][F − K

gkH ]t

Then, it is possible to reduce the cost of the calculations by noting that if δP k can befactorized as:

δP 1 = z0M0zt0,

then δP k+1 can also be factorized as

δP k+1 = zk Mk ztk

and we obtain then the following equations:

δP k+1 = zk M k ztk

zk = [F − KgkH ]zk−1

Mk = Mk−1 + M k−1 ztk−1H

t(Rek−1)

−1 Hzk−1 Mk−1

Rek−1 = Re

k + Hzk Mk ztkH

t

Kk+1 = Kgk+1 Re

k+1 = Kk + Fzk M k ztkH

t

P k|k−1 = P 0 +k−1∑

j=0

zjM jztj

7.4. FAST KALMAN FILTER EQUATIONS 111

which are called Chandrasekhar equations.Note that if α = rang {δP 1} where

δP 1 = FP 0Ft + GQGt − K0(R

e0)

−1Kt0 − P 0

then zk has dimensions (N,α), M k has dimensions (α,α). So, in place of updating, ateach time k the matrix P k with dimensions (N,N) we only have to update the matrixeszk and Mk with dimensions (N,α) and (α,α).

Note also that M0 is the signature matrix of δP 1 and the value of α depends on thechoice of the initial covariance matrix P 0. It is not unusual to have α = 1 which greatelyreduces the computation cost.


7.5 Kalman filter equations for signal deconvolution

Starting by the convolution equation:

z(k) =p−1∑

i=0

h(i)x(k − i) + b(k)

rewritten in matrix form

z(1)...

z(k)...

z(M)

=

h(p−1) · · · h(0) · · · · · ·...

...0 h(p−1) · · · h(0) 0...

...0 0 h(p−1) · · · h(0)

x(−p)...

x(0)...

x(M)

+

b(1)...

b(k)...

z(M)

we can propose the following models:

• Constant state vector model

{xk+1 = xk = x = [x−p, . . . , x−1, x0, x1, . . . , xn]t

zk = htk · xk + vk

hk = ( 0 0 0 hp−1 . . . h0 0 0 )t

where coefficient h0 is in the k-th position. Then we have

uk = 0F k = Gk = I

htk+1 = Dht

k

with D =

0 . . . . . . 0 01 0 . . . 0 0

0 1 0...

......

......

0 . . . 1 0

If we note by

E [x] = x0 E[[x − x0][x − x0]

t]

= P 0

E [vk] = 0 E [vkvj] = r δkj

we obtain

xk|k = xk|k−1 + Kk(rek)

−1[yk − htk · xk|k−1]

rek = r + ht

kP k|k−1 hk

Kk = P k|k−1 htk

P k+1|k = P k|k−1 − Kk(rek)

−1 Ktk

Note that the observations yk are scalar. So, rek is also a scalar quantity, but x is a

N -dimensional vector and so the covariance matrix P has the dimensions (N ×N).

7.5. KALMAN FILTER EQUATIONS FOR SIGNAL DECONVOLUTION 113

• Non constant state space model

we can choose

h = [h0, . . . , hp−1]t, xk = [xk, xk−1, . . . , xk−p+1]

t

z(k) = htxk + b(k), dimension of xk = p ≤ N

But now we need to introduce a generating state space model for xk.

One of such models is an AR model :

x(n+ 1) =q∑

i=1

a(i)x(n − i+ 1) + u(n+ 1)

whereE [un] = 0, E

[|un|2

]= β2, E [umun] = 0, m 6= n

It is easy then to see that we can write

x(n+ 1)x(n)

...

...x(n− q + 2)

=

a1 a2 . . . . . . aq

1 0 . . . . . . 00 1 0 . . . 0...

......

......

......

......

...... 1 0

x(n)x(n − 1)

...

...x(n− q + 1)

+

10......0

u(n+ 1)

Thus we have {xk+1 = Fxk + Guk+1

yk = ht · xk + bk

F =

a1 a2 . . . . . . aq

1 0 . . . . . . 00 1 0 . . . 0...

......

......

......

......

...... 1 0

, G =

10...0

This model has the advantage that F , G and H are constant and we can use fastalgorithms, but the main drawback in practical applications is the determination ofq and ak, k = 1, . . . , q.


Consider the following convolution model :

f(t) - H - +?

b(t)

-g(t)

g(t) =

∫f(t′)h(t− t′) dt′ + b(t) =

∫h(t′)f(t− t′) dt′ + b(t)

and assume the following hypotheses :

• The signals f(t), g(t), h(t) are discretized with the same sampling period ∆T = 1,

• The impulse response is finite (FIR) : h(t) = 0, for t such that t < −q∆T or∀t > p∆T .

Then we have :

g(m) =p∑

k=−q

h(k)f(m− k) + b(m), m = 0, · · · ,M

or in a matrix form

g(0)g(1)

...

...

...

...g(M)

=

h(p) · · · h(0) · · · h(−q) 0 · · · · · · · · · 0

0. . .

. . .. . .

......

...... h(p) · · · h(0) · · · h(−q) ......

......

. . .. . .

......

. . . 00 · · · · · · 0 h(p) · · · h(0) · · · h(−q)

f(−p)...

f(0)f(1)

...f(M)

f(M + 1)...

f(M + q)

org = Hf + v

Note that g is a (M + 1)-dimensional vector, f has dimension M + p + q + 1, h =[h(p), · · · , h(0), · · · , h(−q)] has dimension (p+ q + 1) and matrix H has dimensions (M +1) × (M + p+ q + 1).

Now, if we assume that the system is causal (q = 0) we obtain

g(0)g(1)

...

...

...

...g(M)

=

h(p) · · · h(0) 0 · · · · · · 0

0...

......

... h(p) · · · h(0)...

......

... 00 · · · · · · 0 h(p) · · · h(0)

f(−p)...

f(0)f(1)

...

...f(M)

7.5. KALMAN FILTER EQUATIONS FOR SIGNAL DECONVOLUTION 115

If the input signal is also assumed to be causal, we obtain :

g(0)g(1)

...

...

...

...g(M)

=

h(0)

h(1). . .

...h(p) · · · h(0)

0. . .

. . ....0 · · · 0 h(p) · · · h(0)

f(0)f(1)

...

...

...

...f(M)

(7.17)

and finally if p = M we have :

g(0)g(1)

...g(M)

=

h(0)h(1) h(0)

.... . .

h(M) · · · h(1) h(0)

f(0)f(1)

...f(M)

Remark that, in all cases matrix H is Toeplitz.In the case where the input signal and the system are both causal, (7.17) can be

rewritten as

g(0)g(1)

...

...

...

...g(M)

0...0

=

h(0) 0 · · · 0 h(p) · · · h(1)

h(1). . .

. . ....

... h(p)h(p) · · · h(0) 0 0

0. . .

. . ....

...0 · · · 0 h(p) · · · h(0) 0...

.... . .

. . . 00 · · · · · · 0 h(p) · · · h(0)

f(0)f(1)

...

...

...

...f(M)

0...0

where f and g have been completed artificially by some zeros. This operation is calledzero-filling and the main advantage to do so is that the matrix H is now a circulant matrix.

Starting by the Kalman filter equations:{

zk = Hk xk + vk observation equationxk+1 = F k xk + Gk uk state equation

xk+1|k = F k xk|k


k

xk+1|k+1 = xk+1|k + Kfk+1[zk+1 − Hk+1 xk+1|k]

Kfk+1 = P k+1|k Ht

k+1(Rek+1)

−1

Rek+1 = Rk+1 + Hk+1 P k+1|k H t

k+1

P k+1|k+1 = [I − Kfk+1Hk+1]P k+1|k


7.5.1 AR, MA and ARMA Models

AR model

u(n) =p∑

k=1

a(k)u(n − k) + ǫ(n), ∀n

E [ǫ(n)] = 0, E[|ǫ(n)|2

]= β2,

E [ǫ(n)u(m)] = 0, m 6= n

ǫ(n)−→ H(z) =1

A(z)=

1

1 +∑p

k=1 a(k)z−k

−→u(n)

MA model

u(n) =q∑

k=0

b(k) ǫ(n − k), ∀n

ǫ(n)−→ B(z) =q∑

k=0

b(k)z−k −→u(n)

ARMA model

u(n) =p∑

k=1

a(k)u(n − k) +q∑

l=0

b(l) ǫ(n − l)

ǫ(n)−→ H(z) =B(z)

A(z)=

∑qk=0 b(k)z

−k

1 +∑p

k=1 a(k)z−k

−→u(n)

ǫ(n)−→ H(z) = Bq(z) −→ H(z) =1

Ap(z)−→u(n)

In a dynamic system, in general, we are interested in a physical quantity x throughthe observation of a quantity z related to x by the following system of equations

{xn+1 = fn(xn,un)zn = hn(xn,vn)

(7.18)

Chapter 8

Some complements to Bayesianestimation

8.1 Choice of a prior law in the Bayesian estimation

One of the main difficulties in the application of Bayesian theory in practice is the choiceor the attribution of the direct probabilities f(x|θ) and π(θ). In general, f(x|θ) is obtainedvia an appropriate model relating the observable quantity X to the parameters θ and is wellaccepted. The choice or the attribution of the prior π(θ) has been, and still is, the mainsubject of discussion and controversy between the Bayesian and orthodox statisticians.

Here, I will try to give a brief summary of different approaches and different tools thatcan be used to attribute a prior probability distribution. There are mainly four tools:

• use of some invariance principles

• use of maximum entropy (ME) principle

• use of conjugate and reference priors

• use of other information criteria

8.1.1 Invariance principles

Definition 1 [Group invariance] A probability model f(x|θ) is said to be invariant (orclosed) under the action of a group of transformations G if, for every g ∈ G, there exists aunique θ∗ = g(θ) ∈ T such that y = g(x) is distributed according to f(y|θ∗).

Exemple 1 Any probability density function in the form f(x|θ) = f(x− θ) is invariantunder the translation group

G : {gc(x) : gc(x) = x+ c, c ∈ IR} (8.1)

This can be verified as follows

x ∼ f(x− θ) −→ y = x+ c ∼ f(y − θ∗) with θ∗ = θ + c

117

118 CHAPTER 8. SOME COMPLEMENTS TO BAYESIAN ESTIMATION

Exemple 2 Any probability density function in the form f(x|θ) = 1θf(x

θ ) is invariantunder the multiplicative or scale transformation group

G : {gs(x) : gs(x) = s x, s > 0} (8.2)


x ∼ 1

θf(x

θ) −→ y = s x ∼ 1

θ∗f(y

θ∗) with θ∗ = s θ

Exemple 3 Any probability density function in the form f(x|θ1, θ2) = 1θ2f(x−θ1

θ2) is in-

variant under the affine transformation group

G : {ga,b(x) : ga,b(x) = ax+ b, a > 0, b ∈ IR} (8.3)


x ∼ 1

θ2f(x− θ1θ2

) −→ y = ax+ b ∼ 1

θ∗2f(y − θ∗1θ∗2

) with θ∗2 = a θ2, θ∗1 = a θ1 + b.

Exemple 4 Any multi variable probability density function in the form f(x|θ) = f(x−θ)is invariant under the translation group

G : {gc(x) : gc(x) = x − c, c ∈ IRn} (8.4)

Exemple 5 Any multi variable probability density function in the form f(x) = f(‖x‖)is invariant under the orthogonal transformation group

G :{gA(x) : gA(x) = Ax, AtA = AAt = I

}(8.5)

Exemple 6 Any multi variable probability density function in the form f(x|θ) = 1θf(‖x‖

θ )is invariant under the following transformation group

G :{gA,s(x) : gA,s(x) = sAx, AtA = AAt = I, s > 0.

}(8.6)


x ∼ 1

θf(

‖x‖θ

) −→ y = sAx ∼ 1

θ∗f(

‖y‖θ∗

) with θ∗ = s θ.

From these examples we see also that any invariance transformation group G on x ∈ Xinduces a corresponding transformation group G on θ ∈ T . For example for the translationinvariance G on x ∈ X induces the following translation group on θ ∈ T

G : {gc(θ) : gc(θ) = θ + c, c ∈ IR} (8.7)

and the scale invariance G on x ∈ X induces the following translation groupe on θ ∈ T

G : {gs(θ) : gs(θ) = s θ, s > 0} (8.8)

We just see that for an invariant family of f(x|θ) we have a corresponding invariantfamily of prior laws π(θ). To be complete, we have also to consider the cost function tobe able to define the Bayesian estimate.

8.1. CHOICE OF A PRIOR LAW IN THE BAYESIAN ESTIMATION 119

Definition 2 [Invariant cost functions] Assume a probability model f(x|θ) is invariantunder the action of the group of transformations G. Then the cost function c[θ, θ] is saidto be invariant under the group of transformations G if, for every g ∈ G and θ ∈ T , thereexists a unique θ∗ = g(θ) ∈ T with g ∈ G such that

c[θ, θ] = c[g(θ), θ∗] for every θ ∈ T .Definition 3 [Invariant estimate] For an invariant probability model f(x|θ) under thegroup of transformation Gc and an invariant cost function c[θ, θ] under the correspondinggroup of transformation G, an estimate θ is said to be invariant or equivariant if

θ(g(x)) = g(θ(x)

)

Exemple 7 Estimation of θ from the data coming from any model of the kind f(x|θ) =f(x− θ) with a quadratic cost function c[θ, θ] = (θ − θ)2 is equivariant and we have

G = G = G = {gc(x) : gc(x) = x− c, c ∈ IR}Exemple 8 Estimation of θ from the data coming from any model of the kind f(x|θ) =1θf(1

θ ) with the entropy cost function

c[θ, θ] =θ

θ− ln(

θ

θ) − 1

is equivariant and we have

G = {gs(x) : gs(x) = s x, s > 0}G = G = {gs(θ) : gs(θ) = s θ, s > 0}

Proposition 1 [Invariant Bayesian estimate] Suppose that a probability model f(x|θ)is invariant under the group of transformations G and that there exists a probabilitydistribution π∗(θ) on T which is invariant under the group of transformations G, i.e.,

π∗(g(A)) = π∗(A)

for any measurable set A ∈ T . Then the Bayes estimator associated with π∗, noted θ∗

minimizes∫∫R(θ, θ)π∗(θ) dθ =

∫∫R(θ, g(θ)

)π∗(θ) dθ =

∫∫E[c[θ, g

(θ(X)

)]]π∗(θ) dθ over θ.

If this Bayes estimator is unique, it satisfies

θ∗(x) = g−1(θ∗(g(x))

)

Therefore, a Bayes estimator associated with an invariant prior and a strictly convexinvariant cost function is almost equivariant.

Actually, invariant probability distributions are rare. The following are some examples:

Exemple 9 If π(θ) is invariant under the translation group Gc, it satisfies π(θ) = π(θ+c)for every θ and for every c, which implies that π(θ) = π(0) uniformly on IR and this leadsto the Lebesgue measure as an invariant measure.

Exemple 10 If θ > 0 and π(θ) is invariant under the scale group Gs, it satisfies π(θ) =s π(sθ) for every θ > 0 and for every s > 0, which implies that π(θ) = 1/θ.

Note that in both cases the invariant laws are improper.


8.2 Conjugate priors

The conjugate prior concept is tightly related to the sufficient statistic and exponentialfamilies.

Definition 4 [Sufficient statistics] When X ∼ Pθ(x), a function h(X) is said to be asufficient statistic for {Pθ(x), θ ∈ T } if the distribution of X conditioned on h(X) doesnot depend on θ for θ ∈ T .

Definition 5 [Minimal sufficiency] A function h(X) is said to be minimal sufficient for{Pθ(x), θ ∈ T } if it is a function of every other sufficient statistic for Pθ(x).

A minimal sufficient statistic contains the whole information brought by the observationX = x about θ.

Proposition 2 [Factorization theorem] Suppose that {Pθ(x), θ ∈ T } has a correspondingfamily of densities {pθ(x), θ ∈ T }. A statistic T is sufficient for θ if and only if there existfunctions gθ and h such that

pθ(x) = gθ(T (x))h(x) (8.9)

for all x ∈ Γ and θ ∈ T .

Exemple 11 If X ∼ N (θ, 1) then T (x) = x can be chosen as a sufficient statistic.

Exemple 12 If {X1,X2, . . . ,Xn} are i.i.d. and Xi ∼ N (θ, 1) then

f(x|θ) = (2π)−n/2 exp

[−1

2

n∑

i=1

(xi − θ)2]

= exp

[−1

2

n∑

i=1

x2i

](2π)−n/2 exp

[−n

2θ2]exp

[θ

n∑

i=1

xi

]

and we have T (x) =∑n

i=1 xi.

Note that, in this case, we need to know n and x = 1n

∑ni=1 xi. Note also that we can write

f(x|θ) = a(x) g(θ) exp [θT (x)]

where

g(θ) = (2π)−n/2 exp

[−n

2θ2]

and a(x) = exp

[−1

2

n∑

i=1

x2i

]

Exemple 13 If X ∼ N (0, θ) then T (x) = x2 can be chosen as a sufficient statistic.

Exemple 14 If X ∼ N (θ1, θ2) then T1(x) = x2 and T2(x) = x can be chosen as a set ofsufficient statistics.

8.2. CONJUGATE PRIORS 121

Exemple 15 If {X1,X2, . . . ,Xn} are i.i.d. and Xi ∼ N (θ1, θ2) then

f(x|θ1, θ2) = (2π)−n/2 θ−1/22 exp

[− 1

2θ2

n∑

i=1

(xi − θ1)2

]

= (2π)−n/2 θ−1/22 exp

[−nθ

21

2θ2

]exp

[− 1

2θ2

n∑

i=1

x2i +

θ1θ2

n∑

i=1

xi

]

and we have T1(x) =∑n

i=1 xi and T2(x) =∑n

i=1 x2i .

Note also that we can write

f(x|θ) = a(x) g(θ1, θ2) exp

[θ1θ2T1(x) − 1

2θ2T2(x)

]

where

g(θ1, θ2) = (2π)−n/2θ−1/22 exp

[−nθ

21

2θ2

]and a(x) = 1.

In this case, θ1

θ2and −1

2θ2are called canonical parametrization. It is also usual to use n,

x = 1n

∑ni=1 xi and x2 = 1

n

∑ni=1 x

2i as the sufficient statistics.

Exemple 16 If X ∼ Gam(α, θ) then T (x) = x can be chosen as a sufficient statistic.

Exemple 17 If X ∼ Gam(θ, β) then T (x) = lnx can be chosen as a sufficient statistic.

Exemple 18 If X ∼ Gam(θ1, θ2) then T1(x) = lnx and T2(x) = x can be chosen as aset of sufficient statistics.

Exemple 19 If {X1,X2, . . . ,Xn} are i.i.d. and Xi ∼ Gam(θ1, θ2) then it is easy to showthat T1(x) =

∑ni=1 lnxi and T2(x) =

∑ni=1 xi.

Definition 6 [Exponential family] A class of distributions {Pθ(x), θ ∈ T } is said to bean exponential family if there exist: a(x) a function of Γ on IR, g(θ) a function of T onIR+, φk(θ) functions of T on IR, and hk(x) functions of Γ on IR such that

pθ(x) = p(x|θ) = a(x) g(θ) exp

[K∑

k=1

φk(θ)hk(x)

]

= a(x) g(θ) exp[φt(θ)h(x)

]

for all θ ∈ T and x ∈ Γ. This family is entirely determined by a(x), g(θ), and{φk(θ), hk(x), k = 1, · · · ,K} and is noted Exfn(x|a, g,φ,h)

Particular cases:

• When a(x) = 1 and g(θ) = exp [−b(θ)] we have

p(x|θ) = exp[φt(θ)h(x) − b(θ)

]

and is noted CExf(x|b,φ,h).


• Natural exponential family:When a(x) = 1, g(θ) = exp [−b(θ)], h(x) = x and φ(θ) = θ we have

p(x|θ) = exp[θtx − b(θ)

]Exf(x|b).

and is noted NExf(x|b).

• Scalar random variable with a vector parameter:

p(x|θ) = Exf(x|a, g,φ,h)

= a(x)g(θ) exp

[K∑

k=1

φk(θ)hk(x)

]

= a(x)g(θ) exp[φt(θ)h(x)

]

and is noted Exfk(x|a, g,φ,h).

• Scalar random variable with a scalar parameter:

p(x|θ) = Exf(x|a, g, φ, h) = a(x)g(θ) exp [φ(θ)h(x)]

and is noted Exf(x|a, g, φ, h).

• Simple scalar exponential family:

p(x|θ) = θ exp [−θx] = exp [−θx+ ln θ] , x ≥ 0, θ ≥ 0.

Definition 7 [Conjugate distributions] A family F of probability distributions π(θ) onT is said to be conjugate (or closed under sampling) if, for every π(θ) ∈ F , the posteriordistribution π(θ|x) also belongs to F .

The main argument for the development of the conjugate priors is the following: Whenthe observation of a variable X with a probability law f(x|θ) modifies the prior π(θ) to aposterior π(θ|x), the information conveyed by x about θ is obviously limited, therefore itshould not lead to a modification of the whole structure of π(θ), but only of its parameters.

Definition 8 [Conjugate priors] Assume that f(x|θ) = l(θ|x) = l(θ|t(x)) where t ={n, s} = {n, s1, . . . , sk} is a vector of dimension k+ 1 and is sufficient statistic for f(x|θ).Then, if there exists a vector {τ0, τ} = {τ0, τ1, . . . , τk} such that

π(θ|τ ) =f(s = (τ1, · · · , τk)|θ, n = τ0)∫∫f(s = (τ1, · · · , τk)|θ′, n = τ0) dθ′

exists and defines a family F of distributions for θ ∈ T , then the posterior π(θ|x, τ ) willremain in the same family F . The prior distribution π(θ|τ ) is then a conjugate prior forthe sampling distribution f(x|θ).


Proposition 3 [Sufficient statistics for the exponential family] For a set of n i.i.d. samples{x1, · · · , xn} of a random variable X ∼ Exf(x|a, g,θ,h) we have

f(x|θ) =n∏

j=1

f(xj|θ) = [g(θ)]n

n∏

j=1

a(xj)

exp

K∑

k=1

φk(θ)n∑

j=1

hk(xj)

= gn(θ) a(x) exp

φt(θ)

n∑

j=1

h(xj)

,

where a(x) =∏n

j=1 a(xj). Then, using the factorization theorem it is easy to see that

t =

n,n∑

j=1

h1(xj), · · · ,n∑

j=1

hK(xj)

is a sufficient statistic for θ.

Proposition 4 [Conjugate priors of the Exponential family] A conjugate prior family forthe exponential family

f(x|θ) = a(x) g(θ) exp

[K∑

k=1

φk(θ)hk(x)

]

is given by

π(θ|τ0, τ ) = z(τ )[g(θ)]τ0 exp

[K∑

k=1

τkφk(θ)

]

The associated posterior law is

π(θ|x, τ0, τ ) ∝ [g(θ)]n+τ0a(x)z(τ ) exp

K∑

k=1

τk +n∑

j=1

hk(xj)

φk(θ)

.

We can rewrite this in a more compact way:If

f(x|θ) = Exfn(x|a(x), g(θ),φ,h),

then a conjugate prior family is

π(θ|τ ) = Exfn(θ|gτ0 , z(τ ), τ ,φ),

and the associated posterior law is

π(θ|x, τ ) = Exfn(θ|gn+τ0 , a(x) z(τ ), τ ′,φ)

where

τ ′k = τk +n∑

j=1

hk(xj)

or

τ ′ = τ + h, with hk =n∑

j=1

hk(xj).


Definition 9 [Conjugate priors of natural exponential family] If

f(x|θ) = a(x) exp[θtx − b(θ)

]

Then a conjugate prior family is

π(θ|τ 0) = g(θ) exp[τ t

0θ − d(τ 0)]

and the corresponding posterior is

π(θ|x, τ 0) = g(θ) exp[τ t

nθ − d(τ n)]

with τn = τ 0 + x

where

xn =1

n

n∑

j=1

xj

A slightly more general notation which gives some more explicit properties of theconjugate priors of the natural exponential family is the following:If


]

Then a conjugate prior family is

π(θ|α0, τ 0) = g(α0, τ 0) exp[α0 τ t

0θ − α0b(τ 0)]

The posterior is

π(θ|α0, τ 0,x) = g(α, τ ) exp[α τ tθ − αb(τ )

]

with

α = α0 + n and τ =α0τ 0 + nx

(α0 + n))

and we have the following properties:

E [X|θ] = E[X|θ] = ∇b(θ)

E [∇b(Θ)|α0, τ 0] = τ 0

E [∇b(θ)|α0, τ 0,x] =nx + α0τ 0

α0 + n= πxn + (1 − π)τ 0, with π =

n

α0 + n


8.3 Non informative priors based on Fisher information

Another notion of information related to the maximum likelihood estimation is the Fisherinformation. In this section, first we give some definitions and results related to this notionand we see how this is used to define non informative priors.

Proposition 5 [Information Inequality] Let θ be an estimate of the parameter θ in afamily {Pθ; θ ∈ T } and assume that the following conditions hold:

1. The family {Pθ; θ ∈ T } has a corresponding family of densities {pθ(x); θ ∈ T }, allwith the same support.

2. pθ(x) is differentiable for all θ ∈ T and all x in its support.

3. The integral

g(θ) =

∫

Γh(x) pθ(x)µ( dx)

exists and is differentiable for θ ∈ T , for h(x) = θ(x) and for h(x) = 1 and

∂g(θ)

∂θ=

∫

Γh(x)

∂pθ(x)

∂θµ( dx)

Then

Varθ[θ(X)] ≥

[∂∂θEθ

{θ(X)

}]2

Iθ(8.10)

where

Iθdef= Eθ

{[∂

∂θln pθ(X)

]2}(8.11)

Furthermore, if ∂2

∂θ2pθ(x) exists for all θ ∈ T and all x in the support of pθ(x), and if

∫∂2

∂θ2pθ(x)µ( dx) =

∂2

∂θ2

∫pθ(x)µ( dx)

then Iθ can be computed via

Iθ = −Eθ

{∂2

∂θ2ln pθ(X)

}(8.12)

The quantity defined in (8.11) is known as Fisher’s information for estimating θ from X,and (8.10) is called the information inequality.

For the particular case in which θ is unbiased Eθ

{θ(X)

}= θ, the information inequal-

ity becomes

Varθ[θ(X)] ≥ 1

Iθ(8.13)

Expression 1Iθ

is known as the Cramer-Rao lower bound (CRLB).

8.3. NON INFORMATIVE PRIORS BASED ON FISHER INFORMATION 127

Exemple 20 [The information Inequality for exponential families] Assume that T is openand pθ is given by

pθ(x) = a(x) g(θ) exp [g(θ)h(x)]

Then it can be shown that

Iθdef= Eθ

{[∂

∂θln pθ(X)

]2}=∣∣g′(θ)

∣∣2 Varθ(h(X)) (8.14)

and∂

∂θEθ {h(X)} = g′(θ)Varθ(h(X)) (8.15)

and thus, if we choose θ(x) = h(x) we obtain the lower bound in the information inequality(8.10)

Varθ[θ(X)] =

[∂∂θEθ

{θ(X)

}]2

Iθ(8.16)

Definition 10 [Non informative priors] Assume X ∼ f(x|θ) = pθ(x) and assume that

Iθdef= Eθ

{[∂

∂θln pθ(X)

]2}= −Eθ

{∂2

∂θ2ln pθ(X)

}(8.17)

Then, a non informative prior π(θ) is defined as

π(θ) ∝ I1/2θ (8.18)

Definition 11 [Non informative priors, case of vector parameters]Assume X ∼ f(x|θ) = pθ(x) and assume that

Iij(θ)def= −Eθ

{∂2

∂θi∂θjln pθ(X)

}(8.19)

Then, a non informative prior π(θ) is defined as

π(θ) ∝ |I(θ)|1/2 (8.20)

where I(θ) is the Fisher information matrix with the elements Iij(θ).

Exemple 21 If


]

then

I(θ) = ∇∇tb(θ)

and

π(θ) ∝ |I(θ)|1/2 =

∣∣∣∣∣

n∏

i=1

∂2θi

∂b(θ)2

∣∣∣∣∣

1/2


Exemple 22 If

f(x|θ) = N(µ, σ2

), θ = (µ, σ2)

then

I(θ) = Eθ

{(1σ2

2(X−µ)σ3

2(X−µ)σ3

3(X−µ)2

σ4 − 1σ2

)}=

( 1σ2 00 2

σ2

)

and

π(θ) = π(µ, σ2) ∝ 1

σ4

Chapter 9

Linear Estimation

In previous chapters we saw that the optimum estimation, in a MMSE sense, of an un-known signal Xt, given the observations Ya:b = {Ya, . . . , Yb} of a related quantity, is givenby Xt = E [Xt|Ya:b]. This estimate is not, in general, a linear function of the data and itscomputation needs the knowledge of the joint distribution of {Xt, Ya:b}. Only when thisjoint distribution is Gaussian and when Xt is a related to Ya:b by a linear relation, thisoptimal estimate is a linear function of the data. Even in this case, its computation needsthe inversion of the covariance matrix of the data ΣY whose dimensions increase with thenumber of data.

One way to circumvent these drawbacks is, from the first step, to constraint the esti-mate to be a linear function of the data. Doing so, as we will see below, we do not needanymore the joint distribution of {Xt, Ya:b} but only its second order statistics. Further-more, we will see that, in this case, we can develop real time or on-line algorithms withlower complexity and lower cost, if we assume data to be stationary.

9.1 Introduction

Assume that we want to obtain an estimate Xt of a quantity Xt which is a linear (or moreprecisely an affine) function of the data Ya:b = {Ya, . . . , Yb}, i.e.

Xt =b∑

n=a

ht,nYn + ct (9.1)

where, in general, a can be either −∞ or finite and b can also be either finite or ∞. Whena and b are finite the meaning of the summation is clear. For the cases where a = −∞or b = ∞, these and all the following summations have to be understood in the MMSEsense, for example for the case a = −∞

limm7→−∞

E

(

b∑

n=m

ht,nYn + ct − Xt

)2

= 0 (9.2)

In these cases we need also to assume that

E[X2

n

]<∞ and E

[Y 2

n

]<∞.

129

130 CHAPTER 9. LINEAR ESTIMATION

The following propositions resume all we need for developing linear estimation theory.

Proposition 1 Assume Xt ∈ Hba where Hb

a is the Hilbert space generated by the affine

transform (9.1). Then E[X2

t

]<∞ and if Z is a random variable satisfying E

[Z2]<∞,

then

E[Z Xt

]=

b∑

n=a

ht,nE [Z Yn] + ct E [Z]

Proposition 2 (Orthogonality principle) Xt ∈ Hba solves

minXt∈Hb

a

E[(Xt −Xt)

2]

(9.3)

if and only if

E[(Xt −Xt)Z

]= 0 ∀Z ∈ Hb

a. (9.4)

In other words, Xt is a MMSE linear estimate of Xt given Ya:b if and only if theestimation error (Xt −Xt) is orthogonal to every linear function of the observation Ya:b.

Considering the particular cases of Z = 1 and Z = Yl, a ≤ l ≤ b we can rewrite thisproposition in the following way

Proposition 3 Xt ∈ Hba solves (9.3) if and only if

E[Xt

]= E [Xt] (9.5)

andE[(Xt −Xt)Yl

]= 0 ∀a ≤ l ≤ b. (9.6)

Now replacing (9.1) in (9.6) we obtain

E[(Xt − Xt)Yl

]= E

[(Xt −

b∑

n=a

ht,nYn − ct)Yl

]= 0 ∀a ≤ l ≤ b. (9.7)

To go further in details more easily and without any loss of generality, we assumeE [Yl] = 0,∀a ≤ l ≤ b. Then, since E [Xt] = ct, the previous equation becomes

Cov {Xt, Yl} =b∑

n=a

ht,nCov {Yn, Yl} ∀a ≤ l ≤ b. (9.8)

which is known as the Wiener-Hopft equation.Writing this in a matrix form we have

σXY (t) = ΣY ht (9.9)

where

σXY (t)def= [Cov {Xt, Ya} , . . . ,Cov {Xt, Yb}]t

ΣXY (t)def= [Cov {Yn, Yl}]

htdef= [ht,a, . . . , ht,b]

t

9.2. ONE STEP PREDICTION 131

So, theoretically we have

ht = Σ−1Y σXY (t) (9.10)

The main difficulty is however the computation of Σ−1Y . Note that this matrix is symmetric

and positive definite. So, theoretically, it is not singular. However, its inversion costincreases exponentially with the number of data. In the following we will see how thestationary assumption will help to reduce this cost.

9.2 One step prediction

Consider the case where a = 0, b = t and Xt = Yt+1 and assume that Yl is wide sensestationary, i.e., E [Yl] = 0 and Cov {Yl, Ym} = CY (l −m). Then we have

Cov {Xt, Yl} = Cov {Yt+1, Yl} = CY (t+ 1 − l) (9.11)

and the Wiener-Hopft equation becomes

CY (t+ 1)CY (t)

...

...CY (1)

=

CY (0) CY (1) CY (t)

CY (1). . .

. . .. . .

CY (1)CY (t) CY (1) CY (0)

ht,0

ht,1...

...ht,t

(9.12)

called Yule-Walker equation.Note that the w.s.s. hypothesis of the data Yn leads to a covariance matrix which is

Toeplitz. Unlike the general case, the cost of the inversion of this matrix is only O(n2)against O(n3) for the general case, where n is the number of data. Thus, in any lin-ear MMSE estimation problem, the w.s.s. assumption can reduce the complexity of thecomputation of the coefficients by a factor equal to the number of the data.

In the following, we will see that, we can still go further and use the specific structureof the Yule-Walker equation to keep on reducing this cost.

9.3 Levinson algorithm

Levinson algorithm uses the special structure of the Yule-Walker equation for the one stepprediction problem where the left hand side vector of this equation is equal to the lastcolumn of the covariance matrix shifted by one time unit.

Rewriting this equation

Yt+1 =t∑

n=0

ht,nYn = −t∑

n=0

at+1,t+1−nYn (9.13)

the coefficients at,1, . . . , at,t can be updated recursively in t through the Levinson algorithm:

at+1,k = at,k − kt at,t+1−k, k = 1, . . . , t

at+1,t+1 = −kt


where kt itself, is generated recursively with ǫtdef= E

[(Yt − Yt)

2]

via

kt =1

ǫt[CY (t+ 1) +

t∑

k=1

at,k CY (t+ 1 − k)]

ǫt+1 = (1 − k2t )ǫt

with the initialization k0 = CY (1)CY (0) and ǫ0 = CY (0).

The coefficients ak are called reflection coefficients or still partial correlation coefficients(PARCOR).

9.4 Vector observation case

The linear estimation can be extended to the case where both the observation sequenceand the quantity to be estimated are vectors. This extension is straight forward and wehave:

X t =b∑

n=a

H t,nY n + ct (9.14)

where H t,n is a sequence of matrices. When a or b are infinite, the summations have theMSE sense. For example when a = −∞, we have

limm7→−∞

E

∥∥∥∥∥

b∑

n=m

Ht,nY n + ct − Xt

∥∥∥∥∥

2

= 0 (9.15)

where ‖x‖ def= xtx.

The orthogonality principle becomes:

Proposition 4 (Orthogonality principle) Xt ∈ Hba solves

minXt∈H

ba

E

[∥∥∥X t −X t

∥∥∥2]

(9.16)

if and only if

E[(Xt −X t)

t Z]

= 0 ∀Z ∈ Hba. (9.17)

Writing this last equation for Z = 1 and for Z = Y tl we obtain

E [X t] = E[X t

]

E[(X t − Xt)Y

tl

]= 0, ∀a ≤ l ≤ b.

Using these relations we obtain

E[(X t − Xt)Y

tl

]= E

[(Xt −

b∑

n=a

H t,nY n − ct

)Y t

l

]= [0], ∀a ≤ l ≤ b. (9.18)

9.5. WIENER-KOLMOGOROV FILTERING 133

where [0] means a matrix whose all elements are equal to zero. The Wiener-Hopft equationbecomes:

CXY (t, l) =b∑

n=a

H t,nCY (n, l), ∀a ≤ l ≤ b

where CXY (t, l)def= Cov {Xt, Y l} is the cross-covariance of {Xn}∞n=−∞ and {Y l}∞l=−∞ and

CY (n, l)def= Cov {Y n, Y l} is the auto-covariance of {Y n}∞n=−∞. Note that CXY (t, l) and

CY (n, l) are (m × k) and (k × k) matrices respectively, where k and m are respectivelythe dimensions of the vectors Y n and X t.

9.5 Wiener-Kolmogorov filtering

We assume here that Yn is wide sense stationarity (w.s.s.) and that there is an infinitenumber of observations.

Two cases are of interest: Non causal where (a = −∞, b = t) and Causal where(a = −∞, b = ∞).

9.5.1 Non causal Wiener-Kolmogorov

Without losing any generality, we assume E [Yn] = E [Xn] = 0. Then we have

Xt =∞∑

n=−∞

ht,nYn (9.19)

The Wiener-Hopft equation becomes

CXY (t− l) =∞∑

n=−∞

ht,nCY (n− l) (9.20)

Changing variable τ = t− l we obtain

CXY (τ) =∞∑

n=−∞

ht,nCY (n+ τ − t) (9.21)

Changing now the summation variable n = t− α we obtain

CXY (τ) =∞∑

α=−∞

ht,t−αCY (τ − α) (9.22)

In this summation t appears only in ht,t−α. This means that we can choose it to beindependent of t, i.e. if this equation has a solution, it can be chosen to be time-invariant

with coefficients ht,t−α = h0,0−αdef= hα. Then we have

CXY (τ) =∞∑

α=−∞

hαCY (τ − α) (9.23)


which is a convolution equation. Using then the following DFTs

H(ω)def=

∞∑

n=−∞

hn exp [−jωn] , −π < ω < π

SXY (ω)def=

∞∑

n=−∞

CXY (n) exp [−jωn] − π < ω < π

SY (ω)def=

∞∑

n=−∞

CY (n) exp [−jωn] − π < ω < π

we have

SXY (ω) = H(ω)SY (ω)

and finally

H(ω) =SXY (ω)

SY (ω)(9.24)

The coefficients hn can then be obtained by inverse FT

hn =1

2π

∫ π

−π

SXY (ω)

SY (ω)exp [jωn] dω, n ∈ ZZ (9.25)

It is interesting to see that we have

E[XtXt

]=

1

2π

∫ π

−π

SXY (ω)

SY (ω)dω

E[X2

t

]= CX(0) =

1

2π

∫ π

−πSX(ω) dω

E[(Xt −Xt)

2]

=1

2π

∫ π

−πSX(ω) − SXY (ω)

SY (ω)dω

This last equation can be written

MMSE = E[(Xt −Xt)

2]

=1

2π

∫ π

−π

[1 − |SXY (ω)|2

SX(ω)SY (ω)

]SX(ω) dω (9.26)

Noting that we have |SXY (ω)|2 ≤ SX(ω)SY (ω), with equality if {Xt}∞t=−∞ and {Yn}∞n=−∞

are perfectly correlated, we can conclude that the MMSE ranges from E[X2

t

]to zero as

the relationship between {Xt}∞t=−∞ and {Yn}∞n=−∞ ranges from independence to perfectcorrelation.

Example 1 (Noise filtering) Consider the model

Yn = Sn +Nn, n ∈ ZZ

where Sn and Nn are assumed uncorrelated, zero mean and w.s.s. Suppose that we wantto estimate Xt = St+λ for some integer λ. The problem represents filtering, prediction

9.6. CAUSAL WIENER-KOLMOGOROV 135

and smoothing respectively when λ = 0, λ > 0 and when λ < 0. To obtain the necessaryequations, it is straightforward to show

SY (ω) = SX(ω) + SN (ω)

SXY (ω) = exp [jωλ] SX(ω)

SX(ω) = SS(ω)

So, the transfer function of the optimum non causal filter is

H(ω) =exp [jωλ] SS(ω)

SS(ω) + SN (ω)

Example 2 (Deconvolution) Consider the model

Yn =p∑

k=0

hkSn−k +Nn, n ∈ ZZ

where Sn and Nn are assumed uncorrelated, zero mean and w.s.s. Suppose that we wantto estimate Xt = St+λ for some integer λ. Again here, it is straightforward to show

SY (ω) = |H(ω)|2SX(ω) + SN (ω)

SXY (ω) = exp [jωλ] H∗(ω)SX(ω)

SX(ω) = SS(ω)

where

H(ω) =p∑

n=0

hn exp [−jnω] ,−π < ω < π

So, the transfer function of the optimum non causal filter is

H(ω) =exp [jωλ] H(ω)∗SS(ω)

|H(ω)|2SS(ω) + SN (ω)

For λ = 0 we have

H(ω) =H(ω)∗SS(ω)

|H(ω)|2SS(ω) + SN (ω)=

1

H(ω)

|H(ω)|2

|H(ω)|2 + SN (ω)SS(ω)

9.6 Causal Wiener-Kolmogorov

To develop the causal Wiener-Kolmogorov filtering, let note by Xt the non causal Wiener-Kolmogorov solution and by Xt the causal one, i.e.

Xt =∞∑

n=−∞

ht−nYn

Xt =∞∑

n=−∞

ht−nYn


Note also that Ht−∞ is a subset of H∞

−∞. So, if the solution to the noncausal Wiener-Kolmogorov problem happens to be causal, it also solves the causal Wiener-Kolmogorovproblem. But unfortunately, this is not the case excepted very special cases. However,there is surely a relation between these two solutions.

To obtain this relation we start by writing

(Xt − Xt) = (Xt − Xt) + (Xt − Xt)

So, for any Z ∈ Ht−∞ we have

E[(Xt − Xt)Z

]= E

[(Xt − Xt)Z

]+ E

[(Xt − Xt)Z

]

The left hand term E[(Xt − Xt)Z

]is zero due to the orthogonality principle applied to

Xt. The second right hand term E[(Xt − Xt)Z

]is zero due to the orthogonality principle

applied to Xt. So we have

E[(Xt − Xt)Z

]= 0, ∀Z ∈ Ht

−∞

which means that Xt is the MMSE estimate of Xt among all estimates in Ht−∞. In other

words, Xt which is the projection of Xt on Ht−∞ can be obtained by first projecting Xt

on H∞−∞ to get Xt and then projecting Xt onto Ht

−∞.

Now, let define

Xtdef=

t∑

n=−∞

ht−nYn

and consider the error

Xt − Xt =∞∑

n=−∞

ht−nYn −t∑

n=−∞

ht−nYn =∞∑

n=t+1

ht−nYn

If this error could be orthogonal to Ym for all m ≤ t, we could consider Xt as the projectionof Xt on Ht

−∞. But this is not the case in general. This could be the case if {Yn}∞n=−∞

was a sequence of uncorrelated random variables, because in that case we would have

E[(Xt − Xt)Ym

]= E

∞∑

n=t+1

ht−nYn

Ym

=∞∑

n=t+1

ht−nE [Yn, Ym]

= σ2∞∑

n=t+1

ht−nδn,m = 0, ∀ m ≤ t

where δn,m is the Kronecker delta (δn,m = 1 if n = m and δn,m = 0 if n 6= m and whereσ2 = E

[Y 2

n

]

9.6. CAUSAL WIENER-KOLMOGOROV 137

From the above discussion we see that, if we transform first the data {Yn}∞n=−∞

into an equivalent w.s.s. and uncorrelated sequence {Zn}∞n=−∞, then the causal Wiener-Kolmogorov filter coefficients can be obtained by simple truncation on the non causalWiener-Kolmogorov filter, i.e.

Xt =t∑

n=−∞

ht−nZn

where

hn =1

2π

∫ π

−πSXZ(ω) exp [jnω] dω = CXZ(n), n ≥ 0

where SXZ(ω) and CXZ(n) are, respectively, the cross spectrum and the cross covarianceof the sequences {Xn}∞n=−∞ and {Zn}∞n=−∞.

Now the aim is to obtain {Zn}∞n=−∞ from {Yn}∞n=−∞. The analyse of the one stepprediction problem in previous section in this chapter gives us the solution. If we note by

Yt|t−1 the one step prediction of Yt from the data {Yn}t−1n=−∞ and by σ2

t = E[(Yt − Yt|t−1)

],

then define

Zn =Yn − Yn|n−1

σn, n ∈ ZZ

we can verify that E [Zn] = 0, E[Z2

n

]= 1 and Cov {Zn, Zm} = 0. So, Zn has all the

necessary properties that we need. Now, still we have to show that {Zn}∞n=−∞ is equivalentto {Yn}∞n=−∞ for the purpose of MMSE. To do this we need the result of the followingtheorem.

Theorem 1 (Spectral Factorization) Assume Yn has a power spectral density satisfy-ing the Paley-Wiener condition

1

2π

∫ π

−πlnSY (ω) dω > −∞.

Then

SY (ω) = S−Y (ω)S+

Y (ω), −π < ω < π (9.27)

where

SY (ω) = |S−Y (ω)|2 = |S+

Y (ω)|2 (9.28)

and

1

2π

∫ π

−πS−

Y (ω) exp [jnω] dω = 0 n < 0

1

2π

∫ π

−π

1

S−Y (ω)

exp [jnω] dω = 0 n < 0

1

2π

∫ π

−πS+

Y (ω) exp [jnω] dω = 0 n > 0

1

2π

∫ π

−π

1

S+Y (ω)

exp [jnω] dω = 0 n > 0


Now consider the time invariant filter with H(ω) = 1S+

Y(ω)

. This filter, by definition, is

causal. The output of this filter with the wide sense stationnary input sequence Yn willbe another wide sense stationnary sequence Zn with

SZ(ω) =

∣∣∣∣∣1

S−Y (ω)

∣∣∣∣∣

2

SY (ω) = 1, −π < ω < π (9.29)

Since SZ(ω) = 1 corresponds to a white sequence, the filter is called a whitenning filter.Note also that the input sequence Yn can also be obtained causally from the output

Zn by

Yt =∞∑

n=−∞

ft−nZn, t ∈ ZZ (9.30)

where

fn =1

2π

∫ π

−πS+

Y (ω) exp [jnω] dω, n ≥ 0

Now consider the λ-step prediction of Yn from Zn:

Yt+λ =t+λ∑

n=−∞

ft+λ−nZn

=t∑

n=−∞

ft+λ−nZn +t∑

n=t+1

ft+λ−nZn

But Zn is white, so Zn+1, . . . , Zt+λ are orthogonal to {Zn}tn=−∞. So the best estimate

Yt+λ of Yt+λ from {Yn}tn=−∞ is

Yt+λ =t∑

n=−∞

ft+λ−nZn

Combining the whitening filter and this prediction we obtain

· · · , Yt−2, Yt−1, Yt −→ 1S+

Y(ω)

−→ Zn −→[exp [jωλ]S+

Y (ω)]

+−→ Yt+λ

where the operator [H(ω)]+ is defined as

[H(ω)]+ =∞∑

n=0

hn exp [−jωn]

where

hn =1

2π

∫ π

−πH(ω) exp [jnω] dω

Now we obtain a procedure for causally prewhitenning a stationary sequence of ob-servations. Thus the solution of the general causal Wiener-Kolmogorov problem followsimmediately. Assuming that {Yn}∞n=−∞ satisfies the Paley-Winner condition, it can betransformed equivalently into {Zn}t

n=−∞ by passing it through the causal filter 1/S+Y (ω)

Yn −→ 1S+

Y(ω)

−→ Zn

9.7. RATIONAL SPECTRA 139

Then we need to find the cross spectrum SXZ(ω) and pass {Zn}tn=−∞ through the causal

filter [SXZ(ω)]+ to obtain the required result.

Zn −→ S+XZ(ω) −→ Yt

Knowing that

SXZ(ω) =SXY (ω)

[S+Y (ω)]∗

=SXY (ω)

S−Y (ω)

we obtain

Yn −→ 1S+

Y(ω)

−→ Zn −→[

SXY (ω)

S−Y

(ω)

]

+−→ Yt

It is interesting to compare this filter with the non causal Wiener-Kolmogorov filter

SXY (ω)

SY (ω)=

1

S+Y (ω)

[SXY (ω)

S−Y (ω)

]

The following diagrams summarize this comparison:

Yn −→ 1S+

Y(ω)

−→[

SXY (ω)

S−Y

(ω)

]

+−→ Yt+λ Yn −→ 1

S+

Y(ω)

−→ SXY (ω)

S−Y

(ω)−→ Yt+λ

Causal Non Causal

9.7 Rational spectra

Definition 1 (Rational spectra) Sy(ω) is said rational if it can be written as the ratioof two real trigonometric polynomials

Sy(ω) =

n0 + 2p∑

k=1

nk cos kω

d0 + 2m∑

k=1

dk cos kω

, nk, dk ∈ IR (9.31)

Using the relation cos kω = exp [jkω] + exp [−jkω] we deduce

Sy(ω) =N (exp [jkω])

D (exp [jkω]).

Noting z = exp [jkω] we have

N(z) =p∑

k=−p

n|k|z−k

D(z) =p∑

k=−p

d|k|z−k


zpN(z) is a (2p)-th order polynomial. So we can write

N(z) = np z−p

2p∏

k=1

(z − zk)

Since N(z) = N(1/z) the roots zk are reciprocal pairs. If we order them in such a waythat |z1| > |z2| > · · · > |z2p| we have

|z2p| =1

|z1|, |z2p−1| =

1

|z2|, · · · , |zp+1| =

1

|zp|.

We deduce that all the roots are outside or over the unit circle. Due to the reciprocity ofthe roots we can write

N(z) = B(z)B(1/Z) (9.32)

where

B(z) =

√√√√(−1)pnp/p∏

k=1

zk

p∏

k=1

(z−1 − zk) (9.33)

So B(z) is a polynomial of degree p and can be extended as

B(z) =p∑

k=0

bk z−k (9.34)

Similarly, we can do exactly the same analysis for the denominator D(z):

D(z) = A(z)A(1/Z) (9.35)

where

A(z) =m∑

k=0

ak z−k (9.36)

Putting these together we have

Sy(ω) =B (exp [jkω]) B (exp [−jkω])

A (exp [jkω]) A (exp [−jkω])=B (exp [jkω])

A (exp [jkω])

B (exp [−jkω])

A (exp [−jkω])

Assuming that none of the roots of B or A is on the unit circle |z| = 1 we have

S+y (ω) =

B (exp [jkω])

A (exp [jkω])

S−y (ω) =

B (exp [−jkω])

A (exp [−jkω])

Now consider the whitenning filter of the last section and assume that the powerspectrum of the data SY (ω) is a rational fraction.

Yn −→ 1S+

Y(ω)

−→ Zn −→ Yn −→ A(z)B(z) =

∑m

k=0ak z−k

∑p

k=0bk z−k −→ Zn

9.7. RATIONAL SPECTRA 141

Then, we can see easily that we have

m∑

k=0

ak Yn−k =p∑

k=0

bk Zn−k (9.37)

We can rewrite this equation in two other equivalent forms

b0zn = −p∑

k=1

bk Zn−k +m∑

k=0

ak Yn−k

a0Yn = −m∑

k=1

ak Yn−k +p∑

k=0

bk Zn−k

Autoregressive, moving Average (ARMA) sequence of order (m, p). For p = 0, we havean Autoregressive (AR) and for m = 0 we have a moving average (MA) sequence.

Example 3 (Wide-Sense Markov sequences) A simple and useful model for the cor-relation structure of a stationary random sequence is the so-called wide-sense Markovmodel:

CY (n) = σ2r|n|, n ∈ ZZ (9.38)

where |r| < 1. The power spectrum of such a sequence is

SY (ω) =σ2(1 − r2)

1 − 2r cos(ω) + r2

which is a rational fraction, and we can see easily that we can write it

SY (ω) =σ2(1 − r2)

(1 − r exp [−jω]) (1 − r exp [+jω])=

1

A(exp [−jω])A(exp [+jω])=

1

A(Z)A(z−1)

whereA(z) = a0 + a1z

−1

with a0 =√σ2(1 − r2), a1 = −r a0.

We can conclude here that a wide-sense Markov sequence with the covariance structure(9.38) is an AR(1) sequence.

Example 4 (Prediction of a Wide-Sense Markov sequences) Consider now the pre-diction problem where we wish to predict Yn+λ from the sequence Yk

nk=−∞. Using the rela-

tions we obtained in the last section, this can be done through a causal filter whose transferfunction is

Yn −→ H(ω) = 1S+

Y(ω)

[SXY (ω)

S−Y

(ω)

]

+−→ Yt+λ

Here we have

H(ω) = A(exp [jω])

[exp [jωλ]

A(exp [jω])

]

+

Using the following geometric series relations

∞∑

k=0

xk =1

1 − x, and

∞∑

k=1

xk =x

1 − x, |x| < 1


we obtain easily1

A(z)=

1

a0

∞∑

k=0

rn z−1

and

[exp [jωλ]

A(exp [jω])

]

+

=

[1

a0

∞∑

n=0

rn exp [−jω(n− λ)]

]

+

=1

a0

∞∑

n=λ

rn exp [−jω(n− λ)]

=1

a0

∞∑

l=0

rl+λ exp [−jlω]

=rλ

A(exp [jω])

Finally, we obtain

H(ω) = A(exp [jω])rλ

A(exp [jω])= rλ

which is a pure pure gainYt+λ = rλ Yt

It is also easy to show that, in this case we have

MSE = E[(Yt+λ − Yt+λ)2

]= σ2(1 − r2λ)

which means that the prediction error increases monotonically from σ2(1 − r2λ) to σ2 asλ increases from 1 to ∞.

Appendix A

Annexes

143

144 APPENDIX A. ANNEXES

A.1 Summary of Bayesian inference

Observation model:Xi ∼ f(xi|θ)z = {x1, · · · , xn}, zn = {x1, · · · , xn}, zn+1 = {x1, · · · , xn, xn+1},

Likelihood and sufficient statistics:

l(θ|z) = f(x|θ) =n∏

i=1

f(xi|θ)

l(θ|z) = l(θ|t(z))

p(t(z)|θ) =l(θ|t(z))∫∫l(θ|t(z)) dθ

Inference with any prior law:π(θ)

f(xi) =

∫f(xi|θ)π(θ) dθ and f(x) =

∫∫f(x|θ)π(θ) dθ

p(t(z)) =

∫∫p(t(z)|θ)π(θ) dθ

p(z,θ) = p(z|θ)π(θ)

p(z) =

∫∫p(z|θ)π(θ) dθ, prior predictive

π(θ|z) =p(z|θ)π(θ)

p(z)E [θ|z] =

∫∫θ π(θ|z) dθ

f(x|z) =p(z, x)

p(z)=p(zn+1)

p(zn), posterior predictive

E [x|z] =

∫x f(x|z) dx

Inference with conjugate priors:

π(θ|τ 0) =p(t = τ 0|θ)∫∫p(t = τ 0|θ) dθ

∈ Fτ 0(θ)

π(θ|z, τ ) ∈ Fτ (θ), with τ = g(τ 0, n,z)

A.1. SUMMARY OF BAYESIAN INFERENCE 145

Inference with conjugate priors and generalized exponential family:

if f(xi|θ) = a(xi) g(θ) exp

[K∑

k=1

ckφk(θ)hk(xi)

]

then

tk(x) =n∑

j=1

hk(xj), k = 1, · · · ,K

π(θ|τ 0) = [g(θ)]τ0 z(τ ) exp

[K∑

k=1

τkφk(θ)

]

π(θ|x, τ ) = [g(θ)]n+τ0 a(x)Z(τ ) exp

[K∑

k=1

ckφk(θ) (τk + tk(x))

].

Inference with conjugate priors and natural exponential family:if f(xi|θ) = a(xi) exp [θxi − b(θ)]then

t(x) =n∑

i=1

xi

π(θ|τ0) = c(θ) exp [τ0θ − d(τ0)]π(θ|x, τ0) = c(θ) exp [τnθ − d(τn)] with τn = τ0 + x

where x =1

n

n∑

i=1

xi,

Inference with conjugate priors and natural exponential familyMultivariable case:

if f(xi|θ) = a(xi) exp[θtxi − b(θ)

]

then

tk(x) =n∑

i=1

xki, k = 1, . . . ,K

π(θ|τ 0) = c(θ) exp [τ 0θ − d(τ 0)]π(θ|x, τ0) = c(θ) exp [τnθ − d(τn)] with τn = τ 0 + x

where x =1

n

n∑

i=1

xi,


Bernouilli model:

z = {x1, · · · , xn}, xi ∈ {0, 1}, r =∑

xi : number of 1, n− r : number of 0

f(xi|θ) = Ber(xi|θ), 0 < θ < 1


l(θ|z) =n∏

i=1

Ber(xi|θ) = θ∑

xi(1 − θ)n−∑

xi = θr(1 − θ)n−r

t(z) = r =n∑

i=1

xi, l(θ|r) = θr(1 − θ)1−r

p(r|θ) = Bin(r|θ, n)

Inference with conjugate priors:π(θ) = Bet(θ|α, β)f(x) = BinBet(x|α, β, 1)p(r) = BinBet(r|α, β, n)

π(θ|z) = Bet(θ|α+ r, β + n− r), E [θ|z] =α+ r

β + n− r

f(x|z) = BinBet(x|α+ r, β + n− r, 1), E [x|z] =α+ r

β + n− r

Inference with reference priors:

π(θ) = Bet(θ|12,1

2)

π(x) = BinBet(x|12,1

2, 1)

π(r) = BinBet(r|12,1

2, n)

π(θ|z) = Bet(θ|12

+ r,1

2+ n− r)

π(x|z) = BinBet(x|12

+ r,1

2+ n− r, 1)


Binomial model:z = {x1, · · · , xn}, xi = 0, 1, 2, · · · ,mf(xi|θ,m) = Bin(xi|θ,m), 0 < θ < 1, m = 0, 1, 2, · · ·


l(θ|z) =n∏

i=1

Bin(xi|θ,m)

t(z) = r =n∑

i=1

xi

p(r|θ) = Bin(r|θ, nm)

Inference with conjugate priors:π(θ) = Bet(θ|α, β)f(x) = BinBet(x|α, β,m)p(r) = BinBet(r|α, β, nm)

π(θ|z) = Bet(θ|α+ r, β + n− r), E [θ|z] =α+ r

β + n− rf(x|z) = BinBet(x|α+ r, β + n− r,m)


π(θ) = Bet(θ|12,1

2)

π(x) = BinBet(x|12,1

2, 1)

π(r) = BinBet(r|12,1

2, n)

π(θ|z) = Bet(θ|12

+ r,1

2+ n− r)

π(x|z) = BinBet(x|12

+ r,1

2+ n− r,m)


Poisson:z = {x1, · · · , xn}, xi = 0, 1, 2, · · ·f(xi|λ) = Pn(xi|λ), λ ≥ 0


l(λ|z) =n∏

i=1

Pn(xi|λ)

t(z) = r =n∑

i=1

xi

p(r|λ) = Pn(r|nλ)

Inference with conjugate priors:p(λ) = Gam(λ|α, β)f(x) = PnGam(x|α, β, 1)p(r) = PnGam(r|α, β, n)

p(λ|z) = Gam(λ|α+ r, β + n), E [λ|z] =α+ r

β + nf(x|z) = PnGam(x|α+ r, β + n, 1)


π(λ) ∝ λ−1/2 = Gam(λ|12, 0)

π(x) = PnGam(x|12, 0, 1)

π(r) = PnGam(r|12, 0, n)

π(λ|z) = Gam(λ|12

+ r, n)

π(x|z) = PnGam(x|12

+ r, n, 1)


Negative Binomial model:z = {x1, · · · , xn}, xi = 0, 1, 2, · · ·f(xi|θ, r) = NegBin(xi|θ, r), 0 < θ < 0, r = 1, 2, · · ·


l(θ|z) =n∏

i=1

NegBin(xi|θ, r)

t(z) = s =n∑

i=1

xi

p(s|θ) = NegBin(s|θ, nr)

Inference with conjugate priors:π(θ) = Bet(θ|α, β)f(x) = NegBinBet(x|α, β, r)p(s) = NegBinBet(s|α, β, nr)π(θ|z) = Bet(θ|α+ nr, β + s), E [θ|z] =

α+ nr

β + sf(x|z) = NegBinBet(x|α+ nr, β + s, nr)


π(θ) ∝ θ−1(1 − θ)−1/2 = Bet(θ|0, 12)

π(x) = NegBinBet(x|0, 12, r)

π(s) = NegBinBet(s|0, 12, nr)

π(θ|z) = Bet(θ|nr, s +1

2)

π(x|z) = NegBinBet(x|nr, s +1

2, nr)


Exponential model:z = {x1, · · · , xn}, 0 < xi <∞f(xi|λ) = Ex(xi|λ), λ > 0


l(λ|z) =n∏

i=1

Ex(xi|λ)

t(z) = t =n∑

i=1

xi

p(t|λ) = Gam(t|n, λ)

Inference with conjugate priors:p(λ) = Gam(λ|α, β)f(x) = GamGam(x|α, β, 1)p(t) = GamGam(t|α, β, n)

p(λ|z) = Gam(λ|α+ n, β + t) E [λ|z] =α+ n

β + tf(x|z) = GamGam(x|α+ n, β + t, 1)


π(λ) ∝ λ−1 = Gam(λ|0, 0)π(x) = GamGam(x|0, 0, 1)π(t) = GamGam(t|0, 0, n)π(λ|z) = Gam(λ|n, t)π(x|z) = GamGam(x|n, t, 1)


Uniform model:z = {x1, · · · , xn}, 0 < xi < θf(xi|θ) = Uni(xi|0, θ), θ > 0


l(θ|z) =n∏

i=1

Uni(xi|0, θ)

t(z) = t = max{x1, · · · , xn}p(t|θ) = IPar(t|n, θ−1)

Inference with conjugate priors:π(θ) = Par(θ|α, β)

f(x) =

{α

α+1Uni(x|0, β), if x ≤ β,1

α+1Par(x|α, β), if x > β

p(t) =

{α

α+nIPar(t|n, β−1), if t ≤ β,n

α+nPar(t|α, β), if x > β

π(θ|z) = Par(θ|α+ n, βn), βn = max{β, t}

f(x|z) =

{α+n

α+n+1Uni(x|0, βn), if t ≤ βn,1

α+n+1Par(x|α, βn), if x > βn


π(θ) ∝ θ−1 = Par(θ|0, 0)π(θ|z) = Par(θ|n, t)

π(x|z) =

nn+1Uni(x|0, t), if x ≤ t,

1

n+ 1Par(x|n, t), if x > t


Normal with known precision λ = 1σ2 > 0

(Estimation of µ):

z = {x1, · · · , xn}, xi ∈ IR, xi = µ+ bi, bi ∼ N(bi|0, λ)f(xi|µ, λ) = N(xi|µ, λ), µ ∈ IR


l(µ|z) =n∏

i=1

N(xi|µ, λ)

t(z) = x =1

n

n∑

i=1

xi

p(x|µ, λ) = N(x|µ, nλ)

Inference with conjugate priors:p(µ) = N(µ|µ0, λ0)

f(x) = N

(x|µ0,

λλ0

λ+ λ0

)

f(x) = f(x1, · · · , xn) = Nn

(x|µ01,

(1

λI +

1

λ01.1t

)−1)

p(x) = N(x|µ0,

nλλ0

λn

), λn = λ0 + nλ

p(µ|z) = N (µ|µn, λn) , µn =λ0µ0 + nλx

λn

f(x|z) = N(x|µn,

λλn

λ+λn

)

Inference with reference priors:π(µ) = constantπ(µ|z) = N(µ|x, nλ)

π(x|z) = N

(x|x, nλ

n+ 1

)

Inference with other prior laws:

π(µ) = St(µ|0, τ2, α) = π1(µ|ρ)π2(ρ|α)with

π1(µ|ρ) = N(µ|0, τ2ρ),π2(ρ|α) = IGam(ρ|α/2, α/2),

π(µ|z, ρ) = N

(µ| 1

1 + τ2ρx,

τ2ρ

1 + τ2ρ

)

π(ρ|z) ∝ (1 + τ2ρ)−1/2 exp

[ −1

2(1 + τ2ρ)xtx

]π2(ρ)


Normal with known variance σ2 > 0(Estimation of µ):

z = {x1, · · · , xn}, xi ∈ IR, xi = µ+ bi, bi ∼ N(bi|0, σ2)

f(x) = f(xi|µ, σ2) = N(xi|µ, σ2), µ ∈ IR


l(µ|z) =n∏

i=1

N(xi|µ, σ2)

t(z) = x =1

n

n∑

i=1

xi

p(x|µ, σ2) = N(x|µ, 1

nσ2)


p(µ) = N(µ|µ0, σ20)

f(x) = N(x|µ0, σ

20 + σ2

)

f(x1, · · · , xn) = Nn

(x|µ01, σ

2I + σ201.1

t)

p(x) = N

(x|µ0, σ

20 +

1

nσ2),

p(µ|z) = N(µ|µn, σ

2n

), µn =

σ20σ

2

nσ20 + σ2

(1

σ20

µ0 +n

σ2x

), σ2

n =σ2

0σ2

nσ20 + σ2

f(x|z) = N(x|µn, σ

2 + σ2n

)

Inference with reference priors:π(µ) = constant

π(µ|z) = N(µ|x, 1

nσ2)

π(x|z) = N

(x|x, n+ 1

nσ2

)

Inference with other prior laws:

π(µ) = St(µ|0, τ2, α) = π1(µ|ρ)π2(ρ|α)with

π1(µ|ρ) = N(µ|0, τ2ρ),π2(ρ|α) = IGam(ρ|α/2, α/2),

π(µ|z, ρ) = N

(µ| 1

1 + τ2ρx,

τ2ρ

1 + τ2ρ

)

π(ρ|z) ∝ (1 + τ2ρ)−1/2 exp

[ −1

2(1 + τ2ρ)xtx

]π2(ρ)


Normal with known mean µ ∈ IR(Estimation of λ):

z = {x1, · · · , xn}, xi ∈ IR, xi = µ+ bi, bi ∼ N(bi|0, λ)f(xi|µ, λ) = N(xi|µ, λ), λ > 0


l(λ|z) =n∏

i=1

N(xi|µ, λ)

t(z) = t =n∑

i=1

(xi − µ)2

p(t|µ, λ) = Gam(t|n2, λ/2), p(λt|µ, λ) = Chi2(λt|n)

Inference with conjugate priors:p(λ) = Gam(λ|α, β)f(x) = St (x|µ, α/β, 2α)

p(t) = GamGam

(t|α, 2β, n

2

)

p(λ|z) = Gam

(λ|α+

n

2, β +

t

2

)

f(x|z) = St

(x|µ, α+ n

2

β+ t2

, 2α + n

)


π(λ) ∝ λ−1 = Gam(λ|0, 0)π(λ|z) = Gam(λ|n

2,t

2)

π(x|z) = St

(x|µ, n

t, n

)


Normal with both unknown parametersEstimation of mean and precision (µ, λ):

z = {x1, · · · , xn}, xi ∈ IR, xi = µ+ bi, bi ∼ N(bi|0, λ)f(xi|µ, λ) = N(xi|µ, λ), µ ∈ IR, λ > 0


l(µ, λ|z) =n∏

i=1

N(xi|µ, λ)

t(z) = (x, s), x =1

n

n∑

i=1

xi, s2 =n∑

i=1

(xi − x)2

p(x|µ, λ) = N(x|µ, nλ),

p(ns2|µ, λ) = Gam(ns2|(n − 1)/2, λ/2), p(λns2|µ, λ) = Chi2(λns2|n− 1)

Inference with conjugate priors:p(µ, λ) = NGam(µ, λ|µ0, n0, α, β) = N(µ|µ0, n0λ)Gam(λ|α, β)

p(µ) = St

(µ|µ0, n0

α

β, 2α

)

p(λ) = Gam(λ|α, β)

f(x) = St

(x|µ0,

n0

n0 + 1

α

β, 2α

)

p(x) = St

(x|µ0,

n0n

n0 + n

α

β, 2α

)

p(ns2) = GamGam

(ns2|α, 2β, n− 1

2

)

p(µ|z) = St(µ|µn, (n+ n0)(αn)β−1

n , 2αn

),

αn = α+ n2 ,

µn = n0µ0+nxn0+n ,

βn = β + ns2/2 + 12

n0nn0+n(µ0 − x)2

p(λ|z) = Gam (λ|αn, βn)

f(x|z) = St(x|µn,

n+n0

n+n0+1αn

βn, 2αn

)


π(µ, λ) = π(λ, µ) ∝ λ−1, n > 1

π(µ|z) = St(µ|x, (n − 1)s2, n − 1)

π(λ|z) = Gam(λ|(n − 1)/2, ns2/2)

π(x|z) = St

(x|x, n− 1

n+ 1s−2, n− 1

)


Multinomial:

z = {r1, · · · , rk, n}, ri = 0, 1, 2, · · · ,k∑

i=1

ri ≤ n,

p(ri|θi, n) = Bin(ri|θi, n),

p(z|θ, n) = Muk(z|θ, n), 0 < θi < 1,∑k

i=1 θi ≤ 1

Likelihood and sufficient statistics:l(θ|z) = Muk(z|θ, n)t(z) = (r, n), r = {r1, · · · , rk}p(r|θ) = Muk(r|θ, n)

Inference with conjugate priors:π(θ) = Dik(θ|α), α = {α1, · · · , αk+1}p(r) = Muk(r|α, n)

π(θ|z) = Dik

(θ|α1 + r1, · · · , αk + rk, αk+1 + n−

k∑

i=1

rk

)

f(x|z) = Dik

(θ|α1 + r1, · · · , αk + rk, αk+1 + n−

k∑

i=1

rk

)

Inference with reference priors:π(θ) ∝??π(θ|z) =??π(x|z) =??


Multi-variable Normal with known precision matrix Λ(Estimation of the mean µ):

z = {x1, · · · ,xn}, xi ∈ IRk, xi = µ + bi, bi ∼ Nk(bi|0,Λ)

f(xi|µ,Λ) = Nk(xi|µ,Λ), µ ∈ IRk, Λ matrix d.p. of dimensions k × k


l(µ|z) =n∏

i=1

Nk(xi|µ,Λ)

t(z) = x, x =1

n

n∑

i=1

xi,

p(x|µ,Λ) = Nk(x|µ, nΛ)

Inference with conjugate priors:p(µ) = Nk(µ|µ0,Λ0)

f(x) = Nk

(x|µ0, (Λ0Λ)Λ−1

1

), Λ1 = Λ0 + Λ

p(µ|z) = Nk (µ|µn,Λn)Λn = Λ0 + nΛ,µn = Λ−1

n (Λ0µ0 + nΛx)f(x|z) = Nk (x|µn,Λn)

Inference with reference priors:π(µ) =??π(µ|z) =??π(x|z) =??


Multi-variable Normal with known covariance matrix Σ(Estimation of the mean µ):

z = {x1, · · · ,xn}, xi ∈ IRk, xi = µ + bi, bi ∼ Nk(bi|0,Σ)

f(xi|µ,Σ) = Nk(xi|µ,Σ), µ ∈ IRk, Σ p.d. matrix of dimensions k × k


l(µ|z) =n∏

i=1

Nk(xi|µ,Σ)

t(z) = x, x =1

n

n∑

i=1

xi,

p(x|µ,Σ) = Nk(x|µ, nΣ)

Inference with conjugate priors:p(µ) = Nk(µ|µ0,Σ0)f(x) = Nk (x|µ0,Σ1) , Σ1 = Σ0 + Σp(µ|z) = Nk (µ|µn,Σn)

Σn = Σ0 + 1nΣ,

µn = Σ−1n (Σ0µ0 + nΣx)

f(x|z) = Nk (x|µn,Σn)

Inference with reference priors:π(µ) =??π(µ|z) =??π(x|z) =??


Multi-variable Normal with known mean µ

(Estimation of precision matrix Λ):

z = {x1, · · · ,xn}, xi ∈ IRk, xi = µ + bi, bi ∼ Nk(bi|0,Λ)

f(xi|µ,λ) = Nk(xi|µ,λ), µ ∈ IRk, λ matrix d.p. of dimensions k × k


l(λ|z) =n∏

i=1

Nk(xi|µ,λ)

t(z) = S, S =n∑

i=1

(xi − µ)(xi − µ)t

p(S|λ) = Wik(S|(n− 1)/2,λ/2),

Inference with conjugate priors:p(λ) = Wik(λ|α,β)

f(x) = Stk

(x|µ0,

n0

n0 + 1(α− k − 1

2)β−1, 2α− k + 1

)

p(λ|z) = Wik (λ|αn,βn)

αn = α+ n2 − k−1

2 ,

µn =n0µ0

+nxn0+n ,

βn = β + 12S + 1

2n0n

n0+n(µ0 − x)(µ0 − x)t

f(x|z) = Stk

(x|µn,

n+n0

n+n0+1αnβ−1n , 2αn

)

Inference with reference priors:π(λ) =??π(λ|z) =??π(x|z) =??


Multi-variable Normal with both unknown parametersEstimation of mean and precision matrix (µ,Λ):

z = {x1, · · · ,xn}, xi ∈ IRk,xi = µ + bi, bi ∼ Nk(bi|0,Λ) µ ∼ Nk(µ|µ0, n0Λ) Λ ∼ Wik(Λ|α,β)

f(xi|µ,Λ) = Nk(xi|µ,Λ), µ ∈ IRk, Λ matrix d.p. of dimensions k × k


l(µ,Λ|z) =n∏

i=1

Nk(xi|µ,Λ)

t(z) = (x,S), x =1

n

n∑

i=1

xi, S =n∑

i=1

(xi − x)(xi − x)t

p(x|µ,Λ) = Nk(x|µ, nΛ)p(S|Λ) = Wik(S|(n− 1)/2,Λ/2),

Inference with conjugate priors:p(µ,Λ) = NWik(µ,Λ|µ0, n0, α,β) = Nk(µ|µ0, n0Λ)Wik(Λ|α,β)

p(µ) = Stk

(µ|µ0, n0αβ−1, 2α

)??

p(Λ) = Wik(Λ|α,β) ??

f(x) = Stk

(x|µ0,

n0

n0 + 1(α− k − 1

2)β−1, 2α− k + 1

)

p(µ|z) = Stk

(µ|µn, (n + n0)αnβ−1

n , 2αn

)

αn = α+ n2 − k−1

2 ,

µn =n0µ0

+nxn0+n ,

βn = β + 12S + 1

2n0n

n0+n(µ0 − x)(µ0 − x)t

p(Λ|z) = Wik (Λ|αn,βn)

f(x|z) = Stk

(x|µn,

n+n0


)

Inference with reference priors:π(µ,Λ) =??π(µ|z) =??π(Λ|z) =??π(x|z) =??


Linear regression:z = (y,X), y = {y1, · · · , yn} ∈ IRn,

xi = {xi1 , · · · ,xik} = {xi1, · · · , xik} ∈ IRk, X = (xi,j)

θ = {θ1, · · · , θk} ∈ IRk, yi = xtiθ = θtxi

p(y|X,θ, λ) = Nn(y|Xθ, λIn), θ ∈ IRk, λ > 0

Likelihood and sufficient statistics:l(θ|z) = Nn(y|Xθ, λIn)t(z) = (XtX,X ty)

Inference with conjugate priors:π(θ, λ) = NGamk(θ, λ|θ0,Λ0, α, β) = Nk(θ|θ0, λΛ0)Gam(λ|α, β)

π(θ|λ) = Nk(θ|θ0, λΛ0), E [θ|λ] = θ0, Var [θ|λ] = (λΛ0)−1

π(λ|α, β) = Gam(λ|α, β)

π(θ) = Stk

(θ|θ0,

α

βΛ0, 2α

), E [θ] = θ0, Var [θ] =

α

α− 2Λ−1

0

p(yi|xi) = St

(yi|xt

iθ0,α

βf(xi), 2α

), with f(xi) = 1 − xt

i(Λ0 + xixti)−1xi,

π(θ, λ|z) = NGamk(θ, λ|θn,Λn, αn, βn) = Nk(θ|θn, λΛn)Gam(λ|αn, βn)

π(θ|z) = Stk

(θ|θn, (Λ0 + XtX)

αn

βn, 2αn

)

αn = α+ n2 ,

θn = (Λ0 + XtX)−1(Λ0θ0 + Xty) = (I − Λn)θ0 + Λnθ,βn = β + 1

2(y − Xtθn)ty + 12(θ0 − θn)tΛ0θ0 = β + 1

2yty + 12θt

0Λ0θ0 − 12θnΛnθn

θ = (XtX)−1Xty, Λn = (Λ0 + XtX)−1XtX

E [θ|z] = θn, Var [θ|z] = (Λ0 + X tX)−1

π(λ|z) = Gam (λ|αn, βn)

p(yi|xi,z) = St(yi|xt

iθn, fn(xi)αn

βn, 2αn

)

fn(xi) = 1 − xti(X

tX + Λ0 + xixti)−1xi,


π(θ, λ) = π(λ,θ) ∝ λ−(k+1)/2

π(θ|z) = Stk

(θ|θn,

n− k

2βn

XtX, n− k

)

θn = (XtX)−1Xty,

βn =1

2(y − Xtθn)t(y − Xtθn)

π(λ|z) = Gam

(λ|n− k

2, βn

)

p(yi|xi,z) = St

(yi|xt

iθn,n− k

2βn

fn(xi), n − k

),

fn(xi) = 1 − xti(X

tX + xixti)−1xi,


Inverse problems:

z = Hx + b, z = {z1, · · · , zn} ∈ IRn, hi = {hi1 , · · · , hik} ∈ IRk, H = (hi,j)

x = {x1, · · · , xk} ∈ IRk,

p(z|H ,x, λ) = Nn(z|Hx, λIn), x ∈ IRk, λ > 0

Likelihood and sufficient statistics:l(x|z) = Nn(z|Hx, λIn)t(z) = (H tz,H txtxH)

Inference with conjugate priors:π(x, λ) = NGamk(x, λ|x0,Λ0, α, β) = Nk(x|x0, λΛ0)Gam(λ|α, β)

π(x|λ) = Nk(x|x0, λΛ0), E [x|λ] = x0, Var [x|λ] = (λΛ0)−1

f(x) = Stk

(x|x0,

α

βΛ0, 2α

)


π(x) = Stk

(x|x0,

α

βΛ0, 2α

), E [x] = x0, Var [x] =

α

α− 2Λ−1

0


p(zi|x) = St

(zi|xtx0,

α

βf(x), 2α

)

f(x) = 1 − xt(Λ0 + xtx)−1x

π(x, λ|z) = NGamk(x, λ|xbn,Λn, αn, βn) = Nk(x|xn, λΛn)Gam(λ|αn, βn)

f(x|z) = Stk

(x|xn, (Λ0 + HtH)

αn

βn, 2αn

)

αn = α+ n2 ,

xn = (Λ0 + H tH)−1(Λ0x0 + H tz) = (I − Λn)x0 + Λnθ

βn = β + 12(z − Htxn)tz + 1

2(x0 − xn)tΛ0x0 = β + 12ztz + 1

2xt0Λ0x0 − 1

2xtnΛnθn

x = (H tH)−1H ty, Λn = (Λ0 + H tH)−1H tH

E [x|z] = xn, Var [x|z] = (Λ0 + H tH)−1

π(λ|z) = Gam (λ|αn, βn)

p(zi|hi,z) = St(zi|ht

izn, fn(hi)αn

βn, 2αn

)

fn(hi) = 1 − hti(H

tH + Λ0 + hihti)−1hi,


π(x, λ) = π(λ,x) ∝ λ−(k+1)/2

π(x|z) = Stk

(x|xn,

n− k

2βn

H tH , n− k

)

xn = (H tH)−1H tz,

βn =1

2(z − H txn)tz

π(λ|z) = Gam

(λ|n− k

2, βn

)

p(zi|hi,z) = St(zi|ht

izn, fn(hi)αn

βn, 2αn

)

fn(hi) = 1 − hti(H

tH + Λ0 + hihti)−1hi,

A.2. SUMMARY OF PROBABILITY DISTRIBUTIONS 169

A.2 Summary of probability distributions

Probability laws for discrete random variables

Bernoulli Ber (x|θ) = θx (1 − θ)1−x,0 < θ < 1,x = {0, 1}

Binomial Bin (x|θ, n) = c θx (1 − θ)n−x,

c =

(nx

),

0 < θ < 1, n = 1, 2, · · · ,x = 0, 1, · · · , n

Hypergeometric HypGeo (x|N,M,n) = c

(Nx

)(M

n− x

),

c =

(N +M

n

)−1

,

N,M = 1, 2, · · · , n = 1, · · · , N +M,x = a, a+ 1, · · · , b, avec a = max{0, n −M}, b = min{N,n}

Negative-Binomial NegBin (x|θ, r) = c

(r + x− 1r − 1

)(1 − θ)x,

c = θr,0 < θ < 1, r = 1, 2, · · · ,x = 0, 1, 2, · · ·

Poisson Pn (x|λ) =λx

x!,

c = exp [−λ] ,λ > 0,x = 0, 1, 2, · · ·

Binomial-Beta BinBet (x|α, β, n) = c

(nx

)Γ(α+ x)Γ(β + n− x),

c =Γ(α+ β)

Γ(α)Γ(β)Γ(α + β + n),

α, β > 0, n = 1, 2, · · · ,x = 0, · · · , n


Probability laws for discrete random variables (cont.)

Negative Binomial-Beta NegBinBet (x|α, β, r) = c

(r + x− 1r − 1

)Γ(β + x)

Γ(α+ β + x+ α),

c =Γ(α+ β)Γ(α+ r)

Γ(α)Γ(β),

α, β > 0, r = 1, 2, · · · ,x = 0, 1, 2, · · ·

Poisson-Gamma PnGam (x|α, β, n) = cΓ(α+ x)

x !

nx

(β + n)α+x,

c =βα

Γ(α),

α, β > 0, n = 0, 1, 2, · · · ,x = 0, 1, 2, · · ·

Composite Poisson Pnc (x|λ, µ) = exp [−λ]∞∑

n=0

(nµ)x exp [−nµ]

x!

λN

n!,

λ, µ > 0, x = 0, 1, 2 · · ·Geometric Geo (x|θ) = c (1 − θ)x−1,

c = θ,θ > 0, x = 0, 1, 2 · · ·

Pascal Pas (x|m, θ) = Cm−1x−1 θ

m(1 − θ)x−m,m > 0, 0 < θ < 1 x = 0, 1, 2 · · ·


Probability laws for real random variables

Beta Bet (x|α, β) = c xα−1(1 − x)β−1,

c =Γ(α+ β)

Γ(α)Γ(β),

α, β > 0,0 < x < 1

Gamma Gam (x|α, β) = c xα−1 exp [−βx] ,c =

βα

Γ(α),

α, β > 0,x > 0

Inverse Gamma IGam (x|α, β) = c x−(α+1) exp

[−βx

],

c =βα

Γ(α),

α, β > 0,x > 0

Gamma–Gamma GamGam (x|α, β, n) = cxn−1

(β + x)α+n,

c =βα

Γ(α)

Γ(α+ n)

Γ(n),

α, β > 0, n = 0, 1, 2, · · · ,x = 0, 1, 2, · · ·

Pareto Par (x|α, β) = c x−(α+1),c = αβα,α, β > 0,x ≥ β

Normal N (x|µ, λ) = c exp

[−1

2λ(x− µ)2

],

c =

√λ

2π,

µ ∈ IR, λ > 0,x ∈ IR

Normal N (x|µ, σ) = c exp

[− 1

2σ2(x− µ)2

],

c =1√

2πσ2,

µ ∈ IR, σ > 0,x ∈ IR


Probability laws for real random variables (cont.)

Logistic Lo (x|α, β) = cexp

[−β−1(x− α)]

(1 + exp [β−1(x− α)])2,

c = β−1,α ∈ IR, β > 0,x ∈ IR

Student (t) St (x|µ, λ, α) = c

[1 +

λ

α(x− µ)2

]−(α+1)/2

,

c =Γ(α+1

2 )

Γ(α/2)Γ(1/2)

(λ

α

)1/2

µ ∈ IR, λ, α > 0,x ∈ IR

Fisher-Snedecor FS (x|α, β) = cxα/2−1

(β + αx)(α+β)/2,

c =Γ((α+ β)/2)

Γ(α/2)Γ(β/2)αα/2ββ/2,

α, β > 0,x > 0

Uniform Uni (x|θ1, θ2) = c

c =1

θ2 − θ1,

θ2 > θ1,θ1 < x < θ2

Exponential Ex (x|λ) = c exp [−λx] ,c = λ,λ > 0, x > 0

Inverse Gamma IGam−1/2 (x|α, β) = c x−(2α+1) exp

[− β

x2

],

c =2βα

Γ(α),

α, β > 0,x > 0

Inverse Pareto IPar (x|α, β) = c xα−1,c = αβα,α, β > 0,

0 < x < β−1



Cauchy Cau (x|λ) =1/(πλ)

1 + (x/λ)2,

λ ∈ IR, x ∈ IR

Rayleight Ray (x|θ) = c x exp

[−x

2

θ

],

c =2

θ,

θ > 0, x > 0

Log-Normal LogN (x|µ,Λ) = c exp

[−(lnx− µ)2

2Λ2

],

c =1

Λ√

2πx,

Generalized Normal Ngen (x|α, β) = c xα−1 exp[−βx2

],

c =2βα/2

Γ(α/2),

Weibull Wei (x|α) = c xα−1 exp[−xβ/α

],

c =β

α,

α, β > 0, x > 0

Double Exponential Exd (x|λ) = c exp [−|λ|x] ,c =

λ

2,

λ > 0, x ∈ IR

Truncated Exponential Ext (x|λ) = exp [−(x− λ)] ,λ > 0, x > λ



Khi Chi (x|n) = c xn/2−1 exp

[−x

2

],

c =12

n/2

Γ(n/2),

n > 0, x > 0

Khi-squared Chi2 (x|n) = c xn/2−1 exp

[−x

2

],

c =12

n/2

Γ(n/2),

n > 0, x > 0

Non centered Khi-squared Chi2 (x|ν, λ) =∞∑

i=0

Pn

(x|λ

2

)χ2 (x|ν + 2i) ,

ν, λ > 0,x > 0

Inverse Khi IChi (x|ν) = c x−(ν/2+1) exp

[− 1

2x2

],

c =12

ν/2

Γ(ν/2,

ν > 0, x > 0


Generalized Exponentialwith one parameters

Exf (x|a, g, φ, h, θ) = a(x) g(θ) exp [h(θ)φ(x)] ,

Generalized Exponentialwith K parameters

Exfk (x|a, g,φ,h,θ) = a(x) g(θ) exp

[K∑

k=1

hk(θ)φk(x)

],

Probability laws for two real random variables

Normal-Gamma NGam (x, y|µ, λ, α, β) = N(x|µ, λy)Gam(y|α, β),µ ∈ IR, λ, α, β > 0,x ∈ IR, y > 0

Pareto bi-variable Par2 (x, y|α, β0, β1) = (y − x)(α+2),c = α(α + 1)(β1 − β0)

α,

(β0, β1) ∈ IR2, β0 < β1, α > 0,

(x, y) ∈ IR2, x < β0, y > β1


Probability laws with n discrete variables

Multinomial Muk (x|θ, n) =n!

∏k+1l=1 xl!

k+1∏

l=1

θxl

l ,

xk+1 = n−k∑

l=1

xl, θk+1 = 1 −k∑

l=1

θl,

0 < θl < 1,∑

θl < 1, n = 1, 2, · · · ,xl = 0, 1, 2, · · · ,

∑xl ≤ n

Dirichlet Dik (x|α) = ck+1∏

l=1

xαl−1l ,

c =Γ(∑k+1

l=1 αl

)

∏k+1l=1 Γ(αl)

,

αl > 0, l = 1, · · · k + 1

0 < xl < 1, l = 1, · · · k + 1 xl+1 = 1 −k∑

l=1

xl

Multinomial–Dirichlet MuDik (x|α, n) = ck+1∏

l=1

α[xl]l

xl!,

c =n!

∑k+1l=1 α

[n]l

,

α[s] =s∏

l=1

(α+ l − 1), xk+1 = n−k∑

l=1

xl,

αl > 0, n = 1, 2, · · · ,

xl = 0, 1, 2, · · · ,k∑

l=1

xl < n


Probability laws for n real variables

Canonical Exponential (x|b, φ, h, θ)GeneralizedExponential

Exfn (x|a, g, φ, h,θ)

Normal Nk (x|µ,Λ) = c exp

[−1

2(x − µ)tΛ(x − µ)

],

c = |Λ|1/2 (2π)−k2 ,

µ ∈ IRk,Λ > 0,

x ∈ IRk

Normal Nk (x|µ,Σ) = c exp

[−1

2(x − µ)tΣ−1(x − µ)

],

c = |Σ|−1/2 (2π)−k2 ,

µ ∈ IRk,Σ > 0,

x ∈ IRk

Student Stk (x|µ,Λ, α) = c

[1 +

1

α(x − µ)tΛ(x − µ)

]−(α+k)/2

,

c =Γ ((α+ k)/2)

Γ(α/2)(απ)k/2

(Λ

α

)1/2

,

µ ∈ IRk,Λ > 0, α > 0,

x ∈ IRk

Wishart Wik (X|α,Λ) = c |X|α−(k+1)/2 exp [−tr(ΛX)] ,

c =|Λ|αΓk(α)

,

Λune matrice de dimensions k × k,X une matrice symetrique d.p. de dimensions k × k,Xi,j = Xj,i, i, j = 1, · · · , k,2α > k − 1

Probability laws for n+ 1 real variables

Normal-Gamma NGamk (x, y|µ,Λ, α, β) = Nk(x|µ, yΛ)Gam(y|α, β),

µ ∈ IRk,Λ > 0, α, β > 0,

x ∈ IRk, y > 0

Normal-Wishart NWik (x,Y |µ, λ, α,B) = Nk(x|µ, λY )Wik(Y |α,B),

µ ∈ IRk, λ > 0, 2α > k − 1,B > 0,

x ∈ IRk, Yi,j = Yj,i, i, j = 1, · · · , k


Family of invariant position-scale distribution

p(x|µ, β) = 1β f(t) with t = x−µ

β Var [x] = β2v − ∫ p(x) ln p(x) dx = log β + h

Family f(t) v h

Normal (2π)−1

2 exp[− t2

2

]1 1

2 log(2πe)

Gumbel exp [−t] exp [− exp [−t]] π2/6 1 + γLaplace 1

2 exp [−|t|] 2 1 + log 2

Logistic exp[−t](1+exp[−t])2 π2/3 2

Exponential exp [−t] , x > 0 1 1

Uniform 1, µ− β2 < x < µ+ β

2 1/12 0

Family of invariant shape-scale distributions

p(x|α, β) = 1β f(t;α) with t = x

β

{Var [x] = β2v(α)− ∫ p(x) ln p(x) dx = log β + h(α)

Family f(t)

{vh

Generalized Gaussianα = 2 : Rayleigh,α = 3 : Maxwell-Boltzmann,α = ν : khi

2Γ(α

2)t

α−1 exp[−t2]

α−2Γ2(α+1

2)

Γ2(α2)

log[Γ(α2 )/2] + 1−α

2 ψ(α2 ) + α

2

Inverse Generalized Gaussian

α = ν, β =√

2 : khi inverse

2Γ(α

2)t

−α−1 exp[−t−2

]{

1α−2 − αΓ2(α−1)

Γ2(α2)

log[Γ(α2 )/2] + −α

2 ψ(α2 ) + α

2

Gammaα = 1 : Exponential,α = ν

2 , β = 2 : khi-squared,α = ν : Erlang

1Γ(α) t

α−1 exp [−t]{αlog[Γ(α)] + (1 − α)ψ(α) + α

Inverse Gammaα = ν

2 , β = 12 : khi-squared inverse

1Γ(α) t

−α+1 exp[−t−1

]{

1(α−1)2(α−2) , α > 2

log[Γ(α)] − (1 + α)ψ(α) + α

Pareto αt−α−1, x > β

{1

(α−1)2(α−2) , α > 21α − log α+ 1

Weibull αtα−1 exp [−t]α{

Γ(1 + 2α)Γ2(1 + 1

α )γ(α−1)

α − logα+ 1

Appendix A

Exercises

179

180 APPENDIX A. EXERCISES

A.1 Exercise 1: Signal detection and parameter estimation

Assume zi = si+ei where si = a cos(ωti+φ) is the transmitted signal, ei is the noise and ziis the received signal. Assume that we have received n independent samples zi, i = 1, . . . , n.Assume also that

∑ni=1 zi = 0.

1. Assume that ei ∼ N (0, θ) and that si is perfectly known. Design a Bayesian optimaldetector with the uniform prior and the uniform cost coefficients.

2. Repeat (1) with the assumption that ei ∼ 12θ exp [−|ei|/θ] , θ > 0.

3. Repeat (1) but assume now that θ is unknown and distributed as θ ∼ π(θ) ∝ θα.

4. Assume that ei ∼ N (0, θ), but now a is not known.

• Give the expression of the ML estimate aML(z) of a.

• Give the expressions of the MAP estimate aMAP (z) and the Bayes optimalestimate aB(z) if we assume that a ∼ N (0, 1).

• Give the expressions of the MAP estimate aMAP (z) and the Bayes optimalestimate aB(z) if we assume that 1

2α exp [−|a|/α] , α > 0.


6. Assume that ei ∼ N (0, θ) with known θ and that a is known, but now ω is unknown.

• Give the expression of the ML estimate ωML(z) of ω.

• Give the expressions of the MAP estimate ωMAP (z) and the Bayes optimalestimate ωB(z), if we assume that ω ∼ Uni(0, 1).

7. Assume that ei ∼ N (0, θ) with known θ and that a and ω are known, but now φ isunknown.

• Give the expression of the ML estimate φML(z) of φ.

• Give the expressions of the MAP estimate φMAP (z) and the Bayes optimalestimate φB(z), if we assume that φ ∼ Uni(0, 2π).

8. Assume that ei ∼ N (0, θ) with known θ, but now both a and ω are unknown.

• Give the expressions of the ML estimates aML(z) of a and ωML(z) of ω.

• Give the expressions of the Bayes optimal estimates aB(z) of a and ωB(z) of ωif we assume that a ∼ N (0, 1) and ω ∼ Uni(0, 1).

• Give the expressions of the MAP estimates aMAP (z) of a and ωMAP (z) of ω ifwe assume that a ∼ N (0, 1) and ω ∼ Uni(0, 1).

9. Assume now that ω is known and we know that a can only take the values {−1, 0,+1}.Design an optimal detector for a.

A.1. EXERCISE 1: SIGNAL DETECTION AND PARAMETER ESTIMATION 181

10. Assume now that

si =K∑

k=1

ak cos(ωkt+ φk), ωk 6= ωl,∀k 6= l

• Give the expression of the ML estimates ak(z) assuming ωk and φk known.

• Give the expressions of the ML estimates ωk(z) assuming ak and φk known.

• Give the expressions of the ML estimates φk(z) assuming ak and ωk known.

• Give the expressions of the joint ML estimates ak(z) and ωk(z) assuming φk

unknown.

• Discuss the pssibilty of the joint estimation of ak(z), ωk(z) and ak.


A.2 Exercise 2: Discrete deconvolution

Assume zi = si + ei where

si =p∑

k=0

hk x(i− k)

and where

• h = [h0, h1, . . . , hp]t represents the finite impulse response of a chanel,

• x = [x(0), . . . , x(n)]t the finite input sequence,

• z = [z(p), . . . , z(p + n)]t the received signal, and

• e = [e(p), . . . , e(p + n)]t the chanel noise.

1. Construct the matrixes H and X in such a way that z = Hx+ e and z = Xh + e.

2. Assume that ei ∼ N (0, θ) and that we know perfectly h and the input sequencex. Design a Bayesian optimal detector with the uniform prior and the uniform costcoefficients.

3. Repeat (2) with the assumption that ei ∼ 12θ exp [−|ei|/θ] , θ > 0.


5. Assume that ei ∼ N (0, θ), but now x is unknown.

• Give the expression of the ML estimate xML(z) of x.

• Give the expressions of the MAP xMAP (z) and the Bayes optimal estimatexB(z) if we assume that x ∼ N (0, I).


7. Assume that ei ∼ N (0, θ) with known θ, and that x is known but h is unknown.

• Give the expression of the ML estimate hML(z) of h.

• Give the expression of the MAP estimate hMAP (z) and the Bayes optimalestimate hB(z) if we assume that h ∼ N (0, I).


9. Assume that ei ∼ N (0, θ) with known θ, but now both h and x are unknown.

• Give the expressions of the ML estimates hML(z) of h and xML(z) of x.

• Give the expressions of the MAP and the Bayes optimal estimates xMAP (z)and xB(z) of x and hMAP (z) and hB(z) of h if we assume that x ∼ N (0, I)and h ∼ N (0, I).

• Give the expressions of the MAP and the Bayes optimal estimates xMAP (z) andxB(z) of x and hMAP (z) and hB(z) of h if we assume that x ∼ N (

0, σ2xDt

xDx)

and h ∼ N (0, σ2

hDthDh

).

A.2. EXERCISE 2: DISCRETE DECONVOLUTION 183

10. Assume now that h is known and we know that x(k) can only take the values {0, 1}.

• First assume that x(k) are independent and π0 = P (x(k) = 0) and π1 =P (x(k) = 1) = 1 − π0, with known π0. Design an optimal detector for x(k).

• Now assume x(k) can be modelled as a first order Markov chaine and that weknow the probabilites

π00 = P (x(k) = 0, x(k + 1) = 0) π01 = P (x(k) = 0, x(k + 1) = 1) = 1 − π00

π11 = P (x(k) = 1, x(k + 1) = 1) π10 = P (x(k) = 1, x(k + 1) = 0) = 1 − π11

Design an optimal detector for x(k).

• Repeat these two last items, assuming now that h is unknown and h ∼ N (0, σ2

hDthDh

).

List of Figures

2.1 Location testing with Gaussian errors, uniform costs and equal priors. . . . 19

2.2 Illustration of minimax rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Partition of the observation space and deterministic or probabilistic decisionrules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Two decision rules δ(1) and δ(2) and thier respectives risk functions. . . . . . 34

3.3 Two decision rules δ(1) and δ(2) and thier respectives power functions. . . . 35

4.1 General structure of a Bayesian optimal detector. . . . . . . . . . . . . . . . 48

4.2 Simplified structure of a Bayesian optimal detector. . . . . . . . . . . . . . . 48

4.3 General Bayesian detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Bayesian detector in the case of i.i.d. data. . . . . . . . . . . . . . . . . . . 49

4.5 General Bayesian detector in the case of i.i.d. Gaussian data. . . . . . . . . 50

4.6 Simplified Bayesian detector in the case of i.i.d. Gaussian data. . . . . . . . 50

4.7 Bayesian detector in the case of i.i.d. Laplacian data. . . . . . . . . . . . . . 51

4.8 A binary chanel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1 The structure of the optimal detector for an i.i.d. noise model. . . . . . . . 59

5.2 The structure of the optimal detector for an i.i.d. Gaussian noise model. . . 60

5.3 The simplified structure of the optimal detector for an i.i.d. Gaussian noisemodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.4 Bayesian detector in the case of i.i.d. Laplacian data. . . . . . . . . . . . . . 60

5.5 The structure of a locally optimal detector for an i.i.d. noise model. . . . . 61

5.6 The structure of a locally optimal detector for an i.i.d. Gaussian noise model. 62

5.7 The structure of a locally optimal detector for an i.i.d. Laplacian noise model. 62

5.8 The structure of a locally optimal detector for an i.i.d. Cauchy noise model. 62

5.9 Coherent detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.10 Sequential detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.11 Stopping rule in sequential detection. . . . . . . . . . . . . . . . . . . . . . . 71

5.12 Stopping rule in SPART (a, b). . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.1 Curve fitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.2 Line fitting: model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90



6.5 Line fitting: models 1 and 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 94

185

186 LIST OF FIGURES

Bibliography

187

Index

Maximum A Posteriori (MAP) estimation,77, 85

Minimum-Mean-Absolute-Error, 76, 84Minimum-Mean-Squared-Error, 76, 84Neyman-Pearson hypothesis testing, 23Wiener-Kolmogorov filtering, 133

Admissible decision rules, 29AR, 116ARMA, 116

Bayesian parameter estimation, 75Bayesian binary hypothesis testing, 16Bayesian composite hypothesis testing, 55Bayesian estimation, 117Bayesian MSE estimate, 104Bernouilli, 146Binary chanel transmission, 52Binary composite hypothesis testing, 56Binary hypothesis testing, 15Binomial, 147

Causal Wiener-Kolmogorov, 135Conjugate distributions, 122

Conjugate priors, 120, 122, 123, 145curve fitting, 88

Deconvolution, 112, 135

Exponential family, 121, 145Exponential model, 150

Factorization theorem, 120false alarm, 28Fast Kalman filter, 110Fisher information, 126Fisher-Information, 126

Gaussian vector, 86Group invariance, 117

Inference, 145Information-Inequality, 126

Invariance principles, 117Invariant Bayesian estimate, 119

Inverse problems, 166

Joint probability distribution, 12

Jointly Gaussian, 86

Kalman filter, 112Kalman filtering, 103

Laplacian noise, 50, 60Linear Estimation, 129

Linear estimation, 129Linear Mean Square (LMS), 104

Linear models, 87Linear regression, 165

Locally most powerful (LMP) test, 57Locally optimal detectors, 61

Location estimation, 18

MA, 116

Maximum A posteriori (MAP), 104Memoryless stochastic process, 13

Minimal sufficiency, 120Minimax binary hypothesis testing, 22

miss, 28Mult-variable Normal, 164

Multi-variable Normal, 159–163

Multinomial, 158

Natural exponential family, 123Negative Binomial, 149

Noise filtering, 134Non causal Wiener-Kolmogorov filtering, 133

Non informative priors, 126, 127

Non parametric description of a stochasticprocess, 13

188

INDEX 189

Normal, 152–157

One step prediction, 131Optimal detectors, 55Optimization, 44Orthogonality principle, 132

Parametrically well known stochastic pro-cess, 13

Performance analysis, 66Poisson, 148Prediction, 131Prediction-Correction, 106Prior law, 117Probability density, 12Probability distribution, 9, 12Probability spaces, 11Probablistic description, 9

Radar, 108Random variable, 9, 12Random vector, 12Rational spectra, 139Robust detection, 74

Sequential detection, 70sequential detection, 68Signal detection, 55Signal estimation, 101SPART, 71Stationary stochastic process, 13statistical inference, 9Stochastic process, 9Stopping rule, 29, 70Sufficient statistics, 120

time-invariant, 107Track-While-Scan (TWS) Radar, 108

Uniform, 151Uniform most powerful (UMP) test, 57

Wald-Wolfowitz theorem, 71Wide-Sense Markov sequences, 141

detection - estimation - freedjafari.free.fr/cours/nd/de1.pdf · chapter 1 introduction generally...

Documents