estimation

Upload: jaya-shukla

Post on 04-Nov-2015

6 views

Category:

Documents


0 download

DESCRIPTION

In this module, the principle of maximum likelihood estimation is discussed. This happens to be the most popular approach for obtaining practical estimators. The maximum likelihood estimation approach is very desirable in situations where the MVUE does not exist or cannot be found even though it does exist. The attractive feature of the maximum likelihood estimator (MLE) is that it can always be found following a definitive procedure, allowing it to be used for complex estimation problems. Additionally, the MLE is asymptotically optimal for large data record. The salient topics discussed this module are:• Basic Procedure of MLE• MLE for Transformed Parameters• MLE for General Linear Model• Asymptotic Property of MLE

TRANSCRIPT

4.1 OutlineIn this module, the principle of maximum likelihood estimation is discussed. This happens to be the most popular approach for obtaining practical estimators. The maximum likelihood estimation approach is very desirable in situations where the MVUE does not exist or cannot be found even though it does exist. The attractive feature of the maximum likelihood estimator (MLE) is that it can always be found following a definitive procedure, allowing it to be used for complex estimation problems. Additionally, the MLE is asymptotically optimal for large data record. The salient topics discussed this module are: Basic Procedure of MLE MLE for Transformed Parameters MLE for General Linear Model Asymptotic Property of MLE4.2.1 Basic Procedure of MLEIn some case the MVUE may not exist or it cannot be found by any of the methods discussed so far. The maximum likelihood estimation (MLE) approach is an alternative method in cases where the PDF or the PMF is known. This PDF or PMF involves the unknown parameter and is called the likelihood function. With MLE the unknown parameter is estimated by maximizing the likelihood function for the observed data. The MLE is defined as:

where x is the vector of observed data (of N samples).It can be shown that is asymptotically unbiased:

and asymptotically efficient:

An important result is that if an MVUE exists, then the MLE procedure will produce it.ProofAssume a scaler parameter case, if an MVUE exists, then the log-likelihood function can be factorized as

where .On maximizing the likelihood function, by setting its derivative to zero, yields the MLE

Another important observation is that unlike the previous estimates the MLE does not require an explicit expression for p(x; )! Indeed given a histogram plot of the PDF as a function of one can numerically search for the that maximizes the PDF.4.2.2 ExampleConsider the problem of a DC signal embedded in noise:

where w[n] is WGN with zero mean and known variance 2.We know that the MVU estimator for is the sample-mean. To see that this is also the MLE, we consider the PDF:

and maximize the log likelihood function by setting it to zero:

Thus which is the sample-mean.4.2.3 ExampleConsider the problem of a DC signal embedded in noise:

where w[n] is WGN with zero mean but unknown variance which is also A, that is the unknown parameter, = A, manifests itself both as the unknown signal and the variance of the noise. Although a highly unlikely scenario, this simple example demonstrates the power of the MLE approach since finding the MVUE by the procedures is not easy. Consider the likelihood function for x is given by:

Now consider p(x; ) as a function of , thus it is a likelihood function and we need to maximize it with respect to . For Gaussian PDFs it is easier to find the maximum of the log-likelihood function (since logarithm is a monotonic function):

On differentiating we have:

on setting the derivative to zero and solving for , produces the MLE:

where we have assumed > 0. It can be shown that:

and:

4.3.1 MLE for Transformed ParametersThe MLE of the transformed parameter, = g() is given by:

where is the MLE of . If g is not one-to-one function (i.e., not invertible) then is obtained as the MLE of transformed likelihood function, pT (x; ), which is defined as:

4.3.2 ExampleIn this example we demonstrate the nding of transformed MLE. In context of the previous example, consider two different parameter transformations (i) = exp(A) and (ii) = A.Case (i) From previous example, the PDF parameterized by the parameter = A can be given as

Since is a one-to-one transformation of A, the PDF parameterized in terms of the transformed parameter can be given as

Thus pT (x; ) is the PDF of the data set,

Now to find the MLE of , setting the derivative of pT (x; ) with respect to to zero yields

or

But being the MLE of A, so we have = exp(). Thus the MLE of the transformed parameter is found by substituting the MLE of the original parameter into the transformation function. This is known as invariance property of MLE.Case (ii) Since , the is not one-to-one transformation of A. If we take only then some possible PDFs will be missing. To characterize all possible PDFs, we need to consider two sets of PDFs

The MLE of is the value of that yields the maximum of pT1(x; ) and pT2(x; ) or

The maximum can be found in two steps as1. For a given value of , say 0, determine whether pT1(x; ) or pT2(x; ) is larger. If for example pT1(x; 0) > pT2(x; 0) then denote the value of pT1(x; 0) as . Repeat for all > 0 to form . Note that .2. The MLE is given as the that maximizes over 0.Thus the MLE is

Again the invariance property holds.4.3.3 MLE for General Linear ModelConsider the general linear model of the form:x = H + wwhere H is a known N p matrix, x is an N 1 observation vector with N samples, and w is N 1 noise vector with PDF (0,C). The PDF of the observed data is:

and the MLE of is found by differentiating the log-likelihood which can be shown to yield:

which upon simplification and setting to zero becomes:

and this yields the MLE of as:

which turns out to be same as the MVU estimator.Lecture 12 : Properties of MLE4.4.1 Asymptotic Normality Property of MLEThe asymptotic property of the MLE can be stated as follows. If the PDF p(x; ) of the data x satisfies some regularity conditions, then the MLE of the unknown parameter is asymptotically distributed (i.e., for large data record) according to

ProofIn the following, the proof of this important property is outlined for the scalar parameter case. Assuming that the observations are IID and the regularity condition holds i.e., . Further assume that the first-order and the second-order derivatives of the likelihood function are defined.Before deriving the asymptotic PDF, first its is shown that MLE is a consistent estimator. For this using the Kullback-Leibler information inequality

or

with equality if and only if 1 = 2 or the right-hand side of the above inequality is maximized for = 0. As the data is IID, the maximization of log-likelihood function is equivalently maximizing

But for N , this converges to the expected value by the law of large numbers. Hence, if 0 be the true value of , we have

By a continuity argument, the normalized log-likelihood function also maximizes for = 0 or as N , the MLE is = 0. Thus the MLE is consistent.To derive the asymptotic PDF of the MLE using the mean value theorem, we have

where . But by the definition of the MLE the left-hand side of the above relation is zero, so that

Now considering , from above relation we have

Due to IID assumption

Since , we must also have 0 due to consistency of the MLE. Hence

where the last convergence is due to law of large numbers and i(0) denotes the information for single sample. Also the numerator term is

Now let is a random variable, being the function of x[n]. Additionally, since the x[n]s are IID so are the ns. By the central limit theorem the numerator term has the PDF that converges to a Gaussian with mean

and the variance

due to independence of the random variables. On applying the Slutskys theorem which says that if a sequence of random variable xn has the asymptotic PDF of the random variable x and the sequence of random variables yn converges to a constant c, then xn / yn has the same asymptotic PDF as the random variable xc. Thus in this case

So that

or equivalently

or finally

Thus the distribution of MLE for a parameter is asymptotically normal with mean as true value of the paramter and the variance as the inverse of the Fisher information.Module 5 : Least Squares Estimator5.1OutlineIn this module, a class of estimator is introduced which unlike the earlier discussed optimal (MVU) or asymptotically optimal (MLE) estimator has no optimality property in general. In this approach no probabilistic assumptions about the data are made, only a signal model is assumed. The least squares estimator (LSE) is determined by the minimization of least squares error and is widely used in practice, due to ease of implementation. The salient topics discussed this module are: Basic Procedure of LSE Linear Least Squares Geometrical Interpretations of LS Approach Constrained Least Squares

5.2.1 Basic Procedure of LSEThe MVUE, BLUE, and MLE developed previously required an expression for the PDF p(x; ) in order to estimate the unknown parameter in some optimal manner. An alternative approach is to assume a signal model (rather than making probabilistic assumptions about the data) and achieve a design goal assuming this model. With the least squares (LS), approach we assume that the signal model is a function of the unknown parameter and produces a signal:

where s(n; ) is a function of n and parameterized by .Due to measurement noise and model inaccuracies w[n], only the noisy version x[n] of the true signal s[n] can be observed as shown in Fig.5.1 .

Figure5.1: Signal model employed in least squares estimation.Unlike previous approaches no assumption is made about the probabilistic distribution of w[n]. We only state that what we have observed is an error e[n] = x[n] - s[n] which with the appropriate choice of should be minimized in a least-squares sense. Thus we choose = so that the cost function:

is minimized over N observation samples of interest and we call this the LSE of . More precisely we have:

and the minimum LS error is given by:

An important assumption to produce a meaningful unbiased estimate is that the noise and model inaccuracies, w[n], have zero mean. However no other probabilistic assumption about the data is made (i.e., LSE is valid for both Gaussian and non-Gaussian noise). At the same time we can not make any optimality claims with LSE (as this would depend on the distribution of the noise and modeling errors).A problem that arises from assuming the signal model function s(n; ) rather than knowledge of p(x; ) is the need to choose an appropriate signal model. Then again in order to obtain a closed form or parametric expression for p(x; ) one usually requires to know what the underlying model and noise characteristics are anyway.5.2.2 ExampleConsider observations, x[n], arising from a DC-level signal model, s[n] = s(n; ) = :

where is the unknown parameter to be estimated. Then we have:

On differentiating wrt and setting to zero:

and hence which is the sample-mean. We also have:

5.3.1 Linear Least SquaresAlthough there is no restrictions on the form of the assumed signal model in LSE, but often it is assumed that the signal model is a linear function of the parameter to be estimated:s = Hwhere s = [s[0],s[1],s[2],,s[N - 1]]T and H is a known N p matrix with = [1,2,,p]T. Nowx = H + wand with x = [x[0],x[1],x[2],,x[N - 1]]T we have:

On differentiating and setting to zero:

this yields the required LSE:

which surprisingly is identical in functional form to the MVU estimator for the linear model.An interesting extension to the linear LS is the weighted LS where the contribution to the error from each component of the parameter vector can be weighted in importance by using a different form of the error criterion:

where W is an N N positive definite (symmetric) weighting matrix. The weighting matrix W is generally a diagonal and its main purpose is to emphasize the contribution of those data samples that are deemed to be more reliable.5.3.2 ExampleReconsider the acoustic echo cancellation problem discussed earlier having the signal flow diagram re-shown in Figure5.2 for the ease of reference. We now show how linear LS appraoch can be used for estimating the required echo path impulse response in this problem.

Figure5.2: Acoustic echo cancellation problemRecall the output vector at the microphone is expressed asy = Uh + vAny linear estimator of h can be written as a linear function of the microphone signal vector y as

Minimization of the least square (LS) cost function:

would result in the LS estimator:

Recall that for this problem, the earlier derived BLUE is of the form:

Note that the LSE does not take into account the near-end signal characteristics (R = EvvT) and therefore, in practice, it is not found as effective as the BLUE.5.4.1 Geometrical Interpretations of LS ApproachThe geometrical perspective of the LS approach helps provide the insights into the estimator and also reveals other useful properties. Consider a general linear signal model s = H. On denoting the ith column of H by hi, the signal model can be seen as a linear combination of the signal vectors as

The LS error was defined to be

The Euclidean length of an N 1 vector can be defined as|||| = then the LS error can also be written as

Figure5.3: Geometrical visualization of linear least squares in 3-dimensional (R) spaceWe now note that the linear LS approach attempts to minimize the square of the distance from the data vector x to a signal vector , which must be a linear combination of the columns of H. The data vector can lie anywhere in an N-dimensional space, termed RN, while all possible signal vectors, being linear combinations of p < N vectors, must lie in a p-dimensional subspace of RN, termed Sp. The full rank of H assumption ensures that the columns are linearly independent and hence the subspace is truly p-dimensional. For N = 3 and p = 2 this is illustrated in Figure5.3. Note all possible choices of 1, 2 (where assume that - < 1 < and - < 2 < ) produce signal vectors constrained to lie in subspace S and that in general x does not lie in the subspace. It is intuitively obvious that the vector that lies in S and that is closest to x in the Euclidean sense is the component of x in S. In other words, is the orthogonal projection of x onto S. This makes the error vector x - to be orthogonal (or perpendicular) to all vectors in S. Two vectors in RN are defined to be orthogonal if xTy = 0. For the considered example, we can determine the appropriate by using the orthogonality condition as

Letting = 1h1 + 2h2, we have

On combining the two equations and using the matrix form, we have

or

Finally, we get the LSE as

Note that if = x - H denotes the error vector, then

The error vector must be orthogonal to the columns of H. This is the well known orthogonality principle. In effect the error represent the part of x that cannot be described by the signal model. The minimum LS error Jmin can be given as

Figure5.4: Effect of nonorthogonality of the columns of the observation matrix HAs illustrated in Figure5.4 (a), if the signal vectors h1 and h2 were orthogonal, then could have easily been found. This is because the components of along h1 or 1 do not contain a component of along h2. If this does not happen then we have the situation as in Figure5.4 (b). Making the orthogonality assumption and also assuming that ||h1|| = ||h2|| = 1 (orthonormal vectors), we have

where hiTx is the length of the vector x along hi. In matrix notation this is

so that

This result is due to orthonormal columns of H. As a result, we have HTH = I and therefore

In general, the columns of H will not be orthogonal, so that the signal vector estimate is obtained as

The signal estimate is orthogonal projection of x onto the p-dimensional subspace.The N N matrix P = H(HTH)-1HT is known as projection matrix. It has the properties that it is symmetric (PT = P), idempotent (P2 = P) and it must be singular (for independent columns of H, it has rank p).5.5.1 Constrained Least SquaresIn some LS estimation problems, the unknown parameters are constrained. Consider that we wish to estimate the amplitudes of a number of signals but it is known a priori that some of the signals are of same amplitude. In this case, the number of parameters can be reduced to take advantage of the prior knowledge. If the parameters are linearly related, it leads to least squares problem with linear constraints and could be easily solved.Consider a least squares estimation problem with parameter subjected to r < p linear constraints. Assuming the constraints be independent, we can summarize the constraints asA + bwhere A is a known r p matrix and b is a known r 1 vector. To find the LSE subject to constraints we setup a Lagrangian and determine the constrained LSE by minimizing the Lagrangian

where is a r 1 vector of Lagrangian multipliers. On taking the derivative of Jc() with respect to and setting it to zero produces,

where is unconstrained LSE and is yet to be determined. To find we impose the constraint so that that

and hence

Substituting in earlier found expression of c results in the final estimator

Remark: Note that the constrained LSE is a corrected version of the unconstrained LSE. In cases where the constraint by chance satisfied by or A = b, then according to above relation the LSE and constrained LSE be identical. Such is usually not the case however.5.5.2 ExampleIn this example we explain the effect of constraints on LSE. Consider a signal model

if we observe {x[0], x[1], x[2]}, find the LSE.The signal vector can be expressed as

The unconstrained LSE and signal estimate can be obtained as

Now assume that it is known a priori that 1 = 2. Expressing this constraint in matrix form we have [1 - 1] = 0. So that A = [1 - 1] and b = 0. Noting that HTH = I, we can get the constrained LSE as

With some matrix algebra we can show the constrained LSE and corresponding signal estimate as

Since 1 = 2, the two observations are averaged which is intuitively reasonable. In this simple problem, we can easily incorporate the constraint into the given signal model

Note the parameter to be estimated gets reduced to only. On estimating the unconstrained LSE of using the reduced signal model would have produced the same result. Similar to the least squares, it would be interesting to view the constrained least squares estimation problem geometrically as done in Figure5.5 in the context of this example.

Figure5.5: Graphical representation of unconstrained and constrained least squaresBayesian EstimationOutline6.1 OutlineIn this module, a new class of estimators are introduced. This class departs from the classical approach to statistical estimation in which the unknown parameter of interest is assumed to be a deterministic but unknown constant. Instead it is assumed that the unknown parameter is a random variable whose particular realization is to be estimated. This approach makes direct use of the Bayess theorem and so is commonly termed as Bayesian approach. The two main advantages of this approach are the incorporation of prior knowledge about the parameter to be estimated and providing an alternative to the MVUE when it cannot be found. First, the minimum mean square estimator (MMSE) is discussed. It is followed by introduction to Bayesian linear model which allows the finding of the MMSE with ease. The salient topics discussed this module are: Minimum mean square estimator Bayesian linear model General Bayesian estimators Lecture 17 : Bayesian Estimation 6.2.1 Minimum Mean Square Estimator (MMSE) The classical approach we have been using so far has assumed that the parameter is unknown but deterministic. This the optimal estimator is optimal irrespective and independent of the actual value of . But in cases where the actual value or prior knowledge of could be a factor (e.g. where the MVU estimator does not exist for certain values or where prior knowledge would improve the estimator performance), the classical approach would not work effectively. In the Bayesian approach the is treated as a random variable with a known prior pdf, p(). Such prior knowledge concerning the distribution of the estimator should provide better estimators than the deterministic case. In the classical approach the MVU estimator is derived by first considering minimization of the mean square error, i.e., = arg min mse() where: and p(x; ) is the PDF of x parameterized by . In the Bayesian approach, the estimator is similarly derived by minimizing = arg min Bmse() where: is the Bayesian mean square error and p(x,) is the joint PDF of x and (since is now a random variable). It is to note that the squared error (- ) is identical in both Bayesian and classical MSE. The minimum Bmse() estimator or MMSE estimator is derived by differentiating the expression of Bmse() with respect to and setting this to zero to yield: where the posterior PDF, p(|x), is given by Thus MMSE is the conditional expectation of the parameter given the observations x. Apart from the computational (and analytical!) requirements in deriving an expression for the posterior PDF and then evaluating the expectation E(|x) there is also the problem of finding an appropriate prior PDF. The usual choice is to assume that the joint PDF, p(x,), is Gaussian and hence both the prior PDF, p() and posterior PDF, p(|x), are also Gaussian (this property implies the Gaussian PDF is a conjugate prior distribution). Thus the form of the PDFs remains the same and all that changes are the means and the variances.

6.2.2 ExampleConsider a signal embedded in noise:

where as before w[n] ~(0,) is a WGN process and the unknown parameter = A is to be estimated. However in the Bayesian approach we also assume the parameter A is a random variable with prior PDF which in this case is Gaussian PDF, p(A) = (A,A). We also have p(x|A) = (A,) and we can assume that A and x are jointly Gaussian. Thus the posterior PDF:

is also Gaussian PDF and after the required simplification we have:

and hence the MMSE is:

where .Upon closer examination of MMSE we observe the following (assume A ):1. With fewer data (N is small) we have and A, that is the MMSE tends towards the mean of the prior PDF and effectively ignores the contribution of the data. Also p(A|x) (A,A).2. With large amounts of data (N is large) we have and , that is the MMSE tends towards the sample-mean and effectively ignores the contribution of the prior information. Also p(A|x) .Conditional PDF of Multivariate Gaussian: If x and y are jointly Gaussian where x is k 1 and y is l 1, with the mean vector [E(x)T, E(y)T]T and the partitioned covariance matrix,

then the conditional PDF, p(y|x), is also Gaussian and the posterior means vector and the covariance matrix are given by:

This result can be used for the MMSE estimation involving jointly Gaussian parameter vector and the data vector.Lecture 18 : Properties of Bayesian Estimator6.3.1 Bayesian Linear ModelNow consider the Bayesian linear model:x = H + wwhere is the unknown parameter to be estimated with prior PDF (, C) and w is a WGN with PDF (0, Cw). The MMSE is provided by the expression for E(y|x) where we identifyy . We have:

and we can show that:

and hence since x and are jointly Gaussian we have:

6.3.2 Nuisance ParametersSuppose that both and were unknown parameters but we are only interested in . Then is a nuisance parameter. We can deal with this by integrating out of the way. Consider the Bayess rule for the posterior PDF:

Now p(x|) is, in reality, p(x|,), but we can obtain the true p(x|) by:

and if and are independent then:

6.3.3 Relation with Classical EstimationIn the classical estimation we do not make any assumption on the prior, thus all possible have to be considered. The equivalent prior would be a flat distribution and essentially = . This so-called non-informative prior PDF will yield the classical estimator where such is defined.6.3.4 ExampleConsider again the signal embedded in noise problem:

where we have already shown that the MMSE is:

where . If the prior is non-informative then A = and = 1 with - which is the classical estimator.

Lecture 19 : General Bayesian Estimator6.4.1 General Bayesian EstimatorsThe Bmse() is given by:

is one specific case for a general estimator that attempts to minimize the average of the cost function, (), that is the Bayesian risk = E[()] where = ( -). Figure6.1 shows the plots of three different cost functions of wide interest and the discussion about which one of the central tendencies of the posterior PDF gets emphasised with choice of these cost functions is given below:1. Quadratic: () = which yields = Bmse(). It is already shown that the estimate to minimize = Bmse() is:

which is the mean of the posterior PDF.2. Absolute: () = || The estimate, , that minimizes = E[| -|] satisfies:

which is the median of the posterior PDF.3. Hit-or-miss: where is a very small threshold. Note that a uniform penality is assigned here to all errors greater than .The estimate that minimizes the Bayes risk can be shown to be:

which is the mode of the posterior PDF, i.e., the value that maximizes the PDF.

Figure6.1: Common cost functions used in finding the Bayesian etimatorFor the Gaussian posterior PDF, it should be noted that the mean, the median and the mode are identical. Of most interest are the quadratic and hit-or-miss cost functions which, together with a special case of the later, yield the following three important classes of estimators:1. MMSE Estimator: The minimum mean square error (MMSE) estimator which has already been introduced as the mean of the posterior PDF.

2. MAP Estimator: The maximum a posteriori (MAP) estimator which is the mode (or maximum) of the posterior PDF.

3. Bayesian ML Estimator: The Bayesian maximum likelihood estimator which is the special case of the MAP estimator where the prior PDF, p(), is uniform or non-informative:

Noting that the conditional PDF of x given , p(x|), is essentially equivalent to the PDF of x parameterized by , p(x|; ), the Bayesian ML estimator is equivalent to the classical MLE.Comparison among the three types of Bayesian estimators: The MMSE is preferred due to its least-squared cost function but it is also the most difficult to derive and compute due to the need to find an expression of the posterior PDF, p(|x) in order to integrate p(|x)d.

The hit-or-miss cost function used in the MAP estimator though less precise but it is much easier to derive since there is no need to integrate, only find the maximum of the posterior PDF p(|x) which can be done either analytically or numerically.

The Bayesian ML is equivalent in preference to the MAP only in the case where the prior is non-informative, otherwise it is a sub-optimal estimator.

Like the classical MLE, the expression for the conditional PDF, p(x|) is easier to obtain rather than that of the posterior PDF, p(|x). Since in most cases knowledge of the prior is not available so, not surprisingly, classical MLE tend to be more prevalent. However it may not always be prudent to assume that prior is uniform, especially in the cases where prior knowledge of the estimate is available even though the exact PDF is unknown. In these cases a MAP estimate may perform better even if an artificial prior PDF is assumed (e.g., a Gaussian prior which has the added benefit of yielding a Gaussian posterior).7.1 OutlineThe Bayesian estimators discussed in previous module are difficult to implement in practice as they involve multi-dimensional integration for the MMSE estimator and multi-dimensional maximization for the MAP estimator. In general these estimators are very difficult to derive in closed form except under the jointly Gaussian assumption. When ever the Gaussian assumption is not valid, an alternate approach is required. In this module we introduce the Bayesian estimators derived with linearity constraint which depends on only first two moments of the PDF. This approach is analogous to the BLUE in classical estimation case. These estimators are also termed as Wiener filter and find extensive use in practice. The salient topics discussed this module are: Linear Minimum Mean Square Error (LMMSE) Estimator Bayesian Gauss-Markov Theorem Wiener Filtering and Prediction7.2.1 Linear Minimum Mean Square Error (LMMSE) EstimatorAssume that the parameter is to be estimated based on the data set x = [x[0],x[1],,x[N - 1]]T. Rather than assuming any specific form for the joint PDF p(x,), we consider the class of all affine estimators of the form:

where a = [a1,a2,,aN-1]T.The estimation problem now is to choose the weight coefficients [a,aN] to minimize the Bayesian MSE:

The resultant estimator is termed as linear minimum mean square error (LMMSE) estimator. It is to note that the LMMSE estimator will be sub-optimal unless the MMSE estimator happens to be linear. The MMSE estimator is linear when and x are jointly Gaussian.If the Bayesian linear model is applicable, we can writex = H + wThe weight coefficients are obtained from = 0 for i = 0, 1,,N - 1 this yields:

where Cxx is N N covariance matrix and Cx is N 1 cross-covariance vector.Thus the LMMSE estimator is:

where we note that Cx = CxT.For the 1 N vector parameter an equivalent expression for LMMSE estimator is derived as:

and the Bayesian MSE matrix is:

where C = E[( - E())( - E())T] is p p covariance matrix.7.2.2 ExampleLet us revisit the acoustic echo cancellation problem, the purpose is to show that variance of the estimator of the echo path impulse response could be lowered further by dropping the unbiasedness constraint. The signal flow diagram of the problem is shown again in Figure7.1 for the ease of reference.

Figure7.1: Acoustic echo cancellation problemRecall the output vector at the microphone is expressed asy = Uh + vLet us assume that the echo path (room) impulse response is a random variable ho with some prior knowledge is available i.e., ho has the prior PDF p(ho) with first and second moments given as:

Any linear estimator minimizing the Bayesian MSE

where ho denotes the true value of the impulse response, is then given by the mean of the posterior PDF p(ho|y)

If the prior knowledge is constructed as o = 0 then

It is worth comparing the form of LMMSE with that of earlier derived BLUE:

Note that in MMSE criterion of the Bayesian framework, the variance of the estimator and squared bias are weighted equally. Thus the variance of the estimator could be reduced further than that of the MVUE if the estimator is no longer constrained to be unbiased.Lecture 21 : Wiener Smoother 1 2

7.3.1 Bayesian Gauss-Markov TheoremIf the data are described by the Bayesian linear model:x = H + wwhere x is N 1 data vector, H is known N p observation matrix, is p 1 random vector of parameters with mean E() and covariance matrix C and w is an N 1 noise vector with zero mean and covariance matrix Cw which is uncorrelated with (the joint PDF p(w,) and hence also p(x,) are otherwise arbitrary). Noting that:

Therefore, the LMMSE estimator is:

We assume N sample of time-series data x = [x[0], x[1],, x[N - 1]]T which are wide-sense stationary (WSS). Further as E(x) = 0, such N N covariance matrix takes the symmetric Toeplitz form:

where rxx[k] = E(x[n]x[n-k]) is the autocorrelation function (ACF) of the x[n] process and Rxx denotes the autocorrelation matrix. Note that since x[n] is WSS the expectation E(x[n]x[n-k]) is independent of the absolute time index n.In signal processing the estimated ACF is used. The estimated ACF is given by,

Both the data x and the parameter to be estimated are assumed to be zero mean. Thus the LMMSE estimator is:

Application of the LMMSE estimation to the three signal processing problems such as soomthing, filtering and prediction gives rise to different kinds of Wiener filters and are discussed in the following sections.7.3.2 Wiener SmoothingThe problem is to estimate the signal = s = [s[0], s[1],, s[N - 1]]T based on the noisy datax = [x[0], x[1],, x[N - 1]]T wherex = s + wand w = [w[0], w[1],, w[N - 1]]T is the noise process. An important difference between smoothing and filtering is that the signal estimate [n] can use the entire data set: the past values (x[0], x[1],, x[n- 1]), the present value x[n] and the future values(x[n + 1], x[n + 2],, x[N - 1]). This means that the solution cannot be cast as filtering problem since we cannot apply a causal filter to the data. We assume that the signal and noise processes are uncorrelated. Hence,

and thus

Also

Hence the LMMSE estimator (also called Wiener estimator) is,

and the N N matrix

is referred to as the Wiener smoothing matrix.Non-causal Wiener filteringConsider a smooting problem in which a signal s[n] is required to be estimated given a noisy signal x[n] of infinite length, i.e., {, x[-2], x[-1], x[0], x[1], x[2],} or x[k] for allk. In such cases, the smoothing estimator takes the form

Now, by letting h[k] = n-k, the above estimator can be expressed as convolution sum

where h[k] could be indentifies as the impulse response of an infinite length two-sided time-invariant filter.Analogous to the LSE case (refer to 5.4.1) the orthogonality principle also holds for the LMMSE case, i.e., the error in estimation ( -) is always orthogonal (or perpendicular) to the observed data {, x[-1], x[0], x[1],}. This could be mathematically expressed as

Using the definition of the orthogonality principle, we have

Hence,

Thus the equations required to be solved for determining the infinite Wiener filter impulse response, also referred to as Wiener-Hopf equations, are given by

On taking the Fourier transform of both sides of the above equation we have

where H() is the frequency response of the infinite Wiener smoother, Pxx() and Pss() are the power sepctral density of noisy and clean signals, respectively.As the signal and noise are assumed to be uncorrelated, the frequency response of Wiener smoother can be expressed as

Remarks: Since the power spectral densities are real and even function of frequency, the impulse response also turns out to be real and even. This means that the designed filter is non-causal which is consistent with the fact that the signal is estimated using both future as well as present and past data.7.3.3 ExampleConsider a noisy signal x[n] consisting of a desired clean signal s[n] corrupted with an additive white noise w[n]. Given that the autocorrelation function (ACF) of signal and noise are

respectively. Assume that signal and noise are uncorrelated and zero mean. Find the non-causal optimal Wiener filter for estimating the clean signal from its noisy version.On computing the z-transform of ACF, the power spectral of the signal and the noise can be derived as

The z-transform of the non-causal optimal Wiener filter can be given as

Given Hopt(z), the impulse response of the optimal stable filter turns out as

7.4.1 Wiener FilteringThe problem is to estimate the signal = s[n] based only on the present and past noisy datax = [x[0],x[1],,x[N - 1]]T. As n increases this allows us to view the estimation process as an application of a causal filter to the data and we need to cast the LMMSE estimator expression in form of a filter.Assuming the signal and noise processes are uncorrelated, we have,

where Cxx is the autocorrelation matrix.Note that Cx is a 1 (n + 1) row vector.

The LMMSE estimator is given by,

where is the (n + 1) 1 vector of weights. Note that the check subscript is used to denote time reversal. Thus we have

The process of forming the estimator as time evolves can be interpreted as a filtering operation. Specifically we let h(n)[k], the time-varying impulse response, be the response of the filter at time n to an impulse applied k samples before (i.e., at time n-k). We note that di can be interpreted ad the response of the filter at time n to the signal (or impulse) applied at timei = n - k. Thus we can make the following correspondence,

Then:

We define the vector h = [h(n)[0]h(n)[1]h(n)[n]]T. Then we have , that is, h is a time-reversed version of a. To explicitly find the impulse response h we note that since,

then it is also true that,

When written out we get the Wiener-Hopf filtering equations:

where rxx[k] = rss[k] + rww[k].Remark: A computationally efficient solution for solving the equations exists and is known as Levinson recursion which solves the equations recursively to avoid resolving them for each value of n.Causal Wiener FilteringOn using the property rxx[k] = rxx[-k], the Wiener-Hopf equation can be written as

For large data (n ) case, the time-varying impulse response h(n)[k] could be replaced with its tine-invariant version h[k] and we have

This is termed as the infinite Wiener filter. The determination of the causal Wiener filter involves the use of the spectral factorization theorem and is explained in the following.The one-sided z-transform of a sequence x[n] is defined as

Now we could write the Wiener-Hopf equation as

where the filter impulse response h[n] is constrained to be causal. The two-sided z-transform that satisfies the Wiener-Hopf equation could be written as

If has no zeros on the unit circle, then it can be factorized as

where (z) is the z-transform of a causal sequence and (z-1) is the z-transform of a anticausal sequence. Thus we have

Now letting , so that

Noting that is the z-transform of a causal sequence, it can be shown that

7.4.3 ExampleConsider a signal s[n] corrupted with an additive white noise w[n]. The signal and noise are assumed to be zero mean and uncorrelated. The autocorrelation function (ACF) of the signal and noise are

Find the causal optimal Wiener filter to estimate the signal from its noisy observations.As the signal and noise are uncorrelated, so we have

and on taking z-transform, we have

The z-transform of signal ACF, after some manipulation, is

The z-transform of noise ACF is

Thus, the z-transform of noisy signal is given by

On factorizing Pxx(z) into causal and anticausal parts, we have,

Note corresponds to a right-handed sequence while corresponds to a left-handed sequence. Figure7.2 shows the plot of the sequences corresponding to these components as well as that of the optimal non-causal filter obtained by combination. Form these plots, it is straight forward to determine the causal sequence corresponding to the optimal filter and that is also shown in Figure7.2 . Thus

Figure7.2: Plot showing the two-sided and the causal optimal filter for causal Wiener filter example problem.

The z-transform of optimal causal Wiener filter can be given as

7.5.1 Wiener PredictionThe problem is to estimate a future sample = x[N - 1 + l], as l 1, based on the current and past data x = [x[0],x[1],,x[N - 1]]T. The resulting estimator is termed as l-step linear predictor.As before we have Cxx = Rxx where Rxx is N N autocorrelation matrix and:

Then, the LMMSE estimator is:

where . We can interpret the process of forming the estimator as filtering operation where

Therefore,

Defining as before we can find an explicit expression for h by noting that:

where rxx = [rxx[l]rxx[l + 1]rxx[l + N - 1]]T is the time-reversed version of When written out we get the Wiener-Hopf prediction equations:

As pointed out earlier the Levinson recursion is the computationally efficient procedure for solving these equations recursively. The special case for l = 1, the one-step predictor, covers two important cases in signal processing: The values -h[n] are termed as the linear prediction coefficients (LPC) which are used extensively in speech coding. For example, commonly a 10th-order (N = 10) linear predictor used in speech coding and is given by

The resulting Wiener-Hopf equations equations are identical to the Yule-Walker equations used to solve the autoregressive (AR) filter parameters of an AR(N) process.

7.5.2 ExampleConsider a real wide-sense stationary (WSS) random process x with autocorrelation sequence

Find the coefficients of second-order optimal linear predicitor filter.The first few coefficients of the autocorrelation sequence rxx(k) are:

The second-order optimal linear predictor coefficients is defined as

where ais are the predictor coefficients. These predictor coefficients are obtained by solving the Wiener-Hopf prediction equations for N = 2

or

On solving we have a1 = -56 and a2 = 16. Thus, the optimal predictor polynomial is given by

or