8. introduction to adaptive...

14
8. Introduction to Adaptive Filtering 8.1 Introduction 109 8.2 The adaptive linear combiner 111 8.3 The performance function 112 8.4 Finding the minimum MSE analytically 114 8.5 Finding the minimum MSE empirically 115 8.6 The least mean squares (LMS) adaptation algorithm 117 8.7 The recursive least squares (RLS) adaptation algorithm 119 8.8 The adaptive lattice algorithm 121 Classical discrete-time filters have fixed coecients that are determined to meet predefined specifications such as passband and stop band ripple, critical frequencies, etc. This chapter describes the structure and function fo simple adaptive filters, which have coecients that vary with time in order to realize a performance goal such as noise suppression or signal prediction. The optimal values of these time-varying coecients are typically not known a priori or they may vary with time as environmental conditions change. 8.1 Introduction Conventional filters with fixed coecients are designed with a particular goal in mind, such as attenuating all frequencies above a particular “cuto” frequency. The design pro- cedure frequently consists of choosing the set of coecients that would best approximate a desired frequency response. There are multiple well-developed techniques to accom- plish this, as was discussed in basic digital signal processing courses. Adaptive filters, in contrast, have coecients that are allowed to vary over time. They are used when the filter response that best accomplishes a particular task is not known a priori, or when the nature of the operating environment is expected to change over time. A typical system goal would be for a filter to suppress undesired noise to the greatest extent possible while leaving the target signal (which frequently is speech or music) intact to the extent possible. 1 For various historical reasons, the notational conventions commonly used in the litera- ture to describe adaptive filters and the signals that pass through them are dierent from the standard notation that we have used above to describe frequency analysis and con- ventional filters. Specifically, the system input and output are frequently denoted as x k and y k , respectively, rather than x[n] and y [n] as in previous chapters. (Note that the time 1 The material in this chapter is a condensation of the chapter on adaptive filters by Stearns in the text edited by Lim and Oppenheim. The Stearns chapter is in turn an abstraction of many topics in the book Adaptive Signal Processing published in 1985 by B. Widrow and S. D. Stearns. 109

Upload: others

Post on 25-Aug-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

8. Introduction to Adaptive Filtering

8.1 Introduction 109

8.2 The adaptive linear combiner 111

8.3 The performance function 112

8.4 Finding the minimum MSE analytically 114

8.5 Finding the minimum MSE empirically 115

8.6 The least mean squares (LMS) adaptation

algorithm 117

8.7 The recursive least squares (RLS) adaptation

algorithm 119

8.8 The adaptive lattice algorithm 121

Classical discrete-time filters have fixed coe�cients that are determined to meet predefinedspecifications such as passband and stop band ripple, critical frequencies, etc. This chapterdescribes the structure and function fo simple adaptive filters, which have coe�cients thatvary with time in order to realize a performance goal such as noise suppression or signalprediction. The optimal values of these time-varying coe�cients are typically not knowna priori or they may vary with time as environmental conditions change.

8.1 Introduction

Conventional filters with fixed coe�cients are designed with a particular goal in mind,such as attenuating all frequencies above a particular “cuto↵” frequency. The design pro-cedure frequently consists of choosing the set of coe�cients that would best approximatea desired frequency response. There are multiple well-developed techniques to accom-plish this, as was discussed in basic digital signal processing courses.

Adaptive filters, in contrast, have coe�cients that are allowed to vary over time. Theyare used when the filter response that best accomplishes a particular task is not known apriori, or when the nature of the operating environment is expected to change over time. Atypical system goal would be for a filter to suppress undesired noise to the greatest extentpossible while leaving the target signal (which frequently is speech or music) intact to theextent possible.1

For various historical reasons, the notational conventions commonly used in the litera-ture to describe adaptive filters and the signals that pass through them are di↵erent fromthe standard notation that we have used above to describe frequency analysis and con-ventional filters. Specifically, the system input and output are frequently denoted as xkand yk , respectively, rather than x[n] and y[n] as in previous chapters. (Note that the time

1The material in this chapter is a condensation of the chapter on adaptive filters by Stearns in the textedited by Lim and Oppenheim. The Stearns chapter is in turn an abstraction of many topics in the bookAdaptive Signal Processing published in 1985 by B. Widrow and S. D. Stearns.

109

Page 2: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

110 8.1. Introduction

variable is now a subscript rather than a function argument, and that we are now using krather than n to denote time.)

Figure 8.1 above is a diagram of the standard adaptive signal processor. Using the modi-fied notation just introduced, xk represents the input to and yk represents the output fromthe adaptive processor. The signal dk represents the so-called desired signal. The signalek = dk � yk is referred to as the error signal, which is the di↵erence between the desiredsignal (which is what we presumably want) and the adaptive processor’s output (whichis what we are presently getting). The coe�cients of the adaptive processor are manipu-lated in a fashion that minimizes the square of the error signal ⇠2 = E[e2k ], themean-squareerror or MSE. This tends to make the signal yk at the output of the adaptive processor tobe as much as possible like the desired signal dk . The diagonal arrow through the adap-tive processor box indicates that the coe�cients of the adaptive processor are variable,and that their values are controlled by the error signal ek = dk � yk .

xk

ykεk

dk

++

+–

Figure 8.1: Block diagram of a standard adaptive signal processor.

xk

dk

+ekykADAPTIVE

PROCESSOR–

PRIMARY CHANNEL

REFERENCE CHANNEL

sk + nk

n�

k

Figure 8.2: Block diagram of a standard noise cancellation system.

While it may not be obvious why the structure in Fig. 8.1 is potentially useful, the purposeof the various elements becomes more evident when considered in the context of thecommon adaptive noise cancellation structure as in Fig. 8.2 above. The primary channelconsists of the sum of a target speech ormusic signal sk to which a noise signal nk is added.(The term “noise signal” refers to any kind of noise or interference (including echoes) thatis added to the target signal; nk could, for example, represent background speech or musicthat is simply not what is desired.) The reference channel (which is also sometimes calledthe secondary channel) contains a second noise signal n0k that is correlated with nk .

Page 3: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

Chapter 8. Introduction to Adaptive Filtering 111

As an example, consider one of the earliest applications of adaptive noise cancellation:communication among fighter pilots wearing oxygen masks in military aircraft. The au-dio signal input to the primary channel is detected by a microphone inside the mask, andit consists of the pilot’s speech plus a filtered version of the very loud ambient noise insidethe aircraft. The reference channel signal n0k is obtained from a microphone within thecockpit but outside the mask. This signal consists of the ambient cockpit noise withoutthe filtering that occurs when the cabin noise is propagated through the pilot’s mask, andessentially none of the pilot’s speech (which is at a much lower intensity level comparedto the noise in the cabin). The adaptive processor attempts to filter the noise signal nk inthe reference channel so that it most closely approximates the noise component nk in theprimary channel. If yk were to closely approximate nk , the two would almost cancel eachother out, and the system output would consist mostly of the target signal. In practice,the amount of noise cancellation that is obtained depends on the degree of correlationbetween the noise in the primary and reference channels (greater correlation providesgreater cancellation). In addition, it is very important that the target signal not leak intothe reference channel and that the target and noise signals be statistically independent.

8.2 The adaptive linear combiner

Δ Δ Δ Δ

...

...

xk

w0k w1k w2k w3k wN�1,k

xk�1 xk�2 xk�3 xk�(N�1)

yk

x0k x1k x2k x3k xN�1,k

Figure 8.3: FIR filter structure as used in adaptive filtering.

While there are a number of di↵erent types of adaptive processors, the easiest type tostudy is the adaptive linear combiner (ALC). The output of an ALC can be described by theequation

yk =L�1X

l=0

xlkwlk (8.1)

where the {wlk} represent the variable weights of the adaptive filter and the signals {xlk}are the inputs to those weights. While this equation can be applied directly to any systemfor which the output is obtained by computing the linear combination of the outputs ofan array of sensors, the most common implementation for the adaptive processor is thatof an FIR filter with variable coe�cients. This structure is depicted in Fig. 8.3 abovewith the notational conventions commonly used to describe adaptive systems. We notethat the filter coe�cients (or “weights”) are now variables with the variable name wlk ,where the first subscript identifies the individual filter coe�cient under considerationand the second coe�cient indicates the time index. The diagonal arrows through themultiplication symbols again indicate that the filter coe�cients vary over time. Another

Page 4: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

112 8.3. The performance function

important variable shown in the picture is the observation vector, which is by definitionthe set of signal inputs to the variable weights. If the adaptive processor is in the form ofan FIR filter, the observation vector components xlk are equal to delayed versions of theoriginal input where xlk = xk�l , which approximates n0k for the application in Figure 3.2

It is frequently convenient to represent the inputs {xlk} and variable filter coe�cients wlkas column vectors:

Xk =

26666666666666666666666666664

x0k

x1k

x2k...

xL�1,k

37777777777777777777777777775

and Wk =

26666666666666666666666666664

w0k

w1k

w2k...

wL�1,k

37777777777777777777777777775

(8.2)

Using the notation of linear algebra it is easy to show that

yk =L�1X

l=0

xlkwlk = XTk Wk =W T

k Xk (8.3)

where the T in the equation above denotes the matrix transpose operator.

8.3 The performance function

In order to develop algorithms for minimizing the mean square error we must first de-velop an analytical expression for what that error is. As before we define the error as

ek = dk � yk = dk �XTk Wk = dk �W T

k Xk (8.4)

ande2k = d2k +W T

k XkXTk Wk � 2dkW T

k Xk (8.5)

If dk , ek , and xk are all zero mean and wide-sense stationary, and statistically independentof one another (which in real life is normally not true), then

⇠2 = E[e2k ] = E[d2k ] +W Tk E[XkX

Tk ]Wk � 2E[dkXT

k ]Wk (8.6)

To simplify the notation, let us define

R = E[XkXTk ] and P = E[dkXk] (8.7)

Note that R is an L⇥L matrix and that P is an L⇥ 1 vector. Hence it follows directly that

⇠2 = E[e2k ] = E[d2k ] +W TRW � 2PTW (8.8)

As noted above, the coe�cients in the adaptive filter are normally manipulated to min-imize the statistically-averaged squared value of the error signal ⇠2. Note that the ex-pression for the MSE ⇠2 above is a quadratic function of the weight vector W . As a

2The multiple use of the variable x in this equation is unfortunate. The variable x with a single subscriptdenotes delayed versions of the input; the variable x with double subscripts denotes individual componentsof the observation vector.

Page 5: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

Chapter 8. Introduction to Adaptive Filtering 113

consequence, there is only one single minimum value of ⇠2 over all the possible values ofthe coe�cients in W .

For purposes of illustration, let us consider an adaptive filter that has only two variablecoe�cients, w0 and w1. Figure 8.4 depicts a typical example of the “performance surface”for such an adaptive filter, which is a plot of the MSE ⇠2 as a function of w0 and w1. Ascan be seen, the performance surface is bowl-shaped, and the surface is in the form of aparabola in two dimensions (which is actually called a hyperparabola). As noted above,every quadratic performance surface has a single minimum, which in this example isat w0 = 40 and w1 = 60. These coe�cient values are the best possible values for the(imaginary) filter under consideration. In principle, the shape of the performance surfacedepends on the specific optimal values of the coe�cients for a particular application (40and 60 in this case), the correlations between the components of the observation vectorXk and the desired signal dk and (especially) the correlations of the components of theobservations Xk with each other, which is 0.7 in the example in Fig. 8.4.

20 40 60 80 100

0

20

40

60

80

1000

200

400

600

800

1000

Coefficient w0

Coefficient w1

Figure 8.4: Plot of the mean-square error ⇠2 as a joint function of the two filter coe�-cients, w0 and w1.

Another way of describing the performance surface is in terms of its “contour lines” as inFig. 8.5. The left panel of Fig. 8.5 depicts a performance surface for a system in whichthe optimal adaptive filter coe�cients are 40 and 60, and the correlation ⇢ between thetwo components x0 and x1 of the observations is equal to 0.7, as in Fig. 8.4. The rightpanel of Fig. 8.5 depicts the contour lines of a similar performance surface, but in thiscase the correlation between the components of the observations equals 0.0 (i.e., thereis no correlation between x0 and x1). In general, the contours lines of the performancesurface are circles if and are uncorrelated (as in the right panel of the figure), and theybecome increasingly-elongated ellipses as the correlation increases. (In linear algebraterms, increasingly-elongated ellipsoidal contour lines reflect increasing spread of theeigenvalues of the correlation matrix of the observations.) The degree of correlation ofthe observations is important, as will be discussed below.

Page 6: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

114 8.4. Finding the minimum MSE analytically

Coefficient w0

Coe

ffici

ent w

1

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Coefficient w0

Coe

ffici

ent w

1

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Figure 8.5: Contour plots of mean-squared error for two values of the correlation ⇢ be-tween x0 and x1: 0.7 (left panel) and 0.0 (right panel). The minimum value of ⇠2 isobtained with the values of the filter coe�cients w0 and w1 are equal to 40 and 60, re-spectively, in both cases.

8.4 Finding the minimum MSE analytically

If there were only a single weight w, we would obtain the value of w that minimizes ⇠2 bydi↵erentiating with respect to w, setting the derivative equal to zero, and solving for w.With the vector W we perform a similar series of operations, working from the gradientoperator. Define the gradient of ⇠2 to be

r =@⇠2

@W=

2666666666666666666664

@⇠2

@w0

@⇠2

@w1...

@⇠2

@wL�1

3777777777777777777775

(8.9)

Recall from Eq. (8.7) above that

⇠2 = E[e2k ] = E[d2k ] +W TRW � 2PTW

Di↵erentiating with respect to the elements of W it follows directly that

r =@⇠2

@W= 2RW � 2P (8.10)

Let W ⇤ represent the value of W that minimizes ⇠2. W ⇤ is easily obtained by solving

r = 0 = 2RW ⇤ � 2P which produces W ⇤ = R�1P (8.11)

We note that this is the familiar Wiener-Hopf equation, Eq. (7.15), which you have en-countered before in conjunction with the discussion on linear prediction. We can also

Page 7: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

Chapter 8. Introduction to Adaptive Filtering 115

easily obtain an expression for the minimumMSE by pluggingW ⇤ into the expression forthe MSE:

⇠2min = E[d2k ] +W T ⇤RW ⇤ � 2PTW ⇤

= E[d2k ] + (R�1P)TRR�1P � 2PTR�1P(8.12)

Combining terms we obtain

⇠2min = E[d2k ] +PTR�1P � 2PTR�1P

= E[d2k ]�PTR�1P

= E[d2k ]�PTW ⇤(8.13)

Note that the MMSE depends only on R and P. And please remember that W ⇤ refers tothe optimal coe�cients, not complex conjugation.

8.5 Finding the minimum MSE empirically

In this section we will discuss two methods to find the minimum MSE empirically: New-ton’s method and gradient descent.

8.5.1 Newton’s method

Recall that we discussed the gradient in the section above:

r =@⇠2

@W=

2666666666666666666664

@⇠2

@w0

@⇠2

@w1...

@⇠2

@wL�1

3777777777777777777775

= 2RW � 2P (8.14)

Pre-multiplying both sides of the above equation by 12R�1 produces

12R�1r =

12R�1[2RW � 2P] =W �R�1P =W �W ⇤ (8.15)

Rearranging the terms of the above equation gives us a di↵erent way of thinking aboutW ⇤:

W ⇤ =W � 12R�1r (8.16)

This is essentially Newton’s root finding method set up to find the zero of the gradientoperator. What this means in principle is that if we have perfect knowledge of R, we canmove to the “bottom of the bowl” (i.e., the minimum value of ⇠2) in one step becauseknowing the position and slope is su�cient to detect the bottom of the bowl and becausethe surface is quadratic and the statistics are known. Of course neither of these is true inreal life: we do not know R and we do not know r ... both must be estimated empirically.

Page 8: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

116 8.5. Finding the minimum MSE empirically

8.5.2 Gradient descent

Oneway to “hedge our bets” about the value of that provides theMMSE is to move towardthe bottom of the bowl in small increments, rather than in one big step as in the case ofNewton’s method. Specifically, we can move toward the MMSE value at the bottom of thebowl one step at a time using the relationship

Wk+1 =Wk �µR�1r (8.17)

Keep in mind that the gradient operator produces a vector that points in the directionof “up”, so what this equation teaches us to do is locate the direction on the plane thatcorresponds to “down” and walk a little ways in the direction of “down”. The parameterµ in the equation above is the stepsize parameter. Clearly, larger values of µ would enableus to get to the bottom of the bowl faster, but the residual error once we are close to thebottom would be greater for larger values of µ as the coe�cients are likely to oscillatearound their true values because of the larger stepsizes.

Since r = 2RW � 2P, we can rewrite the iterative equation above as

Wk+1 =Wk �µR�1(2RW � 2P) = (1� 2µ)Wk +2µW ⇤ (8.18)

We can get some idea how the coe�cient trajectories evolve by assuming an arbitraryinitial coe�cient vector W0 and iterating Eq. (8.18) a few times:

W1 = (1� 2µ)W0 + 2µW ⇤ (8.19)W2 = (1� 2µ)W1 + 2µW ⇤ = (1� 2µ) [(1� 2µ)W0 + 2µW ⇤)] + 2µW ⇤ (8.20)

= (1� 2µ)2W0 + 2µW ⇤(1 + (1� 2µ)) (8.21)

Similarly, W3 = (1� 2µ)3W0 + 2µW ⇤(1 + (1� 2µ) + (1� 2µ)2) (8.22)

and Wk = (1� 2µ)k +2µW ⇤k�1X

i=0

(1� 2µ)i (8.23)

Using the formula for the finite sum of exponentials we obtain

Wk = (1� 2µ)kW0 +W ⇤1� (1� 2µ)k1� (1� 2µ) (8.24)

= (1� 2µ)kW0 +W ⇤(1� (1� 2µ)k) (8.25)

While it may not be obvious, (1� 2µ) in the first term is a sampled decaying exponentialand (1 � (1 � 2µ)k) in the second term is a sampled rising exponential. Equation (8.25)would have been more transparent had it been written in the form of

Wk = e�tk /⌧W0 + (1� e�tk /⌧)W ⇤ (8.26)

where tk = k in this expression. We can expand the exponential in a Taylor series

e�k/⌧ = e(�1/⌧)k= 1� 1/⌧ + (�1/⌧)2

2!+(�1/⌧)3

3!+ . . . (8.27)

Keeping only the first two terms produces

e�k/⌧ ⇡ (1� 1/⌧)k (8.28)

Page 9: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

Chapter 8. Introduction to Adaptive Filtering 117

In other words,

e�k/⌧ = e(�1/⌧)k ⇡ (1� 1/⌧)k which corresponds to (1� 2µ)k (8.29)

Matching terms provides the correspondence

1/⌧ = 2µ or ⌧ = 1/2µ (8.30)

Summarizing, this tells us that the recursion for Wk causes the coe�cient vector to moveexponentially from W0 toward the true optimal coe�cient vector W ⇤, and that the timeconstant ⌧ is inversely proportional to the stepsize parameter µ. The average error curvefor the coe�cients also decays exponentially.

8.6 The least mean squares (LMS) adaptation algorithm

The gradient descent solution developed in Eq. (8.17) in the previous section

Wk+1 =Wk �µR�1r

has two sets of problems: we don’t know the values of the statistics R and P and we don’tknow the value of r, either. We get around these problems by making two changes:

1. Ignore the factor of R�1 in the above equation.

2. Replace the average statistics R and P by their instantaneous values. Specifically,we replace R = E[XkXT

k ] by XkXTk and we replace P = E[dkXk] by dkXk .

In other words, we replace

Wk+1 =Wk �µR�1r=Wk �µR�1(2RWk � 2P)=Wk �µR�1(2E[XkX

Tk ]Wk � 2E[dkXk])

byWk+1 =Wk � 2µ(XkX

Tk Wk � dkXk) (8.31)

Based on our definitions for yk and ek , we can rewrite this expression as

Wk+1 =Wk � 2µXk(yk � dk) or (8.32)Wk+1 =Wk +2µXkek (8.33)

This extremely simple expression is the least-mean squares or LMS algorithm, developedbyWidrow in the 1970s. It is one of the most widely used adaptation algorithms. We notethat the coe�cient trajectories produced by the LMS algorithm become a noisy approxi-mation to the ideal trajectories that would have been obtained using the gradient descentalgorithm if the actual statistics R and P were known. For this reason, the LMS algorithmis sometimes called a “stochastic gradient” algorithm.

As an example, Fig. 8.6 compares the coe�cient tracks produced by the LMS algo-rithm (irregular black curves) with the corresponding gradient-descent tracks (smoothred curves) for the two performance surfaces with correlations of and between the two

Page 10: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

118 8.6. The least mean squares (LMS) adaptation algorithm

coe�cients. As can be seen, the actual tracks show a degree of randomness and irregu-larity because the calculations are based on instantaneous observations rather than thecorresponding mathematically-correct statistical averages. Nevertheless, it can also beseen that the tracks converge to the correct values, despite the fact that the journey ismore irregular. This occurs because calculating the coe�cient values iteratively providesparameter estimates that are averaged over time, which provides reasonable accuracy

Coefficient w0

Coe

ffici

ent w

1

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Coefficient w0

Coe

ffici

ent w

10 10 20 30 40 50 60 70 80 90 100

0

10

20

30

40

50

60

70

80

90

100

Figure 8.6: Comparison of tracks of steepest descent (smooth red curves) from initialcoe�cient values of 40 and 10 to final coe�cient values of 40 and 60 with actual LMScoe�cient tracks (irregular black curves). Values of correlation between the componentsof the observations are 0.7 (left panel) and 0.0 (right panel) as in the two previous figures.

The LMS algorithm is remarkably simple and quite e↵ective. Nevertheless, it does su↵erfrom potentially slow convergence when the components of the observation vector arehighly correlated with one another as in the left panel of Fig. 8.6. As we have discussed,the LMS algorithm causes the tracks of the filter coe�cients to follow (on average) trajec-tories that are perpendicular to the contour lines, which is not the most direct path to thebottom of the bowl when the observations are correlated. This is of particular concernwhen the FIR filter structure is used for the adaptive filter in this discussion, because inthis case the observation components are merely successive delayed values of the originalinput signal.

Several approaches to the original LMS formulation using FIR adaptive filters have beenproposed to ameliorate the problem of slow convergence when the observations are cor-related. These include:

1. The use of alternate adaptation algorithms. Other coe�cient-update algorithmshave been proposed that converge rapidly than the LMS algorithm when observa-tions are correlated, at the expense of increased computational complexity. Themost widely-used such approach is the recursive least-squares (or RLS) algorithm.These algorithms attempt to decorrelate the observations by estimating iterativelyand applying the inverse of the correlation matrix.

2. Alternate filter structures. There are other filter structures, such as the FIR latticestructure, which provide intrinsic decorrelation of the observations.

Page 11: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

Chapter 8. Introduction to Adaptive Filtering 119

3. Frequency-domain approaches. The convolution performed by the adaptive FIRfilter can also be performed in the frequency domain using fast Fourier transformtechniques. Performing the adaptation in the frequency domain as well has the fur-ther advantage that the Fourier transform operation is intrinsically orthogonalizing,and hence will decorrelate the observations naturally.

4. Subband approaches. Adapting separate frequency bands in the frequency domainin parallel also reduces the correlation in the observations because the power spec-trum in local regions tends to be flatter (which implies reduced eigenvalue spread)than the entire power spectrum.

We will discuss the RLS and lattice implementations in Secs. 8.7 and 8.8 below.

8.7 The recursive least squares (RLS) adaptation algorithm

As noted above, the LMS algorithm is widely used but it is subject to slow convergencebecause its coe�cient trajectories follow the path of steepest descent, which can be quiteindirect if the observations {Xk} are highly correlated with one another. Recall that theoriginal iterative gradient descent algorithm derived directly from the Newton solution,Eq. (8.18), was

Wk+1 =Wk � 2µR�1rk =Wk +2µekR�1k Xk (8.34)

The contribution of the matrix R�1k in Eq. (8.17) is to decorrelate or “whiten” the obser-vations, and indeed it is the removal of R�1 while formulating the LMS algorithm, thatcauses the coe�cient trajectories to move indirectly to the minimum error by followingthe perpendicular to the contour lines, rather than moving directly to the optimal coe�-cient values regardless of whether the eigenvalue spread is small or large.

<latexit sha1_base64="Uef9LqIaAQy1XfZtrKI0A7Nchw8=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoseCF48V7AekoWy2m3bp7ibsboQQ+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8MOFMG9f9diobm1vbO9Xd2t7+weFR/fikp+NUEdolMY/VIMSaciZp1zDD6SBRFIuQ0344uyv8/hNVmsXy0WQJDQSeSBYxgk0hZb4MRvWG23QXQOvEK0kDSnRG9a/hOCapoNIQjrX2PTcxQY6VYYTTeW2YappgMsMT6lsqsaA6yBe3ztGFVcYoipUtadBC/T2RY6F1JkLbKbCZ6lWvEP/z/NREt0HOZJIaKslyUZRyZGJUPI7GTFFieGYJJorZWxGZYoWJsfHUbAje6svrpHfV9FpN9+G60XbLOKpwBudwCR7cQBvuoQNdIDCFZ3iFN0c4L86787FsrTjlzCn8gfP5AxKrjjM=</latexit>

y[n]<latexit sha1_base64="Lj5BKi5GPJcIUzHb1cZqreOfnKY=">AAACCXicbZDLSsNAFIZP6q3WW9Slm8Ei1IUlEUU3QsGNywr2Amksk+mkGTqZhJmJUEq3bnwVNy4UcesbuPNtnLZZaOuBgY//P4cz5w9SzpR2nG+rsLS8srpWXC9tbG5t79i7e02VZJLQBkl4ItsBVpQzQRuaaU7bqaQ4DjhtBYPrid96oFKxRNzpYUr9GPcFCxnB2khdG0We8NEV6mCeRhhV3JMZHd8LhDLjde2yU3WmhRbBzaEMedW79lenl5AspkITjpXyXCfV/ghLzQin41InUzTFZID71DMocEyVP5peMkZHRumhMJHmCY2m6u+JEY6VGsaB6YyxjtS8NxH/87xMh5f+iIk001SQ2aIw40gnaBIL6jFJieZDA5hIZv6KSIQlJtqEVzIhuPMnL0LztOqeV53bs3LNyeMowgEcQgVcuIAa3EAdGkDgEZ7hFd6sJ+vFerc+Zq0FK5/Zhz9lff4AUMSYIg==</latexit>

h[n] = �(1 � �)nu[n]

<latexit sha1_base64="j2xwCKCpjQ2MrfIBy0dCpDNMf5A=">AAACCnicbVC7TsMwFL0pr1JeAUYWQ4XEVCUIBGMlFsYi9SW1IXIct7XqOJHtIFVRZxZ+hYUBhFj5Ajb+BqfNAC1Hsnx8zr3yvSdIOFPacb6t0srq2vpGebOytb2zu2fvH7RVnEpCWyTmsewGWFHOBG1ppjntJpLiKOC0E4xvcr/zQKVisWjqSUK9CA8FGzCCtZF8+7gfxDxUk8hcqOuP0cL7vunbVafmzICWiVuQKhRo+PZXP4xJGlGhCcdK9Vwn0V6GpWaE02mlnyqaYDLGQ9ozVOCIKi+brTJFp0YJ0SCW5giNZurvjgxHKp/OVEZYj9Sil4v/eb1UD669jIkk1VSQ+UeDlCMdozwXFDJJieYTQzCRzMyKyAhLTLRJr2JCcBdXXibt85p7WXPuLqp1p4ijDEdwAmfgwhXU4RYa0AICj/AMr/BmPVkv1rv1MS8tWUXPIfyB9fkD/0qaYw==</latexit>

XkXTk

<latexit sha1_base64="Wvzp8qdgB28X6cKEBzN3oscJtfs=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0WPBi8cKpi2koWy2m3bp7ibsbsQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8KOVMG9f9dipr6xubW9Xt2s7u3v5B/fCoo5NMEeqThCeqF2FNOZPUN8xw2ksVxSLitBtNbgu/+0iVZol8MNOUhgKPJIsZwaaQngIZDuoNt+nOgVaJV5IGlGgP6l/9YUIyQaUhHGsdeG5qwhwrwwins1o/0zTFZIJHNLBUYkF1mM9vnaEzqwxRnChb0qC5+nsix0LrqYhsp8BmrJe9QvzPCzIT34Q5k2lmqCSLRXHGkUlQ8TgaMkWJ4VNLMFHM3orIGCtMjI2nZkPwll9eJZ2LpnfVdO8vGy23jKMKJ3AK5+DBNbTgDtrgA4ExPMMrvDnCeXHenY9Fa8UpZ47hD5zPHxEkjjI=</latexit>

x[n]

<latexit sha1_base64="LB0TB7Wij9E1SbIYrAKpr0T2gNg=">AAACCHicbZDLSsNAFIZP6q3WW9SlCweLUBeWRBTdCAU3LivYC7QxTKaTZsjkwsxEKKFLN76KGxeKuPUR3Pk2TtsstPrDwMd/zuHM+b2UM6ks68soLSwuLa+UVytr6xubW+b2TlsmmSC0RRKeiK6HJeUspi3FFKfdVFAceZx2vPBqUu/cUyFZEt+qUUqdCA9j5jOClbZccz9wQ3SJ+pinAUaoZh/P8OguRChzQ9esWnVrKvQX7AKqUKjpmp/9QUKyiMaKcCxlz7ZS5eRYKEY4HVf6maQpJiEe0p7GGEdUOvn0kDE61M4A+YnQL1Zo6v6cyHEk5SjydGeEVSDnaxPzv1ovU/6Fk7M4zRSNyWyRn3GkEjRJBQ2YoETxkQZMBNN/RSTAAhOls6voEOz5k/9C+6Run9Wtm9NqwyriKMMeHEANbDiHBlxDE1pA4AGe4AVejUfj2Xgz3metJaOY2YVfMj6+ATIHl30=</latexit>

hk = �(1 � �)kuk

<latexit sha1_base64="BsU/jFUoD2lNE300pyH4XWgxCLY=">AAAB/3icbVBPS8MwHP11/pvzX1Xw4iU4BE+jFUWPAy8ep7g5WEtJ02wLS9OSpMKoO/hVvHhQxKtfw5vfxnTrQTcfhDze+/3IywtTzpR2nG+rsrS8srpWXa9tbG5t79i7ex2VZJLQNkl4IrshVpQzQduaaU67qaQ4Djm9D0dXhX//QKViibjT45T6MR4I1mcEayMF9gHyhljnXpjwSI1jc6HbSTAK7LrTcKZAi8QtSR1KtAL7y4sSksVUaMKxUj3XSbWfY6kZ4XRS8zJFU0xGeEB7hgocU+Xn0/wTdGyUCPUTaY7QaKr+3shxrIpwZjLGeqjmvUL8z+tlun/p50ykmaaCzB7qZxzpBBVloIhJSjQfG4KJZCYrIkMsMdGmspopwZ3/8iLpnDbc84Zzc1ZvOmUdVTiEIzgBFy6gCdfQgjYQeIRneIU368l6sd6tj9loxSp39uEPrM8f5+6V+Q==</latexit>

Rk

Figure 8.7: Upper panel: conventional LSI filter that implements exponential smoothing.Lower panel: implementation that realizes the estimated correlation matrix Rk .

The recursive least squares (RLS) algorithm, which is also referred to in the Stearns chap-ter in Lim and Oppenheim as the “LMS-Newton” algorithm, produces a coe�cient trajec-tory that moves directly to the optimal coe�cientsW ⇤ by iteratively estimating thematrixR�1 and the stochastic gradient simultaneously. We can approach the iterative estima-tion of Rk by first considering the adaptive estimation of a parameter by exponentially-weighted smoothing. For example, consider the very simple one-dimensional LSI sys-tem depicted in the upper panel of Fig. 8.7, which has the unit sample response h[n] =

Page 12: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

120 8.7. The recursive least squares (RLS) adaptation algorithm

↵(1�↵)nu[n] where 0 < ↵ < 1. It is easy to show that the di↵erence equation that relatesthe output to the input is

y[n] = (1�↵)y[n� 1] +↵x[n] (8.35)

The lower panel of Fig. 8.7 depicts a similar LSI system with the sample response, butusing the notation of the Stearns chapter in Lim and Oppenheim. This filter operateselement by element on its input, which is the L⇥ L matrix XkXT

k , producing as its outputRk , which is a running exponential estimate of the autocorrelation matrix R = E[XkXT

k ].The corresponding di↵erence equation relating output to input in the lower panel is

Rk = (1�↵)Rk�1 +↵XkXTk (8.36)

Let us now premultiply Eq. (8.36) by R�1k and postmultiply it by R�1k�1:

R�1k RkR�1k�1 = (1�↵)R�1k Rk�1R

�1k�1 +↵R�1k XkX

Tk R�1k�1 (8.37)

Cleaning up Eq. (8.37) produces

R�1k�1 = (1�↵)R�1k +↵R�1k XkXTk R�1k�1 (8.38)

Postmultiplying Eq. (8.38) by Xk produces

R�1k�1Xk = (1�↵)R�1k Xk +↵R�1k XkXTk R�1k�1Xk (8.39)

oror R�1k�1Xk = R�1k Xk[(1�↵) +↵XT

k R�1k�1Xk] (8.40)

Now, let Sk = R�1k�1Xk . We can then write trivially

Sk = R�1k Xk[(1�↵) +↵XTk Sk] (8.41)

Because the matrices Rk and R�1k are symmetric, we can write STk = XT

k R�1k�1. Dividing

both sides of Eq. (8.41) by [1�↵+↵XTk Sk] and postmultiplying by ST

k = XTk R�1k�1 produces

R�1k XkXTk R�1k�1 =

Sk STk

1�↵ +↵XTk Sk

(8.42)

where againSk = R�1k�1Xk

Now ... substituting (8.42) into (8.38) produces

R�1k�1 = (1�↵)R�1k +↵Sk ST

k

1�↵ +↵XTk Sk

which is easily rewritten as

R�1k =1

(1�↵)

0BBBB@R�1k�1 �↵

Sk STk

1�↵ +↵XTk Sk

1CCCCA (8.43)

This equation is (finally!) the update equation for R�1k that has been the missing piece inthe RLS algorithm. Please note that the only new concept is the recursive estimation ofRk in Eq. (??). Everything after that was just a series of algebraic operations that enabledus to convert the iterative estimate of Rk into an iterative estimate of R�1k . We can nowwrite the complete RLS update equations as:

Page 13: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

Chapter 8. Introduction to Adaptive Filtering 121

1. Let R�10 = R�1 if R is known or 1�2 I if it is not, where �2 is the variance of the input

signal xk .

2. Let Sk = R�1k�1Xk .

3. Update the estimated inverse covariance matrix:

R�1k =1

(1�↵)

0BBBB@R�1k�1 �↵

Sk STk

1�↵ +↵XTk Sk

1CCCCA

4. Calculate the error function from the filter output: ek = dk �XTk Wk

5. Update the coe�cients: Wk+1 =Wk +2µekR�1k Xk

We iterate Steps 2-5 in the normal update of the filter coe�cients. As you will recall, theLMS algorithm in contrast is simply

Wk+1 =Wk +2µekXk

As discussed in class, the RLS algorithm converges faster than the LMS algorithm whenthe performance surface is elongated in shape, which happens when the components ofXk are correlated or (alternatively) the matrix R has large eigenvalue spread.

8.8 The adaptive lattice algorithm

The adaptive lattice filter is an alternate way to develop fast adaptation by exploitingthe decorrelation of the signals to subsequent stages of the adaptive lattice. Of coursewe have already discussed lattice filters in Secs. 7.3 and 7.4 in our discussion of linearprediction, so we will not repeat most of the basics other than to point out notationalchanges. The process of obtaining the adaptive algorithm parallels the development ofthe LMS algorithm, so it should be easy to understand.

Let us begin by comparing the notational conventions for the lattice structure from ourdiscussion of linear prediction with the notational conventions used by Stearns in Fig. 8.8below. The figure above compares the notational conventions used to describe the (fixed)FIR lattice filter in the text by Rabiner and Schafer and also in Secs. 7.3 and 7.4 andthe (adaptive) FIR lattice filter developed in Chapter 5 in the Lim and Oppenheim text,which we will adopt in this chapter. It is important to keep in mind that the coe�cientski in the adaptive lattice filter are actually time varying (i.e., ki = ki [k]). Consequently wemust take some care not to confuse the (unsubscripted) symbol k that indicates the timeindex with the (subscripted) ki [k], which are the time-varying reflection coe�cients. Alsonote the change in letter symbol to denote the forward error, the reversal in sign in thereflection coe�cients, and the change in indexing of the reflection coe�cients. The latticeupdate equations using the notational conventions of Stearns are:

fi+1[k] = fi [k] + kibi [k � 1] (8.44)bi+1[k] = kifi [k] + bi [k � 1] (8.45)

Let us consider the MSE of the forward error signal:

⇠2f = E[f 2

i [k]] = E[(fi [k] + kibi [k � 1])2] (8.46)

Page 14: 8. Introduction to Adaptive Filteringcourse.ece.cmu.edu/~ece491/lectures/L27/AdaptiveFilteringChap_ADSP.pdfChapter 8. Introduction to Adaptive Filtering 111 As an example, consider

122 8.8. The adaptive lattice algorithm

Figure 8.8: Comparison of fixed lattice filters from Rabiner and Schafer with adaptivelattice filters of Stearns.

As you will recall, the free parameters of the lattice filter are the reflection coe�cients{ki [k]}. Proceeding as before, we will first obtain the optimal solution for the error assum-ing that statistics are known by di↵erentiating the MSE with respect to the parameter ki ,setting the derivative to zero, and solving for ki .

@⇠2f

@ki= E[2(fi [k] + kibi [k � 1])bi [k � 1]] = 0 (8.47)

Solving for ki :

E[fi [k]bi [k � 1]] = kiE[b2i [k � 1]] (8.48)

k⇤i =�E[fi [k]bi [k � 1]]

E[b2i [k � 1]](8.49)

This solution, of course, is exactly the same as the PARCOR method of estimating reflec-tion coe�cients that we have seen before in our discussion of linear prediction. The onlydi↵erences (including the sign change) are a consequence of the changes in notation.

Now let us consider the correlation across subsequent stages when the optimal value ofki is used. We will consider the forward errors:

E[fi+1[k]fi [k]] = E[(fi [k] + kibi [k � 1])fi [k]] (8.50)

Plugging in the optimal value of k⇤i we observe

E[fi+1[k]fi [k]] = E

266664

0BBBB@fi [k]�

fi [k]bi [k � 1]bi [k � 1]b2i [k � 1]

1CCCCA fi [k]

377775 = 0 (8.51)

In other words, the forward error stages are uncorrelated when the coe�cients are fullyadapted. The same is true for the backward error stages.