bode_shannon_a simplified derivation of linear least square smoothing and prediction theory_1950

PROCEEDINGS OF THE I.R.E.

values of s and t found by this method determine valuesof Y8H shown in Fig. 2. The greater departure from thetrue curve of Ye would indicate that this approximationis not as good as that described above.

It is now possible to find the values of Q. and K8 ap-propriate to the solid beam. Thus if Q(Qyoa, b/a) andK(yoa, b/a) are the values for the hollow beam calcu-lated from (9), (12), and (13),

tual value of the current in the solid beam in the solu-tions instead of t1o, the equivalent current. Values of

Qj(+k) -1/2

k / k2 \+3/2Ks- 1 +-- }+O 2)TYo To

and

are plotted versus

Q9 = QQyoa, s)

and

Ks tK(Tyoa, s-).

(23 Py0a in Figs. 4 and 5 for different values of b/a and for(23) values of t and s taken at Ye-=To. All the solutions ob-

tained by Pierce3" for the hollow beam will be valid forthe solid beam if Q. and KS are substituted for Q and K.

(24)ACKNOWLEDGMENT

The t is placed in front of K in (24) because t1o and K ap-pear in the thin beam solutions only in the combinationtIoK. Using tK instead of K allows us to use lo, the ac-

The writer wishes to acknowledge his indebtedness toJ. R. Pierce, who suggested the general approach used inthis paper.

A Simplified Derivation of Linear Least SquareSmoothing and Prediction Theory*

H. W. BODEt, SENIOR MEMBER, IRE, AND C. E. SHANNONt, FELLOW, IRE

Summary-The central results lf the Wiener-Kolmogoroffsmoothing and prediction theory for stationary time series are de-veloped by a new method. The approach is motivated by physicalconsiderations based on electric circuit theory and does not involveintegral equations or the autocorrelation function. The cases treatedare the "'infinite lag" smoothing problem, the case of pure prediction(without noise), and the general smoothing prediction problem.Finally, the basic assumptions of the theory are discussed in order toclarify the question of when the theory will be appropriate, and toavoid possible misapplication.

I. INTRODUCTION

N A CLASSIC REPORT written for the NationalDefense Research Council,' Wiener has developed amathematical theory of smoothing and prediction of

considerable importance in communication theory. Asimilar theory was independently developed by Kol-mogoroff2 at about the same time. Unfortunately thework of Kolmogoroff and Wiener involves some ratherformidable mathematics-Wiener's yellow-bound re-port soon came to be known among bewildered engi-neers as "The Yellow Peril"-and this has prevented the

* Decimal classification: 510. Original manuscript received bythe Institute, July 13, 1949; revised manuscript received, January 17,1950.

t Bell Telephone Laboratories, Inc., Murray Hill, N. J.I N. Wiener, "The Interpolation, Extrapolation, and Smoothing

of Stationary Time Series," National Defense Research Committee;reprinted as a book, together with two expository papers by N. Levin-son, published by John Wiley and Sons, Inc., New York, N. Y., 1949.

2 A. Kolmogoroff, "Interpolation und Extrapolation von Sta-tionaren Zufailligen Folgen," Bull. Acad. Sci. (URSS) Ser. Math. 5,pp. 3-14; 1941.

wide circulation and use that the theory deserves. Inthis paper the chief results of smoothing theory will bedeveloped by a new method which, while not as rigorousor general as the methods of Wiener and Kolmogoroff,has the advantage of greater simplicity, particularly forreaders with a background of electric circuit theory. Themathematical steps in the present derivation have, forthe most part, a direct physical interpretation, whichenables one to see intuitively what the mathematics isdoing.

II. THE PROBLEM AND BASIC ASSUMPTIONSThe main problem to be considered may be formu-

lated as follows. We are given a perturbed signal f(t)which is the sum of a true signal s(t), and a perturbingnoise n(t)

f(t) = s(t) + n(t.It is desired to operate onf(t) in such a way as to obtain,as well as possible, the true signal s(t). More generally,one may wish to combine this smoothing operation withprediction, i.e., to operate on f(t) in such a way as to ob-tain a good approximation to what s(t) will be in thefuture, say a seconds from now, or to what it was in thepast, a seconds ago. In these cases we wish to approxi-mate s(t+a) with a positive or negative, respectively.The situation is indicated schematically in Fig. 1; theproblem is that of filling the box marked "?."

4171950

Authorized licensed use limited to: ULAKBIM UASL ISTANBUL TEKNIK UNIV Expert now Trial User. Downloaded on December 6, 2009 at 13:20 from IEEE Xplore. Restrictions apply.

PROCEEDINGS OF TIIE I.R.E.

It will be seen that this problem and its generaliza-tions are of wide application, not only in communicationtheory, but also in such diverse fields as economic pre-diction, weather forecasting, gunnery, statistics, and thelike.

fi(t)= Rfl+ Yt)g(t) APPROXIMATION

Y()z ? TO S(t +M)

Fig. 1-The smoothing and prediction problem.

The Wiener-Kolmogoroff theory rests on three mainassumptions which determine the range of application ofthe results. These assumptions are:

1. The time series represented by the signal s(t) andthe noise n(t) are stationary. This means essentially thatthe statistical properties of the signal and of the noisedo not change with time. The theory cannot properlybe applied, for example, to long-term economic effects,since the statistics of, say, the stock market were not thesame in 1850 as they are today.

2. The criterion of error of approximation is taken tobe the mean-square discrepancy between the actual out-put and the desired output. In Fig. 1 this means that wefill the box "?" in such a way as to minimize the mean-square error [g(t)-s(t+a)]2, the average being takenover all possible signal and noise functions with eachweighted according to its probability of occurrence.This is called the ensemble average.

3. The operation to be used for prediction and smooth-ing is assumed to be a linear operation on the availableinformation, or, in communication terms, the box is tobe filled with a linear, physically realizable, filter. Theavailable information consists of the past history of theperturbed signal, i.e., the function f(t) with t< t1, whereti is the present time. A linear, physically realizable filterperforms a linear operation on f(t) over just this range,as we will see later in connection with equations (3) and(4).The theory may therefore be described as linear least

square prediction and smoothing of stationary time series.It should be clearly realized that the theory applies onlywhen these three assumptions are satisfied, or at leastare approximately satisfied. If any one of the conditionsis changed or eliminated, the prediction and smoothingproblem becomes very difficult mathematically, and lit-tle is known about usable explicit solutions. Some of thelimitations imposed by these assumptions will be dis-cussed later.How is it possible to predict at all the future behavior

of a function when all that is known is a perturbed ver-sion of its past history? This question is closely asso-ciated with the problems of causality and induction inphilosophy and with the significance of physical laws. Ingeneral, physical prediction depends basically on an as-sumption that regularities which have been observed inthe past will obtain in the future. This assumption can

never be proved deductively, i.e., by purely mathemati-cal argument, since we can easily conceive mathematicaluniverses in which the assumption fails. Neither can itbe established inductively, i.e., by a generalization fromexperiments, for this very generalization would assumethe proposition we were attempting to establish. Theassumption can be regarded only as a central postulateof physics.

Classical physics attempted to reduce the physicalworld to a set of strict causal laws. The future behaviorof a physical system is then exactly predictable from aknowledge of its past history, and in fact all that is re-quired is a knowledge of the present state of the system.Modern quantum physics has forced us to abandon thisview as untenable. The laws of physics are now believedto be only statistical laws, and the only predictions arestatistical predictions. The "exact" laws of classicalphysics are subject to uncertainties which are smallwhen the objects involved are large, but are relativelylarge for objects on the atomic scale.

Linear least square smoothing and prediction theoryis based on statistical prediction. The basic assumptionthat statistical regularities of the past will hold in thefuture appears in the mathematics as the assumptionthat the signal and noise are stationary time series. Thisimplies, for example, that a statistical parameter of thesignal averaged over the past will give the same value asthis parameter averaged over the future.

I'he prediction depends essentially on the existence ofcorrelations between the future value of the signals(tl+a) where t1 is the present time, and the knowndata f(t) = s(t) +n(t) for t t1. The assumption that theprediction is to be done by a linear operation impliesthat the only type of correlation that can be used islinear correlation, i.e., s(ti+a) f(t). If this correlationwere zero for all t ti, no significant linear predictionwould be possible, as will appear later. The best mean-square estimate of s(tl+a) would then be zero.

III. PROPERTIES OF LINEAR FILTERS

In this section, a number of well-known results con-cerning filters will be summarized for easy reference. Alinear filter can be characterized in two different butequivalent ways. The first and most common descrip-tion is in terms of the complex transfer function Y(co).If a pure sine wave of angular frequency wc and ampli-tude E is used as input to the filter, the output is also apure sine wave of frequency xl and amplitude Y(w) E.The phase of the output is advanced by the angle ofY(cwi), the phase of the filter at this frequency. It is fre-quently convenient to write the complex transfer func-tion Y(w) in the form Y(w) = eA) eiB()where A (co) = logY(w) is the gain, and B(co) = angle [ Y(o) ] is the phase.

Since we will assume that the filter can contain an idealamplifier as well as passive elements, we can add anyconstant to A to make the absolute level of the gain ashigh as we please.

A pril418


Bode and Shannon: Linear Least Square Smoothing and Prediction Theory

The second characterization of a filter is in terms oftime functions. Let K(t) be the inverse Fourier trans-form of Y(co)

K(t) = f Y(co)eiwted. 1)

Then Y(w) is the direct Fourier transform of K(t)

Y(co) = K(t)e-iotdt. (2)-00

Knowledge of K(t) is completely equivalent to knowl-edge of Y(w); either of these may be calculated if theother is known.The time function K(t) is equal to the output ob-

tained from the filter in response to a unit impulse im-pressed upon its input at time t = 0, as illustrated by Fig.2. From this relation we can readily obtain the response

IMPULSE RESPONSE = K(t)INPUT Y(L-

Fig. 2-Impulsive response of a network.

of the filter to any artibrary inputf(t). It is merely nec-essary to divide the input wave into a large number ofthin vertical slices, as shown by Fig. 3. Each slice can beregarded as an impulse of strength f(t)dt, which will pro-

INPUT\= f 11

PRESENT TIME

_! _-

Criteria for physical realizability can be given interms of either the K function or Y. In terms of the im-pulsive response K(t), it is necessary that K(t) be zerofor t<0; that is, the network cannot respond to an im-pulse before the impulse arrives. Furthermore, K(t)must approach zero (with reasonable rapidity) ast-->+ oo. Thus the effect of an impulse at the presenttime should eventually die out.

INPUT = fts ;ij ;L =t""

K (C PRESENT INSTANT

Fig. 4-Response as a weighted average of the past input.

These requirements are also meaningful in terms ofthe interpretation of K as a weighting function. Thus,the filter cannot apply a weighting to parts of the inputthat have yet to occur; hence, K(r) = 0 for r <0. Alsothe effect of the very remote past should gradually dieout, so that K(r) should approach zero as r-- c*. It mayalso be noted that these conditions are also sufficient forphysical realizability in the sense that any impulsive re-sponse K(t) satisfying them can be approximated asclosely as desired with a passive lumped element net-work, together with a single amplifier.

In terms of frequency response, the principal condi-tion for physical realizability is that Y(co), considered asa function of the complex variable co, must be an analyt-ic function in the half plane defined by Im(w) <0. Inaddition, the function must behave on the real fre-quency axis3 in such a way that

Fig. 3 Response to an arbitrary input as a sumof impulsive responses.

duce a responsef(t)dt K(t1-t) at any subsequent time ti.Upon adding together the contributions of all the sliceswe have the well-known formula

g(tj) f(t)K(t1 - t)dt (3)-00

for the total response at ti.For the study of smoothing theory, (3) can conven-

iently be replaced by a slightly different expression. Set-tingr=t-t, we have

00

g(ti) f(ti -T)K(T)dT. (4)

In this formulation, r stands for the age of the data, sothat f(ti-T) represents the value of the input wave Tseconds ago. K(r) is a function like the impulsive ad-mittance, but projecting into the past rather than thefuture, as shown by Fig. 4. It is evidently a weightingfunction by which the voltage inputs in the past mustbe multiplied to determine their contributions to thepresent output.

(5)log Y(c))

I ~~~-dcoI + co2

is a finite number.The requirements of physical realizability lead to the

well-known loss-phase relations. For a given gainA=log Y(w) satisfying (5), there is a minimum pos-sible phase characteristic. This phase is given by

2Boo r X A (o) -A (wo)dB(w o) = --2-- -2- dw

tr J CO@(6)

If the square of the prescribed gain Y(W) I2= Y(W)Y7(w) is a rational function of co, say, P1(CO)/P2(w) wherePl(co) and P2(co) are polynomials, the minimum phasecharacteristic can be found as follows: Calculate theroots of P1(co) and P2(w) and write I Y(co) 1 2 as

Y(W) I2 k2 (W al)(i Ctl)(C a2)(w - at2) . . .

(CO - /1)(C - ~1)(CO - 132)(W- 12) * *.(7)

I Including the point at infinity. Actual physical networks will,of course, always have zero gain at infinite frequency, and the aboverequirement shows that the approach to zero cannot be too rapid. Anapproach of the type &- (6n db per octave) is possible but e-I'l oreIcauses the integral in (5) to diverge and is physically unrealizable.

1950 419



where a,, a2 . . . 01, 12 * all have imaginary parts>0. That is, these are the roots and poles of y(w) | 2 inthe upper-half plane and the conjugate terms are thecorresponding roots and poles in the lower-half plane.The minimum phase network then has the transferfunction

y k*( - al)(- a2) (8)

(Cd- -1)(C 2) . . .

A minimum phase network has the important prop-erty that its inverse, with the transfer function Y-1(w),is also physically realizable.4 If we pass a signal f(t)through the filter Y(co), we can recover it in its originalform by passing it through the inverse filter. Moreover,the recovery takes place without loss of time. On theother hand, there is no physically realizable exact in-verse for a nonminimum phase network. The best we cando is to provide a structure which has all the propertiesof the theoretical inverse, except for an extra phase lag.The extra phase lag can be equalized to give a constantdelay by the addition of a suitable phase equalizer, butit cannot be eliminated. Thus, if we transmit a signalthrough a nonminimum network, we can recover it onlyafter a delay; that is, we obtain f(t--a) for some posi-tive a.

IV. GENERAL EXPRESSION FOR THE MEAN-SQUARE ERROR

Suppose we use for the predicting-smoothing filter inFig. 1 a filter with transfer characteristic Y(w). What isthe mean-square error in the prediction? Since differentfrequencies are incoherent, we can calculate the averagepower in the error function

e(t) = s(t + a) - g(t) (9)

by adding the contributions due to different frequencies.Consider the components of the signal and noise of aparticular frequency wc. It will be assumed that the sig-nal and noise are incoherent at all frequencies. Then, atfrequency w1 there will be a contribution to the error dueto noise equal to N(coi) Y(Wo1) 12, where N(coi) is the av-erage noise power at that frequency wi.There is also a contribution to the error due to the

failure of components of the signal, after passing throughthe filter, to be correct. A component of frequency w,oshould be advanced in phase by awl, and the amplitudeof the output should be that of the input. Hence therewill be a power error

Y(w1) - eil 12P(col) (10)

where P(w1) is the power in the signal at frequency wo.

The total mean-square error due to components offrequency co, is the sum of these two errors, or

E = Y(o1) 12N(wi) + Y(wj) - eial, 12P(,W), (11)

If the original function has a zero at infinity, so that the re-quired inverse has a pole there, there are complications, but anadequate approximation can be obtained in physical cases.

and the total mean-square error for all frequencies is

E = J [ Y(CO) 12N(c) + Y(o.) - ei w 12P(w)]dW. (12)

The problem is to minimize E by proper choice of Y(w),remembering that Y(c) must be physically realizable.

Several important conclusions can be drawn merelyfrom an inspection of (12). The only way in which thesignal and noise enter this equation is through theirpower spectra. Hence, the only statistics of the signaland noise that are needed to solve the problem are thesespectra. Two different types of signal with the samespectrum will lead to the same optimal prediction filterand to the same mean-square error. For example, if thesignal is speech it will be predicted by the same filter aswould be used for prediction of a thermal noise whichhas been passed through a filter to give it the samepower spectrum as speech.Speaking somewhat loosely, this means that a linear

filter can make use only of statistical data pertaining tothe amplitudes of the different frequency components;the statistics of the relative phase angles of these com-ponents cannot be used. Only by going to nonlinear pre-diction can such statistical effects be used to improvethe prediction.

It is also clear that in the linear least square problemwe can, if we choose, replace the signal and noise by anydesired time series which have the same power spectra.This will not change the optimal filter or the mean squareerror in any way.

V. THE PURE SMOOTHING PROBLEM

The chief difficulty in minimizing (12) for the mean-square error lies in properly introducing the conditionthat Y(co) must be a physically realizable transfer func-tion. We will first solve the problem with this constraintwaived and then from this solution construct the bestphysically realizable filter.Waiving the condition of physical realizability is

equivalent to admitting any Y(cw), or, equivalently, anyimpulsive response K(t). Thus, K(t) is not necessarilyzero for t< 0, and we are allowing a weighting functionto be applied to both the past and future of f(t). Inother words, we assume that the entire function f(t)=s(t)+n(t) from t= -oo to t= + oo is available for usein prediction.

In (12), suppose

Y(Q) = C(W)eiB(o (13)

with C(w) and B(w) real. Then (12) becomes

E = [C2N + P(c2 + 1 - 2C cos (ac - B))]dw (14)-00

where C(o), N(co), and the like are written as C, N, andso forth, for short. Clearly, the best choice of B(co) isB(co) =ao since this maximizes cos (acw-B(c.o)). Then(14) becomes

April420



E [C2(P + N) -2PC + P]dw. (15)C00

P2/(P+N) we obtain

E = C2(P + N) - 2PC

p2 p2 -

+ - + P d (16)P+N P+J[r j N16or

rt/_ ~~p - 2 PN'E= f-o P+N C-\ N P+N d@. (17)

The bracketed term is the square of a real number, andtherefore positive or zero. Clearly, to minimize E wechoose C to make this term everywhere zero, thus

C(CO) P(CO)P(Qo) + N(c)

and

P(@Y('w)- e"'. (18)P(cw) + N(w)

With this choice of Y(c) the mean square error will be,from (17),

J_ P(w) +N(w)(19)

The best weighting function is given by the inverseFourier transform of (18)

1 rZ P(co)K(t)=-I - -P eiw(t+a)dw. (2))

27r J_0P(o) + N(c)

This K(t) will, in general, extend from t=-o tot= + cc. It does not represent the impulsive response ofa physical filter. However, it is a perfectly good weight-ing function. If we could wait until all the functions(t) +n(t) is available, it would be the proper one to ap-

ply in estimating s(t+a).To put the question in another way, the weighting

K(r) can be obtained in a physical filter if sufficient de-lay is allowed so that K(T-) is substantially zero for thefuture. Thus we have solved here the "infinite lag"smoothing problem. Although Y(o) in (18) is nonphysi-cal, Y(co)e-ifw will be physical, or nearly so, if : is takensufficiently large.

VI. THE PURE PREDICTION PROBLEMWe will now consider another special case, that in

which there is no perturbing noise. The problem is thenone of pure prediction. What is the best estimate ofs(t+a) when we know s(t) from t= up to t=0?We have seen that the solution will depend only on

the power spectra of the signal and noise, and since weare now assuming the noise to be identically zero, thesolution depends only on the power spectrum P(w) of

the signal. This being the case, we may replace the ac-tual signal by any other having the same spectrum. Thesolution of the best predicting filter will be the same forthe altered problem as for the original problem.Any desired spectrum P(w) can be obtained by pass-

ing wide-band resistance noise or "white" noise througha shaping filter whose gain characteristic is V/P(w). Thespectrum of resistance noise is flat (at least out to fre-quencies higher than any of importance in communica-tion work), and the filter merely multiplies this constantspectrum by the square of the filter gain P(w). The phasecharacteristic of the filter can be chosen in any way con-sistent with the conditions of physical realizability. Letus choose the phase characteristic so that the filter isminimum phase for the gain VP(w). Then the filter hasa phase characteristic given by

B(coo) = -co j log P(w) - log P(c1O) d (21)r coo2C -WO2

Furthermore, this minimum phase network has a phys-ically realizable inverse.We have now reduced the problem to the form shown

in Fig. 5. What is actually available is the function s(t)up to t= 0. However, this is equivalent to a knowledge of

RESISTANCE ACTUALNOISE Y(L.() SIGNALh(t) S(t)

GAIN = FP(4)MINIMUM PHASE

Fig. 5-Construction of actual signal spectrum from resistance noise.

the resistance noise h(t) up to t= 0, since the filter Y hasa physically realizable inverse and we can pass the avail-able function s(t) through the inverse Y-' to obtain h(t).The problem, therefore, is equivalent to asking what

is the best operation to apply to h(t) in order to approxi-mate s(t+a) in the least square sense? The question iseasily answered. A resistance noise can be thought of asmade up of a large number of closely spaced and veryshort impulses, as indicated in Fig. 6. The impulses havea Gaussian distribution of amplitudes and are statis-tically independent of each other. Each of these im-pulses entering the filter Y produces an output corre-sponding to the impulsive response of the filter, asshown at the right of Fig. 6, and the signal s(t) is the sumof these elementary responses.

Y(63)S(t) = SUM OF

SERIES OF INDIVIDUALIMPULSES RESPONSES K(t)

Fig. 6-Result of resistance noise input.

What is known is h(t) up to the present; that is, weknow effectively the impulses up to t=O and nothingabout those after t = 0; these have not yet occurred. Thefuture signal s(t+a) is thus made up of two parts; the

1950 421


4PROCEEDINGS OF THE I.R.E.

tails of responses due to impulses that have already oc-curred, and a part due to impulses which will occur be-tween the present time and time t=a. The first part iscompletely predictable, while the second part is entirelyunpredictable, being statistically independent of ouravailable information at the present time.The total result of the first part can be obtained by

constructing a filter whose impulsive response is the tailof the impulsive response of filter Y moved ahead a sec-onds. This is shown in Fig. 7 where K1(t) is the new im-pulsive.response and K(t) the old one. The new filter re-sponds to an impulse entering now as the filter Y willrespond in a seconds. It responds to an impulse thatentered one second ago as Y will respond in a seconds toone that entered it one second ago. In short, if h(t) isused as input to this new filter Y1, the output now will bethe predictable part of the future response of Y to thesame input a seconds from now.

K(t)

zs ~~~K,(t )

time t =0 to t= a. Since these impulses are uncorrelated,the mean-square sum of the errors is the sum of the in-dividual mean-square errors. The individual pulses areeffective in causing mean-square error in proportion tothe square of K(a-t). Hence, the total mean-square er-ror will be given by

raE = p K2(a -t)dt

(24)

pJ K2(t)dtwhere p =fp(co)dw is the mean-square signal. By a simi-lar argument the mean-square value of s(t+a) will be

U2- pj K2Q)d7 (25)

and the relative error of the prediction may be mleas-ured by the ratio of the root-mean-square error to theroot-mean-square value of s(t+a), i.e.,

11/2

Fig. 7-Construction of the physical response K1(t) from K(t).

The second, or unpredictable part of the future re-sponse, corresponding to impulses yet to occur, cannot,of course, be constructed. We know, however, that themean value of this part must be zero, since future im-pulses are as likely to be of one sign as the other. Thusthe arithmetic average, or center of gravity, of the pos-sible future responses is the predictable part given bythe output of Y1. But it is well known that the arith-metic mean of any distribution is the point about whichthe mean-square error is the least. The output of Yi isthus the desired prediction of s(t+a).

In constructing Y1 we assumed that we had availablethe white noise h(t). Actually, however, our given datais the signal s(t). Consequently, the best operation onthe given data is Yi(w) Y-'(co), the factor Y-'(w) reduc-ing the function s(t) to the white noise h(t), and the sec-ond operation Yi(w) performing the best predictionbased on h(t).The solution may be summarized as follows:1. Determine the minimum phase network having the

gain characteristic P(w). Let the complex transfercharacteristic of this filter be Y(o.), and its impulsiveresponse K(t).

2. Construct a filter whose impulsive response isK1(t) = K(t + a) for t O0

- O for < 0.

Let the transfer characteristic of this network be Y1(w).3. The optimal least square predicting filter then has

a characteristicY1 ('v)Y (,W) (23)

The mean-square error E in the prediction is easilycalculated. The error is due to impulses occurring from

(26)

The prediction will be relatively poor if the area underthe curve K(t)2 out to a is large compared to the totalarea, good if it is small compared to the total. It is evi-dent from (26) that the relative error starts at zero fora= 0 and is a monotonic increasing function of a whichapproaches unity as a----> cc.There is an important special case in which a great

deal more can be shown by the argument just given. Inour analysis, the actual problem was replaced by one inwhich the signal was a Gaussian type of time series, de-rived from a resistance noise by passing it through afilter with a gain VP(o). Suppose the signal is alreadya time series of this type. Then the error in prediction,due to the tails of impulses occurring between t=0 and=a, will have a Gaussian distribution. This followsfrom the fact that each impulse has a Gaussian distri-bution of amplitudes and the sum of any number of ef-fects, each Gaussian, will also be Gaussian. The stand-ard deviation of this distribution of errors is just theroot-mean-square error E obtained from (24).

Stated another way, on the basis of the availabledata, that is, s(t) for t<0, the future value of the signals(t+a) is distributed according to a Gaussian distribu-tion. The best linear predictor selects the center of thisdistribution for the predicted value. The actual futurevalue will differ from this as indicated in Fig. 8, wherethe future value is plotted horizontally, and the prob-ability density for various values of s(t+-a) is plottedlvertically.

It is clear that in this special case the linear predictioinmethod is in a sense the best possible. The center of the

422 April


Bode and Shannon: Linear LeNst Square Smoothing and Prediction Theory

Gaussian distribution remains the natural point to curring during the past history of this output. Whatchoose if we replace the least square criterion of the weights should be given these data to obtain thebest

CIO0CCa:]

a2

s.d.= JK(t)2dt0

s(t + .)t

Fig. 8-Distribution of prediction errors in the Gaussian case.

best prediction by almost any other reasonable criterion,such as the median value or the most probable value.Thus in this case a nonlinear computation would offernothing which the linear process does not already pro-vide. In the general case, on the other hand, the dis-tribution of future values will not be Gaussian, and theshape of the distribution curve may vary from point topoint depending upon the particular past history of thecurve. Under these circumstances, a nonlinear schememay offer improvements upon the linear process and theexact characteristics of the optimal procedure will de-pend critically upon the criterion adopted for the bestprediction.

VII. PREDICTION IN THE PRESENCE OF NOISE

Now consider the general prediction and smoothingproblem with noise present. The best estimate ofs(t+a) is required when the function s(t) +n(t) isknown from t= -oo to the present. If s(t)+n(t) ispassed through a filter whose gain is [P(co) +N(cw)]-1/2,the result will be a flat spectrum which we can identifywith white noise. Let Yi(co) be the transfer function of afilter having this gain characteristic and the associatedminimum phase. Then both Yi(cw) and the inverseY11(cw) are physically realizable networks. Evidently,knowledge of the input of Y1 and knowledge of its out-put are equivalent. The best linear operation on theoutput will give the same prediction as the correspond-ing best linear operation on the input.

If we knew the entire function s(t) +n(t) from t=- oto t = + oo the best operation to apply to the input ofY2(co) would be that specified by (18). If we let B(co)be the phase component of Y1, this corresponds to theequivalent operation

Y2(W) = P(w)i[a-B(w),[P(o,) + N(w)]112

(27)

on the "white noise" output of Yi.Let the impulse response obtained from (27) be

K2(t). As illustrated by Fig. 9, K2(t) will, in general, con-tain tails extending to both t = + oo and t = - oo, thejunction between the two halves of the curve being dis-placed from the origin by the prediction time a. Theassociated K2(r) of Fig. 10 is, of course, the idealweighting function to be applied to the "white noise"output of Y1. But the only data actually available atr =0 are the impulses which may be thought of as oc-

Fig. 9-Possible function K2(t).

Fig. 10-Weighting function K2(r), corresponding to Fig. 9.

prediction? It seems natural to weight these as onewould if all data were available, and to weight the futurevalues zero (as we must to keep the filter physical).The fact that this is actually correct weighting when thevarious input impulses are statistically independent willnow be shown as a consequence of a general statisticalprinciple.Suppose we have a number of chance variables,

x1, x2, * * , xn which are statistically independent, or atleast have the property that the mean product of anytwo, XmXn, is equal to zero. These variables are to beinterpreted as the amplitudes of the individual whitenoise impulses to which we are attempting to apply theweighting function of Fig. 10.

Let y be another chance variable, correlated withxi, ' , xn, which we wish to estimate in the leastsquare sense by performing a linear operation onXi * * * Xn. In the problem at hand y is the actual signals(t) at the time a seconds from now.The predicted value will be

n

Y= aixi

and the mean-square error is

E = (y - y')2 = (y-Z aixi)2n _ n

= y2 - 2 , aixiy + E a afixixi-1 i, j=l

n

= Y2-2E aixiy + E a 2Xi2i=l

(28)

since all terms in the double sum vanish except those forwhich i =j. We seek to minimize E by proper choice ofthe as. Setting the partial derivatives with respect to aiequal to zero, we have

OE-= - 2xiy + 2aixil = 03ai

-- l:,Z -- -

1950 423

(-- C4 -



or

a = -2. (29)Xi2

The important fact about this calculation is that each ofthe n minimizing equations involves only the a, in ques-

tion; aE/lal involves only a,, etc. In other words, min-mizing E on all the ai is equivalent to minimizing sep-

arately on the individual ai; a1 should have the valuexly/x12 whatever values are assigned to the other a's.

Returning now to the prediction and smoothingproblem, the function K2(r) gives the proper weightingto be attached to the impuilses if we could use them all.Requirements of physical realizability demand that fu-ture impulses corresponding to r <0 be given weightzero. From the above statistical principle those oc-

curring in the past should still be given the weightingK2(r). In other words, the proper filter to apply to theinput white noise has an impulse response zero for1<0 and K2(t) for t>O.To summarize, the solution consists of the following

steps:1. Calculate the minimum phase transfer function for

the gain (P+N)-1>2. Let this be Y1(co).2. Let

pY2( ) = Y1-'(W)P

This is a nonphysical transfer function. Let its Fouriertransform be K2(t).

3. Set K3(t)=K2(t+a) for t1O and Ks(t)=O fort<0. That is, cut off the first a seconds of K2(t) andshift the remaining tail over to t = 0. This is the impulseresponse of a physical network, and is the optimal op-

eration on the past history of the white noise input. Letthe corresponding transfer function be Y3(Cw).

4. Construct Y4(w) = Y3(w) Y(co). This is the op-

timal smoothing and prediction filter, as applied to theactual given s(t) +n(t).As in the pure prediction problem, if the signal and

noise happen to be Gaussian time series, the linear pre-

diction is an absolute optimum among all predictionoperations, linear or not. Furthermore, the distributionof values of s(t+a), when f(t) is known for t<O, is a

Gaussian distribution.

VIII. GENERALIZATIONSThis theory is capable of generalization in several

directions. These generalizations will be mentioned onlybriefly, but can all be obtained by methods similar tothose used above.

In the first place, we assumed the true signal and thenoise to be uncorrelated. A relatively simple extensionof the argument used in Section IV allows one to accountfor correlation between these time series.A second generalization is to the case where there are

several correlated time series, say fi(t), f2(t), * ^ * , f, (t)-It is desired to predict, say, s1(t+a) from a knowledgeof f1, f2, * ' ' ) fn

Finally the desired quantity may not be s(t+a) but,for example, s'(t+a), the future derivative of the truesignal. In such a case, one may effectively reduce theproblem to that already solved by taking derivativesthroughout. The function f(t) is passed through a dif-ferentiator to produce g(t) =f'(t). The best linear pre-diction for g(t) is then determined.

IX. DISCUSSION OF THE BASIC ASSUMPTIONS

A result in applied mathematics is only as reliable asthe assumptions from which it is derived. The theorydeveloped above is especially subject to misapplicationbecause of the difficulty in deciding, in any particularinstance, whether the basic assumptions are a reasonabledescription of the physical situation. Anyone using thetheory should carefully consider each of the three mainassumptions with regard to the particular smoothing orprediction problem involved.The assumption that the signal and noise are sta-

tionary is perhaps the most innocuous of the three, for itis usually evident from the general nature of the prob-lem when this assumption is violated. The determina-tion of the required power spectra P(w) and N(co) willoften disclose any time variation of the statistical struc-ture of the time series. If the variation is slow comparedto the other time constants involved, such nonstationaryproblems may still be solvable on a quasi-stationarybasis. A linear predictor may be designed whose transferfunction varies slowly in such a way as to be optimal forthe "local" statistics.The least square assumption is more troublesome, for

it involves questions of values rather than questions offact. When we minimize the mean-square error we are,in effect, paying principal attention to the very largeerrors. The prediction chosen is one which, on the whole,makes these errors as small as possible, without muchregard to relatively minor errors. In many circum-stances, however, it is more important to make as manyvery accurate predictions as possible, even if we makeoccasional gross errors as a consequence. When the dis-tribution of future events is Gaussian, it does not matterwhich criterion is used since the most probable event isalso the one with respect to which the mean-squareerror is the least. With lopsided or multimodal dis-tributions, however, a real question is involved.As a simple example, consider the problem of predict-

ing whether tomorrow will be a clear day. Since cleardays are in the majority, and there are no days withnegative precipitation to balance days when it rains, weare concerned here with a very lopsided distribution.With such a curve, the average point, which is the onegiven by a prediction minimizing the mean-square error,might be represented by a day with a light drizzle. To aman planning a picnic, however, such a prediction wouldhave no value. He is interested in the probability thatthe weather will really be clear. If the picnic must becalled off because it in fact rains, the actual amount ofprecipitation is of comparatively little consequence.

424 A pril



As a second example, consider the problem of inter-cepting a bandit car attempting to flee down a networkof roads. If the road on which the bandit car happens tobe forks just ahead, it is clear that a would-be inter-ceptor should station himself on one fork or the other,making the choice at random if necessarv. The mean-square error in the interception would be least, how-ever, if he placed himself in the fields beyond the fork.Problems similar to these may also arise in gunnery,where, in general, we are usually interested in the num-ber of actual hits and "a miss is as good as a mile."The third assumption, that of linearity, is neither a

question of fact, nor of evaluation, but a self-imposedlimitation on the types of operations or devices to beused in prediction. The mathematical reason for this as-sumption is clear; linear problems are always muchsimpler than their nonlinear generalizations. In certainapplications the linear assumption may be justified forone or another of the following reasons:

1. The linear predictor may be an absolute optimalmethod, as in the Gaussian time series mentionedabove.

2. Linear prediction may be dictated by the simplic-ity of mechanization. Linear filters are easy to syn-thesize and there is an extensive relevant theory, withno corresponding theory for nonlinear systems.

3. One may use the linear theory merely because ofthe lack of any better approach. An incomplete solutionis better than none at all.How much is lost by restricting ourselves to linear

prediction? The fact that nonlinear effects may be im-portant in a prediction can be illustrated by returningto the problem of forecasting tomorrow's weather. Weare all familiar with the fact that the pattern of eventsover a period of time may be more important than thehappenings taken individually in determining what willcome. For example, the sequence of events in the pas-sage of a cold or warm front is characteristic. Moreover,the significance of a given happening may dependlargely upon the intensity with which it occurs. Thus, asharp dip in the barometer may mean that moderatelyunpleasant weather is coming. Twice as great a drop inthe same time, on the other hand, may not indicate thatthe weather will be merely twice as unpleasant; it mayindicate a hurricane.As a final point, we may notice that the requirement

that the prediction be obtained from a linear device andthe objective of minimizing the mean-square error arenot, in all problems, quite compatible with one another.The absolute best mean-square prediction (ignoring theassumption of linearity) would, of course, always pickthe mean of the future distribution, i.e., the "center ofgravity," since in any case this minimizes the mean-square error. In general, however, the position of thiscenter of gravity will be a nonlinear function of the pasthistory. When we require that the prediction be alinear operation on the past history, the mathematics isforced to compromise among the conflicting demands of

various possible past histories. The compromise amountsessentially to averaging over-all relative phases of thevarious components of the signal; any pertinent in-formation contained in the relative phases cannot beused properly.

This can be illustrated by the familiar statistical prob-lem of calculating a line or plane of regression to providea linear least square estimation of one variable y fromthe knowledge of a set of variables correlated with y.5The simplest such problem occurs when there is just oneknown variable x, and one unknown variable y to beestimated from x. Fig. 11 shows three of the "scatterdiagrams" used in statistics. The variable x may be, forexample, a man's weight and y his height. A largepopulation is sampled and plotted. It is then desired to

/OF R E(

.I .

.. A...

L.

,.:ILINE OFREGRE SSION

'.. UK L LINE OF'GRESSION REGRE SSION

LINE OF CURVE OFREGRESSION REGRESSION

(a) (b) (c)Fig. 11-Some scatter diagrams with lines and curves of regression.

estimate, or predict, a man's height, knowing only hisweight. If we agree to use only linear operations y mustbe calculated in the form y = ax. The best choice of a forleast square prediction is xy/x2 and the correspondingstraight line is known as the line of regression. The caseof a normal distribution corresponds to the Gaussiantype noise in which the linear prediction is an absoluteoptimum.

Figs. 11 (b) and 1 (c) are scatter diagrams for otherdistributions of two variables. The lines of regressionare now not nearly as good in predicting y as they werein Fig. 11(a). The requirement that the predicted valuebe a linear function of the known data requires a com-promise which may be very serious. It is obvious inFigs. 11(b) and 11(c) that a much better estimate of ycould be formed if we allowed nonlinear operations on x.In particular, functions of the form aX2+b and cx3+dxwould be more suitable.

In predicting y from two known variables x1 and X2we can construct a scatter diagram in three dimensions.The linear prediction requires fitting the points with aplane of regression. If there are n known quantitiesXI, X2, .. *, xn we need (n+1) dimensional space andthe linear theory corresponds to a hyperplane of ndimensions.The problem of smoothing and prediction for time

series is analogous. What we are now dealing with, how-ever, is the function space defined by all the values off(t) for t<0. The optimal linear predictor correspondsto a hyperplane in this function space.

Pp. G. Hoel, "Introduction to Mathematical Statistics,' JohnWiley and Sons, Inc., New York, N. Y.; 1947.

1950 425

r IIpvF


bode_shannon_a simplified derivation of linear least square smoothing and prediction theory_1950

Documents