an overview of econometrics using b34s, matlab,...

Econometric Notes

30 December 2014

Econometric Notes *

Houston H. StokesDepartment of Economics

University of Illinois in [email protected]

An Overview of Econometrics *...........................................1Objective of Notes...................................................11. Purpose of statistics................................................32. Role of statistics...................................................33. Basic Statistics....................................................34. More complex setup to illustrate B34S Matrix Approach.....................155. Review of Linear Algebra and Introduction to Programming Regression Calculations. 21

Figure 5.1 X'X for a random Matrix X....................................28Figure 5.2 3D plot of 50 by 50 X'X matrix where X is a random matrix..............41

6. A Sample Multiple Input Regression Model Dataset.........................45Figure 6.1 2 D Plots of Textile Data......................................52Figure 6.2 3-D Plot of Theil (1971) Textile Data.............................54

7. Advanced Regression analysis........................................68Figure 7.1 Analysis of residuals of the YMA model............................78Figure 7.2 Recursively estimated X1 and X3 coefficients for X1 Sorted Data...........80Figure 7.3 CUSUM test on Estimated with Sorted Data.........................81Figure 7.4 CUMSQ Test of Model y model estimated with sorted data...............83Figure 7.5 Quandt Likelihood Ratio tests of y model estimated with sorted data.........848. Advanced concepts..............................................849. Summary.....................................................87

Objective of Notes

The objective of these notes is to introduce students to the basics of applied regression calculation using STATA setups of a number of very simple models. Computer code is shown to allow students to "get going" ASAP. More advanced sections show matlab code to made calculations. The notes are organized around the estimation of regression models and the use of basic statistical concepts. The textbooks Introduction to Econo metrics by Christopher Dougherty 4th Edition Oxford 2011 or

1

Econometric Notes

Introductory Econometrics: A Modern Approach by Jeffrey Wooldridge 5th Edition, South-Western Cengage 2013 can be used to provide added information. A number of examples from this book will be shown. Statistical analysis will be treated, both as a means by which the data can be summarized, and as a means by which it is possible to accept or reject a specific hypothesis. Four simple datasets are initially discussed:

- The Price vs Age of Cars dataset illustrates a simple 2 variable OLS model where graphics and correlation analysis can be used to detect relationships.

- The Theil (1971) Textile deta set illustrates use of log transformations and contracts 2D and 3D graphic analysis of data. A variable with a low correlation was show to enter an OLS model only in the presence of another variable.

- The Brownlee (1965) Stack Loss data set illustrates how in a multiple regression context, variables with "significant" correlation may not enter a full model.

- The Brownlee (1965) Stress data set illustrates the dangers of relying on correlation analysis.

Finally a number of statistical problems and procedures that might be used are discussed.

2

Econometric Notes

1. Purpose of statistics

- Summarize data- Test models- Allow one to generalize from a sample to the wider population.

2. Role of statistics

Quote by Stanley (1856) in a presidential address to section F of the British Association for the Advancement of Science.

"The axiom on which ....(statistics) is based may be stated thus: that the laws by which nature is governed, and more especially those laws which operate on the moral and physical condition of the human race, are consistent, and are, in all cases best discoverable - in some cases only discoverable - by the investigation and comparison of phenomena extending over a very large number of individual instances. In dealing with MAN in the aggregate, results may be calculated with precision and accuracy of a mathematical problem... This then is the first characteristic of statistics as a science: that it proceeds wholly by the accumulation and comparison of registered facts; - that from these facts alone, properly classified, it seeks to deduce general principles, and that it rejects all a priori reasoning, employing hypothesis, if at all, only in a tentative manner, and subject to future verification"

3

Econometric Notes

(Note: underlining entered by H. H. Stokes)

3. Basic Statistics

Key concepts:

-Mean-Median =middle data value-Mode = data value with most cases

-Population Variance = -Sample Variance =

-Population Standard Deviation =

-Sample Standard Deviation = -Confidence Interval with k% => a range of data values

-Correlation =

-Regression

Where = is a N by K matrix of explanatory variables.-Percentile-Quartile-Z score -t test-SE of the mean

-Central Limit Theorem

Statistics attempts to generalize about a population from a sample. For the purposes of this discussion assume the population of men in the US. A 1/1000 sample from this population would be a randomly selected sample of men such that the sample contained only one male for every 1000 in the population. The task of statistics is to be able to draw meaningful generalizations from the sample about the population. It is costly, and often impossible, to examine all the measurements in the population of interest. A sample must be selected in such a manner such that it is representative of the population.

In a famous example of the potential for problems in sample selection, during the depression in the 1932 presidential election the Literery Digest attempted to sample the electorate. A staff was selected and numbers to call were randomly selected from the phone book in New York. In each call the question was asked “Who will you vote for, Mr. Roosevelt or President Hoover?” Those called, for the most part, supported President Hoover being relected. When Mr. Roosevelt won the election, the question was asked? What went wrong in the sampling process? The assumption that

4

Econometric Notes

those who had phones was the correct characterization of poplution of the voters, was the problem. Those without phones in that period disproportionally went for Mr. Roosevelt biasing the results of the study.

In summary, statistics allows us to use the information contained in a representative sample to correctly make inferences about the population. For example if one were interested in ascertaining how long the light bulbs produced by a certain company last, one could hardly test them all. Sampling would be necessary. The bootstrap can be used to test the distribution of statistics estimated from a sample whose distribution is not known.

In addition to sampling correctly, it is important to be able to detect a shift in the underlying population. The usual practice is to draw a sample from the population to be able to make inferences about the underlying population. If the population is shifting, such samples will give biased information. For example assume a reservoir. If a rain comes and adds to and stirs up the water in the reservoir, samples of water would have to be taken more frequently than if there had been no rain and there was no change in water usage. The interesting question is how do you know when to start increasing the sampling rate? A possible approach would be to increase the sampling rate when the water quality of previous samples begins to fall outside normal ranges for the focus variable. In this example, it is not possible to use the population of all the water in the reservoir to test the water. A number of key concepts are listed next.

Measures of Central Tendency. The mean is a measure of central tendency. Assume a vector x containing N observations. The mean is defined as

(3-1)

Assuming xi = (1 2 3 4 5 6 7 8 9), then N=9, and . The mean is often written as or E(x) or the expected value of x. The problem with the mean as a measure of central tendency is that it is affected by all observations. If instead of making x9 = 9, make x9 = 99. Here which is bigger than all xi values except x9. The median defined as the middle term of an odd number of terms or the average of the two middle terms when the terms have been arranged in increasing order is not affected by outlier terms. In the above example the median is 5 no matter whether x9 = 9 or x9 = 99. The final measure of central tendency is the mode or value which has the highest frequency. The mode may not be unique. In the above example, it does not exist.

Variation. It has been reported that a poor statistician once drowned in a stream with a mean depth of 6 inches. How could this occur? To summarize the data, we also need to check on variation, something that can be done by looking at the standard deviation and variance. The population variance of a vector x is defined as

(3-2)

5

Econometric Notes

while the sample variance is

(3-3)

The population standard deviation is the square root of the population variance. For the purposes of these notes, the standard deviation will mean the sample standard deviation. There are alternative formulas for these values that may be easier to use. As an alternative to (3-2) and (3-3)

(3-4)

(3-5)

For implementing the variance in a computer program, (3-2) is more accurate than (3-4)? Why is this the case?

If is unbiased, a general rule is that will lie 99% of the time in + - 3 standard deviations, 95% of the time in + - 2 standard deviations, and 68% of the time in + - 1 standard deviations. Given a vector of numbers, it is important to determine where a certain number might lie. There are 4 quartile positions of a series. Quartile 1 is the top of the lower 25%, quartile 2 the top of the lower 50% or the median. Quartile 3 is the top of the 75%. The standard deviation gives

information concerning where observations lie. Assume = 10, = 5 and N = 300. The question asked is how likely will a value > 14 occur? To answer this question requires putting the data in Z form where

(3-6)

Think of Z as a normalized deviation. Once we get Z, we can enter tables and determine how likely this will occur. In this case Z = (14-10)/5 = .8.

Distribution of the mean. It often is desirable to know how the sample mean is

distributed. Assuming a vector has a finite distribution and that each value is mutually

independent, then the Central Limit Theorem states that if the vector has any distribution with mean and variance , then the distribution of approaches the normal distribution with mean and variance as sample size N increases. Note that the standard deviation of the mean

6

Econometric Notes

defined as

(3-7)

Given and the 95% confidence interval around is

_ (3-8)

For small samples (<30) the formula is

(3-9)

Tests of two means. Assume two vectors x and y where we know . The simplest test if the means differ is

(3-10)

where the small sample approximation assuming the two samples have the same population standard deviation is

(3-11)

(3-12)

Note that is an estimate of the population variance.

Correlation. If two variables are thought to be related, a possible summary measure would be the correlation coefficient . Most calculators or statistical computer programs will make the

calculation. The standard error of is for small samples and for large samples.

This means that is distributed as a t statistic with asymptotic percentages as given above . The correlation coefficient is defined as

(3-13)

7

Econometric Notes

Perfect positive correlation is 1.0, perfect negative correlation is -1.0. The SE of is converges to 0.0 as N . If N was 101, the SE of r would be 1/10 or .1. must be .2 to be significant at or better than the 95% level. Correlation is major tool of analysis that allows a person to formalize what is shown in an x y plot of the data. A simple data set will be used to illustrates these concepts and introduce OLS models as well as show the flaws of correlation analysis as a diagnostic tool.

Single Equation OLS Regression Model. Data was obtained on 6 observations on age and value of cars (from Freund [1960] Modern Elementary Statistics, page 332), two variables that are thought to be related. Table One lists this data and gives means, correlation between age and value and a simple regression value=f(age). We expect the relationship to be negative and significant.

Table 1. Age of cars

Obs Age Value1 1 19952 3 8753 6 6954 10 3455 5 5956 2 1795

Mean 4.5 1050Variance 10.7 461750Correlation -0.85884

Next we show the Stata command files to obtain analysis of this data. Assume you have a file car_age_data.do

input double x 0.1E+01 0.3E+01 0.6E+01 0.1E+02 0.5E+01 0.2E+01 endlabel variable x "AGE OF CARS "input double y 0.1995E+04 0.8750E+03 0.6950E+03 0.3450E+03 0.5950E+03 0.1795E+04label variable y "PRICE OF CARS "

// Comment

// run car_age_data.do describe

8

Econometric Notes

summarize list correlate (x y) regress y x twoway (scatter y x)

Edited output is:

clear

. run car_age_data.do

. describe

Contains data obs: 6 vars: 2 size: 96 ----------------------------------------------------------------------------------- storage display valuevariable name type format label variable label-----------------------------------------------------------------------------------x double %10.0g AGE OF CARSy double %10.0g PRICE OF CARS-----------------------------------------------------------------------------------Sorted by: Note: dataset has changed since last saved

. summarize

Variable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------------- x | 6 4.5 3.271085 1 10 y | 6 1050 679.5219 345 1995

. list

+-----------+ | x y | |-----------| 1. | 1 1995 | 2. | 3 875 | 3. | 6 695 | 4. | 10 345 | 5. | 5 595 | |-----------| 6. | 2 1795 | +-----------+

. correlate (x y)(obs=6)

| x y-------------+------------------ x | 1.0000 y | -0.8588 1.0000

. regress y x

Source | SS df MS Number of obs = 6-------------+------------------------------ F( 1, 4) = 11.24 Model | 1702935.05 1 1702935.05 Prob > F = 0.0285 Residual | 605814.953 4 151453.738 R-squared = 0.7376-------------+------------------------------ Adj R-squared = 0.6720 Total | 2308750 5 461750 Root MSE = 389.17

------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- x | -178.4112 53.20631 -3.35 0.028 -326.1356 -30.68683 _cons | 1852.85 287.3469 6.45 0.003 1055.048 2650.653

9

Econometric Notes

------------------------------------------------------------------------------

. twoway (scatter y x)

. end of do-file

050

010

0015

0020

00P

RIC

E O

F C

AR

S

0 2 4 6 8 10AGE OF CARS

From the plot we see that the ten year old car appears to have a larger that expected value for its age. For this reason, more variables and observations are needed.

Remark: When there are two series correlation and plots can be used effectively to determine the model. However when there are more that two series, plots and correlation analysis are less useful and in may cases can give the wrong impression. This will be illustrated later. In cases where there are more than one explanatory variable, regression is the appropriate approach, although this approach has many problems.

A regression tries to write the dependent variable y as a linear function of the explanatory variables. In this case we have estimated a model of the form

(3-14)

10

Econometric Notes

where y=value is the price of the car in period t, x=age is the age in period t and e is the error term.

Regression output produces

value = 1852.8505 - 178.41121*age (3-15) (6.45) (-3.35)

R2 = .672, SEE = 389.17, e'e = 605814.953

which can be verified from the printout. Note that SEE= .

The regression model suggests that every year older a car gets the value significantly drops $178.41. A car one year old should have a value of 1852.8505 - (1)*178.41221 = 1674.4. In the sample data set the one year old car in fact had a value of 1995. For this observation the error was 320.56. Using the estimated equation (3-14) we have

Age Actual Value Estimated Value Error1 1995 1674.4 320.563 875 1317.6 -442.626 695 782.38 -87.38310 345.0 68.738 276.265 595 960.79 -365.792 1795 1496 298.97

t scores have been placed under the estimated coefficients. Since for both coefficients |t| > 2, we can state that given the assumptions of the linear regression model, both coefficients are significant. Before turning to an in-depth discussion of the regression model, we look at a few optional topics.

4. More complex setup to illustrate Matlab to estimate the Model. Optional Topic.

This optional topic implements the key ideas in Appendix E of Wooldridge (2013) that show how a linear econometrioc model has be estimated by OLS. As discussed in the text, a linear OLS Model selects the coefficients so as to minimize the sum of squared errors. Define X as an N by K matrix

where N is the number of observation of K series. The OLS coefficient vector where y is the right hand side vector. The error vector . Standard errors of the coefficients

can be obtained from the square root of diagonal elements of where .

As an alternative to the Stata regress command that was shown above, the self contained MATLAB program that is listed next can be used to estimate the model.

%% Cars Example using Matlab% Load data

11

Econometric Notes

x=[1,1; 1,3; 1,6; 1,10; 1,5; 1,2];y=[1995 875 695 345 595 1795];y=y';value=y;disp('Mean of dependent (Age) and Independent Variable (Value)')disp([mean(y),mean(x(:,2))])age=x(:,2);disp('Small and Large Variances for Age and Value')disp([var(age,0),var(age,1),var(value,0),var(value,1)])disp('Correlation using formula and built in function')cor=(mean(age.*y)-(mean(age)*mean(y)))/(sqrt(var(age,1))*sqrt(var(value,1)))% using built in functioncor=corr([age,value])%% Estimate the model% Logic works for any sized problem!!% for large # of obs put ; at end of [y,yhat,res] linebeta=inv(x'*x)*x'*y;yhat=x*beta;res=y-yhat;disp(' Value Yhat Res')[y,yhat,res]ssr=res'*res;disp('Sum of squared residuals')disp(ssr)df=size(x,1)-size(x,2);se=sqrt(diag((ssr/df)*inv(x'*x)));disp(' Beta se t')t=beta./se;[beta,se,t]plot(res)% plot(age,y,age,yhat)disp('Durbin Watson')i=1:1:5;dw=((res(i+1)-res(i))'*(res(i+1)-res(i)))/(res'*res);disp(dw)

Which produces output:

Mean of dependent (Age) and Independent Variable (Value) 1050 4.5Small and Large Variances for Age and Value 10.7 8.9167 4.6175e+005 3.8479e+005Correlation using formula and built in functioncor = -0.85884cor = 1 -0.85884 -0.85884 1 Value Yhat Resans = 1995 1674.4 320.56

12

Econometric Notes

875 1317.6 -442.62 695 782.38 -87.383 345 68.738 276.26 595 960.79 -365.79 1795 1496 298.97Sum of squared residuals 6.0581e+005 Beta se tans = 1852.9 287.35 6.4481 -178.41 53.206 -3.3532Durbin Watson 2.7979

which matches what was produced by the Stata regress commands which can give the user the impression of a "black box." Our findings indicate that for every year on average the car falls in value $178.41.

Remark: Econometric calculations can easily be programmed using 4th generation languages without detailed knowledge of Fortran or C. This allows new techniques to be implemented without waiting for software developers to "hard wire" these procedures.

13

Econometric Notes

5. Review of Linear Algebra and Introduction to Programming Regression Calculations. Optional Topic for those with right math background.

Assume a problem where there are multiple x variables, all possibly related to y, and there is some relationship between the x variables (multicollinearity). The proposed solution is to fit a linear model of the form:

, (5-1)

where y, and e are N element column vectors, is the coefficient of and is the intercept of the equation. A linear model such as (5-1) can be estimated by OLS (ordinary least squares), which will minimize which a good measure of the fit of the model. OLS is one of many methods to fit

a line, others discussed being L1 which minimizes and minimax which minimizes the largest element in e. After the coefficients are calculated, it is a good idea to estimate and report standard errors, which allow significance tests on the estimates of the parameters. OLS models can be estimated, using matrix algebra directly or using pre programmed procedures like the regression command in Excel. There are however a number of ways to calculate the estimated parameters. Before this occurs we first illustrate a number of Linear algebra calculations that include the LU factorization, eigenvalue analysis, the Cholesky factorization, the QR factoprization, the Schur factorization (that always works when eigen values may not work) and the SVD calculation.

The LU factorization is the appropriate way to invert a general matrix. Eigenvalue analysis decomposes where is a diagonal matrix and Z is a general matrix. For the positive definite case since here . Inspection of the diagonal elements of

indicates whether explodes if we note . The sum of the diagonal elements of are the trace of Z while their product is . If Z is positive definite (all diagonal elements of

>0) the Cholesky factorization writes where R is upper triangular. The Schur factorization writes where U is orthogonal and S is block upper triangular. Unlike the eigenvalue transformation, all elements of the Schur factorization are real for the general matrix. The QR factorization writes where Q is orthogonal and R is the Cholesky factorization calculated accurately since it used X not .The SVD calculates where both U and V are orthogonal and N by K and K by K and is a K by K diagonal matrtrix whose elements are the square roots of the eigenvalues of . The below listed Matlab script self documents these calculations and shows graphically where X was 100 by 50. How would this graph look like if X was not a random matrix where by assumption

? How might it be used?

%% Linear Algebra Useful for Econometrics in Matlab

14

Econometric Notes

disp(' Short course in Math using Matlab(c)')% 2 December 2006 Versiondisp(' Houston H. Stokes')disp(' All Matlab commands are indented. Cut and paste from this')disp(' document into Matlab and execute.')disp(' ')disp(' If ; is left off result will print.')disp(' Define x as a n by n matrix of random numbers.')disp(' x = rand(n) ')disp(' define x as a n by n matrix of random normal numbers')disp(' xn = randn(n)')disp(' Do a LU factorization and test answer')disp(' Inverse using LU ')disp(' ')disp(' x = rand(n) ')disp(' [l u] = lu(x) ')disp(' test = l*u ')disp(' error = l*u - x ')disp(' ix = inv(x) ')disp(' ix2 = inv(u)*inv(l) ')disp(' error = ix - ix2 ') n=3 x = rand(n) [l u] = lu(x) test = l*u error = l*u - x ix = inv(x) ix2 = inv(u)*inv(l) error = ix - ix2disp(' Form PD Matrix and look at it. ')disp(' xx = randn(100,10); ')disp(' xpx = xx`*xx ')disp(' mesh(xpx) ') xx = randn(100,50); xpx = xx'*xx; mesh(xpx)disp(' Factor PD matrix into R(t)*R and test')disp(' xx = randn(100,n); ')disp(' xpx = xx(t)*xx ')disp(' r = chol(xpx) ')disp(' test1 = r(t)*r ')disp(' mesh(r) ')disp(' error = r(t)*r - xpx ') xx = randn(100,n); xpx = xx'*xx r = chol(xpx) test1 = r'*r error = r'*r - xpx disp(' Eigen and svd analysis. For pd matrix s = landa')disp(' xx = randn(100,n); ')disp(' xpx = xx(t)*xx ')disp(' lamda = eig(xpx) ')

15

Econometric Notes

xx = randn(100,n); xpx = xx'*xx lamda = eig(xpx) disp(' show trace = sum eigen')disp(' det = prod(e) ')disp(' trace1 = trace(xpx) ')disp(' det1 = det(xpx) ')disp(' trace2 = sum(lamda) ')disp(' det2 = prod(lamda) ') trace1 = trace(xpx) det1 = det(xpx) trace2 = sum(lamda) det2 = prod(lamda) disp(' Test SVD')disp(' s = svd(xpx) ')disp(' [u ss v] = svd(xpx) ')disp(' test = u*ss*v(t)')disp(' error = xpx-test ') s = svd(xpx) [u ss v] = svd(xpx) test = u*ss*v' error = xpx-test disp(' Does X*V = V*Lamda')disp(' xx = randn(100,n); ')disp(' xpx = xx(t)*xx ')disp(' [v lamda] = eig(xpx) ')disp(' test = v*lamda*inv(v)')disp(' error = xpx-test ')disp(' vpv = v(t)*v ')disp(' s = svd(xpx) ') xx = randn(100,n); xpx = xx'*xx [v lamda] = eig(xpx) test = v*lamda*inv(v) error = xpx-test vpv = v'*v s = svd(xpx)disp(' Schur Factorization X = U S U(t) where U is orthogonal and')disp(' S is block upper triangural with 1 by 1 and 2 by 2 on the')disp(' diagonal. All elements of a Schur factorization real')disp(' xx = randn(100,n); ')disp(' xpx = xx(t)*xx ')disp(' [U,S] = schur(xpx) ')disp(' test = U*S*U(t) ')disp(' error = xpx-test ') xx = randn(100,n); xpx = xx'*xx [U,S] = schur(xpx) test = U*S*U' error = xpx-test

16

Econometric Notes

disp(' Schur Factorization')disp(' xx = randn(n,n) ')disp(' [U,S] = schur(xx) ')disp(' test = U*S*U(t) ')disp(' error = xx-test ') xx = randn(n,n) [U,S] = schur(xx) test = U*S*U' error = xx-testdisp(' QR Factorization preserves length and angles and does not magnify')disp(' errors. We express X = Q*R where Q is orthogonal and R is upper')disp(' triangular ')disp(' x = randn(n,n) ')disp(' [Q R] = qr(x) ')disp(' test1 = Q(t)*Q ')disp(' test2 = Q*R ')disp(' error = x - test2 ') x = randn(n,n) [Q R] = qr(x) test1 = Q'*Q test2 = Q*R error = x - test2

and produces output:

Short course in Math using Matlab(c) Houston H. Stokes All Matlab commands are indented. Cut and paste from this document into Matlab and execute. If ; is left off result will print. Define x as a n by n matrix of random numbers. x = rand(n) define x as a n by n matrix of random normal numbers xn = randn(n) Do a LU factorization and test answer Inverse using LU x = rand(n) [l u] = lu(x) test = l*u error = l*u - x ix = inv(x) ix2 = inv(u)*inv(l) error = ix - ix2 n = 3x = 0.84622 0.67214 0.68128 0.52515 0.83812 0.37948 0.20265 0.01964 0.8318l = 1 0 0 0.62059 1 0 0.23947 -0.33568 1

17

Econometric Notes

u = 0.84622 0.67214 0.68128 0 0.421 -0.04331 0 0 0.65411test = 0.84622 0.67214 0.68128 0.52515 0.83812 0.37948 0.20265 0.01964 0.8318error = 0 0 0 0 0 0 0 -6.9389e-018 0ix = 2.9596 -2.3417 -1.3557 -1.5445 2.4281 0.15727 -0.68458 0.51318 1.5288ix2 = 2.9596 -2.3417 -1.3557 -1.5445 2.4281 0.15727 -0.68458 0.51318 1.5288error = 0 0 0 0 0 0 -1.1102e-016 0 0 Form PD Matrix and look at it. xx = randn(100,10); xpx = xx`*xx mesh(xpx) Factor PD matrix into R(t)*R and test xx = randn(100,n); xpx = xx(t)*xx r = chol(xpx) test1 = r(t)*r mesh(r) error = r(t)*r - xpx xpx = 98.02 17.334 0.14022 17.334 104.66 -7.2052 0.14022 -7.2052 114.22r = 9.9005 1.7508 0.014163 0 10.08 -0.71729 0 0 10.663test1 = 98.02 17.334 0.14022 17.334 104.66 -7.2052 0.14022 -7.2052 114.22error = 1.4211e-014 0 0 0 0 0 0 0 0 Eigen and svd analysis. For pd matrix s = landa xx = randn(100,n); xpx = xx(t)*xx lamda = eig(xpx) xpx = 95.217 -3.5453 12.006 -3.5453 96.003 -3.9312 12.006 -3.9312 92.989lamda =

18

Econometric Notes

82.025 93.783 108.4 show trace = sum eigen det = prod(e) trace1 = trace(xpx) det1 = det(xpx) trace2 = sum(lamda) det2 = prod(lamda) trace1 = 284.21det1 = 8.3388e+005trace2 = 284.21det2 = 8.3388e+005 Test SVD s = svd(xpx) [u ss v] = svd(xpx) test = u*ss*v(t) error = xpx-test s = 108.4 93.783 82.025u = -0.67492 0.3165 0.66657 0.39135 0.91936 -0.040277 -0.62557 0.23368 -0.74435ss = 108.4 0 0 0 93.783 0 0 0 82.025v = -0.67492 0.3165 0.66657 0.39135 0.91936 -0.040277 -0.62557 0.23368 -0.74435test = 95.217 -3.5453 12.006 -3.5453 96.003 -3.9312 12.006 -3.9312 92.989error = -2.8422e-014 -1.199e-014 -4.0856e-014 -9.3703e-014 -2.8422e-014 5.9064e-014 -4.7962e-014 -5.3291e-015 1.4211e-014 Does X*V = V*Lamda xx = randn(100,n); xpx = xx(t)*xx [v lamda] = eig(xpx) test = v*lamda*inv(v) error = xpx-test vpv = v(t)*v s = svd(xpx) xpx = 98.321 -0.36605 1.9557 -0.36605 127.52 -2.4594 1.9557 -2.4594 112.74v = 0.99127 0.1298 0.022941

19

Econometric Notes

0.0013134 0.1643 -0.98641 -0.13181 0.97783 0.1627lamda = 98.061 0 0 0 112.59 0 0 0 127.93test = 98.321 -0.36605 1.9557 -0.36605 127.52 -2.4594 1.9557 -2.4594 112.74error = -1.4211e-014 2.7645e-014 2.2204e-015 2.9421e-014 -4.2633e-014 5.7732e-015 1.7764e-015 -1.3323e-015 0vpv = 1 2.7756e-017 -2.0817e-017 2.7756e-017 1 -2.498e-016 -2.0817e-017 -2.498e-016 1s = 127.93 112.59 98.061 Schur Factorization X = U S U(t) where U is orthogonal and S is block upper triangural with 1 by 1 and 2 by 2 on the diagonal. All elements of a Schur factorization real xx = randn(100,n); xpx = xx(t)*xx [U,S] = schur(xpx) test = U*S*U(t) error = xpx-test xpx = 75.062 11.465 -4.6863 11.465 135.28 7.6196 -4.6863 7.6196 87.647U = -0.91599 -0.36457 -0.16747 0.20355 -0.062606 -0.97706 -0.34572 0.92907 -0.13156S = 70.745 0 0 0 88.973 0 0 0 138.27test = 75.062 11.465 -4.6863 11.465 135.28 7.6196 -4.6863 7.6196 87.647error = 1.4211e-014 2.1316e-014 -1.4211e-014 2.4869e-014 -8.5265e-014 7.1054e-015 -1.0658e-014 5.3291e-015 1.4211e-014 Schur Factorization xx = randn(n,n) [U,S] = schur(xx) test = U*S*U(t) error = xx-test xx = 2.095 0.93943 -0.45994 0.34979 -0.047081 0.64722 2.0142 -1.4799 -1.8411U =

20

Econometric Notes

-0.89939 -0.19282 0.39233 -0.24726 -0.51574 -0.82029 -0.3605 0.83477 -0.41617S = 2.1689 1.4404 -1.1939 0 -0.98103 2.3141 0 -0.42339 -0.98103test = 2.095 0.93943 -0.45994 0.34979 -0.047081 0.64722 2.0142 -1.4799 -1.8411error = 8.8818e-016 -2.2204e-016 -1.6653e-016 1.1102e-016 9.09e-016 8.8818e-016 4.4409e-016 3.3307e-015 8.8818e-016 QR Factorization preserves length and angles and does not magnify errors. We express X = Q*R where Q is orthogonal and R is upper triangular x = randn(n,n) [Q R] = qr(x) test1 = Q(t)*Q test2 = Q*R error = x - test2 x = -0.9756 0.55997 0.88166 0.028304 0.62542 0.15174 -0.050706 0.53695 -0.017682Q = -0.99823 0.0094729 -0.058658 0.028961 -0.78444 -0.61953 -0.051883 -0.62013 0.78278R = 0.97733 -0.56872 -0.87479 0 -0.81828 -0.099712 0 0 -0.15956test1 = 1 0 6.9389e-018 0 1 -1.6653e-016 6.9389e-018 -1.6653e-016 1test2 = -0.9756 0.55997 0.88166 0.028304 0.62542 0.15174 -0.050706 0.53695 -0.017682error = 0 0 -4.4409e-016 3.4694e-018 2.2204e-016 8.3267e-017 0 0 2.0817e-017

21

Econometric Notes

010

2030

4050

0

10

20

30

40

50-40

-20

0

20

40

60

80

100

120

140

X'X Where X was 100 by 50

Figure 5.1 X'X for a random Matrix X

These ideas are illustrated using the Theil dataset discussed in more detail in the next section.

%% Use of Theil Data to Illustrate various ways to get Beta% For more detail on these calculations see Stokes (200x) Chapter 10disp('Theil (1971) data on Year CT RP Income')data=[1923 99.2 96.7 101.0;1924 99.0 98.1 100.1;1925 100.0 100.0 100.0;1926 111.6 104.6 90.6;1927 122.2 104.9 86.5;1928 117.6 109.5 89.7;1929 121.1 110.8 90.6;1930 136.0 112.3 82.8;1931 154.2 109.3 70.1;1932 153.6 105.3 65.4;1933 158.5 101.7 61.3;

22

Econometric Notes

1934 140.6 95.4 62.5;1935 136.2 96.4 63.6;1936 168.0 97.6 52.6;1937 154.3 102.4 59.7;1938 149.0 101.6 59.5;1939 165.5 103.8 61.3]y=data(:,2);x=[ones(size(data,1),1),data(:,3),data(:,4)];disp('Beta using Inverse')beta1=inv(x'*x)*x'*y%% QRdisp('Using QR approach')[q,r]=qr(x,0)disp('Testing q')q'*qbeta2=inv(r)*q'*yyhat=q*q'*y;resid=y-yhat;disp('Y Yhat Residual')[y,yhat,resid]%% Testing R from QR and R from Choleskydisp('Inverse (xpx) = inv(r)*transpose(inv(r))')inv(x'*x)inv(r)*(inv(r))'rcholr=chol(x'*x)%% SVD approach that includes PC Regressiondisp('SVD approach')[u,s,v]=svd(x,0)pc_coef=u'*ybeta3=inv(v')*inv(s)*pc_coef

Output produced is:

Theil (1971) data on Year CT RP Incomedata = 1923 99.2 96.7 101 1924 99 98.1 100.1 1925 100 100 100 1926 111.6 104.6 90.6 1927 122.2 104.9 86.5 1928 117.6 109.5 89.7 1929 121.1 110.8 90.6 1930 136 112.3 82.8 1931 154.2 109.3 70.1 1932 153.6 105.3 65.4 1933 158.5 101.7 61.3 1934 140.6 95.4 62.5 1935 136.2 96.4 63.6 1936 168 97.6 52.6 1937 154.3 102.4 59.7 1938 149 101.6 59.5 1939 165.5 103.8 61.3Beta using Inversebeta1 =

23

Econometric Notes

130.23 1.0659 -1.3822Using QR approachq = -0.24254 -0.2958 -0.42465 -0.24254 -0.2297 -0.39928 -0.24254 -0.13999 -0.38173 -0.24254 0.077214 -0.20134 -0.24254 0.091379 -0.13707 -0.24254 0.30858 -0.14641 -0.24254 0.36996 -0.14898 -0.24254 0.44079 -0.018862 -0.24254 0.29913 0.14704 -0.24254 0.11027 0.18403 -0.24254 -0.059716 0.21536 -0.24254 -0.35718 0.14409 -0.24254 -0.30997 0.13597 -0.24254 -0.25331 0.31174 -0.24254 -0.026664 0.24537 -0.24254 -0.064438 0.24162 -0.24254 0.03944 0.2331r = -4.1231 -424.53 -314.64 0 21.179 11.878 0 0 -66.411Testing qans = 1 8.1532e-017 1.3878e-017 8.1532e-017 1 -1.1796e-016 1.3878e-017 -1.1796e-016 1beta2 = 130.23 1.0659 -1.3822Y Yhat Residualans = 99.2 93.704 5.4962 99 96.44 2.56 100 98.603 1.3965 111.6 116.5 -4.8995 122.2 122.49 -0.28637 117.6 122.97 -5.3664 121.1 123.11 -2.0081 136 135.49 0.51173 154.2 149.84 4.3553 153.6 152.08 1.5225 158.5 153.91 4.5927 140.6 145.53 -4.9335 136.2 145.08 -8.8789 168 161.56 6.4376 154.3 156.87 -2.565 149 156.29 -7.2887 165.5 156.15 9.3542Inverse (xpx) = inv(r)*transpose(inv(r))ans = 23.773 -0.2272 -0.0042094 -0.2272 0.0023008 -0.00012716 -0.0042094 -0.00012716 0.00022673ans =

24

Econometric Notes

23.773 -0.2272 -0.0042094 -0.2272 0.0023008 -0.00012716 -0.0042094 -0.00012716 0.00022673r = -4.1231 -424.53 -314.64 0 21.179 11.878 0 0 -66.411cholr = 4.1231 424.53 314.64 0 21.179 11.878 0 0 66.411SVD approachu = 0.26014 -0.42317 0.28267 0.26123 -0.39389 0.21821 0.26398 -0.37096 0.12977 0.26026 -0.17816 -0.076467 0.25606 -0.11332 -0.086908 0.26662 -0.1094 -0.30402 0.26959 -0.10823 -0.36537 0.26301 0.025612 -0.42853 0.2441 0.18215 -0.27778 0.23275 0.20748 -0.087337 0.22268 0.22834 0.08395 0.21455 0.13929 0.37648 0.2173 0.13408 0.32893 0.20664 0.31251 0.28251 0.22192 0.26022 0.052713 0.22049 0.25419 0.090163 0.22584 0.25202 -0.013903s = 530.48 0 0 0 53.304 0 0 0 0.20509v = 0.0077424 0.0056046 0.99995 0.799 0.60125 -0.0095564 0.60128 -0.79904 -0.00017699pc_coef = 545.81 131.94 26.706beta3 = 130.23 1.0659 -1.3822

Remark: This section shows how to implement the basic linear algebra relationships that are useful in understanding modern econometric methods and calculations. In many cases these new approaches are required to be used for complex and multi-collinear datasets.

6. A Sample Multiple Input Regression Model Dataset

In sections 3 and 4 we introduced a small (6 observation dataset) that relates age of cars to

25

Econometric Notes

their value. We observed that since there are so few observations in this example, the correlation coefficient must be relatively large to be significant. The small sample standard error of the

correlation coefficient is calculated using (3.3) which is this case is . Since the absolute value of the correlation coefficient (-.85884) is about 2 times the standard error, we can state that at about the 95% level, the correlation coefficient is significant. The problem with correlation analysis is that it is hard to make direct predictions. What is wanted is a relationship where, if given only the age of a car, we can make some prediction on its price. To obtain an answer to the prediction problem requires more advanced statistical techniques. Its solution will be discussed further below.

As discussed earlier, when more complicated models are deemed appropriate or when predictions are required, the correlation coefficient statistical procedure, which restricts analysis to two variables, is no longer the best way to proceed. In the highly unlikely situation where all the variables influencing y (the x's) were unrelated among themselves (i. e., were orthogonal), correlation analysis would give the correct sign of the relationship between each x variable and y. This situation would occur if the x's were principal components. In a later example, using generated data, some of these possibilities will be illustrated with further examples.

Table Two lists data on the consumption of textiles in the Netherlands from Theil( [1971] Principles of Econometrics, page 102) which was used as an example in the Matlab code in section 5. This example will be shown to provide a better fit than the previous example and, in addition, illustrates multiple input regression models. (It should be noted that not all economics examples work this well.) Usually time series models have higher than cross section models, because of the serial correlation (relationship between the error terms across time) implicit in most time series. In this example from Theil (l971) the consumption of textiles in the Netherlands (CT) between 1923-1939 is modeled as a function of income (Y) and the relative price of textiles (RP). The maintained hypothesis is that as income increases, the consumption of textiles should increase and as the relative price of textiles increases, the consumption of textiles should decrease. Two models are tried, one with the raw data and one with data logged to the base 10. The linear model asserts

Table Two Consumption of Textiles in the Netherlands: 1923-1939

Year CT Y RP1923 99.2 96.7 101.01924 99.0 98.1 100.11925 100.0 100.0 100.01926 111.6 104.6 90.61927 122.2 104.9 86.51928 117.6 109.5 89.71929 121.1 110.8 90.61930 136.0 112.3 82.81931 154.2 109.3 70.11932 153.6 105.3 65.41933 158.5 101.7 61.31934 140.6 95.4 62.5

26

Econometric Notes

1935 136.2 96.4 63.61936 168.0 97.6 52.61937 154.3 102.4 59.71938 149.0 101.6 59.51939 165.5 103.8 61.3

CT = consumption of textiles.Y = income.RP = relative price of textiles.

(6-1)

while the log form assumes the error is multiplicative or that

(6-2)

(6-2) can be estimated in log form as

(6-3)

Actual estimates of the alternative models were

(6.4)

Prior to preliminary estimation, raw correlations and plots were performed. The log transformation was attempted to make the time series data stationary. B34S and SAS commands to analyze this data are shown next

Note that B34S requires the user to explicitly define variables to be built with the gen statements with the build statement when using the B34S data step. This allows for checking of variable names in the gen statements. For SAS the following commands would be used.

data theil;INPUT CT Y RP ;LABEL CT = 'CONSUMPTION OF TEXTILES' ;LABEL LOG10CT = 'LOG10 OF CONSUMPTION' ;

27

Econometric Notes

LABEL Y = 'INCOME' ;LABEL LOG10Y = ' LOG10 OF INCOME ' ;LABEL RP = 'RELATIVE PRICE OF TEXTILES';LABEL LOG10RP = 'LOG10 OF RELATIVE PRICE' ;LOG10CT = LOG10(CT) ;LOG10RP = LOG10(RP) ;LOG10Y = LOG10(Y) ;CARDS;99.2 96.7 10199 98.1 100.1100 100 100111.6 104.9 90.6122.2 104.9 86.5117.6 109.5 89.7121.1 110.8 90.6136 112.3 82.8154.2 109.3 70.1153.6 105.3 65.4158.5 101.7 61.3140.6 95.4 62.5136.2 96.4 63.6168 97.6 52.6154.3 102.4 59.7149 101.6 59.5165.5 103.8 61.3;proc reg; MODEL CT = Y RP; run;proc reg; MODEL LOG10CT = LOG10Y LOG10RP; run;proc autoreg; MODEL LOG10CT = LOG10Y LOG10RP / nlag=1 method=ml; run;proc autoreg; MODEL LOG10CT = LOG10Y / nlag=1 method=ml; run;

Edited output from B34S discussed below is:

Variable # Label Mean Std. Dev. Variance Maximum Minimum

CT 1 CONSUMPTION OF TEXTILES 134.506 23.5773 555.891 168.000 99.0000 Y 2 INCOME 102.982 5.30097 28.1003 112.300 95.4000 RP 3 RELATIVE PRICE OF TEXTILES 76.3118 16.8662 284.470 101.000 52.6000 LOG10CT 4 LOG10 OF CONSUMPTION 2.12214 0.791131E-01 0.625889E-02 2.22531 1.99564 LOG10Y 5 LOG10 OF INCOME 2.01222 0.222587E-01 0.495451E-03 2.05038 1.97955 LOG10RP 6 LOG10 OF RELATIVE PRICE 1.87258 0.961571E-01 0.924619E-02 2.00432 1.72099 CONSTANT 7 1.00000 0.00000 0.00000 1.00000 1.00000

Data file contains 17 observations on 7 variables. Current missing value code is 0.1000000000000000E+32B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 16:14:15 DATA STEP PAGE 2

Correlation Matrix

1 Y Var 2 0.61769E-01

1 2 RP Var 3 -0.94664 0.17885

1 2 3 LOG10CT Var 4 0.99744 0.93936E-01 -0.94836

1 2 3 4 LOG10Y Var 5 0.66213E-01 0.99973 0.17511 0.97862E-01

1 2 3 4 5 LOG10RP Var 6 -0.93820 0.22599 0.99750 -0.93596 0.22212

1 2 3 4 5 6 CONSTANT Var 7 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

*************** Problem Number 4 Subproblem Number 1

F to enter 0.99999998E-02 F to remove 0.49999999E-02 Tolerance 0.10000000E-04 Maximum no of steps 3

28

Econometric Notes

Dependent variable X( 1). Variable Name CT

Standard Error of Y = 23.577332 for degrees of freedom = 16.

............. Step Number 3 Analysis of Variance for reduction in SS due to variable entering Variable Entering 2 Source DF SS MS F F Sig. Multiple R 0.975337 Due Regression 2 8460.9 4230.5 136.68 1.000000 Std Error of Y.X 5.56336 Dev. from Reg. 14 433.31 30.951 R Square 0.951282 Total 16 8894.2 555.89

Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation CT = Variable Coefficient F for selection Y X- 2 1.061710 0.2666740 3.981 0.99863 0.7287 0.8129 RP X- 3 -1.382985 0.8381426E-01 -16.50 1.00000 -0.9752 -0.7846 CONSTANT X- 7 130.7066 27.09429 4.824 0.99973

Adjusted R Square 0.944321908495049 -2 * ln(Maximum of Likelihood Function) 103.294108058298 Akaike Information Criterion (AIC) 111.294108058298 Scwartz Information Criterion (SIC) 114.626961434523 Akaike (1970) Finite Prediction Error 36.4128553394184 Generalized Cross Validation 37.5832685467569 Hannan & Quinn (1979) HQ 36.8112662258895 Shibata (1981) 34.4851159390962 Rice (1984) 39.3920889580981 Residual Variance 30.9509270385056

Order of entrance (or deletion) of the variables = 7 3 2

Estimate of computational error in coefficients = 1 -0.1889E-13 2 -0.2396E-14 3 0.7430E-11

Covariance Matrix of Regression Coefficients

Row 1 Variable X- 2 Y 0.71115004E-01

Row 2 Variable X- 3 RP -0.39974169E-02 0.70248306E-02

Row 3 Variable X- 7 CONSTANT -7.0185405 -0.12441382 734.10069

Program terminated. All variables put in.

Residual Statistics for... Original Data

Von Neumann Ratio 1 ... 2.14471 Durbin-Watson TEST..... 2.01855 Von Neumann Ratio 2 ... 2.14471

For D. F. 14 t(.9999)= 5.3624, t(.999)= 4.1403, t(.99)= 2.9768, t(.95)= 2.1448, t(.90)= 1.7613, t(.80)= 1.3450

Skewness test (Alpha 3) = -.232914E-01, Peakedness test (Alpha 4)= 1.37826

Normality Test -- Extended grid cell size = 1.70 t Stat Infin 1.761 1.345 1.076 0.868 0.692 0.537 0.393 0.258 0.128 Cell No. 0 2 2 4 2 0 2 2 1 2 Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 Act Per 1.000 1.000 0.882 0.765 0.529 0.412 0.412 0.294 0.176 0.118

Normality Test -- Small sample grid cell size = 3.40 Cell No. 2 6 2 4 3 Interval 1.000 0.800 0.600 0.400 0.200 Act Per 1.000 0.882 0.529 0.412 0.176

Extended grid normality test - Prob of rejecting normality assumption Chi= 7.118 Chi Prob= 0.4760 F(8, 14)= 0.889706 F Prob =0.450879

Small sample normality test - Large grid Chi= 3.294 Chi Prob= 0.6515 F(3, 14)= 1.09804 F Prob =0.617396

Autocorrelation function of residuals

1) -0.1546 2) -0.2529 3) 0.2272 4) -0.3925

F( 6, 6) = 0.3219 1/F = 3.106 Heteroskedasticity at 0.9032 level

Sum of squared residuals 433.3 Mean squared residual 25.49

Gen. Least Squares ended by satisfying tolerance.


F to enter 0.99999998E-02

29

Econometric Notes

F to remove 0.49999999E-02 Tolerance 0.10000000E-04 Maximum no of steps 3 Dependent variable X( 4). Variable Name LOG10CT

Standard Error of Y = 0.79113140E-01 for degrees of freedom = 16.

............. Step Number 3 Analysis of Variance for reduction in SS due to variable entering Variable Entering 5 Source DF SS MS F F Sig. Multiple R 0.987097 Due Regression 2 0.97575E-01 0.48787E-01 266.02 1.000000 Std Error of Y.X 0.135425E-01 Dev. from Reg. 14 0.25676E-02 0.18340E-03 R Square 0.974361 Total 16 0.10014 0.62589E-02

Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation LOG10CT = Variable Coefficient F for selection LOG10Y X- 5 1.143156 0.1560002 7.328 1.00000 0.8906 1.084 LOG10RP X- 6 -0.8288375 0.3611136E-01 -22.95 1.00000 -0.9870 -0.7314 CONSTANT X- 7 1.373914 0.3060903 4.489 0.99949

Adjusted R Square 0.970697895872232 -2 * ln(Maximum of Likelihood Function) -101.322167384484 Akaike Information Criterion (AIC) -93.3221673844844 Scwartz Information Criterion (SIC) -89.9893140082595 Akaike (1970) Finite Prediction Error 0.215763077505479D-003 Generalized Cross Validation 0.222698319282440D-003 Hannan & Quinn (1979) HQ 0.218123847024249D-003 Shibata (1981) 0.204340326343424D-003 Rice (1984) 0.233416420210472D-003 Residual Variance 0.183398615879657D-003

Order of entrance (or deletion) of the variables = 7 6 5

Estimate of computational error in coefficients = 1 0.5793E-11 2 0.2356E-12 3 0.2547E-11


Row 1 Variable X- 5 LOG10Y 0.24336056E-01

Row 2 Variable X- 6 LOG10RP -0.12513115E-02 0.13040301E-02

Row 3 Variable X- 7 CONSTANT -0.46626424E-01 0.76017246E-04 0.93691270E-01




For D. F. 14 t(.9999)= 5.3624, t(.999)= 4.1403, t(.99)= 2.9768, t(.95)= 2.1448, t(.90)= 1.7613, t(.80)= 1.3450

Skewness test (Alpha 3) = -.159503 , Peakedness test (Alpha 4)= 1.44345






1) -0.0990 2) -0.1061 3) 0.0862 4) -0.3157


Sum of squared residuals 0.2568E-02 Mean squared residual 0.1510E-03


We first show plots of the data.

30

Econometric Notes

OBS2 4 6 8 10 12 14 16

60

70

80

90

100

110

120

130

140

150

160CT

RP

Y

Linear Theil Data

OBS2 4 6 8 10 12 14 16

1.75

1.80

1.85

1.90

1.95

2.00

2.05

2.10

2.15

2.20LOG10CT

LOG10RP

LOG10Y

Log Theil Data

Figure 6.1 2 D Plots of Textile Data

31

Econometric Notes

Two dimensional plots of this dataset do not capture the full relationships. From the plots in Figure 6.1 it appears that the consumption of textiles increases when the relative price of textiles falls and that RI has little effect. Figure 6.2, which is based on a three dimensional extrapolation about each point, gives a better picture of the true relationship. This figure clearly shows that LOG10RP, has the most effect on LOG10CT, which is on the Z axis, but that LOG10RI does have a positive effect. The OLS regression model attempts to capture this surface.

Remark: A 2-D plot may lead one to drop a variable that is in fact significant in a multi-dimensional context. A 3-D plot can help in cases where K=3, but may be less useful for larger problems.

The plots of CT against RP and LOG10CT against LOG10RP suggest a negative relationship, which is consistent with the economic theory that quantity demanded of a good will increase as its relative price falls. The correlations between these two sets of variables are negative (-.94664 and -.93596) and highly significant (at the .0001 level for both correlations). The plot between CT and Y and the plot between LOG10CT and LOG10Y do not show much of a relationship. The raw correlations are small (.06177 and .09786, respectively) and not significant. The preliminary finding might be that Y was not a good variable to use on the right-hand side of a model predicting CT. It will be shown later that such a conclusion would be wrong.

32

Econometric Notes

1.80

1.90

2.001.9801.990

2.0002.010

2.0202.030

2.0402.050

2.042.062.082.102.122.142.162.18

log10ct

log10y log10rp

Log Theil Textile Data

Figure 6.2 3-D Plot of Theil (1971) Textile Data

Remark: The preliminary estimation of a model CT = f(constant, Y, RP) indicates that the coefficients are 1.0617 (t = 3.98) and -1.383 (t = -16.5), respectively. The results support the maintained hypothesis that CT is positively related to income and negatively related to relative price. The Y variable, which was not significantly correlated with CT, was found to be significant when included in a regression controlling for RP. This demonstrates that it is important to go beyond just raw cross correlation analysis. If proposed variables are "prescreened out" by correlation analysis and not tried in regression models, many important variables may be incorrectly dropped from the analysis. It is important not to prematurely drop a proposed, theoretically plausible, variable from a regression model specification, even if in preliminary specifications it does not enter significantly. Later in the paper an example will be presented that illustrates that if other important variables are omitted from an equation, a significant variable that is in the equation may not show up as significant when other variables enter the equation omitted variable bias). The preceding discussion suggests that regression analysis requires careful use of diagnostic tests before the results are to be used in a production environment.

A possible problem with the above formulation is that the error process might potentially have heteroskedasticity or nonconstant variance due to the fact that the time series values for CT are

33

Econometric Notes

increasing over time. If all the variables in the model are transformed into logs (to the base 10), some of the potential for difficulty may be avoided. If heteroskedasticity were to be present, the estimated standard errors of the coefficients would be biased. In addition, the estimated standard error of the

model, from equation (6-4), would be misleading, since it would be an average, and, assuming the variance of the error was increasing, would overstate the error at the beginning of the data set and understate the error at the end of the data set.

Log transforms to the base 10 are made and the model is estimated again and reported in the bottom equation (6-4). The results indicate the log linear form of the model fits better (the adjusted

now is .9707) and all coefficients, except for the constant, are more significant. Comparison of the estimated values with the actual values shows surprisingly good results, considering there are only two explanatory variables in the model.

One of the assumptions of an OLS regression is that the error process follows a random normal distribution with no serial correlation or heteroskedasticity (nonconstant variance). If the error process is only normal, the estimated coefficients will be unbiased and the standard errors of the estimated coefficients will be biased.

Another important assumption of OLS is that the error terms are not related. If is the error

term of the estimated model, is a random error and the model

(6-5)

is estimated, no autocorrelation up to order K implies that for are not significant. First-order serial correlation can be tested by the Durbin-Watson test statistic. If the Durbin Watson statistic is around 2.0, there is no problem. If it is substantially below (above) 2.0 there is positive (negative) autocorrelation. This can be seen since the formula for the Durbin Watson is

(6-6)

If serial correlation is found, the appropriate procedure is generalized least squares, which involves a transformation of the data. If heteroskedasticity is found, there are other procedures that can be used to remove the problem. To illustrate GLS, assume

(6-7)

34

Econometric Notes

where t refers to the time period of the observation. If and model (6-5) is estimated for the

residuals for (6-7) and is significant, the appropriate procedure is to lag the original equation and

multiply through by and then subtract from the original equation. This would give

(6-9)

which will give unbiased estimates of and and their standard errors, since from (6-5)

and by assumption does not contain serial correlation.

As a test a misspecified model (to induce serial correlation) containing only LOG10Y is run in SAS. This model will find LOG10Y not significant and evidence of serial correlation in the model as measured by the low Durbin-Watson test statistic (.241). In the presence of serial correlation, the best course of action is to attempt to add new variables to explain the serial correlation. The B34S reg command output is shown first and next the SAS autoreg command

REG Command. Version 1 February 1997

Real*8 space available 9000000 Real*8 space used 43

OLS Estimation Dependent variable LOG10CT Adjusted R**2 -5.645122564288263E-02 Standard Error of Estimate 8.131550225838252E-02 Sum of Squared Residuals 9.918316361299519E-02 Model Sum of Squares 9.590596648930139E-04 Total Sum of Squares 0.1001422232778882 F( 1, 15) 0.1450437196128150 F Significance 0.2913428904662156 1/Condition of XPX 1.523468705487359E-05 Number of Observations 17 Durbin-Watson 0.2414802718079813

Variable Coefficient Std. Error t LOG10Y { 0} 0.34782649 0.91329943 0.38084606 CONSTANT { 0} 1.4222303 1.8378693 0.77384737

SAS output next: The AUTOREG Procedure

Dependent Variable LOG10CT LOG10 OF CONSUMPTION

Ordinary Least Squares Estimates

SSE 0.00256758 DFE 14 MSE 0.0001834 Root MSE 0.01354 SBC -92.822527 AIC -95.322167 Regress R-Square 0.9744 Total R-Square 0.9744 Durbin-Watson 1.9267

Standard Approx

35

Econometric Notes

Variable DF Estimate Error t Value Pr > |t| Variable Label

Intercept 1 1.3739 0.3061 4.49 0.0005 LOG10Y 1 1.1432 0.1560 7.33 <.0001 LOG10 OF INCOME LOG10RP 1 -0.8288 0.0361 -22.95 <.0001 LOG10 OF RELATIVE PRICE

Estimates of Autocorrelations

Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1

0 0.000151 1.000000 | |********************| 1 -0.00001 -0.093221 | **| |

Preliminary MSE 0.000150

Estimates of Autoregressive Parameters

Standard Lag Coefficient Error t Value

1 0.093221 0.276142 0.34

Algorithm converged.

The SAS System 10:37 Wednesday, December 6, 2006 4

The AUTOREG Procedure

Maximum Likelihood Estimates


Standard Approx Variable DF Estimate Error t Value Pr > |t| Variable Label

Intercept 1 1.3592 0.2941 4.62 0.0005 LOG10Y 1 1.1487 0.1516 7.58 <.0001 LOG10 OF INCOME LOG10RP 1 -0.8271 0.0343 -24.09 <.0001 LOG10 OF RELATIVE PRICE AR1 1 0.1248 0.3186 0.39 0.7017

Autoregressive parameters assumed given.


Intercept 1 1.3592 0.2875 4.73 0.0004 LOG10Y 1 1.1487 0.1471 7.81 <.0001 LOG10 OF INCOME LOG10RP 1 -0.8271 0.0338 -24.47 <.0001 LOG10 OF RELATIVE PRICE



Dependent Variable LOG10CT LOG10 OF CONSUMPTION

36

Econometric Notes

Ordinary Least Squares Estimates



Intercept 1 1.4222 1.8379 0.77 0.4510 LOG10Y 1 0.3478 0.9133 0.38 0.7087 LOG10 OF INCOME

Estimates of Autocorrelations

Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1

0 0.00583 1.000000 | |********************| 1 0.00447 0.765305 | |*************** |

Preliminary MSE 0.00242

Estimates of Autoregressive Parameters

Standard Lag Coefficient Error t Value

1 -0.765305 0.172027 -4.45

Algorithm converged.



Maximum Likelihood Estimates



Intercept 1 0.6643 1.6320 0.41 0.6901 LOG10Y 1 0.7229 0.8167 0.89 0.3910 LOG10 OF INCOME AR1 1 -0.8961 0.1312 -6.83 <.0001

Autoregressive parameters assumed given.


Intercept 1 0.6643 1.5885 0.42 0.6821 LOG10Y 1 0.7229 0.7910 0.91 0.3762 LOG10 OF INCOME

The SAS first shows the complete model where the DW was 1.9267 indicating GLS was

37

Econometric Notes

not needed. If in fact GLS is run, with . The DW fell to 1.79. For the mis-specified equation the DW was .215 before GLS and 1.6157 after GLS. Here with . For B34S which the two pass method to do GLS the results were:

Problem Number 1 Subproblem Number 3 F to enter 9.999999776482582E-03 F to remove 4.999999888241291E-03 Tolerance (1.-R**2) for including a variable 1.000000000000000E-05 Maximum Number of Variables Allowed 2 Internal Number of dependent variable 4 Dependent Variable LOG10CT Standard Error of Y 7.911314021618683E-02 Degrees of Freedom 16

............. Step Number 2 Analysis of Variance for reduction in SS due to variable entering Variable Entering 5 Source DF SS MS F F Sig. Multiple R 0.978620E-01 Due Regression 1 0.95906E-03 0.95906E-03 0.14504 0.291343 Std Error of Y.X 0.813155E-01 Dev. from Reg. 15 0.99183E-01 0.66122E-02 R Square 0.957698E-02 Total 16 0.10014 0.62589E-02

Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation LOG10CT = Variable Coefficient F for selection LOG10Y X- 5 0.3478265 0.9132994 0.3808 0.29134 0.0979 0.3298 CONSTANT X- 7 1.422230 1.837869 0.7738 0.54895

Adjusted R Square -5.645122564294430E-02 -2 * ln(Maximum of Likelihood Function) -39.20409573256165 Akaike Information Criterion (AIC) -33.20409573256165 Scwartz Information Criterion (SIC) -30.70445570039300 Akaike (1970) Finite Prediction Error 7.390118073123232E-03 Generalized Cross Validation 7.493839028535489E-03 Hannan & Quinn (1979) HQ 7.454314110419272E-03 Shibata (1981) 7.207081092983957E-03 Rice (1984) 7.629474124074592E-03 Residual Variance 6.612210907531313E-03

Order of entrance (or deletion) of the variables = 7 5 Estimate of Computational Error in Coefficients

1 2 0.00000 0.00000


Row 1 Variable X- 5 LOG10Y 0.83411584

Row 2 Variable X- 7 CONSTANT -1.6784283 3.3777634


Residual Statistics for Original data


For D. F. 15 t(.9999)= 5.2391, t(.999)= 4.0728, t(.99)= 2.9467, t(.95)= 2.1314, t(.90)= 1.7531, t(.80)= 1.3406







38

Econometric Notes

1 2 3 4 0.813137 0.658160 0.545551 0.332369


Sum of squared residuals 9.918316361299512E-02 Mean squared residual 5.834303741940890E-03

B34S 8.10Z (D:M:Y) 6/12/06 (H:M:S) 11: 8:24 REGRESSION STEP PAGE 10

Doing Gen. Least Squares using residual Dif. Eq. of order 1 Lag Coefficients

1 0.842413

Standard Error of Y 0.2288578856184614 Degrees of Freedom 15

............. Step Number 2 Analysis of Variance for reduction in SS due to variable entering Variable Entering 5 Source DF SS MS F F Sig. Multiple R 0.105883 Due Regression 1 0.88080E-02 0.88080E-02 0.15874 0.303669 Std Error of Y.X 0.235559 Dev. from Reg. 14 0.77683 0.55488E-01 R Square 0.112113E-01 Total 15 0.78564 0.52376E-01

Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation LOG10CT = Variable Coefficient F for selection LOG10Y X- 5 0.2959532 0.7428190 0.3984 0.30367 0.1059 0.2806 CONSTANT X- 7 1.605192 1.504752 1.067 0.69586

Adjusted R Square -5.941647786252678E-02 -2 * ln(Maximum of Likelihood Function) -2.995906752503885 Akaike Information Criterion (AIC) 3.004093247496115 Scwartz Information Criterion (SIC) 5.321859414215458 Akaike (1970) Finite Prediction Error 6.242391585298818E-02 Generalized Cross Validation 6.341477166017846E-02 Hannan & Quinn (1979) HQ 6.265098482400155E-02 Shibata (1981) 6.068991819040517E-02 Rice (1984) 6.473591273643219E-02 Residual Variance 5.548792520265616E-02

Order of entrance (or deletion) of the variables = 7 5


Row 1 Variable X- 5 LOG10Y 0.55178009

Row 2 Variable X- 7 CONSTANT -1.1169023 2.2642794


Residual Statistics for Smoothed Original data

For GLS Y and Y estimate scaled by 0.1575867198767030


For D. F. 14 t(.9999)= 5.3634, t(.999)= 4.1405, t(.99)= 2.9768, t(.95)= 2.1448, t(.90)= 1.7613, t(.80)= 1.3450

Skewness test (Alpha 3) = 0.512095 , Peakedness test (Alpha 4)= 1.97237






1 2 3 4 -0.159923 -0.479294 0.253850 -0.348167E-01


39

Econometric Notes

Sum of squared residuals 0.7768309528371908 Mean squared residual 4.855193455232443E-02


Here and after GLS the DW was 2.16, which is higher than found with SAS. Note that the change in sign of in the SAS output. Although there is positive serial correlation (the autocorrelation was .7653) SAS reports . The insignificant LOG10Y term is now found to be .2959 in place of .7229 as found with SAS but very close to the OLS .3478 value.

Remarks: What can we conclude from the preceding results? Serial correlation was not the reason that LOG10Y was not significant (as measured by the low t value) in the OLS equation containing just LOG10Y on the right-hand side. In this equation, LOG10Y was not significant because of omitted variable bias. The B34S two-pass GLS procedure was able to remove more serial correlation than the SAS ML approach. We found that LOG10Y is a significant variable in a properly specified equation. This problem illustrates how it would be a mistake to remove LOG10Y from consideration as a potentially important variable just because it does not enter significantly into a serial correlation-free equation that does not contain all the appropriate variables on the right.

An example having different problems is illustrated by a dataset from the engineering literature (from Brownlee[1965] Statistical Theory and Methodology, page 454) that is presented in Table Three. Here we have a maintained hypothesis that the stack loss of in going ammonia (Y) is related to the operation of the factory to convert ammonia to nitric acid by the process of oxidation. There is data on three variables for 21 days of plant operation. X1 = air flow, X2 = cooling water inlet temperature, X3 = acid concentration and Y=stack loss of ammonia.____________________________________________________________ Table Three

Brownlee Engineering Stack Loss Data

Obs X1 X2 X3 Y1 1 80 27 89 42 2 80 27 88 37 3 75 25 90 37 4 62 24 87 28 5 62 22 87 18 6 62 23 87 18 7 62 24 93 19 8 62 24 93 20 9 58 23 87 1510 58 18 80 1411 58 18 89 1412 58 17 88 1313 58 18 82 1114 58 19 93 1215 50 18 89 816 50 18 86 717 50 19 72 818 50 19 79 819 50 20 80 920 56 20 82 1521 70 20 91 15

X1 = air flow.X2 = cooling water inlet temperature.X3 = acid concentration.

40

Econometric Notes

Y1 = stack loss of ammonia.

The following B34S commands will load the data and perform the required analysis.

/$ Sample Data # 3/$ Data from Browlee (1965) page 454b34sexec data corr$INPUT X1 X2 X3 Y$LABEL X1 = 'AIR FLOW'$LABEL X2 = 'COOLING WATER INLET TEMPERATURE'$LABEL X3 = 'ACID CONCENTRATION'$LABEL Y = 'STACK LOSS' $DATACARDS$80 27 89 4280 27 88 3775 25 90 3762 24 87 2862 22 87 1862 23 87 1862 24 93 1962 24 93 2058 23 87 1558 18 80 1458 18 89 1458 17 88 1358 18 82 1158 19 93 1250 18 89 850 18 86 750 19 72 850 19 79 850 20 80 956 20 82 1570 20 91 15b34sreturn$b34seend$b34sexec regression maxgls=2 residuala$ MODEL Y = X1 X2 X3 $ b34seend$

41

Econometric Notes

The results of the OLS model fit are reported next. B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:47:54 DATA STEP PAGE 1

Variable # Label Mean Std. Dev. Variance Maximum Minimum

X1 1 AIR FLOW 60.4286 9.16827 84.0571 80.0000 50.0000 X2 2 COOLING WATER INLET TEMPERATURE 21.0952 3.16077 9.99048 27.0000 17.0000 X3 3 ACID CONCENTRATION 86.2857 5.35857 28.7143 93.0000 72.0000 Y 4 STACK LOSS 17.5238 10.1716 103.462 42.0000 7.00000 CONSTANT 5 1.00000 0.00000 0.00000 1.00000 1.00000

Data file contains 21 observations on 5 variables. Current missing value code is 0.1000000000000000E+32

B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:47:54 DATA STEP PAGE 2

Correlation Matrix

1 X2 Var 2 0.78185

1 2 X3 Var 3 0.50014 0.39094

1 2 3 Y Var 4 0.91966 0.87550 0.39983

1 2 3 4 CONSTANT Var 5 0.0000 0.0000 0.0000 0.0000

B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:47:54 REGRESSION STEP PAGE 3


F to enter 0.99999998E-02 F to remove 0.49999999E-02 Tolerance 0.10000000E-04 Maximum no of steps 4 Dependent variable X( 4). Variable Name Y



Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation Y = Variable Coefficient F for selection X1 X- 1 0.7156402 0.1348582 5.307 0.99994 0.7897 2.468 X2 X- 2 1.295286 0.3680243 3.520 0.99737 0.6492 1.559 X3 X- 3 -0.1521225 0.1562940 -0.9733 0.65595 -0.2297 -0.7490 CONSTANT X- 5 -39.91967 11.89600 -3.356 0.99625


Order of entrance (or deletion) of the variables = 1 5 2 3

Estimate of computational error in coefficients = 1 0.3461E-10 2 0.1208E-10 3 0.1959E-09 4 0.1472E-15


Row 1 Variable X- 1 X1 0.18186730E-01

Row 2 Variable X- 2 X2 -0.36510675E-01 0.13544186

Row 3 Variable X- 3 X3 -0.71435215E-02 0.10476827E-04 0.24427828E-01

42

Econometric Notes

Row 4 Variable X- 5 CONSTANT 0.28758711 -0.65179437 -1.6763208 141.51474




For D. F. 17 t(.9999)= 5.0433, t(.999)= 3.9650, t(.99)= 2.8982, t(.95)= 2.1098, t(.90)= 1.7396, t(.80)= 1.3334







1) 0.0858 2) -0.1149 3) -0.0409 4) -0.0064




Y1 = -39.91967 + .7156402X1 + 1.295286X2 - .1521225X3 (-3.356) (5.307) (3.520) (-.9733)

= .8983, = 178.83 (6-10)

Two of the three variables (in addition to the constant) are found to be significant (significantly different from zero at or better than the 95% level). The correlations were .9197, .8755 and .3998 respectively. Of the three variables, X3 (acid concentration) was not significant at the 95% or better level, because its t statistic was less than 2 in absolute value. The variable X1 (air flow) was found to be positively related to stack loss and the variable X2 (cooling water inlet temperature) was also found to be positively related to stack loss. In this model, 89.83% of the variance is explained by the three variables on the right. Clearly, stack loss can be lowered, if X1 and X2 are lowered. The X3 variable (acid concentration) was not significant, even though the raw correlations show some relationship (correlation = .39983)1. The OLS equation was found to have a Durbin-Watson statistic of 1.4851, showing some serial correlation. First-order GLS was tried but were not executed since the residual correlation was less that the tolerance.

Remark: A close to significant correlation is no assurance that the variable will be significant in a more populated model. From an economist's point of view, the results reported in the above paragraph suggest that the tradeoffs of a lower air flow and lower cooling water inlet temperature

1 Depending on whether the large or small sample SE is used the value is

43

Econometric Notes

must be weighed against absorption technology changes that would lower the constant. While engineering considerations are clearly paramount in the decision process, the regression results, which can be readily obtained with a modern PC, can help summarize the data and highlight the relationship between the variables of interest. Of course it is important to select the appropriate data to use in the study. If data on key variables are omitted, the results of the study could be called into question. However, the problem may not be as bad as it seems. If a variable was inadvertently omitted that was important, unless it was random, its effect should be visible in the error process. In the last analysis, the value of a model lies in how well it works. Inspection of the results is a key aspect of the validation of a model.

Table Four shows a data set (taken from Brownlee [1965], op. cit., page 463). Here the number of deaths per 100,000 males aged 55-59 years of heart disease in a number of countries is related to the number of telephones per head (X1) (presumable as a measure of stress and or income), the percent fat calories are of total calories (X2) and the percent animal protein calories are of total calories (X3).

Table Four

Brownlee Health Data

obs X1 X2 X3 Y 1 124 33 8 81 2 49 31 6 55 3 181 38 8 80 4 4 17 2 24 5 22 20 4 78 6 152 39 6 52 7 75 30 7 88 8 54 29 7 45 9 43 35 6 5010 41 31 5 6911 17 23 4 6612 22 21 3 4513 16 8 3 2414 10 23 3 4315 63 37 6 3816 170 40 8 7217 125 38 6 4118 15 25 4 3819 221 39 7 5220 171 33 7 5221 97 38 6 6622 254 39 8 89

X1 = 1000 * telephones per head.X2 = fat calories as a % of total calories.X3 = animal protein as a % of total calories.Y = 100 * log number of deaths per 1000 males 55-59 years.

44

Econometric Notes

The B34S commands to analyze this data are:

/$ Sample Data # 4/$ From Brownlee (1965) page 463b34sexec data corr$INPUT X1 X2 X3 Y$LABEL X1 = '1000 * TELEPHONES PER HEAD'$LABEL X2 = ' FAT CALORIES AS A % OF TOTAL CALORIES'$LABEL X3 = 'ANIMAL PROTEIN AS A % TO TOTAL CALORIES'$LABEL Y = '100 * LOG # DEATHS PER 1GMALES 55-59'$DATACARDS$124 33 8 81 49 31 6 55181 38 8 80 4 17 2 24 22 20 4 78152 39 6 52 75 30 7 88 54 29 7 45 43 35 6 50 41 31 5 69 17 23 4 66 22 21 3 45 16 8 3 24 10 23 3 43 63 37 6 38170 40 8 72125 38 6 41 15 25 4 38221 39 7 52171 33 7 52 97 38 6 66254 39 8 89b34sreturn$b34seend$b34sexec regression residuala$ MODEL Y = X1 X2 X3 $ b34seend$

The raw correlation results, show X1, X2 and X3 positively related to Y, with correlations of .46875, .44628 and .62110, respectively. The variable X3 appears to be the most important.

Y = 23.9306 - .0067849*X1 -.478240*X2 + 8.496616*X3(1.499) (-.0833) (-.6315) (2.21)

= .3017, = 4686 (6-11)

OLS results, indicate that only 30.17% of the variance can be explained and that the animal protein variable (X3) is the only significant variable. Clearly, this finding is interesting, but the large unexplained component suggests that more data need to be collected to improve the model. It may well be the case that the animal protein variable (X3) is related to other unspecified variables and

45

Econometric Notes

interpreting it without qualification would be dangerous. This will have to be investigated in future research if more data is available.

Remark: This dataset shows a case where correlation analysis suggested a result that did not stand up in a multiple regression model. This is in contrast to the Theil dataset where correlation analysis did not suggested a relationship that was only found with a regression model.

B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:58:58 DATA STEP PAGE 1 Variable # Label Mean Std. Dev. Variance Maximum Minimum

X1 1 1000 * TELEPHONES PER HEAD 87.5455 75.4212 5688.35 254.000 4.00000 X2 2 FAT CALORIES AS A % OF TOTAL CALORIES 30.3182 8.68708 75.4654 40.0000 8.00000 X3 3 ANIMAL PROTEIN AS A % TO TOTAL CALORIES 5.63636 1.86562 3.48052 8.00000 2.00000 Y 4 100 * LOG # DEATHS PER 1GMALES 55-59 56.7273 19.3075 372.779 89.0000 24.0000 CONSTANT 5 1.00000 0.00000 0.00000 1.00000 1.00000

Data file contains 22 observations on 5 variables. Current missing value code is 0.1000000000000000E+32

B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:58:58 DATA STEP PAGE 2

Correlation Matrix

1 X2 Var 2 0.75915

1 2 X3 Var 3 0.80220 0.83018

1 2 3 Y Var 4 0.46875 0.44628 0.62110

1 2 3 4 CONSTANT Var 5 0.0000 0.0000 0.0000 0.0000

B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:58:58 REGRESSION STEP PAGE 3


F to enter 0.99999998E-02 F to remove 0.49999999E-02 Tolerance 0.10000000E-04 Maximum no of steps 4 Dependent variable X( 4). Variable Name Y



Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation Y = Variable Coefficient F for selection X1 X- 1 -0.6784908E-02 0.8144097E-01 -0.8331E-01 0.06548 -0.0196 -0.1047E-01 X2 X- 2 -0.4782399 0.7572547 -0.6315 0.46438 -0.1472 -0.2556 X3 X- 3 8.496616 3.844121 2.210 0.95973 0.4620 0.8442 CONSTANT X- 5 23.93061 15.96606 1.499 0.84875



Estimate of computational error in coefficients =

46

Econometric Notes

1 0.2532E-11 2 -0.5376E-11 3 -0.7851E-12 4 -0.9443E-13


Row 1 Variable X- 1 X1 0.66326323E-02

Row 2 Variable X- 2 X2 -0.17265284E-01 0.57343470

Row 3 Variable X- 3 X3 -0.14835665 -1.6567911 14.777267

Row 4 Variable X- 5 CONSTANT 0.77898727 -6.5357233 -20.071205 254.91515




For D. F. 18 t(.9999)= 4.9654, t(.999)= 3.9216, t(.99)= 2.8784, t(.95)= 2.1009, t(.90)= 1.7341, t(.80)= 1.3304







1) -0.0991 2) 0.1355 3) -0.4051 4) 0.1520


Sum of squared residuals 4686. Mean squared residual 213.0

The preceding sections have outlined some of the things that can be done with simple regression analysis. In the next section of the paper, data will be generated that will better illustrate problems of omitted variables and "hidden" nonlinearity.

7. Advanced Regression analysis

The below listed B34S code shows how 250 observations for a number of series are generated. Regression models for the B34S are also shown.

/;/; nonlinearity and serial correlation in generated data/;b34sexec data noob=250 maxlag=1/; corr;* b0=1 b1=100 b2=-100 b3=80 $* generate three output variables with different characteristics$

47

Econometric Notes

build x1 x2 x3 y ynlin yma e$gen x1 = rn()$gen x2 = x1*x1$gen x3 = lag1(x1)$/; gen e = 100.*rn()$gen e = rn()$* ;* build three variables $* y=f(x1,x2 x3) ;* ynlin=f(x1, x3);* yma =f(x1,x2,x3) + theta*lag(et);* ;gen y = 1.0 + 1.*x1 - 1.*x2 + .8*x3 + e $gen ynlin = y - .8*x3 $* generate an ma model;gen yma = y + (-.95*lag1(e));b34srun$/;/; end of data building/;/; b34sexec list iend=20$ b34seend$b34sexec reg$ model y = x1 x2 $ b34seend$b34sexec reg$ model y = x1 $ b34seend$b34sexec reg$ model ynlin = x1 $ b34seend$b34sexec reg$ model ynlin = x1 x2 $ b34seend$b34sexec reg$ model yma = x1 x2 x3$ b34seend$

/$ do gls

b34sexec regression residuala maxgls=4$ model yma=x1 x2 x3 $ b34seend$b34sexec matrix;call loaddata;call load(rrplots);call load(data2acf);call olsq(yma x1 x2 x3 :print);call data2acf(%res,'Model yma=f(x1, x2, x3)',12,'yma_res_acf.wmf');

b34srun;

/; sort data by variable we suspect is nonlinear/; Then do RR analysis/;b34sexec sort $ by x1$ b34seend$

/; b34sexec list iend=20$ b34seend$

b34sexec reg$ model y = x1 $ b34seend$b34sexec reg$ model y = x1 x3 $ b34seend$b34sexec reg$ model ynlin = x1 $ b34seend$

/;/; recursive residual analysis/; x2 which is a nonlinewar x1 term is missing. Can RR detect it?/;

48

Econometric Notes

b34sexec matrix;call loaddata;call load(rrplots);

/; call print(rrplots);call olsq(y x1 x3 :rr 1 :print);/; call tabulate(%rrobs,%ssr1,%ssr2,%rr,%rrstd,%res);call print('Sum of squares of std RR ',sumsq(goodrow(%rrstd)):);call print('Sum of squares of OLS RES ',sumsq(goodrow(%res)):);/; call print(%rrcoef,%rrcoeft);/; call rrplots(%rrstd,%rss,%nob,%k,%ssr1,%ssr2,1);call rrplots(%rrstd,%rss,%nob,%k,%ssr1,%ssr2,0);/; call names(all);x1_coef=%rrcoef(,1);x3_coef=%rrcoef(,2);call graph(x1_coef,x3_coef :file 'coef_bias.wmf' :nolabel:heading 'Omitted Variable Bias x1 and x3 coef');b34srun; The above code builds three models.

(7-1)

(7-2)

(7-3)

By construction of the data and ignoring subscripts where there is no confusion:

(7-4)

(7-5)

Since x1 is a random variable, there is no correlation between x1 and x3. Because of (7-4) there is correlation between x1 and x2. The purpose of the generated data set is to illustrate the conditions under which an omitted variable will and will not cause a bias in the coefficients estimated for an incomplete model and to show detection strategy. The yma series illustrates the relationship between AR (autoregressive) and MA (moving average) error processes.

Assume the lag operator L defined such that . A simple OLS model with an MA process is defined as

49

Econometric Notes

, (7-6)

where is a polynomial in the lag operator L. A simple OLS model with an AR process is defined as

, (7-7)

where is a polynomial in the lag operator L. If we assume further that the maximum order in is 1, i. e.

(7-8)

It can be proved that a first-order MA model (MA(1)) is equal to an infinite order AR model if

. This can be seen if we note that

(7-9)

where . The importance of equation (7-9) is that it shows that if equation (7-3) is estimated with GLS, which is implicitly an AR error correction technique, more than first-order GLS will be required to remove the serial correlation in the error term. In a transfer function model of the form of

(7-10)

then only one MA term (7-8) would be needed and . An OLS model is a transfer function model that constrains . GLS allows .

The means of the data generated in accordance with equations (7-1) - (7-3) and OLS estimation of a number of models are given next.

Variable # Cases Mean Std Deviation Variance Maximum Minimum

X1 1 249 0.1508079470 1.047903751 1.098102271 3.422285173 -2.584990017 X2 2 249 1.116435259 1.532920056 2.349843898 11.71203580 0.2973256533E-04 X3 3 249 0.1609627505 1.054029274 1.110977710 3.422285173 -2.584990017

50

Econometric Notes

Y 4 249 0.1975673548 2.283588868 5.214778119 5.013192154 -9.876622239 YNLIN 5 249 0.6879715443E-01 2.085180439 4.347977464 4.117746229 -9.568155797 YMA 6 249 0.1594119944 2.399542772 5.757805512 6.108808152 -10.57614307 E 7 249 0.3442446593E-01 1.059199304 1.121903166 2.876329588 -3.278246463 CONSTANT 8 249 1.000000000 0.000000000 0.000000000 1.000000000 1.000000000

Number of observations in data file 249 Current missing variable code 1.000000000000000E+31

The below listed output shows that the coefficients for x1 and x2 are close to their population values of 1.0 and -1.0 even though x3 is missing from the model. This is because the omitted variable X3 is not correlated with an included variable. REG Command. Version 1 February 1997


OLS EstimationDependent variable Y Adjusted R**2 0.6416762467329375 Standard Error of Estimate 1.366959717042574 Sum of Squared Residuals 459.6704015322101 Model Sum of Squares 833.5945719535468 Total Sum of Squares 1293.264973485757 F( 2, 246) 223.0557634525042 F Significance 1.000000000000000 1/Condition of XPX 0.1334523794627398 Number of Observations 249 Durbin-Watson 1.889314291485050

Variable Coefficient Std. Error tX1 { 0} 1.1228155 0.83599150E-01 13.430944 X2 { 0} -1.0266583 0.57148357E-01 -17.964792 CONSTANT { 0} 1.1744354 0.10731666 10.943645

B34S 8.10Z (D:M:Y) 10/12/06 (H:M:S) 8:14:43 REG STEP PAGE 3

Since the omitted variable x2 is related to the included variable x1, the estimated coefficient for x1 is biased.




Variable Coefficient Std. Error tX1 { 0} 0.92008294 0.12569399 7.3200236 CONSTANT { 0} 0.58811535E-01 0.13281015 0.44282410

The model for ynlin does not contain x3. Here the omission of x2 shows as a bias on the included variable x1. The fact that is appears highly significant (t=7.55) may fool the researcher. The task ahead is to investigate model specification in a systematic manner using simple tests.

51

Econometric Notes



OLS EstimationDependent variable YNLIN Adjusted R**2 0.1841644277111127 Standard Error of Estimate 1.883410386043661 Sum of Squared Residuals 876.1669665175114 Model Sum of Squares 202.1314444376089 Total Sum of Squares 1078.298410955120 F( 1, 247) 56.98282254868781 F Significance 0.9999999999991491 1/Condition of XPX 0.7121866785207466 Number of Observations 249 Durbin-Watson 1.922026461982424

Variable Coefficient Std. Error tX1 { 0} 0.86152861 0.11412945 7.5486967 CONSTANT { 0} -0.61128207E-01 0.12059089 -0.50690568

Before beginning the analysis, note that the correct model for ynlin shows coefficients close to their population values REG Command. Version 1 February 1997



Variable Coefficient Std. Error tX1 { 0} 1.0636116 0.64892761E-01 16.390297 X2 { 0} -1.0233690 0.44360674E-01 -23.069283 CONSTANT { 0} 1.0509212 0.83303169E-01 12.615622

The model for yma shows negative serial correlation (DW=2.872) even though all variables are in the model. REG Command. Version 1 February 1997


OLS EstimationDependent variable YMA Adjusted R**2 0.6413612625787487 Standard Error of Estimate 1.437001078383063 Sum of Squared Residuals 505.9181643221510 Model Sum of Squares 922.0176027460155 Total Sum of Squares 1427.935767068167 F( 3, 245) 148.8345537566244 F Significance 1.000000000000000 1/Condition of XPX 0.1270036486757258 Number of Observations 249 Durbin-Watson 2.871725884170242

Variable Coefficient Std. Error tX1 { 0} 1.0907591 0.88117138E-01 12.378513 X2 { 0} -0.94687924 0.60077630E-01 -15.760929 X3 { 0} 0.75276712 0.86803876E-01 8.6720450 CONSTANT { 0} 0.93087875 0.11360868 8.1937292

52

Econometric Notes

GLS will be attemptedProblem Number 1 Subproblem Number 1 F to enter 9.999999776482582E-03 F to remove 4.999999888241291E-03 Tolerance (1.-R**2) for including a variable 1.000000000000000E-05 Maximum Number of Variables Allowed 4 Internal Number of dependent variable 6 Dependent Variable YMA Standard Error of Y 2.399542771523700 Degrees of Freedom 248

.............Step Number 4 Analysis of Variance for reduction in SS due to variable entering Variable Entering 8 Source DF SS MS F F Sig. Multiple R 0.803554 Due Regression 3 922.02 307.34 148.83 1.000000 Std Error of Y.X 1.43700 Dev. from Reg. 245 505.92 2.0650 R Square 0.645700 Total 248 1427.9 5.7578

Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation YMA = Variable Coefficient F for selection X1 X- 1 1.090759 0.8811714E-01 12.38 1.00000 0.6203 1.032 X2 X- 2 -0.9468792 0.6007763E-01 -15.76 1.00000 -0.7095 -6.631 X3 X- 3 0.7527671 0.8680388E-01 8.672 1.00000 0.4846 0.7601 CONSTANT X- 8 0.9308788 0.1136087 8.194 1.00000


Order of entrance (or deletion) of the variables = 1 2 3 8Estimate of Computational Error in Coefficients

1 2 3 4 -0.255313E-15 -0.119643E-15 0.190567E-16 -0.196339E-16


Row 1 Variable X- 1 X1 0.77646301E-02

Row 2 Variable X- 2 X2 -0.71499452E-03 0.36093216E-02


Row 4 Variable X- 8 CONSTANT -0.28296676E-03 -0.39267339E-02 -0.11633355E-02 0.12906932E-01


Residual Statistics for Original data

Von Neumann Ratio 1 ... 2.88331 Durbin-Watson TEST..... 2.87173Von Neumann Ratio 2 ... 2.88331

For D. F. 245 t(.9999)= 3.9556, t(.999)= 3.3307, t(.99)= 2.5960, t(.95)= 1.9697, t(.90)= 1.6511, t(.80)= 1.2850


Normality Test -- Extended grid cell size = 24.90t Stat Infin 1.651 1.285 1.039 0.843 0.675 0.525 0.386 0.254 0.126Cell No. 25 20 29 26 23 30 27 24 21 24Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100Act Per 1.000 0.900 0.819 0.703 0.598 0.506 0.386 0.277 0.181 0.096

Normality Test -- Small sample grid cell size = 49.80Cell No. 45 55 53 51 45Interval 1.000 0.800 0.600 0.400 0.200Act Per 1.000 0.819 0.598 0.386 0.181

Extended grid normality test - Prob of rejecting normality assumptionChi= 3.731 Chi Prob= 0.1195 F(8, 245)= 0.466365 F Prob =0.120880

53

Econometric Notes

Small sample normality test - Large gridChi= 1.703 Chi Prob= 0.3637 F(3, 245)= 0.567604 F Prob =0.363150


1 2 3 4 5 -0.438023 -0.874684E-01 0.759865E-01 -0.113471 0.809694E-01



Note the ACF values of -.438, -.087 for the OLS model. GLS is now attempted:


1 -0.436471



Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation YMA = Variable Coefficient F for selection X1 X- 1 1.053314 0.7820969E-01 13.47 1.00000 0.6530 0.9965 X2 X- 2 -0.9569885 0.5047854E-01 -18.96 1.00000 -0.7718 -6.702 X3 X- 3 0.7796353 0.7773981E-01 10.03 1.00000 0.5403 0.7872 CONSTANT X- 8 0.9431791 0.8032375E-01 11.74 1.00000




Row 1 Variable X- 1 X1 0.61167560E-02


Row 3 Variable X- 3 X3 -0.26031711E-02 -0.11668626E-03 0.60434773E-02






For D. F. 244 t(.9999)= 3.9559, t(.999)= 3.3308, t(.99)= 2.5961, t(.95)= 1.9697, t(.90)= 1.6511, t(.80)= 1.2850

54

Econometric Notes







1 2 3 4 -0.151211 -0.321877 -0.413296E-03 -0.787568E-01



The DW, now 2.298, is closed to 2.0. The ACF shows a spike at lag 2. GLS order 2 is attempted.


1 2 -0.587393 -0.343873







Row 1 Variable X- 1 X1 0.51431761E-02


Row 3 Variable X- 3 X3 -0.29466037E-02 -0.27120489E-04 0.50867433E-02


55

Econometric Notes





For D. F. 243 t(.9999)= 3.9561, t(.999)= 3.3310, t(.99)= 2.5962, t(.95)= 1.9698, t(.90)= 1.6511, t(.80)= 1.2850

Skewness test (Alpha 3) = 0.726358E-01, Peakedness test (Alpha 4)= 2.77604






1 2 3 4 5 -0.645124E-01 -0.148837 -0.240427 -0.992257E-01 0.728148E-01



DW now 2.114. GLS order 3 and 4 attempted with little gain to the DW but showing slow decline in GLS values which would be expected in view of (7-9).


1 2 3 -0.648299 -0.447794 -0.176326




Adjusted R Square 0.8503620346592192 -2 * ln(Maximum of Likelihood Function) 376.8392476702688 Akaike Information Criterion (AIC) 386.8392476702688 Scwartz Information Criterion (SIC) 404.3659053499306 Akaike (1970) Finite Prediction Error 0.2798540632805742 Generalized Cross Validation 0.2799280742725161 Hannan & Quinn (1979) HQ 0.2863502020183029 Shibata (1981) 0.2797084481582168

56

Econometric Notes

Rice (1984) 0.2800045730288931 Residual Variance 0.2753763982680850



Row 1 Variable X- 1 X1 0.51276574E-02








For D. F. 242 t(.9999)= 3.9564, t(.999)= 3.3312, t(.99)= 2.5963, t(.95)= 1.9698, t(.90)= 1.6512, t(.80)= 1.2851

Skewness test (Alpha 3) = 0.628738E-01, Peakedness test (Alpha 4)= 2.66896






1 2 3 4 5 6 -0.577779E-01 -0.116476 -0.153224 -0.211519 0.509156E-01 0.347658E-02




1 2 3 4 -0.694654 -0.564775 -0.345213 -0.259654



Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation YMA = Variable Coefficient F for selection X1 X- 1 1.036793 0.6936194E-01 14.95 1.00000 0.6936 0.9808 X2 X- 2 -0.9666933 0.3536745E-01 -27.33 1.00000 -0.8695 -6.770

57

Econometric Notes

X3 X- 3 0.7804025 0.6903331E-01 11.30 1.00000 0.5887 0.7880 CONSTANT X- 8 0.9634053 0.4764657E-01 20.22 1.00000




Row 1 Variable X- 1 X1 0.48110784E-02








For D. F. 241 t(.9999)= 3.9567, t(.999)= 3.3314, t(.99)= 2.5964, t(.95)= 1.9699, t(.90)= 1.6512, t(.80)= 1.2851

Skewness test (Alpha 3) = -.879541E-01, Peakedness test (Alpha 4)= 2.53942






1 2 3 4 5 6 7 -0.580007E-01 -0.654729E-01 -0.913505E-01 -0.110200 -0.165341 -0.324259E-01 -0.258273E-01



Gen. Least Squares ended by max. order reached.

The classic MA residual ACF is shown in Figure 7.1. There is one ACF spike but the PACF suggests a longer AR model which was shown to be captured by the GLS model above.

58

Econometric Notes

X

Obs 20 40 60 80 100 120 140 160 180 200 220 240

-4

-3

-2

-1

0

1

2

3

4

Model yma=f(x1, x2, x3)

Lag1 2 3 4 5 6 7 8 9 10 11 12

-1

-.8

-.6

-.4

-.2

0

.2

.4

.6

.8

1ACF of Above Series

Lag1 2 3 4 5 6 7 8 9 10 11 12

-1

-.8

-.6

-.4

-.2

0

.2

.4

.6

.8

1PACF of Above Series

Figure 7.1 Analysis of residuals of the YMA model.

59

Econometric Notes

Remark: A low order autoregressive structure in the error term is usually easily captured by a GLS model. However a simple MA residual structure, that might occur in an over shooting situation, often requires a high order GLS model to clean the residual. The problem is that using maximum likelihood GLS the GLS autoregressive parameters often are hard to estimate because they are related as is seen in (7.9)

Recall that the models , and produced biased coefficients for both the constant and x1 and the constant, x1 and x3 respectively. How might one test such a models for an excluded variable (x2) that is related to an included variable (x1)? One way to proceed is to sort the data with respect to one variable (x1 in the example to be shown) and inspect the Durbin Watson statistic. Nonlinearity will be reflected in a low DW. This approach uses time series methods on cross section data. This is shown next.




Variable Coefficient Std. Error tX1 { 0} 0.92008294 0.12569399 7.3200236 CONSTANT { 0} 0.58811535E-01 0.13281015 0.44282410




Variable Coefficient Std. Error tX1 { 0} 0.85966646 0.11465356 7.4979481 X3 { 0} 0.82544163 0.11398725 7.2415260 CONSTANT { 0} -0.64942535E-01 0.12202611 -0.53220196


Real*8 space available 8000000

60

Econometric Notes

Real*8 space used 508


Variable Coefficient Std. Error tX1 { 0} 0.86152861 0.11412945 7.5486967 CONSTANT { 0} -0.61128207E-01 0.12059089 -0.50690568

The Durbin Watson tests for the three models were .9304, .7352 and .7346 respectively. The above results show how the Durbin-Watson test, which was developed for a time series model, can be used effectively in cross section models to test for equation misspecifications. The results suggest that if a nonlinearity is suspected, the data should be sorted against each suspected variable in turn and recursive coefficients analyzed. The recursively estimated coefficients for x1 and x3 for the model , when the data was sorted against x1, are displayed in Figure 7.2. The omitted variable bias is clearly shown by the movement in the x1 coefficient as higher and higher values of x1 and added to the sample.

61

Econometric Notes

O b s 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0 1 6 0 1 8 0 2 0 0 2 2 0 2 4 0

1

2

3

4

5

6

7

8

X1_COEF

X3_COEF

O m i t t ed V ar i ab l e B i as x 1 an d x 3 c o ef

Figure 7.2 Recursively estimated X1 and X3 coefficients for X1 Sorted Data

62

Econometric Notes

D I 1 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

- 1 8 0

- 1 6 0

- 1 4 0

- 1 2 0

- 1 0 0

- 8 0

- 6 0

- 4 0

- 2 0

0

2 0

4 0

C U S U M T

U 1 0 U 5 U 1

L 1 0 L 5 L 1

P l o t o f C u su m T est

Figure 7.3 CUSUM test on Estimated with Sorted Data

Figures 7.3-7.5 show respectively the CUSUM, CUMSQ and Quandt Likehood ratio tests.Further detailed on these tests is contained in Stokes (Specifying and Diagnostically Testing Econometric Models 1997 (see also third edition drafts) Chapter 9. Here we only sketch their use.

Brown, Durbin and Evans (1975) proposed the CUSUM test as a summary measure of whether there was parameter stability. The test consists of plotting the quantity

(7-11)

Where is the normalized recursive residual. The CUSUM test is particularly good at

detecting systematic departure of the coefficients, while the CUSUMSQ test is useful when

the departure of the coefficients from constancy is haphazard rather than systematic. The

63

Econometric Notes

CUSUMSQ test involves a plot of defined as

(7-12)

Approximate bounds for and are given in Brown, Durbin and Evans (1975). Assuming a rectangular plot, the upper-right-hand value is 1.0 and the lower-left-hand value is 0.0. A

regression with stable coefficients will generate a plot up the diagonal. If the plot lies above the diagonal, the implication is that the regression is tracking poorly in the early subsample in comparison with the total sample. A plot below the diagonal suggests the reverse, namely, that the regression is tracking better in the early subsample than in the complete sample.

The Quandt log-likelihood ratio test involves the calculation of the , defined as

(7-13)

where and are the variances of regressions fitted to the first i observations, the last T

- i observations and the whole T observations, respectively. The minimum of the plot of can

be used to select the "break" in the sample. Although no specific tests are available for , the information suggested by the plot can be tested with the multiperiod Chow test, which is discussed next.

If structural change is suspected, a homogeneity test (Chow) of equal segments n can be performed. Given that is the residual sum of squares from a regression calculated from observations , the appropriate statistic is distributed as and defined as

(7-14)

where

(7-15)and

(7-16)

64

Econometric Notes

D I 1 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

0

.1

.2

.3

.4

.5

.6

.7

.8

.9

1

P l o t o f C u su m sq T est

Figure 7.4 CUMSQ Test of Model y model estimated with sorted data.

65

Econometric Notes

D I 1 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

- 1 1 0

- 1 0 0

- 9 0

- 8 0

- 7 0

- 6 0

- 5 0

- 4 0

- 3 0

- 2 0

P l o t o f Q u an d t L i k eh o o d R a t i o

Figure 7.5 Quandt Likelihood Ratio tests of y model estimated with sorted data.

Remark: If an inadvertently excluded variable is correlated with an excluded variable, substantial bias in the estimated coefficients can occur. In cross section analysis, if the data is sort by the included variables, what are usually thought of as time series techniques can be used to determine the nature of the problem. For more complex models "automatic" techniques such as GAM, MARS and ACE can be employed. These are far too complex to discuss in this introductory analysis.

8. Advanced concepts A problem with simple OLS models is that there may be situations where the estimated coefficients are biased, or the estimated standard errors are biased. While space precludes a detailed treatment, some of the problems and their solution are outlined below.

_________________________________________________________

Table Five

66

Econometric Notes

Some Problems and Their Solution.

Problem Solution

Y a 0 - 1 variable. PROBIT, LOGIT

Y a bounded variable. TOBIT

X's not independent (i. e., 2SLS, 3SLS, LIML, FIMLX's not orthogonal to e in thepopulation.

Relationship not linear. Reparameterize model and/or NLS, MARS, GAM, ACE

Error not random. GLS, weighed least squares

Coefficients changing from changing population. Recursive Residual Analysis

Time series problems. ARIMA Model, transfer function model, vector model

Outlier Problems L1 & MINIMAX Estimation___________________________________________________________

The 0-1 left-hand variable problem arises if there are only two states for Y. For example, if Y is coded 0 = alive, 1=dead, then a regression model that predicts more than dead (YHAT > 1) or less than alive (YHAT < 0) is clearly not using all of the information at hand. While the coefficients of an OLS model can be interpreted as partial derivatives, in the case of the 0-1

problem, this is not the case. Assume that you have a number of variables, , and that high values are associated with a high probability death before 45 years of age. Clearly, since

you cannot be more than dead, if all variables are high, an additional one more unit for will not have the same effect than if all variables were low. For such problems, the appropriate procedure is LOGIT or PROBIT analysis. Due to space and time limitations, these techniques are not illustrated.

A left-hand variable can be bounded on the upper or lower side. Examples of the former include scores on tests and of the latter money spent on cars. Assume a model where the score on a test (S) is a function of a number of variables, such as study time (ST), age (A), health (H) and experience (E). Clearly, one is going to get into diminishing returns regarding study time. If the

67

Econometric Notes

number of hours were increased from 200 to 210, the increase in the score would not be the same as if the hours had been increased from 0 to 10 hours. Such problems require TOBIT procedures, which have not been illustrated here due to space and time constraints. If an OLS model were fit to the above data, the coefficient for the study time variable would understate the effect of study time on exam scores for relative low total study time hours and overstate the effect of study time on exam scores for relatively high total study time.

An important assumption of OLS is that the right-hand variables are independent. By

this, we mean that if , where are variables, then can be changed without

changing. On the other hand, if the system is of the form

, (8-1)

, (8-2)

then one cannot use as a measure of how will change for a one unit change in , since

there will be an effect on from the change in in the second equation, which will occur as changes. Such problems require two-stage least squares (2SLS) or limited information maximum likelihood estimation (LIML). In addition, if the possible relationship between the error terms e1 and e2 is taken into consideration, three-stage least squares (3SLS) and or full information maximum likelihood (FIML) estimation procedures should be used. These more advanced techniques will not be discussed further here except to say that the appropriate procedures are available.

In OLS estimation, there is always the danger that the estimated linear model is being used to capture a nonlinear process. Over a short data range, a nonlinear process can look like a linear process. In the preliminary research on the way to "kill" live polio bacteria in order to make a "live" vaccine, a graph was used to show that increased percentages of the bacteria were killed as more heat was applied. A straight line was fit and the appropriate temperature was selected. Much to the surprise of the researcher, it was later determined that the relationship was not linear; in fact, proportionately more heat was required to kill bacteria, the lower the percentage of live polio bacteria. Poor statistical methodology resulted in people unexpectedly getting polio from the vaccine.

One way to determine if the relationship is nonlinear is to put power and interaction terms in the regression. The problem is that it is easy to exhaust available CPU time and researcher time before all possibilities have been tested. The recursive residual procedure which involves starting from a model with a small sample and recursively estimating and adding observations provides a way to detect is there are problems in the initial specification. More detail on this approach is provided in Stokes (Specifying and Diagnostically Testing Econometric Models 1997 Chapter 9).

68

Econometric Notes

A brief introduction was given in section 7. The essential idea is that if the data set is sorted against one of the right-hand variables and regressions are run, adding one observation at a time, a plot or list of the coefficients will indicate whether they are stable for different ranges of the sorted variable. If other coefficients change, it indicates the need for interaction terms. If the sorted variable coefficient changes, it indicates that there is a nonlinear relationship. If time is the variable for which the sort is made, it suggests that over time the coefficients are shifting.

The OLS model for time series data can be shown to be a special case of the more general ARIMA, transfer function and vector autoregressive moving average models. A preliminary look at some of these models and their uses is presented in the paper "The Box-Jenkins Approach-When Is It a Cost-effective Alternative," which I wrote with Hugh Neuburger in the Columbia Journal of World Business (Vol. XI, No. 4, Winter 1976). As noted, if we were to write the model as

(8-3)

a more complex lag structure can be modeled than in the simple OLS case. If , then we

have an ARIMA model and are modeling as a function of past shocks alone. If , then we have a rational distributed lag model. If neither term is zero, we have a transfer function model. Systems of transfer function type models can be estimated using simultaneous transfer function estimation techniques or vector model estimation procedures. Space limits more comprehensive discussion of these important and general class of models beyond this brief treatment.

9. Summary

These short notes have attempted to outline the scope of elementary applied statistics. Students are encouraged to experiment with the sample data sets to perform further analysis.

* Editorial assistance was provided by Diana A. Stokes. Important suggestions were made by Evelyn Lehrer on a prior draft. I am responsible for any errors or omissions.

69

an overview of econometrics using b34s, matlab,...

Documents