an overview of econometrics using b34s, matlab,...
TRANSCRIPT
Econometric Notes
30 December 2014
Econometric Notes *
Houston H. StokesDepartment of Economics
University of Illinois in [email protected]
An Overview of Econometrics *...........................................1Objective of Notes...................................................11. Purpose of statistics................................................32. Role of statistics...................................................33. Basic Statistics....................................................34. More complex setup to illustrate B34S Matrix Approach.....................155. Review of Linear Algebra and Introduction to Programming Regression Calculations. 21
Figure 5.1 X'X for a random Matrix X....................................28Figure 5.2 3D plot of 50 by 50 X'X matrix where X is a random matrix..............41
6. A Sample Multiple Input Regression Model Dataset.........................45Figure 6.1 2 D Plots of Textile Data......................................52Figure 6.2 3-D Plot of Theil (1971) Textile Data.............................54
7. Advanced Regression analysis........................................68Figure 7.1 Analysis of residuals of the YMA model............................78Figure 7.2 Recursively estimated X1 and X3 coefficients for X1 Sorted Data...........80Figure 7.3 CUSUM test on Estimated with Sorted Data.........................81Figure 7.4 CUMSQ Test of Model y model estimated with sorted data...............83Figure 7.5 Quandt Likelihood Ratio tests of y model estimated with sorted data.........848. Advanced concepts..............................................849. Summary.....................................................87
Objective of Notes
The objective of these notes is to introduce students to the basics of applied regression calculation using STATA setups of a number of very simple models. Computer code is shown to allow students to "get going" ASAP. More advanced sections show matlab code to made calculations. The notes are organized around the estimation of regression models and the use of basic statistical concepts. The textbooks Introduction to Econo metrics by Christopher Dougherty 4th Edition Oxford 2011 or
1
Econometric Notes
Introductory Econometrics: A Modern Approach by Jeffrey Wooldridge 5th Edition, South-Western Cengage 2013 can be used to provide added information. A number of examples from this book will be shown. Statistical analysis will be treated, both as a means by which the data can be summarized, and as a means by which it is possible to accept or reject a specific hypothesis. Four simple datasets are initially discussed:
- The Price vs Age of Cars dataset illustrates a simple 2 variable OLS model where graphics and correlation analysis can be used to detect relationships.
- The Theil (1971) Textile deta set illustrates use of log transformations and contracts 2D and 3D graphic analysis of data. A variable with a low correlation was show to enter an OLS model only in the presence of another variable.
- The Brownlee (1965) Stack Loss data set illustrates how in a multiple regression context, variables with "significant" correlation may not enter a full model.
- The Brownlee (1965) Stress data set illustrates the dangers of relying on correlation analysis.
Finally a number of statistical problems and procedures that might be used are discussed.
2
Econometric Notes
1. Purpose of statistics
- Summarize data- Test models- Allow one to generalize from a sample to the wider population.
2. Role of statistics
Quote by Stanley (1856) in a presidential address to section F of the British Association for the Advancement of Science.
"The axiom on which ....(statistics) is based may be stated thus: that the laws by which nature is governed, and more especially those laws which operate on the moral and physical condition of the human race, are consistent, and are, in all cases best discoverable - in some cases only discoverable - by the investigation and comparison of phenomena extending over a very large number of individual instances. In dealing with MAN in the aggregate, results may be calculated with precision and accuracy of a mathematical problem... This then is the first characteristic of statistics as a science: that it proceeds wholly by the accumulation and comparison of registered facts; - that from these facts alone, properly classified, it seeks to deduce general principles, and that it rejects all a priori reasoning, employing hypothesis, if at all, only in a tentative manner, and subject to future verification"
3
Econometric Notes
(Note: underlining entered by H. H. Stokes)
3. Basic Statistics
Key concepts:
-Mean-Median =middle data value-Mode = data value with most cases
-Population Variance = -Sample Variance =
-Population Standard Deviation =
-Sample Standard Deviation = -Confidence Interval with k% => a range of data values
-Correlation =
-Regression
Where = is a N by K matrix of explanatory variables.-Percentile-Quartile-Z score -t test-SE of the mean
-Central Limit Theorem
Statistics attempts to generalize about a population from a sample. For the purposes of this discussion assume the population of men in the US. A 1/1000 sample from this population would be a randomly selected sample of men such that the sample contained only one male for every 1000 in the population. The task of statistics is to be able to draw meaningful generalizations from the sample about the population. It is costly, and often impossible, to examine all the measurements in the population of interest. A sample must be selected in such a manner such that it is representative of the population.
In a famous example of the potential for problems in sample selection, during the depression in the 1932 presidential election the Literery Digest attempted to sample the electorate. A staff was selected and numbers to call were randomly selected from the phone book in New York. In each call the question was asked “Who will you vote for, Mr. Roosevelt or President Hoover?” Those called, for the most part, supported President Hoover being relected. When Mr. Roosevelt won the election, the question was asked? What went wrong in the sampling process? The assumption that
4
Econometric Notes
those who had phones was the correct characterization of poplution of the voters, was the problem. Those without phones in that period disproportionally went for Mr. Roosevelt biasing the results of the study.
In summary, statistics allows us to use the information contained in a representative sample to correctly make inferences about the population. For example if one were interested in ascertaining how long the light bulbs produced by a certain company last, one could hardly test them all. Sampling would be necessary. The bootstrap can be used to test the distribution of statistics estimated from a sample whose distribution is not known.
In addition to sampling correctly, it is important to be able to detect a shift in the underlying population. The usual practice is to draw a sample from the population to be able to make inferences about the underlying population. If the population is shifting, such samples will give biased information. For example assume a reservoir. If a rain comes and adds to and stirs up the water in the reservoir, samples of water would have to be taken more frequently than if there had been no rain and there was no change in water usage. The interesting question is how do you know when to start increasing the sampling rate? A possible approach would be to increase the sampling rate when the water quality of previous samples begins to fall outside normal ranges for the focus variable. In this example, it is not possible to use the population of all the water in the reservoir to test the water. A number of key concepts are listed next.
Measures of Central Tendency. The mean is a measure of central tendency. Assume a vector x containing N observations. The mean is defined as
(3-1)
Assuming xi = (1 2 3 4 5 6 7 8 9), then N=9, and . The mean is often written as or E(x) or the expected value of x. The problem with the mean as a measure of central tendency is that it is affected by all observations. If instead of making x9 = 9, make x9 = 99. Here which is bigger than all xi values except x9. The median defined as the middle term of an odd number of terms or the average of the two middle terms when the terms have been arranged in increasing order is not affected by outlier terms. In the above example the median is 5 no matter whether x9 = 9 or x9 = 99. The final measure of central tendency is the mode or value which has the highest frequency. The mode may not be unique. In the above example, it does not exist.
Variation. It has been reported that a poor statistician once drowned in a stream with a mean depth of 6 inches. How could this occur? To summarize the data, we also need to check on variation, something that can be done by looking at the standard deviation and variance. The population variance of a vector x is defined as
(3-2)
5
Econometric Notes
while the sample variance is
(3-3)
The population standard deviation is the square root of the population variance. For the purposes of these notes, the standard deviation will mean the sample standard deviation. There are alternative formulas for these values that may be easier to use. As an alternative to (3-2) and (3-3)
(3-4)
(3-5)
For implementing the variance in a computer program, (3-2) is more accurate than (3-4)? Why is this the case?
If is unbiased, a general rule is that will lie 99% of the time in + - 3 standard deviations, 95% of the time in + - 2 standard deviations, and 68% of the time in + - 1 standard deviations. Given a vector of numbers, it is important to determine where a certain number might lie. There are 4 quartile positions of a series. Quartile 1 is the top of the lower 25%, quartile 2 the top of the lower 50% or the median. Quartile 3 is the top of the 75%. The standard deviation gives
information concerning where observations lie. Assume = 10, = 5 and N = 300. The question asked is how likely will a value > 14 occur? To answer this question requires putting the data in Z form where
(3-6)
Think of Z as a normalized deviation. Once we get Z, we can enter tables and determine how likely this will occur. In this case Z = (14-10)/5 = .8.
Distribution of the mean. It often is desirable to know how the sample mean is
distributed. Assuming a vector has a finite distribution and that each value is mutually
independent, then the Central Limit Theorem states that if the vector has any distribution with mean and variance , then the distribution of approaches the normal distribution with mean and variance as sample size N increases. Note that the standard deviation of the mean
6
Econometric Notes
defined as
(3-7)
Given and the 95% confidence interval around is
_ (3-8)
For small samples (<30) the formula is
(3-9)
Tests of two means. Assume two vectors x and y where we know . The simplest test if the means differ is
(3-10)
where the small sample approximation assuming the two samples have the same population standard deviation is
(3-11)
(3-12)
Note that is an estimate of the population variance.
Correlation. If two variables are thought to be related, a possible summary measure would be the correlation coefficient . Most calculators or statistical computer programs will make the
calculation. The standard error of is for small samples and for large samples.
This means that is distributed as a t statistic with asymptotic percentages as given above . The correlation coefficient is defined as
(3-13)
7
Econometric Notes
Perfect positive correlation is 1.0, perfect negative correlation is -1.0. The SE of is converges to 0.0 as N . If N was 101, the SE of r would be 1/10 or .1. must be .2 to be significant at or better than the 95% level. Correlation is major tool of analysis that allows a person to formalize what is shown in an x y plot of the data. A simple data set will be used to illustrates these concepts and introduce OLS models as well as show the flaws of correlation analysis as a diagnostic tool.
Single Equation OLS Regression Model. Data was obtained on 6 observations on age and value of cars (from Freund [1960] Modern Elementary Statistics, page 332), two variables that are thought to be related. Table One lists this data and gives means, correlation between age and value and a simple regression value=f(age). We expect the relationship to be negative and significant.
Table 1. Age of cars
Obs Age Value1 1 19952 3 8753 6 6954 10 3455 5 5956 2 1795
Mean 4.5 1050Variance 10.7 461750Correlation -0.85884
Next we show the Stata command files to obtain analysis of this data. Assume you have a file car_age_data.do
input double x 0.1E+01 0.3E+01 0.6E+01 0.1E+02 0.5E+01 0.2E+01 endlabel variable x "AGE OF CARS "input double y 0.1995E+04 0.8750E+03 0.6950E+03 0.3450E+03 0.5950E+03 0.1795E+04label variable y "PRICE OF CARS "
// Comment
// run car_age_data.do describe
8
Econometric Notes
summarize list correlate (x y) regress y x twoway (scatter y x)
Edited output is:
clear
. run car_age_data.do
. describe
Contains data obs: 6 vars: 2 size: 96 ----------------------------------------------------------------------------------- storage display valuevariable name type format label variable label-----------------------------------------------------------------------------------x double %10.0g AGE OF CARSy double %10.0g PRICE OF CARS-----------------------------------------------------------------------------------Sorted by: Note: dataset has changed since last saved
. summarize
Variable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------------- x | 6 4.5 3.271085 1 10 y | 6 1050 679.5219 345 1995
. list
+-----------+ | x y | |-----------| 1. | 1 1995 | 2. | 3 875 | 3. | 6 695 | 4. | 10 345 | 5. | 5 595 | |-----------| 6. | 2 1795 | +-----------+
. correlate (x y)(obs=6)
| x y-------------+------------------ x | 1.0000 y | -0.8588 1.0000
. regress y x
Source | SS df MS Number of obs = 6-------------+------------------------------ F( 1, 4) = 11.24 Model | 1702935.05 1 1702935.05 Prob > F = 0.0285 Residual | 605814.953 4 151453.738 R-squared = 0.7376-------------+------------------------------ Adj R-squared = 0.6720 Total | 2308750 5 461750 Root MSE = 389.17
------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- x | -178.4112 53.20631 -3.35 0.028 -326.1356 -30.68683 _cons | 1852.85 287.3469 6.45 0.003 1055.048 2650.653
9
Econometric Notes
------------------------------------------------------------------------------
. twoway (scatter y x)
. end of do-file
050
010
0015
0020
00P
RIC
E O
F C
AR
S
0 2 4 6 8 10AGE OF CARS
From the plot we see that the ten year old car appears to have a larger that expected value for its age. For this reason, more variables and observations are needed.
Remark: When there are two series correlation and plots can be used effectively to determine the model. However when there are more that two series, plots and correlation analysis are less useful and in may cases can give the wrong impression. This will be illustrated later. In cases where there are more than one explanatory variable, regression is the appropriate approach, although this approach has many problems.
A regression tries to write the dependent variable y as a linear function of the explanatory variables. In this case we have estimated a model of the form
(3-14)
10
Econometric Notes
where y=value is the price of the car in period t, x=age is the age in period t and e is the error term.
Regression output produces
value = 1852.8505 - 178.41121*age (3-15) (6.45) (-3.35)
R2 = .672, SEE = 389.17, e'e = 605814.953
which can be verified from the printout. Note that SEE= .
The regression model suggests that every year older a car gets the value significantly drops $178.41. A car one year old should have a value of 1852.8505 - (1)*178.41221 = 1674.4. In the sample data set the one year old car in fact had a value of 1995. For this observation the error was 320.56. Using the estimated equation (3-14) we have
Age Actual Value Estimated Value Error1 1995 1674.4 320.563 875 1317.6 -442.626 695 782.38 -87.38310 345.0 68.738 276.265 595 960.79 -365.792 1795 1496 298.97
t scores have been placed under the estimated coefficients. Since for both coefficients |t| > 2, we can state that given the assumptions of the linear regression model, both coefficients are significant. Before turning to an in-depth discussion of the regression model, we look at a few optional topics.
4. More complex setup to illustrate Matlab to estimate the Model. Optional Topic.
This optional topic implements the key ideas in Appendix E of Wooldridge (2013) that show how a linear econometrioc model has be estimated by OLS. As discussed in the text, a linear OLS Model selects the coefficients so as to minimize the sum of squared errors. Define X as an N by K matrix
where N is the number of observation of K series. The OLS coefficient vector where y is the right hand side vector. The error vector . Standard errors of the coefficients
can be obtained from the square root of diagonal elements of where .
As an alternative to the Stata regress command that was shown above, the self contained MATLAB program that is listed next can be used to estimate the model.
%% Cars Example using Matlab% Load data
11
Econometric Notes
x=[1,1; 1,3; 1,6; 1,10; 1,5; 1,2];y=[1995 875 695 345 595 1795];y=y';value=y;disp('Mean of dependent (Age) and Independent Variable (Value)')disp([mean(y),mean(x(:,2))])age=x(:,2);disp('Small and Large Variances for Age and Value')disp([var(age,0),var(age,1),var(value,0),var(value,1)])disp('Correlation using formula and built in function')cor=(mean(age.*y)-(mean(age)*mean(y)))/(sqrt(var(age,1))*sqrt(var(value,1)))% using built in functioncor=corr([age,value])%% Estimate the model% Logic works for any sized problem!!% for large # of obs put ; at end of [y,yhat,res] linebeta=inv(x'*x)*x'*y;yhat=x*beta;res=y-yhat;disp(' Value Yhat Res')[y,yhat,res]ssr=res'*res;disp('Sum of squared residuals')disp(ssr)df=size(x,1)-size(x,2);se=sqrt(diag((ssr/df)*inv(x'*x)));disp(' Beta se t')t=beta./se;[beta,se,t]plot(res)% plot(age,y,age,yhat)disp('Durbin Watson')i=1:1:5;dw=((res(i+1)-res(i))'*(res(i+1)-res(i)))/(res'*res);disp(dw)
Which produces output:
Mean of dependent (Age) and Independent Variable (Value) 1050 4.5Small and Large Variances for Age and Value 10.7 8.9167 4.6175e+005 3.8479e+005Correlation using formula and built in functioncor = -0.85884cor = 1 -0.85884 -0.85884 1 Value Yhat Resans = 1995 1674.4 320.56
12
Econometric Notes
875 1317.6 -442.62 695 782.38 -87.383 345 68.738 276.26 595 960.79 -365.79 1795 1496 298.97Sum of squared residuals 6.0581e+005 Beta se tans = 1852.9 287.35 6.4481 -178.41 53.206 -3.3532Durbin Watson 2.7979
which matches what was produced by the Stata regress commands which can give the user the impression of a "black box." Our findings indicate that for every year on average the car falls in value $178.41.
Remark: Econometric calculations can easily be programmed using 4th generation languages without detailed knowledge of Fortran or C. This allows new techniques to be implemented without waiting for software developers to "hard wire" these procedures.
13
Econometric Notes
5. Review of Linear Algebra and Introduction to Programming Regression Calculations. Optional Topic for those with right math background.
Assume a problem where there are multiple x variables, all possibly related to y, and there is some relationship between the x variables (multicollinearity). The proposed solution is to fit a linear model of the form:
, (5-1)
where y, and e are N element column vectors, is the coefficient of and is the intercept of the equation. A linear model such as (5-1) can be estimated by OLS (ordinary least squares), which will minimize which a good measure of the fit of the model. OLS is one of many methods to fit
a line, others discussed being L1 which minimizes and minimax which minimizes the largest element in e. After the coefficients are calculated, it is a good idea to estimate and report standard errors, which allow significance tests on the estimates of the parameters. OLS models can be estimated, using matrix algebra directly or using pre programmed procedures like the regression command in Excel. There are however a number of ways to calculate the estimated parameters. Before this occurs we first illustrate a number of Linear algebra calculations that include the LU factorization, eigenvalue analysis, the Cholesky factorization, the QR factoprization, the Schur factorization (that always works when eigen values may not work) and the SVD calculation.
The LU factorization is the appropriate way to invert a general matrix. Eigenvalue analysis decomposes where is a diagonal matrix and Z is a general matrix. For the positive definite case since here . Inspection of the diagonal elements of
indicates whether explodes if we note . The sum of the diagonal elements of are the trace of Z while their product is . If Z is positive definite (all diagonal elements of
>0) the Cholesky factorization writes where R is upper triangular. The Schur factorization writes where U is orthogonal and S is block upper triangular. Unlike the eigenvalue transformation, all elements of the Schur factorization are real for the general matrix. The QR factorization writes where Q is orthogonal and R is the Cholesky factorization calculated accurately since it used X not .The SVD calculates where both U and V are orthogonal and N by K and K by K and is a K by K diagonal matrtrix whose elements are the square roots of the eigenvalues of . The below listed Matlab script self documents these calculations and shows graphically where X was 100 by 50. How would this graph look like if X was not a random matrix where by assumption
? How might it be used?
%% Linear Algebra Useful for Econometrics in Matlab
14
Econometric Notes
disp(' Short course in Math using Matlab(c)')% 2 December 2006 Versiondisp(' Houston H. Stokes')disp(' All Matlab commands are indented. Cut and paste from this')disp(' document into Matlab and execute.')disp(' ')disp(' If ; is left off result will print.')disp(' Define x as a n by n matrix of random numbers.')disp(' x = rand(n) ')disp(' define x as a n by n matrix of random normal numbers')disp(' xn = randn(n)')disp(' Do a LU factorization and test answer')disp(' Inverse using LU ')disp(' ')disp(' x = rand(n) ')disp(' [l u] = lu(x) ')disp(' test = l*u ')disp(' error = l*u - x ')disp(' ix = inv(x) ')disp(' ix2 = inv(u)*inv(l) ')disp(' error = ix - ix2 ') n=3 x = rand(n) [l u] = lu(x) test = l*u error = l*u - x ix = inv(x) ix2 = inv(u)*inv(l) error = ix - ix2disp(' Form PD Matrix and look at it. ')disp(' xx = randn(100,10); ')disp(' xpx = xx`*xx ')disp(' mesh(xpx) ') xx = randn(100,50); xpx = xx'*xx; mesh(xpx)disp(' Factor PD matrix into R(t)*R and test')disp(' xx = randn(100,n); ')disp(' xpx = xx(t)*xx ')disp(' r = chol(xpx) ')disp(' test1 = r(t)*r ')disp(' mesh(r) ')disp(' error = r(t)*r - xpx ') xx = randn(100,n); xpx = xx'*xx r = chol(xpx) test1 = r'*r error = r'*r - xpx disp(' Eigen and svd analysis. For pd matrix s = landa')disp(' xx = randn(100,n); ')disp(' xpx = xx(t)*xx ')disp(' lamda = eig(xpx) ')
15
Econometric Notes
xx = randn(100,n); xpx = xx'*xx lamda = eig(xpx) disp(' show trace = sum eigen')disp(' det = prod(e) ')disp(' trace1 = trace(xpx) ')disp(' det1 = det(xpx) ')disp(' trace2 = sum(lamda) ')disp(' det2 = prod(lamda) ') trace1 = trace(xpx) det1 = det(xpx) trace2 = sum(lamda) det2 = prod(lamda) disp(' Test SVD')disp(' s = svd(xpx) ')disp(' [u ss v] = svd(xpx) ')disp(' test = u*ss*v(t)')disp(' error = xpx-test ') s = svd(xpx) [u ss v] = svd(xpx) test = u*ss*v' error = xpx-test disp(' Does X*V = V*Lamda')disp(' xx = randn(100,n); ')disp(' xpx = xx(t)*xx ')disp(' [v lamda] = eig(xpx) ')disp(' test = v*lamda*inv(v)')disp(' error = xpx-test ')disp(' vpv = v(t)*v ')disp(' s = svd(xpx) ') xx = randn(100,n); xpx = xx'*xx [v lamda] = eig(xpx) test = v*lamda*inv(v) error = xpx-test vpv = v'*v s = svd(xpx)disp(' Schur Factorization X = U S U(t) where U is orthogonal and')disp(' S is block upper triangural with 1 by 1 and 2 by 2 on the')disp(' diagonal. All elements of a Schur factorization real')disp(' xx = randn(100,n); ')disp(' xpx = xx(t)*xx ')disp(' [U,S] = schur(xpx) ')disp(' test = U*S*U(t) ')disp(' error = xpx-test ') xx = randn(100,n); xpx = xx'*xx [U,S] = schur(xpx) test = U*S*U' error = xpx-test
16
Econometric Notes
disp(' Schur Factorization')disp(' xx = randn(n,n) ')disp(' [U,S] = schur(xx) ')disp(' test = U*S*U(t) ')disp(' error = xx-test ') xx = randn(n,n) [U,S] = schur(xx) test = U*S*U' error = xx-testdisp(' QR Factorization preserves length and angles and does not magnify')disp(' errors. We express X = Q*R where Q is orthogonal and R is upper')disp(' triangular ')disp(' x = randn(n,n) ')disp(' [Q R] = qr(x) ')disp(' test1 = Q(t)*Q ')disp(' test2 = Q*R ')disp(' error = x - test2 ') x = randn(n,n) [Q R] = qr(x) test1 = Q'*Q test2 = Q*R error = x - test2
and produces output:
Short course in Math using Matlab(c) Houston H. Stokes All Matlab commands are indented. Cut and paste from this document into Matlab and execute. If ; is left off result will print. Define x as a n by n matrix of random numbers. x = rand(n) define x as a n by n matrix of random normal numbers xn = randn(n) Do a LU factorization and test answer Inverse using LU x = rand(n) [l u] = lu(x) test = l*u error = l*u - x ix = inv(x) ix2 = inv(u)*inv(l) error = ix - ix2 n = 3x = 0.84622 0.67214 0.68128 0.52515 0.83812 0.37948 0.20265 0.01964 0.8318l = 1 0 0 0.62059 1 0 0.23947 -0.33568 1
17
Econometric Notes
u = 0.84622 0.67214 0.68128 0 0.421 -0.04331 0 0 0.65411test = 0.84622 0.67214 0.68128 0.52515 0.83812 0.37948 0.20265 0.01964 0.8318error = 0 0 0 0 0 0 0 -6.9389e-018 0ix = 2.9596 -2.3417 -1.3557 -1.5445 2.4281 0.15727 -0.68458 0.51318 1.5288ix2 = 2.9596 -2.3417 -1.3557 -1.5445 2.4281 0.15727 -0.68458 0.51318 1.5288error = 0 0 0 0 0 0 -1.1102e-016 0 0 Form PD Matrix and look at it. xx = randn(100,10); xpx = xx`*xx mesh(xpx) Factor PD matrix into R(t)*R and test xx = randn(100,n); xpx = xx(t)*xx r = chol(xpx) test1 = r(t)*r mesh(r) error = r(t)*r - xpx xpx = 98.02 17.334 0.14022 17.334 104.66 -7.2052 0.14022 -7.2052 114.22r = 9.9005 1.7508 0.014163 0 10.08 -0.71729 0 0 10.663test1 = 98.02 17.334 0.14022 17.334 104.66 -7.2052 0.14022 -7.2052 114.22error = 1.4211e-014 0 0 0 0 0 0 0 0 Eigen and svd analysis. For pd matrix s = landa xx = randn(100,n); xpx = xx(t)*xx lamda = eig(xpx) xpx = 95.217 -3.5453 12.006 -3.5453 96.003 -3.9312 12.006 -3.9312 92.989lamda =
18
Econometric Notes
82.025 93.783 108.4 show trace = sum eigen det = prod(e) trace1 = trace(xpx) det1 = det(xpx) trace2 = sum(lamda) det2 = prod(lamda) trace1 = 284.21det1 = 8.3388e+005trace2 = 284.21det2 = 8.3388e+005 Test SVD s = svd(xpx) [u ss v] = svd(xpx) test = u*ss*v(t) error = xpx-test s = 108.4 93.783 82.025u = -0.67492 0.3165 0.66657 0.39135 0.91936 -0.040277 -0.62557 0.23368 -0.74435ss = 108.4 0 0 0 93.783 0 0 0 82.025v = -0.67492 0.3165 0.66657 0.39135 0.91936 -0.040277 -0.62557 0.23368 -0.74435test = 95.217 -3.5453 12.006 -3.5453 96.003 -3.9312 12.006 -3.9312 92.989error = -2.8422e-014 -1.199e-014 -4.0856e-014 -9.3703e-014 -2.8422e-014 5.9064e-014 -4.7962e-014 -5.3291e-015 1.4211e-014 Does X*V = V*Lamda xx = randn(100,n); xpx = xx(t)*xx [v lamda] = eig(xpx) test = v*lamda*inv(v) error = xpx-test vpv = v(t)*v s = svd(xpx) xpx = 98.321 -0.36605 1.9557 -0.36605 127.52 -2.4594 1.9557 -2.4594 112.74v = 0.99127 0.1298 0.022941
19
Econometric Notes
0.0013134 0.1643 -0.98641 -0.13181 0.97783 0.1627lamda = 98.061 0 0 0 112.59 0 0 0 127.93test = 98.321 -0.36605 1.9557 -0.36605 127.52 -2.4594 1.9557 -2.4594 112.74error = -1.4211e-014 2.7645e-014 2.2204e-015 2.9421e-014 -4.2633e-014 5.7732e-015 1.7764e-015 -1.3323e-015 0vpv = 1 2.7756e-017 -2.0817e-017 2.7756e-017 1 -2.498e-016 -2.0817e-017 -2.498e-016 1s = 127.93 112.59 98.061 Schur Factorization X = U S U(t) where U is orthogonal and S is block upper triangural with 1 by 1 and 2 by 2 on the diagonal. All elements of a Schur factorization real xx = randn(100,n); xpx = xx(t)*xx [U,S] = schur(xpx) test = U*S*U(t) error = xpx-test xpx = 75.062 11.465 -4.6863 11.465 135.28 7.6196 -4.6863 7.6196 87.647U = -0.91599 -0.36457 -0.16747 0.20355 -0.062606 -0.97706 -0.34572 0.92907 -0.13156S = 70.745 0 0 0 88.973 0 0 0 138.27test = 75.062 11.465 -4.6863 11.465 135.28 7.6196 -4.6863 7.6196 87.647error = 1.4211e-014 2.1316e-014 -1.4211e-014 2.4869e-014 -8.5265e-014 7.1054e-015 -1.0658e-014 5.3291e-015 1.4211e-014 Schur Factorization xx = randn(n,n) [U,S] = schur(xx) test = U*S*U(t) error = xx-test xx = 2.095 0.93943 -0.45994 0.34979 -0.047081 0.64722 2.0142 -1.4799 -1.8411U =
20
Econometric Notes
-0.89939 -0.19282 0.39233 -0.24726 -0.51574 -0.82029 -0.3605 0.83477 -0.41617S = 2.1689 1.4404 -1.1939 0 -0.98103 2.3141 0 -0.42339 -0.98103test = 2.095 0.93943 -0.45994 0.34979 -0.047081 0.64722 2.0142 -1.4799 -1.8411error = 8.8818e-016 -2.2204e-016 -1.6653e-016 1.1102e-016 9.09e-016 8.8818e-016 4.4409e-016 3.3307e-015 8.8818e-016 QR Factorization preserves length and angles and does not magnify errors. We express X = Q*R where Q is orthogonal and R is upper triangular x = randn(n,n) [Q R] = qr(x) test1 = Q(t)*Q test2 = Q*R error = x - test2 x = -0.9756 0.55997 0.88166 0.028304 0.62542 0.15174 -0.050706 0.53695 -0.017682Q = -0.99823 0.0094729 -0.058658 0.028961 -0.78444 -0.61953 -0.051883 -0.62013 0.78278R = 0.97733 -0.56872 -0.87479 0 -0.81828 -0.099712 0 0 -0.15956test1 = 1 0 6.9389e-018 0 1 -1.6653e-016 6.9389e-018 -1.6653e-016 1test2 = -0.9756 0.55997 0.88166 0.028304 0.62542 0.15174 -0.050706 0.53695 -0.017682error = 0 0 -4.4409e-016 3.4694e-018 2.2204e-016 8.3267e-017 0 0 2.0817e-017
21
Econometric Notes
010
2030
4050
0
10
20
30
40
50-40
-20
0
20
40
60
80
100
120
140
X'X Where X was 100 by 50
Figure 5.1 X'X for a random Matrix X
These ideas are illustrated using the Theil dataset discussed in more detail in the next section.
%% Use of Theil Data to Illustrate various ways to get Beta% For more detail on these calculations see Stokes (200x) Chapter 10disp('Theil (1971) data on Year CT RP Income')data=[1923 99.2 96.7 101.0;1924 99.0 98.1 100.1;1925 100.0 100.0 100.0;1926 111.6 104.6 90.6;1927 122.2 104.9 86.5;1928 117.6 109.5 89.7;1929 121.1 110.8 90.6;1930 136.0 112.3 82.8;1931 154.2 109.3 70.1;1932 153.6 105.3 65.4;1933 158.5 101.7 61.3;
22
Econometric Notes
1934 140.6 95.4 62.5;1935 136.2 96.4 63.6;1936 168.0 97.6 52.6;1937 154.3 102.4 59.7;1938 149.0 101.6 59.5;1939 165.5 103.8 61.3]y=data(:,2);x=[ones(size(data,1),1),data(:,3),data(:,4)];disp('Beta using Inverse')beta1=inv(x'*x)*x'*y%% QRdisp('Using QR approach')[q,r]=qr(x,0)disp('Testing q')q'*qbeta2=inv(r)*q'*yyhat=q*q'*y;resid=y-yhat;disp('Y Yhat Residual')[y,yhat,resid]%% Testing R from QR and R from Choleskydisp('Inverse (xpx) = inv(r)*transpose(inv(r))')inv(x'*x)inv(r)*(inv(r))'rcholr=chol(x'*x)%% SVD approach that includes PC Regressiondisp('SVD approach')[u,s,v]=svd(x,0)pc_coef=u'*ybeta3=inv(v')*inv(s)*pc_coef
Output produced is:
Theil (1971) data on Year CT RP Incomedata = 1923 99.2 96.7 101 1924 99 98.1 100.1 1925 100 100 100 1926 111.6 104.6 90.6 1927 122.2 104.9 86.5 1928 117.6 109.5 89.7 1929 121.1 110.8 90.6 1930 136 112.3 82.8 1931 154.2 109.3 70.1 1932 153.6 105.3 65.4 1933 158.5 101.7 61.3 1934 140.6 95.4 62.5 1935 136.2 96.4 63.6 1936 168 97.6 52.6 1937 154.3 102.4 59.7 1938 149 101.6 59.5 1939 165.5 103.8 61.3Beta using Inversebeta1 =
23
Econometric Notes
130.23 1.0659 -1.3822Using QR approachq = -0.24254 -0.2958 -0.42465 -0.24254 -0.2297 -0.39928 -0.24254 -0.13999 -0.38173 -0.24254 0.077214 -0.20134 -0.24254 0.091379 -0.13707 -0.24254 0.30858 -0.14641 -0.24254 0.36996 -0.14898 -0.24254 0.44079 -0.018862 -0.24254 0.29913 0.14704 -0.24254 0.11027 0.18403 -0.24254 -0.059716 0.21536 -0.24254 -0.35718 0.14409 -0.24254 -0.30997 0.13597 -0.24254 -0.25331 0.31174 -0.24254 -0.026664 0.24537 -0.24254 -0.064438 0.24162 -0.24254 0.03944 0.2331r = -4.1231 -424.53 -314.64 0 21.179 11.878 0 0 -66.411Testing qans = 1 8.1532e-017 1.3878e-017 8.1532e-017 1 -1.1796e-016 1.3878e-017 -1.1796e-016 1beta2 = 130.23 1.0659 -1.3822Y Yhat Residualans = 99.2 93.704 5.4962 99 96.44 2.56 100 98.603 1.3965 111.6 116.5 -4.8995 122.2 122.49 -0.28637 117.6 122.97 -5.3664 121.1 123.11 -2.0081 136 135.49 0.51173 154.2 149.84 4.3553 153.6 152.08 1.5225 158.5 153.91 4.5927 140.6 145.53 -4.9335 136.2 145.08 -8.8789 168 161.56 6.4376 154.3 156.87 -2.565 149 156.29 -7.2887 165.5 156.15 9.3542Inverse (xpx) = inv(r)*transpose(inv(r))ans = 23.773 -0.2272 -0.0042094 -0.2272 0.0023008 -0.00012716 -0.0042094 -0.00012716 0.00022673ans =
24
Econometric Notes
23.773 -0.2272 -0.0042094 -0.2272 0.0023008 -0.00012716 -0.0042094 -0.00012716 0.00022673r = -4.1231 -424.53 -314.64 0 21.179 11.878 0 0 -66.411cholr = 4.1231 424.53 314.64 0 21.179 11.878 0 0 66.411SVD approachu = 0.26014 -0.42317 0.28267 0.26123 -0.39389 0.21821 0.26398 -0.37096 0.12977 0.26026 -0.17816 -0.076467 0.25606 -0.11332 -0.086908 0.26662 -0.1094 -0.30402 0.26959 -0.10823 -0.36537 0.26301 0.025612 -0.42853 0.2441 0.18215 -0.27778 0.23275 0.20748 -0.087337 0.22268 0.22834 0.08395 0.21455 0.13929 0.37648 0.2173 0.13408 0.32893 0.20664 0.31251 0.28251 0.22192 0.26022 0.052713 0.22049 0.25419 0.090163 0.22584 0.25202 -0.013903s = 530.48 0 0 0 53.304 0 0 0 0.20509v = 0.0077424 0.0056046 0.99995 0.799 0.60125 -0.0095564 0.60128 -0.79904 -0.00017699pc_coef = 545.81 131.94 26.706beta3 = 130.23 1.0659 -1.3822
Remark: This section shows how to implement the basic linear algebra relationships that are useful in understanding modern econometric methods and calculations. In many cases these new approaches are required to be used for complex and multi-collinear datasets.
6. A Sample Multiple Input Regression Model Dataset
In sections 3 and 4 we introduced a small (6 observation dataset) that relates age of cars to
25
Econometric Notes
their value. We observed that since there are so few observations in this example, the correlation coefficient must be relatively large to be significant. The small sample standard error of the
correlation coefficient is calculated using (3.3) which is this case is . Since the absolute value of the correlation coefficient (-.85884) is about 2 times the standard error, we can state that at about the 95% level, the correlation coefficient is significant. The problem with correlation analysis is that it is hard to make direct predictions. What is wanted is a relationship where, if given only the age of a car, we can make some prediction on its price. To obtain an answer to the prediction problem requires more advanced statistical techniques. Its solution will be discussed further below.
As discussed earlier, when more complicated models are deemed appropriate or when predictions are required, the correlation coefficient statistical procedure, which restricts analysis to two variables, is no longer the best way to proceed. In the highly unlikely situation where all the variables influencing y (the x's) were unrelated among themselves (i. e., were orthogonal), correlation analysis would give the correct sign of the relationship between each x variable and y. This situation would occur if the x's were principal components. In a later example, using generated data, some of these possibilities will be illustrated with further examples.
Table Two lists data on the consumption of textiles in the Netherlands from Theil( [1971] Principles of Econometrics, page 102) which was used as an example in the Matlab code in section 5. This example will be shown to provide a better fit than the previous example and, in addition, illustrates multiple input regression models. (It should be noted that not all economics examples work this well.) Usually time series models have higher than cross section models, because of the serial correlation (relationship between the error terms across time) implicit in most time series. In this example from Theil (l971) the consumption of textiles in the Netherlands (CT) between 1923-1939 is modeled as a function of income (Y) and the relative price of textiles (RP). The maintained hypothesis is that as income increases, the consumption of textiles should increase and as the relative price of textiles increases, the consumption of textiles should decrease. Two models are tried, one with the raw data and one with data logged to the base 10. The linear model asserts
Table Two Consumption of Textiles in the Netherlands: 1923-1939
Year CT Y RP1923 99.2 96.7 101.01924 99.0 98.1 100.11925 100.0 100.0 100.01926 111.6 104.6 90.61927 122.2 104.9 86.51928 117.6 109.5 89.71929 121.1 110.8 90.61930 136.0 112.3 82.81931 154.2 109.3 70.11932 153.6 105.3 65.41933 158.5 101.7 61.31934 140.6 95.4 62.5
26
Econometric Notes
1935 136.2 96.4 63.61936 168.0 97.6 52.61937 154.3 102.4 59.71938 149.0 101.6 59.51939 165.5 103.8 61.3
CT = consumption of textiles.Y = income.RP = relative price of textiles.
(6-1)
while the log form assumes the error is multiplicative or that
(6-2)
(6-2) can be estimated in log form as
(6-3)
Actual estimates of the alternative models were
(6.4)
Prior to preliminary estimation, raw correlations and plots were performed. The log transformation was attempted to make the time series data stationary. B34S and SAS commands to analyze this data are shown next
Note that B34S requires the user to explicitly define variables to be built with the gen statements with the build statement when using the B34S data step. This allows for checking of variable names in the gen statements. For SAS the following commands would be used.
data theil;INPUT CT Y RP ;LABEL CT = 'CONSUMPTION OF TEXTILES' ;LABEL LOG10CT = 'LOG10 OF CONSUMPTION' ;
27
Econometric Notes
LABEL Y = 'INCOME' ;LABEL LOG10Y = ' LOG10 OF INCOME ' ;LABEL RP = 'RELATIVE PRICE OF TEXTILES';LABEL LOG10RP = 'LOG10 OF RELATIVE PRICE' ;LOG10CT = LOG10(CT) ;LOG10RP = LOG10(RP) ;LOG10Y = LOG10(Y) ;CARDS;99.2 96.7 10199 98.1 100.1100 100 100111.6 104.9 90.6122.2 104.9 86.5117.6 109.5 89.7121.1 110.8 90.6136 112.3 82.8154.2 109.3 70.1153.6 105.3 65.4158.5 101.7 61.3140.6 95.4 62.5136.2 96.4 63.6168 97.6 52.6154.3 102.4 59.7149 101.6 59.5165.5 103.8 61.3;proc reg; MODEL CT = Y RP; run;proc reg; MODEL LOG10CT = LOG10Y LOG10RP; run;proc autoreg; MODEL LOG10CT = LOG10Y LOG10RP / nlag=1 method=ml; run;proc autoreg; MODEL LOG10CT = LOG10Y / nlag=1 method=ml; run;
Edited output from B34S discussed below is:
Variable # Label Mean Std. Dev. Variance Maximum Minimum
CT 1 CONSUMPTION OF TEXTILES 134.506 23.5773 555.891 168.000 99.0000 Y 2 INCOME 102.982 5.30097 28.1003 112.300 95.4000 RP 3 RELATIVE PRICE OF TEXTILES 76.3118 16.8662 284.470 101.000 52.6000 LOG10CT 4 LOG10 OF CONSUMPTION 2.12214 0.791131E-01 0.625889E-02 2.22531 1.99564 LOG10Y 5 LOG10 OF INCOME 2.01222 0.222587E-01 0.495451E-03 2.05038 1.97955 LOG10RP 6 LOG10 OF RELATIVE PRICE 1.87258 0.961571E-01 0.924619E-02 2.00432 1.72099 CONSTANT 7 1.00000 0.00000 0.00000 1.00000 1.00000
Data file contains 17 observations on 7 variables. Current missing value code is 0.1000000000000000E+32B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 16:14:15 DATA STEP PAGE 2
Correlation Matrix
1 Y Var 2 0.61769E-01
1 2 RP Var 3 -0.94664 0.17885
1 2 3 LOG10CT Var 4 0.99744 0.93936E-01 -0.94836
1 2 3 4 LOG10Y Var 5 0.66213E-01 0.99973 0.17511 0.97862E-01
1 2 3 4 5 LOG10RP Var 6 -0.93820 0.22599 0.99750 -0.93596 0.22212
1 2 3 4 5 6 CONSTANT Var 7 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
*************** Problem Number 4 Subproblem Number 1
F to enter 0.99999998E-02 F to remove 0.49999999E-02 Tolerance 0.10000000E-04 Maximum no of steps 3
28
Econometric Notes
Dependent variable X( 1). Variable Name CT
Standard Error of Y = 23.577332 for degrees of freedom = 16.
............. Step Number 3 Analysis of Variance for reduction in SS due to variable entering Variable Entering 2 Source DF SS MS F F Sig. Multiple R 0.975337 Due Regression 2 8460.9 4230.5 136.68 1.000000 Std Error of Y.X 5.56336 Dev. from Reg. 14 433.31 30.951 R Square 0.951282 Total 16 8894.2 555.89
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation CT = Variable Coefficient F for selection Y X- 2 1.061710 0.2666740 3.981 0.99863 0.7287 0.8129 RP X- 3 -1.382985 0.8381426E-01 -16.50 1.00000 -0.9752 -0.7846 CONSTANT X- 7 130.7066 27.09429 4.824 0.99973
Adjusted R Square 0.944321908495049 -2 * ln(Maximum of Likelihood Function) 103.294108058298 Akaike Information Criterion (AIC) 111.294108058298 Scwartz Information Criterion (SIC) 114.626961434523 Akaike (1970) Finite Prediction Error 36.4128553394184 Generalized Cross Validation 37.5832685467569 Hannan & Quinn (1979) HQ 36.8112662258895 Shibata (1981) 34.4851159390962 Rice (1984) 39.3920889580981 Residual Variance 30.9509270385056
Order of entrance (or deletion) of the variables = 7 3 2
Estimate of computational error in coefficients = 1 -0.1889E-13 2 -0.2396E-14 3 0.7430E-11
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 2 Y 0.71115004E-01
Row 2 Variable X- 3 RP -0.39974169E-02 0.70248306E-02
Row 3 Variable X- 7 CONSTANT -7.0185405 -0.12441382 734.10069
Program terminated. All variables put in.
Residual Statistics for... Original Data
Von Neumann Ratio 1 ... 2.14471 Durbin-Watson TEST..... 2.01855 Von Neumann Ratio 2 ... 2.14471
For D. F. 14 t(.9999)= 5.3624, t(.999)= 4.1403, t(.99)= 2.9768, t(.95)= 2.1448, t(.90)= 1.7613, t(.80)= 1.3450
Skewness test (Alpha 3) = -.232914E-01, Peakedness test (Alpha 4)= 1.37826
Normality Test -- Extended grid cell size = 1.70 t Stat Infin 1.761 1.345 1.076 0.868 0.692 0.537 0.393 0.258 0.128 Cell No. 0 2 2 4 2 0 2 2 1 2 Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 Act Per 1.000 1.000 0.882 0.765 0.529 0.412 0.412 0.294 0.176 0.118
Normality Test -- Small sample grid cell size = 3.40 Cell No. 2 6 2 4 3 Interval 1.000 0.800 0.600 0.400 0.200 Act Per 1.000 0.882 0.529 0.412 0.176
Extended grid normality test - Prob of rejecting normality assumption Chi= 7.118 Chi Prob= 0.4760 F(8, 14)= 0.889706 F Prob =0.450879
Small sample normality test - Large grid Chi= 3.294 Chi Prob= 0.6515 F(3, 14)= 1.09804 F Prob =0.617396
Autocorrelation function of residuals
1) -0.1546 2) -0.2529 3) 0.2272 4) -0.3925
F( 6, 6) = 0.3219 1/F = 3.106 Heteroskedasticity at 0.9032 level
Sum of squared residuals 433.3 Mean squared residual 25.49
Gen. Least Squares ended by satisfying tolerance.
*************** Problem Number 4 Subproblem Number 2
F to enter 0.99999998E-02
29
Econometric Notes
F to remove 0.49999999E-02 Tolerance 0.10000000E-04 Maximum no of steps 3 Dependent variable X( 4). Variable Name LOG10CT
Standard Error of Y = 0.79113140E-01 for degrees of freedom = 16.
............. Step Number 3 Analysis of Variance for reduction in SS due to variable entering Variable Entering 5 Source DF SS MS F F Sig. Multiple R 0.987097 Due Regression 2 0.97575E-01 0.48787E-01 266.02 1.000000 Std Error of Y.X 0.135425E-01 Dev. from Reg. 14 0.25676E-02 0.18340E-03 R Square 0.974361 Total 16 0.10014 0.62589E-02
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation LOG10CT = Variable Coefficient F for selection LOG10Y X- 5 1.143156 0.1560002 7.328 1.00000 0.8906 1.084 LOG10RP X- 6 -0.8288375 0.3611136E-01 -22.95 1.00000 -0.9870 -0.7314 CONSTANT X- 7 1.373914 0.3060903 4.489 0.99949
Adjusted R Square 0.970697895872232 -2 * ln(Maximum of Likelihood Function) -101.322167384484 Akaike Information Criterion (AIC) -93.3221673844844 Scwartz Information Criterion (SIC) -89.9893140082595 Akaike (1970) Finite Prediction Error 0.215763077505479D-003 Generalized Cross Validation 0.222698319282440D-003 Hannan & Quinn (1979) HQ 0.218123847024249D-003 Shibata (1981) 0.204340326343424D-003 Rice (1984) 0.233416420210472D-003 Residual Variance 0.183398615879657D-003
Order of entrance (or deletion) of the variables = 7 6 5
Estimate of computational error in coefficients = 1 0.5793E-11 2 0.2356E-12 3 0.2547E-11
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 5 LOG10Y 0.24336056E-01
Row 2 Variable X- 6 LOG10RP -0.12513115E-02 0.13040301E-02
Row 3 Variable X- 7 CONSTANT -0.46626424E-01 0.76017246E-04 0.93691270E-01
Program terminated. All variables put in.
Residual Statistics for... Original Data
Von Neumann Ratio 1 ... 2.04710 Durbin-Watson TEST..... 1.92669 Von Neumann Ratio 2 ... 2.04710
For D. F. 14 t(.9999)= 5.3624, t(.999)= 4.1403, t(.99)= 2.9768, t(.95)= 2.1448, t(.90)= 1.7613, t(.80)= 1.3450
Skewness test (Alpha 3) = -.159503 , Peakedness test (Alpha 4)= 1.44345
Normality Test -- Extended grid cell size = 1.70 t Stat Infin 1.761 1.345 1.076 0.868 0.692 0.537 0.393 0.258 0.128 Cell No. 1 1 1 5 1 1 3 1 2 1 Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 Act Per 1.000 0.941 0.882 0.824 0.529 0.471 0.412 0.235 0.176 0.059
Normality Test -- Small sample grid cell size = 3.40 Cell No. 2 6 2 4 3 Interval 1.000 0.800 0.600 0.400 0.200 Act Per 1.000 0.882 0.529 0.412 0.176
Extended grid normality test - Prob of rejecting normality assumption Chi= 9.471 Chi Prob= 0.6958 F(8, 14)= 1.18382 F Prob =0.626481
Small sample normality test - Large grid Chi= 3.294 Chi Prob= 0.6515 F(3, 14)= 1.09804 F Prob =0.617396
Autocorrelation function of residuals
1) -0.0990 2) -0.1061 3) 0.0862 4) -0.3157
F( 6, 6) = 0.5544 1/F = 1.804 Heteroskedasticity at 0.7544 level
Sum of squared residuals 0.2568E-02 Mean squared residual 0.1510E-03
Gen. Least Squares ended by satisfying tolerance.
We first show plots of the data.
30
Econometric Notes
OBS2 4 6 8 10 12 14 16
60
70
80
90
100
110
120
130
140
150
160CT
RP
Y
Linear Theil Data
OBS2 4 6 8 10 12 14 16
1.75
1.80
1.85
1.90
1.95
2.00
2.05
2.10
2.15
2.20LOG10CT
LOG10RP
LOG10Y
Log Theil Data
Figure 6.1 2 D Plots of Textile Data
31
Econometric Notes
Two dimensional plots of this dataset do not capture the full relationships. From the plots in Figure 6.1 it appears that the consumption of textiles increases when the relative price of textiles falls and that RI has little effect. Figure 6.2, which is based on a three dimensional extrapolation about each point, gives a better picture of the true relationship. This figure clearly shows that LOG10RP, has the most effect on LOG10CT, which is on the Z axis, but that LOG10RI does have a positive effect. The OLS regression model attempts to capture this surface.
Remark: A 2-D plot may lead one to drop a variable that is in fact significant in a multi-dimensional context. A 3-D plot can help in cases where K=3, but may be less useful for larger problems.
The plots of CT against RP and LOG10CT against LOG10RP suggest a negative relationship, which is consistent with the economic theory that quantity demanded of a good will increase as its relative price falls. The correlations between these two sets of variables are negative (-.94664 and -.93596) and highly significant (at the .0001 level for both correlations). The plot between CT and Y and the plot between LOG10CT and LOG10Y do not show much of a relationship. The raw correlations are small (.06177 and .09786, respectively) and not significant. The preliminary finding might be that Y was not a good variable to use on the right-hand side of a model predicting CT. It will be shown later that such a conclusion would be wrong.
32
Econometric Notes
1.80
1.90
2.001.9801.990
2.0002.010
2.0202.030
2.0402.050
2.042.062.082.102.122.142.162.18
log10ct
log10y log10rp
Log Theil Textile Data
Figure 6.2 3-D Plot of Theil (1971) Textile Data
Remark: The preliminary estimation of a model CT = f(constant, Y, RP) indicates that the coefficients are 1.0617 (t = 3.98) and -1.383 (t = -16.5), respectively. The results support the maintained hypothesis that CT is positively related to income and negatively related to relative price. The Y variable, which was not significantly correlated with CT, was found to be significant when included in a regression controlling for RP. This demonstrates that it is important to go beyond just raw cross correlation analysis. If proposed variables are "prescreened out" by correlation analysis and not tried in regression models, many important variables may be incorrectly dropped from the analysis. It is important not to prematurely drop a proposed, theoretically plausible, variable from a regression model specification, even if in preliminary specifications it does not enter significantly. Later in the paper an example will be presented that illustrates that if other important variables are omitted from an equation, a significant variable that is in the equation may not show up as significant when other variables enter the equation omitted variable bias). The preceding discussion suggests that regression analysis requires careful use of diagnostic tests before the results are to be used in a production environment.
A possible problem with the above formulation is that the error process might potentially have heteroskedasticity or nonconstant variance due to the fact that the time series values for CT are
33
Econometric Notes
increasing over time. If all the variables in the model are transformed into logs (to the base 10), some of the potential for difficulty may be avoided. If heteroskedasticity were to be present, the estimated standard errors of the coefficients would be biased. In addition, the estimated standard error of the
model, from equation (6-4), would be misleading, since it would be an average, and, assuming the variance of the error was increasing, would overstate the error at the beginning of the data set and understate the error at the end of the data set.
Log transforms to the base 10 are made and the model is estimated again and reported in the bottom equation (6-4). The results indicate the log linear form of the model fits better (the adjusted
now is .9707) and all coefficients, except for the constant, are more significant. Comparison of the estimated values with the actual values shows surprisingly good results, considering there are only two explanatory variables in the model.
One of the assumptions of an OLS regression is that the error process follows a random normal distribution with no serial correlation or heteroskedasticity (nonconstant variance). If the error process is only normal, the estimated coefficients will be unbiased and the standard errors of the estimated coefficients will be biased.
Another important assumption of OLS is that the error terms are not related. If is the error
term of the estimated model, is a random error and the model
(6-5)
is estimated, no autocorrelation up to order K implies that for are not significant. First-order serial correlation can be tested by the Durbin-Watson test statistic. If the Durbin Watson statistic is around 2.0, there is no problem. If it is substantially below (above) 2.0 there is positive (negative) autocorrelation. This can be seen since the formula for the Durbin Watson is
(6-6)
If serial correlation is found, the appropriate procedure is generalized least squares, which involves a transformation of the data. If heteroskedasticity is found, there are other procedures that can be used to remove the problem. To illustrate GLS, assume
(6-7)
34
Econometric Notes
where t refers to the time period of the observation. If and model (6-5) is estimated for the
residuals for (6-7) and is significant, the appropriate procedure is to lag the original equation and
multiply through by and then subtract from the original equation. This would give
(6-9)
which will give unbiased estimates of and and their standard errors, since from (6-5)
and by assumption does not contain serial correlation.
As a test a misspecified model (to induce serial correlation) containing only LOG10Y is run in SAS. This model will find LOG10Y not significant and evidence of serial correlation in the model as measured by the low Durbin-Watson test statistic (.241). In the presence of serial correlation, the best course of action is to attempt to add new variables to explain the serial correlation. The B34S reg command output is shown first and next the SAS autoreg command
REG Command. Version 1 February 1997
Real*8 space available 9000000 Real*8 space used 43
OLS Estimation Dependent variable LOG10CT Adjusted R**2 -5.645122564288263E-02 Standard Error of Estimate 8.131550225838252E-02 Sum of Squared Residuals 9.918316361299519E-02 Model Sum of Squares 9.590596648930139E-04 Total Sum of Squares 0.1001422232778882 F( 1, 15) 0.1450437196128150 F Significance 0.2913428904662156 1/Condition of XPX 1.523468705487359E-05 Number of Observations 17 Durbin-Watson 0.2414802718079813
Variable Coefficient Std. Error t LOG10Y { 0} 0.34782649 0.91329943 0.38084606 CONSTANT { 0} 1.4222303 1.8378693 0.77384737
SAS output next: The AUTOREG Procedure
Dependent Variable LOG10CT LOG10 OF CONSUMPTION
Ordinary Least Squares Estimates
SSE 0.00256758 DFE 14 MSE 0.0001834 Root MSE 0.01354 SBC -92.822527 AIC -95.322167 Regress R-Square 0.9744 Total R-Square 0.9744 Durbin-Watson 1.9267
Standard Approx
35
Econometric Notes
Variable DF Estimate Error t Value Pr > |t| Variable Label
Intercept 1 1.3739 0.3061 4.49 0.0005 LOG10Y 1 1.1432 0.1560 7.33 <.0001 LOG10 OF INCOME LOG10RP 1 -0.8288 0.0361 -22.95 <.0001 LOG10 OF RELATIVE PRICE
Estimates of Autocorrelations
Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1
0 0.000151 1.000000 | |********************| 1 -0.00001 -0.093221 | **| |
Preliminary MSE 0.000150
Estimates of Autoregressive Parameters
Standard Lag Coefficient Error t Value
1 0.093221 0.276142 0.34
Algorithm converged.
The SAS System 10:37 Wednesday, December 6, 2006 4
The AUTOREG Procedure
Maximum Likelihood Estimates
SSE 0.0025352 DFE 13 MSE 0.0001950 Root MSE 0.01396 SBC -90.189374 AIC -93.522227 Regress R-Square 0.9789 Total R-Square 0.9747 Durbin-Watson 1.7932
Standard Approx Variable DF Estimate Error t Value Pr > |t| Variable Label
Intercept 1 1.3592 0.2941 4.62 0.0005 LOG10Y 1 1.1487 0.1516 7.58 <.0001 LOG10 OF INCOME LOG10RP 1 -0.8271 0.0343 -24.09 <.0001 LOG10 OF RELATIVE PRICE AR1 1 0.1248 0.3186 0.39 0.7017
Autoregressive parameters assumed given.
Standard Approx Variable DF Estimate Error t Value Pr > |t| Variable Label
Intercept 1 1.3592 0.2875 4.73 0.0004 LOG10Y 1 1.1487 0.1471 7.81 <.0001 LOG10 OF INCOME LOG10RP 1 -0.8271 0.0338 -24.47 <.0001 LOG10 OF RELATIVE PRICE
The SAS System 10:37 Wednesday, December 6, 2006 5
The AUTOREG Procedure
Dependent Variable LOG10CT LOG10 OF CONSUMPTION
36
Econometric Notes
Ordinary Least Squares Estimates
SSE 0.09918316 DFE 15 MSE 0.00661 Root MSE 0.08132 SBC -33.537669 AIC -35.204096 Regress R-Square 0.0096 Total R-Square 0.0096 Durbin-Watson 0.2415
Standard Approx Variable DF Estimate Error t Value Pr > |t| Variable Label
Intercept 1 1.4222 1.8379 0.77 0.4510 LOG10Y 1 0.3478 0.9133 0.38 0.7087 LOG10 OF INCOME
Estimates of Autocorrelations
Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1
0 0.00583 1.000000 | |********************| 1 0.00447 0.765305 | |*************** |
Preliminary MSE 0.00242
Estimates of Autoregressive Parameters
Standard Lag Coefficient Error t Value
1 -0.765305 0.172027 -4.45
Algorithm converged.
The SAS System 10:37 Wednesday, December 6, 2006 6
The AUTOREG Procedure
Maximum Likelihood Estimates
SSE 0.02423721 DFE 14 MSE 0.00173 Root MSE 0.04161 SBC -53.034484 AIC -55.534124 Regress R-Square 0.0564 Total R-Square 0.7580 Durbin-Watson 1.6157
Standard Approx Variable DF Estimate Error t Value Pr > |t| Variable Label
Intercept 1 0.6643 1.6320 0.41 0.6901 LOG10Y 1 0.7229 0.8167 0.89 0.3910 LOG10 OF INCOME AR1 1 -0.8961 0.1312 -6.83 <.0001
Autoregressive parameters assumed given.
Standard Approx Variable DF Estimate Error t Value Pr > |t| Variable Label
Intercept 1 0.6643 1.5885 0.42 0.6821 LOG10Y 1 0.7229 0.7910 0.91 0.3762 LOG10 OF INCOME
The SAS first shows the complete model where the DW was 1.9267 indicating GLS was
37
Econometric Notes
not needed. If in fact GLS is run, with . The DW fell to 1.79. For the mis-specified equation the DW was .215 before GLS and 1.6157 after GLS. Here with . For B34S which the two pass method to do GLS the results were:
Problem Number 1 Subproblem Number 3 F to enter 9.999999776482582E-03 F to remove 4.999999888241291E-03 Tolerance (1.-R**2) for including a variable 1.000000000000000E-05 Maximum Number of Variables Allowed 2 Internal Number of dependent variable 4 Dependent Variable LOG10CT Standard Error of Y 7.911314021618683E-02 Degrees of Freedom 16
............. Step Number 2 Analysis of Variance for reduction in SS due to variable entering Variable Entering 5 Source DF SS MS F F Sig. Multiple R 0.978620E-01 Due Regression 1 0.95906E-03 0.95906E-03 0.14504 0.291343 Std Error of Y.X 0.813155E-01 Dev. from Reg. 15 0.99183E-01 0.66122E-02 R Square 0.957698E-02 Total 16 0.10014 0.62589E-02
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation LOG10CT = Variable Coefficient F for selection LOG10Y X- 5 0.3478265 0.9132994 0.3808 0.29134 0.0979 0.3298 CONSTANT X- 7 1.422230 1.837869 0.7738 0.54895
Adjusted R Square -5.645122564294430E-02 -2 * ln(Maximum of Likelihood Function) -39.20409573256165 Akaike Information Criterion (AIC) -33.20409573256165 Scwartz Information Criterion (SIC) -30.70445570039300 Akaike (1970) Finite Prediction Error 7.390118073123232E-03 Generalized Cross Validation 7.493839028535489E-03 Hannan & Quinn (1979) HQ 7.454314110419272E-03 Shibata (1981) 7.207081092983957E-03 Rice (1984) 7.629474124074592E-03 Residual Variance 6.612210907531313E-03
Order of entrance (or deletion) of the variables = 7 5 Estimate of Computational Error in Coefficients
1 2 0.00000 0.00000
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 5 LOG10Y 0.83411584
Row 2 Variable X- 7 CONSTANT -1.6784283 3.3777634
Program terminated. All variables put in.
Residual Statistics for Original data
Von Neumann Ratio 1 ... 0.25657 Durbin-Watson TEST..... 0.24148 Von Neumann Ratio 2 ... 0.25657
For D. F. 15 t(.9999)= 5.2391, t(.999)= 4.0728, t(.99)= 2.9467, t(.95)= 2.1314, t(.90)= 1.7531, t(.80)= 1.3406
Skewness test (Alpha 3) = -.233040 , Peakedness test (Alpha 4)= 1.30008
Normality Test -- Extended grid cell size = 1.70 t Stat Infin 1.753 1.341 1.074 0.866 0.691 0.536 0.393 0.258 0.128 Cell No. 0 4 1 2 4 2 2 1 0 1 Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 Act Per 1.000 1.000 0.765 0.706 0.588 0.353 0.235 0.118 0.059 0.059
Normality Test -- Small sample grid cell size = 3.40 Cell No. 4 3 6 3 1 Interval 1.000 0.800 0.600 0.400 0.200 Act Per 1.000 0.765 0.588 0.235 0.059
Extended grid normality test - Prob of rejecting normality assumption Chi= 10.65 Chi Prob= 0.7775 F(8, 15)= 1.33088 F Prob =0.698738
Small sample normality test - Large grid Chi= 3.882 Chi Prob= 0.7255 F(3, 15)= 1.29412 F Prob =0.687249
Autocorrelation function of residuals
38
Econometric Notes
1 2 3 4 0.813137 0.658160 0.545551 0.332369
F( 6, 6) = 1.730 1/F = 0.5781 Heteroskedasticity at 0.7389 level
Sum of squared residuals 9.918316361299512E-02 Mean squared residual 5.834303741940890E-03
B34S 8.10Z (D:M:Y) 6/12/06 (H:M:S) 11: 8:24 REGRESSION STEP PAGE 10
Doing Gen. Least Squares using residual Dif. Eq. of order 1 Lag Coefficients
1 0.842413
Standard Error of Y 0.2288578856184614 Degrees of Freedom 15
............. Step Number 2 Analysis of Variance for reduction in SS due to variable entering Variable Entering 5 Source DF SS MS F F Sig. Multiple R 0.105883 Due Regression 1 0.88080E-02 0.88080E-02 0.15874 0.303669 Std Error of Y.X 0.235559 Dev. from Reg. 14 0.77683 0.55488E-01 R Square 0.112113E-01 Total 15 0.78564 0.52376E-01
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation LOG10CT = Variable Coefficient F for selection LOG10Y X- 5 0.2959532 0.7428190 0.3984 0.30367 0.1059 0.2806 CONSTANT X- 7 1.605192 1.504752 1.067 0.69586
Adjusted R Square -5.941647786252678E-02 -2 * ln(Maximum of Likelihood Function) -2.995906752503885 Akaike Information Criterion (AIC) 3.004093247496115 Scwartz Information Criterion (SIC) 5.321859414215458 Akaike (1970) Finite Prediction Error 6.242391585298818E-02 Generalized Cross Validation 6.341477166017846E-02 Hannan & Quinn (1979) HQ 6.265098482400155E-02 Shibata (1981) 6.068991819040517E-02 Rice (1984) 6.473591273643219E-02 Residual Variance 5.548792520265616E-02
Order of entrance (or deletion) of the variables = 7 5
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 5 LOG10Y 0.55178009
Row 2 Variable X- 7 CONSTANT -1.1169023 2.2642794
Program terminated. All variables put in.
Residual Statistics for Smoothed Original data
For GLS Y and Y estimate scaled by 0.1575867198767030
Von Neumann Ratio 1 ... 2.30745 Durbin-Watson TEST..... 2.16324 Von Neumann Ratio 2 ... 2.30745
For D. F. 14 t(.9999)= 5.3634, t(.999)= 4.1405, t(.99)= 2.9768, t(.95)= 2.1448, t(.90)= 1.7613, t(.80)= 1.3450
Skewness test (Alpha 3) = 0.512095 , Peakedness test (Alpha 4)= 1.97237
Normality Test -- Extended grid cell size = 1.60 t Stat Infin 1.761 1.345 1.076 0.868 0.692 0.537 0.393 0.258 0.128 Cell No. 1 0 3 3 2 1 3 2 0 1 Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 Act Per 1.000 0.938 0.938 0.750 0.562 0.438 0.375 0.188 0.062 0.062
Normality Test -- Small sample grid cell size = 3.20 Cell No. 1 6 3 5 1 Interval 1.000 0.800 0.600 0.400 0.200 Act Per 1.000 0.938 0.562 0.375 0.062
Extended grid normality test - Prob of rejecting normality assumption Chi= 7.750 Chi Prob= 0.5417 F(8, 14)= 0.968750 F Prob =0.503304
Small sample normality test - Large grid Chi= 6.500 Chi Prob= 0.9103 F(3, 14)= 2.16667 F Prob =0.862453
Autocorrelation function of residuals
1 2 3 4 -0.159923 -0.479294 0.253850 -0.348167E-01
F( 5, 5) = 0.3909 1/F = 2.558 Heteroskedasticity at 0.8371 level
39
Econometric Notes
Sum of squared residuals 0.7768309528371908 Mean squared residual 4.855193455232443E-02
Gen. Least Squares ended by satisfying tolerance.
Here and after GLS the DW was 2.16, which is higher than found with SAS. Note that the change in sign of in the SAS output. Although there is positive serial correlation (the autocorrelation was .7653) SAS reports . The insignificant LOG10Y term is now found to be .2959 in place of .7229 as found with SAS but very close to the OLS .3478 value.
Remarks: What can we conclude from the preceding results? Serial correlation was not the reason that LOG10Y was not significant (as measured by the low t value) in the OLS equation containing just LOG10Y on the right-hand side. In this equation, LOG10Y was not significant because of omitted variable bias. The B34S two-pass GLS procedure was able to remove more serial correlation than the SAS ML approach. We found that LOG10Y is a significant variable in a properly specified equation. This problem illustrates how it would be a mistake to remove LOG10Y from consideration as a potentially important variable just because it does not enter significantly into a serial correlation-free equation that does not contain all the appropriate variables on the right.
An example having different problems is illustrated by a dataset from the engineering literature (from Brownlee[1965] Statistical Theory and Methodology, page 454) that is presented in Table Three. Here we have a maintained hypothesis that the stack loss of in going ammonia (Y) is related to the operation of the factory to convert ammonia to nitric acid by the process of oxidation. There is data on three variables for 21 days of plant operation. X1 = air flow, X2 = cooling water inlet temperature, X3 = acid concentration and Y=stack loss of ammonia.____________________________________________________________ Table Three
Brownlee Engineering Stack Loss Data
Obs X1 X2 X3 Y1 1 80 27 89 42 2 80 27 88 37 3 75 25 90 37 4 62 24 87 28 5 62 22 87 18 6 62 23 87 18 7 62 24 93 19 8 62 24 93 20 9 58 23 87 1510 58 18 80 1411 58 18 89 1412 58 17 88 1313 58 18 82 1114 58 19 93 1215 50 18 89 816 50 18 86 717 50 19 72 818 50 19 79 819 50 20 80 920 56 20 82 1521 70 20 91 15
X1 = air flow.X2 = cooling water inlet temperature.X3 = acid concentration.
40
Econometric Notes
Y1 = stack loss of ammonia.
The following B34S commands will load the data and perform the required analysis.
/$ Sample Data # 3/$ Data from Browlee (1965) page 454b34sexec data corr$INPUT X1 X2 X3 Y$LABEL X1 = 'AIR FLOW'$LABEL X2 = 'COOLING WATER INLET TEMPERATURE'$LABEL X3 = 'ACID CONCENTRATION'$LABEL Y = 'STACK LOSS' $DATACARDS$80 27 89 4280 27 88 3775 25 90 3762 24 87 2862 22 87 1862 23 87 1862 24 93 1962 24 93 2058 23 87 1558 18 80 1458 18 89 1458 17 88 1358 18 82 1158 19 93 1250 18 89 850 18 86 750 19 72 850 19 79 850 20 80 956 20 82 1570 20 91 15b34sreturn$b34seend$b34sexec regression maxgls=2 residuala$ MODEL Y = X1 X2 X3 $ b34seend$
41
Econometric Notes
The results of the OLS model fit are reported next. B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:47:54 DATA STEP PAGE 1
Variable # Label Mean Std. Dev. Variance Maximum Minimum
X1 1 AIR FLOW 60.4286 9.16827 84.0571 80.0000 50.0000 X2 2 COOLING WATER INLET TEMPERATURE 21.0952 3.16077 9.99048 27.0000 17.0000 X3 3 ACID CONCENTRATION 86.2857 5.35857 28.7143 93.0000 72.0000 Y 4 STACK LOSS 17.5238 10.1716 103.462 42.0000 7.00000 CONSTANT 5 1.00000 0.00000 0.00000 1.00000 1.00000
Data file contains 21 observations on 5 variables. Current missing value code is 0.1000000000000000E+32
B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:47:54 DATA STEP PAGE 2
Correlation Matrix
1 X2 Var 2 0.78185
1 2 X3 Var 3 0.50014 0.39094
1 2 3 Y Var 4 0.91966 0.87550 0.39983
1 2 3 4 CONSTANT Var 5 0.0000 0.0000 0.0000 0.0000
B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:47:54 REGRESSION STEP PAGE 3
*************** Problem Number 1 Subproblem Number 1
F to enter 0.99999998E-02 F to remove 0.49999999E-02 Tolerance 0.10000000E-04 Maximum no of steps 4 Dependent variable X( 4). Variable Name Y
Standard Error of Y = 10.171623 for degrees of freedom = 20.
............. Step Number 4 Analysis of Variance for reduction in SS due to variable entering Variable Entering 3 Source DF SS MS F F Sig. Multiple R 0.955812 Due Regression 3 1890.4 630.14 59.902 1.000000 Std Error of Y.X 3.24336 Dev. from Reg. 17 178.83 10.519 R Square 0.913577 Total 20 2069.2 103.46
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation Y = Variable Coefficient F for selection X1 X- 1 0.7156402 0.1348582 5.307 0.99994 0.7897 2.468 X2 X- 2 1.295286 0.3680243 3.520 0.99737 0.6492 1.559 X3 X- 3 -0.1521225 0.1562940 -0.9733 0.65595 -0.2297 -0.7490 CONSTANT X- 5 -39.91967 11.89600 -3.356 0.99625
Adjusted R Square 0.898325769953741 -2 * ln(Maximum of Likelihood Function) 104.575591004800 Akaike Information Criterion (AIC) 114.575591004800 Scwartz Information Criterion (SIC) 119.798203193417 Akaike (1970) Finite Prediction Error 12.5231065545072 Generalized Cross Validation 12.9945646836181 Hannan & Quinn (1979) HQ 13.0142387900995 Shibata (1981) 11.7597933930896 Rice (1984) 13.7561508921817 Residual Variance 10.5194095057860
Order of entrance (or deletion) of the variables = 1 5 2 3
Estimate of computational error in coefficients = 1 0.3461E-10 2 0.1208E-10 3 0.1959E-09 4 0.1472E-15
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 1 X1 0.18186730E-01
Row 2 Variable X- 2 X2 -0.36510675E-01 0.13544186
Row 3 Variable X- 3 X3 -0.71435215E-02 0.10476827E-04 0.24427828E-01
42
Econometric Notes
Row 4 Variable X- 5 CONSTANT 0.28758711 -0.65179437 -1.6763208 141.51474
Program terminated. All variables put in.
Residual Statistics for... Original Data
Von Neumann Ratio 1 ... 1.55939 Durbin-Watson TEST..... 1.48513 Von Neumann Ratio 2 ... 1.55939
For D. F. 17 t(.9999)= 5.0433, t(.999)= 3.9650, t(.99)= 2.8982, t(.95)= 2.1098, t(.90)= 1.7396, t(.80)= 1.3334
Skewness test (Alpha 3) = -.140452 , Peakedness test (Alpha 4)= 2.03637
Normality Test -- Extended grid cell size = 2.10 t Stat Infin 1.740 1.333 1.069 0.863 0.689 0.534 0.392 0.257 0.128 Cell No. 2 1 0 3 4 1 5 2 2 1 Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 Act Per 1.000 0.905 0.857 0.857 0.714 0.524 0.476 0.238 0.143 0.048
Normality Test -- Small sample grid cell size = 4.20 Cell No. 3 3 5 7 3 Interval 1.000 0.800 0.600 0.400 0.200 Act Per 1.000 0.857 0.714 0.476 0.143
Extended grid normality test - Prob of rejecting normality assumption Chi= 9.952 Chi Prob= 0.7316 F(8, 17)= 1.24405 F Prob =0.666586
Small sample normality test - Large grid Chi= 3.048 Chi Prob= 0.6157 F(3, 17)= 1.01587 F Prob =0.589919
Autocorrelation function of residuals
1) 0.0858 2) -0.1149 3) -0.0409 4) -0.0064
F( 7, 7) = 1.336 1/F = 0.7485 Heteroskedasticity at 0.6440 level
Sum of squared residuals 178.8 Mean squared residual 8.516
Gen. Least Squares ended by satisfying tolerance.
Y1 = -39.91967 + .7156402X1 + 1.295286X2 - .1521225X3 (-3.356) (5.307) (3.520) (-.9733)
= .8983, = 178.83 (6-10)
Two of the three variables (in addition to the constant) are found to be significant (significantly different from zero at or better than the 95% level). The correlations were .9197, .8755 and .3998 respectively. Of the three variables, X3 (acid concentration) was not significant at the 95% or better level, because its t statistic was less than 2 in absolute value. The variable X1 (air flow) was found to be positively related to stack loss and the variable X2 (cooling water inlet temperature) was also found to be positively related to stack loss. In this model, 89.83% of the variance is explained by the three variables on the right. Clearly, stack loss can be lowered, if X1 and X2 are lowered. The X3 variable (acid concentration) was not significant, even though the raw correlations show some relationship (correlation = .39983)1. The OLS equation was found to have a Durbin-Watson statistic of 1.4851, showing some serial correlation. First-order GLS was tried but were not executed since the residual correlation was less that the tolerance.
Remark: A close to significant correlation is no assurance that the variable will be significant in a more populated model. From an economist's point of view, the results reported in the above paragraph suggest that the tradeoffs of a lower air flow and lower cooling water inlet temperature
1 Depending on whether the large or small sample SE is used the value is
43
Econometric Notes
must be weighed against absorption technology changes that would lower the constant. While engineering considerations are clearly paramount in the decision process, the regression results, which can be readily obtained with a modern PC, can help summarize the data and highlight the relationship between the variables of interest. Of course it is important to select the appropriate data to use in the study. If data on key variables are omitted, the results of the study could be called into question. However, the problem may not be as bad as it seems. If a variable was inadvertently omitted that was important, unless it was random, its effect should be visible in the error process. In the last analysis, the value of a model lies in how well it works. Inspection of the results is a key aspect of the validation of a model.
Table Four shows a data set (taken from Brownlee [1965], op. cit., page 463). Here the number of deaths per 100,000 males aged 55-59 years of heart disease in a number of countries is related to the number of telephones per head (X1) (presumable as a measure of stress and or income), the percent fat calories are of total calories (X2) and the percent animal protein calories are of total calories (X3).
Table Four
Brownlee Health Data
obs X1 X2 X3 Y 1 124 33 8 81 2 49 31 6 55 3 181 38 8 80 4 4 17 2 24 5 22 20 4 78 6 152 39 6 52 7 75 30 7 88 8 54 29 7 45 9 43 35 6 5010 41 31 5 6911 17 23 4 6612 22 21 3 4513 16 8 3 2414 10 23 3 4315 63 37 6 3816 170 40 8 7217 125 38 6 4118 15 25 4 3819 221 39 7 5220 171 33 7 5221 97 38 6 6622 254 39 8 89
X1 = 1000 * telephones per head.X2 = fat calories as a % of total calories.X3 = animal protein as a % of total calories.Y = 100 * log number of deaths per 1000 males 55-59 years.
44
Econometric Notes
The B34S commands to analyze this data are:
/$ Sample Data # 4/$ From Brownlee (1965) page 463b34sexec data corr$INPUT X1 X2 X3 Y$LABEL X1 = '1000 * TELEPHONES PER HEAD'$LABEL X2 = ' FAT CALORIES AS A % OF TOTAL CALORIES'$LABEL X3 = 'ANIMAL PROTEIN AS A % TO TOTAL CALORIES'$LABEL Y = '100 * LOG # DEATHS PER 1GMALES 55-59'$DATACARDS$124 33 8 81 49 31 6 55181 38 8 80 4 17 2 24 22 20 4 78152 39 6 52 75 30 7 88 54 29 7 45 43 35 6 50 41 31 5 69 17 23 4 66 22 21 3 45 16 8 3 24 10 23 3 43 63 37 6 38170 40 8 72125 38 6 41 15 25 4 38221 39 7 52171 33 7 52 97 38 6 66254 39 8 89b34sreturn$b34seend$b34sexec regression residuala$ MODEL Y = X1 X2 X3 $ b34seend$
The raw correlation results, show X1, X2 and X3 positively related to Y, with correlations of .46875, .44628 and .62110, respectively. The variable X3 appears to be the most important.
Y = 23.9306 - .0067849*X1 -.478240*X2 + 8.496616*X3(1.499) (-.0833) (-.6315) (2.21)
= .3017, = 4686 (6-11)
OLS results, indicate that only 30.17% of the variance can be explained and that the animal protein variable (X3) is the only significant variable. Clearly, this finding is interesting, but the large unexplained component suggests that more data need to be collected to improve the model. It may well be the case that the animal protein variable (X3) is related to other unspecified variables and
45
Econometric Notes
interpreting it without qualification would be dangerous. This will have to be investigated in future research if more data is available.
Remark: This dataset shows a case where correlation analysis suggested a result that did not stand up in a multiple regression model. This is in contrast to the Theil dataset where correlation analysis did not suggested a relationship that was only found with a regression model.
B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:58:58 DATA STEP PAGE 1 Variable # Label Mean Std. Dev. Variance Maximum Minimum
X1 1 1000 * TELEPHONES PER HEAD 87.5455 75.4212 5688.35 254.000 4.00000 X2 2 FAT CALORIES AS A % OF TOTAL CALORIES 30.3182 8.68708 75.4654 40.0000 8.00000 X3 3 ANIMAL PROTEIN AS A % TO TOTAL CALORIES 5.63636 1.86562 3.48052 8.00000 2.00000 Y 4 100 * LOG # DEATHS PER 1GMALES 55-59 56.7273 19.3075 372.779 89.0000 24.0000 CONSTANT 5 1.00000 0.00000 0.00000 1.00000 1.00000
Data file contains 22 observations on 5 variables. Current missing value code is 0.1000000000000000E+32
B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:58:58 DATA STEP PAGE 2
Correlation Matrix
1 X2 Var 2 0.75915
1 2 X3 Var 3 0.80220 0.83018
1 2 3 Y Var 4 0.46875 0.44628 0.62110
1 2 3 4 CONSTANT Var 5 0.0000 0.0000 0.0000 0.0000
B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:58:58 REGRESSION STEP PAGE 3
*************** Problem Number 3 Subproblem Number 1
F to enter 0.99999998E-02 F to remove 0.49999999E-02 Tolerance 0.10000000E-04 Maximum no of steps 4 Dependent variable X( 4). Variable Name Y
Standard Error of Y = 19.307491 for degrees of freedom = 21.
............. Step Number 4 Analysis of Variance for reduction in SS due to variable entering Variable Entering 1 Source DF SS MS F F Sig. Multiple R 0.633617 Due Regression 3 3142.9 1047.6 4.0246 0.976444 Std Error of Y.X 16.1340 Dev. from Reg. 18 4685.5 260.31 R Square 0.401471 Total 21 7828.4 372.78
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation Y = Variable Coefficient F for selection X1 X- 1 -0.6784908E-02 0.8144097E-01 -0.8331E-01 0.06548 -0.0196 -0.1047E-01 X2 X- 2 -0.4782399 0.7572547 -0.6315 0.46438 -0.1472 -0.2556 X3 X- 3 8.496616 3.844121 2.210 0.95973 0.4620 0.8442 CONSTANT X- 5 23.93061 15.96606 1.499 0.84875
Adjusted R Square 0.301715772931645 -2 * ln(Maximum of Likelihood Function) 180.379400452936 Akaike Information Criterion (AIC) 190.379400452936 Scwartz Information Criterion (SIC) 195.834612719728 Akaike (1970) Finite Prediction Error 307.634186421501 Generalized Cross Validation 318.151594504287 Hannan & Quinn (1979) HQ 321.036004555605 Shibata (1981) 290.423882286032 Rice (1984) 334.678950062951 Residual Variance 260.305850048962
Order of entrance (or deletion) of the variables = 3 5 2 1
Estimate of computational error in coefficients =
46
Econometric Notes
1 0.2532E-11 2 -0.5376E-11 3 -0.7851E-12 4 -0.9443E-13
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 1 X1 0.66326323E-02
Row 2 Variable X- 2 X2 -0.17265284E-01 0.57343470
Row 3 Variable X- 3 X3 -0.14835665 -1.6567911 14.777267
Row 4 Variable X- 5 CONSTANT 0.77898727 -6.5357233 -20.071205 254.91515
Program terminated. All variables put in.
Residual Statistics for... Original Data
Von Neumann Ratio 1 ... 2.21784 Durbin-Watson TEST..... 2.11703 Von Neumann Ratio 2 ... 2.21784
For D. F. 18 t(.9999)= 4.9654, t(.999)= 3.9216, t(.99)= 2.8784, t(.95)= 2.1009, t(.90)= 1.7341, t(.80)= 1.3304
Skewness test (Alpha 3) = 0.145227 , Peakedness test (Alpha 4)= 1.39268
Normality Test -- Extended grid cell size = 2.20 t Stat Infin 1.734 1.330 1.067 0.862 0.688 0.534 0.392 0.257 0.127 Cell No. 1 2 5 2 1 2 3 4 1 1 Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 Act Per 1.000 0.955 0.864 0.636 0.545 0.500 0.409 0.273 0.091 0.045
Normality Test -- Small sample grid cell size = 4.40 Cell No. 3 7 3 7 2 Interval 1.000 0.800 0.600 0.400 0.200 Act Per 1.000 0.864 0.545 0.409 0.091
Extended grid normality test - Prob of rejecting normality assumption Chi= 8.000 Chi Prob= 0.5665 F(8, 18)= 1.00000 F Prob =0.531050
Small sample normality test - Large grid Chi= 5.273 Chi Prob= 0.8471 F(3, 18)= 1.75758 F Prob =0.808728
Autocorrelation function of residuals
1) -0.0991 2) 0.1355 3) -0.4051 4) 0.1520
F( 7, 7) = 1.432 1/F = 0.6985 Heteroskedasticity at 0.6762 level
Sum of squared residuals 4686. Mean squared residual 213.0
The preceding sections have outlined some of the things that can be done with simple regression analysis. In the next section of the paper, data will be generated that will better illustrate problems of omitted variables and "hidden" nonlinearity.
7. Advanced Regression analysis
The below listed B34S code shows how 250 observations for a number of series are generated. Regression models for the B34S are also shown.
/;/; nonlinearity and serial correlation in generated data/;b34sexec data noob=250 maxlag=1/; corr;* b0=1 b1=100 b2=-100 b3=80 $* generate three output variables with different characteristics$
47
Econometric Notes
build x1 x2 x3 y ynlin yma e$gen x1 = rn()$gen x2 = x1*x1$gen x3 = lag1(x1)$/; gen e = 100.*rn()$gen e = rn()$* ;* build three variables $* y=f(x1,x2 x3) ;* ynlin=f(x1, x3);* yma =f(x1,x2,x3) + theta*lag(et);* ;gen y = 1.0 + 1.*x1 - 1.*x2 + .8*x3 + e $gen ynlin = y - .8*x3 $* generate an ma model;gen yma = y + (-.95*lag1(e));b34srun$/;/; end of data building/;/; b34sexec list iend=20$ b34seend$b34sexec reg$ model y = x1 x2 $ b34seend$b34sexec reg$ model y = x1 $ b34seend$b34sexec reg$ model ynlin = x1 $ b34seend$b34sexec reg$ model ynlin = x1 x2 $ b34seend$b34sexec reg$ model yma = x1 x2 x3$ b34seend$
/$ do gls
b34sexec regression residuala maxgls=4$ model yma=x1 x2 x3 $ b34seend$b34sexec matrix;call loaddata;call load(rrplots);call load(data2acf);call olsq(yma x1 x2 x3 :print);call data2acf(%res,'Model yma=f(x1, x2, x3)',12,'yma_res_acf.wmf');
b34srun;
/; sort data by variable we suspect is nonlinear/; Then do RR analysis/;b34sexec sort $ by x1$ b34seend$
/; b34sexec list iend=20$ b34seend$
b34sexec reg$ model y = x1 $ b34seend$b34sexec reg$ model y = x1 x3 $ b34seend$b34sexec reg$ model ynlin = x1 $ b34seend$
/;/; recursive residual analysis/; x2 which is a nonlinewar x1 term is missing. Can RR detect it?/;
48
Econometric Notes
b34sexec matrix;call loaddata;call load(rrplots);
/; call print(rrplots);call olsq(y x1 x3 :rr 1 :print);/; call tabulate(%rrobs,%ssr1,%ssr2,%rr,%rrstd,%res);call print('Sum of squares of std RR ',sumsq(goodrow(%rrstd)):);call print('Sum of squares of OLS RES ',sumsq(goodrow(%res)):);/; call print(%rrcoef,%rrcoeft);/; call rrplots(%rrstd,%rss,%nob,%k,%ssr1,%ssr2,1);call rrplots(%rrstd,%rss,%nob,%k,%ssr1,%ssr2,0);/; call names(all);x1_coef=%rrcoef(,1);x3_coef=%rrcoef(,2);call graph(x1_coef,x3_coef :file 'coef_bias.wmf' :nolabel:heading 'Omitted Variable Bias x1 and x3 coef');b34srun; The above code builds three models.
(7-1)
(7-2)
(7-3)
By construction of the data and ignoring subscripts where there is no confusion:
(7-4)
(7-5)
Since x1 is a random variable, there is no correlation between x1 and x3. Because of (7-4) there is correlation between x1 and x2. The purpose of the generated data set is to illustrate the conditions under which an omitted variable will and will not cause a bias in the coefficients estimated for an incomplete model and to show detection strategy. The yma series illustrates the relationship between AR (autoregressive) and MA (moving average) error processes.
Assume the lag operator L defined such that . A simple OLS model with an MA process is defined as
49
Econometric Notes
, (7-6)
where is a polynomial in the lag operator L. A simple OLS model with an AR process is defined as
, (7-7)
where is a polynomial in the lag operator L. If we assume further that the maximum order in is 1, i. e.
(7-8)
It can be proved that a first-order MA model (MA(1)) is equal to an infinite order AR model if
. This can be seen if we note that
(7-9)
where . The importance of equation (7-9) is that it shows that if equation (7-3) is estimated with GLS, which is implicitly an AR error correction technique, more than first-order GLS will be required to remove the serial correlation in the error term. In a transfer function model of the form of
(7-10)
then only one MA term (7-8) would be needed and . An OLS model is a transfer function model that constrains . GLS allows .
The means of the data generated in accordance with equations (7-1) - (7-3) and OLS estimation of a number of models are given next.
Variable # Cases Mean Std Deviation Variance Maximum Minimum
X1 1 249 0.1508079470 1.047903751 1.098102271 3.422285173 -2.584990017 X2 2 249 1.116435259 1.532920056 2.349843898 11.71203580 0.2973256533E-04 X3 3 249 0.1609627505 1.054029274 1.110977710 3.422285173 -2.584990017
50
Econometric Notes
Y 4 249 0.1975673548 2.283588868 5.214778119 5.013192154 -9.876622239 YNLIN 5 249 0.6879715443E-01 2.085180439 4.347977464 4.117746229 -9.568155797 YMA 6 249 0.1594119944 2.399542772 5.757805512 6.108808152 -10.57614307 E 7 249 0.3442446593E-01 1.059199304 1.121903166 2.876329588 -3.278246463 CONSTANT 8 249 1.000000000 0.000000000 0.000000000 1.000000000 1.000000000
Number of observations in data file 249 Current missing variable code 1.000000000000000E+31
The below listed output shows that the coefficients for x1 and x2 are close to their population values of 1.0 and -1.0 even though x3 is missing from the model. This is because the omitted variable X3 is not correlated with an included variable. REG Command. Version 1 February 1997
Real*8 space available 8000000 Real*8 space used 638
OLS EstimationDependent variable Y Adjusted R**2 0.6416762467329375 Standard Error of Estimate 1.366959717042574 Sum of Squared Residuals 459.6704015322101 Model Sum of Squares 833.5945719535468 Total Sum of Squares 1293.264973485757 F( 2, 246) 223.0557634525042 F Significance 1.000000000000000 1/Condition of XPX 0.1334523794627398 Number of Observations 249 Durbin-Watson 1.889314291485050
Variable Coefficient Std. Error tX1 { 0} 1.1228155 0.83599150E-01 13.430944 X2 { 0} -1.0266583 0.57148357E-01 -17.964792 CONSTANT { 0} 1.1744354 0.10731666 10.943645
B34S 8.10Z (D:M:Y) 10/12/06 (H:M:S) 8:14:43 REG STEP PAGE 3
Since the omitted variable x2 is related to the included variable x1, the estimated coefficient for x1 is biased.
REG Command. Version 1 February 1997
Real*8 space available 8000000 Real*8 space used 508
OLS EstimationDependent variable Y Adjusted R**2 0.1749360075645998 Standard Error of Estimate 2.074253035297188 Sum of Squared Residuals 1062.723836646581 Model Sum of Squares 230.5411368391758 Total Sum of Squares 1293.264973485757 F( 1, 247) 53.58274542797675 F Significance 0.9999999999965166 1/Condition of XPX 0.7121866785207466 Number of Observations 249 Durbin-Watson 1.944404933534759
Variable Coefficient Std. Error tX1 { 0} 0.92008294 0.12569399 7.3200236 CONSTANT { 0} 0.58811535E-01 0.13281015 0.44282410
The model for ynlin does not contain x3. Here the omission of x2 shows as a bias on the included variable x1. The fact that is appears highly significant (t=7.55) may fool the researcher. The task ahead is to investigate model specification in a systematic manner using simple tests.
51
Econometric Notes
REG Command. Version 1 February 1997
Real*8 space available 8000000 Real*8 space used 508
OLS EstimationDependent variable YNLIN Adjusted R**2 0.1841644277111127 Standard Error of Estimate 1.883410386043661 Sum of Squared Residuals 876.1669665175114 Model Sum of Squares 202.1314444376089 Total Sum of Squares 1078.298410955120 F( 1, 247) 56.98282254868781 F Significance 0.9999999999991491 1/Condition of XPX 0.7121866785207466 Number of Observations 249 Durbin-Watson 1.922026461982424
Variable Coefficient Std. Error tX1 { 0} 0.86152861 0.11412945 7.5486967 CONSTANT { 0} -0.61128207E-01 0.12059089 -0.50690568
Before beginning the analysis, note that the correct model for ynlin shows coefficients close to their population values REG Command. Version 1 February 1997
Real*8 space available 8000000 Real*8 space used 638
OLS EstimationDependent variable YNLIN Adjusted R**2 0.7410517802311660 Standard Error of Estimate 1.061084833449131 Sum of Squared Residuals 276.9716518488394 Model Sum of Squares 801.3267591062809 Total Sum of Squares 1078.298410955120 F( 2, 246) 355.8602142571058 F Significance 1.000000000000000 1/Condition of XPX 0.1334523794627398 Number of Observations 249 Durbin-Watson 1.933912998767355
Variable Coefficient Std. Error tX1 { 0} 1.0636116 0.64892761E-01 16.390297 X2 { 0} -1.0233690 0.44360674E-01 -23.069283 CONSTANT { 0} 1.0509212 0.83303169E-01 12.615622
The model for yma shows negative serial correlation (DW=2.872) even though all variables are in the model. REG Command. Version 1 February 1997
Real*8 space available 8000000 Real*8 space used 771
OLS EstimationDependent variable YMA Adjusted R**2 0.6413612625787487 Standard Error of Estimate 1.437001078383063 Sum of Squared Residuals 505.9181643221510 Model Sum of Squares 922.0176027460155 Total Sum of Squares 1427.935767068167 F( 3, 245) 148.8345537566244 F Significance 1.000000000000000 1/Condition of XPX 0.1270036486757258 Number of Observations 249 Durbin-Watson 2.871725884170242
Variable Coefficient Std. Error tX1 { 0} 1.0907591 0.88117138E-01 12.378513 X2 { 0} -0.94687924 0.60077630E-01 -15.760929 X3 { 0} 0.75276712 0.86803876E-01 8.6720450 CONSTANT { 0} 0.93087875 0.11360868 8.1937292
52
Econometric Notes
GLS will be attemptedProblem Number 1 Subproblem Number 1 F to enter 9.999999776482582E-03 F to remove 4.999999888241291E-03 Tolerance (1.-R**2) for including a variable 1.000000000000000E-05 Maximum Number of Variables Allowed 4 Internal Number of dependent variable 6 Dependent Variable YMA Standard Error of Y 2.399542771523700 Degrees of Freedom 248
.............Step Number 4 Analysis of Variance for reduction in SS due to variable entering Variable Entering 8 Source DF SS MS F F Sig. Multiple R 0.803554 Due Regression 3 922.02 307.34 148.83 1.000000 Std Error of Y.X 1.43700 Dev. from Reg. 245 505.92 2.0650 R Square 0.645700 Total 248 1427.9 5.7578
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation YMA = Variable Coefficient F for selection X1 X- 1 1.090759 0.8811714E-01 12.38 1.00000 0.6203 1.032 X2 X- 2 -0.9468792 0.6007763E-01 -15.76 1.00000 -0.7095 -6.631 X3 X- 3 0.7527671 0.8680388E-01 8.672 1.00000 0.4846 0.7601 CONSTANT X- 8 0.9308788 0.1136087 8.194 1.00000
Adjusted R Square 0.6413612625787489 -2 * ln(Maximum of Likelihood Function) 883.1529747953530 Akaike Information Criterion (AIC) 893.1529747953530 Scwartz Information Criterion (SIC) 910.7402392776766 Akaike (1970) Finite Prediction Error 2.098144341832704 Generalized Cross Validation 2.098685929466314 Hannan & Quinn (1979) HQ 2.146406058401069 Shibata (1981) 2.097078566971383 Rice (1984) 2.099245495112659 Residual Variance 2.064972099274085
Order of entrance (or deletion) of the variables = 1 2 3 8Estimate of Computational Error in Coefficients
1 2 3 4 -0.255313E-15 -0.119643E-15 0.190567E-16 -0.196339E-16
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 1 X1 0.77646301E-02
Row 2 Variable X- 2 X2 -0.71499452E-03 0.36093216E-02
Row 3 Variable X- 3 X3 -0.55762009E-03 0.30981359E-04 0.75349130E-02
Row 4 Variable X- 8 CONSTANT -0.28296676E-03 -0.39267339E-02 -0.11633355E-02 0.12906932E-01
Program terminated. All variables put in.
Residual Statistics for Original data
Von Neumann Ratio 1 ... 2.88331 Durbin-Watson TEST..... 2.87173Von Neumann Ratio 2 ... 2.88331
For D. F. 245 t(.9999)= 3.9556, t(.999)= 3.3307, t(.99)= 2.5960, t(.95)= 1.9697, t(.90)= 1.6511, t(.80)= 1.2850
Skewness test (Alpha 3) = 0.113402 , Peakedness test (Alpha 4)= 2.84927
Normality Test -- Extended grid cell size = 24.90t Stat Infin 1.651 1.285 1.039 0.843 0.675 0.525 0.386 0.254 0.126Cell No. 25 20 29 26 23 30 27 24 21 24Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100Act Per 1.000 0.900 0.819 0.703 0.598 0.506 0.386 0.277 0.181 0.096
Normality Test -- Small sample grid cell size = 49.80Cell No. 45 55 53 51 45Interval 1.000 0.800 0.600 0.400 0.200Act Per 1.000 0.819 0.598 0.386 0.181
Extended grid normality test - Prob of rejecting normality assumptionChi= 3.731 Chi Prob= 0.1195 F(8, 245)= 0.466365 F Prob =0.120880
53
Econometric Notes
Small sample normality test - Large gridChi= 1.703 Chi Prob= 0.3637 F(3, 245)= 0.567604 F Prob =0.363150
Autocorrelation function of residuals
1 2 3 4 5 -0.438023 -0.874684E-01 0.759865E-01 -0.113471 0.809694E-01
F( 83, 83) = 1.121 1/F = 0.8919 Heteroskedasticity at 0.6983 level
Sum of squared residuals 505.9181643221511 Mean squared residual 2.031799856715466
Note the ACF values of -.438, -.087 for the OLS model. GLS is now attempted:
Doing Gen. Least Squares using residual Dif. Eq. of order 1 Lag Coefficients
1 -0.436471
Standard Error of Y 1.806380831086763 Degrees of Freedom 247
.............Step Number 4 Analysis of Variance for reduction in SS due to variable entering Variable Entering 3 Source DF SS MS F F Sig. Multiple R 0.868408 Due Regression 3 607.80 202.60 249.47 1.000000 Std Error of Y.X 0.901185 Dev. from Reg. 244 198.16 0.81213 R Square 0.754132 Total 247 805.96 3.2630
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation YMA = Variable Coefficient F for selection X1 X- 1 1.053314 0.7820969E-01 13.47 1.00000 0.6530 0.9965 X2 X- 2 -0.9569885 0.5047854E-01 -18.96 1.00000 -0.7718 -6.702 X3 X- 3 0.7796353 0.7773981E-01 10.03 1.00000 0.5403 0.7872 CONSTANT X- 8 0.9431791 0.8032375E-01 11.74 1.00000
Adjusted R Square 0.7511089134685829 -2 * ln(Maximum of Likelihood Function) 648.1547627678506 Akaike Information Criterion (AIC) 658.1547627678506 Scwartz Information Criterion (SIC) 675.7219064986755 Akaike (1970) Finite Prediction Error 0.8252334731172152 Generalized Cross Validation 0.8254482099043910 Hannan & Quinn (1979) HQ 0.8442731002246441 Shibata (1981) 0.8248109265359979 Rice (1984) 0.8256701045844729 Residual Variance 0.8121345290994816
Order of entrance (or deletion) of the variables = 1 2 8 3
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 1 X1 0.61167560E-02
Row 2 Variable X- 2 X2 -0.42933638E-03 0.25480832E-02
Row 3 Variable X- 3 X3 -0.26031711E-02 -0.11668626E-03 0.60434773E-02
Row 4 Variable X- 8 CONSTANT -0.43170371E-04 -0.27722114E-02 -0.41214721E-03 0.64519047E-02
Program terminated. All variables put in.
Residual Statistics for Smoothed Original data
For GLS Y and Y estimate scaled by 1.436471170926504
Von Neumann Ratio 1 ... 2.30772 Durbin-Watson TEST..... 2.29841Von Neumann Ratio 2 ... 2.30772
For D. F. 244 t(.9999)= 3.9559, t(.999)= 3.3308, t(.99)= 2.5961, t(.95)= 1.9697, t(.90)= 1.6511, t(.80)= 1.2850
54
Econometric Notes
Skewness test (Alpha 3) = 0.110831 , Peakedness test (Alpha 4)= 2.81493
Normality Test -- Extended grid cell size = 24.80t Stat Infin 1.651 1.285 1.039 0.843 0.675 0.525 0.386 0.254 0.126Cell No. 21 33 20 25 25 25 34 23 21 21Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100Act Per 1.000 0.915 0.782 0.702 0.601 0.500 0.399 0.262 0.169 0.085
Normality Test -- Small sample grid cell size = 49.60Cell No. 54 45 50 57 42Interval 1.000 0.800 0.600 0.400 0.200Act Per 1.000 0.782 0.601 0.399 0.169
Extended grid normality test - Prob of rejecting normality assumptionChi= 8.935 Chi Prob= 0.6522 F(8, 244)= 1.11694 F Prob =0.647735
Small sample normality test - Large gridChi= 3.089 Chi Prob= 0.6219 F(3, 244)= 1.02957 F Prob =0.619884
Autocorrelation function of residuals
1 2 3 4 -0.151211 -0.321877 -0.413296E-03 -0.787568E-01
F( 83, 83) = 1.028 1/F = 0.9724 Heteroskedasticity at 0.5505 level
Sum of squared residuals 198.1608251002743 Mean squared residual 0.7990355850817513
The DW, now 2.298, is closed to 2.0. The ACF shows a spike at lag 2. GLS order 2 is attempted.
Doing Gen. Least Squares using residual Dif. Eq. of order 2 Lag Coefficients
1 2 -0.587393 -0.343873
Standard Error of Y 1.488647529013689 Degrees of Freedom 246
.............Step Number 4 Analysis of Variance for reduction in SS due to variable entering Variable Entering 3 Source DF SS MS F F Sig. Multiple R 0.906985 Due Regression 3 448.46 149.49 375.65 1.000000 Std Error of Y.X 0.630820 Dev. from Reg. 243 96.698 0.39793 R Square 0.822623 Total 246 545.15 2.2161
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation YMA = Variable Coefficient F for selection X1 X- 1 1.064045 0.7171594E-01 14.84 1.00000 0.6894 1.007 X2 X- 2 -0.9423114 0.4273665E-01 -22.05 1.00000 -0.8165 -6.599 X3 X- 3 0.7593212 0.7132141E-01 10.65 1.00000 0.5640 0.7667 CONSTANT X- 8 0.9305895 0.6243077E-01 14.91 1.00000
Adjusted R Square 0.8204327863250724 -2 * ln(Maximum of Likelihood Function) 469.3198834070857 Akaike Information Criterion (AIC) 479.3198834070857 Scwartz Information Criterion (SIC) 496.8668250902256 Akaike (1970) Finite Prediction Error 0.4043780501040350 Generalized Cross Validation 0.4044841286507808 Hannan & Quinn (1979) HQ 0.4137361553163259 Shibata (1981) 0.4041693287529482 Rice (1984) 0.4045937579438610 Residual Variance 0.3979337783892296
Order of entrance (or deletion) of the variables = 1 2 8 3
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 1 X1 0.51431761E-02
Row 2 Variable X- 2 X2 -0.30211664E-03 0.18264214E-02
Row 3 Variable X- 3 X3 -0.29466037E-02 -0.27120489E-04 0.50867433E-02
Row 4 Variable X- 8 CONSTANT -0.60309115E-05 -0.19976715E-02 -0.29290972E-03 0.38976015E-02
55
Econometric Notes
Program terminated. All variables put in.
Residual Statistics for Smoothed Original data
For GLS Y and Y estimate scaled by 1.931266064605667
Von Neumann Ratio 1 ... 2.12282 Durbin-Watson TEST..... 2.11423Von Neumann Ratio 2 ... 2.12282
For D. F. 243 t(.9999)= 3.9561, t(.999)= 3.3310, t(.99)= 2.5962, t(.95)= 1.9698, t(.90)= 1.6511, t(.80)= 1.2850
Skewness test (Alpha 3) = 0.726358E-01, Peakedness test (Alpha 4)= 2.77604
Normality Test -- Extended grid cell size = 24.70t Stat Infin 1.651 1.285 1.039 0.843 0.676 0.525 0.386 0.254 0.126Cell No. 24 29 23 25 23 24 27 30 25 17Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100Act Per 1.000 0.903 0.785 0.692 0.591 0.498 0.401 0.291 0.170 0.069
Normality Test -- Small sample grid cell size = 49.40Cell No. 53 48 47 57 42Interval 1.000 0.800 0.600 0.400 0.200Act Per 1.000 0.785 0.591 0.401 0.170
Extended grid normality test - Prob of rejecting normality assumptionChi= 4.781 Chi Prob= 0.2193 F(8, 243)= 0.597672 F Prob =0.220557
Small sample normality test - Large gridChi= 2.696 Chi Prob= 0.5592 F(3, 243)= 0.898785 F Prob =0.557561
Autocorrelation function of residuals
1 2 3 4 5 -0.645124E-01 -0.148837 -0.240427 -0.992257E-01 0.728148E-01
F( 82, 82) = 1.162 1/F = 0.8609 Heteroskedasticity at 0.7505 level
Sum of squared residuals 96.69790814858062 Mean squared residual 0.3914895066744155
DW now 2.114. GLS order 3 and 4 attempted with little gain to the DW but showing slow decline in GLS values which would be expected in view of (7-9).
Doing Gen. Least Squares using residual Dif. Eq. of order 3 Lag Coefficients
1 2 3 -0.648299 -0.447794 -0.176326
Standard Error of Y 1.356570786867690 Degrees of Freedom 245
.............Step Number 4 Analysis of Variance for reduction in SS due to variable entering Variable Entering 3 Source DF SS MS F F Sig. Multiple R 0.923144 Due Regression 3 384.23 128.08 465.10 1.000000 Std Error of Y.X 0.524763 Dev. from Reg. 242 66.641 0.27538 R Square 0.852194 Total 245 450.87 1.8403
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation YMA = Variable Coefficient F for selection X1 X- 1 1.027797 0.7160766E-01 14.35 1.00000 0.6781 0.9723 X2 X- 2 -0.9485962 0.3945060E-01 -24.05 1.00000 -0.8396 -6.643 X3 X- 3 0.7885052 0.7105830E-01 11.10 1.00000 0.5807 0.7962 CONSTANT X- 8 0.9428557 0.5552124E-01 16.98 1.00000
Adjusted R Square 0.8503620346592192 -2 * ln(Maximum of Likelihood Function) 376.8392476702688 Akaike Information Criterion (AIC) 386.8392476702688 Scwartz Information Criterion (SIC) 404.3659053499306 Akaike (1970) Finite Prediction Error 0.2798540632805742 Generalized Cross Validation 0.2799280742725161 Hannan & Quinn (1979) HQ 0.2863502020183029 Shibata (1981) 0.2797084481582168
56
Econometric Notes
Rice (1984) 0.2800045730288931 Residual Variance 0.2753763982680850
Order of entrance (or deletion) of the variables = 1 2 8 3
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 1 X1 0.51276574E-02
Row 2 Variable X- 2 X2 -0.23263544E-03 0.15563500E-02
Row 3 Variable X- 3 X3 -0.33414602E-02 0.11967882E-04 0.50492818E-02
Row 4 Variable X- 8 CONSTANT -0.35430730E-04 -0.17098083E-02 -0.26379732E-03 0.30826085E-02
Program terminated. All variables put in.
Residual Statistics for Smoothed Original data
For GLS Y and Y estimate scaled by 2.272418943212623
Von Neumann Ratio 1 ... 2.12109 Durbin-Watson TEST..... 2.11247Von Neumann Ratio 2 ... 2.12109
For D. F. 242 t(.9999)= 3.9564, t(.999)= 3.3312, t(.99)= 2.5963, t(.95)= 1.9698, t(.90)= 1.6512, t(.80)= 1.2851
Skewness test (Alpha 3) = 0.628738E-01, Peakedness test (Alpha 4)= 2.66896
Normality Test -- Extended grid cell size = 24.60t Stat Infin 1.651 1.285 1.039 0.843 0.676 0.525 0.386 0.254 0.126Cell No. 23 30 24 24 23 30 22 28 25 17Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100Act Per 1.000 0.907 0.785 0.687 0.589 0.496 0.374 0.285 0.171 0.069
Normality Test -- Small sample grid cell size = 49.20Cell No. 53 48 53 50 42Interval 1.000 0.800 0.600 0.400 0.200Act Per 1.000 0.785 0.589 0.374 0.171
Extended grid normality test - Prob of rejecting normality assumptionChi= 5.707 Chi Prob= 0.3200 F(8, 242)= 0.713415 F Prob =0.320389
Small sample normality test - Large gridChi= 1.683 Chi Prob= 0.3593 F(3, 242)= 0.560976 F Prob =0.358735
Autocorrelation function of residuals
1 2 3 4 5 6 -0.577779E-01 -0.116476 -0.153224 -0.211519 0.509156E-01 0.347658E-02
F( 82, 82) = 1.185 1/F = 0.8435 Heteroskedasticity at 0.7786 level
Sum of squared residuals 66.64108838087667 Mean squared residual 0.2708987332555962
Doing Gen. Least Squares using residual Dif. Eq. of order 4 Lag Coefficients
1 2 3 4 -0.694654 -0.564775 -0.345213 -0.259654
Standard Error of Y 1.203012334750675 Degrees of Freedom 244
.............Step Number 4 Analysis of Variance for reduction in SS due to variable entering Variable Entering 3 Source DF SS MS F F Sig. Multiple R 0.943134 Due Regression 3 314.11 104.70 646.68 1.000000 Std Error of Y.X 0.402377 Dev. from Reg. 241 39.020 0.16191 R Square 0.889502 Total 244 353.13 1.4472
Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Partial Cor. for Var. not in equation YMA = Variable Coefficient F for selection X1 X- 1 1.036793 0.6936194E-01 14.95 1.00000 0.6936 0.9808 X2 X- 2 -0.9666933 0.3536745E-01 -27.33 1.00000 -0.8695 -6.770
57
Econometric Notes
X3 X- 3 0.7804025 0.6903331E-01 11.30 1.00000 0.5887 0.7880 CONSTANT X- 8 0.9634053 0.4764657E-01 20.22 1.00000
Adjusted R Square 0.8881265220663394 -2 * ln(Maximum of Likelihood Function) 245.1681832743955 Akaike Information Criterion (AIC) 255.1681832743955 Scwartz Information Criterion (SIC) 272.6744743271191 Akaike (1970) Finite Prediction Error 0.1645510140428232 Generalized Cross Validation 0.1645948877321811 Hannan & Quinn (1979) HQ 0.1683823670820125 Shibata (1981) 0.1644646992743718 Rice (1984) 0.1646402423899563 Residual Variance 0.1619076242590027
Order of entrance (or deletion) of the variables = 1 2 8 3
Covariance Matrix of Regression Coefficients
Row 1 Variable X- 1 X1 0.48110784E-02
Row 2 Variable X- 2 X2 -0.16111651E-03 0.12508569E-02
Row 3 Variable X- 3 X3 -0.35012367E-02 0.74455260E-04 0.47655984E-02
Row 4 Variable X- 8 CONSTANT -0.40532334E-04 -0.13901832E-02 -0.26943072E-03 0.22701954E-02
Program terminated. All variables put in.
Residual Statistics for Smoothed Original data
For GLS Y and Y estimate scaled by 2.864295541231388
Von Neumann Ratio 1 ... 2.11504 Durbin-Watson TEST..... 2.10640Von Neumann Ratio 2 ... 2.11504
For D. F. 241 t(.9999)= 3.9567, t(.999)= 3.3314, t(.99)= 2.5964, t(.95)= 1.9699, t(.90)= 1.6512, t(.80)= 1.2851
Skewness test (Alpha 3) = -.879541E-01, Peakedness test (Alpha 4)= 2.53942
Normality Test -- Extended grid cell size = 24.50t Stat Infin 1.651 1.285 1.039 0.843 0.676 0.525 0.386 0.254 0.126Cell No. 22 30 30 20 21 22 27 23 22 28Interval 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100Act Per 1.000 0.910 0.788 0.665 0.584 0.498 0.408 0.298 0.204 0.114
Normality Test -- Small sample grid cell size = 49.00Cell No. 52 50 43 50 50Interval 1.000 0.800 0.600 0.400 0.200Act Per 1.000 0.788 0.584 0.408 0.204
Extended grid normality test - Prob of rejecting normality assumptionChi= 5.408 Chi Prob= 0.2868 F(8, 241)= 0.676020 F Prob =0.287518
Small sample normality test - Large gridChi= 0.9796 Chi Prob= 0.1938 F(3, 241)= 0.326531 F Prob =0.193819
Autocorrelation function of residuals
1 2 3 4 5 6 7 -0.580007E-01 -0.654729E-01 -0.913505E-01 -0.110200 -0.165341 -0.324259E-01 -0.258273E-01
F( 82, 82) = 1.215 1/F = 0.8233 Heteroskedasticity at 0.8097 level
Sum of squared residuals 39.01973744641884 Mean squared residual 0.1592642344751790
Gen. Least Squares ended by max. order reached.
The classic MA residual ACF is shown in Figure 7.1. There is one ACF spike but the PACF suggests a longer AR model which was shown to be captured by the GLS model above.
58
Econometric Notes
X
Obs 20 40 60 80 100 120 140 160 180 200 220 240
-4
-3
-2
-1
0
1
2
3
4
Model yma=f(x1, x2, x3)
Lag1 2 3 4 5 6 7 8 9 10 11 12
-1
-.8
-.6
-.4
-.2
0
.2
.4
.6
.8
1ACF of Above Series
Lag1 2 3 4 5 6 7 8 9 10 11 12
-1
-.8
-.6
-.4
-.2
0
.2
.4
.6
.8
1PACF of Above Series
Figure 7.1 Analysis of residuals of the YMA model.
59
Econometric Notes
Remark: A low order autoregressive structure in the error term is usually easily captured by a GLS model. However a simple MA residual structure, that might occur in an over shooting situation, often requires a high order GLS model to clean the residual. The problem is that using maximum likelihood GLS the GLS autoregressive parameters often are hard to estimate because they are related as is seen in (7.9)
Recall that the models , and produced biased coefficients for both the constant and x1 and the constant, x1 and x3 respectively. How might one test such a models for an excluded variable (x2) that is related to an included variable (x1)? One way to proceed is to sort the data with respect to one variable (x1 in the example to be shown) and inspect the Durbin Watson statistic. Nonlinearity will be reflected in a low DW. This approach uses time series methods on cross section data. This is shown next.
REG Command. Version 1 February 1997
Real*8 space available 8000000 Real*8 space used 508
OLS EstimationDependent variable Y Adjusted R**2 0.1749360075645995 Standard Error of Estimate 2.074253035297189 Sum of Squared Residuals 1062.723836646582 Model Sum of Squares 230.5411368391754 Total Sum of Squares 1293.264973485757 F( 1, 247) 53.58274542797663 F Significance 0.9999999999965213 1/Condition of XPX 0.7121866785207469 Number of Observations 249 Durbin-Watson 0.9304383379985703
Variable Coefficient Std. Error tX1 { 0} 0.92008294 0.12569399 7.3200236 CONSTANT { 0} 0.58811535E-01 0.13281015 0.44282410
REG Command. Version 1 February 1997
Real*8 space available 8000000 Real*8 space used 638
OLS EstimationDependent variable Y Adjusted R**2 0.3171457852038800 Standard Error of Estimate 1.887043512405973 Sum of Squared Residuals 875.9895715575144 Model Sum of Squares 417.2754019282424 Total Sum of Squares 1293.264973485757 F( 2, 246) 58.59073681198955 F Significance 1.000000000000000 1/Condition of XPX 0.6446864807753573 Number of Observations 249 Durbin-Watson 0.7352270766892632
Variable Coefficient Std. Error tX1 { 0} 0.85966646 0.11465356 7.4979481 X3 { 0} 0.82544163 0.11398725 7.2415260 CONSTANT { 0} -0.64942535E-01 0.12202611 -0.53220196
REG Command. Version 1 February 1997
Real*8 space available 8000000
60
Econometric Notes
Real*8 space used 508
OLS EstimationDependent variable YNLIN Adjusted R**2 0.1841644277111131 Standard Error of Estimate 1.883410386043660 Sum of Squared Residuals 876.1669665175107 Model Sum of Squares 202.1314444376096 Total Sum of Squares 1078.298410955120 F( 1, 247) 56.98282254868798 F Significance 0.9999999999991508 1/Condition of XPX 0.7121866785207469 Number of Observations 249 Durbin-Watson 0.7345600764854415
Variable Coefficient Std. Error tX1 { 0} 0.86152861 0.11412945 7.5486967 CONSTANT { 0} -0.61128207E-01 0.12059089 -0.50690568
The Durbin Watson tests for the three models were .9304, .7352 and .7346 respectively. The above results show how the Durbin-Watson test, which was developed for a time series model, can be used effectively in cross section models to test for equation misspecifications. The results suggest that if a nonlinearity is suspected, the data should be sorted against each suspected variable in turn and recursive coefficients analyzed. The recursively estimated coefficients for x1 and x3 for the model , when the data was sorted against x1, are displayed in Figure 7.2. The omitted variable bias is clearly shown by the movement in the x1 coefficient as higher and higher values of x1 and added to the sample.
61
Econometric Notes
O b s 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0 1 6 0 1 8 0 2 0 0 2 2 0 2 4 0
1
2
3
4
5
6
7
8
X1_COEF
X3_COEF
O m i t t ed V ar i ab l e B i as x 1 an d x 3 c o ef
Figure 7.2 Recursively estimated X1 and X3 coefficients for X1 Sorted Data
62
Econometric Notes
D I 1 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0
- 1 8 0
- 1 6 0
- 1 4 0
- 1 2 0
- 1 0 0
- 8 0
- 6 0
- 4 0
- 2 0
0
2 0
4 0
C U S U M T
U 1 0 U 5 U 1
L 1 0 L 5 L 1
P l o t o f C u su m T est
Figure 7.3 CUSUM test on Estimated with Sorted Data
Figures 7.3-7.5 show respectively the CUSUM, CUMSQ and Quandt Likehood ratio tests.Further detailed on these tests is contained in Stokes (Specifying and Diagnostically Testing Econometric Models 1997 (see also third edition drafts) Chapter 9. Here we only sketch their use.
Brown, Durbin and Evans (1975) proposed the CUSUM test as a summary measure of whether there was parameter stability. The test consists of plotting the quantity
(7-11)
Where is the normalized recursive residual. The CUSUM test is particularly good at
detecting systematic departure of the coefficients, while the CUSUMSQ test is useful when
the departure of the coefficients from constancy is haphazard rather than systematic. The
63
Econometric Notes
CUSUMSQ test involves a plot of defined as
(7-12)
Approximate bounds for and are given in Brown, Durbin and Evans (1975). Assuming a rectangular plot, the upper-right-hand value is 1.0 and the lower-left-hand value is 0.0. A
regression with stable coefficients will generate a plot up the diagonal. If the plot lies above the diagonal, the implication is that the regression is tracking poorly in the early subsample in comparison with the total sample. A plot below the diagonal suggests the reverse, namely, that the regression is tracking better in the early subsample than in the complete sample.
The Quandt log-likelihood ratio test involves the calculation of the , defined as
(7-13)
where and are the variances of regressions fitted to the first i observations, the last T
- i observations and the whole T observations, respectively. The minimum of the plot of can
be used to select the "break" in the sample. Although no specific tests are available for , the information suggested by the plot can be tested with the multiperiod Chow test, which is discussed next.
If structural change is suspected, a homogeneity test (Chow) of equal segments n can be performed. Given that is the residual sum of squares from a regression calculated from observations , the appropriate statistic is distributed as and defined as
(7-14)
where
(7-15)and
(7-16)
64
Econometric Notes
D I 1 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0
0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1
P l o t o f C u su m sq T est
Figure 7.4 CUMSQ Test of Model y model estimated with sorted data.
65
Econometric Notes
D I 1 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0
- 1 1 0
- 1 0 0
- 9 0
- 8 0
- 7 0
- 6 0
- 5 0
- 4 0
- 3 0
- 2 0
P l o t o f Q u an d t L i k eh o o d R a t i o
Figure 7.5 Quandt Likelihood Ratio tests of y model estimated with sorted data.
Remark: If an inadvertently excluded variable is correlated with an excluded variable, substantial bias in the estimated coefficients can occur. In cross section analysis, if the data is sort by the included variables, what are usually thought of as time series techniques can be used to determine the nature of the problem. For more complex models "automatic" techniques such as GAM, MARS and ACE can be employed. These are far too complex to discuss in this introductory analysis.
8. Advanced concepts A problem with simple OLS models is that there may be situations where the estimated coefficients are biased, or the estimated standard errors are biased. While space precludes a detailed treatment, some of the problems and their solution are outlined below.
_________________________________________________________
Table Five
66
Econometric Notes
Some Problems and Their Solution.
Problem Solution
Y a 0 - 1 variable. PROBIT, LOGIT
Y a bounded variable. TOBIT
X's not independent (i. e., 2SLS, 3SLS, LIML, FIMLX's not orthogonal to e in thepopulation.
Relationship not linear. Reparameterize model and/or NLS, MARS, GAM, ACE
Error not random. GLS, weighed least squares
Coefficients changing from changing population. Recursive Residual Analysis
Time series problems. ARIMA Model, transfer function model, vector model
Outlier Problems L1 & MINIMAX Estimation___________________________________________________________
The 0-1 left-hand variable problem arises if there are only two states for Y. For example, if Y is coded 0 = alive, 1=dead, then a regression model that predicts more than dead (YHAT > 1) or less than alive (YHAT < 0) is clearly not using all of the information at hand. While the coefficients of an OLS model can be interpreted as partial derivatives, in the case of the 0-1
problem, this is not the case. Assume that you have a number of variables, , and that high values are associated with a high probability death before 45 years of age. Clearly, since
you cannot be more than dead, if all variables are high, an additional one more unit for will not have the same effect than if all variables were low. For such problems, the appropriate procedure is LOGIT or PROBIT analysis. Due to space and time limitations, these techniques are not illustrated.
A left-hand variable can be bounded on the upper or lower side. Examples of the former include scores on tests and of the latter money spent on cars. Assume a model where the score on a test (S) is a function of a number of variables, such as study time (ST), age (A), health (H) and experience (E). Clearly, one is going to get into diminishing returns regarding study time. If the
67
Econometric Notes
number of hours were increased from 200 to 210, the increase in the score would not be the same as if the hours had been increased from 0 to 10 hours. Such problems require TOBIT procedures, which have not been illustrated here due to space and time constraints. If an OLS model were fit to the above data, the coefficient for the study time variable would understate the effect of study time on exam scores for relative low total study time hours and overstate the effect of study time on exam scores for relatively high total study time.
An important assumption of OLS is that the right-hand variables are independent. By
this, we mean that if , where are variables, then can be changed without
changing. On the other hand, if the system is of the form
, (8-1)
, (8-2)
then one cannot use as a measure of how will change for a one unit change in , since
there will be an effect on from the change in in the second equation, which will occur as changes. Such problems require two-stage least squares (2SLS) or limited information maximum likelihood estimation (LIML). In addition, if the possible relationship between the error terms e1 and e2 is taken into consideration, three-stage least squares (3SLS) and or full information maximum likelihood (FIML) estimation procedures should be used. These more advanced techniques will not be discussed further here except to say that the appropriate procedures are available.
In OLS estimation, there is always the danger that the estimated linear model is being used to capture a nonlinear process. Over a short data range, a nonlinear process can look like a linear process. In the preliminary research on the way to "kill" live polio bacteria in order to make a "live" vaccine, a graph was used to show that increased percentages of the bacteria were killed as more heat was applied. A straight line was fit and the appropriate temperature was selected. Much to the surprise of the researcher, it was later determined that the relationship was not linear; in fact, proportionately more heat was required to kill bacteria, the lower the percentage of live polio bacteria. Poor statistical methodology resulted in people unexpectedly getting polio from the vaccine.
One way to determine if the relationship is nonlinear is to put power and interaction terms in the regression. The problem is that it is easy to exhaust available CPU time and researcher time before all possibilities have been tested. The recursive residual procedure which involves starting from a model with a small sample and recursively estimating and adding observations provides a way to detect is there are problems in the initial specification. More detail on this approach is provided in Stokes (Specifying and Diagnostically Testing Econometric Models 1997 Chapter 9).
68
Econometric Notes
A brief introduction was given in section 7. The essential idea is that if the data set is sorted against one of the right-hand variables and regressions are run, adding one observation at a time, a plot or list of the coefficients will indicate whether they are stable for different ranges of the sorted variable. If other coefficients change, it indicates the need for interaction terms. If the sorted variable coefficient changes, it indicates that there is a nonlinear relationship. If time is the variable for which the sort is made, it suggests that over time the coefficients are shifting.
The OLS model for time series data can be shown to be a special case of the more general ARIMA, transfer function and vector autoregressive moving average models. A preliminary look at some of these models and their uses is presented in the paper "The Box-Jenkins Approach-When Is It a Cost-effective Alternative," which I wrote with Hugh Neuburger in the Columbia Journal of World Business (Vol. XI, No. 4, Winter 1976). As noted, if we were to write the model as
(8-3)
a more complex lag structure can be modeled than in the simple OLS case. If , then we
have an ARIMA model and are modeling as a function of past shocks alone. If , then we have a rational distributed lag model. If neither term is zero, we have a transfer function model. Systems of transfer function type models can be estimated using simultaneous transfer function estimation techniques or vector model estimation procedures. Space limits more comprehensive discussion of these important and general class of models beyond this brief treatment.
9. Summary
These short notes have attempted to outline the scope of elementary applied statistics. Students are encouraged to experiment with the sample data sets to perform further analysis.
* Editorial assistance was provided by Diana A. Stokes. Important suggestions were made by Evelyn Lehrer on a prior draft. I am responsible for any errors or omissions.
69