· sloan school of management massachu setts institute of technology cambridge 39, massachusetts...
TRANSCRIPT
Sloan School of Management
Mas sachu sett s Institute of Technology
Cambridge 39 , Mas sachusett s
December , 1964
Go p q j .
Multicollinearity in Regre s sion Analysi s
The Problem Revi sited
10 5 -64
E u Farrar and R . R . G1auber*
This paper i s a draft for private c irculation and comment . Itshould not be cited , quoted , or reproduced without permis sion ofthe authors . The re search has been supported by the Institute ofNaval Studie s , and by grants from the Ford Foundation to the SloanSchool of Management , and the Harvard Busine ss School .
*Sloan School of Management , and Graduate School ofBu sine s s Admini stration , Harvard University , re spectively .
CONTENTS
Multicollinearity Problem
Nature and Effect s
E stimation
Illustration
Spec ification
Hi storical Approache s
Econometric
Computer Programming
Problem Revi sited
Def inition
Diagnosis
General
Spec if ic
Illustration
Summary
MULTICOLLINEARITY IN REGRESSION ANALYSISTHE PROBLEM REVISITED
To most economists the s ingle equation least square s
regre s sion model , like an old friend , i s tried and true . Its
propertie s and limitations have been extensively studied , docu
mented and are , for the most part , well known . Any good text in
econometric s can lay out the as sumptions on which the model is
based and provide a reasonably coherent perhaps even a luc id
discus s ion of problems that arise as particular assumptions are
violated . A short bibliography of def initive papers on such
cla s sical problems as non -normality , heteroscedasticity , serial
correlation , feedback , etc . , complete s the j ob .
As with most old friend s , however , the longer one knows least
square s , the more one learns about it . An admiration for it s
robu stne s s under departure s from many as sumptions i s sure to
grow . The admiration must be tempered , however , by an ap p recia
tion of the model's sensitivity to certain other conditions .
The requirement that independent variable s be truly independent
of one another i s one of the se .
Proper treatment of the model's clas s ical problems ordinarily
involve s two separate stage s , detection and correction . The
Durbin -Watson te st for serial corre lation , combined with Cochrane
and Orcutt's sugge sted f irst differenc ing procedure , i s an
obviou s example .
*
*J . Durbin and G . 8 . Watson ,
"Testing for Serial Correlationin Lea st Square s Regre s s ion , Biometrika , 37
- 8 , 19 50 -1 . D .
Regre s sions to Relationships Containing Auto- Correlated Error
Bartlett's te st for variance heterogeneity followed by a data
transformation to re store homoscedastic ity is another .
* No
such "proper treatment"has been developed , however , f or problems
that arise as multicollinearity i s encountered in regre s sion
analysis .
Our attention here will focus on what we consider to be
the f irst step in a proper treatment of the problem its
detection , or diagnosi s . Economi sts generally agree that the
second step correction requires the generation of addition
al Just how this information i s to be obtained
depends largely on the taste s of an inve stigator and on the
specif ic s of a particular problem . It may involve additional
primary data collection , the use of extraneous parameter e stimate s
from secondary data sources,or the application of subj ective
inf ormation through constrained regre s s ion , or through Baye sian
e stimation procedure s . Whatever it s source , however , selectivity
and there by efficiency in generating the added information
require s a systematic procedure for detecting its need i . e . ,
for detecting the existence , measuring the extent , and pinpointing
the location and cause s of multicollinearity within a set of
independent variable s . Measure s are proposed here that , in our
opinion , fill thi s need .
The paper's bas ic organi zation can be outlined briefly a s
f ollows . In the next section the multicollinearity problem's
bas ic , formal nature is developed and illu strated . A di scu s sion
*F . David and J . Neyman ,"Extension of the Markoff Theorem on
Least Square s ,
"Stati stical Re search Memoirs , II . London , 1938 .
**J . Johnston,Econometric Method s ,
McGraw Hill , 1963 , p . 207 ;
J . Meyer and R Glauber , I nve stment Dec is ions , Economic Forecastin
and Public Policy , Division of Research Graduate School of Bu sine s sAdmini stration , Harvard University
,1964 , p . 181 ff .
of historical approache s to the problem follows . With this as
background , an attempt i s made to def ine multicollinearity in
terms of departure s from a hypothe sized statistical condition ,
and to fashion a serie s of hierarchical mea sure s at each of
three levels of detail for its pre sence , severity , and location
within a set cf data . Te sts are developed in terms of a genera
lized , multivariate normal , linear model , A pragmatic interpretation
of re sulting stati stic s as dimensionle s s measure s of corre spondence
between hypothe si zed and sample propertie s , rather than in terms
of clas s ical probability levels , is advocated . A numerical example
and a summary complete s the exposition .
THE MULTICOLLINEARITY PROBLEM
NATURE AND EFFECTS
The purpose of regre s s ion analysis is to e stimate the
parameters of a dependency , not of an interdependency,relation
ship . We def ine first
Y , X as observed value s , measured as standardized deviate s,
of the dependent and independent variable s,
8 as the true ( structural ) coeff icients ,
u as the true ( unobserved ) error term , with distributional
propertie s specif ied by the general linear model ,
*
and
02
as the underlying , population variance of u ;u
and pre sume that Y and X are re lated to one another through
the linear form
Y X B u
Least square s regre s s ion analysis leads to e stimates
fi = ( xtzo
'litY
with variance -covariance matrix
05 ( XtX )
- l
*See f or example , J . Johnston , op . cit . , Ch . 4 ; or F . Graybill ,
An Introduction to Linear Stati stical Models , McGraw Hill ,
that,in a variety of sense s , be st reproduce s the the hypothe sized
dependency relationship
Multicollinearity , on the other hand , i s veiwed here a s an
interdependency condition . It is defined in terms of a lack of
independence , or of the pre sence of interdependence s ignified
by high intercorre lations (xtX) within a set of variable s , and
under thi s view can exist quite apart from the nature , or even
the exi stence of a dependency relationship between X and a
dependent variable Y . Multic ollinearity i s not important to the
stati stician for it s own sake . Its signif icance , as contrasted
with it s definition , come s from the effect of interdependence in
X on the dependency relationship whose parameters are de sired .
Multicollinearity constitute s a threat and often a very seriou s
threat both to the proper specification and to the effective
e stimation of the type of structural relationships commonly
sought through the u se of regre s sion technique s .
The single equation , least square s re gre s s ion model is not
we ll equipped to cope with interdependent explanatory variable s .
In it s Original and most simple form the problem is not even
conceived . value s of X are pre sumed to be the pre - selected ,
c ontrolled elements of a clas s ical , laboratory experiment .
*
Least square s models are not limited , however , to simple , fixed
variate or fully controlled experimental situations .
Partially controlled or completely unc ontrolled experiments , in
which x as well as Y is subj ect to random variation and
therefore also to multicollinearity may provide the data on
which perfectly legitimate regre s s ion analyse s are
*Kendall put s his finger on the e s sence of the simple , f ixed variate ,
regre s sion model when he remarks that standard multiple regre s s ion i snot multivariate at all , but i s univariate . Only Y is pre sumed tobe generated by a proce s s that include s stochastic elements .
p . 68 -9 0
**See Model 3 , Graybill , op . c it . p . 104 .
Though not limited to fully c ontrolled , fixed variate
experiment s , the regre s sion model , like any other analytical
tool , is limited to gg gg experiments if gggg re sults are to
be insured . And a good experiment must provide as many dimensions
of independent variation in the data it generate s as there are
in the hypothe si s it handle s . An n dimensional hypothe sis
implied by a regre s sion equation containing n independent variable s
can be neither properly e stimated nor properly te sted , in it s entirety ,
by data that contains fewer than n signif icant dimensions .
In most case s an analyst may not be equally concerned about
the structural integrity of each parameter in an equation . Indeed,
it i s often sugge sted that for certain forecasting applications
structural integrity anywhere although always nice to have
may be of secondary importance . Thi s point will be considered
later . For the moment it must suff ice to emphasize that multi
c ollinearity is both a symptom and a facet of poor experimental
de sign . In a laboratory poor design occurs only through the
improper use of control . In life , where most economic e x p eri
ments take place , control is minimal at be st , and multicollinearity
i s an ever pre sent and seriou sly di sturbing fact of life .
Estimation
Difficultie s as soc iated with a multicollinear set of data
depend , of course , on the severity with which the problem i s
encountered . As interdependence among explanatory variable s
X grows , the corre lation matrix ( xtX ) approaches s ingu larity ,
and elements of the inverse matrix (t )- l
explode . In the limit ,
perfect linear dependence within an independent variable set lead s
to perfect singularity on the part of ( xtx ) and to a completely
indeterminate set of parameter In a formal sense
diagonal elements of the inverse matrix ( xtx)
"lthat corre spond
to linearly dependent members of X become infinite . variance s
for the affected variable s'regre s s ion coeff ic ients ,
03( ZtX )
-l
accordingly , also become inf inite .
The mathematic s , in its brute and tactle s s way , tells us
that explained variance can be allocated completely arbitrarily
between linearly dependent members of a completely singu lar set
of variables , and almost arbitrarily between members of an
almost singular set . Alternatively , the large variance s on
regres sion coeffic ient s produced by multicollinear independent
variable s indicate , quite properly , the low information content
of observed data and , accordingly , the low quality of re sulting
parameter e stimate s . It emphasize s one's inability to distinguish
the independent contribution to explained variance of an explanatory
variable that exhibits little or no truly independent variation .
In many ways a person whose independent variable set is
completely interdependent may be more f ortunate than one whose
data i s almost so ; f or the f ormer's inability to base his model
on data that cannot support its informat ion requirement s will be
discovered by a purely mechanical inability to invert the singu
1ar matrix ( ZFX) while the latter's problem , in most case s ,
will never be fully realized .
Diff icultie s encountered in the application of regre s sion
technique s to highly multicollinear independent variable s can be
discus sed at great length , and in many ways . One can state that
the parameter estimate s obta ined are highly sensitive to changes
in model specif ication , to change s in sample coverage , or
even to change s in the direction of minimization ; but lacking a
simple illustration , it is diff icult to endow such statements
with much meaning .
Illustrations , unfortunately , are plentiful in economic
re search . A simple Cobb-Douglas production function provide s
an almost clas s ical example of the instability of least square s
parameter e stimate s when derived from collinear data .
Illustration
The Cobb-Douglas production function* can be expre s sed
most simply as
P L81 0
52 ea+u
where
P i s production or output ,
L is labor input ,
C i s cap tial input ,
are parameters to be e stimated , and
u i s an error or re sidual term .
Should Bl+82
l , proportionate change s in inputs generate equal ,
proportionate change s in expected output the production
function is linear , homogeneous and several de sirable condi
tions of welfare economic s are satisfied . Cobb and Douglas set
out to te st the homogeneity hypothe sis empirically . Structural
e stimates , accordingly , are desired .
Twenty-four annual observations on aggregate employment,
capital stock and output for the manufacturing sector of the U . S .
economy,18 99 and 1922
,are collected . 8
2i s set equal to
1- 81
and the value of the labor coefficient 81
. 75 i s
e stimated by constrained least squares regre s sion analysi s . By
virtue of the constraint, 8
21 . 75 . 2 5 . Cobb and Douglas
are satisf ied that structural e stimates of labor and capital
coefficients for the manufacturing sector of the economy
have been obtained .
C W . Cobb and P . H . Douglas,"A Theory of Production
,"
American Economic Review,XVIII ,
Supplement,March 1928 . See
H . Mendushau sen ,"On the Signif icance of Prof es sor Dougla s'Pro
duction Function,"Econometrica
,6 April 1938 ; and D . Durand
,
"Some Thought on Marginal Productivity with Special Referenceto Prof e s sor Douglas Analysi s
,"Journal of Political Economy
,
45 ,193 7 .
- 8
Ten years later Mendershau sen reproduce s Cobb and Dougla s'
work and demonstrate s , quite vividly,both the collinearity of their
data and the sensitivity of re sult s to sample coverage and direc
tion of minimization . U sing the same data we have attempted to
re construct both Cobb and Douglas'original calculations and
Mendershausen's replication of same . A demon stration of least
squares'sensitivity to model spe cifi cation i . e . , to the
composition of an equation's independent variable set i s added .
By being unable to reproduce several of Mendershau sen's results ,
we inadvertently ( and facetiously ) add "sensitivity to computa
tion error"to the table of pitfalls commonly as sociated with
multicollinearity in regre s sion analysi s . Our own computations
are summarized below .
Parameter e stimate s for the Cobb—Dougla s model are linear
in logarithms of the variable s P,L,and C . Table 1 contains
simple correlations between the se variable s,in addition to an
arithmetic trend,t . Multi collinearity within the data i s indi
cated by the high intercorrelations between all variables ,
a common occurrence when aggregative time serie s data are used .
The sensitivity of parameter e stimate s to virtually any
change in the delicate balance obtained by a parti cular sample
of observations,and a particular model spe cif ication
,may be
illustrated forcefully by the wildly fluctuating array of labor
and capital coeff icient s,
and 82 ,
summarized in Table 2 .B1
Equations ( a ) and ( b ) replicate Cobb and Dougla s'original ,
constrained e stimate s of labor and capital coeff icient s with
and without a trend to pick up the impact on productivity of
technological change .
" Cobb and Dougla s comment on the in
The capital coeffi cient 82 i s constrained to equal 1- 81by regre s s ing the logarithm of P/C on the logarithm of L/C , withand without a term for trend
, 83t .
crease in correlation that results from adding a trend to their
production function ,* but neglect to report the change in produc
tion coeff icients that accompany the new spe cification ( equation
b , Table
Table 1
Simple Correlations,
Cobb-Douglas Production Function
Log L Log C
To four places , rC t . 99683
The sensitivity of coefficient e stimate s to change s in
model specif ication i s demonstrated even more strikingly,how
ever,by equations ( c ) and (d ) , Table 2 . Here e stimates of the
logarithm of p roduction. are based directly on logarithms of labor
and capital,with and without a term for trend . 8
1and 8
2 ,
accordingly,are not constrained to add to unity . The relation
ship obtained in equation ( c ) is very nearly linear,homogen
eous 81 8
2Cobb and Dougla s
,clearly
,would be
delighted by thi s specif ication . Unconstrained parameters
clo sely re semble their own constrained estimate s of labor and
Cobb,Douglas
,O p cit
,p . 154 .
capital coefficients , and the observed relationship between
dependent and independent variable s i s strong,both individually
( a s indicated by t-ratios ) , and j ointly ( as measured by R2,the
coefficient of determination ) .
By adding a trend to the independent variable set,however
the whole system rotate s wildly . The capital coefficient
82
. 23 that make s equation ( 0 ) so reas suring become s 82
in equation ( d ) . Labor and technological change,of course
,
pick up and divide between themselves much of capital's former
explanatory power . Neither , however,as sumes a value that
,on
a priori grounds,can be dismi s sed as outrageous ; and
,it must
be noted , each variable's individual contribution to explained
variance ( measured by t - ratios ) continue s to be strong . De s
pite the fact that both trend and capital and labor too,for
that matter carry large standard errors,no "danger light"
i s f lashed by conventional stati stical measure s of goodne s s of
f it . Indeed , R2and t-ratios for individual coefficients are
suf ficiently large in either equation to lull a great many
econometricians into a wholly unwarranted sense of se curity .
Evidence of the instability of multicollinear regres sion
e stimate s under changes in the dire ction of minimization i s il
lustrated by equations ( 0 ) ( f ) , Table 2 . Here capital and
labor,respectively
,serve as dependent variable s for estima
tion purpose s,after whi ch the de sired ( labor and capital ) co
efficients are derived algebraically . Labor coeff icient s,
81
. 0 5 and and capital coefficients, 82
. 60 and . 01 ,
are derived from the same data that produce s Cobb and Douglas'
81
. 75 , 62. 2 5 divi sion of product between labor and capital .
Following Mendershausen's lead equations (g ) ( i ) , Table
2,illustrate the model's sensitivity to the few non- collinear
observations in the sample . By omitting a single year ( 1908 ,
factory repre sentation of postwar United State s consumer
behavior . Multicollinearity,unfortunately
,contribute s to
difficulty in the specification as well as in the estimation
of economi c relationships .
Model spe cif ication ordinarily begins in the model
builder's mind . From a combination of theory,prior information ,
and j ust plain hunch,variable s are chosen to explain the be
havior of a given dependent variable . The j ob , however , doe s
not end with the first tentative specif ication . Before an equa
tion i s j udged acceptable it must be te sted on a body of em
p irical data . Should it be def icient in any of several re spects ,
the specif ication and thereby the model builder's "prior
hypothe sis" i s modified and tried again . The proces s may go
on for some time . Eventually di screpancies between prior and
samp le information are reduced to tolerable levels and an equa
tion acceptable in both re spe ct s i s produced .
In concept the proce s s i s sound . In practice , however ,
the econometrician's mind is more fertile than hi s data , and
the proce s s of modifying a hypothese s consists largely of paring
down rather than of building- up model complexity .Having
little conf idence in the validity of hi s prior information , the
economi st tends to yield too easily to a temptation to reduce
hi s model's s cope to that of hi s data .
Each sample,of course
,covers only a limited range of ex
p erience . A relatively small number of forces are likely to
be operative over,or during
,the subset of reality on which a
parti cular set of observations i s based . As the number of
variable s extracted from the sample increase s , each tends to
measure different nuance s of the same,few , basic factors that
are pre sent . The sample's basic information i s simply spread
more and more thinly over a larger and larger number of in
creasingly multicollinear independent variable s .
However real the dependency relationship between Y and each
member of a relatively large independent variable set X may be,
the growth of interdependence within X as its size increase s
rapidly decrease s the stability and therefore the sample
significance of each independent variable's contribution to
explained variance . As Liu point s out,data limitations rather
than theoretical limitations are largely re sponsible for a per
sistent tendency to underspecify or to oversimplify
econometric models .
* The increase in sample standard errors for
multicollinear regre s sion coef f icient s virtually a s sure s a
tendency for relevant variables to be di s carded incorrectly
from regre s sion equations .
The econometrician , then,is in a box . Whether hi s goal
i s to e stimate complex structural relationships in order to dis
tinguish between alternative hypotheses,or to develop reliable
forecasts , the number of variable s required i s likely to be
large , and past experience demonstrate s with depre s sing regu
larity that large numbers of e conomic variable s from a single
sample space are almost certain to be highly intercorrelated .
Regardle s s of the particular application,then
,the e s sence of
the multicollinearity problem i s the same . There exi st s a sub
stantial difference between the amount of information required
for the sati sfactory e stimation of a model and the information
contained in the data at hand .
If the model i s to be retained in all it s complexity ,
solution of the multicollinearity problem require s an augmenta
tion of existing data to include additional information .
Parameter e stimate s for an n dimensional model e . g . , a
T . C . Liu,"Underidentification , Structural Estimation and
Fore casting ,"Econometri ca
,28 ,
October 1960 ; p . 85 6 .
two dimensional production function cannot properly be based
on data that contains fewer signif icant dimens ions . Neither can
such data provide a basi s for dis criminating between alternative
formulations of the model . Even for forecasting purposes the
e conometrician whose data is multicollinear i s in an extreme ly
exposed po sition . Succes sful forecasts with multicollinear
variables require not only the perpetuation of a stable depend
ency relationship between Y and 1 ,but also the perpetuation of
stable interdependency relationships within X . The second
condition,unfortunately
,i s met only in a context in which
the forecasting problem is all but trivial .
The alternative of s caling down each model to fit the
dimensionality of a given set of data appears equally unpromi sing .
A set of substantially orthogonal independent variable s can in
general be specified only by di scarding much of the prior
theoretical information that a re searcher brings to hi s problem .
Time serie s analyse s containing more than one or two inde
pendent variables would virtually di sappear,and forecasting
models too simple to provide reliable forecast s would become the
order of the day . Consumption functions that include either
income or liquid as sets but not both provide an appropri
ate warning .
There i s,perhaps a middle ground . All the varia ble s in
a model are seldom of equal interest . Theoretical que stions
ordinarily focus on a relatively small portion of the independent
variable set . Cobb and Douglas,for example , are intere sted
only in the magnitude of labor and capital coefficients,not
in the impact on output of technological change . Di spute s con
cerning alternative consumption,inve stment , and cost of
capital models similarly focus on the relevance of , at most ,
one or two disputed variable s . Similarly, forecasting models
- 16
rely f or succe s s mainly on the structural integrity of tho se
variables whose behavior i s expected to change . In each case
certain variable s are strategically important to a particular
application while others are not .
Multicollinearity,then
,constitute s a problem only if it
undermine s that portion of the independent variable set that
i s crucial to the analysi s in que stion labor and capital
for Cobb and Douglas ; income and liquid as sets for po stwar
consumption forecast s . Should these variables be multicollinear ,
corrective action is ne ce s sary . New information must be oh
tained . Perhap s it can be extracted by stratifying or other
wi se reworking exi sting data . Perhaps entirely new data i s re
quired . Wherever inf ormation is to be sought , however , insight
into the pattern of interdependence that undermine s pre sent
data is nece ssary if the new information i s not to be simi
larly affected .
Current procedure s and summary statisti cs do not provide
effective indi cation s of multicollinearity's presence in a
set of data,let alone the insight into it s location , pattern ,
and severity that i s required if a remedy in the form of
selective additions to information i s to be obtained . The
current paper attempts to provide appropriate "diagnostics"
for thi s purpose .
Hi storical approache s to the problem will both facili
tate expo sition and complete the nece s sary background for the
pre sent approach to multicollinearity in regre s sion analysi s .
- 17
HISTORICAL APPROACHES
Hi storical approache s to multicollinearity may be organized
in any of a number of ways . A very convenient organi zation re
f le et s the taste s and backgrounds of two types of persons who have
worked actively in the area . Econometricians tend to view the
problem in a relatively ab stract manner . Computer programmers ,
on the other hand , see multicollinearity as j u st one of a rela
tively large number of contingencie s that mu st be anticipated
and treated . Theoretical statisticians,drawing their training ,
experience and data from the controlled world o f the laboratory
experiment,are noticeably unintere sted in the problem altogether .
Econometric
Econometrician s typically view multicollinearity in a very
matter-of- fact if slightly s chizophrenic fashion . They
point out on the one hand that least squares coeff icient esti
mate s ,
6 s (iti )‘
l itu
are "be st linear unbiased ," s ince the expectation of the last term
i s zero regardle s s of the degree of multicollinearity inherent
in x ,if the model i s properly spe cified and feedback i s absent .
Rigorously demonstrated , this propo sition is often a source of
great comfort to the embattled practitioner . At time s it may
j ustify comp acency .
On the other hand , we have seen that multi collinearity
imparts a substantial bia s toward incorrect model specification .*
Liu,idem .
It ha s also been shown that poor spe cif ication undermine s the
"be st linear unbiased"character of parameter e stimate s over multi
collinear , independent variable sets .
* Complacency , then , tend s
to be short lived,giving way alternatively to de spair a s the
econometrician recognizes that non- experimental data , in general ,
i s multicollinear and that in principal nothing can be done
about Or , to use Jack Johnston's words,one i s in
the stati stical pos ition of not being able to make bricks without
Data that doe s not po sse s s the information required
by an equation cannot be expected to yield it . Admonitions that
new data or additional a priori information are required to"break the multicollinearity are hardly reas suring ,
for the gap between information on hand and information required
i s so often immense .
Together the combination of complacency and de sp air . that
characterize s traditional views tends to virtually paralyze ef
forts to deal with multicollinearity as a legitimate and diffi
cult , yet tractable e conometric problem . There are,of course
,
exceptions . Two are di scus sed below .
Artif icial Orthogonalization" The f irst i s proposed by
and illustrated with data from a demand study by
Employed correctly,the method is an example of a
solution to the multicollinearity problem that proceeds by re
ducing a model's information requirements to the information
H . Theil ,
"Specif ication Errors and the Estimation ofEconomi c Relationships
,"Review of the International Stati sti cal
H . Theil,Economic Forecasts and Policy , North Holland ,
1962 ;p . 216 .
J . Johnston , 0 p . cit . ,p . 2 0 7 .
J . Johnston,idem . ,
and H . Theil , op . cit . , P o 217 °
M . G . Kendall , op . cit . , pp . 70- 75 .
R . N . Stone,"The Analysi s of Market Demand ,
"Journal ofthe Royal Stati stical Society
,CVIII ,
III ,1945 ; pp . 28 6- 382 .
- 19
content of existing data . On the other hand , perverse applica
tions lead to parameter e stimate s that are even les s sati sfactory
than those ba sed on the original set of data .
G iven a set of interdependent explanatory variable s X ,and
a hypothe sized dependency relationship
Y K 8 u
Kendall propo se s "[ to throw ] new light on certain old but uh
solved problems ; particularly ( a) how many variables do we take?
( b ) how do we dis card the unimportant ones? and ( c ) how do we get
rid of multicollinearitie s in
Hi s solution , brief ly,runs as follows" Def ining
a matrix of observations on n multicollinear
explanatory variables ,
a set of m .s n orthogonal component s or
common factors ,
a matrix of n derived re sidual ( or unique )
components , and
the constructed set of ( m x n ) normalized
factor loadings ,
Kendall decompose s X into a set of statistically signif icant
orthogonal common factors and re sidual components U such that
X = E A + U
exhausts the sample's observed variation .
'
2 EA , then , sum
Kendall,op . cit . , p . 70 .
- 20
return from factor to variable space by transforming e stimators
A and 8* into e stimate s
B** A
t8*
of the structural parameters
Y X B u
originally sought .
In the S pecial ( component analysi s ) case in which all m n
factors are obtained , and retained in the regre s s ion equation .
"[Nothing has been lost ] by the transformation except the time
spent on the arithmetical labor of f inding it .
"T By the same
token , however , nothing has been gained,for the Gaus-Markoff
theorem insure s that coeff icient estimates B** are identical
to the e stimate s 8 that would be obtained by the dire ct applica
tion of least square s to the original , highly un stable,set of
variable s . Moreover,all m n factors will be found signifi
cant and retained only in tho se instance s in which the independent
variable set,in fact , is not seriously multicollinear .
In general,therefore , Kendall's procedure derive s n
parameter estimates ,
B** A B*
from an m dimensional independent variable set ,
Y E e"
<x-u>ats* e*
f Kendall , op,cit . ,
p . 70 .
whose total information content is both lower and le s s well
defined than for the original set of variable s . The rank of
K-U , clearly , is never greater,and usually i s smaller
,than the
rank of X Multicollinearity , therefore , is intensif ied rather
than alleviated by the serie s of transformations . Indeed,by di s
carding the re sidual or perhaps the "unique" portion of an
independent variable's variation,one is seriously in danger
of throwing out the baby rather than the bath i . e . ,the inde
pendent rather than the redundant dimension s of information .
Kendall's approach i s not without attractions . Should
factors permit identif ication and use as variable s in their own
right,the transformation provide s a somewhat defensible solu
tion to the multicollinearity problem . The di screpancy between
apparent and significant dimensions ( in model and data,re sp ec
tively ) i s eliminated by a meaningful reduction in the number of
the model's parameters . Even where factors cannot be used
directly , their derivation provides insight into the pattern of
interdependence that undermine s the structural stability of
e stimate s based on the original set of variable s .
The shortcoming of thi s approach lie s in it s prescriptions
for handling tho se s ituations in which the data do not suggest
a reformulation that reduces the model's information require
ments i . e . , where components cannot be interpreted directly
as e conomic variable s . In such circumstances , solution of the
multicollinearity problem require s the application of additional
information,rather than the further reduction of existing in
formation . Methods that retain a model's full complexity while
reducing the information content of existing data aggrevate
rather than alleviate the multicollinearity problem .
Rule s of Thumb" A se cond and more pragmatic line of attack
re cognize s the need to live with poorly conditioned , non-ex p eri
- 2 3
mental data,and seeks to develop rule s of thumb by which
"acceptable"departure s from orthogonality may be distingui shed
from "harmful"degree s of multicollinearity .
The term "harmful multicollinearity"i s generally def ined only
symptomatically as the cause of wrong signs or other symptoms
of nonsense regres sions . Such a practice's inadequacy may be illus
trated ,perhap s , by the ease with which the same argument can be
used to explain right signs and sensible regre s sions from the same
basic set of data . An O perational definition of harmful multi
collinearity , however inadequate it may be,i s clearly preferable
to the methodological slight-of- hand that symptomatic definitions
make pos sible .
The mo st simple,operational definition of unacceptable
collinearity make s no pretense to theoretical validity . An ad
mittedly arbitrary rule of thumb i s establi shed to constrain
s imple correlations between explanatory variables to le s s than,
say,r . 8 or . 9 . The most obvious type of pairwi se sample in
terdep endence ,of course , can be avoided in this fashion .
More elaborate and apparently S ophi sticated rule s of thumb
also exi st . One,that has lingered in the background of econome
tries for many years,has recently gained suff icient stature to
be included in an elementary text . The rule holds,es sentially
,
that "intercorrelation or multi collinearity i s not ne ce s sarily a
problem unle s s it i s high relative to the over- all degree of
multiple Or , more spe cifically,if
i s the ( simple ) correlation between two independent
variable s ,and
i s the multiple correlation between dependent and
independent variables ,
Klein,An Introduction to Econometrics , Prentice—Hall ,
101 .
_ 24_
multicollinearity i s said to be "harmful"if
rii
Ry
By this criterion the Cobb~Douglas production function i s
not seriously collinear,for multiple correlation R
y. 98 i s
comfortably greater than the simple correlation between
( logarithms of ) labor and capital,
r12
. 91 . Ironically thi s
i s the application chosen by the textbook to illustrate the rule's
validity .
"
Although it s origin i s unknown,the rule's intuitive appeal
appears to rest on the geometri cal concept of a triangle formed
by the end points of three vectors ( repre senting variables Y , _Kl ,
and 52 ,re spe ctively ) in N dimen sional observation space
( reduced to three dimensions in Figure Y éfil
is repre sented
by the perpendicular the least square s ) ref lection of Y
onto the plane . Multiple correlation Ry
i s def ined by the
Observation N
Observation 1
tion
Figure
Idem .
direction cosine between Y and Y, while simple correlation riz
is the direction cosine between X1
and X2
. Should multiple
correlation be greater than simple correlation,the triangle's
base X1X2
is greater than its height Y Y, and the dependency
relationship appear s to be "stable ."
De spite it s intuitive appeal the concept may be attacked on
at least two grounds . First,on extens ion to multiple dimensions
it breaks down entirely . Complete multicollinearity i . e . ,per
feet singularity within a set of explanatory variable s i s
quite consistent with very small simple correlations between
members of x . A set of dummy variables whose non- zero elements
accidentally exhaust the sample space i s an obviou s,and an aggre
vatingly common , example . Second , the Cobb-Douglas production
function provide s a convincing counter- example . If thi s set of
data is not "harmfully collinear,"the term has no meaning .
The rule's conceptual appeal may be rescued from absurdities
of the f irst type by extending the concept of simple correlation
between independent variable s to multiple correlation within the
independent variable set . A variable, Xi’then ,
would be said
to be "harmfully multicollinear"only if its multiple correlation
with other members of the independent variable set were greater
than the dependent variable's multiple correlation with the
entire set .
The Cobb-Douglas counter example remains,however
,to in
dicate that multicollinearity i s ba sically an interdependency,
not a dependency condition . Should (KtX ) be singular or
virtually so tight sample dependence between Y and X can
not as sure the structural integrity of least square s parameter
e stimates .
Computer Prpgrammipg
The development of large s cale , high speed digital com
puters has had a well-recognized,virtually revolutionary impact
- 2 6
on econometric applications . By bringing new persons into con
tact with the field the computer also i s having a perceptible,if
le s s dramatic , impact on econometric methodology . The phenomenon
i s not new . Technical spe ciali sts have called attention to
matters of theoretical intere st in the past Profes sor Viner's
famous draftsman , Mr . Wong,i s a notable example .
" More recently,
the computer programmer's approach to s ingularity in regre s sion
analysi s has begun to shape the econometrician's view of the
problem a s well .
Specif ically , the numerical e stimation of parameters for a
standard regre s sion equation require s the inversion of a matrix
of variance- covariance or correlation coeff icients for the inde
pendent variable set . Est imate s of both slope coeff icient s,
.
3 (iti )’
l itY
and variance s,
A 2 - lvarcs) o
u(ltX )
require the operation . Should the independent variable set X
be perfe ctly multicollinear , (Ktz) , of course
,is s ingular
,and a
determinate solution doe s not exi st .
The programmer,accordingly
,i s required to buili checks
for non- singularity into standard regres sion routine s . The te st
most commonly used re lies on the property that the determinant
of a singular matrix i s zero . Defining a small,positive test
value,
6 0 ,a solution i s attempted only if the determinant
14t
h
otherwi se,computations are halted and a premature exit is
called .
J . Viner ,"Cost Curves and Supply Curve s
,"Z eitschrift fur
Nationalokonomie ,III
,1931 . Reprinted in A . E . A . Readings in
Price Theory,Irwin
,1952 .
Che cks for singularity may be kept internal to a computer
program,well out of the user's sight . Re cently
,however
,the
determinant ha s tended to j oin 8 coefficients, t
- ratios,
F-te sts and other summary statistics as routine elements of
printed output . Remembering that the determinant, | §
tx i s
based on a normalized , correlation matrix,its position on the
scale
yields at lea st heuri stic insight into the degree of interde
p endence within the independent variable set . A s K approaches
singularity , of course , lét
élapproaches zero . Converse ly
close to one implie s a nearly orthogonal independent variable
set . Unf ortunately , the gradient between extreme s is not well
defined . As an ordinal measure of the relative orthogonality of
similar set s of independent variable s , however,the stati stic
has attracted a certain amount of well- deserved attention and
use .
A single,overall measure of the degree of interdependence
within an independent variable set,although useful in its own
right,provide s little information on which corrective action
can be based . Near singularity may result from strong,sample
pairwi se correlation between independent variable s,or from a
more subtle and complex linkage between several members of the
set . The problem's cure,of course
,depend s on the nature of
the interaction . The determinant per se , unfortunately , give s
no information about that interaction .
In at least one case an attempt has been made to localize
multicollinearity by building directly into a multiple regre s
sion program an index of each explanatory variable's dependence
on other members of the independent variable set .
* Recalling
A . E . Beaton and R . R . G lauber , Stati stical LaboratoryUltimate Regre s sion Package ,
Harvard Stati stical Laboratory, 1962 .
- 28
diagonal element r and thereby the var ("i ) 02
r1 1
ex
p lode s ,pinpointing not only the exi stence
,but alsgthe location
of singularity within an independent variable set .
As originally conceived,diagonal e lements of the inverse
correlation matrix were checked internally by computer programs
only to identify completely singular independent variable s . More
recently they have j oined other stati stic s a s standard elements1 1
< CD 1 8of regre s sion output .
" Even though the spe ctrum 1 r
little explored,diagonal element s
,by their size , give heuri stic
insight into the relative severity,as well as the location , of
redundancie s within an independent variable set .
Armed with such ba sic ( albeit crude ) diagnostics,the in
ve stigator may begin to deal with the multicollinearity problem .
Fir st,of cour se
,the determinant lX
t
X alert s him to its
exi stence . Next,diagonal element s r
1 1give suffi cient insight
into the problem's location and therefore into it s cause
to sugge st the selective additions of information that are re
quired for stable,least square s , parameter e stimates .
Idem .
THE PROBLEM REVISITED
Many persons , clearly , have examined one or more aspect s
of the multicollinearity problem . Each,however
,has focused
on one facet to the e xclusion of others . Few have attempted to
synthe size , or even to distingui sh between either multicollin
earity's nature and ef fects
,or it s diagnosi s and cure . Klein
,
for example , define s the problem in terms that include both
nature and effe cts ; while Kendall attempt s to produce a solution
without concern for the problem's nature,eff ects , or diagnosi s .
Those who do concern themselves with a definition of multi
collinearity tend to think of the problem in terms of a di screte
condition that either exi sts or doe s not exi st , rather than as
a continuous phenomenon who se severity may be measured .
A good deal of confusion and some inconsistency
emerge s from this picture . Cohe sion require s,f irst of all
,a
clear distinction between multicollinearity's nature and ef
feet s,and
,second
,a def inition in terms of the former on
which diagnosi s , and subsequent correction can be ba sed .
DEFINITION
Econometric problems are ordinarily def ined in terms of
stati stically signif icant dis crepancie s between the propertie s
of hypothe sized and sample variates . Non— normality,heter
oscedasticity and autocorrelation,for example
,are defined
in terms of diff erences between the behavior of hypothe sized
and observed res iduals . Such definitions lead directly to the
development of te st stati stics on which detection , and an
evaluation of the problem's nature and severity,can be based .
Once an inve stigator i s alerted to a problem's exi stence and
character,of course
,corrective action ordinarily consti
tutes a separate and often quite straight- forward step .
Such a definition would seem to be both pos sible and de sirable
for multicollinearity .
Let us define the multicollinearity problem,therefore
,
in terms of departure s from parental orthogonality in an inde
pendent variable set . Such a def inition has at least two ad
vantage s .
First,it distingui shes clearly between the problem's
e s sential nature which consist s of a lack of independence ,
or the presence of interdep endence , in an independent variable
set , X and the symptoms or effects on the dependency re
lationship between Y and X that it produce s .
Second,parental orthogonality lends itself easily to
formulation as a stati stical hypothesi s and,as such
,leads
directly to the development of te st statistics,adj usted for
numbers of variables and observati ons in X ,against "hich the
problem's severity can be calibrated . Developed in suf f icient
detail,such statistics may provide a great deal of insight
into the location and pattern , as well as the severity,of in
terdep endence that undermine s the experimental quality of a
given set of data .
DIAGNOSIS
Once a def inition i s in hand,multicollinearity cease s to
be so inscruitable . Add a set of distributional propertie s and
hypothese s of parental orthogonality can be developed and te sted
in a variety of ways , at several level s of detail . Stati stics
whose distributions are known ( and tabulated ) under appropriate
as sumptions,of course
,mu st be obtained . Their value s for a
parti cular sample provide proba bilistic measure s of thet ex tent
of correspondence or non- corre spondence between hypothe
- 32
s ized and sample characteri stics ; in thi s case,between hypo
the sized and sample orthogonality .
To derive test stati sti cs with known di stribution s,
specif ic as sumptions are required about the nature of the p op u
lation that generate s sample va lue s of X . Becau se exi sting di s
tribution theory i s based almost entirely on as sumptions that
X i s multivariate normal , it is convenient to retain the assump
tion here as well . Common versions of least square s regre s sion
models , and te sts of signif icance ba sed thereon,also are ba sed
on multivariate normality . Questions of dependence and inter
dependence ih regres sion analysi s,therefore
,may be examined
within the same stati stical framework .
Should the as sumption prove unne cessarity s evere,it s
probabili stic implications can be relaxed informally . For
formal purpose s , however,multivariate normality's strength and
convenience i s e s sential,and underlie s everything that f ollows .
General
The heuri stic relationship between orthogonality and the
determinant of a matrix of sample f ir st order correlation co
ef f icients
0 < | xtx < 1
has been discus sed under computer programming approache s to
singularity,above . Should it be pos sible to attach distribu
tional propertie s under an a ssumption of parental orthogonality
to the determinant I Xt
X I , or to a convenient transformation of
|Xt
X | , the re sulting stati stic could provide a useful f irst
measure of the presence and severity of multi collinearity within
an independent variable set .
- 33
Presuming X be multivariate normal,such propertie s are
close at hand . As shown by Wishart,sample variances and co
variances are j ointly di stributed according to the frequency
function that now bears hi s name .
" Working from the Wi shart
distribution , Wilke s , in an analytical tour de force,i s able
to derive the moment s and di stribution ( in open form ) of the
determinant of sample covariance Employing the addi
tional as sumption of parental orthogonality,he then obtains
the moments and di stribution of determinants for sample correla
tion matrice s IXt
X I as well . Specifically the kth moment of
lXt
X I i s shown to be
N- l n-l n[P ~fl P k )
kn - i
[r —
2 152? —
2
where as befor e , N i s sample size and n,the number of
In theory,one ought to be able to derive the frequency
function for |Xt
X | from and in open form it i s indeed
po s sible . For n 2, however
,explicit solutions for the di s
tribution of IXt
X | have not been obtained .
Bartlett,however , by comparing the lower moments of ( 2 )
with those of the Chi Square di stribution,obtains a transfor
mation of
t( 3 )
étX.(v)" - [N-l ( 2n log X X |
Wi shart,J .
"The Generali ze Produce Moment DistributionSample s from a Multivariate Normal Population
,"Biometrika ,
1928 .
Wilke s,S "Certain Generalizati ons in the Analysi s of
Variance ,"Biometrika
,24
,1932 ; p . 477 .
8 . O p . cit . , p . 492 .
- 34
that i s di stributed approximately as Chi Square with vln (n- l )
degree s of freedom .
In thi s light the determinant of intercorrelati ons within
an independent variable set takes on new meaning . No longer i s
interpretation limited to extreme s of the range
0 5 | x x l< i .
By transforming [Xt
Xlinto an approximate Chi Square stati stic,
a meaningful s cale is provided against which departure s from
hypothe si zed orthogonality , and hence the gradient between
s ingularity and orthogonoality ,can be calibrated . Should one
accept the multivariate normality as sumption,of course
,
probability levels provide a cardinal measure of the extent to
which X i s interdependent . Even without such a scale,trans
formation to a variable whose di stribution i s known,even ap p rox i
mately by standardizing for sample size and number of vari
able s offers a generalized,ordinal measure of the extent to
which quite diff erent sets of independent variable s are under
mined by multicollinearity .
Specific
Determining that a set of explanatory variable s departs
substantially from internal orthogonality i s the f irst,but
only the logical first step in an analysis of multicollinearity
as defined here . If information i s to be applied eff iciently
to alleviate the problem , localization mea sure s are required
to accurately specify the variable s mo st severely undermined
by interdependence .
To find the ba si s for one such measure we return to
notions developed both by computer programmers and by econome
- 35 _
tricians . As indicated earlier,both use diagonal elements of
the inverse correlation matrix ,rl l
,in some form
,a s measure s
of the extent to which particular explanatory variable s are af
fected by multicollinearity .
Intuition sugge sts that our def inition of hypothesized
parental orthogonality be tested through this statistic .
Elements of the nece ssary stati stical theory are developed
by Wilkes,who obtains the di stribution of numerous determinental
ratios of variable s from a multivariate normal di stribution .
"
Specif ically , for the matrix (XtX ) , def ining h principle minors
tX ] . for i l h
,
no two contain the same diagonal element ,rii’but
r enters one principle minor,Wilke s considers the
For any arbitrary set of h principle minors he then
obtains both the moments and di stribution of Z . For the spe cial
case of intere st here, ( employing the notation of p . 2 9 above ) ,
let us form principal minors such that
h 2, IX
t
Xll I r l,and lé
t
élg then
I xtl | 1
z,ct r
l l
l
S . S . Wilke s , op . cit” e sp . pp . 480- 2 ,491- 2 .
which can be recogni zed as the F-di stribution with v1and v
2
degre e s of freedom .
"
The transformation
( 9 ) w = cr”
1 )
then,can be seen to be distributed a s F with N-n and n- l degree s
of freedom . Def ining R? as the squared multiple correlation
between Xi and the otherlmembers of X , thi s re sult can be most
easily understood by recalling that
1
Therefore , ( r -1 ) equals and w ( as defined in ( 6 ) and
( 9 ) above ) , except for a term involving degrees of freedom,i s
the ratio of explained to unexplained variance ; it is not sur
pri sing then,to see w di stributed as F .
As regards the distribution of the same considerations
dis cus sed in the preceding se ction are relevant . If X i s
j ointly normal, ( 9 ) i s distributed exactly as F , and it s magni
tude therefore provides a cardinal measure of the extent to
which individual variables in an independent variable set are
affe cted by multicollinearity . If normality cannot be as sumed
( 9 ) still provide s an ordinal measure,adju sted for degree s of
freedom,of X .
's dependence on other variables in X .
— a
Having e stabli shed which variables in X are substantially
F . Graybill,op . cit . ,
p . 31 .
- 38
multicollinear , it generally pro ve s useful to determine in
greater detail the pattern of interdependence between affected
members of the independent variable set . An example,perhaps
,
will illustrate the informati on's importance . Suppose ( 9 ) i s
large only for Xl ,
.X2 , X 3 , and X4 ,indicating the se variable s
to be signif icantly multicollinear ,but only with each other
,
the remaining variables in X being e s sentially uncorrelated both
with each other and with X1 , X4 . Suppose further that all four
variable s , i , X4 , are substantially intercorrelated with
each of the others . If well- determined e stimate s are de sired
for this subset of variable s,additional information must be ob
tained on at least three of the four .
Alternatively,suppose that X
1and X
2are highly correlated
,
X and X4relati ons among the four
,and with other members of X ,
are small .
also are highly correlated,but all other intercor
In thi s case,additional information must be obtained only for
two variable s Xlor X
2 , p p X4 . Clearly,then the
efficient solution of multicollinearity problems require s de
tailed information about the pattern as well as the existence,
severity, and location of intercorrelations within a subset of
interdependent variable s .
To gain insight into the pattern of interdependence in X ,
a straight forward transformation of off —diagonal elements of
the inverse correlation matrix (XtXY
li s both eff ective and
convenient . Its development may be summarized brief ly,a s
follows .
Con sider a partition of the independent variable set
such that variable s Xiand X
jconstitute X and the remaining
K( 2 )
correlation coefficients,then
,i s partitioned such that
n- 2 variable s The corre sponding matrix of zero order
511 512
R21 52 2
where R11 ,
containing variables X . and X . , i s of dimension 2 x 2
and R2 2
i s (n- 2 ) x (n Element s of the inverse correlation
matrix 21 3 corre sponding to X(
l ), then , can be expre s sed without
lo s s of generality as"
ij -1 - lr (511 512 52 2 52 1 )
Before inversion the single off -diagonal element of
R R“1
(R =12 =22 521 )— 11
may be recognized as the partial covariance of‘Xi and Xj ,
constant X(2 )
,the other members of the independent variable set .
holding
On normalizing i . e . ,dividing by square roots of corresponding
diagonal elements in the usual fashion,partial correlation
coef ficients between‘Xi and X
jcan be
For the spe cial case considered here,where X(
l ) contains
only 2 variable s and Ell’accordingly is 2 x 2,it can also be
shown that corre sponding normalized off -diagonal element s ofl - l l
(511 512 g 2 521 ) and it s inverse (511 512 522 g differ2 21 )
G . Hadley,Linear Algebra ,
Addi son We sley,1961 ; pp . 10 7 , 108 .
Analysi s,Wiley
,1958 .
from one another only by sign . It follows , therefore,that
,by
a change of sign,normalized of f - diagonal elements of the in
’1 yield artial correlations among
members of the independent variable set . That i s , defining r .
verse correlation matrix (XtX )
as the coeff icient of partial correlation between Xiand Xj ,
other members of X held constant,and r
i j as elements of (X XX )
above,it follows that
Di stributional propertie s under a hypothesis of parental
orthogonality,of course
,are required to tie - up the bundle .
Carrying forward the as sumption of multivariate normality,such
propertie s are close at hand . In a manner exactly analogous to
the simple ( zero order ) correlation coeff icient,the stati stic
r . ./N- n
2
ij
may be shown to be distributed as Student's t with v N-n
degrees of freedom .
"
An exact,cardinal interpretation
,of interdependence be
tween Xi and Xjas members of X ,
of course,require s exact sat
isfaction of multivariate normal di stributional propertie s . As
with the determinant and diagonal elements of (XJ
CXY
'lthat
precede it,however
,off - diagonal elements transformed to
r . . or't provide useful ordinal mea sure s of collinearity
1 3 13'
even in the absence of such rigid as sumptions .
Graybill,op . cit . ,
pp . 2 15,208 .
Illustration
A three stage hierachy of increas ingly detailed te sts
for the pre sence , location , and pattern of interdependence within
an independent variable set X has been proposed . In order , the
stage s are
1 . Te st for the presence and severity of multicollinearity
anywhere in X , based on the approximate distribution ( 3 )
of determinants of sample correlation matrice s ,
I XtXI , from an orthogonal parent populat ion .
2 . Te st for the dependence of particular variable s on
other members of X based on the exact distribution ,
under parental orthogonality , of diagonal elements of
the inverse correlation matrix ( XtX )
'l.
3 . Examine thap attern of interdependence among X through
the distribution , under parental independence , of off
diagonal elements of the inverse c orrelation matrix ,
( Xtx>
'1.
In many ways such an analysi s , based entirely on statistic s
that are routinely generated during standard regres sion computations ,
may serve as a substitute for the formal , thorough ( and time -con
suming ) factor analysis of an independence variable set . It
provide s the insight required to detect , and if present to identify ,
multicollinearity in X . Accordingly , it may serve as a starting
point from which the additional information required for stable ,
least square s e stimation can be sought . An illustration , perhaps ,
will help to clarify the procedure's mechanic s and purpose ; both
of which are quite straight-f orward .
In a series of stati stical cost analyse s for the U . S . Navy ,
an attempt has been made to measure the effect on maintenance cost of
such factor s as ship age , size , intensity of usage ( measured by
fuel consumption ) , time between succe s sive overhauls , and such
discrete , qualitative characteristic s a s propulsion mode ( steam ,
die sel , nuclear ) , complexity ( radar picket , gu ided mis s ile,
and conversion under a recent ( Fleet Rehabilitation and Moderniza
tion , FRAM) program . Equations have been spec if ied and e stimated
on various sample s from the Atlantic Fleet de stroyer force that
relate logarithms of repair costs to logarithms of age , displace
ment , overhaul cycle and fuel consumption , and to discrete ( 0 , l )
dummy variable s for die sel p rO p ulsion , radar picket , and FRAM c on
version .
"
Stability under change s in spec ification , direction of
mimization , and sample coverage have been examined heuristically
by comparing regre s sion coef fic ient s , determinants of correlation
matrice s lXtX | , and diagonal elements of ( X
tX)
- l, from different
equations . The sensitivity of certain parameters under such change s
fuel consumption ) and the stability of others age ,
overhaul c ycle ) have been noted in the past .
By perf orming an explicit analysi s for interdependence in X ,
such information could have been obtained more quickly , directly ,
and in greater detail . Consider , for example . the seven variable
equation summarized in Table 3 . Multiple correlation and
as sociated FL statistics,with t-ratios for the relationship between
dependent and independent variable s , shows dependence between Y and
X to be substantial .
D . E . Farrar and R . E . Apple ,
"Some Factors that Af fect the
Overhaul Cost of Ships ,
"Naval Research Logi stic s Quarterly , 10 ,
4 , 1963 ; and "Economic Considerations in Establi shing an
Overhaul Cycle for Ships ,
"Naval Engineers Journal , 77 , 6 ,1964 .
parameter e stimate s from secondary data source s , or the direct
application of subj ective information may be nece s sary .
' I n any case , efficient c orrective action require s selectivity ,
and selectivity require s information about the nature of the
problem to be handled . The procedure outlined here provide s such
information . It produce s detailed diagnostic s that can support
the selective acqui sition of information required f or effective
treatment of the multicollinearity problem .
-46
SUMMARY
A point of view as well as a colle ction of techniques i s
advocated here . The technique s in thi s case a serie s of
diagnostics can be formulated and illustrated explicitly . The
spirit in whi ch they are developed , however , i s more diffi cult to
convey . G iven a point of view , technique s that support it may
be replaced quite ea sily ; the inverse i s seldom true . An effort
will be made,therefore , to summarize our approach to multi
collinearity and to contrast it with alternative views of the
problem .
Multicollinearity as def ined here i s a stati stical,
rather than a mathematical condition . As such one thinks,and
speaks,in terms of the problem's severity rather than of its
exi stence or non- exi stence .
As viewed here,multicollinearity i s a property of the
independent variable set alone . No account whatever i s taken of
the extent,or even the exi stence
,of dependence between Y and X .
It i s true,of course
,that the effect on estimation and sp e cifi
cation of interdependence in X reflected by variance s of
e stimated regre s sion coef ficients also depends partly on the
strength of dependence between Y and X . In order to treat the
problem,however
,it is important to distinguish between nature
and ef fects,and to develop diagno stics based on the former .
In our view an independent variable set X is not le s s multi
collinear if related to one dependent variable than if related
to another ; even though its effect s may be more serious in one
case than the other .
Of multicollinearity's ef fect s on the structural inte
grity of e stimated econometric models estimation instability ,
and structural mi s specification the latter , in our view,is
the more serious . Sensitivity of parameter e stimate s to change s
in spec ification , sample coverage , etc . , is reflected at least
partially in standard deviations of e stimated regres s ion co
efficients . No indication at all exist s , however , of the bias
imparted to coeff ic ient e stimate s by incorrectly omitting a
relevant , yet multicollinear , variable from an independent vari
able set .
Historical approache s to multicollinearity are almost
unanimous in presuming the problem's solution to lie in deciding
which variable s to keep and which to drop from an independent
variable set . Thought that the gap between a model's information
requirements and data's information content can be reduced by
increas ing available information,as well as by reducing mode l
complexity , i s seldom cons idered .
"
A maj or aim of the pre sent approach , on the other hand , is
to provide suf f ic iently detailed insight into the location and
pattern of interdependence among a set of independent variable s
that strategic additions of information become not only a theore
tically possibility but also a practically feasible solution f or
the multicollinearity problem .
Selectivity , however , is emphasized . This is not a
counsel of perfection . The purpose of regre s sion analysi s i s to
e stimate the structure of a dependent variable Y's dependence
on a pre - selected set of independent variable s X , not to select
an orthogonal independent variable
H . The il , op cit , p . 217 ; and J . Johnston , op c it , p . 207
are notable exceptions .
Indeed , should a completely orthogonal set of economic variable sappear in the literature one would suspect it to be e ither toosmall to explain properly a moderately complex dependent variable ,
or to have been chosen with internal orthogonality rather thanrelevance to the dependent variable in mind .
- 48
Structural integrity over an entire set , admittedly , requ ire s
both complete spec ification and internal orthogonality . One can
not obtain reliable e stimate s for an entire n dimens ional structure ,
or di stingu ish between c ompeting n dimensional hypothe se s , with
fewer than n significant dimens ions of independent variation . Yet
all variable s are seldom equally important . Only one or at
most two or three strategically important variable s are ordinarily
pre sent in a regre s s ion equation . With complete specification and
detailed ins ight into the location and pattern of interdependence
in X , structural instability within the critical subset can be
evaluated and , if nece s sary , corrected . Multicollinearity amongfi
non-critical variable s can be tolerated . Should critical variable s
also be affected additional information to provide coeff ic ient
e stimate s either for strategic variable s directly , or for those
members of the set on which they are princ ipally dependent i s
required . Detailed diagnostic s for the pattern of interdepend
ence that undermine s the experimental quality of X permits such
information to be developed and applied both frugally and effectively .
Insight into the pattern of interdependence that affects
an independent variable set can be provided in many ways . The
entire f ield of factor analysi s , for example , i s de signed to handle
such problems . Advantage s of the measure s proposed here are two
fold . The f irst is pragmatic ; while factor analys is involve s ex
tensive separate computations , the pre sent set of measures relie s
entirely on transformations of statistic s , such as the determinant
I XtX I and elements of the inverse correlation matrix , ( X
tX)
- l,
that are generated routinely during standard regre s s ion computations .
The second i s symmetry ; que sti ons of dependence and interdependence
in regre s s ion analysi s are handled in the same conceptual and
stati stical framework . variable s that are internal to a set X
for one purpose are viewed as external to it for another . In
thi s ve in , te sts of interdependence are approached as succe s sive
te sts of each independent variable's dependence on other members
of the set . The conceptual and computational apparatus of
regre s sion analysis , accordingly , i s u sed to provide a quick and
simple , yet serviceable, substitute for the factor analysis of
an independent variable set .
It would be ple sant to conclude on a note of triumph that
the problem ha s been solved and that no further "revisits"are
nece s sary . Such a feeling , clearly , would be misleading . Diagno
s is , although a nece s sary first step , doe s not insure cure . No
miraculou s "instant orthogonalization"can be offered .
We do , however , close on a note of optimism . The diagnostics
de scribed here offer the econometrician a place to begin . In
c ombination with a spirit of selectivity in obtaining and apply
ing additional informat ion , multicollinearity may return from
the realm of impos sible to that of difficult , but tractable ,
econometric problems .