the parameters of cross-validation

Recommended citation:Herzberg, P. A. (1969). The Parameters of Cross-Validation (Psychometric Monograph No. 16).

Richmond, VA: Psychometric Society. Retrieved fromhttp://www.psychometrika.org/journal/online/MN16.pdf

THE PARAMETERS OFCROSS-VALIDATION

Paul A. Herzberg

YORK UNIVERSITY

TORONTO

Copyright 1969 by the Psychometric Society

Printed by The William Byrd Pres.-_’, Richmond,

PREFACE

This monograph was submitted as a doctoral dissertation at the Uni-versity of Illinois, Urbana. The author is indebted to Professors L. R Tuckerand J. S. Wiggins for their generous help. The work on this project waspartially supported by the Office of Naval Research under contracts Nonr1834(39) and US NAVY/00014-67-A-0305-0003 with the University Illinois.

TABLE OF CONTENTS

Chapter

1 INTRODUCTION

1.1 Multiple Regression and Cross-Validation1.2 Estimates of Validity1.3 Improvement of Prediction1.4 The Comparison of Prediction Methods

2 THE MATHEMATICAL MODEL AND SAMPLECALCULATIONS

2.1 Notation2.2 The General Model for Predictors and Criteria2.3 Computer Generation of the Model2.4 Computer Generation of Data Samples2.5 Multiple Regression on the Predictors2.6 Multiple Regression on the Principal Components2.7 Multiple Regression on the Principal Predictors2.8 Cross-Validities

SIMULATION RESULTS WITH ONE CRITERIONVARIABLE

3.1 Distribution of the Correlation StatisticsWhen p = 0

3.2 Dependence of the Correlation Statistics onZ~Whenp # 0

3.3 Dependence of the Correlation Statisticson n, N, and p2

3.4 Prediction from the Principal Components

STUDIES WITH SEVERAL CRITERIA

4.1 Simulation Results4.2 Study of Real Data4.3 Simulation of Real Data

5 SUMMARY AND CONCLUSIONS

REFERENCES

Page

1

125

11

1111151920.212325

27

28

32

3439

51

5158

65

69

CHAPTER 1

INTRODUCTION

1.1 Multiple Regression and Cross-Validation

A problem common to many areas of psychology is the prediction of aperson’s score on one variable from his scores on a number of other variables.The variable that is to be predicted is called the criterion and the othervariables are called predictors. 5~any methods have been developed tocombine predictor scores in order to optimize the prediction of the criterion.A common procedure is to obtain a sample of subjects with known pre-dictor and criterion scores (the derivation sample) and to calculate thelinear combination of the predictor scores that best predicts the criterionscores. By "best" is usually meant "least squared error," which means thatthe sum (over subjects) of the squared deviations of the observed from thepredicted criterion score is a minimum. The optimizing coefficients of thepredictor scores are called the multiple regression weights and are calculatedfrom the normal equations which express the minimization conditions[Anderson, 1958; Kendall & Stuart, 1961].

When multiple regression is used to compute predictor weights, amultiple correlation may be calculated. The multiple correlation is thePearson product-moment correlation, in the sample, between the optimallinear combination of the predictors and the criterion variable. The multiplecorrelation is thus a measure of the degree of relationship between the pre-dictors and the criterion. However the multiple correlation is a biased esti-mate of this relationship and is generally larger than the true populationmultiple correlation. The bias occurs because the process of minimizing theaverage squared error in prediction is equivalent to maximizing the correla-tion between the linear combination of the predictors and the criterion.Due to the finite size of the sample, the optinxizing linear combination willbe fitted to the idiosyncracies of the sample and will generally result in ahigher multiple correlation than the population multiple correlation.

One problem in th6 application of multiple correlation techniques istherefore the estimation of the true multiple correlation from the biasedsample multiple correlation. In the next section it will be shown that thereare two population correlations which must be distinguished. A number offormulas for correcting the sample multiple correlation is known. Howeverthese formulas require assumptions which are often difficult to satisfy and,therefore, many early investigators estimated the population correlation byapplying to a second sample the regression weights calculated in an original

1

PSYCHOMETRIKA MONOGRAPH SUPPLEMENT

sample. They found that the correlation between the regression functionand the criterion in the second sample was less than the original samplemultiple correlation. This technique became known as cross-validation ofthe predictor weights or simply as cross-validation [Mosier, 1951]. Thecorrelation in the second sample is called the cross-validity. The first sampleis known as the derivation sample; the second is the validation, sample. Anobvious addition to the cross-validation method is to repeat the calculations,interchanging the roles of the first and second sample. We shall call thistechnique double cross-validation.

This study was designed to investigate:

(a) the accuracy of the cross-validity as an estimate of the populatioucorrelation,

(b) fhe effectiveness of two reduced rank methods for estimating predictorweights, and

(c) the effect of the variation of some parameters of the population distri-bution on the results of (a) and (b).

The estimation of the population correlation is described in more detail inSection 1.2. The reduced rank methods are introduced in Section 1.3. Finally,the study of the effect of variation of population parameters by a simulationtechnique is introduced in Section 1.4.

1.2 Estimates o] Validity

Let the predictor variables be x~, x2, .-. , xn and let the criterionvariable be y. Then the regression function in the population is

(1.2.1) f~lxl + fl2x~ + -.- + 3,~xo + 3o.

The constant term in the equation is 3o; the 3~ are called the regressionweights. Two models for the predictors are possible, the regression modeland the correlation model [Ezekiel & Fox, 1959, pp. 279--281]. In the re-gression model, the values of the predictor variables are fixed and only thecriterion is a random variable. A more realistic model for most multivariatework in psychology is to assume that both the predictors and the criterionare random variables (the correlation model). Under the null hypothesis zero multiple correlation the distributional theory is identical for the twomodels. However when the null hypothesis is not true the distributions aredifferent under the two models. Since the distributional theory is much moreComplicated under the correlation model, most investigators in psychology[e.g. Burket, 1964] have continued to use the regression model hoping thatthere will be little practical difference between the two models.

Regression equations can also differ in whether the constant term, ~o,is included. In the case of the regression model, the constant term is reallyindistinguishable from the other terms in the equation since a predictor

PAUL A. HERZBERG 3

variable, Xo, may be defined as the constant 1.0. Then the constant termmay be written as ~oxo ̄ Therefore, formulas developed for the constant = 0case

(1.2.2)

may be modified for the constant ~ 0 case (1.2.1) by simply replacing n n%l.

In the correlation model this simple correspondence between the zeroand non-zero constant cases does not hold since xl, .-- , xn are randomvariables while Xo is fixed. Inclusion of a constant term does not ’affect themultiple correlation or the correlation between the regression function andany other variable. Thus in studies such as the present One which emphasizecorrelation measures, it is simplest to set the constant term to zero. How-ever if the mean square error of prediction is used as a measure of accuracyof prediction, it is very important to stute whether the constant term isincluded in the regression.

Let p be the population multiple correlation, ~ the population standarddeviation of y, and ~(~_~ the population standard deviation of the error inprediction (y - :~) where :~ is the regression function (1.2.1) with weightscalculated from the normal equations. Then

(1.2.3)

A similar equation holds in the sample, relating the squared sample correla-tion, r ~, to the mean squared error of prediction, MSE, and the standarddeviation of the sample, s~ :

M SE(1.2.4) r

Sy

The first estimation of p~ in the psychological literature [Larson, 1931] isby the following formula:

N (1 -- 2)(1.2.5) E~ (p~) = 1 N --

where N is the sample size. Larson does not give a derivation of this formulabut Wherry [1931] showed that it follows (in the regression model) from

2 2~ and estimating ~(~_~) by (MSE)N/(N - n). The estimating a, by s~stitution of these two estimates into (1.2.3) and the use of (1.2.4) gives

~ by(1.2.5). In order to improve this estimate, Wherry estimated ~ N/(N - 1) rather than by s~. The resulting formula

N-- 1 (1 --r~).(1.2.6) Est (p~) = 1 N -


Larson and Wherry compared their estimates with cross-validities andWherry showed that (1.2.6) is superior to (1.2.5).

It is not entirely clear how Larson and Wherry handled the constantterm in the regression function. Formula (1.2.6) is strictly applicable to zero constant term. When the constant term is not zero, the unbiased esti-mate of ~_~> is (MSE)N/(N -- n - 1) so that the estimate of p:

N-1(1.2.7) Est (p-~) = 1 N -- n -- 1 (1 -- r2).

Formula (1.2.7) is often referred to as Wherry’s formula even though hisoriginal formula was (1.2.6). Formula (1.2.7) is not an unbiased estimate p2 since the ratio of two unbiased estimates is not unbiased. However, un-biased estimates of p~ are not always desirable, for, if the true p~ = 0, anunbiased estimate must take on both negative and positbm values eventhough a multiple correlation ~s always positive.

The multiple correlation, p, -is the correlation, in the population, of thecriterion and the regression function calculated in the population. In appli-cations, the population regression function can never be known and one ismore interested in how effective the sample regression function is in othersamples. A measure of this effectiveness is ro, the sample cross-validity. Forany given regression function, re will vary from validation sample to valida-tion sample. The average value of ro will be approximately equal to thecorrelation, in the population, of the sample regression function with thecriterion. This correlation is the population cross-validity, p~. Wherry’sformula estimates p rather than po. Lord [1950] and Nicholson [1960] derivedan unbiased estimate of the population mean square error of a sample re-gression function. Using this estimate of 5ISE, an estimate of p~o is

N-- 1 N +n+ 1 (1 -- 2)(1.218) Est (p~) = 1 N - n -- 1 "

This formula applies to the regression model with a constant term. Darlington[1968] modified this formula for the correlatio~ model with a constant term.His formula is

N- 1 N- 2 N + 1 (1 (1.2.9) Est(p~) = 1 N-n- 1N--n- 2 N

This formula is based on the assumption that the predictors and criterionhave a multivariate normal distribution.

It would be possible to derive a similar formula for Est(p~) in the multi-vuriate normal case with no constant term. It is not, however, the purposeof this study to derive the best estimate of p~ for it is not clear what proper-ties such an estimator should have, particularly since an unbiased estimator

PAUL A. HERZBERG ~

has the defects mentioned above. It is more interesting to study the accuracyof the cross-vMidity as an estimate of po and p.

Returning to estimating p, Wishart [1931] calculated the moments ofthe distribution of r~ for the multivariate normal distribution. The expectedvalue of r~ is

(1.2.10) E(r "~) = 1N- n- 1

N -- 1 (1 -- p~)F(1, 1, (N + 1)/2,

where F(a, b, c, x) is the hypergeometric function. Using the first two termsof the exp.~nsion of this function, (1.2.10) reduces to

(1.2.11) E(r :) = 1N--n-- 1 (1-- p’~) N--n- 1 2

N- 1 N- 1 NA-lP~(1- p*).

Olkin and Pratt [1958] showed that an unbiased estimate of p: is

N-3 (1-r:)F(1,1,(N-n4-, 1)/2, 1-r~),(1.2.12) Est (p~) = 1 N -- n

which, neglecting tcrms in l/N:, is

(1.2.13)N-- 3 (1 --r ~)- N-- 3 2 (1 - r~)~.

Est.(p=) = 1 N--n-- 1 N --n-- 1N - n-t-

The Wherry estimate (1.2.7) is almost identical to the first two terms this series.

Darlington [1968] has e:~refully distinguished the four correlationspo, r, and r~. The smallest of these, p¢ and ro, are the validity of the sampleregression function in the population and another satnple, respectively. Theaverage, over ma.ny samples, of the cross-validity, ro, will be approximatelyequal to po. The next smallest correlation is p, the population multiplecorrelation or the wflidity of the population regression function in the popu-lation. On the average, the largest correlation is the sample multiple corre-lation r, which is the wdidity of the s:m~ple regression function in the deriva-tion sample. The l’ch-~tionships ma.y bc summarized as follows:

(1.2.14) E(r,) ~ p~ < p < E(r).

Empirical confirmation of (1.2.14) is presented in Section 3.3.

1.3 Improvement o] Prediction

It is well known that adding predictors to a regression equation in-creases both the sample and population multiple correlations. However, thegreater the number of predictors, n, the more unstable are the sample re-gression weights and the lower are the sample cross-validity, ro, and the

6 PSYCHOMETRIKA MONOGRAPH SUPPLEMENT

population cross-validity, po. The decrease in estimated po follows from(1.2.9).

A second difficulty with a large number of predictors in multiple regres-sion is that a subset of them would probably do just as well, if the subsetcould be determined. For predictions in applied psychology, e.g. personnelselection, it is undesirable to have to make a large number of measurementson each individual in order to make accurate predictions. Furthermore, theweights for a subset of predictors would be more stable in future samplesdue to the smaller n.

There are several ways to select a subset of predictors. The best selectionprocedure is stepwise regression in which predictors ~re udded to the regres-sion, one at a time, until there is no significant additional prediction. Otherselection procedures are shown by D~rlington [1968] to be inferior to thestepwise method.

Another way to reduce the number of predictors in the regression func-tion is to use a few linear combinations of the predictors rather than thepredictors themselves. Two such methods, called reduced rank methods, areconsidered in this study. In the first method [Horst, 1941], the largest princi-pal components of the predictors [Anderson, 1958] are entered in the regres-sion function. Since the principal components may be expressed ~s linearcombinations of the predictors, the regression function may be transformedto ~ linear combination of the predictors. Hence the full set of predictorvariables is used but only through the intermediary of a few principal com-ponents. These compone~ts may be interpreted psychologically, and it maybe possible to select predictors loading highly on the components as a subsetto use in future prediction. In this way reduced rank prediction can lead toa reduction of the size of the predictor battery.

In his 1941 paper, Horst also suggested that the predictors could berepresented as a linear function of common and unique factors ratheras a linear function of the princip~d compo~mnts. This factor analytic modelis more difficult to treat because of the dii~icuity of estimating the factorsas linear combinations of the predictors. Unlike Horst’s first method, factoranalysis is not a reduced rank method. The factor analytic model for regres-sion calculations was studied by Leiman [1951] with some success but willnot be considered further in this study.

Before outlining the second reduced rank procedure, let us considerstudy by Burket [1964] comparing a number of regression methods in a largedata sample. He compared two stepwisc selection procedures [Efroymson,1960; Horst & 5[acEwau, 1960], the largest principal components method,the smallest principal components method [Guttman, 1958] and the criterion-related principal components method [Hotelling, 1957; Massy, 1965]. Gutt-man proposed the use of the smallest principal components since the solutionfor the multiple regression weights depends on the inverse of the predictor

PAUL A. HERZBERG 7

intercorrelation matrix, and the largest components of the inverse are thesmallest components of the original matrix. Hotelling and Massy suggestedthat the principal components which are entered into the regression functionshould be those components correlating maximally with the criterion ratherthan those of largest variance. Burket compared these five methods forseveral criteria and in several subsamples of his total sample. He found thatthe largest principal components method was consistently superior to theother four methods. One purpose of the present study is to show under whatconditions this superiority can be expected to hold.

The second reduced rank method, prediction from the principal predic-tors, was developed from the following considerations. The principal com-ponents of the predictors may not be highly related to the criterion since thecomponents are determined solely from the intercorrelations of the pre-dictors. It would be desirable to find linear combinations of the predictorswhich are strongly related to the criterion. The Hotelling and 5.~assy methodemployed by Burket finds these linear combinations by computing the corre-lation of each principal component with the criterion and entering into themultiple regression only those components with the highest correlations.However a more effective procedure might be to find those linear combina-tions of the predictors (not necessarily the principal components of thepredictors) which are maximally correlated with the criterion.

In the single criterion case, this problem is trivial since there is onlyone linear combination of the predictors maximally correlated with thecriterion and all other orthogonal combinations are uncorrelated with thecriterion. This combination is simply the regression function, i.e. the pre-dicted criterion, using multiple regression on all the predictors. Therefore,in the single criterion case, nothing new is found by considering linear com-binations of the predictors maximally correlated with the criterion.

Consider, however, prediction of several criteria from a common set ofpredictors. Examples of such multiple criteria are the prediction of successin several academic curricula by using a battery of aptitude tests or theprediction of a number of social criteria using scales from a personality test[Hase & Goldberg, 1967]. In particular, let us suppose that we wish to pre-dict each of the criteria equally well. Then Tucker [1957] has developed amethod which discovers those linear combinations of the predictors maxi-mally related to the set of criteria. These combinations are called the prin-cipal predictors. The largest of the principal predictors may be entered intothe regression equations for each criterion. The principal predictors havethe property that, for a fixed number of linear combinations entered intoeach regression, the average squared multiple correlation is greater for theprincipal predictors*than for any other linear combinations entered into theregression.

The principal predictors were developed by Tucker as a convenient way


to summarize a large number of predictor scores by a few criterion-relatedpredictor scores. The principal predictors also provide a useful conceptuali-zation of the relationship of a set of predictors to a set of criteria. In thepresent study, on the other hand, the principal predictors are compared withthe principal components as reduced rank prediction methods.

In any prediction calculation, each criterion variable may be dividedinto two parts--one part is predictable from the set of predictors and theother part is unpredictable from these predictors. When there are severalcriteria the predictable parts of the criteria are themselves a set of variableswhich have principal components. These principal components are theprincipal predictors. It is important not to confuse the principal componentsof the predictors, previously discussed, with the principal components ofpredictable parts of the criteria; which are called the principal predictors.The largest principal predictor accounts for the largest portion of the pre-dictable variation in the criteria. The next largest principal predictor ae-count~ for the next largest portion, and so on. Therefore a few of the principalpredictors account for most of the predictable variation in the criteria.

The largest principal predictors may be used as predictors themselves.Then the principal ’predictors, like the principal components of the predic-tors, can be expressed in terms of the original predictors. The weights forthe original predictors can therefore be calculated. Also, the principal pre-dictors may be interpreted psychologically, which may lead to greaterunderstanding of the relationship between the predictors and the criteria.The principal predictors, unlike the principal components of the predictors,are criterion-related so that variation in the predictors which is unrelated tothe criteria is not represented in the principal predictors. In some cases, thepredictor variation which is unrelated to the criteria could be large enoughto dominate the principal components of the predictors. But this variationis not useful for prediction. This is the reason that prediction from the largestprincipal predictors may be superior to prediction from the largest principalcomponents.

The two methods, prediction from the principal components of thepredictors and from the principal predictors, arc called reduced rank methodssince, in both cases, a correlation matrix may be approximated by a matrixof lower rank using the largest principal components or the largest principalpredictors. In the first method, the correlation matrix of the predictors isapproximated, while in the second method, the correlation matrix of thepredictable parts of the criteria is approximated. These statements are mademore precise in Chapter 2.

Of the prediction methods discussed in this section only three will beconsidered further in this study:

prediction from the full set of predictors,

PAUL A. HERZBERG 9

(b) prediction from the largest principal components of the predictors, and(c) prediction from the largest principal predictors.

The last method is possible only when there are several criteria. The firsttwo methods may be used for one or several criteria.

1.~ The Comparison o] Prediction Methods

In order to evaluate and compare the prediction methods described inthe preceding section it would be desirable to employ mathematical tech-niques. However the problems are so complex that multivariate statisticaltheory is unable to solve most of them.

Another approach to these problems has been to apply the differentprediction methods to a common body of data and to compare the results[Burket, 1964; Leiman, 1951]. There are definite advantages to this ap-proach. Any conclusions are based on real data and do not depend on theassumptions in a theoretical development being valid. However, there is amajor drawback to such empirical techniques. If two or more studies, usingdifferent data, disagree in their conclusions, it is difficult to determine whatproperties of the data sets differ enough between the studies to produce thevaried conclusions. Similarly, it is difficult to evaluate the generality of con-clusions found in a single study using one set of data.

It is therefore desirable to compare the prediction methods on a widevariety of data sets, differing in a known way in certain parameters. Since itis hard to satisfy this condition with real data, it is proposed that ~omeuseful conclusions may be made from the study of artificial or simulated datasets. Such data sets can be readily generated on a computer. The parametersspecifying the properties of a data set can be input to the computer and awide variety of data sets can be generated by varying these input parameters.The prediction method can then be compared in these data. Such a simula-tion procedure is described in this study.

The simulation experiments consist of four stages of calculations:Generation o] the model. A combined predictor and criterion population

covariance matrix is generated subject to certain input parameters. Thepopulation model and its parameters are described in Sections 2.2 and 2.3.In the model the predictors and criteria are expressed in terms of the prin-cipal predictors.

Generation of two samples. Two samples, each of size N, are obtainedfrom the population generated in the preceding stage. The samples areobtained by generating sample covariance matrices; the method is outlinedin Section 2.4. The two samples are used in double cross-validation.

Calculation of predictor weights. The predictor weights are calculated,h~ each sample, by one or more of the following methods--(a) multipleregression on the predictors, (b) multiple regression on the principal com-


ponents, and (c) multiple regression on the principal predictors. The calcu-lation of these weights is described in Sections 2.5, 2.6, and 2.7, respectively.

Cross-validation o/the weights. The weights for each sample (and eachmethod) are cross-validated on the other sample (r,) and on the populationitself (po). The formulas for the validities are presented in Section 2.8.

CHAPTER 2

THE MATHEMATICAL MODEL AND SAMPLE CALCULATIONS

2.1 Notation

Scalars are denoted by lower case letters (m, p). The only exceptions this convention arc N for sample size and the elements of matrices. Scalarsmay be either numbers or random variables. Column vectors are denoted bylower case italicized letters (x, a). These vectors may be either randomvariable vectors or vectors of numbers. Row vectors are transposed columnvectors, transposition being indicated by priming (x’, a’). Matrices aredenoted by upper case letters (A, 2;). Transposed matrices are indicated a prime (A~, 2V). The (i, j) element of a matrix is denoted by Aij The identity matrix is I.

The matrix consisting of the first t columns of a matrix A is denotedby (A)~. The first t rows of A ~re ~(A). The vector consisting of the first elements of a vector b is denoted by ~(b).

The population covariance matrix of two random vectors x and y isdenoted by ~.~. The corresponding sample covariance matrix is C~.~. Wheny is kno~vn to have only one component, the covariance matrices are columnvectors denoted by z~ and c~. The w~riance of a scalar y is denoted bya~ or c~. The abbreviation Var( ) is used to denote the vari.’mce of therandom variable enclosed in parentheses.

The univariate normal distribution with mean = m aud variance = vis denoted by N(m, v). The multivariate normal distribution with meanvector = a and covariance matrix = 2~ is represented by N(a, 2:). Fisher’s distribution, Student’s t distribution and the chi distribution are denotedby F(n~, n,), t(n), ×(n), respectively, with the indicated degrees of freedom.The chi distribution is the square root of the chi-squared distribution.

2.2 The Ge~eral Model ]or Predictors and Criteria

Let x (n components) be n random variables, called predictors, and lety (m components) be m random variables, called criteria. Let m be less thann as is usually the case in practice. For convenience, normalize all variablesx ~nd y to unit variance. Let x and y have a joint multivariate normal distri-bution with null mean vector and arbitrary cow~riauce matrix

(2.2.1) Z = ~

11


It can be shown [Herzberg, 1967] that x and y can be written in a specialway in terms of (n q- m) independent unit variance random variables Let w be partitioned as

(2.2.2) w’ = (w;w~w~)

where wl has m components, w2 has (n - m) components and w3 has components. Then

(2.2.3)m (n--m)m [wl]

I./) 2 ¯

(The number of rows or columnsto the matrix expression above).

in the partitioned matrices are appended

Thus the submatrices forming 2: may be written as

(2.2.4)

(2.2.5)

(2.2.6) 22,., = FF’ -t- EE’.

This representation of x and y may be understood in the following way.Let y be written as the sum of two terms

(2.2.7) y = ~) + e,

where ?~ is the linear least squares prediction of y from the predictorsThat is,

(2.2.8)

The weight matrix is written as B’ rather than B so that when m = 1, B’ = b’,the transpose ofis the best predictor of the corresponding component of y. Now the variables~ may be transformed to independent unit wwiaace variables w~, by

(2.2.9) ~ = Fw~

where F is orthogonal by columns, so that

(2.2.10) F’F = D2 (diagonal)

with the diagonal elements of Dare called the principal predictors and are, except for normalization, theprincipal components of 7). The diagonal elemeuts of ~ are t he eigenvaluesof Z;.~ and are

(2.2.11) ])~ = ~ F~k

PAUL A. HERZBERO 13

The variables e may also be transformed to independent unit variancevariables w3 by

(2.2.12) e = Ew.~.

This latter transformation may be done in ma~,,y ways. The representationof y is now complete.

The predictors x are also represented as the sum of two terms,

(2.2.13) ~ = x A- d,

where ~ is the part of x linearly predictable from the principal predictors wl.The relationship of ~ to the principal predictors is

(2.2.14) 2 = S,wl.

The residual vector d is related to (n -- m) independent unit variance vari-ables w~ by

(2.2.15) d --- S~w~.

The fact that the n-vector d is of rank (n - m) is shown in Herzberg [1967].In summary then, the criteria y are written as the sum of a linear trans-

formation of the principal predictors w~ and a transformation of m otherindependent wri~bles wa. Similarly, the predictors x are written as the sumof a linear transformation of the s:~me principal predictors w~ and a trans-formation of (n - m) other independent variables w~. The associationbetween y and x is expressed through their dependence on the principalpredictors u’~. The non-associated p~rts of x and y are expressed in terms ofindependent variables w~ (for x) and w,~ (for y). The total set of variablesw’ = (w~u~w~) are independent unit variance normal.

The matrices S~ and F are central to the description of the dependenceof the criteria on the predictors. Let us first consider F. The squared multiplecorrelation of the jth criterion y) with the n predictors is (recall that y~ hasunit variance)

(2.2.16) p~ = Var (~) = (B’2;~B)~ = (FF’)~

The average, over criteria, of the squared multiple correlation is

p~= (l/m)~ p~ = (l/m)~ ~

(2.2.17) ~=1 ~-~ ~:~

= (l/m) ~ D~-

using (2.2.11). The size or importance of the kth principal predictor forpredicting the criteria y can be measured by the sum of squares of the co-


(2.2.19)

Thus

efficients of the ltth principal predictor in (2.2.9). The coefficients are thekth column of F. The sum of squares is

(2.2.18) ~ F~ = D~k .

Hence the eigenvalues D~k can be interpreted as the total variance of thecriteria accounted for by the kth principal predictor. The D~k are, by defini-tion, decreasing with k, so that the first principal predictor accounts formore of the variance of the criteria than any other principal predictor. If,in a certain population, the first eigenvalue D~, is very large and the otherssmall, this indicates that most of the prediction of y from x is derived fromonly one linear combination of the x variables, namely the first principalpredictor. On.the other hand, if several of the I)~k are large, then severalindependent linear combinations of the predictors are needed in order to getmaximum prediction of y from x.

The relation of x to the principal predictors wl through the matrix $I isquite independent of the matrix F and the quantities D~. Let us define the(n X n) matrix S as the super matrix

S = (8182),

(2.2.20) x =.~ :-kd= S,wl + S~w: = S]w’|¯

LW_oJ

Since the variables ~t’ :~re independent and of unit variance, and the pr~dictors x are of unit variance, the sum of squares of each row of S is 1.0.The sum of squares of the ith row of S~ is then less than 1.0 and is Var(xi),

the magnitude of the dependence of the ith predictor on the principal pr~dictors:

(2.2.21) Var (:~,) = ~ S~

Turning to the columns of S, let q~(k = i, ¯ ¯ ¯ , m) denote the sum of squaresof the kth column of S:

q~ is a measure of the tottal dependence or relation of the predictors x to ~hekth principal predictor. The q~ are analogous to the D~ since they represent,respectively, the total dependence of the predictors and ~he criteria on thekth principal predictor. (Var(k~) and p~ are also analogous.) We may averagethe Var(~) in the same way as the ~ were averaged in (2.2.17):

PAUL A. HERZBERG ]5

= (l/n) £ q~.

~r~ is the average, over predictors, of the predictor variance related to theprincipal predictors. It is a measare of the dependence of the predictors onthe principal predictors. For brevity, ~r~ will be called the average criterion-related predictor variance. This description should not imply that ~2 is anaver~ge multiple correlation of x predicted from y (roles of predictors andcriteria reversed). ~r2 is, however, the average multiple correlation of the xvariables predicted from the principal predictors wl. Note that ~r~ is alsothe average of the q~ (but the division is by n, not m, so that r2 has a maxi-mum value of 1.0).

Consider now two populations, each with the same average squaredmultiple correlation 02. One population might h~ve a small value of ~r2 andthe second ~ large value of r2. In the first population the predictors dependvery little on the principal predictors while in the second the dependence isgreater. Nevertheless the prediction of the criteria is the same in each popu-lation. This paradoxical situation can be understood by first noting that we~re considering for the moment prediction in the population, not in finitesamples. The prediction of y is solely via the principal predictors wl and therandom vector w~ is an e.ract linear combination of the predictors x since Sis square and non-singular:

(2.2.24)

As additional verification that the average multiple correlation is inde-pendent of ~, note that ps depends only on F in (2.2.17).

The parameter ~-~ t,:ill have au effect on prediction in finite sampleshowever. Consider prediction from the principal components of the pre-dictors. Whenthe space of the principal predictors and will contribute to prediction. How-ever, when ~ is small, the first principal component will be unrelated to theprincipal predictor space and will be a very poor predictor. The effectivenessof prediction from the principal components will thus depend on the sizeof r2. Empirical confirmation of this phenomenon will be demonstrated inSection 3.4.

2.3 Computer Generatio~ o/the Model

Since any set of n predictors and m criteria may be written in the form(2.2.3), it is possible to generate an arbitrary population distribution


specifying the matrices S, F, and E. The covariance matrices ~x~, Zzy, andZ,~ may then be calculated by (2.2.4) to (2.2.6) where S is related to $1 $2 by (2.2.19).

Rather than allowing the matrices S, F, and E to be completely arbi-trary, a few basic parameters may be fixed arbitrarily and the matrices thengenerated essentially randomly subject to these given parameters. Thesearbitrary parameters are called "input parameters" since they are input tothe computer program that generates the model.

The major input parameters are:

1. n = number of predictors.2. m = number of criteria.3. (D~k, k = 1, --. , m), the eigenvalues 4. (q~, k -- 1, -.. , m), the dependencies of the predictors on the

principal predictors.

Note that once D~k and q~ are specified for all k, the average squaredmultiple correlation p2 and the average criterion-related predictor variance~r* are fixed by (2.2.17) and (2.2.23). In particular, when m = 1 as in Chapter3, p~ = D~, and ~ = (1/n)q~.

Two additional parameters, related to n and m, are:

la. n, = number of columns of S.2a. m~ = number of duplicate criteria.

In the model described in Section 2.2, S = (S1S~) is an (n X n) square matrix.S~ has m columns and S~ has (n - m) columns. In some of the experimentsdescribed in Sections 3.1 and 3.2, S has more than n columns, namelycolumns, so that S~ has (n. -- m) columns and.S~ still has m columns.

All the experiments in Chapter 3 involve only one criterion (m = 1).However several duplicate criteria are allowed and the number of suchduplicate criteria is denoted by m~. Duplicate criteria are described furtherin Chapter 3.

Four minor parameters are needed to complcte the input for the model:

5. v, = variance of the generated Var(~i).

6. e~ = tolerance on this variance.7. v~ = variance of the gener:~ted multiple c()rrclations p~.8. e, = tolerance on this variance.

The computer program generates matrices F and E so that the msquared multiple correlations p~ calculated from them have mean exactlyequal to the average of the D~ and variance equal to v~ within a maximumerror of ey. That is,

(2.3.1) p~ = (l/m) ~ p~ = (l/m)

PAUL A. HERZBERG 1~

and

(l/m)(2.3.2)

MatrLx S, composed of $1 and S~, is generated so that the mean of ~he wri-~nees of ~ e~leul~ted from S~ is exactly equal to (l/n) ~;.~ q~ and thev~rianee of these v~fi~nces is equal to v~ witMnThat is,

(2.3.3) ~ = (l/n) ~ Var (i,) = (l/n)

and

(2.3.4) ~V~r [V~r (~)] -- v~] < e~.

The two occurrences of "V~r" in the preceding formul~ refer to d~erenttypes of variances. V~r(~) means the v~ri~nce of the r~ndom v~riablethe population. Let the constants V~r(~) = v~ temporarily. Then V~r(v~)is simply sho~h~nd notation for

(~.~.5) v~ (vi) = 0/n) ~ (v~ ~)~.i~l

Note that r~ is the mean of the v~ .

Ge~eration o] F

Since F is orthogonal by columns (equatioa (2.2.10)), it m~y be written~s a product of ~n orthonorm~l m~trLx V ~nd ~ diagonal m~trix D consistingof the squ~re roots of the eigenv~!ues D~ :

(2.3.6) F = YD.

The m~trix D~ is input so that generation of F reduces to the generation of~n orthonormal V s~tisfying the two restrictions (2.3.1) ~nd (2.3.2) on squared multiple correlations p~. p~ m~y be expressed in terms of V and D by

~= (~D X )~(2.3.7)

p~ = V~r (~) = (FF’)~

When p~ is calculated in this way with V orthonormal, (2.3.1) is au~mati-cally satisfied. The restrictions on V are then

(a) V must be orthonormal,(b) all p~ calculated from (2.3.7) must be less than 1.0, (c) the variance of the p~ must satisfy (2.3.2).


An algorithmic procedure to generate a V satisfying these three conditions,for arbitrary parameters m, (D~k, k = 1, .-- , m), vy, and ey, is outlinedin Herzberg [1967].

Generation o] E

The elements of the (m X m) matrix E are first generated randomlyfrom N(0, 1) and then the rows of E are normalized so that, for all

(2.3.8) (EE’)~j = ~ E~k = 1 --

where the p~ are calculated from (2.3.7). This normalization ensures that thecriteria y are normalized to unit variance. The methods used to generatenormal random numbers as well as other random numbers discussed in thischapter follow Box and Muller [1958] and are described in Herzberg [1967].

Generation o] S

Two different methods for generating S were developed for the experi-ments described in Chapters 3 and 4. The first is called the e~ = 0 methodsince the variance of Var(~) is exactly equal to v~. This method was usedfor the single criterion case (m = 1) in Chapter 3. The method does notgeneralize well to the m > 1 case and therefore a second method, called thee~ ~ 0 method was used for the several criterion calculations in Chapter 4.This latter method could have been employed in Chapter 3 except that e~cannot be set to zero in this method.

When m = 1, ~ c,%n be calculated directly from the input parameters as

(2.3.9) ~= (1/n)q~.

Then, in the c~ = 0 method, n numbers are generated randomly fromN(~", v.) subject to the restriction that no number be more than 1.0 or lessthan 0.0. The numbers are rescaled ~fter generation so that their mean isexactly ~r~ and their variance is v~. If one of the nmnbers is now more than1.0 or less than 0.0, the number is discarded and a new attempt is made tosatisfy the conditions. These n numbers are the variances of ~ , denoted byv~ = Var(~) in (2.3.5). When this step is completed, (2.3.3) and (2.3.4) exactly satisfied with e~ = 0.

Now S~ is a single column, s~, and its elements are defined as the squareroots of (vi , i = 1, .- ¯ , n), with their signs chosen randomly. The elementsof the last (n, - 1) columns of S, namely S: , are generated randomly fromN(0, 1) and rhea rescaled, by rows, so th~’tt the row sum of squares of thewhole S matrix is unity. This resctding e~sures that all x variables have unitvariance. The ge~mration of S by the e~ = 0 method is now complete.

The e~ ~ 0 method of generating S is very similar to the method ofgencr~ting F. In analogy to (2.3.6) in which all matrices are (m X m), S~

I’AUL h. HERZBERG 19

written as

(2.3.10) S, = TQ

where $1 and T are (n X m) and Q is diagonal (m X m) with diagonal ments = (qk, k = 1, ... , m). T (like V) is orthonormal by columns. Var(~i) may be written in terms of T and Q

=S’ = (TQ

Since T is orthonormal it follows that the average of these variances isexactly (l/n) ~ q~ = ~ sothat (2. 3.3) is exactly sat isfied. The remainingrestrictions on T (as on V) are then

(a) T must be orthonormM by columns,(b) all Var(~) calculated from (2.3.11) must be less than 1.0, (c) the variance of the Var(~) must satisfy (2.3.4).

The algorithmic procedure for generating V may be also used to generate aT satisfying the ~bove three conditions for arbitrary n, m, (q~, k = 1, ¯ ¯ ¯ , m),v~ , and e~ .

After S~ is generated, S2(n X (n~ -- m)) is generated in the same as in the e. = 0 method. First the elements of S, are generated as N(0, 1)random numbers. The rows of $2 are resealed so that each row sum of squaresof S = (S~$2) is unity, resulting in unit variance x variables. This completesthe e. # 0 method for generating S.

2.4 Comrputer Generation o/Data Samples

x and y are (n + m) random variables with a joint multivariate normaldistribution. The mean vector is the null vector and the covariance matrixis ~ as given in (2.2.1). In order to draw samples from this distribution, would be a straightforward procedure to use (2.2.3) which expresses x and in terms of independent N(0, 1) variables w. For each simulated subject would be necessary to generate (n + m) independent N(0, 1) numbers to place these in (2.2.3) as the w values. The sample x and y vectors wouldthen be found by matrix multiplication.

This procedure, while conceptually simple, has the disadvantage thatthat the computer time required increases linearly with N, the sample size.The method is thus costly to use for all but small sample sizes.

Another procedure was chosen instead. It is not based on generatingsample vectors x and y at all but on generating a sample covariance matrix


The method is the Bartlett decomposition of the Wishart distribution[Bartlett, 1933; Kshirsagar, 1959; Wiisman, 1957]. The covariance matrix Chas a Wishart distribution depending solely on the population covariancematrix 27, the sample size N, and the number of variables which is (n + m).

Let the population covariance matrix ~ be written as

(2.4.2) Z = ftff.

This may be done in a variety of ways. The Gauss-Doolittle method forcomputing a triangular ~2 was used.

Let an ((n + m) × (n + m)) matrix A be defined

(2.4.3) A = (1/IN)TT’

where T is a lower triangular ((n -t- m) X (n -t- m)) matrix whose triangular elements are independent random variables:

(2.4.4)

Then, if we compute

(2.4.5)

Tij(i > j) are N(0,

Tii are x(N - i)

T. = 0 (i < j).

C = ~Aft’ = (1/iN)flTT’ff,

C will have a Wishart distribution as desired. Equation (2.4.5) is theBartlett decomposition of the Wishart matrix C. A is a s~mple covariancematrix from a population with identity covariance matrix. The letter A isused as a temporary symbol in this paragraph and is reserved for anotheruse in Section 2.6.

2.5 Multiple Regression o~ the Predictors

The most widely used method for prediction is multiple regression. Thisleast squares method ensures that, in the derivation sample, the correlationbetween the predicted score and the observed criterion score is a maximum.The maximum correlation is the multiple correlation.

Let the covariance matrix of the n predictors in the derivation sample(of size iN) be C~(n X n) ~nd let the column vector c~(n X 1) be the variance between each predictor and the single criterion (m -- 1). Let thecoefficients of the multiple regression combination of the predictors beb~(n × 1), the subscript indicating that the first method of prediction,multiple regression on the predictors, is being used. The linear combinationof the predictors is then b,x.

The solution for b~ is well known to be

(2.5.1)

PAUL A. HERZBERG 21

The correlation between ~. = bIx and y is the multiple correlation. Thesquare of this correlation is

(2.5.2) r~ -Cyy

where c,,, is the sample variance of the criterion. The subscript on r~ is usedonly in this chapter to distinguish the three methods of prediction. Thesubscript is dropped in later chapters.

2.6 Mvltiple Regressio~ o~ the Pri~cipal Compo~e~ds

As an alternative to the original predictors one can use the largestprincipal components of the predictors in the regression function. The scoreson the principal components are first estimated from the predictor scores.

The following calculations are made in the derivation sample. Thecharacteristic roots and vectors of the cov~riance matrix of the predictors,C,~(n × n), are calculated. Let the roots be written in descending order the diagonal of a diagonal matrix U~ and the vectors in corresponding orderas the columns of an orthonormal matrix W. Then C~ may be written as

(2.6.1) C~. = WU2W’

where all matrices are (n X n). The characteristic vectors are the coefficientsfor relating the principal components, ], to the predictors, i.e.

(2.6.2) ] = W’x

where x is the (n X 1) column vector of one subject’s scores on the predictors,’rod ] is the (n X 1) column vector of the principal component scores[Anderson, 1958, pp. 273-277].

We wish to use the t lt~.rgest principal components in the regressionfunction. From (2.6.2), the scores on these t principM components are esti-mated by

(2.0.3) ,(/) = ~(W’)~

where ~(]) is (t × 1) and is the vector of the first t elements of ]. ~(W’) the matrix consisting of the first t rows of W’.

The prediction equation, using the t largest principal components, is

(2.6.4)

where d is ~ temporary symbol representing the t-vector of weights for theprincipal components. The multiple regression solution for d is

(2.6.5) d =in analogy to the solution (2.5.1). In (2.6.5), C~ is (t X t) and cf~ is (t The cow~ri~nce matrix of the t principal components is, from (2.6.3), (2.6.1),


and the orthonormality of W,

(2.6.6) C~,-~ t(W’)C~.(W)t = t(U~)t

and the covariance of the t principal components and the criterion is

(2.6.7) c,y = ~(W’)c.~.

Therefore the weight vector is

(2.6.8) d = ~(U-~)t

Substituting (2.6.3) and (2.6.8) into (2.6.4), we find

(2.6.9) :~(~’ = c~.(W)~ ~(U-2)~ ~(W’)x

is the equation for predicting y from x using regression on the t largestprincipal components of the predictors. This equation may be simplifiedslightly by writing

(2.6.10) A = WU

so that (2.6.1) becomes

(2.6.11) Cx, = AA’.

Then (2.6.9) becomes

(2.6.12) ytt’ = c~’y(A’):’ t(A-’)x.

Equation (2.6.12) is the formula for predicting y from x using regressionon the t largest principal components. If we express the right hand side ofthis equation as [b~)]’x, then the weight vector is

(2.6.13) b(~~)= (A’):’ ~(A-’)c~.

It can be shown [Herzberg, 1967] that the squared multiple correlation is

r(t)(2.6.14) [2 ] -Cyy

which is identical in form to (2.5.2) but of course involves (~) instead ofb,.It is important to note that r~(~) is not the multiple correlation of y with thepredictors x. The l~tter multiple correlation is r,. The symbol r~(t) representsthe multiple correlation of y and the t largest principal components of thepredictors and therefore r~(~) must be less than r~ unless t = n. The sub-script 2 is dropped in later chapters when the context makes clear whichmethod of prediction is used.

Multiple regression on the principal components may be called a re-duced rank method since the use of the t largest principal components in-stead of the original predictors is equivalent to approximating the matrixC~ by the matrix

PAUL A. HERZBERG 23

(2.6.15) (~,x = (A), ,(A’).

The matrix (~,, is of reduced rank t < n.

~.7 Multiple Regression on the Principal Predictors

Another method of calculating independent scores in the derivationsample is the method of principal predictors. Scores on the largest principalpredictors are used in the regression equation. The method is only applicableif there are several criteria (m > 1).

Let the scores of a subject on the m criteria be yt, ¯ ¯ ¯ , ym, which maybe placed in a column vector y (m X 1). Each criterion, y~ , has a part, ~ linearly predictable from the n predictors, x, in the derivation sample:

(2.7.1) ~ = c~C:~x

where cxyj (n)< 1) is the covariance of yj with the n predictors x and C~, the covariance of the predictors. The m predictable parts of the criteria maybe written as a column vector #(m X 1). Then (2.7.1) may be rewritten

(2.7.2) ~ = C~yC~x = C~C~t~x

where C~(m X n) is the covariance matrix of the m criteria with the n pre-dictors. The covariance of the predictable parts is the (m × m) matrix C~

(2.7.3) C~ =

Let us diagonalize C~. in analogy to the way that C~ was written in(2.6.1) and (2.6.11):

(2.7.4) C~ = VD~V’ = GG’.

The matrices V and D2 are sample estimates of the corresponding populationmatrices denoted by the same symbols in Sections 2.2 and 2.3. The eigen-values D~,~ are written in decreasing order in the diagonal of D~ and theeigenvectors are written in the corresponding order as columns of V. Therows of G are the coefficients for relating the predictable parts of the criteriaand the principal predictors, i.e.,

(2.7.5) ~) = Gw.

Thus the equation for estimating the scores on the m principal predictors(column vector w (m × 1)) from the predictable parts

(2.7.6) w = G-~

which, combined with (2.7.2), gives

(2.7.7) w = G-~C~C:~x

as the equation for estimating w from

24 PSYCHOMETRIKA MONOGRAPH 8I’PPLEMENT

The equation for estimating the t largest principal predictors t(w)

(2.7.8) t(w) = t(G-1)C~,~C~1~x

where ~(G-~) is the first t rows of G-~.

Note that w is a vector of numbers which can be calculated for eachsubject. The use here of w for the estimated principal predictor score vectorshould not be confused with the use of w, in Section 2.2, for the principalpredictors, a random vector in the population.

Since the principal predictors are uncorrelated in the derivation sample,the multiple regression weights for predicting each yi from the principalpredictors are simply the rows of the pattern matrix G as shown in (2.7.5).E~ch row of G is the weight vector for one criterion variable. The coefficientsG ~re still the correct weights ~vhen only some of the principal predictorsare included in the regression. If the t largest principal predictors are in-eluded, the predicted parts of the criteria are

(2.7.9) ~) = (G)~ ~(w).

In order to express this equation in terms of the original predictor scores as

(2.7.10) ~)~(~) B~x,

(2.7.9) and (2.7.8) may be combined, yielding

(2.7.11) B~ =.C:~C,~(G’):~ ~(G’).

It can be shown [Herzberg, 1967] that the squared multiple correlation ofthe jth criterion variable y) is

B’(2.7.12) trait (t)~2j _ ( 3Cxy)ii

which is identical in form to (2.5.2) and (2.6.14) except that there is such equation for each criterion variable y~ . As was the case with regressionon the principal components, the multiple correlation r~) is not the multiplecorrelation of y) with x but is the multiple correlation of y~ and the t largestprincipal predictors. The correlation r~;) is always less thegn or equ,~l to r~.The subscript 3 is dropped in later chapters.

Multiple regression on the principal predictors, as on the principalcomponents, is a reduced rank method. The use of the t largest principalpredictors instead of all m principal predictors is equiwtlent to approxi-mating the matrix C~.~ by the matrix

(2.7.13) (~.~ = (G), ,(G’).

The matrix ~ is of reduced rank t < n.

PAUL A. HERZBERG 25

It was pointed out after (2.7.4) that the eigenvalues of C~ are estimatesof the population parameters (D~k, k = 1, --. , m). In order to estimateand (q~, k = 1, ..- , m), it is natural to require that the covariance matricesbe changed to correlation matrices. The correlations of the predictors x andthe principal predictors w are given by the (n X m) sample matrix St = C~w $1 may be written as

(2.7.14) $1 = Cx~ = Cx~C]~C~y(G’)

by equation (2.7.7).The sample quantity q~ is the total dependence of the predictors on the

kth principal predictor and is therefore, since the predictors have unit vari-ance by the use of correlation matrices, given by

(2.7.15) sample q~ = ~ S~k ¯

Similarly the sample estimate of z~, the average criterion-related predictorvariance, is the sum of the squares of all the elements of S~ divided by nand is

sample ~2 = (l/n)(2.7.16) ~-~ i=l

= (l/n) ~ sample q~,

2.8 Cross-Validities

The calculation of correlations in the validation sample is identical forall three weight computational methods. Given the weight vector b from thederivation sample, the square of the correlation between b’x and a criterionvariable y in the validation sample is

(b’c.)~(2.8.1) r~- (b’C,~b)cy~

where cx~, C~ and c~ are covariances computed in the validation sample.The quantity ro is called the sample cross-validity and may be negative orpositive. The sign of ro is equal to the sign of b’c~,.. The sign of ro is alsoaffixed to r~ when averages of several r2o are taken.

The predictor weight vector b, calculated in the derivation sample, mayalso be applied to the population itself. The square of the correlation be-tween b’x and the criterion variable y in the population is

~ (b%~)~(2.8.2) po = , ¯

(b Zx~b)~y


This formula is identical to (2.8.1) except for the use of population covari-ances instead of sample covariances, po is called the population cross-validity.A sign is affixed to p~ in the same way as to r~.

The three statistics, r 2 (in its several forms), r~, and p~ are called thecorrelation statistics. These statistics are the principal quantities computedin the experiments described in Chapters 3 and 4.

CHAPTER 3

SIMULATION RESULTS WITH ONE CRITERION VARIABLE

A number of experiments was performed using a single criterionvariable. The two methods of prediction used were multiple regression onthe predictors and on the principal components of the predictors. The modeland sample generation methods described in Sections 2.2 and 2.3 are appli-cable to the single criterion case (m = 1). However, in order to generatemany samples without excessive use of computer time, a special procedurefor the m = 1 case was developed. This procedure, described in detail inHerzberg [1967], allows any number of criterion variables to be generated,each with the same population multiple correlation and each with the samerelation to the predictors. The criterion variables are thus all duplicates ofthe single criterion of interest. The number of such duplicates is denotedby md.

As an example, suppose that there are five predictors and ten duplicatecriteria. Then, in a sample, one can compute ten multiple correlations, onefor each criterion. These ten correlations are all based on one sample (size N)of five predictor scores but on ten different samples of single criterion scores.

Each calculation described in this chapter is based on one or morepopulations (Z) generated for each combination of the input parameters.For each 2~ that was generated, sample covarianc~ matrices were generatedin pairs (C1, C2), representing the two samples needed for double cross-validation. 2~, C1, and C2 are the covariance matrices of (n -~ rod) variables--n predictors and md duplicate criteria. Except in Section 3.4, at least twosuch pairs of sample covariance matrices were generated. This allowedvariation in the predictor sample covariance matrices.

The results using the simulation program described in this chapter are:

(a) When p -- 0, the sample multiple correlation and cross-validity followthe known theoretical law. (Section 3.1)

(b) When p ~ 0, variation in Z~, produced by change in the number ofcolumns of S does not affect the correlation statistics r~, r~, and p~.(Section 3.2)

(c) The correlation statistics depend on n, N, and p~ in theoretically under-standable ways. Tables are presented which may be used to interpretsample multiple correlations and cross-validities. (Section 3.3)

(d) The optimum number of principal components to include in the regres-sion function depends on the parameters n, N, p~, and ~r:. (Section 3.4)

27

28 PSYCHOMETIIIKA MONOGRAPH SUPPLEMENT

3.1 Distribution o] the Correlation Statistics When, p = 0

It is useful to study populations in which the multiple correlation (p) zero even though such populations are of little practical significance. Firstly,the distributions of the correlation statistics when p = 0 provide a baseline

¯ against which to compare the distributions obtained for non-zero p. Secondly,some properties of the distributions for p -- 0 are known theoretically and acomparison of the distributions obtained from the computer model for p -- 0with the theoretical predictions provides a check of the model and the com-puter calculations.

Fisher [1928] showed that, if r is the sample multiple correlation,

r 2 N-n- 1(3.1.1) 1 - r2 n

is distributed as F(n, N - n - 1) when p = 0. This distribution has theimportant property that it is independent of the covariance matrix of thepredictors, 2~x,. The distribution of the sample cross-validity, r~, is thedistribution of the sample correlation of two variables whose populationcorrelation is zero. Therefore, from (3.1.1),

{N - 2V2(3.1.2) r4,~_ r~}

is distributed as t(N - 2) when p = 0. This distribution is also independentof 2~x. Finally, the population cross-validity, po, is exactly zero since azy -- 0.

An effective simulation of multiple regression should be able to repro-2duce these properties. The distributions of r2 and ro were studied for two

different sample sizes, N, and for predictor covariance matrices, 2~, varyingin two ways, namely in the values of the parameters ~r~ and n,. ~ is theaverage of the variances of the part of each predictor dependent on theprincipal predictor; n. is the number of columns of the S matrix. Two valuesof n. were used--n~ = 10 implies a square S mutrix since n = 10 and n, = 20implies a non-square S matrix.

The input parameters which were constant for all the c~lculations inth~ section ure shown in Table 1. Table 2 shows the variable model p~r~m-eters and s~mple sizes ~nd ~lso the total number of populations and samplesculculated for each model. According to T~ble 2 four d~erent Zs were gen-erated for e~ch model, und for e~ch Z generated, three double cross-valida-tio~ were performed. Since ten duplicate criteri~ were used throughout(m~ = 10), there were altogether 10 X 4 X 3 X 2 = 240 sample multiplecorrelations and cross-v~liditics computed for euch model. The final factorof two in the preceding expression represents the two s~mples (C~, C~)which were generated for euch double cross-validation.

In order to test whether r~ satisfies the Fisher distribution l~w (3.1.1),it is necessury to have a tabulation of F(10, 39) for Models 1 and 2 and

PA~’,’L A. I-IILRZBERG 29

TABLE I

Constant Parameters for Section 3.1

n= I0

md = l0 (m = l)2p = 0.0

v = 0.01X

e = 0.0X

Vy, ey inapplicable since m = 1

F(10, 130) for Models 3 and 4. These distributions were obtained directly,or by interpolation, from Owen [1962]. From (3.1.1), 2 i s distributed a s

,~F(n, N -- n -- 1)(N- n- 1) +nF(n,N--n-

The percentage points used (those available in Owen) and the correspondingpercentile points of the F and r" distributions are shown in Tables 3 and 4.

The four Zs and associated samples for each model were divided equallyinto two sets, chosen in the order that they were computed. The cumtflativefrequency distribution of the 120 sample squared multiple correlations, r~,

are presented in Tables 3 and 4. Each set of 120 squared multiple correla-

TABLE 2

Variable Parameters for Section 3.1

Model 1 Model 2 Model 3 Model

ns 10 20 I0 20

2~ .2 02 .5 .5

number of Zs 4 4 4 4

N 50 50 131 131

number of Cl, C2pairs per Z 3 3 3 3

3O PSYCHOMETRIKA MONOGEAPH SUPPLEMENT

TABLE 3

Distribution of r 2 for Models 1 and 2

Prob- F(10, i9) 2

abil-ity

Cumulative Frequencies

Model 1 Model 2Expected Set 1 Set 2 Set 1 Set 2

1.000 1.000¯ 975 2.401 .381.95 2.086 .348.90 1.769 .312.75 1..329 .254.50 .951 .196.25 .664 .145.io .469 .107.05 .375 .0878.025 .307 .0729.000 .000 .0000

120 12o 320 120 12o117 117 116 i17 117114 115 114 115 114108 103 107 108 1129o 85 8O 98 9160 50 52 60 5930 25 23 30 2612 16 12 Ii 96 5 6 5 6

3 1 ,3 4 40 o 0 o o

MD = maximum absolute difference(observed - expected) I0 I0 8 ~

Kolmogoroy-Smlrnov D = MD/120 .083 .083 .067 .033

TABLE 4

Distribution of r 2 for Models 3 and 4

Prob- F(10,120) 2

abil-ity

Cumulative Frequencies

Model 3 Model 4Expected Set 1 Set 2 Set I Set 2

1.000 1.0000¯ 975 2.157 .1524¯ 95 1.911 .1373.90 1.652 .1210¯ 75 1.279 .0963.50 .939 .0726.25 .670 .0529.lO .480 .o385.o5 .B88 .OBIB.025 .318 .0259.000 .000 .0000

120 120 120 12G 120ll7 120 ll5 ll9 117ll4 ll9 lll ll3 113108 114 107 106 1099O 95 96 88 9560 65 69 60 693O 27 39 33 3212 7 18 13 136 6 9 7 53 4 2 5 30 0 0 0 0

MD = maximum absolute difference(observed - expected) 6 9 3 9

Kolmogorov-Smirnov D = MD/120 .050 .075 .025 .075

PAUL A. HERZBERG 31

tions originated in six sample ccvariance matrices for each of two Zs. Therewere ten duplicate criteria in each sample.

The observed frequency distributions were compared with the expecteddistributions by the Kolmogorov-Smirnov one sample test [Siegel, 1956].MD is the maximum absolute difference between observed and expectedcumulative frequencies and D = MD/120 is the Kolmogorov-Smirnovstatistic. Both values are presented in the tables. The critical D for the onesample, two tailed test is 0.12 (a = 0.05, 120 observations). None of the in Tables 3 and 4 exceeds this value. There is therefore good evidence thatthe multiple correlations generated by the simulation program s~tisfy theFisher law.

The observed distributions of ro were compared with the expected dis-tributions. From (3.1.2), ro is distributed

t(N -- 2)(3.1.4)

(N -

The percentile points of ro are shown in Tables 5 and 6. In these t&bles the2 in order to emphasize thatdistributed variable is, for simplicity, ro, not ro,

ro is distributed about zero. If, consistent with all the other frequency tables,~ (with the sign of ro) were chosen, the distribution would, of course, ro

TABLE 5

Distribution of r for Models 1 and 2C

Prob- t(48) r Cumulative Frequenciesabi I- c

ity Model 1 Model 2Expected Set 1 Set 2 Set 1 Set 2

1.000¯ 975 2.01295 1.6789o 1.3o075 .6805O .o0025 -.680IO -1.30o

05 -1.678025 -2.o12o00

120 120 120 120 120.2789 ll7 ll8 ll9 ll6 ll4.2354 114 109 116 114 !Ii.1855 108 96 112 102 103.0977 90 81 86 85 91.00o0 6O 54 55 59 61

-.0977 30 28 23 32 30-.1855 12 15 3 ll 16-.2354 6 ii I 7 8-.2789 3 7 i 4 5

0 0 0 0 0

MD = maximum absolute difference(observed -expected) 12

Kolmogorov-Smirnov D = MD/120 .I00

9 6 5

.075 .050 .042


TABLE 6

Distribution of r e for Models 3 and 4

Prob- t(129) c Cumulative Frequenciesabil-ity Model 3 Model 4

Expected Set 1 Set 2 Set 1 Set 2

1.000 120 120 120 120 120¯ 975 1.979 .1716 ll7 ll9 ll4 ll8 ll6¯ 95 1.657 .1444 114 116 ii0 114 114.90 1:288 .1127 108 114 !01 109 112¯ 75 .677 .0595 90 lO3 82 99 97.50 .000 .0000 6o 73 49 77 76.25 -.677 -.0595 30 37 22 37 35.I0 -1.288 -.1127 12 14 ii 13 16.05 -1.657 -.1444 6 5 4 6 I0.025 -1.979 -.1716 3 2 2 4 5.000 0 0 0 0 0

MD = maximum absolute difference(observed - expected) 13

Kolmogorov-Smirnov D = MD/120 .108

ll 17 16

.082 .142 .133

identical. The expected and observed frequency distributions and theKolmogorov-Smirnov statistics are presented in these tables. The critical D(=0.12, as before) is exceeded in two cases. This result might be interpretedas failure of the simulation to produce truly independent derivation andvalidation samples. However, in 15 further models described in Section 3.3,all with p~ = 0.0 and varying n and N, none of the 15 Kolmogorov-SmirnovDs was significant at the .05 level. (In these models only 40 cross-validitieswere obtained for each Z). Therefore it may be concluded that the twosignificant Ds in the present section are chance results.

The calculations in this section h~ve shown that when the populationmultiple correlation is zero, the sample statistics obey the theoreticallyknown distributions. The function (3.1.1) of the sample multiple correlationhas an F distribution and the function (3.1.2) of the sample cross-validityhas a t distribution.

3.2 Dependence of the Correlation Statistics on Z~x When p ~ 0

It was shown in Section 2.2 that the average squared population multiplecorrelation, p", depends only on the D~k (equation (2.2.17)) and not matrix S. In particular, when there is only one criterion (m = 1), the squaredpopulation multiple correlation "of the criterion with the predictors is

(3.2.1)

PAUL A. HERZBERG 33

Sb~ce D~ is an input parameter, it is therefore straightforward to specifyan arbitrary p~ for a desired population.

Unfortunately this simple specification of p~ is only possible when thematrix S is square. When S is non-square, say with n, columns, then / isgiven by

(3.2.2) p~ ~ ’ ’-~= Dns~(SS)

which depends on the matrix S. The vector s~ is the first column of S (equa-tion 2.2.19). Since S is generated to some extent randomly by the populationgeneration procedure, it is impossible to specify by input parameters whatthe population multiple correlation will be. This is a severe limitation on themodel if S is not square.

Because of the ease in specifying p~, the models in the remaining sectionsof this study all employ square S matrices. In order to show, at least to acertain extent, that this does not effect the generality of the conclusions, someexperiments are described in this section which compare the correlationstatistics of square S and non-square S models. A further comparison is madeof two models which differ only in the parameter ~.

The input parameters which were constant for all the calculations inthis section are shown in Table 7. Table 8 shows the variable parameters andthe total number of populations and samples calculated for each of the tenmodels. Models 5 and 6 are square S models differing only in v~. Two popula-tions were generated for each model and three sample pairs for each popula-tion. This results in 10 X 2 X 3 X 2 = 120 sample correlations for eachmodel.

Each of the four remaining model pairs differs only in ns ; one model ofeach pair has square S and the other model has non-square S. The squaredpopulation multiple correlation, /, is identical for the model in each pair.The identity holds to sLx or seven decim~d places even though only three are

TABLE 7


n=5

md = i0 (m = i)

v = 0.01xe = 0.00x


N= 40

34 PSYCPIOMETRIKA MONOGRAPH SUPPLEMENT

TABLE 8


Model

5 6 7 8 9 I0 Ii 12 13

ns 5 5 5 i0 5 I0 5 i0 5 I0

p2 .5 .5 .234 .234 .311 .311 .438 .~38 .495 .495

~2 .2 .5 .2 .2 .2 .2 .5 .5 .5 .5

number of Zs 2 2 1 1 1 1 1 1 1 1

nu~,ber of Cl, C2pairs per Z 3 3 6 6 6 6 6 6 6 6

shown in Table 8. The identity was produced in the following way. Asexplained above, p2 is not an input parameter. The input D~I is equal to psonly for square S models. The non-square S Models 8, 10, 12, and 14 weregenerated using D~I = 0.5. The population squared multiple correlation,was calculated for each model by (3.2.2). These are the values in Table These computed values were used as the input D~, for the square S Models7, 9, 11, and 13. Since only one Z could be generated for each set of param-eters, six C1 , C2 pairs were generated instead of three. This means thst10 X 1 X 6 X 2 = 120 sample correlations were generated, the same numberas for Models 5 ~nd 6.

The frequency distributions of the 120 correlations for each model arepresented in Tables 9 (r~), 10 (r¢~), and 11 (p~). Recall, from Section 2.8, a negative ro results from a negative ro. The maximum absolute difference,I%{D, and the Kolmogorov-Smirnov statistic D are also presented for eachpair of models. The critical D for the two sample test, two tailed, is 0.18(~ = 0.05, 120 observations). None of the sample values exceeds this valuealthough two approach it. It c~n be safely stated that for these examplesthe variation in Z.. has not produced differences in the observed distributionsof the correlation statistics.

In the simulation model, the population multiple correlation is inde-pendent of ~r~, the average predictor variance related to the criteria. Thecomparison of Models 5 and 6 confirmed that the sample statistics are inde-pendent of ~r~. Even though the population multiple correlation does dependon the matrix S if it is not square, it was shown that the correlation statisticsare not affected by the matrix S for the models considered in this section.All further calculations in this study employ square S models only.

3.3 Dependence o/the Correlation Statistics on n, ~q, and p

The cross-validation technique was developed as a way to correctsample multiple correlation [Mosier, 1951]. The purpose of the calculations

PAUL A. HERZBERG 35

TABLE 9

Cumulative Frequency Distribution of rfor Models 5 to 14

Model

r2 5 6 7 8 9 l0 ll 12 13

.85 120 120

.80 120 120 120 119 ll9¯ 75 ll9 ll8 ll9 ll9.70 108 lll 120 120 115 119 113.65 101 93 120 119 119 102 114 9~.60 76 68 119 120 116 ll5 90 97 68.55 4~ 48 118 ll7 lll Iii 73 73 49.50 31 29 115 ll3 99 96 48 50 27.45 18 17 104 105 69 77 30 30 13.40 8 4 89 92 53 60 13 20 5.35 3 1 74 85 28 41 4 12 4.30 1 0 45 66 15 28 2 7 1.25 0 26 46 i0 21 i 4 !.20 II 21 4 lO 0 1 0.15 9 12 I 6.I0 2 3 0 1.05 0 1 0.00 0

14

MD 8 21 13 12 6

D .067 .175 .109 .I00 .050

120119115109987151291910

310

described in this section was to investigate empirically the relationship ofthe sample multiple correlation and the cross-validities to the populationmultiple correlation.

The parameters which were constant for all the calculations in this

section are shown in Table 12. Table 13 indicates the values of the threeparameters which were varied. All possible combinations of squared multiplecorrelations, p2, number of predictors, n, and sample size, N, were used exceptfor those combinations with (n, N) = (15, 16). A different model was erated for each combination of (p2, n, N) so that the models for combinationsdiffering only in N are different models.

In all cases 40 sample correlations were obtained. The means of the 40correlations of each type (r ~, r~, and p~) are shown in Tables 14 to 17. Theexpected values E(r ~) as calculated from (1.2.10) are also shown in these

tables. The standard errors of the correlation means range from 0.01 to 0.02for r 2 and r~ and somewhat less for p~¢. However, this standard error does

36 PSYCHOMETR1KA MONOGRAPH SUPPLEMENT

TABLE l0

2Cumulative Frequency Distribution of rfor Models 5 to 14

c

Model

9 l0 ii 12 13 14

¯ ?5 120 120 120.70 120 I19 i19 117.65 i19 ll5 120 120 120 I15 ll5.60 ll2 105 119 112 114 l!G 106¯ 55 102 83 120 119 120 99 108 91 93.50 72 68 120 119 116 118 85 94 66 71.45 55 56 i18 ll8 ll3 ll0 74 82 51 47.40 38 29 ll2 ll3 104 100 59 61 26 32¯ 35 2? 18 106. IIi 88 86 43 42 13 24.30 14 I0 99 102 64 68 25 26 8 15.25 8 4 88 98 44 53 13 17 4 8.20 2 2 66 81 22 35 7 7 2 3.15 0 0 54 52 12 25 5 3 0 0.lO 31 36 7 18 3 0.05 15 16 3 6 0.00 2 1 0 2

-.o5 o 1 0-.I0 0

MD 19 15 13 9 II

D .158 .125 .109 .075 .092

not take into account variation which would be produced by another popu-lation generated from the same parameters. Table 13 indicates that onlyone Y. was generated for each parameter combination. Nevertheless the tablesgive a useful picture of the dependence of the three correlation statistics on

/, N, and n.The following observations may be made from the tables:

(a) The squared sample multiple correlation, r=, is an overestimate of/. Theexpected values of r=, E(r2), from (1.2.10) match the observed 2 valuesvery well. The first two terms of the expansion (1.2.11) may be re-arranged to show that the bias in E(r =) is a simple function of n, N,and

(3.3.1) E(r =) _ p= _ n (1 -- p’~).N--1

The match of E(r ~) and r~ shows that formulas (1.2.13) and (1.2.7),

PAUL A. PIERZBERG 37

which are essentially backward solutions of (1.2.10), provide reasonableestimates of the squared population multiple correlation.

(b) The squared sample cross-validity, r~, is generally an underestimate ofp2 and this bias (except for p2 = 0.0) tends to be approximately the samefor all values of p2 for fixed (n, N). As with ~, the bias of t he cross-validity decreases with N and increases with n for fixed p2.

(c) The squared population cross-validity, p~., is the squared cross-validityusing the derived weights on a validation sample of infinite size. The

2 and the same commentstables show that p~. has similar values to roapply to p~ as were applied to r~ ~bove. Looked at another way, r~ is anunbiased estimate of p~.

TABLE ii

Cumulative Frequency Distribution of pfor Models 5 to 14

2

Model

5 6 7 8 9 i0 ii 12 13 14

.500 120 120

.475 82 93

.450 53 56 120

.425 31 33 113.400 18 16 78¯ 375 13 7 "~7.350 8 3 28.325 6 2 120 120 14.300 3 0 108 114 9.275 1 82 81 4.250 I 120 120 57 52 1.225 1 118 113 24 32 1.200 ! 88 91 14 14 I.175 1 65 65 5 10 0.150 1 37 45 2 5.125 0 18 32 2 3.lOO 9 15 1 3.o75 4 i0 0.050 3 4 1.025 2 4 1.000 0 1 0

-.025 0

MD ll 14 8

D .092 .117 .067

120107713721Ii6310

10

.083

12094592918

94220

120996126

6310

12

.I00


TABLE 12


m=l

n = n (square S matrix)S

~2= 0.5

vx = 0.01e = 0.0

X


The tables confirm the known properties of multiple regression and cross-validation as summarized in (1.2.14). The sample multiple correlation is overestimate of the population value since the sample weights are chosen tooptimize the correlation in the derivation sample. These weights are not theoptimum weights in either the population or another sample and the conse-quence is that both p~ and rE are biased low. With repeated samplings thevalues of rE cluster around p~ since some weights are better for the validationsample than the population (r~ > p~) and other weights are worse (rE < p~).

These tables can, perhaps, be useful in estimating the population p2from sample r2 and rE values obtained from real data. If the values of (n, N)

TABLE 13


md

number of Zs

number of Cl, C2pairs per Z

n = 2, 5, i0, 15

p2 = 0.0, 0.i, 0.25, 0.5, 0.75

N = 16, 26, 50, 131

models with (n, N)

(15, 16) (10, all others

5 I0

not done 1 1

4 2

PAUL A. HERZBERG 39

correspond to one of the tables and if r2 and r~ match the values in the tablesfor a value of p2, then this value of p2 is the estimate of the population mul-tiple correlation.

3.4 Prediction ]rom the Principal Components

Burket [1964] showed that cross-validities can be increased by using onlya few principal components of the predictors in the prediction function. The

TABLE 14

Correlation Statistics for Sample Size N = 16

2 .00 .lO .25 .50 .75

n= 2

E(r 2) .133 .211 .331 .541 .763

r2 .160 .189 .329 .593 . .805

r2 .029 .121 .217 .534 .763C2

Pc .000 .056 .191 .468 .731

n--5

n = I0

E(r 2) .667 696 .743 .823 .909

r2 .680 .666 .744 .821 .908

2rc -.035 .003 .047 .300 .523

p~ .000 .009 .041 .217 .489

E(r2) .333 .393 .485 .647 .818

r2 .307 .409 .450 .669 .866

r2 .014 .086 .104 .372 .691C

2Pc

.000 .027 .107 .351 .648

00

0 0

0~

0~

0

I

0I-

~0

00

0~

0 0

I

~

o ~

~

0 l.-

’ co

0

I--a

C~

C

~

O0

0 I’-

’ "q

O0

0

~

~0

0

I1

00

I~I~

00

0~

-~0

00

00

0

~

0

00

00

00

~0

00

~0

~~

IO

~0

~~

~0

~~

~0

i~~

oo

oo

II

00

00

00

CO

~0

PO

~~

0"~

~~

C~

~C

OO

0C

O~ II 0

00

kid

0

00

"-q

~1

00

00

00

~~

0~

0C

O

00

~--

’I-~

II

00

00

0I-~

I-~

0I~

~"

kfl

PAUL A. HERZBERG 43

TABLE 18


md = I0 (m = I)

n = n (square S matrix)s

v = 0.01x

e = 0.0x

Vy, ey inapplicable since m = i

number of Zs per model = I

number of Cl, C2 pairs per Z = I

formulas for such prediction were presented in Section 2.6. The calculationsdescribed in the present section demonstrate the improvement of predictionby using the largest principal components in the simulation data and showhow variation in some parameters can change the effect.

Sixty models were generated varying in four parameters: n, the numberof predictors; p:, the squared multiple correlation; ~r2, the average criterion-related predictor variance; and N, the sample size. The constant parametersare listed in Table 18. All combinations of the variable parameters listed inTable 19 were used. A new model was generated for each (n, p2, ~r2, N) com-

bination so that here, as in Section 3.3, combinations differing only in N aredifferent models.

For each simulation model two sample covariance matrices, C1 and C~,were generated, both with the same sample size N. The covariance matrixof the first sample predictors, (C1)=, was diagonalized and the weights forpredicting each of the ten duplicate criteria from the largest principal com-ponents were calculated. These weights were validated on Cs, and the cross-validities r~’~ were calculated for each criterion, the superscript (1) indi-cating that one component was included in the regression. The weights werethen recomputed for prediction from the two largest principal components

TABLE 19


5, i0

0.25, 0.50, 0.75

0.20, 0.35, 0.50, 0.65, 0.80

20~ I00 for n = 5 and N = 25~ 105 for n = i0

44 PSYCHOSIISTRIKA 5IONO(;RAPH SUPPLEMENT

resulting in cross-validities r(o ~). This procedure was continued untilcomponents had been included in the regression. In general, the t largestprincipal components resulted in ten cross-validities r(c t) for each t(t 1, ... , n). The average of the squares of the ten cross-validities for each twas calculated. The largest of these averages is called r~(m"~) and occurs fort = t .... The symbol tm,x represents the number of components producingthe largest average squared validity. (When r(o t) was negative, a negativesign was affixed to its square before the averages were calculated.)

The above procedure was repeated for validating the principal com-ponents of (C2)x~ on C~ . The averages of the squares of the ten cross-validitiesfor each t as well as r2~(m~) and tm.~ were again calculated for each model.

In Section 3.3 it was shown that the squared cross-validity r~ (when allvariables or principal components are included in the regression) under-estimates the squared population multiple correlation p~. The result wasconfirmed with the 60 new models since only 14 out of the 120 values of

2 (max)(") 05. (") is the same as ro. The were less[r~ ] exceeded Recall that robiased as 30 out of 120 values exceeded p~. Thus r~( .... ) is still an underesti-mate of the squared population multiple correlation.

Averaging the squared correlations be]ore calculating tm:~, has the dis-advantage of reducing r~(~’~) from what it would be if the maximum r~ wasfound for each criterion and then these maxima averaged. These maximawill occur for different t for the different criteria, in general. In several caseswhich were examined, however, it was found that most maxim~ occurredfor the same t. Since some averaging had to be done to comprehend the re-sults, the method previously described was used for simplicity.

The results from these experiments will be displayed in two ways--first, by t~.~ (Table 20), and, second, by the average squared cross-validity(Figures 1 to 10). Table 20 shows how t ..... varies as a function of p2, ~, andN for each of the two values of n. For example, the first section of Table 20shows that, for all ten models with p~ = 0.25 and n = 5 (two values of t~per model), t .... = 1 occurred five times, t .... = 2 occurred four times, etc.This section of the table shows that t~.~ te~ded to increase ~s p~ increased.This effect is stronger when n = 10. On the other hand, the effect of in-creasing ~ is to decrease the value of t ..... . Hence for larger values of ~r~, asmaller number of principal components produce maximum cross-validities.This is true for both wdues of n. Finally, there is a tendency for larger valuesof N to lead to larger values of t .... As N increases, more principal compo-nents are needed to maximize the cross-validity.

Figures 1 to 6 illustrate these results in ,~ different way. In these figures,the average squared cross-validity is shown as a function of t, the numberof principal components included in the regression. Each point in Figures 1and 4 is the average of 12 cross-validities; e:~ch point in Figures 2 and 5 isthe average of 20 cross-validities; and each point in Figures 3 and 6 is the

PAUL A. HERZBERG

TABLE 20

2

~requencles of tmax for Cross-Valldltles rc

45

n=5

.25 5 ~ ~ ! 2 802 .5O 6 I 0 3 I0

¯ 75 3 1 4 0 12

.20 0 0 0 0 12¯ 35 0 2 2 2 6

z2 .50 3 1 0 0 8.65 a 1 3 2 2.8O 7 2 0 1 2

N I00 Q 2 2 l 21

n= I0

tmax

1 2 3 4 5 6 7 8 9 1O

.25 16 1 0 0 0 0 0 0 1 202 .50 9 ¯ 2 0 0 0 0 3 2 3

.75 3 2 ¯ 0 0 ~ 2 0 2 9

.20 2 0 1 0 0 0 0 2 2 5

.35 3 1 0 0 0 0 0 1 3 4

.50 8 0 0 0 0 1 0 0 0 3

.65 8 2 0 0 0 0 0 0 O 2

.80 ? l 2 0 0 0 2 0 0 0

N 25 17 3 3 0 0 1 0 3 2 !105 ii 1 0 0 0 0 2 0 3 13

The cell entry is the frequency that the value of tmax occurred for modelswith the row value of the parameter.

I I I I I I I 1 I ~~ r[2= ,80

- ---.c H== .50 ~..a~.,~.~*~’~- -.-e--- U2=.35

Numbe~ o~ P~inci~l Components - ¢

Fmu~ 1 (Left)Prediction from principal components, v’ w~ing (n = 5)

FIGURE 2 (Right)

Prediction from principal components, p~ varying (n = 5)

PSYCHO.~ETRIKA MONOGRAPH SUPPLE]~IENT

t.

I

I I I I I I I I I I I I I I I

---- : 17 ~= .80 -¯ N=IO0.--<>--- 172=.50z~ N= 20 ---e--- T[2-.35--o--- Uz=.20

I I I I I ~ ~ ! ~ ~ ~ ~ ~ ~ t

Number o~ Prlncip~l Components - t

Fmc~ 3 (Left)Prediction from principM ~mponents, N v~ing (n = 5)

Fmc~ 4 (Right)Pr~iction from p~ncipM components, ~= yawing (~ = 10)

I I I I I I I I I IJ

. ~’- _

_ ¯ ~o2=.75 __o p~ =.50

I I I I I I I I I I

¯ N= IO:g _z~ N=25

NumSer o¢ Principol Components - 1;

:FIGURE 5 (Left)Prediction from principal components, p* varying (n = 10)

FmuR~ 6 (Right)Prediction from principM components, N varying (n = 10)

PAUI~ A. HERZBERG 47

I ~ I I I I, I I I I I¯ TT~ = ,80

~ ~=,50-..~-.- 7f== .:35--o--- 7T~= ,20

2 3 ~ S I 2. 3 ~ SNumber o1~ Principal Co~ponen~ -~

Fmu~ 7 (Left)Prediction from principal components~ ~ yawing (n ~ 5~ ~ ~ 20)

FIGURE S (Right)Prediction from principal components, ~ varying (~ ~ 5, ~ ~ 100)

average of 30 cross-validities. (The results for w~ = 0.65 have not been in-cluded in Figures 1 and 4 in order to avoid crowding the figures. In Figure 1,the r-" = 0.65 curve would fall between the curves for ~ = 0.5 and ~r2 = 0.8.Iu Figure 4, the ~’~ = 0.65 curve would fall slightly below the x2 = 0.5 curve.)

Each figure shows that there is an inter~tction between the number ofprincipal predictors (t) a.nd the v:~riable p~rameter (~, p~, or N). This counts for the results of T:~ble 20. For example, when ~r~ is small, an increasein t produces ~n iz~crease in the average r~. Hence for small ~r~, the numberof principal components producing maximum cross-validity is n or almost n.However, for l:trge ~, the average cross-validities are constant or slightlydecrc:~sing as a function of t. Therefore there is a tendency for one component,or at most a few components, to produce ma~ximum cross-validity. Iu asimilar way, the effects of p’~ and N can be compared in the figures and Table20. The interaction of p~ and t is quite small and will not be further discussed.

The present results may be compared ~vith a previous major study ofreduced rank prediction [Burket, 1964]. In Burket’s Table 1, the predictionmethod 3 is prediction from the principal components of 29 predictors. IuBurket’s table it is clear that when the sample size is large (N = 255, 210,165), the cross-validities decrease only very slowly as the number of principalcomponents increase, while when the sample size is small (N = 75, 30~, thecross-v~lidities decrease significantly with t. While it is impossible to makea perfect comparison between Burket’s results and the simulations, somesimilarity can be shown by considering Figures 7 to 10. These figures show


the simulation results for the two sample sizes separately instead of averagedtogether as in Figures 1 and 4. Now the experiments to be described later(Section 4.2) show that real data have large values of ~2 (approximately 0.8).It is reasonable to assume that Burket’s data also have a large value of ~2since the predictors and criteria in both studies are simila.r (Burket: pre-dictors-Edwards Personal Preference Schedule, High School Grade-PointAverages, Test Scores, Age, Sex; criteria--Grade-Point Averages in VariousCollege Course Areas; Herzberg: predictors--Sequential Tests of EducationalProgress, School and College Ability Tests, Tests of General Interest; cri-teria-Scholastic Aptitude Tests, College Entrance Examination Board,Rank in High School Class). In Figures 7 to 10 it is seen that, for ~ = 0.8,r~ decreases only slightly when N is large but decreases much more rapidlywhen N is small, particularly when n = 10 (Figure 9). This corresponds the results in Burket’s Table 1.

Both Burket’s study and the present one confirm that reduced rankprediction is of maximum benefit when the sample size is small. The presentstudy ~lso shows the great effect that the parameter r~ has on predictionfrom the principal components. When ~r~, the average criterion-related pre-dictor variance, is small, r~ incre.~ses rapidly with the number of principalcomponents. However when ~ is large, r~ is constant or slightly decreasingas t increases. This result is easily understood, for, when ~ is increased, onelinear combination of the predictors, namely w~, the first principal predictor,is increased in wriance. The result is that the largest principal component

PAUL A. HERZBERG 49

of the predictors becomes increasingly collinear with wl as ~2 increases. Thisme,~ns that a single principal component can produce better prediction thanseveral components. Hence r~ decreases slightly as t increases and thereforet ...... the value of t giving maximum rE, tends to be near 1 as ~ increases.(Note again that the results for v~ = 0.65 have not been included in thesefigures. In Figures 7 and 8, the ~2 = 0.65 curve would almost coincide withthe v2 = 0.8 curve. In Figures 9 and 10 the v2 = 0.65 curve would fall slightlybelow the ~ = 0.5 curve.)

CHAPTER 4

STUDIES WITH SEVERAL Ct~ITERIA

The generation of populations with several criterion variables permitsthe principal predictors to be used in multiple regression. In the first sectionof this chapter, simulation experiments with several criteria are described.The two reduced rank methods of prediction (multiple regression on theprincipal components and on the principal predictors) are compared. In thenext section, some studies are described using real data from high schoolstudents. Finally, in the last section, an attempt is made to simulate thereal data with the computer program.

~.1 Simulation Results

The purpose of this section is to compare the principal component andprincipal predictor methods of prediction in a number of models whichdiffer in the distribution of (q~, k = 1, -.. , m) and 2. The importance ofthe parameter ~", the average criterion-related predictor variance, in pre-diction from the principal components of the predictors, has already beenshown in the single criterion case (Section 3.4). When m -- 1, there is onlyone q~ and it is directly related to ~r~(q~ = n~). However when the~e arescver~l criteria, there are several q~, each q~ representing the total depend-ence of the predictors on the kth principal predictor. The sum of the q~ isequal to n~~.

Similarly, the (D~, k = 1, -.. , m) are the total dependence of thecriteria on each principal predictor. Since the D~ are eigenvalues (ofthey are always considered in descending order. In this section, the D~distribution was kept constant. By varying the order of the q~ it is possibleto vary the relative dependence of the predictors on the principal predictors.This variation will produce differences in the effectiveness of the two methodsof prediction.

Eighteen models were studied. The constant input parameters are shownin Table 21 and the variable parameters in Table 22. For each value of v~,three different q~ distributions were used, called decreasing, level, and in-creasing. The same distributions, except for scaling by ~r2, were used for allthree values of vs. Two models, differing only in sample size, were generatedfor each combination of ~ and q~ distribution. A pair of samples of size 20(small N) was generated from one model and a pair of samples of size (large N) was generated from the second model.


TABLE 21


n= I0

m= 5ns = 10

p2 = 0.6

2 2 = 0.8, 2 = 0.6D! 1 = 1.2, D22 D33 ,

~ 2 = 0.iD 4 = 0.3, D55

vx = 0.01

e = 0.005X

Vy = 0.01

ey = 0.005

number o£ Zs per model = 1

number of CI, C2 pairs per Z = 1

In the decreasing q~, distribution, half the dependence of the predictorson the principal predictors is dependence on the first principal predictor.When ~r2 = 0.8, this means that 40~0 of the total variance of the predictorsis linearly dependent on the first principal predictor. The contribution of thesucceeding principal predictors is progressively smaller. When ~ = 0.2and the q~ are again decreasing, the first principal predictor accounts foronly 10% of the predictor variance and the other principal predictors accountfor less, with a. total of 20~o of the variance of the predictors explained by theprincipal predictors.

In the level q~_ distribution, each principal predictor contributes equallyto the predictors, the contribution of each varying from 16% when ~r~ = 0.8to 4% when ~2 = 0.2.

When the (q~, k = 1, ... , m) are increasing the dependence is exactlyreversed from the decreasing case. Most of the dependence of the predictorson the principal predictors is dependence on the fifth (last) principal pre-dictor. The dependence on the first principal predictor is very small.

One change was made in the generation of samples for the calculationsin this chapter. The sample covariancc matrices, C~ and C~, were changedto correlation matrices in order to make possible the calculation of a sample

2r and sample (q~, k = 1, ... , m).The calculations performed on the sample correlation matrices were

similar to the calculations in Section 3.4 except that two methods of pre-

PAUL A. HERZBERG 53

diction were compared and the criteria were no longer duplicate criteria.Multiple regression on the principal components of the predictors was donefirst. The correlation matrix of the predictors, (C1)z,, was diagonalized andthe weights for predicting each of the five criteria from the largest componentwere calculated. The cross-validities for each criterion in the second samplewere calculated. The average, over criteria, of the squares of these validitieswas calculated and is presented in Tables 23 to 28 in the "r~" columns under"P. C." for t = 1 (one component). The weights were then recomputed forthe two largest components (t = 2) and the average squared cross-validitycalculated. This was repeated for t = 3, ... , 10. The whole procedure wasrepeated again for validating the weights derived in C~ on C1, but theseresults are not reproduced in the tables as they are very similar to the valida-tion of C, on C~.

Prediction from the principal predictors was then performed followingthe method given in Section 2.7. The weights in (2.7.11) were calculated the first sample for each value of t (t = 1, -.. , 5) and the weights werecross-validated in the second sample. The average of the cross-validities foreach value of t w~s c~lculated ~nd is given in Tables 23 to 28 in the "r~"columns under "P. P." for e~tch t. The validation of sample 2 on sample 1 isnot reported here. ,

In the derivation sample the following sample statistics were calculated:sample (q~, k = 1, .-- , 5), s~mple u~, sample (D~, k = 1, .-. , 5), sample p~. For the calculation of the sample (q~_, k = 1, ... , 5) and thesample ~r~, see (2.7.15) and (2.7.16). The D~,~ are the eigenvalues of C~.~ the s.~mple p~ is r :, the average squared multiple correlation using all predic-tors. The smnple p~ is the average of the (D~, k = 1, ..- , 5).

TABLE 22

Variable Parameters for Section ~.I

~2 = 0.2, 0.5, 0.8

2qk distribution--decreasing, level

as shown below

N = 20, 75

decreasing level increasing

2 ~ ~2 2ql 5.0 ~- 2.0 0.625 ~

q~ 2.5 ~2 2.0 w2 0.625 w2

2 ~2 w2 w2q3 1.25 2.0 1.25

2 ~2 w2. ~q4 0.625 2.0 2.5 .,.

q~ 0.625 ~2 2.0 ~2 5.0 ~2

, increasing

PAUL A. HERZBERG

TABLE 26

2 Distribution DecreasingSeveral Criteria (N = 75)

55

~ = .2 w2 = .5 ~2 = .8

P,C. P.P. Sample P.O. P.P. Sample P.C.P.P. Sampletork r2 2

D~k q~r2 r2 2 q~ r2 r2

o rc c c Dkk c C D~k q~.19 .27 1.2 2.1i -.00 .15 1.5 I.i

2 .07 .21 .9 .73 .07 .27 .6 .34 .07 .32 .2 .35 .15 .33 .i 1.46 .267 .298 -379 .44

i0 .33

Sample p2 .65

Sample ~2 .38

.26 .42 .9 1.7

.27 .53 .6 .5

.37 .54 -3 .3

.40 .56 .I .8

.52

.54

.56

.63

.54

.22 .23 1.3 4.1

.38 .34 .6 1.7

.46 .44 .5 1.4

.48 .45 .4 .6

.51 .45 .1 .7

.50

.51

.53

.53

.45

.58

.85

TABLE 272 Distribution LevelSeveral CritePia (N = 75)

t opk

~2 = .2

P,C. P.P. =~ample

r2 2 2c rc Dkk qk2

1 .02 .12 1.3 .52 -.02 .263 -.O1 .32 .7 .74 .03 .37 .4 .45 .07 .38 .1 .56 .127 .228 .349 .33

10 .38

Sample 02 .69

Sample ~2

~r2 = .5

P.C. P.P. Sample

r c r e Dk2k qk2

=2 = .8

P.C. P.P. Sample2 2 D2kk

qk

.0o .21 1.3 1.0.06 .33 .7 .7.10 .41 .6 1.1.11 .46 .3 .9.22 .48 .2 .9.34.40.43.48.48

.26 .46

TABLE28

Several Criteria (N = 75) q~ Distribution Increasing

~2 =

P.C. P.P. Sample

2r~ r2c D~k qk

-.0O .26 1.4.04 .33 .8 .5

.09 .21 1.3 1.7

.19 .33 1.0 1.7

.33 .39 .6 1.9

.46 .45 .2 1.2

.49 .45 .Z 1.5

.52

.52

.52

.45

.64

.82

~2 = .2

P.O. P.P. Samplet or k

2r2c r2c Dk2k qk

I -.01 .17 1.2 .22 .o0 .~o 1.0 .73 .01 .27 .5 .84 .01 .34 .4 1.15 .05 .36 .£ .76 .087 .138 .269 .39

I0 .36

Sample 02 .65

Sample ~2¯ 35

.08 .43 .7 .7

.i0 .46 .3 1.7

.16 .48 .i 2.2

.21.28.41.46.48

.67

.56

~2 = .8

P.C. P.P. Sample

r2c r2c Dk2k qk2

.02 .23 1.6 .8

.06 .39 1.0 .8

.17 .46 .6 1.3

.22 .51 .2 1.4

.47 .51 .1 2.6

.49

.53

.55

.53

.51

.70

.68

56 PSYCHOMETR1KA MONOGRAPH SUPPLEMENT

Consider first N = 75 (Tables 26 to 28). When ~ = 0.2, th e fi rst fe wprincipal predictors are far superior to the first few principal components inaverage cross-validity. This is true whether the q~ distribution is decreasing,level, or increasing. The reason is that the principal predictors account foronly 20% of the predictor variance wheu ~2 = 0.2. The largest principalcomponents therefore reflect variance mostly indepeudent of the principalpredictors and therefore the largest principal components are not goodpredictors. The principal predictors, though of small v,~riance, are good pre-dictors and multiple regression on them cross-validates well.

The situation changes, however, as r2 increases to 0.5 and 0.8. When2v = 0.8 and the q~ distribution is decreasing or level (not increasing), there

is practically no difference between the two prediction methods in the aver-age cross-validity for t = 1, ... , 5. When ~r2 = 0.8, the five principal pre-dictors account for a total of 80% of the predictor variance and therefore theprincipal components of the predictors are very similar to the principalpredictors. Therefore the principal components and principal predictorscross-validate equally well.

An exception occurs for the increasing q~ distribution wheu ~r" = 0.8(Table 28). Here the principal predictors cross-validate much better than thefirst few principal components. The reason is that the first principal pre-dictor (largest D~ and hence best predictor) is the smallest principal pre-dictor in terms of associated predictor w~riance (q~ is the smallest q~). Theprincipal predictor method of regression properly picks this first principalpredictor as the best predictor. The principal component method of predic-tion, however, chooses the principal predictors in reverse order, since theprincipal predictor with largest predictor variance (40%) is the fifth princi-pal predictor and the fourth principal predictor has the next largest v~riance(20%), etc. Note that there is a large increase in average cross-validitybetween t = 4 and t = 5 principal components in this case. The fifth prin-cipal component is approximately collinear with the first principal predictorwhich is the best predictor. On the other hand, when ~-~ -- 0.2, the largeincrease in average cross-validity using principal components does not occuruntil t --- 8 or 9.

The results when ~.o = 0.5 are intermediate between those for ~ = 0.2~nd r~ = 0.8. Furthermore, the results for N = 20 (Tables 23 to 25) aresimilar to the N = 75 results just described except that the effects are notas clear due to instability of the weights with small N. An interesting effectis shown in two cases, however (Table 23, ~r2 = 0.5 and Table 25, ~r~ = 0.2).This effect is not dependent on sample size and could have occurred forN = 75. In both cases the average cross-validity when all factors are in-cluded in the regression is essentially zero, since the predictors are dependent.The population 2;,~ which was generated in these two cases was almostsingular. This was shown by the difIiculty in inverting it. Every time a

PAUL A. HERZBERG 57

matrix is inverted in the simulation program, the result is checked by multi-plying the inverse by the original matrix and comparing the result with theidentity matrix. The largest difference, in absolute value, between corre-sponding elements in the two matrices is printed as WINV. Most ger~eratedZxz matrices yield WINV = 0.00001 or less. In the two cases mentionedabove, WINV = 0.003 and 0.0001, respectively, indicating approximatedependence of the predictors in the population. This dependence appears inthe generated samples as well.

In the two singular cases, prediction from the principal components, fort ~ 10, is successful. However, for all values of t, the cross-validities usingthe principal predictors as predictors are practically zero. The t largestprincipal components (t < 10) are independent and their weights cross-validate well. This is an advantage of prediction from the principal compo-nents of the predictors--the effect of dependence of the predictors can beeliminated. However, the weights on the principal predictors are not stablein the singular case, regardless of the number of principal predictors includedin the regression.

Even though variation of the population parameters has an appreciableeffect on prediction by the two reduced rank methods, for practical applica-tion of these results it would be necessary to determine from samples whatthe population parameters are. Can these parameters be estimated? Thesample (q~, k = 1, ¯ ¯ ¯ , 5) and ~ are estimates of the corresponding popula-tion parameters and are shown in Tables 23 to 28. In general the sample q~distribution is similar to the population distribution. This is shown mostclearly for large sample size (N = 75). When the population q~ distributionis decreasing, the largest sample q~ generally occurs for k = 1. When the q~distribution is level, the sample values, q~, are approximately equal. Whenthe q~ distribution is increasing, the largest sample q~ normally occursfor k = 5. It is therefore possible to decide, on the basis of the(q~, k = 1, ..- , m) in the derivation sample, whether the principal pre-dictors cross-validate better than the principal components or whetherthere will be little difference between the two methods.

As an additional aid in making this determination, it is important toestimate ~. This may be done from the value of r’~ computed in the deriva-tion sample. It can be seen from the tables that the sample ~: is a roughmeasure of the population value; there is a tendency for the sample valueto shift nearer 0.5 than the population value.

Since the population (D~_~, k -- 1, -.. , 5) were not varied in thesesimulation studies, the sample values, D~, shown in the tables are relativelyconstant from sample to sample. These parameters will be discussed furtherin the next section.

This section has shown that prediction from the principal predictors isan effective method of prediction, particularly when the q~ distribution is

58 PSYCHOMETRIKA MONO(;RAPH SUPPLEMENT

increasing or ~ is small. In other cases prediction from the principal compo-nents is ahnost as successful as prediction from the principal predictors. Theonly case in whichprediction from the principal components is superior toprediction from the principal predictors is when the predictors aredependent.

¯ 3.2 Study of Real Data

The calculations described in the preceding section were also performedon some samples of real data. The data were collected in 1961 by the Educa-tional Testing Service, Princeton, N. J., from 1205 boys in academic highschools. The 21 variables employed in the multiple regression calculationsare listed in Table 29. There are two sets of eight predictors each and oneset of five criterion w~riables. The first set of predictors, called the S-predic-tors, consists of six variables from the Sequential Tests of EducationalProgress (STEP) ~nd two variables from the School and College AbilityTests (SCAT). The second set of predictors, called the T-predictors, consistsof eight variables from the Tests of General Interest (TGI). The criterionvariables are two variables from the Scholastic Aptitude Test (SAT), twovariables from the College Entrance Examination Board (CEEB), and therank in the high school class.

Eight samples were drawn at random from the pool of 1205 subjects.Four of the samples were of size N = 20 (Samples 1, 2, 3, 4) and the otherfour samples were of size N = 75 (Samples 5, 6, 7, 8). Four double cross-validations were performed (two for each sample size) using the S-predictors.Then the same samples were used in four double cross-validations using theT-predictors.

Tables 30 (N = 20) and 31 (N = 75) are a summary of the calculationsmade on these samples; the calculations were the same as those made on thesimulation samples in Section 4.1. Correl~tion matrices were used. Again,only the validation of C~ on C~ is reported.

In most of the samples, the principal predictors (P. P.) validate morepoorly than the principal components (P. C.), and in the two cases ~vherethe first principal predictor validates better than the first principal com-ponent, the improvement is not great. Another feature of these data is thenearly constant average cross-validity of the principal predictors for t =1, ..- , 5. Even though, in some cases, a few principal components are farsuperior to including all predictors in the regression, in no case is the firstprincipal predictor significantly better than all predictors.

These findings can be understood by considering the estimates of theparameters p:, ~r-~, (D~, k = 1, ... , 5), and (q~, k = 1, ... , 5). In all derivation samples, the sample ~r: is at least 0.87 for the S-predictors and atleast 0.78 for the T-predictors. Therefore these samples correspond approxi-mately to the ~r" = 0.8 cases of Section 4.1. Furthermore the sample q~

PAUL A. HERZBERG

TABLE 29

E. T. S. Variables

59

S-predictors

STEP MathematicsSTEP ScienceSTEP Social StudiesSTEP ReadingSTEP ListeningSTEP WritingSCAT VerbalSCAT Quantitative

T-predictors

TGI Industrial ArtsT@I Home ArtsTGI Physical EducationTGI Biological ScienceTGI Music and ArtTGI History-LiteratureTG! EntertainmentTGI Public Affairs

Criteria

SAT Verbal-SAT HathematicalCEEB English CompositionCEEB American HistoryRank in Hish School Class

distributions are i~1 all cases but one heavily weighted on q~, thus indicatingau extremely decreasing q~ distributiou. Returnir~g to the correspoading simu-lation examples in Section 4.1, it is seen that the ETS results do not differgreatly from the last columns of Tables 23 (N = 20) and 26 (N -- 75).

The failure of the principal predictors in the ETS samples can befurther explained by the s~mple D~ distribution. In aJ! ~cases, D~ is at


TABLE 30

E. T. 3. Samples (N = 20)

t or k

Sample i validatedon Sample 2

P.C. P.~. ~ Sample

r2 r2 r, 2 2c c ~kk qk

Sample 3 validatedon Sample 4

P.C. P.P. Sample

2

r2

2 2rc c Dkk qk

1 552 553 564 535 526 527 518 44

2Sample ~

Sample w2

S-predictors

.41 3.15 4.68.42 .4o .83.46 .24 .59.44 .!I .41.4~ .o2 .46

.78

.87

.47

.47

.42

.44

.45

.44.40¯ 37

.42 3.93 5.83

.41 .28 .34¯ 39 . ll ¯ 35.36 .03 .43¯ 37 .O1 .26

.87

1 262 273 274 275 156 137 128 08

2Sample p

2Sample ~,

T-predictors

.09 2.!6 1.69

.06 .27 .68

.08 .16 1.93

.O7 .O6 1.57

.08 .O2 .37

.54

.78

2831333232303234

.35 ~.5! 4.31

.32 .25 °42

.32 .09 .47¯ 35 .04 .68.34 .01 .51

.78

.80

]east 80% of the total predictable variance and therefore the D~k distribu-tion in the ETS data is weighted more in favor of D~I than the populationsconsidered in Section 4.1. This means that the first principal component isvery similar to the first principal predictor and hence prediction from theprincipal components is effective.

In the next section some simulation models will be considered that moreclosely match the ETS data, particularly in the D~ distribution.

~.3 Simulation o] Real Data

The ETS data samples differ from the simulated data in Section 4.1 inseveral respects. The (D~k, k = 1, ... , 5) distribution is much more con-

PAUL A. HERZBERG

TADLE 31

E. T. S. Samples (N : 75)

61

t ork


P.C.P.P. Sample

r 2 r2 2 2c c Dkk qk


P. C. P. P. -Sample

r2 r2 Dk2k qk2c c

1234 645 646 647 638 63

2Sample p

2Sample ~

65 .61 3.2866 .63 .1366 .63 .07

.63 .02

.63 .Ol

.70

S-predictors

5.55 64¯ 57 66.40 67.27 67.24 67

686869

.88

.66 3.29 5.87

.69 .09 .47

.69 .o6 .27

.69 .O3 .36

.69 .oi .2~

.7o

.9o

I 492 493 434 44

5 446 4O7 3~8 37

2Sample O

9Sample ~-

T-predictors

¯ 37 2.47 3.54¯ 37 .04 .82¯ 37 .02 .62¯ 37 .01 .64¯ 37 .00 .62

.5i

.78

.48.47.45.46.46.45.44.43

.46 2.09 4.15

.45 .O9 .53

.44 .04 .51

.43 .02 .76

.43 .01 .67

.45

.83

centrated on D~ in the ETS data than in Section 4.1 where the population

D~k distribution was fixed as (1.2, 0.8, 0.6, 0.3, 0.1). The distribution (q~, k = 1, ... , 5) in the ETS data is similar to the decreasing q~ distri-bution in Section 4.1 although, in most cases, the ETS data had an evenlarger q~.

The S-predictors were superior to the T-predictors; the approximatepopulation p2 for the S-predictors might be estimated to be 0.65 and forthe T-predictors about 0.45. Other parameters were roughly estimated forthe two sets of predictors and are shown in Table 32. These parameters

62 PSYCHOMETR]KA MONOGRAPtt SUPPLEMENT

TABLE 32

Estimated Parameters Used to Simulate E. T. S. Data

n= 8

vx = 0.01

ex = 0.005

Vy = 0.01

e = 0.005Y

N= 75

number of Zs per predictor type = 2

number of C1, C2 pairs per ~ = 1

S-predictors (First Z and Second Z)

p2 = 0.65

~22 = 0.1, O~ = 0.05,D l = 3.0, D22

3

D 4 = 0.05, D55 = 0.05

~2 = 0.85

ql 2= 5.0, q22= 0.6, q~= 0.4, q~= 0.4, q~= 0.4

T-predlctors

p2 = 0.45

2 2Dll = 2.0, D22

~2 = 0.75

(Third ~ and Fourth

= 0.i, D~3 = 0.05,

2 =’0.05= 0.05, D55

= 4.5, q22 = 0.5, q~ = 0.4, q~ = 0.3,.q~ = 0.3

PAUL A. HERZBERG 63

TABLE 33

Si:uu!~.tion of E. T. S. Samples (N = 75)

t or k 2r

First ZSample A validated

on Sample B

P.P. Sample

2 2rc Dkk

S-predictors

I .56 .57 3.242 .5~ .58 .153 .60 .60 .124 .60 .61 .035 .61 .62 .026 .607 .6~8 .62

2Sample p .71

Sample w2

2qk

5.05.44.38.45.56

.86

Second zSample c validated

on Sample D

P.C. P.P. Sample

r2

r2

2c c Dkk

.59 .57¯ 59 .6O.61 .59.63 .6O.64 .61.61.61.61

3.27.16.ii.05.01

.72

2qk

5.24.56.32.41.55

.89

T-predictors

Third ZSmmp!e E validated

on Sample F

t or kP.C. ?. P. Sample

2 2 D2 2rc rc kk qk

i .392 .393 .404 .405 .416 .4_~7 .448 . 39

2Sample ~

2Sample ~

.41 2.04 4.21¯ 37 .34 .27.36 .19 .51.38 .12 .5~¯ 39 .03 .57

.54

.76

Fourth ZSample G validated

on Sample H

P.C. P.P. Sample

2 2 2 2r r qkc c Dkk

515152525152

.54 1.67 4I.l~

.55 .20 .4~

.53 .13 .86

.52 .09 .23

.52 .03 .35

.42

64 ]?SYCHOMETRIKA MONOGRAPH SUPPLEMENT

were used to generate population and sample correlation matrices using thesimulation program. Two populations (First 2~ and Second 2~) were generatedfor the S-predictor simulation and two populations (Third X and Fourth 2;)were generated for the T-predictor simulation. One pair of sample correlationmatrices was generated for each population (N -- 75 in all cases). The re-sults of the validation of C1 on C2 for each of the four populations are shownin Table 33.

If these tables are compared with the corresponding Table 31 for thereal data, it will be seen that the real and simulation results correspondclosely. By adjusting the parameters it would be possible to make the matcheven better, but the simulation using the parameters in Table 32 is presentedhere, since this simulation was the first attempted. The close similarity of thesimulation results to the results from the ETS data shows that the simula-tion model is basically sound.

CHAPTER 5

SUMMARY AND CONCLUSIONS

The several methods of multiple regression discussed in this study aredesigned to provide optimal weights for predictor variables. The weightsare optimal in the sense that, in new samples, the weighted linear combina-tion of the predictors has the highest possible correlation with the criterionvariable. By means of cross-validation, it is possible to estimate the correla-tion in new samples using only data in a (divided) original sample.

For any given problem it is important to decide which prediction methodgives the best weights. If one exhaustively tries all prediction methods it isstraightforward, using cross-validation, to pick the best linear combinationof the predictors. But there are some disadvantages to this procedure. It islengthy even with a computer; there is capitalization on chance results; andthe procedure does not provide a way to generalize to new variables or newpopulations.

It is apparent that no one method of prediction will be optimal for allpossible predictor and criterion distributions. Even if one method, forexample prediction from the principal components, were superior, it wouldstill be necessary to decide the number of components to include in theregression. Burket’s [1964] work included the computation of statistics whichwere of some assistance in deciding how many principal components to in-clude in the regression. The present study considered some fundamentalparameters of the population distribution which are relevant to the choiceof prediction method and the number of components to include in theregression.

In order to study the effect of these parameters on prediction, the dis-tributions were simulated on a computer. The parameters were systemati-cally varied and the prediction methods were compared for each parameterset by applying the weights to cross-validation samples.

In Section 3.3 the accuracy of the sample multiple correlation andcross-validity as measures of the population multiple correlation and cross-validity were studied. The squared sample multiple correlation, r 2, is anover-estimate of the squared population multiple correlation, p2. The biastends to decrease with increasing sample size and to increase with increasingnumber of predictors and increasing p2. The bias is correctly estimated byformulas of Wishart [1931] and Wherry [1931]. The sample and populationcross-validities are approximately equal and underestimate p2. The sample

65


cross-validity is therefore a good estimate of the population cross-validitybut not of the population multiple correlation.

The dependence of the cross-validation of the principal components ofthe predictors on the distribution parameters was considered in Section 3.4.The technique used was to calculate the number of principal componentswhich produced maximum cross-validity. This number, called tin,x, wasstudied as a function of four parameters--the sample size, N, the numberof predictors, n, the squared population multiple correlation, p2, and theaverage criterion-related predictor variance, ~r2. It was fot~nd that tm~, is anincreasing function of N and p~ and a decreasing function of n and ~. Thismeans that a few (1 or 2, say) principal components will be more effectivethan many components when N is small, n is large, p~ is small, and ~ is large.

Many prediction problems in psychology involve multiple criteria, noone of which can be considered to be the criterion. A convenient way to avoidchoosing one criterion, and at the same time, achieve some synthesis of thecriteria, is to weight each standardized criterion equally and to optimize theprediction of all criteria simultaneously. The effectiveness of any predictionmethod can then be estimated from the average squared cross-validity.

Two prediction methods were compared ,in this way. The first, predic-tion from the largest principal components of the predictors, does not usecriterion information in the selection of the components and may be usedfor one or several criteria. The second, prediction from the principM pre-dictors, uses criterion information to calculate the principal predictors them-selves. This method optimizes the average squared multiple correlation inthe derivation sample.

It was found, for the distributions studied, that the principal predictorshad superior or equal cross-validities to the principal components exceptwhen the predictors were approximately dependent. The st~periority of prin-cipal predic.tors was particularly evident when ~r2 was small and the q~ dis-tribution was increasing, meaning that the first principal predictor accountedfor much less of the predictor variance than the last principal predictor.However this combination of parameters--~~ small and the q~ distributionincreasing--may occur rarely, if at all, in real multivariate distributions. Inthe sample of real ability and interest data from the Educational TestingService, v~ was very large and the q~ distribution was decreasing with heavyconcentration on q[. In these data, as in the corresponding simulation data,the principal components were superior to the principal predictors.

This result is similar to Burket’s [1964] finding that the principal com-ponents correlating greatest with the criterion do not validate as well as thelargest principal components. It appears to be an advantage to select linearcombinations of the predictors independently of criterion information inorder to maximize cross-validity.

In order for the conclusions of a simulation study to apply to real pre-

PAUL A. HERZBERG 67

diction situations, it must be shown that the simulated distributions aresimilar to real distributions in relevant cha~’acteristics. Several sections ofthis study were concerned with this demonstration. In Section 3.1 it wasshown that, when the population multiple correlation is zero, the simulationsample statistics (multiple correlation and cross-validity) obey known sta-tistical laws. In Section 3.2 it was shown that changes in the covariancematrix of the predictors, keeping the multiple correlation constant, have noeffect on the correlation statistics. Finally, in Section 4.3, several modelsand samples were generated in order to match the ETS data more closely.The results from this simulation were almost identical to the ETS results.

It would be interesting to determine if other real data have differentvalues of =2 and different q~ distributions from those of the ETS data andto see if calculations using these variables obey the laws discovered in simu-lation. It is also necessary to extend the c~Iculations to larger numbers ofpredictors and criteria. Such work would be a further check on the effective-ness of the simulation model which was used in this study.

PAUL A. HERZBERG 69

REFERENCES

Anderson, T. W. An introduction to multivariate statistical analysis. New York: Wiley, 1958.Bartlett, M. S. On the theory of statistical regression. Proceedings of the Royal Society of

Edinburgh, 1933, 53, 260-283.Box, G. E. P., & Muller, M. E. A note on the generation of random normal deviates. Thv

Annals of Mathematical Statistics, 1958, 29, 610-611.Burket, G. R. A study of reduced rank models for multiple prediction. Psychometric Mono-

graphs, 1964, No. 12.Darlington, R. B. Multiple regression in psychological research and practice. Psychological

Bulletin, 1968, 69, 161-182.Efroymson, M. A. Multiple regression analysis. In A. Ralston & H. S. Will (Eds.), Ma/h-

ematical methods for digital computers. New York: Wiley, 1960. Pp. 191-203.Ezekiel, M., & Fox, K. A. Methods of correlation and regression analysis. (3rd ed. ) New York:

Wiley, 1959.Fisher, R. A. The general s~mpling distribution of the multiple correlation coefficient.

Proceedings of the Royal Society, 1928, A, 121, 654-673.Gut~man, L. To what extent can communalities reduce rank? Psychometrika, 1958, 23,

297-308.ttase, H. D., & Goldberg, L. R. Comparative validity of different strategies of constructing

personality inventory scales. Psychological Bulletin, 1967, 67, 231-248.tIerzberg, P. A. The parameters of cross-validation. Unpublished doctoral dissertation,

Univ. of Illinois, Urbana, 1967.Horst, P. The prediction of personal adjustment. New York: Social Science Research Council,

1941.I-torst, P., & MacEwan, C. Predlctor-elimination techniques for determining multiple pred.ic-

tion batteries. Psychological Reports, 1960, 7, 19-50.Hotelling, H. The relations of the newer multivariate statistical methods to factor analysis.

Brilish Journal of Statistical Psychology, 1957, 10, 69-79.Kendall, M. G., & Stuart, A. The advanced theory of statistics. Vol. 2. London: Griffin, 1961.Kshirsagar, A. l~I. Bartlett decomposition and Wishart distribution. The Annals of Mathe-

matical Slalisties, 1959, 30, 239-241.Larson, S. C. The shrinkage of the coefficient of multiple correlation. Journal of Ed~cational

Psychology, 1931, 22, 45-55.Leiman, J. M. The calculation of regression weights from common factor loadings. Un-

published doctoral dissertation, Univ. of Washington, 1951.Lord, F. M. Efficiency of prediction when a regression equation from one sample is used in

a new sample. Research Bulletin 50-40. Princeton, N. J.: Educational Testing Ser-vice, 1950.

Massy, W. F. Principal components regression in exploratory statistical research. Journalof lhe American Statistical Association, 1965, 60, 234-256.

!~Iosier, C. I. Problems and designs of cross-validation. Educational and PsychologicalMeasuremen4 1951, 11, 5-11.

Nicholson, G. E. Prediction in future s~mples. In I. 01kin, et al. (Eds.), Contm79utions toprobability and stalistics. Palo Alto, Calif.: Stanford, 1960, Pp. 322-330.

Olkin, I., & Pratt, J. W. Unbiased estimation of certain correlation coefficients. The Annalsof Malhematical Statislics, 1958, 29, 201-211.

Owen, D. B. Handbook of statistical tables. Reading, Mass.: Addison-Wesley, 1962.Siegel, S. Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill, 1956.Tucker, L. R. Transformation of predictor variables to a simplified regression structure.

Unpublished report. Princeton, N. J.: Educational Testing Service, 1957.

70 PSYCHOMETRIK,k MONOGRAPH SUPPLEMENT

Wherry, R. J. A new formula for predicting the shrinkage of the coefficient of multiplecorrelation. The Annals of Malhematieal Statistics, 1931, 2, 440-457.

Wijsman, R. A. Random orthogonal transformations. The Annals of Mathematical Sta-tistics, 1957, 28, 415-423.

Wishart, J. The mean and second moment of the multiple eorrelatioa coefficient in samplesfrom a normal population. Biometrika, 1931, 22, 353-361.

the parameters of cross-validation

Documents