choosing among imputation techniques for incomplete multivariate data: a simulation study
TRANSCRIPT
![Page 1: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/1.jpg)
This article was downloaded by: [Tulane University]On: 03 September 2014, At: 06:12Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T3JH, UK
Communications in Statistics -Theory and MethodsPublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/lsta20
Choosing among imputationtechniques for incompletemultivariate data: asimulation studyAbdul Lateef Bello aa Department of Statistics , University of Oxford ,Oxford, 0X1 3TG, U.KPublished online: 27 Jun 2007.
To cite this article: Abdul Lateef Bello (1993) Choosing among imputationtechniques for incomplete multivariate data: a simulation study, Communications inStatistics - Theory and Methods, 22:3, 853-877, DOI: 10.1080/03610929308831061
To link to this article: http://dx.doi.org/10.1080/03610929308831061
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of allthe information (the “Content”) contained in the publications on ourplatform. However, Taylor & Francis, our agents, and our licensorsmake no representations or warranties whatsoever as to the accuracy,completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views ofthe authors, and are not the views of or endorsed by Taylor & Francis.The accuracy of the Content should not be relied upon and should beindependently verified with primary sources of information. Taylor andFrancis shall not be liable for any losses, actions, claims, proceedings,demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, inrelation to or arising out of the use of the Content.
This article may be used for research, teaching, and private studypurposes. Any substantial or systematic reproduction, redistribution,reselling, loan, sub-licensing, systematic supply, or distribution in any form
![Page 2: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/2.jpg)
to anyone is expressly forbidden. Terms & Conditions of access and use canbe found at http://www.tandfonline.com/page/terms-and-conditions
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 3: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/3.jpg)
COMMUN. STATIST.-THEORY METH., 22(3), 853-877 (1993)
CHOOSING AMONG IMPUTATION TECHNIQUES FOR INCOMPLETE MULTIVARIATE DATA:
A SIMULATION STUDY
ABDUL LAI'EEF BELL0
Department of Statistics, University of Oxford, Oxford, OX1 3TG, U.K
Key Words and Phrases: Imputation techniques; imputed data matrix; EM al- gorithm; principal component analysis; singular value decomposition.
ABSTRACT
A wide variety of strategies for coping with the problem of missing values,
which frequently arises in multivariate data, have been proposed and tried
over the years. One popular and important strategy is to estimate the missing
values themselves in some way, usually achieved by imputation techniques. By
means of Monte Carlo simulations, this paper investigates the relative per-
formance of five deterministic imputation techniques using normal and non-
normal data with several factors that may affect their efficiency. The imputa- tion techniques are: mean substitution method (MSM), EM algorithm (EM),
Dear's principal component method (DPC), general iterative principal com-
ponent method (GIP) and singular value decomposition method (SVD). GIP is a refined, iterative version of DPC, developed to overcome certain problems with the latter. Although results indicate that no singie imputation 'cechuique
is best overall in all combinations of factors studied, MSM and DPC behave
erratically; when the intercorrelation among the variables is moderate or high,
they performed worse than the iterative imputation techniques-EM, SVD, and GIP-which, under this condition, are equally efficient. An illustrative
real data example is given.
Copyright @ 1993 by Marcel Dekker, Inc.
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 4: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/4.jpg)
1. INTRODUCTION
In large scale collection of data, a complete observation vector on all individ-
uals may often be impossible for a variety of reasons. Examples of incomplete
data abound in virtually every domain of scientific inquiry: excavated mate-
r i d may, on account of decay or erosion, lack key variables; respondents to
questions may skip or forget to answer certain questions; medical records may
not show the required clinical measurements they had been expected to show
for some patients; and so on. When summary statistics-means, variances
and covariances, which are in the background of many standard multivariate
analyses-are to be calculated from these data, the method of doing so must
be resolved. There are three quick, naive options: either (i) use only the
individuals (or cases) for which all variables are present (case-wise-deletion
method); or (ii) use only the variables for which all individuals have data val-
ues (variable-wise method); or (iii) obtain the mean and variance of a variable
from all the data available for that variable, and obtain the covariance between
any two variables, similarly, from the maximum data available for both those
variables (all-available-data method). Also, in very special circumstances, op- tions (i) and (ii) may be combined to produce a complete data, depending on
the pattern and proportion of missing values in the data. Options (i)-(iii)
shall hereafter be called deletion-pairwise strategy.
Options (i) and (ii) potentially sacrifice information-by deleting any indi- vidual or variable just because a random datum is missing. In particular, with
option (i), the consequent reduction in sample size may affect statistical power
and efficiency; also, a poor covariance matrix estimate may arise, especially if the proportion of cases with complete variables is small. These problems may,
however, be less serious if missing values are only scattered across just a few
cases. But at one extreme, option (i) may break down altogether, when, for
instance, every case has at least one missing variable on it.
In option (iii), the elements of a covariance (or correlation) matrix will be
based on different sample sizes and the covariance matrix may be inconsistent.
In fact, the covariance matrix may turn out to be non-positive definite (NPD). By definition, an NPD matrix is one that has negative eigenvalue(s). In other
words, it is a matrix with negative variance estimates and may have correlation
coefficients outside the admissible [-I, +1] range. While detecting an NPD matrix is quite straightforward, tackling the problem is, unfortunately, not at
all easy. (See Devlin et al., 1975; Frane, 1978; Huseby et al., 1980; and, Knol
and ten Berge, 1989; for some useful approaches.)
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 5: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/5.jpg)
IMPUTATION TECHNIQUES 855
Because of a number of pitfalls associated with deletion-pairwise strategy,
a time-honoured alternative strategy--of considerable ingenuity-is imputa-
tion. This strategy involves estimating the missing values themselves in some way, usually achieved by imputation techniques. With few exceptions, impu-
tation techniques are difficult t o use; many of them make heavy demands in
computing time and have only become manageable with recent advances in
computing technology. Indeed, modern computing enviroments have certainly
enhanced interest and research in imputation techniques. However, while these
techniques are far superior to any deletion-pairwise methods (see Afifi and
Elashoff, 1966; Chan and Dunn, 1973; Beale and Little, 1975; Kim and Curry,
1977; and Little, 1988), no study so far has investigated their relative perfor-
mance on their own merit. The obvious benefits of imputation techniques are that they create com-
plete data that retain as many of the available data as possible, allow straight- forward use of standard complete-data methods, and may give consistent re-
sults and sometimes reduce bias of parameter estimates. However, these ben-
efits depend solely on imputation techniquebeing able to preserve the rela-
tionship among variables as far as possible-and missing-values mechanism. If
the missing-values mechanism is ignorable-ie., missing at random (MAR) or
missing completely at random (MCAR), in the sense defined by Rubin (1976)- and the imputation technique is good at predicting the missing values, the error
due to bias will be small. On the other hand, if the missing-values mechanism
is not ignorable (that is, if there is systematic difference between cases with
complete variables and cases with partially observed variables), imputation
will not be able to reduce bias unless a suitable model is proposed for the
missing-values mechanism. This approach is advocated by Greenlees, Reece,
and Zieschang (1982). Aside from the fact that imputation strategy overcomes inherent problems
associated with deletion-pairwise strategy, its main problem is distortion of
relationship among variables which may arise (since imputed values are mere
approximations to unknown missing values).
There is a wide variety of imputation techniques which could possibly
be categorized as stochastic or deterministic. The stochastic imputations use
randomization process to obtain imputed values and their refinements are es-
sentially multiple imputations, fully described by Rubin (1991). In this paper,
attention is restricted to members of deterministic imputations for the obvious
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 6: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/6.jpg)
reason that their imputed values are uniquely determined and when repeated
anywhere for the same data, the results are always consistent; this is not nec-
essarily true for stochastic imputations.
The five deterministic imputation techniques studied are: mean substitu- tion method (MSM), EM algorithm (EM), Dear's principal component method
(DPC), general iterative principal component method (GIP) and singular value
decomposition method (SVD). GIP is essentially an iterative version of DPC
with certain refinements introduced to overcome some of the problems asso-
ciated with DPC. Choosing among these imputation techniques to give rea-
sonable imputed values for missing variable (and therefore reliable parameter
estimates) may be a problem. The aim of this paper, therefore, is to offer
a thorough investigation of the effects of imputed data matrix-achieved by any of the five imputation techniques-in estimating population parameters-
mean vector and covariance matrix. The investigation is largely by Monte
Carlo simulations using a root-mean-square deviation of the estimated param-
eters from their true values as a benchmark to judge the performance of any
particular imputation technique over repeated sampling.
The plan of this paper is as follows. Section 2 describes the imputation
techniques, and sections 3 and 4 present, respectively, the Monte Carlo design
and results. Section 5 provides an illustrative real data example. Summary and suggestion are given in section 6.
2. DESCRIPTION O F THE IMPUTATION TECHNIQUES
Unless otherwise stated, let X be an (n x p) incomplete data matrix with ele-
ments x,,, (kl,. . . ,n; j=l,. . . ,p), denoting observed values, where n > p. We
shall assume that the missing values in X are MAR as defined by Rubio (1976).
Mean substitution method ( M S M ) . This is the simplest and perhaps the oldest
imputation method. Its being the oldest stems from the fact that it was the
first to appear in statistical literature, presumably suggested by Wilks (1932). Ever since then, several authors have considered it in their works but unfortu-
nately it has never received the full and proper statistical consideration that
it deserves. The basic idea involves replacing missing values on a particular
variable by the mean of available data on the variable. This method ignores
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 7: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/7.jpg)
IMPUTATION TECHNIQUES 857
the intercorrelation that often exists among multiple variables in obtaining
imputed values which are nonrandom in nature. Consequently, variances and covariances are systematically underestimated-a natural effect of imputing
values at the center of the distribution. To remedy this, two course of actions
are ~ossible, namely:
MI: Adjust the degrees of freedom used in calculating the statistics as follows:
let the (j,k)th element of the adjusted covariance matrix S be denoted
by n
s j k = E ( l T j - e j ) ( z , k - ~ k ) / ( n - C - 1) j , k = I , . . . , p r=l
where, for the diagonal elements sjj, c is the number of imputed values
on the jth variable; and for the off-diagonal elements sjk, c is the number
of values with xj, xk, or both imputed; that is, the number of terms that
are zero because of imputation and 5t.j is the mean of the jth variable.
M2: Add a small perturbation to each imputed value to overcome the un-
derestimation phenomenon, Replace a missing value, z,j, for instance,
with ji.j $ ~ j , where E , is a random quantity having zero mean and vari-
ance equal to the variance of the jth variable (Krzanowski, 1988, p. 33). An assumption will always be required for the distribution of the distur-
bance ej-most likely normality. But any assumption like this may be
unjustifiable mainly because it cannot be verified in practice. Any im-
puted value from this approach is not deterministic. That is, it cannot
be uniquely determined-different imputed values are possible for any
missing value. This approach is perhaps particularly useful in multiple
imputation strategy which involves replacing each missing value with two
or more accept&!e vrr!ues to represent a distribution of possibilities-
originally proposed by Rubin (1978) and now detailed in Rubin (1991) and Li, Raghunathan, and Rubin (1991).
Throughout this paper, we shall be concerned with M1 whenever MSM is
mentioned. This is because the other imputation techniques described in the
sequel are deterministic in methodology.
EM a lgodhm (EM). When dealing with data involving multiple variables, it
is not common to have the variables independent of one another. Therefore,
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 8: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/8.jpg)
if missing values occur on a particular variable, reasonable estimates can be
confidently expected if, for that variable, data on other observed variables are
used to predict the missing part. This basic concept underlies a number of
imputation techniques, prominent amongst which is the so-called EM algo-
rithm. This algorithm is most versatile and consists of two steps-expectation (E-step) and maximization (M-step). Essentially, the E-step is a regression
method, and, in general, the whole concept of EM is based on an iterative
scheme which has only become manageable due primarily to advances in com-
puter technology. A full exposition of EM is given by Dempster et a1 (1977) and
Little and Rubin (1987). We give a brief account of the key ideas as follows:
let xi = (x,, x,,) be a realization of continuous variables X, (j = 1,. . . , p) for
the ith case, where X, denotes the set of variables observed in case i and X,, denotes the missing variables. Suppose 9' = (pt , E f ) is the maximum likeli-
hood estimate of O = ( p , E) at iteration t , the E-step calculates the expected
values of the complete data sufficient statistics given the observed data and current estimates Ot:
x t . = X i j if x;j is observed { E(x, / x,, O t ) if so is missing
and
0 if xi; or ~a is observed cov(xjj, x;k 1 xh, Ot ) if x;j and xik are missing.
M-step Compute new parameter estimates Ot+l = (pt+' Et+' ), w here
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 9: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/9.jpg)
1MPUTATION TECHNIQUES 859
The algorithm then proceeds in an iterative manner through Es tep and M-
step until the difference between 0' and Eltt1 converges to a specified criterion.
The final E-step supplies the imputed values that are used in subsequent anal-
yses. Computationally, considerable time can be saved if all individuals with a similar pattern of missing variables are first grouped together and the initial
parameter estimate 0' is obtained (where possible) from only cases for which
there are complete variables. Otherwise, MSM may be used to supply imputed
values before estimating 0'. Although efforts to speed EM up were made (see
Louis, i382), it turns out. rather surprisingly that most of the modifications
destroy its simplicity or stability (Meng and Rubin, 1991). Moreover, the avail-
ability of modern computing machines and the use of a sweep operator (Little and Rubin, 1987, chap. 6) have rendered the modifications inconsequential. In particular, convergence properties associated with EM were studied by Boyles
(1983) and Wu (1983), and EM'S robustness in the presence of outliers was
discussed in Little (1988).
Dear's principal component method (DPC). One desirable property of princi-
pal component analysis (PCA) is that it does not require any distributional
assumption for its use. This, perhaps, explains the reason why it is a much used statistical technique. Dear (1959) explores the idea of PCA as follows:
Di: Define an (n x p) missingness-indicator matrix, R = ( T ; ~ ) , where r i j = 0
or 1 according to whether x ; , is missing or observed. Treat all variables
on an equal footing by first standardizing X to Z, where z j k = ( x j k - fk)/G and f k and skk are, respectively, the mean and variance of available data on the kth variable. Now, use case-wise-deletion method
to obtain the correlation matrix, S.
D2: Calculate the largest eigenvaiue of S, A, = rncsq(J~)~ and its associated
eigenvector q l k ( k = 1 , . . . , p ) .
D3: Let the first principal component for the ith case be
so that the points on the first principal component line that are closest
to the ith case replace the missing variables thus:
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 10: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/10.jpg)
Repeat D3 for all cases with missing variables and then de-standardise 2 to X.
General iterative principal component method (GIF). It is not difficult to see
that DPC may fail if, for instance, all cases have partially observed variables
or if cases with a complete observation vector are relatively few. The latter will result in a poor estimate of S (correlation matrix) and the former will
make this estimate impossible. To get over these problems and make DPC a
general-purpose method, refinements are introduced as follows:
GI (a): Use the all-available-data method to calculate S. Modify S, if it is
nonpositive definite, with the algorithm provided by Huseby et a1 (1980).
G1 (b): Alternatively, use MSM imputed values for the missing data and
thin calculate S (with adjustment as discussed in step M1 above).
G2: Now, construct the first principal component from the smoothed (or
unsmoothed) S and estimate the missing values with (2.1).
G3: Recalculate S from the imputed data matrix and repeat G2.
G4: Cycle iteratively through G3 and G2 until successive imputed values do
not change materially-that is, satisfy a convergence criterion.
Singular value decomposition method (SVD). Krzanowski (1988) suggests us-
ing SVD in a remarkably simple way to impute data to missing values. The
method is easy t o compute, especially using the algorithms of Bunch and
N i e h i (1978) and "aunch, Nieisen, and Sorensen (1978). To reinforce its wide
ranging applications, Krzanowski (1988) noted that if an (n x p) matrix is
such that p > n, we can transpose the matrix so that the roles of p and n are interchanged, To fix ideas, for one missing element, x;j, in X, the steps
involved are:
S1: Omit the ith case (or row) from X and calculate the SVD of the remaining
( (n - 1) x p) data matrix, denoted by
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 11: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/11.jpg)
IMPUTATION TECHNIQUES 86 1
S2: Omit the jth variable (or column) from X and obtain the SVD of the remaining (n x (p - 1 ) ) data matrix, denoted by
- - - X-j = UDV' with 0 = {iiSt), ? = { G a t ) and fi = diag{&,. . . ,&- I ) ,
- , - - , - - , - where u, v, 0 and ? are orthonormal matrices (i.e., U U, V V, U U and v'v are identity matrices) and b and fi are diagonal matrices.
S3: Now, combine the two SVDs-X-' and XTj-to get the imputed value (cf. Krzanowski, 1888)
For more than one missing value, an iterative scheme is involved and can be
conducted as follows: start with any initial imputed values, but preferably use the MSM technique. Update each initial imputed value in turn using (2.2). The process is then iterated until stability is achieved in the imputed values. I t should be noted that (2.2) was first suggested by Krzanowski (1987) as a basis for determining the dimensionality of a set of multivariate data for PCA.
The workability of DPC depends on being able to obtain a correlation matrix from cases with complete variables-as in D l . Since the pattern of missing values is unpredictable, it is important, therefore, to ensure that the five imputation techniques are usable for any incomplete data situation. To this end, a modified DPC which involves using MSM imputed values as ini-
tial missing values estimates is adopted in this study, and, as a result, the correlation matrix-arising from the MSM imputed data matrix-is adjusted accordingly, as discussed in MI. In this respect, the GIP uses option Gl(b)
2nd its estimated covariance matrix never suffers from a non-positive definite (NDP) problem.
3. MONTE CARL0 DESIGN
This section describes a Monte Carlo study designed to investigate the perfor- mance of five imputation techniques in a continuous multivariate incomplete
data matrix. The usefulness of a Monte Carlo study is that the true pa- rameters, and the distributions from which replicated samples are drawn, are
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 12: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/12.jpg)
usually known, enabling the investigator to compare the estimated parameters
with the known population parameters. For our Monte Carlo simulations, we took data from multinormal and non-multinormal distributions. For the non- multinormal case, we considered t-distribution with four degrees of freedom, as advocated by Devlin et a1 (1976), as a useful model for robust study.
A random covariance matrix, C = v A v T , was first generated, as de- scribed by Bryce and Maynes (1979), having predetermined geometric eigen-
value structure, A, = wvi-' $ 0.1, (i = 1 , . . . , p ) , as used by Bendel (1978),
where
and c = zy X i = tr(C). Let A=diag{ X I , . . . ,A,). A random orthogonal matrix V was then generated using the algorithm given by Heiberger (1978)
(with correction given by Tanner and Thisted (1982)). By pre-multiplying V by A and post-multiplying A by VT we obtained C (C = VAVT). [Note that eigenvalues do not change when A is pre- and post-multiplied with an orthogonal matrix.] Therefore, the generated covariance matrix, C = VAVT, has the predetermined eigenvalues XI,. . . , A,. As may be seen, when v = 1, all eigenvalues are equal to clp and this implies that the variables are independent. On the other hand, for v = 0, the variables are strongly dependent. Evidently,
values of v represent a continuum such that the interdependence among the
variables increases as v decreases from 1 to zero. It is instructive to note that if we write c = p + 6 and 6 = 0, C will be a correlation matrix considered by Bendel (1978). In this study, in order to simplify the Monte Carlo design, we used 6 = 10 and null mean vector (p=O).
The multinormal data were generated using NAG subroutines G05CBF, G05EAF and G05EZF after specifying p and C (as above). It is important to note that the G05EAF and G05EZF routines are specifically designed to
generate multivariate normal data. However, the data matrix from the ro-
bust model ( t c ) were achieved by transforming multinormal data as follows:
let Y N Np(p, C); we drew independent random values w, from a using NAG subroutine GO5DHF; then the transformation sj, = yj;/(wj/4)'/2 was
performed (see Mardia, Kent, and Bibby, 1979, p.43) for j = I , . . . , n, so that
x TP(4, P , a In anticipation of the possible effects of some factors-such as sample
size (n), dimensionality (p), interdependence among the variables ( v ) , and
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 13: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/13.jpg)
IMPUTATION TECHNIQUES 863
proportion of missing values (k)--on the performance of the five imputation
techniques, we used the following levels for these factors:
The number of the Monte Carlo simulations for each combination of n ,p ,v and k was fixed at 100. In particular, both c and v were held constant in
each simulation, but random orthogonal matrices varied from o~le simulation
to another. Thus, a different covariance matrix with fixed eigenvalue structure
was used in each simulation, so that the covariance structure is made very
general.
Now, from any particular distribution (multinormal or t c ) , 100 (n x p) data matrices were drawn with different random seeds obtained using NAG
routines GOSCCF and GOSDYF. Each of these matrices was made incomplete
by introducing missing values at random. To delete k (0 < k < 1) of the p-
variate data matrix, we followed the approach of Krzanowski (1988); pseudo-
random numbers lying between 0 and I were generated using NAG routines
GOSCCF and G05CAF and if the (pi + j ) t h random number was less than k , then the element in the (i + I , j ) t h position of the data matrix was deleted
(i = 0, . . . , n - 1; j = 1,. . . , p ) . In this way, the expected proportion of missing
values in the data matrix is thus k. The five imputation techniques were then
applied one after the other to the incomplete data to obtain imputed values for
the missing values, thereby creating five different complete data sets for each
incomplete data matrix. Estimates of the mean vector and covariance matrix
were then made from each imputed data matrix.
Let gut and ,but denote the estimated covariance matrix and mean vector, * - n - r ~ I K nnn f l ~ ( i respectively, from the t imputed data matrix jt = M ~ L V L , nm, u r u, ,
or SVD), a t the uth simulation. Several comparison criteria are possible for
assessing the closeness of &ut to the population covariance matrix C,, but the
one we adopted involves the Euclidean norm. Thus the quantity
may be viewed as a root-mean-square deviation of the estimated covariance
matrix of the t imputed data matrix from the true population covariance ma-
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 14: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/14.jpg)
trix, where 'tr' means the trace of the matrix in the curly brackets. It is noteworthy that the C, for the t-distribution is &C (Berger, 1980)-and we have used it in this study-where f is the degrees of freedom and C is the covariance matrix of the underlying multinormal.
In looking for a single summary statistic as a criterion for comparing the estimated mean vector, Gut, we adopted
and since the population mean vector is null, Q8( t ) is also a measure of dis- crepancy.
All programs were written in FORTRAN 77 and computations were per- formed on the SUN 4/65 computer machine at Oxford University statistical laboratory.
4. MONTE CARL0 RESULTS
Figs. 1-5 present the results of the Monte Carlo design described in §3-based
on criterion (3.1). Specifically, Figs. 1, 2, and 3-which are for p = 2,5, and 10 respectively-display the results from the multinormal distribution and Figs. 4 and 5 (for p = 2 and 5 respectively) are the results from the multivariate t-distribution-all are based on criterion (3.1). Each Fig. has 12 plots of 4 columns and 3 rows. The columns-as one moves from the top to the bottom- show the effect of increasing sample size, and the rows-from the left to the right-also show the effects of increasing the average intercorrelation coefficient (v) on the imputation techniques. Each plot displays the five imputation techniques based on (3.1) versus the proportion of missing data (k).
The following salient results are noticeable in Figs. 1-3 (normal data). (i) When the variables are nearly independent (v = 0.7), and p is fixed, as
the sample size (n) increases, MSM technique outperforms the regression-like imputation techniques-EM, DPC, GIP, and SVD-in any set of incomplete data, that is, regardless of the proportion of missing values (k ) in the data.
This is not surprising since MSM imputed values are obtained under the pre- text that the variables are uncorrelated. Therefore, when the variables are nearly independent i d the covariance matrix of the MSM imputed data ma-
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 15: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/15.jpg)
IMPUTATION TECHNIQUES
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 16: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/16.jpg)
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 17: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/17.jpg)
IMPUTATION TECHNIQUES
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 18: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/18.jpg)
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 19: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/19.jpg)
IMPUTATION TECHNIQUES
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 20: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/20.jpg)
trix is adjusted (as in MI), MSM technique should fare better than any other
competing techniques. However, it does appear that MSM may not do well in
large sample size (n = 200) and dimensionality (p = 10) for any v and k but
may work well in v = 0.3 and p = 2. In general, under suitable conditions for
MSM, the trend seems to be that EM is second to MSM in terms of efficiency,
followed by DPC, SVD, and GIP in that order. Clearly, GIP is not reliable
here, as it does not improve over DPC. (ii) For fixed n and v, the performance of the imputation techniques improves
as p increases depending on the proportion of missing values (k) (see Figs. 1-3). This can be seen by comparing the average deviations which decrease
with increasing dimensionality (p). It is particularly true for p = 2 and 5 (see
Figs. 1 and 2).
(iii) In moderate-to-high dimensionality, p > 2 (see Figs. 2 and 3), as the
interdependence among the variables increases v 5 0.3, the regression-like im-
putation techniques show appreciable superiority over MSM, especially when
n is large and v is near zero (i.e., variables highly correlated).
From general to specific, consider Fig. 1-the bivariate normal data-
when the sample size is small (n = 50) and v < 0.3, the choice of any imputa-
tion technique is influenced by the proportion of missing values (k ) in the data.
But when k 2 0.10 and n becomes large (n > 50), EM is on the average the
best technique and while the choice between GIP and SVD is not important-
though both trail behind EM-MSM and DPC both perform fairly badly and
are not reliable.
By contrasting Fig. 2 (p = 5) with Fig. 1 (p = 2), a marked difference in
the relative merit of the imputation techniques caused by increase in dimen-
sionality (p) can be observed. In Fig. 2, for instance, the imputation techniques
exhibit a trend as v decreases with fixed n. In particular, when v < 0.3, the
regression-like imputation techniques-DPC, GIP, EM, and SVD-ifiGniai11
consistent superiority across different values of n , and there is also evidence
that the iterative techniques-GIP, EM, and SVD-are virtually equivalent
when the variables are strongly dependent, v 5 0.1, in any sample size (n)
depending on the proportion of missing values (k) in the data. Looking closely at Fig. 3 (p = lo), it is interesting to observe that the
trend is not very different from Fig. 2, except that the pattern of the trend
varies strongly with the proportion of missing values (k). But unlike Fig. 2,
the difference between the iterative techniques-SVD, GIP, and EM-and one
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 21: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/21.jpg)
IMPUTATION TECHNIQUES 87 1
regression-like technique-DPC-is marginally small when v = 0.01 for all values of n . In fact, they all performed almost always alike.
Before presenting the results of the multivariate t-distribution, one point
is worth noting. With the exception of EM and perhaps MSM, the remain-
ing techniques-DPC, GIP, and SVD-are based on nonpararnetric statistical
techniques-principal component and singular value decomposition methods-
and as such they can be presumed to be distributional-assumption-free. Kow-
ever, this does not mean that DPC, GIP, and SVD are robust to structures in
data. In Figs. 4 and 5-the results of t4 distribution-when v = 0.7 for any value
of n, the trend exhibited by the imputation techniques is similar in all respects
to their normal counterparts (Figs. 1 and 2), but for other values of v (v < 0.7),
there is clear disparity in their trends and the superiority of one imputation
technique over another varies greatly according to sample size (n), interdepen-
dence index (v) and proportion of missing values (k) in the data. However,
in spite of this observation, EM-which depends on normality assumption-
is running neck-and-neck with the distributional-free techniques-DPC, GIP,
and SVD. In particular, when n is sufficiently large (200, say) and the vari-
ables are strongly dependent (v < 0.3) with moderate dimensionality, p = 5
(see Fig. 5), EM outperfoms other imputation techniques. On the other hand,
when p = 2 (see Fig. 4) and v = 0.3, for any n value, GIP is the most efficient
of the five imputation techniques, but when v < 0.3, the choice of best impu-
tation technique depends on k and n. However, when p = 5 , n = 200 and
v < 0.3 (see Fig. 5), the regression-like imputation techniques are far better
than MSM. This remark still holds when v = 0.01, regardless of n values.
Interestingly, a comparison between Figs. 1-2, on the one hand, and Figs. 4-5, on the other, indicates that there is insufficient evidence to discredit
the use cf EM when the data are markedly deviate from normality especially
when p > 2 and reasonably moderate-to-high interdependence exists among
the variables. This remark implicitly suggests that whatever is known to af-
fect EM-for example, outliers (Little, 1988)-may also affect other imputa-
tion techniques. However, a cursory look at Figs. 1-2 and 4-5 show that the discrepancy measure of (3.1) for the imputation techniques under the mutivari-
ate t-distribution is almost always larger than their normal counterpart-an indication of the effect of non-normality on the imputation techniques.
For the mean vector based on criterion (3.2), results not shown, irrespec-
tive of p, n, v, and the underlying distribution of the data, imputation tech-
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 22: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/22.jpg)
niques give mean vectors that are reasonably unbiased. This gratifying remark
is expected, since the mean vector of an imputed data matrix is estimated with
full efficiency.
Finally, a note on the computer-time used by these imputation techniques.
For the non-iterative techniques-MSM and DPC-no special computer-time
is required; they are very simple and straightforward. But for the itera-
tive techniques-SVD, EM, and GIP-the amount of computational effort,
as might be expected, increases rapidly as n , p, and k become large. Never-
theless, a modern computing machine such as the SUN 4/65 greatly simplified this problem. In spite of this, the convergence rate of EM was observed to be
the slowest, followed by SVD, especially for n = 200, p = 10 and k = 0.20.
However, one can explain the reason for SVD's slow convergence in view of its
mode of operation as follows: if there are c missing values in a data matrix?
2 x c SVDs are required to fill the missing values in any iteration. Suppose
m iterations are required before the convergence criterion is satisfied, clearly,
2 x c x m SVDs will be needed. It then follows that the convergence rate for
the SVD method depends on the time it takes to perform one SVD. On the
other hand, GIP simply needs eigenvalues and their associated eigenvectors of
a covariance matrix, which are easy to obtain. To sum up, GIP is faster than
the two other iterative imputation techniques-SVD and EM.
5. A REAL DATA EXAMPLE
To demonstrate the effects of the five imputation techniques on an analysis of
real data, we considered the incomplete data set given by Simonoff (1988, p.
213). Two variables-the average annual rate of change of manufac,turing do-
mestic product fGDP), (XI and rates of change of total GDP, (Y)-measured
on 50 nations, were involved. All 50 nations have a complete total GDP
(Y), but the manufacturing GDP (X) was missing for six countries. Simonoff
(1988) used his proposed diagnostic tests to show that the six missing values
are missing completely at random (MCAR). Although Simonoff did not at-
tempt to analyse the data, a simple regression analysis after imputing values
for the countries with missing variables may be of interest. Table I presents
the imputed values (X*) arising from five different imputation techniques. The
summary statistics from the regression analyses are also given below the table.
I t is interesting to note that, in spite of the small proportion of missing val-
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 23: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/23.jpg)
IMPUTATION TECHNIQUES
TABLE I
A comparison of imputed values from five imputation techniques based on the d a t i in Simonoff (1988, p. 213).
MSM EM DPC GIP SVD 4.00 5.69 4.33 4.94 4.20 4.52 3.90 5.69 4.17 4.86 4.03 4.38 5.10 5.69 6.06 5.81 6.10 6.11 4.40 5.69 4.96 5.26 4.89 5.10 2.90 5.69 2.59 4.07 2.30 2.93 6.10 5.69 7.64 6.60 7.82 7.56
TABLE I1
A comparison of residuals for .countries with imputed values.
Least square residuals MSM EM DPC GIP SVD -0.81 -0.13 -0.44 -0.07 -0.22 -0.91 -0.15 -0.49 -0.08 -0.25 -0.29 0.05 0.20 0.02 0.03 -0.41 -0.07 -0.21 -0.04 -0.13 -1.91 -0.31 -i.O7 -0.16 -0.48 1.29 0.21 0.78 0.11 0.26
ues involved (6%), differences do show up among the imputation techniques. In particular, both the values of the coefficient of multiple determination, a2 and residual mean square error, 8*, show clearly that GIP and EM compete favourably, followed by SVD and DPC.
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 24: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/24.jpg)
Although both the slope (dl) and the intercept ( b o ) give some rough idea
of the relative performance of the imputation techniques, a general inference
is difficult to make here since the true values are not known. Nevertheless, a
comparison of the residuals for countries with imputed values (Table 11) throws
some light on the effects of the imputation techniques.
The overall residual plots were plotted, though not shown, and reveal that
one point appears to be an outlier. It was a mild case and not pursued in this
study. It is quite remarkable that none of the imputation techniques seems to
do badly. Notwithstanding that, serious outliers have been noted to affect EM,
and a robust approach which only requires assigning weight to the parameter
estimates in the M-step was suggested by Little (1988).
6. SUMMARY AND SUGGESTION
Above, a model for generating random covariance matrices has been con-
structed and extensively used in a simulation study comparing the relative
merits of five imputation techniques involving several factors.
It was evident in the Monte Carlo study that sample size, dimensional-
ity, interdependency among the variables, proportion of missing values, and
of course, the underlying distribution of the data, all have effects on the per-
formance of any imputation techniques, especially when covarimce matrix
is estimated. Two types of distributions were used, one is the ideal case--
multivariate normal-and the other is multivariate t with four degrees of free-
dom, representing non-ideal situation. It emerged that for the ideal situation
when there are more than two variables and the interdependence among the variables is moderate-to-high with reasonable sample size (> 50, say), the it-
erative imputation techniques SVD, EM, and GIP competed favourably. This
remark was true for the non-ideal case too. However, for the estimation of mean vector, results indicated that there is little or no evidence of superiority of one imputation technique over another-they are all virtually equivalent.
Although no single imputation technique emerged as the overall best in
all the combinations of factors studied, and therefore any general recommen-
dation would be misleading, the straightforward way is to try all the iterative
imputation techniques-EM, GIP, and SVD-and compare the results given
by the statistical analysis, since, in many modern computing environments,
this can be accomplished within a few minutes.
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 25: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/25.jpg)
IMPUTATION TECHNIQUES 875
ACKNOWLEDGEMENTS
The author is greatly indebted to one anonymous associate editor for many valuable suggestions, ideas for improvements and corrections. The helpful comments of Drs. F. H. C. Marriott and J. Qureshi are also gratefully ac-
knowledged.
BIBLIOGRAPHY
Afifi, A . A. and Elashoff, R. M. (1966). Missing observations in multivariate statistics. I. Review of the Literature. J. Amer. Statist. Assoc., 61, 595-604.
Beale, E. M. L, and Little, R. J. A. (1975). Missing values in multivariate statistical analysis. J. R. Statist. Soc., B37, 129-146.
Bendel, R. B. (1978). Population correlation matrices for sampling experiments. Comm. Statist., B7(2), 163-182.
Berger, J. (1980). Statistical decision theory: foundations, concepts and methods. Springer-Verlag, New York.
Boyles, R. A. (1983). On the convergence of EM algorithm. J. R. Statist. Soc., B45, 47-50.
Bunch, J. R. and Nielsen, C. P. (1978). Updating the singular value decomposition. Numerische Mathematik, 31, 111-129.
Bunch, J. R., Nielsen, C. P. and Sorensen, D. C. (1978). Rank one modification of the symmetric eigenproblem. Numerische Mathematik, 31, 31-48.
Bryce, G. R, a d Maynes, D. D. (1979). Generation of multivariate data sets. Report SD-015-R. Brigham Young University Statistics Department Report Series. Provo, Utah 84602.
Chan, L.S. and DUM, 0. J. (1972). The treatment of missing values in discriminant analysis-I. The Sampling Experiment. J. Amer. Statist. Assoc., 67, 473-477.
Dear, R. E. (1959). A principal-component missing data method for multiple regression models. Report SP-86. System Development Corporation. Santa Monica, CA.
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 26: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/26.jpg)
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc., B39, 1-38.
Devlin, S. J . , Gnanadesikan, R. and Kettenring, J. R. (1975). Robust estima- tion and outlier detection with correlation coefficients. Biometrika, 62, 531-545.
Devlin, S. J . , Gnanadesikan, R. and Kettenring, J. R. (1976). Some multivariate applications of elliptical distributions. In Essays in Prohahility and Statistics ( S. Ikeda, Ed.). Sfiinko Tsusho, Tokyo, 365-393.
Frane. J. W. (1978). Missing data and BMDP: Some pragmatic approaches. In Proceedings of the Statistical Computing Section, American Statistical Association, Washington, DC, 27-33.
Greenlees, J . S., Reece, W. S. and Zieschang, K. D. (1982). Imputation of missing values when the probability of response depends on the variable being imputed. J. Amer. Statist. Assoc., 77, 251-261.
Heiberger, R. M. (1978). Generation of random orthogonal matrices. Algorithm AS 127. Appl. Statist., 27, 199-206.
Huseby, J. R., Schwertman, N. C. and Allen, D. M. (1980). Computation of the mean vector and dispersion matrix for incomplete multivariate data. Comm. Statist., B3, 301-309.
Kim, J. 0. and Curry, J. (1977). The treatment of missing data in multivariate analysis. Sociological Methods and Research, 6, 215-240.
Knol, D. L. and ten Berge, J. M. F. (1989). Least-squares approximation of an improper correlation matrix by a proper one. Psychometrika, 54, 53-61.
Krzanowski, W. J. (1987). Cross-validation in principal component analysis. Biometn'cs, 43, 575-584.
Krzanowski, W. J. (1988). Missing value imputation in multivariate data using the singular value decomposition of a matrix. Biometrical Letters, 25, 31-39.
Li, K. H., Raghunathan, T. E. and Rubin, D. B. (1991). Large sample significance levels from multiply imputed data using moment-based
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14
![Page 27: Choosing among imputation techniques for incomplete multivariate data: a simulation study](https://reader036.vdocuments.us/reader036/viewer/2022080418/5750a3291a28abcf0ca0a196/html5/thumbnails/27.jpg)
IMPUTATION TECHNIQUES 877
statistics and an F reference distribution. J. Amer. Statist. Assoc., 86. 1065-1073.
Little, R. J. A . and Rubin, D. B. (1987).Statistical analysis with missing data. New York: Wiley.
Little, R. J. A . (1988). Robust estimation of the mean and covariance matrix from data with missing values. Appl. Statist., 37, 23-38.
Louis, T. A. (1982). Finding the observed information matrix when using the EM Algorithm. J. R. Statist. Soc., B44, 226-233.
Mardia, K, V., Kent, J. T , and Bibby, J. M. (1979). Multivariate Analysis. Academic Press Inc., London.
Meng, X. L. and Rubin, D. B. (1991). Recent extensions to the EM algorithm. In Fourth Valencia International Meeting on Bayesian Statistics, at Peniscola, Spain.
Rubin, D. B. (1976). Inference and missing data. Biometn'ka, 63, 581-592.
Rubin, D. B. (1978). Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse. In Proceedings of the Survey Research Methods Section of the American Statistical Association, 20-34.
Rubin, D. B. (1991). Multiple imputation for nonresponse in surueys. Wiley, New York.
Simonoff, J. S. (1988). Regression diagnostics to detect nonrandom missingness in Linear Regression. Technometre'cs, 30, 205-214.
Tanner, X. A. and Thisted, E. A. (1982). A remark on AS 127. Generation of random orthogonal matrices. Appl. Statist., 31, 190-192.
Wilks, S. S. (1932). Moments and distribution of estimates of population parameters from fragmentary samples. Annals of Mathematical Statistics, 3, 163-195.
Wu, C. F. (1983). On the convergence properties of the EM algorithm, Annals of Statistics, 11, 95-103.
Received December 1991; Revised November 1992
Dow
nloa
ded
by [
Tul
ane
Uni
vers
ity]
at 0
6:12
03
Sept
embe
r 20
14