choosing among imputation techniques for incomplete multivariate data: a simulation study

This article was downloaded by: [Tulane University]On: 03 September 2014, At: 06:12Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T3JH, UK

Communications in Statistics -Theory and MethodsPublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/lsta20

Choosing among imputationtechniques for incompletemultivariate data: asimulation studyAbdul Lateef Bello aa Department of Statistics , University of Oxford ,Oxford, 0X1 3TG, U.KPublished online: 27 Jun 2007.

To cite this article: Abdul Lateef Bello (1993) Choosing among imputationtechniques for incomplete multivariate data: a simulation study, Communications inStatistics - Theory and Methods, 22:3, 853-877, DOI: 10.1080/03610929308831061

To link to this article: http://dx.doi.org/10.1080/03610929308831061

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of allthe information (the “Content”) contained in the publications on ourplatform. However, Taylor & Francis, our agents, and our licensorsmake no representations or warranties whatsoever as to the accuracy,completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views ofthe authors, and are not the views of or endorsed by Taylor & Francis.The accuracy of the Content should not be relied upon and should beindependently verified with primary sources of information. Taylor andFrancis shall not be liable for any losses, actions, claims, proceedings,demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, inrelation to or arising out of the use of the Content.

This article may be used for research, teaching, and private studypurposes. Any substantial or systematic reproduction, redistribution,reselling, loan, sub-licensing, systematic supply, or distribution in any form

http://www.tandfonline.com/loi/lsta20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/03610929308831061

http://dx.doi.org/10.1080/03610929308831061

to anyone is expressly forbidden. Terms & Conditions of access and use canbe found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

http://www.tandfonline.com/page/terms-and-conditions

COMMUN. STATIST.-THEORY METH., 22(3), 853-877 (1993)

CHOOSING AMONG IMPUTATION TECHNIQUES FOR INCOMPLETE MULTIVARIATE DATA:

A SIMULATION STUDY

ABDUL LAI'EEF BELL0

Department of Statistics, University of Oxford, Oxford, OX1 3TG, U.K

Key Words and Phrases: Imputation techniques; imputed data matrix; EM algorithm; principal component analysis; singular value decomposition.

ABSTRACT

A wide variety of strategies for coping with the problem of missing values,

which frequently arises in multivariate data, have been proposed and tried

over the years. One popular and important strategy is to estimate the missing

values themselves in some way, usually achieved by imputation techniques. By

means of Monte Carlo simulations, this paper investigates the relative per-

formance of five deterministic imputation techniques using normal and non-

normal data with several factors that may affect their efficiency. The imputation techniques are: mean substitution method (MSM), EM algorithm (EM),

Dear's principal component method (DPC), general iterative principal com-

ponent method (GIP) and singular value decomposition method (SVD). GIP is a refined, iterative version of DPC, developed to overcome certain problems with the latter. Although results indicate that no singie imputation 'cechuique

is best overall in all combinations of factors studied, MSM and DPC behave

erratically; when the intercorrelation among the variables is moderate or high,

they performed worse than the iterative imputation techniques-EM, SVD, and GIP-which, under this condition, are equally efficient. An illustrative

real data example is given.

Copyright @ 1993 by Marcel Dekker, Inc.

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

1. INTRODUCTION

In large scale collection of data, a complete observation vector on all individ-

uals may often be impossible for a variety of reasons. Examples of incomplete

data abound in virtually every domain of scientific inquiry: excavated mate-

r i d may, on account of decay or erosion, lack key variables; respondents to

questions may skip or forget to answer certain questions; medical records may

not show the required clinical measurements they had been expected to show

for some patients; and so on. When summary statistics-means, variances

and covariances, which are in the background of many standard multivariate

analyses-are to be calculated from these data, the method of doing so must

be resolved. There are three quick, naive options: either (i) use only the

individuals (or cases) for which all variables are present (case-wise-deletion

method); or (ii) use only the variables for which all individuals have data val-

ues (variable-wise method); or (iii) obtain the mean and variance of a variable

from all the data available for that variable, and obtain the covariance between

any two variables, similarly, from the maximum data available for both those

variables (all-available-data method). Also, in very special circumstances, options (i) and (ii) may be combined to produce a complete data, depending on

the pattern and proportion of missing values in the data. Options (i)-(iii)

shall hereafter be called deletion-pairwise strategy.

Options (i) and (ii) potentially sacrifice information-by deleting any indi- vidual or variable just because a random datum is missing. In particular, with

option (i), the consequent reduction in sample size may affect statistical power

and efficiency; also, a poor covariance matrix estimate may arise, especially if the proportion of cases with complete variables is small. These problems may,

however, be less serious if missing values are only scattered across just a few

cases. But at one extreme, option (i) may break down altogether, when, for

instance, every case has at least one missing variable on it.

In option (iii), the elements of a covariance (or correlation) matrix will be

based on different sample sizes and the covariance matrix may be inconsistent.

In fact, the covariance matrix may turn out to be non-positive definite (NPD). By definition, an NPD matrix is one that has negative eigenvalue(s). In other

words, it is a matrix with negative variance estimates and may have correlation

coefficients outside the admissible [-I, +1] range. While detecting an NPD matrix is quite straightforward, tackling the problem is, unfortunately, not at

all easy. (See Devlin et al., 1975; Frane, 1978; Huseby et al., 1980; and, Knol

and ten Berge, 1989; for some useful approaches.)

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

IMPUTATION TECHNIQUES 855

Because of a number of pitfalls associated with deletion-pairwise strategy,

a time-honoured alternative strategy--of considerable ingenuity-is imputa-

tion. This strategy involves estimating the missing values themselves in some way, usually achieved by imputation techniques. With few exceptions, impu-

tation techniques are difficult t o use; many of them make heavy demands in

computing time and have only become manageable with recent advances in

computing technology. Indeed, modern computing enviroments have certainly

enhanced interest and research in imputation techniques. However, while these

techniques are far superior to any deletion-pairwise methods (see Afifi and

Elashoff, 1966; Chan and Dunn, 1973; Beale and Little, 1975; Kim and Curry,

1977; and Little, 1988), no study so far has investigated their relative perfor-

mance on their own merit. The obvious benefits of imputation techniques are that they create com-

plete data that retain as many of the available data as possible, allow straightforward use of standard complete-data methods, and may give consistent re-

sults and sometimes reduce bias of parameter estimates. However, these ben-

efits depend solely on imputation techniquebeing able to preserve the rela-

tionship among variables as far as possible-and missing-values mechanism. If

the missing-values mechanism is ignorable-ie., missing at random (MAR) or

missing completely at random (MCAR), in the sense defined by Rubin (1976)- and the imputation technique is good at predicting the missing values, the error

due to bias will be small. On the other hand, if the missing-values mechanism

is not ignorable (that is, if there is systematic difference between cases with

complete variables and cases with partially observed variables), imputation

will not be able to reduce bias unless a suitable model is proposed for the

missing-values mechanism. This approach is advocated by Greenlees, Reece,

and Zieschang (1982). Aside from the fact that imputation strategy overcomes inherent problems

associated with deletion-pairwise strategy, its main problem is distortion of

relationship among variables which may arise (since imputed values are mere

approximations to unknown missing values).

There is a wide variety of imputation techniques which could possibly

be categorized as stochastic or deterministic. The stochastic imputations use

randomization process to obtain imputed values and their refinements are es-

sentially multiple imputations, fully described by Rubin (1991). In this paper,

attention is restricted to members of deterministic imputations for the obvious

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

reason that their imputed values are uniquely determined and when repeated

anywhere for the same data, the results are always consistent; this is not nec-

essarily true for stochastic imputations.

The five deterministic imputation techniques studied are: mean substitution method (MSM), EM algorithm (EM), Dear's principal component method

(DPC), general iterative principal component method (GIP) and singular value

decomposition method (SVD). GIP is essentially an iterative version of DPC

with certain refinements introduced to overcome some of the problems asso-

ciated with DPC. Choosing among these imputation techniques to give rea-

sonable imputed values for missing variable (and therefore reliable parameter

estimates) may be a problem. The aim of this paper, therefore, is to offer

a thorough investigation of the effects of imputed data matrix-achieved by any of the five imputation techniques-in estimating population parameters-

mean vector and covariance matrix. The investigation is largely by Monte

Carlo simulations using a root-mean-square deviation of the estimated param-

eters from their true values as a benchmark to judge the performance of any

particular imputation technique over repeated sampling.

The plan of this paper is as follows. Section 2 describes the imputation

techniques, and sections 3 and 4 present, respectively, the Monte Carlo design

and results. Section 5 provides an illustrative real data example. Summary and suggestion are given in section 6.

2. DESCRIPTION O F THE IMPUTATION TECHNIQUES

Unless otherwise stated, let X be an (n x p) incomplete data matrix with ele-

ments x,,, (kl,. . . ,n; j=l,. . . ,p), denoting observed values, where n > p. We

shall assume that the missing values in X are MAR as defined by Rubio (1976).

Mean substitution method ( M S M ) . This is the simplest and perhaps the oldest

imputation method. Its being the oldest stems from the fact that it was the

first to appear in statistical literature, presumably suggested by Wilks (1932). Ever since then, several authors have considered it in their works but unfortu-

nately it has never received the full and proper statistical consideration that

it deserves. The basic idea involves replacing missing values on a particular

variable by the mean of available data on the variable. This method ignores

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14


the intercorrelation that often exists among multiple variables in obtaining

imputed values which are nonrandom in nature. Consequently, variances and covariances are systematically underestimated-a natural effect of imputing

values at the center of the distribution. To remedy this, two course of actions

are ~ossible, namely:

MI: Adjust the degrees of freedom used in calculating the statistics as follows:

let the (j,k)th element of the adjusted covariance matrix S be denoted

by n

s j k = E ( l T j - e j ) ( z , k - ~ k ) / ( n - C - 1) j , k = I , . . . , p r=l

where, for the diagonal elements sjj, c is the number of imputed values

on the jth variable; and for the off-diagonal elements sjk, c is the number

of values with xj, xk, or both imputed; that is, the number of terms that

are zero because of imputation and 5t.j is the mean of the jth variable.

M2: Add a small perturbation to each imputed value to overcome the un-

derestimation phenomenon, Replace a missing value, z,j, for instance,

with ji.j $ ~ j , where E , is a random quantity having zero mean and vari-

ance equal to the variance of the jth variable (Krzanowski, 1988, p. 33). An assumption will always be required for the distribution of the distur-

bance ej-most likely normality. But any assumption like this may be

unjustifiable mainly because it cannot be verified in practice. Any im-

puted value from this approach is not deterministic. That is, it cannot

be uniquely determined-different imputed values are possible for any

missing value. This approach is perhaps particularly useful in multiple

imputation strategy which involves replacing each missing value with two

or more accept&!e vrr!ues to represent a distribution of possibilities-

originally proposed by Rubin (1978) and now detailed in Rubin (1991) and Li, Raghunathan, and Rubin (1991).

Throughout this paper, we shall be concerned with M1 whenever MSM is

mentioned. This is because the other imputation techniques described in the

sequel are deterministic in methodology.

EM a lgodhm (EM). When dealing with data involving multiple variables, it

is not common to have the variables independent of one another. Therefore,

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

if missing values occur on a particular variable, reasonable estimates can be

confidently expected if, for that variable, data on other observed variables are

used to predict the missing part. This basic concept underlies a number of

imputation techniques, prominent amongst which is the so-called EM algo-

rithm. This algorithm is most versatile and consists of two steps-expectation (E-step) and maximization (M-step). Essentially, the E-step is a regression

method, and, in general, the whole concept of EM is based on an iterative

scheme which has only become manageable due primarily to advances in com-

puter technology. A full exposition of EM is given by Dempster et a1 (1977) and

Little and Rubin (1987). We give a brief account of the key ideas as follows:

let xi = (x,, x,,) be a realization of continuous variables X, (j = 1,. . . , p) for

the ith case, where X, denotes the set of variables observed in case i and X,, denotes the missing variables. Suppose 9' = (pt , E f ) is the maximum likeli-

hood estimate of O = ( p , E) at iteration t , the E-step calculates the expected

values of the complete data sufficient statistics given the observed data and current estimates Ot:

x t . = X i j if x;j is observed { E(x, / x,, O t ) if so is missing

and

0 if xi; or ~a is observed cov(xjj, x;k 1 xh, Ot ) if x;j and xik are missing.

M-step Compute new parameter estimates Ot+l = (pt+' Et+' ), w here

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

1MPUTATION TECHNIQUES 859

The algorithm then proceeds in an iterative manner through Es tep and M-

step until the difference between 0' and Eltt1 converges to a specified criterion.

The final E-step supplies the imputed values that are used in subsequent anal-

yses. Computationally, considerable time can be saved if all individuals with a similar pattern of missing variables are first grouped together and the initial

parameter estimate 0' is obtained (where possible) from only cases for which

there are complete variables. Otherwise, MSM may be used to supply imputed

values before estimating 0'. Although efforts to speed EM up were made (see

Louis, i382), it turns out. rather surprisingly that most of the modifications

destroy its simplicity or stability (Meng and Rubin, 1991). Moreover, the avail-

ability of modern computing machines and the use of a sweep operator (Little and Rubin, 1987, chap. 6) have rendered the modifications inconsequential. In particular, convergence properties associated with EM were studied by Boyles

(1983) and Wu (1983), and EM'S robustness in the presence of outliers was

discussed in Little (1988).

Dear's principal component method (DPC). One desirable property of princi-

pal component analysis (PCA) is that it does not require any distributional

assumption for its use. This, perhaps, explains the reason why it is a much used statistical technique. Dear (1959) explores the idea of PCA as follows:

Di: Define an (n x p) missingness-indicator matrix, R = ( T ; ~ ) , where r i j = 0

or 1 according to whether x ; , is missing or observed. Treat all variables

on an equal footing by first standardizing X to Z, where z j k = ( x j k - fk)/G and f k and skk are, respectively, the mean and variance of available data on the kth variable. Now, use case-wise-deletion method

to obtain the correlation matrix, S.

D2: Calculate the largest eigenvaiue of S, A, = rncsq(J~)~ and its associated

eigenvector q l k ( k = 1 , . . . , p ) .

D3: Let the first principal component for the ith case be

so that the points on the first principal component line that are closest

to the ith case replace the missing variables thus:

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

Repeat D3 for all cases with missing variables and then de-standardise 2 to X.

General iterative principal component method (GIF). It is not difficult to see

that DPC may fail if, for instance, all cases have partially observed variables

or if cases with a complete observation vector are relatively few. The latter will result in a poor estimate of S (correlation matrix) and the former will

make this estimate impossible. To get over these problems and make DPC a

general-purpose method, refinements are introduced as follows:

GI (a): Use the all-available-data method to calculate S. Modify S, if it is

nonpositive definite, with the algorithm provided by Huseby et a1 (1980).

G1 (b): Alternatively, use MSM imputed values for the missing data and

thin calculate S (with adjustment as discussed in step M1 above).

G2: Now, construct the first principal component from the smoothed (or

unsmoothed) S and estimate the missing values with (2.1).

G3: Recalculate S from the imputed data matrix and repeat G2.

G4: Cycle iteratively through G3 and G2 until successive imputed values do

not change materially-that is, satisfy a convergence criterion.

Singular value decomposition method (SVD). Krzanowski (1988) suggests us-

ing SVD in a remarkably simple way to impute data to missing values. The

method is easy t o compute, especially using the algorithms of Bunch and

N i e h i (1978) and "aunch, Nieisen, and Sorensen (1978). To reinforce its wide

ranging applications, Krzanowski (1988) noted that if an (n x p) matrix is

such that p > n, we can transpose the matrix so that the roles of p and n are interchanged, To fix ideas, for one missing element, x;j, in X, the steps

involved are:

S1: Omit the ith case (or row) from X and calculate the SVD of the remaining

( (n - 1) x p) data matrix, denoted by

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

IMPUTATION TECHNIQUES 86 1

S2: Omit the jth variable (or column) from X and obtain the SVD of the remaining (n x (p - 1 ) ) data matrix, denoted by

- - - X-j = UDV' with 0 = {iiSt), ? = { G a t ) and fi = diag{&,. . . ,&- I ) ,

- , - - , - - , - where u, v, 0 and ? are orthonormal matrices (i.e., U U, V V, U U and v'v are identity matrices) and b and fi are diagonal matrices.

S3: Now, combine the two SVDs-X-' and XTj-to get the imputed value (cf. Krzanowski, 1888)

For more than one missing value, an iterative scheme is involved and can be

conducted as follows: start with any initial imputed values, but preferably use the MSM technique. Update each initial imputed value in turn using (2.2). The process is then iterated until stability is achieved in the imputed values. I t should be noted that (2.2) was first suggested by Krzanowski (1987) as a basis for determining the dimensionality of a set of multivariate data for PCA.

The workability of DPC depends on being able to obtain a correlation matrix from cases with complete variables-as in D l . Since the pattern of missing values is unpredictable, it is important, therefore, to ensure that the five imputation techniques are usable for any incomplete data situation. To this end, a modified DPC which involves using MSM imputed values as ini-

tial missing values estimates is adopted in this study, and, as a result, the correlation matrix-arising from the MSM imputed data matrix-is adjusted accordingly, as discussed in MI. In this respect, the GIP uses option Gl(b)

2nd its estimated covariance matrix never suffers from a non-positive definite (NDP) problem.

3. MONTE CARL0 DESIGN

This section describes a Monte Carlo study designed to investigate the performance of five imputation techniques in a continuous multivariate incomplete

data matrix. The usefulness of a Monte Carlo study is that the true parameters, and the distributions from which replicated samples are drawn, are

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

usually known, enabling the investigator to compare the estimated parameters

with the known population parameters. For our Monte Carlo simulations, we took data from multinormal and non-multinormal distributions. For the non- multinormal case, we considered t-distribution with four degrees of freedom, as advocated by Devlin et a1 (1976), as a useful model for robust study.

A random covariance matrix, C = v A v T , was first generated, as described by Bryce and Maynes (1979), having predetermined geometric eigen-

value structure, A, = wvi-' $ 0.1, (i = 1 , . . . , p ) , as used by Bendel (1978),

where

and c = zy X i = tr(C). Let A=diag{ X I , . . . ,A,). A random orthogonal matrix V was then generated using the algorithm given by Heiberger (1978)

(with correction given by Tanner and Thisted (1982)). By pre-multiplying V by A and post-multiplying A by VT we obtained C (C = VAVT). [Note that eigenvalues do not change when A is pre- and post-multiplied with an orthogonal matrix.] Therefore, the generated covariance matrix, C = VAVT, has the predetermined eigenvalues XI,. . . , A,. As may be seen, when v = 1, all eigenvalues are equal to clp and this implies that the variables are independent. On the other hand, for v = 0, the variables are strongly dependent. Evidently,

values of v represent a continuum such that the interdependence among the

variables increases as v decreases from 1 to zero. It is instructive to note that if we write c = p + 6 and 6 = 0, C will be a correlation matrix considered by Bendel (1978). In this study, in order to simplify the Monte Carlo design, we used 6 = 10 and null mean vector (p=O).

The multinormal data were generated using NAG subroutines G05CBF, G05EAF and G05EZF after specifying p and C (as above). It is important to note that the G05EAF and G05EZF routines are specifically designed to

generate multivariate normal data. However, the data matrix from the ro-

bust model ( t c ) were achieved by transforming multinormal data as follows:

let Y N Np(p, C); we drew independent random values w, from a using NAG subroutine GO5DHF; then the transformation sj, = yj;/(wj/4)'/2 was

performed (see Mardia, Kent, and Bibby, 1979, p.43) for j = I , . . . , n, so that

x TP(4, P , a In anticipation of the possible effects of some factors-such as sample

size (n), dimensionality (p), interdependence among the variables ( v ) , and

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14


proportion of missing values (k)--on the performance of the five imputation

techniques, we used the following levels for these factors:

The number of the Monte Carlo simulations for each combination of n ,p ,v and k was fixed at 100. In particular, both c and v were held constant in

each simulation, but random orthogonal matrices varied from o~le simulation

to another. Thus, a different covariance matrix with fixed eigenvalue structure

was used in each simulation, so that the covariance structure is made very

general.

Now, from any particular distribution (multinormal or t c ) , 100 (n x p) data matrices were drawn with different random seeds obtained using NAG

routines GOSCCF and GOSDYF. Each of these matrices was made incomplete

by introducing missing values at random. To delete k (0 < k < 1) of the p-

variate data matrix, we followed the approach of Krzanowski (1988); pseudo-

random numbers lying between 0 and I were generated using NAG routines

GOSCCF and G05CAF and if the (pi + j ) t h random number was less than k , then the element in the (i + I , j ) t h position of the data matrix was deleted

(i = 0, . . . , n - 1; j = 1,. . . , p ) . In this way, the expected proportion of missing

values in the data matrix is thus k. The five imputation techniques were then

applied one after the other to the incomplete data to obtain imputed values for

the missing values, thereby creating five different complete data sets for each

incomplete data matrix. Estimates of the mean vector and covariance matrix

were then made from each imputed data matrix.

Let gut and ,but denote the estimated covariance matrix and mean vector, * - n - r ~ I K nnn f l ~ ( i respectively, from the t imputed data matrix jt = M ~ L V L , nm, u r u, ,

or SVD), a t the uth simulation. Several comparison criteria are possible for

assessing the closeness of &ut to the population covariance matrix C,, but the

one we adopted involves the Euclidean norm. Thus the quantity

may be viewed as a root-mean-square deviation of the estimated covariance

matrix of the t imputed data matrix from the true population covariance ma-

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

trix, where 'tr' means the trace of the matrix in the curly brackets. It is noteworthy that the C, for the t-distribution is &C (Berger, 1980)-and we have used it in this study-where f is the degrees of freedom and C is the covariance matrix of the underlying multinormal.

In looking for a single summary statistic as a criterion for comparing the estimated mean vector, Gut, we adopted

and since the population mean vector is null, Q8( t ) is also a measure of discrepancy.

All programs were written in FORTRAN 77 and computations were performed on the SUN 4/65 computer machine at Oxford University statistical laboratory.

4. MONTE CARL0 RESULTS

Figs. 1-5 present the results of the Monte Carlo design described in §3-based

on criterion (3.1). Specifically, Figs. 1, 2, and 3-which are for p = 2,5, and 10 respectively-display the results from the multinormal distribution and Figs. 4 and 5 (for p = 2 and 5 respectively) are the results from the multivariate t-distribution-all are based on criterion (3.1). Each Fig. has 12 plots of 4 columns and 3 rows. The columns-as one moves from the top to the bottom- show the effect of increasing sample size, and the rows-from the left to the right-also show the effects of increasing the average intercorrelation coefficient (v) on the imputation techniques. Each plot displays the five imputation techniques based on (3.1) versus the proportion of missing data (k).

The following salient results are noticeable in Figs. 1-3 (normal data). (i) When the variables are nearly independent (v = 0.7), and p is fixed, as

the sample size (n) increases, MSM technique outperforms the regression-like imputation techniques-EM, DPC, GIP, and SVD-in any set of incomplete data, that is, regardless of the proportion of missing values (k ) in the data.

This is not surprising since MSM imputed values are obtained under the pre- text that the variables are uncorrelated. Therefore, when the variables are nearly independent i d the covariance matrix of the MSM imputed data ma-

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

IMPUTATION TECHNIQUES

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

trix is adjusted (as in MI), MSM technique should fare better than any other

competing techniques. However, it does appear that MSM may not do well in

large sample size (n = 200) and dimensionality (p = 10) for any v and k but

may work well in v = 0.3 and p = 2. In general, under suitable conditions for

MSM, the trend seems to be that EM is second to MSM in terms of efficiency,

followed by DPC, SVD, and GIP in that order. Clearly, GIP is not reliable

here, as it does not improve over DPC. (ii) For fixed n and v, the performance of the imputation techniques improves

as p increases depending on the proportion of missing values (k) (see Figs. 1-3). This can be seen by comparing the average deviations which decrease

with increasing dimensionality (p). It is particularly true for p = 2 and 5 (see

Figs. 1 and 2).

(iii) In moderate-to-high dimensionality, p > 2 (see Figs. 2 and 3), as the

interdependence among the variables increases v 5 0.3, the regression-like im-

putation techniques show appreciable superiority over MSM, especially when

n is large and v is near zero (i.e., variables highly correlated).

From general to specific, consider Fig. 1-the bivariate normal data-

when the sample size is small (n = 50) and v < 0.3, the choice of any imputa-

tion technique is influenced by the proportion of missing values (k ) in the data.

But when k 2 0.10 and n becomes large (n > 50), EM is on the average the

best technique and while the choice between GIP and SVD is not important-

though both trail behind EM-MSM and DPC both perform fairly badly and

are not reliable.

By contrasting Fig. 2 (p = 5) with Fig. 1 (p = 2), a marked difference in

the relative merit of the imputation techniques caused by increase in dimen-

sionality (p) can be observed. In Fig. 2, for instance, the imputation techniques

exhibit a trend as v decreases with fixed n. In particular, when v < 0.3, the

regression-like imputation techniques-DPC, GIP, EM, and SVD-ifiGniai11

consistent superiority across different values of n , and there is also evidence

that the iterative techniques-GIP, EM, and SVD-are virtually equivalent

when the variables are strongly dependent, v 5 0.1, in any sample size (n)

depending on the proportion of missing values (k) in the data. Looking closely at Fig. 3 (p = lo), it is interesting to observe that the

trend is not very different from Fig. 2, except that the pattern of the trend

varies strongly with the proportion of missing values (k). But unlike Fig. 2,

the difference between the iterative techniques-SVD, GIP, and EM-and one

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

IMPUTATION TECHNIQUES 87 1

regression-like technique-DPC-is marginally small when v = 0.01 for all values of n . In fact, they all performed almost always alike.

Before presenting the results of the multivariate t-distribution, one point

is worth noting. With the exception of EM and perhaps MSM, the remain-

ing techniques-DPC, GIP, and SVD-are based on nonpararnetric statistical

techniques-principal component and singular value decomposition methods-

and as such they can be presumed to be distributional-assumption-free. Kow-

ever, this does not mean that DPC, GIP, and SVD are robust to structures in

data. In Figs. 4 and 5-the results of t4 distribution-when v = 0.7 for any value

of n, the trend exhibited by the imputation techniques is similar in all respects

to their normal counterparts (Figs. 1 and 2), but for other values of v (v < 0.7),

there is clear disparity in their trends and the superiority of one imputation

technique over another varies greatly according to sample size (n), interdepen-

dence index (v) and proportion of missing values (k) in the data. However,

in spite of this observation, EM-which depends on normality assumption-

is running neck-and-neck with the distributional-free techniques-DPC, GIP,

and SVD. In particular, when n is sufficiently large (200, say) and the vari-

ables are strongly dependent (v < 0.3) with moderate dimensionality, p = 5

(see Fig. 5), EM outperfoms other imputation techniques. On the other hand,

when p = 2 (see Fig. 4) and v = 0.3, for any n value, GIP is the most efficient

of the five imputation techniques, but when v < 0.3, the choice of best impu-

tation technique depends on k and n. However, when p = 5 , n = 200 and

v < 0.3 (see Fig. 5), the regression-like imputation techniques are far better

than MSM. This remark still holds when v = 0.01, regardless of n values.

Interestingly, a comparison between Figs. 1-2, on the one hand, and Figs. 4-5, on the other, indicates that there is insufficient evidence to discredit

the use cf EM when the data are markedly deviate from normality especially

when p > 2 and reasonably moderate-to-high interdependence exists among

the variables. This remark implicitly suggests that whatever is known to af-

fect EM-for example, outliers (Little, 1988)-may also affect other imputa-

tion techniques. However, a cursory look at Figs. 1-2 and 4-5 show that the discrepancy measure of (3.1) for the imputation techniques under the mutivari-

ate t-distribution is almost always larger than their normal counterpart-an indication of the effect of non-normality on the imputation techniques.

For the mean vector based on criterion (3.2), results not shown, irrespec-

tive of p, n, v, and the underlying distribution of the data, imputation tech-

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

niques give mean vectors that are reasonably unbiased. This gratifying remark

is expected, since the mean vector of an imputed data matrix is estimated with

full efficiency.

Finally, a note on the computer-time used by these imputation techniques.

For the non-iterative techniques-MSM and DPC-no special computer-time

is required; they are very simple and straightforward. But for the itera-

tive techniques-SVD, EM, and GIP-the amount of computational effort,

as might be expected, increases rapidly as n , p, and k become large. Never-

theless, a modern computing machine such as the SUN 4/65 greatly simplified this problem. In spite of this, the convergence rate of EM was observed to be

the slowest, followed by SVD, especially for n = 200, p = 10 and k = 0.20.

However, one can explain the reason for SVD's slow convergence in view of its

mode of operation as follows: if there are c missing values in a data matrix?

2 x c SVDs are required to fill the missing values in any iteration. Suppose

m iterations are required before the convergence criterion is satisfied, clearly,

2 x c x m SVDs will be needed. It then follows that the convergence rate for

the SVD method depends on the time it takes to perform one SVD. On the

other hand, GIP simply needs eigenvalues and their associated eigenvectors of

a covariance matrix, which are easy to obtain. To sum up, GIP is faster than

the two other iterative imputation techniques-SVD and EM.

5. A REAL DATA EXAMPLE

To demonstrate the effects of the five imputation techniques on an analysis of

real data, we considered the incomplete data set given by Simonoff (1988, p.

213). Two variables-the average annual rate of change of manufac,turing do-

mestic product fGDP), (XI and rates of change of total GDP, (Y)-measured

on 50 nations, were involved. All 50 nations have a complete total GDP

(Y), but the manufacturing GDP (X) was missing for six countries. Simonoff

(1988) used his proposed diagnostic tests to show that the six missing values

are missing completely at random (MCAR). Although Simonoff did not at-

tempt to analyse the data, a simple regression analysis after imputing values

for the countries with missing variables may be of interest. Table I presents

the imputed values (X*) arising from five different imputation techniques. The

summary statistics from the regression analyses are also given below the table.

I t is interesting to note that, in spite of the small proportion of missing val-

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14


TABLE I

A comparison of imputed values from five imputation techniques based on the d a t i in Simonoff (1988, p. 213).

MSM EM DPC GIP SVD 4.00 5.69 4.33 4.94 4.20 4.52 3.90 5.69 4.17 4.86 4.03 4.38 5.10 5.69 6.06 5.81 6.10 6.11 4.40 5.69 4.96 5.26 4.89 5.10 2.90 5.69 2.59 4.07 2.30 2.93 6.10 5.69 7.64 6.60 7.82 7.56

TABLE I1

A comparison of residuals for .countries with imputed values.

Least square residuals MSM EM DPC GIP SVD -0.81 -0.13 -0.44 -0.07 -0.22 -0.91 -0.15 -0.49 -0.08 -0.25 -0.29 0.05 0.20 0.02 0.03 -0.41 -0.07 -0.21 -0.04 -0.13 -1.91 -0.31 -i.O7 -0.16 -0.48 1.29 0.21 0.78 0.11 0.26

ues involved (6%), differences do show up among the imputation techniques. In particular, both the values of the coefficient of multiple determination, a2 and residual mean square error, 8*, show clearly that GIP and EM compete favourably, followed by SVD and DPC.

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

Although both the slope (dl) and the intercept ( b o ) give some rough idea

of the relative performance of the imputation techniques, a general inference

is difficult to make here since the true values are not known. Nevertheless, a

comparison of the residuals for countries with imputed values (Table 11) throws

some light on the effects of the imputation techniques.

The overall residual plots were plotted, though not shown, and reveal that

one point appears to be an outlier. It was a mild case and not pursued in this

study. It is quite remarkable that none of the imputation techniques seems to

do badly. Notwithstanding that, serious outliers have been noted to affect EM,

and a robust approach which only requires assigning weight to the parameter

estimates in the M-step was suggested by Little (1988).

6. SUMMARY AND SUGGESTION

Above, a model for generating random covariance matrices has been con-

structed and extensively used in a simulation study comparing the relative

merits of five imputation techniques involving several factors.

It was evident in the Monte Carlo study that sample size, dimensional-

ity, interdependency among the variables, proportion of missing values, and

of course, the underlying distribution of the data, all have effects on the per-

formance of any imputation techniques, especially when covarimce matrix

is estimated. Two types of distributions were used, one is the ideal case--

multivariate normal-and the other is multivariate t with four degrees of free-

dom, representing non-ideal situation. It emerged that for the ideal situation

when there are more than two variables and the interdependence among the variables is moderate-to-high with reasonable sample size (> 50, say), the it-

erative imputation techniques SVD, EM, and GIP competed favourably. This

remark was true for the non-ideal case too. However, for the estimation of mean vector, results indicated that there is little or no evidence of superiority of one imputation technique over another-they are all virtually equivalent.

Although no single imputation technique emerged as the overall best in

all the combinations of factors studied, and therefore any general recommen-

dation would be misleading, the straightforward way is to try all the iterative

imputation techniques-EM, GIP, and SVD-and compare the results given

by the statistical analysis, since, in many modern computing environments,

this can be accomplished within a few minutes.

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14


ACKNOWLEDGEMENTS

The author is greatly indebted to one anonymous associate editor for many valuable suggestions, ideas for improvements and corrections. The helpful comments of Drs. F. H. C. Marriott and J. Qureshi are also gratefully ac-

knowledged.

BIBLIOGRAPHY

Afifi, A . A. and Elashoff, R. M. (1966). Missing observations in multivariate statistics. I. Review of the Literature. J. Amer. Statist. Assoc., 61, 595-604.

Beale, E. M. L, and Little, R. J. A. (1975). Missing values in multivariate statistical analysis. J. R. Statist. Soc., B37, 129-146.

Bendel, R. B. (1978). Population correlation matrices for sampling experiments. Comm. Statist., B7(2), 163-182.

Berger, J. (1980). Statistical decision theory: foundations, concepts and methods. Springer-Verlag, New York.

Boyles, R. A. (1983). On the convergence of EM algorithm. J. R. Statist. Soc., B45, 47-50.

Bunch, J. R. and Nielsen, C. P. (1978). Updating the singular value decomposition. Numerische Mathematik, 31, 111-129.

Bunch, J. R., Nielsen, C. P. and Sorensen, D. C. (1978). Rank one modification of the symmetric eigenproblem. Numerische Mathematik, 31, 31-48.

Bryce, G. R, a d Maynes, D. D. (1979). Generation of multivariate data sets. Report SD-015-R. Brigham Young University Statistics Department Report Series. Provo, Utah 84602.

Chan, L.S. and DUM, 0. J. (1972). The treatment of missing values in discriminant analysis-I. The Sampling Experiment. J. Amer. Statist. Assoc., 67, 473-477.

Dear, R. E. (1959). A principal-component missing data method for multiple regression models. Report SP-86. System Development Corporation. Santa Monica, CA.

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc., B39, 1-38.

Devlin, S. J . , Gnanadesikan, R. and Kettenring, J. R. (1975). Robust estimation and outlier detection with correlation coefficients. Biometrika, 62, 531-545.

Devlin, S. J . , Gnanadesikan, R. and Kettenring, J. R. (1976). Some multivariate applications of elliptical distributions. In Essays in Prohahility and Statistics ( S. Ikeda, Ed.). Sfiinko Tsusho, Tokyo, 365-393.

Frane. J. W. (1978). Missing data and BMDP: Some pragmatic approaches. In Proceedings of the Statistical Computing Section, American Statistical Association, Washington, DC, 27-33.

Greenlees, J . S., Reece, W. S. and Zieschang, K. D. (1982). Imputation of missing values when the probability of response depends on the variable being imputed. J. Amer. Statist. Assoc., 77, 251-261.

Heiberger, R. M. (1978). Generation of random orthogonal matrices. Algorithm AS 127. Appl. Statist., 27, 199-206.

Huseby, J. R., Schwertman, N. C. and Allen, D. M. (1980). Computation of the mean vector and dispersion matrix for incomplete multivariate data. Comm. Statist., B3, 301-309.

Kim, J. 0. and Curry, J. (1977). The treatment of missing data in multivariate analysis. Sociological Methods and Research, 6, 215-240.

Knol, D. L. and ten Berge, J. M. F. (1989). Least-squares approximation of an improper correlation matrix by a proper one. Psychometrika, 54, 53-61.

Krzanowski, W. J. (1987). Cross-validation in principal component analysis. Biometn'cs, 43, 575-584.

Krzanowski, W. J. (1988). Missing value imputation in multivariate data using the singular value decomposition of a matrix. Biometrical Letters, 25, 31-39.

Li, K. H., Raghunathan, T. E. and Rubin, D. B. (1991). Large sample significance levels from multiply imputed data using moment-based

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14


statistics and an F reference distribution. J. Amer. Statist. Assoc., 86. 1065-1073.

Little, R. J. A . and Rubin, D. B. (1987).Statistical analysis with missing data. New York: Wiley.

Little, R. J. A . (1988). Robust estimation of the mean and covariance matrix from data with missing values. Appl. Statist., 37, 23-38.

Louis, T. A. (1982). Finding the observed information matrix when using the EM Algorithm. J. R. Statist. Soc., B44, 226-233.

Mardia, K, V., Kent, J. T , and Bibby, J. M. (1979). Multivariate Analysis. Academic Press Inc., London.

Meng, X. L. and Rubin, D. B. (1991). Recent extensions to the EM algorithm. In Fourth Valencia International Meeting on Bayesian Statistics, at Peniscola, Spain.

Rubin, D. B. (1976). Inference and missing data. Biometn'ka, 63, 581-592.

Rubin, D. B. (1978). Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse. In Proceedings of the Survey Research Methods Section of the American Statistical Association, 20-34.

Rubin, D. B. (1991). Multiple imputation for nonresponse in surueys. Wiley, New York.

Simonoff, J. S. (1988). Regression diagnostics to detect nonrandom missingness in Linear Regression. Technometre'cs, 30, 205-214.

Tanner, X. A. and Thisted, E. A. (1982). A remark on AS 127. Generation of random orthogonal matrices. Appl. Statist., 31, 190-192.

Wilks, S. S. (1932). Moments and distribution of estimates of population parameters from fragmentary samples. Annals of Mathematical Statistics, 3, 163-195.

Wu, C. F. (1983). On the convergence properties of the EM algorithm, Annals of Statistics, 11, 95-103.

Received December 1991; Revised November 1992

Dow

nloa

ded

by [

Tul

ane

Uni

vers

ity]

at 0

6:12

03

Sept

embe

r 20

14

choosing among imputation techniques for incomplete multivariate data: a simulation study

Documents