manqging - american marketing association values in surveys may arise tor a number of reasons, such...

7
manqging NKi Multiple imputation can improve data quality- By Marco Vriens and Eric Melton U FaU2OO2 UIUIUIUIUIUIUIUI LiMUlUIUUlUJlUlUUI IllltUUlUlUl Ul U 1IJI U Ul L ^7 '^ iuioiuu luiuiul u u' un u 1 OIOIOIOIOIUIU U) IU101U IUIU UJ I UIOlUU)IUlUUMU

Upload: duongtu

Post on 09-Mar-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: manqging - American Marketing Association values in surveys may arise tor a number of reasons, such as rhe respondeur didn't know rhe answer, didn't feel comfortable answering a question,

manqgingNKi

Multipleimputationcan improvedata quality-

By Marco Vriens and Eric Melton

U FaU2OO2

UIUIUIUIUIUIUIUILiMUlUIUUlUJlUlUUII l l l t U U l U l U l

Ul U 1IJI U Ul

L ^ 7 '̂

iu io iuuluiuiulu u' un u 1

OIOIOIOIOIUIUU) IU101U IUIU UJ IUIOlUU)IUlUUMU

Page 2: manqging - American Marketing Association values in surveys may arise tor a number of reasons, such as rhe respondeur didn't know rhe answer, didn't feel comfortable answering a question,

I M U I U I U I U I U I• • : t l l l U ) l

U M i H JI II I tl U

) I U U l U(M) IUl U

IM I) I U I U I0 I Ul l) U) I U1 III Ul (II U I

u 1 i.1 U I U IU I U i U

I U U) II M U I U I U I

') I U1 U I 01 0J l U l U l

missing values in surveys may arise tor anumber of reasons, such as rhe respondeur didn't know rhe

answer, didn't feel comfortable answering a question, skipped the

question, or broke off before completing the questionnaire

because the question was too long. In some cases data are miss-

ing by design.

The model-based multiple imputation method is one of many

approaches to dealing with missing data. We argue that, although

implementing multiple imputation takes more effort on the part

of the researcher, there are situations where these extra efforts are

worthwhile and even necessary to avoid biased survey results.

marki'tinua'search 13

Page 3: manqging - American Marketing Association values in surveys may arise tor a number of reasons, such as rhe respondeur didn't know rhe answer, didn't feel comfortable answering a question,

Executive Summary

Missing data occur in most survey researcii resuits. The muiti-

pie imputation approach hoids promise for soiving this probiem

and is considered by some to be the goid standard for deaiing

with missing data. This advanced statistical approach iias appli-

cations for poiiticai science, medical research, poiiing researcii,

sociology, and psychology. We make the case that it also

deserves to be in the toolbox for marketing researchers.

Single Imputation MethodsAvailable case methods. These methods only make use of the

data available. List-wise deletion is a well-known option. Incomputing the mean of a variable, we only use the cases thathave valid responses on this variable. When computing morecomplicated models, the list-wise deletion method can quicklylead to a large amount of data being thrown away. Pair-wisedeletion is another variety, but it's generally less preferred.Because the available case method throws away data, however,it tends to be expensive and usually yields biased and inefficientinferences for the parameters of interest.

Replacing the missing value by the mean. In mean imputa-tion, rather than analyzing only the data available, researcherscompute the mean for each variable based on the cases avail-able. For all cases of that variable that have a missing value, themean of the available cases is substituted. This procedure dis-torts the underlying distribution of the data, making the distri-bution more peaked around the mean and reducing the vari-ance. Also, it doesn't use the structure in the data optimally.After the mean Imputation, the data is treated as if it were acomplete data set. Although the method is slightly hetter thanthe available case method, it will still lead to biased results.

Random hot-deck imputation. Random hot-deck imputation(RHDI) involves several steps. First, complete records are sepa-rated from incomplete records. Next, the complete records aresorted so that records with similar attributes are groupedtogether. Grouping is done on the basis of general backgroundvariables (e.g., work-site size, job function, type of purchaseinfluence). Then a random number is generated for each com-plete record within each group. The same treatment is appliedfor the incomplete records. The incomplete records are weavedamong the complete records, and the recipient record receivesdata from the nearest complete record in the file. RHDI is some-times referred to as ascription procedure, or the "man-next-door" procedure, and tends to be better than mean imputation.

Model-based mean imputation. Model-based imputationmethods are based on estimated models and make better use ofthe structure in the data. The approach includes several steps.First, an available case regression is performed on the variablewith missing data with other variables in the data set as predic-tors. This regression equation is used to predict values andimpure the missing data. The procedure is repeated for all

incomplete variables. Model-based (regression or other models)mean imputation is generally better than mean imputation, buthas some disadvantages. If you're imputing just a single valuefor the missing data on the basis of the model, the error vari-ance in the data set may be substantially underestimated,

Model-Based Multiple ImputationAn advanced method for dealing with missing data that

solves the problem of underestimation of the error variance isthe model-based multiple imputation (MBMl) approach. Ratherthan imputing a single value for each missing data point,researchers can impute multiple values. These multiple valuesare obtained by a draw from the "predictive distribution" of thevariable. For example, in regression-based imputation, the dis-tribution of the predictive values is normal. Rather than justcomputing the expected value of the missing data point, you canimpute the missing value several times by drawing from a nor-mal distribution with the estimated mean and variance.Research has shown that a relatively small number of imputa-tions suffice, usually three to five. (See Exhibit 1.)

MBMI uses the structure in the data (model-based imputa-tion) and addresses the uncertainty caused by the imputatit)n(multiple imputation). For each variable that has missing values,we could develop a regression model using some other variablein the data set as predictors. We could also use this regressionequation to calculate the distribution of predicted values. And,from this distribution, we might take the expected value anddraw two (or more) additional values.

Doing this for data sets with many variables takes rime.MBMI implemented variable by variable would be an improve-ment over the model-based single-imputation approach.However, it simply wouldn't be practical.

Schafer developed several software programs to do MBMI.His method uses a Bayesian approach, which calculates the pos-terior distributions for al! variables with missing values simulta-neously. Although the Bayesian approach isn't strictly necessary,we describe it here because it is part of Schafer's approach.Below, we outline the key assumptions of this approach.

The data model. A first assumption involves a probabilitymodel for the complete data (i.e., imputatioti model). For exam-

Exhibit 1 Example of multiple imputations

Observed Data Imputed Values for Variable 3

Variable 1

4

5

3

4

5

4

3

Variable 2

3

2

3

3

2

4

3

Variable 3

9

6

missing

8

missing

9

missing

1

7

6

7

2

6

5

6

3

8

5

6

Source: Vriens, M., M. Wedel, and Z. Sandor (2001), "Split-Questionnaire Designs,"Marketing Research ^Su^\met).^A-^9.

14 Fall 2002

Page 4: manqging - American Marketing Association values in surveys may arise tor a number of reasons, such as rhe respondeur didn't know rhe answer, didn't feel comfortable answering a question,

pie, we can use multivariate normal models for normally dis-tributed data, a log-linear model for categorical data, or a mix-ture of models. The specific model for imputation (i.e., linearregression, logistic regression, log-linear models) should be ascompatible as possible with the analyses to be performed on theimputed data set. If we're only interested in means, then anyregression model should be compatible. But if we're interestedspecifically in running non-linear regression models, then usinga linear regression model for imputation isn't recommended.

Assumptions about missing-ness. Three assumptions relevantto this discussion include missing completely at random(MCAR), missing at random (MAR), and non-ignorable.Informal definitions of these concepts are provided in the boxon the right.

The first two can be referred to as ignorable assumptions.MBMI assumes the missing-ness mechanism is ignorable. Thismodel assumes the unobserved distribution of the Y for thenon-respondents is only randomly different from the observeddistribution of Y for the respondents. The non-ignorable modelassumes systematic different values for non-respondents com-pared to respondents with the same values of variablesobserved. Without some external knowledge, the assumption ofignorable non-response is plausible because no extra informa-tion is available on the population distribution of Y. In this case,imputation methods adjust for all observed differences betweenrespondents and non-respondents and assume that unobserveddifferences are randtim. The plausibility of the ignorable modelincreases when increasing the number of variables in the impu-tation model. Increasing the pool of observed variablesdecreases the degree to which missing-ness depends on unob-servable data given the observed variables. We will assume ourmissing data are MAR.

The prior distribution. In Bayesian statistics we need to spec-ify what is knuwn as the prior distribution of the parameters ofthe imputation model, in Bayesian theory this represents our apriori knowledge about these parameters. If we have no a prioriknowledge, we can use what's known as an uninformative prior.The results of the Bayesian analysis are not very sensitive to ourchoice of the prior when our sample is large.

Combining the results of different imputed data sets. Whenusing the MBMI approach we obtain a number of imputed val-ues for each missing value. {Exhibit I has three.) To get a finalset of estimates of interest or a final value for a statistic of inter-est, we would get the estimates or statistic from each of theimputed data sets and combine them into one final set of esti-mates using the rules provided by Rubin in 1987. Exhibit 2illustrates this process.

To get final estimates and standard errors, we pool the mresults, obtained from the various imputed datasets, into onefinal estimate. First, for each imputed data set we estimate themeans and the standard errors so that for each variable of inter-est we have three estimated means and three estimated, corre-sponding, standard errors. If we have m imputations, we wouldget m sets of estimates. The overall, final estimate of the meanof a particular variable is simply the average mean of that vari-able across the three (or m) estimates. Calculation of the stan-dard errors takes a few more steps.

Assumptions about the missing-ness mechanism

Missing-Complelely-at-Random (MCAR) . Consider a variable Y that has miss-ing values. The missing data for Y are said to be MCAR if the probability that Y ismissing is unrelated to the vatue of Y Itself or any ottier variable in the data set.

MJssing-at-Random (MAR). Missing data are said to be MAR if the probability ofmissing data on Y is unrelated to the vaiues of Y after controlling for othervariables in the data set.

Non-ignorable. If the missing data mechanism is not MAR, the missing datamechanism is non-ignorable. If missing data is generated by a non-ignorablemechanism, then we have the following situation: A respondent and a non-respondent can have exactly the same values on variables both the respondentand non-respondent responded to, while having systematically different valueson the variables that show missing values for the non-respondent.

To arrive at the final standard errors we take the followingsteps. First, calculate the within-imputation variances. This issimply the average, across the results from the three imputed datasets ((standard error 1+standard error 2-(-standard error }}/^).Let's call the within-imputation variance for a particular variableU. Next, calculate the between-imputation variances. We calcu-late the sum of three deviations. Deviation 1 is the first meanminus the average (final) mean, deviation 2 is the second meanminus the average mean, and deviation 3 is the third meanminus the average mean. This sum is then divided by (3-1). Let'scall the between-imputation variance for a particular variable B.Then calculate total standard errors. The total standard error,for a particular variable, denoted by T is: T=U-^-(I-^l/3)*B, or ingeneral with m imputations: T=U+(l + l/m)*B. The T statisticfollows a t-distribution with a certain number of degrees of free-dom.

Software for multiple imputation of missing data (NORM).Based on the theory discussed above, J.L. Schafer developed sev-

Exhibit 2 combining data sets

maik'tiiij; research 15

Page 5: manqging - American Marketing Association values in surveys may arise tor a number of reasons, such as rhe respondeur didn't know rhe answer, didn't feel comfortable answering a question,

eral statistical packages to deal with missing values. Each one iscapable of dealing with different types of data models. NORMis one of those packages, performing multiple imputations usingthe multivariate normal distribution. This program generatesmultiple imputations per missing value using Bayesianapproaches. Using information obtained from the observed partof the data set, NORM simulates the missing part m > 1 times,creating m equally plausible versions of the complete data.These m sets can be analyzed with complete-data techniques,combining the m sets of results to produce one set of estimatesand standard errors.

The package contains two main procedures. The first is an EMalgorithm (EM phase) for efficient estimation of mean, variances,and covariances (or correlations) using alt of the cases in tbe dataset, including those that are partially missing. The EM algorithmis a general method for obtaining maximum-likelihood estimatesof parameters from incomplete data. The second is the data aug-

On a variable-by-variable basis allimputation methods make relatively

large errors in recovering theoriginal full data mean.

mentation procedure (DA phase), using Markov Chain MonteCarlo methods for generating multiple imputations.

The method goes through the following steps. The data aug-mentation starts with a random imputation of missing data,assuming specific values for the parameters, then drawing newparameter estimates from a Bayesian posterior distributionbased on the imputed data.

The DA algorithm simulates random values of parametersand missing data from their posterior distribution in two steps:(1) the I step (impute the missing data by drawing them fromtheir conditional distribution given the observed data andassumed values for the parameters); (2) the P step (simulate newvalues for the parameters by drawing them from a Bayesianposterior distribution given the observed data and the mostrecently imputed values for the missing data). The EM algo-rithm will provide a set of starting values for the parameters.Running DA for a large number of cycles, and storing theresults of a few I steps along the way, can help obtain propermultiple imputations of the missing data.

Before entering the EM phase, some pre-processing of thedata might be necessary to handle non-normal variables. Severaltransformations can be applied, if variables aren't normally dis-tributed. Some of these include power transformations for cor-recting skewness, logit transformations (useful for a variablethat takes values only within a limited range), and dummy cod-ing (for including categorical variables with no missing values inthe model). Finally, NORM offers the opportunity to synthesizethe parameters found into a final set of estimates as discussed inthe previous section.

Is It Worth the Effort?Applying a procedure such as MBMI can be more time-con-

suming than applying less-advanced approaches such as meanimputation and the RHD!. The question then becomes: Is itworth the effort? A paper by Ramaswaniy and colleagues com-pared the MBMI procedures with other alternatives, in theirstudy, 1,000 synthetic data sets were constructed, each with asample size of 300 and defined on several variables. In each dataset, a certain amount of missing data was imposed. The syn-thetic data sets were treated with several missing data methods:(1) substitution by the mean, (2) single-regression model-basedimputation, (3) EM imputation (the missing value analysis mod-ule in SPSS), and (4) MBMI.

The authors now bave six sets of data sets:

1. A full set of data sets: 1,000 data sets containing no missingvalues

2. 1,000 data sets where nothing has been done to treat themissing data (complete case analysis)

3. 1,000 data sets wbere the imposed missing values have beenreplaced by tbe means

4. 1,000 data sets where the imposed missing values have beenreplaced by a single value derived by a regression model

5. 1,000 data sets where the imposed missing values have beenreplaced by the EM-imputation generated values

6. 1,000 data sets where the imposed missing values have beenreplaced by MBMI generated multiple values.The results were striking! Remember that the estimated

model contained two parameters, beta i and beta 2. The com-plete case analysis and the mean substitution method performedworst. In both cases both coefficients were significantly biased.The regression-based single imputation performed slightly bet-ter, but still resulted in biased estimates for both coefficients.The EM imputation method (SPSS missing valtie method) didreasonably well in recovering beta 1 but resulted in a biasedestimate for beta 2. MBMI was the only method that didn'tresult in systematic biased estimates for both betas.

To illustrate the effects of missing data methods on recover-ing the means that we would have found when there wouldhave been no missing data, we created our own test data sets.This data set contains approximately 10,000 respondents sur-veyed on several hundreds of variables.

We first selected only the data that pertained to a specificpart of the computer industry media study (CIMS) question-naire. Next we selected six test variables from this part.Subsequently, we selected only those respondents who had com-pleted, or should have completed, all questions.

To this basis data set, referred to as the full data set, missingdata was added. Missing-ness was imposed through a series ofsteps. Eirst, we performed a linear regression analysis on each ofthe six test variables using 39 additional variables as independentvariables to look for underlying relationships in the data set.

The results from these regression models were then used toidentify the two or three variables that were most stronglyrelated to each specific dependent variable. For test variable 1the file was sorted in descending order by the same two or threetop predicting variables of test variable 1. We selected the first30% or 60% of this ordered dataset, and for these records vari-

16 Fail 2002

Page 6: manqging - American Marketing Association values in surveys may arise tor a number of reasons, such as rhe respondeur didn't know rhe answer, didn't feel comfortable answering a question,

able 1 was set to missing.Hence, missing-ness was generated for each of the test vari-

ables was based on tbe scores of tbese two or tbree predictorvariables. We generated tbe two test data sets: One with 30% ofthe observations set to missing and one with 60% of the obser-vations set to missing.

For each of the generated data sets we apply MBMI using theN(^RM prof^ram and we apply tbe traditional (for CIMS)KHDL Ciiven that the starting point for generating the data setswith different amounts of missing data is a full data set, weknow what the mean values or mean proportions for this fulldata are, which allows us to evaluate the performance of the dif-ferent methods under different conditions. I

We computed the following four statistics:1. The mean values or proportions of the test variables obtained

under analyzing the full data sets (no missing-ness wasadded)

2. The mean values or proportions of the test variables of thedataset with added missing-ness using the available case

3. The mean values or proportions of the test variables of thedataset with added missing-ness using the RHDI

4. The mean values or proportions of the test variables of thedataset with added missing-ness using MBMI (three imputedmeans plus an overall, final, multiple imputation mean)To make an overall evaluation of how well different options

of dealing with missing data perform, we calculated for eachtest variable absolute error, absolute percentage errors, andmean absolute percentage errors. For each variable we calculate

Exhibit 4 Absolute errors and absolute percent errors

Variables

Variable 1

Variable 2

Variable 3

Variable 4

Variable 5

Variable 6

Mean%Error

None

0.150

0.051

0.010

0.013

81,109

0.030

Imputation IVIethods

% RHDI %

30% Missing Data is Added

MBMI

28.21%

10.74%

1,85%

2,03%

14.61%

0.71%

g.69%

0,229

0,053

0.128

0,128

47,457

0.022

43.07%

11.23%

22.82%

19.96%

8.55%

0.52%

17.69%

0.056

0.089

0.014

0.000

7.249

0.136

10.49%

18.76%

2.44%

0.04%

1.31%

3.19%

6.04%

60% Missing Data is Added

Variable 1

Variable 2

Variable 3

Variable 4

Variable 5

Variable 6

Mean%Error

0.321

0.314

0,057

0.006

219.914

0,022

60.31%

66.29%

10.10%

0.95%

39.61%

0.51%

29.63%

0.425

0.067

0.241

0.308

78.160

0.069

79,87%

14,22%

42.88%

47.85%

14.08%

1.61%

33.42%

0,243

0,335

0.052

0.014

102,702

0.358

45,78%

70,86%

9,16%

2.17%

18,50%

8,39%

25.81%

Non9=available case approach. oHDi-random hol-deck imputation approach.

'•: •• model-basfltl miilliplB imputation approach

the absolute difference between the "real" mean and the meansobtained under the different methods. This is the absolute error.Then we calculate what percentage this error constitutes relativeto the "true" mean: the absolute percentage error. Exhibit 4shows these summary performance evaliLation measures.

As Exhibit 4 demonstrates, the absolute errors run fromsmall to large. The performance of each method of dealing withmissing data can vary by variable. For some variables it seemsto be difficult for all methods to recover the "full data means"(e.g., variable 1 and 2). For other variables either the availablecase method or the RHDI seems to have difficulty in gettingclose to the original full data mean (e.g., variable 3 and 5).Overall, as indicated by the mean absolute percentage errors,the MBMI method has a much lower mean absolute percentageerror than both the available case method and the RHDI in the30% condition. In the 60% condition MBMI still performs bet-ter overall than the available case method and RHDI. However,it's clear that on a variable-by-variable basis all imputationmethods make relatively large errors in recovering the originalfull data mean.

The MBMI approach outperformed the RHDI and the avail-able case method if we look at the results across all variables. Thisis consistent with what the theory predicts and it is consistentwith previous studies. Hence, we recommend MBMI over RHDI.

We note that the percentage missing-ness imposed on ourown data seems high. The levels of missing-ness that we testedin our study (i.e., 30% and 60%) were comparable to tbe levelsin other studies we surveyed. Do such high levels occur in prac-tice? We think they do. In our syndicated study, CIMS, if youlook variable by variable, we do encounter missing-ness thatcovers anything from almost 0% to as much as 80%. In otherstudies we have found similar amounts.

The overall results of our analysis all point strongly to oneconclusion: MBMI is the best currently available method to dealwitb missing data, taking into account performance and practi-cal applicability, especially in large, complicated data sets.Compared to the other alternatives discussed here, the MBMIapproach takes into account the uncertainty of the imputationthat subsequently results in an average more accurate than anyindividual imputed value. •

Additional ReadingRamaswamy, V., T. E. Raghunathan, S.H. Cohen, and K. Ozcan(2001), "A Multiple Imputation Approach for the Analysis ofMissing Data in Marketing Research, International lour?ial ofResearch in Marketing (in press).

Rubiti, D.B. (1987), Multiple Imputation for Non-Response inSurveys. New York: J. Wiley &c Sons.

Schafer, J.L. (1997), Analysis of Incomplete Multivariate Data.London: Chapman &c Hall.

Marco Vriens is senior vice president and chief research officerfor Millward Brown IntelliQuest. He may be reached [email protected]. Eric Melton is manager of marketingsciences for Millward Brown IntelliQuest. He may be reached [email protected].

inark'ling tx-st'arch

Page 7: manqging - American Marketing Association values in surveys may arise tor a number of reasons, such as rhe respondeur didn't know rhe answer, didn't feel comfortable answering a question,