the performance of multiple imputation for likert-type...

1

The Performance of Multiple Imputation for Likert-type Items with Missing Data

Walter L. Leite UNIVERSITY OF FLORIDA

S. Natasha BeretvasTHE UNIVERSITY OF TEXAS AT AUSTIN

Copies of the paper can be obtained from:[email protected]

2

Types of missing data

Data missing completely at random (MCAR);

Data missing at random (MAR);

Data missing not at random (MNAR)

This classification is based on the relationships between the missing values, the incomplete variable and the other variables in the design.

Variable X Variable Y

?

?

?

??

?

?

3

Common Methods to Deal with Missing Data

Listwise deletion;

Pairwise deletion;

Mean substitution;

Regression-based single imputation.

4

Maximum-likelihood missing data methods;

Expectation Maximization Algorithm;

Multiple imputation.

Modern Methods to Deal with Missing Data

5

Advantages of Multiple Imputation

Provides unbiased parameter estimates when the data is not missing completely at random

Preserves the variability of each variable

Preserves the variability of the sample covariance matrix

6

Combining Parameter Estimates: ∑=

=m

iiq

mq

1

ˆ1

Calculating the total variance of each parameter:

Within imputations variance:

Between imputation variance:

Total variance:

∑=

=m

iiu

mu

1

ˆ1

2

1)ˆ(

11 ∑

=

−−

=m

ii qq

mB

Bm

uT ⎟⎠⎞

⎜⎝⎛ ++=

11

7

Main Research Questions

How does MI perform with Likert-scale data under the assumption of multivariate normality?

How does the magnitude of the variables’ inter-correlations affect the performance of MI?

How does MI perform with non-normally distributed data under the assumption of multivariate normality?

8

Conditions manipulated in this study

The underlying distribution of the item responses (normal versus non-normal);

The magnitude of the variables’ inter-correlations (ρ = 0.3, ρ = 0.8);

The bluntness of the categorization of the data into discrete item scores (three, five and seven);

The missing data mechanism (MCAR, MAR and MNAR);

The proportion of missing data.

9

The Simulation MethodSimulation of item data:

Defined correlation matrices with ρ = 0.8 or ρ = 0.3

Generated 1000 samples with 10 multivariate normal variables and 400 cases.

Introduced skewness and kurtosis into the variables using the transformation designed by Valle and Maurelli (1983).

Categorized each variable in the dataset into Likert scales with3, 5, or 7 points.

Computed the correlation matrices for the categorized data.

10

Distribution of categorized items

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Non-Normally distributed variable

Normally distributed variable

11

The Simulation MethodSimulation of missing values:

Created MCAR missing data by randomly deleting values;

Deleted values according to a predictor variable to create MAR missing data;

Deleted values in each variable according to its own distribution of values, to create the MNAR missing data.

12

Missing Data Conditions Simulated7-point Likert scale

MAR-Linear/MNARI II III

Proportions of missingness

1 .02 .20 .302 .04 .24 .353 .08 .28 .404 .12 .32 .455 .16 .36 .506 .20 .40 .557 .24 .44 .60 .50.30.157

.45.20.106

.40.10.055

.35.05.024

.40.10.053

.45.20.102

.50.30.151

Proportions of missingness

IIIIIIMAR-Convex

13

Simulated Proportions of Missing Data

Likert Scale

Type Level k=3 k=5 k=7

MAR-linear I .10 .14 .11

MAR-linear II .18 .27 .29

MAR-linear III .45 .45 .40

MAR-convex I .06 .07 .05

MAR-convex II .15 .15 .10

MAR-convex III .42 .42 .36

MNAR I .11 .15 .12

MNAR II .20 .30 .32

MNAR III .48 .50 .45 .34.42.41IIIMNAR

.23.22.12IIMNAR

.04.07.06IMNAR

.42.51.52IIIMAR-convex

.22.24.25IIMAR-convex

.11.12.08IMAR-convex

.30.38.38IIIMAR-linear

.21.20.11IIMAR-linear

.04.06.05IMAR-linear

k=7k=5k=3LevelType

Likert Scale

Normally distributed data Non-normally distributed data

14

Missing Data AnalysisValues for the missing data were imputed with Splus6.0

The multivariate normal model was assumed.Ten imputations were created for each dataset.

The correlation between each pair of variables was calculated for each imputed data set.

The correlations were transformed to Fisher’s Z:

The ten transformed correlation matrices were combined using Rubin’s (1987) rule:

The between-imputations variance, B, of the transformed correlation estimates was calculated:

⎥⎦⎤

⎢⎣⎡−+

=rrZr 1

1ln)2/1(

∑=

=m

iiq

mq

1

ˆ1

2

1)ˆ(

11 ∑

=

−−

=m

ii qq

mB

15

Analysis of the Performance of MI

The Fisher’s Zs for the complete data (before missingness was introduced) were compared with the MI estimates.

The comparisons were performed using relative bias averaged across replications.

The relative bias is considered acceptable if its magnitude is less than .05 (Hoogland & Boomsma, 1998).

ρ

ρ

ζζ−

= rr

ZZB

ˆ)ˆ(

16

Analysis of the Performance of MI

The variance associated with the multiply imputed parameter estimate is a function of the average within-imputation variance and the between-imputation variance.

Because the parameter estimate of interest is the transformed correlation (Fisher’s Z), its within-imputation variance is solely a function of sample size:

The between-imputations variance associated with Z did vary across conditions. For this reason, the efficiency of the Z-transformed correlations was summarized by calculating the average between-imputation variances by condition.

31ˆ−

=n

u

17

Unbiased when the data were missing completely at random (MCAR) for both levels of missingness (10% and 30%).

ρ = .8 ρ = .3

TypeLevel

k=3 k=5 k=7 k=3 K=5 k=7MCAR I -.004 -.004 -.005 -.003 -.004 -.004MCAR II -.032 -.036 -.039 -.035 -.034 -.033

Normally distributed data

-.033-.038-.046-.040-.040-.035IIMCAR-.004-.006-.008-.005-.007-.004IMCARk=7k=5k=3k=7k=5k=3LevelType

ρ = .3ρ = .8Non-Normally distributed data

Results – Fisher’s Z

18

Results – Fisher’s ZMI used for MAR conditions showed robustness to skewnessand categorization under the conditions with the two lowest degrees of missingness (I and II) for both MAR-linear and MAR-convex conditions.


ρ = .8 ρ = .3

TypeLevel

k=3 k=5 k=7 k=3 K=5 k=7MAR-linear I -.007 -.010 -.009 -.014 -.014 -.012MAR-linear II -.017 -.037 -.047 -.024 -.041 -.045MAR-linear III -.121 -.126 -.102 -.139 -.132 -.099

MAR-convex I -.004 -.005 -.005 -.012 -.012 -.010MAR-convex II -.011 -.014 -.010 -.022 -.020 -.014MAR-convex III -.105 -.113 -.079 -.126 -.115 -.076

19

Results – Fisher’s Z

Non-Normally distributed data

ρ = .8 ρ = .3

TypeLevel

k=3 k=5 k=7 k=3 k=5 k=7MAR-linear I .004 .008 .008 -.003 -.001 .003MAR-linear II .005 -.011 -.011 -.008 -.019 -.016MAR-linear III -.074 -.080 -.039 -.114 -.083 -.042

MAR-convex I -.014 -.019 -.021 -.012 -.018 -.017MAR-convex II -.064 -.051 -.053 -.046 -.043 -.040MAR-convex III -.218 -.205 -.128 -.224 -.196 -.121

20

Results – Fisher’s ZAcceptable bias was found for MNAR, with the exception of the conditions where the highest proportion of missing data had been introduced.

ρ = .8 ρ = .3

Type Level k=3 k=5 k=7 k=3 K=5 k=7

MNAR I -.010 -.009 -.010 -.030 -.018 -.016

MNAR II -.017 -.038 -.049 -.029 -.044 -.051

MNAR III -.094 -.134 -.109 -.093 -.146 -.112

Non-Normally distributed data

-.116-.155-.184-.051-.093-.079IIIMNAR

-.062-.068-.077-.017-.019-.006IIMNAR

-.029-.038-.048.003.002-.006IMNAR

k=7k=5k=3k=7k=5k=3LevelTypeρ = .3ρ = .8


21

Results – Between-Imputations Variance

The between imputation variance accounts for the extra amount oferror introduced by the imputation process.

It was observed that as the overall proportion of missingnessincreases so did the amount of between-imputations variance.

In the conditions with high percentage of missing data, the between-imputation variances increased as the correlation between variables increased from ρ = .3 to ρ = .8.

22

Between-imputation variances - Normally distributed data

ρ = .8 ρ = .3

Type Level k=3 k=5 k=7 k=3 k=5 k=7

MCAR I .0004 .0004 .0004 .0005 .0005 .0005

MCAR II .0033 .0050 .0063 .0021 .0021 .0021

MAR-linear I .0004 .0008 .0006 .0005 .0007 .0005

MAR-linear II .0012 .0051 .0079 .0010 .0019 .0021

MAR-linear III .0137 .0225 .0212 .0046 .0050 .0042

MAR-convex I .0002 .0003 .0002 .0003 .0004 .0002

MAR-convex II .0011 .0012 .0006 .0009 .0008 .0005

MAR-convex III .0148 .0206 .0161 .0043 .0046 .0034

MNAR I .0005 .0009 .0007 .0005 .0008 .0006

MNAR II .0013 .0053 .0082 .0012 .0021 .0024

MNAR III .0135 .0246 .0228 .0050 .0058 .0048

23

ρ = .8 ρ = .3

Type Level k=3 k=5 k=7 k=3 k=5 k=7

MCAR I .0004 .0004 .0005 .0005 .0005 .0005

MCAR II .0043 .0059 .0071 .0022 .0022 .0022

MAR-linear I .0003 .0005 .0003 .0004 .0004 .0003

MAR-linear II .0010 .0038 .0049 .0007 .0014 .0015

MAR-linear III .0153 .0208 .0149 .0039 .0041 .0029

MAR-convex I .0003 .0006 .0005 .0004 .0005 .0005

MAR-convex II .0021 .0030 .0025 .0014 .0014 .0011

MAR-convex III .0174 .0267 .0217 .0052 .0058 .0043

MNAR I .0004 .0005 .0003 .0004 .0004 .0003

MNAR II .0011 .0038 .0049 .0008 .0016 .0016

MNAR III .0153 .0219 .0150 .0042 .0045 .0031

Between-imputation variances – Non-Normally distributed data

24

DiscussionThe results indicate that multiple imputation is robust to violations of both continuity and normality.

The biases of the parameter estimates resulting from using MI were found to be consistently negative across all conditions.

The statistical tests performed after MI will tend to be less powerful.

It is possible to conclude that multiple imputation can be safely used to estimate parameters if the overall proportion of missing data in the dataset does not exceed a maximum of about 30%.

25

Limitations

The datasets used in this study contained ten variables inter-correlated with each other. The results of this study may have been different if uncorrelated variables were used.

The proportions of missing data were not consistent across conditions, which make comparisons across conditions somewhat harder.

This simulation uses only a sample size of 400, which is relatively large. Different results could be obtained if smaller sample sizes were used.

26

Future Research Questions

What is the maximum amount of missing data for which MI still functions adequately?

How much can the inclusion of predictors in the dataset help MI when the proportion of missing data is large?

How does sample size affect the performance of MI?

the performance of multiple imputation for likert-type...

Documents