the performance of multiple imputation for likert-type...
TRANSCRIPT
1
The Performance of Multiple Imputation for Likert-type Items with Missing Data
Walter L. Leite UNIVERSITY OF FLORIDA
S. Natasha BeretvasTHE UNIVERSITY OF TEXAS AT AUSTIN
Copies of the paper can be obtained from:[email protected]
2
Types of missing data
Data missing completely at random (MCAR);
Data missing at random (MAR);
Data missing not at random (MNAR)
This classification is based on the relationships between the missing values, the incomplete variable and the other variables in the design.
Variable X Variable Y
?
?
?
??
?
?
3
Common Methods to Deal with Missing Data
Listwise deletion;
Pairwise deletion;
Mean substitution;
Regression-based single imputation.
4
Maximum-likelihood missing data methods;
Expectation Maximization Algorithm;
Multiple imputation.
Modern Methods to Deal with Missing Data
5
Advantages of Multiple Imputation
Provides unbiased parameter estimates when the data is not missing completely at random
Preserves the variability of each variable
Preserves the variability of the sample covariance matrix
6
Combining Parameter Estimates: ∑=
=m
iiq
mq
1
ˆ1
Calculating the total variance of each parameter:
Within imputations variance:
Between imputation variance:
Total variance:
∑=
=m
iiu
mu
1
ˆ1
2
1)ˆ(
11 ∑
=
−−
=m
ii qq
mB
Bm
uT ⎟⎠⎞
⎜⎝⎛ ++=
11
7
Main Research Questions
How does MI perform with Likert-scale data under the assumption of multivariate normality?
How does the magnitude of the variables’ inter-correlations affect the performance of MI?
How does MI perform with non-normally distributed data under the assumption of multivariate normality?
8
Conditions manipulated in this study
The underlying distribution of the item responses (normal versus non-normal);
The magnitude of the variables’ inter-correlations (ρ = 0.3, ρ = 0.8);
The bluntness of the categorization of the data into discrete item scores (three, five and seven);
The missing data mechanism (MCAR, MAR and MNAR);
The proportion of missing data.
9
The Simulation MethodSimulation of item data:
Defined correlation matrices with ρ = 0.8 or ρ = 0.3
Generated 1000 samples with 10 multivariate normal variables and 400 cases.
Introduced skewness and kurtosis into the variables using the transformation designed by Valle and Maurelli (1983).
Categorized each variable in the dataset into Likert scales with3, 5, or 7 points.
Computed the correlation matrices for the categorized data.
10
Distribution of categorized items
1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
Non-Normally distributed variable
Normally distributed variable
11
The Simulation MethodSimulation of missing values:
Created MCAR missing data by randomly deleting values;
Deleted values according to a predictor variable to create MAR missing data;
Deleted values in each variable according to its own distribution of values, to create the MNAR missing data.
12
Missing Data Conditions Simulated7-point Likert scale
MAR-Linear/MNARI II III
Proportions of missingness
1 .02 .20 .302 .04 .24 .353 .08 .28 .404 .12 .32 .455 .16 .36 .506 .20 .40 .557 .24 .44 .60 .50.30.157
.45.20.106
.40.10.055
.35.05.024
.40.10.053
.45.20.102
.50.30.151
Proportions of missingness
IIIIIIMAR-Convex
13
Simulated Proportions of Missing Data
Likert Scale
Type Level k=3 k=5 k=7
MAR-linear I .10 .14 .11
MAR-linear II .18 .27 .29
MAR-linear III .45 .45 .40
MAR-convex I .06 .07 .05
MAR-convex II .15 .15 .10
MAR-convex III .42 .42 .36
MNAR I .11 .15 .12
MNAR II .20 .30 .32
MNAR III .48 .50 .45 .34.42.41IIIMNAR
.23.22.12IIMNAR
.04.07.06IMNAR
.42.51.52IIIMAR-convex
.22.24.25IIMAR-convex
.11.12.08IMAR-convex
.30.38.38IIIMAR-linear
.21.20.11IIMAR-linear
.04.06.05IMAR-linear
k=7k=5k=3LevelType
Likert Scale
Normally distributed data Non-normally distributed data
14
Missing Data AnalysisValues for the missing data were imputed with Splus6.0
The multivariate normal model was assumed.Ten imputations were created for each dataset.
The correlation between each pair of variables was calculated for each imputed data set.
The correlations were transformed to Fisher’s Z:
The ten transformed correlation matrices were combined using Rubin’s (1987) rule:
The between-imputations variance, B, of the transformed correlation estimates was calculated:
⎥⎦⎤
⎢⎣⎡−+
=rrZr 1
1ln)2/1(
∑=
=m
iiq
mq
1
ˆ1
2
1)ˆ(
11 ∑
=
−−
=m
ii qq
mB
15
Analysis of the Performance of MI
The Fisher’s Zs for the complete data (before missingness was introduced) were compared with the MI estimates.
The comparisons were performed using relative bias averaged across replications.
The relative bias is considered acceptable if its magnitude is less than .05 (Hoogland & Boomsma, 1998).
ρ
ρ
ζζ−
= rr
ZZB
ˆ)ˆ(
16
Analysis of the Performance of MI
The variance associated with the multiply imputed parameter estimate is a function of the average within-imputation variance and the between-imputation variance.
Because the parameter estimate of interest is the transformed correlation (Fisher’s Z), its within-imputation variance is solely a function of sample size:
The between-imputations variance associated with Z did vary across conditions. For this reason, the efficiency of the Z-transformed correlations was summarized by calculating the average between-imputation variances by condition.
31ˆ−
=n
u
17
Unbiased when the data were missing completely at random (MCAR) for both levels of missingness (10% and 30%).
ρ = .8 ρ = .3
TypeLevel
k=3 k=5 k=7 k=3 K=5 k=7MCAR I -.004 -.004 -.005 -.003 -.004 -.004MCAR II -.032 -.036 -.039 -.035 -.034 -.033
Normally distributed data
-.033-.038-.046-.040-.040-.035IIMCAR-.004-.006-.008-.005-.007-.004IMCARk=7k=5k=3k=7k=5k=3LevelType
ρ = .3ρ = .8Non-Normally distributed data
Results – Fisher’s Z
18
Results – Fisher’s ZMI used for MAR conditions showed robustness to skewnessand categorization under the conditions with the two lowest degrees of missingness (I and II) for both MAR-linear and MAR-convex conditions.
Normally distributed data
ρ = .8 ρ = .3
TypeLevel
k=3 k=5 k=7 k=3 K=5 k=7MAR-linear I -.007 -.010 -.009 -.014 -.014 -.012MAR-linear II -.017 -.037 -.047 -.024 -.041 -.045MAR-linear III -.121 -.126 -.102 -.139 -.132 -.099
MAR-convex I -.004 -.005 -.005 -.012 -.012 -.010MAR-convex II -.011 -.014 -.010 -.022 -.020 -.014MAR-convex III -.105 -.113 -.079 -.126 -.115 -.076
19
Results – Fisher’s Z
Non-Normally distributed data
ρ = .8 ρ = .3
TypeLevel
k=3 k=5 k=7 k=3 k=5 k=7MAR-linear I .004 .008 .008 -.003 -.001 .003MAR-linear II .005 -.011 -.011 -.008 -.019 -.016MAR-linear III -.074 -.080 -.039 -.114 -.083 -.042
MAR-convex I -.014 -.019 -.021 -.012 -.018 -.017MAR-convex II -.064 -.051 -.053 -.046 -.043 -.040MAR-convex III -.218 -.205 -.128 -.224 -.196 -.121
20
Results – Fisher’s ZAcceptable bias was found for MNAR, with the exception of the conditions where the highest proportion of missing data had been introduced.
ρ = .8 ρ = .3
Type Level k=3 k=5 k=7 k=3 K=5 k=7
MNAR I -.010 -.009 -.010 -.030 -.018 -.016
MNAR II -.017 -.038 -.049 -.029 -.044 -.051
MNAR III -.094 -.134 -.109 -.093 -.146 -.112
Non-Normally distributed data
-.116-.155-.184-.051-.093-.079IIIMNAR
-.062-.068-.077-.017-.019-.006IIMNAR
-.029-.038-.048.003.002-.006IMNAR
k=7k=5k=3k=7k=5k=3LevelTypeρ = .3ρ = .8
Normally distributed data
21
Results – Between-Imputations Variance
The between imputation variance accounts for the extra amount oferror introduced by the imputation process.
It was observed that as the overall proportion of missingnessincreases so did the amount of between-imputations variance.
In the conditions with high percentage of missing data, the between-imputation variances increased as the correlation between variables increased from ρ = .3 to ρ = .8.
22
Between-imputation variances - Normally distributed data
ρ = .8 ρ = .3
Type Level k=3 k=5 k=7 k=3 k=5 k=7
MCAR I .0004 .0004 .0004 .0005 .0005 .0005
MCAR II .0033 .0050 .0063 .0021 .0021 .0021
MAR-linear I .0004 .0008 .0006 .0005 .0007 .0005
MAR-linear II .0012 .0051 .0079 .0010 .0019 .0021
MAR-linear III .0137 .0225 .0212 .0046 .0050 .0042
MAR-convex I .0002 .0003 .0002 .0003 .0004 .0002
MAR-convex II .0011 .0012 .0006 .0009 .0008 .0005
MAR-convex III .0148 .0206 .0161 .0043 .0046 .0034
MNAR I .0005 .0009 .0007 .0005 .0008 .0006
MNAR II .0013 .0053 .0082 .0012 .0021 .0024
MNAR III .0135 .0246 .0228 .0050 .0058 .0048
23
ρ = .8 ρ = .3
Type Level k=3 k=5 k=7 k=3 k=5 k=7
MCAR I .0004 .0004 .0005 .0005 .0005 .0005
MCAR II .0043 .0059 .0071 .0022 .0022 .0022
MAR-linear I .0003 .0005 .0003 .0004 .0004 .0003
MAR-linear II .0010 .0038 .0049 .0007 .0014 .0015
MAR-linear III .0153 .0208 .0149 .0039 .0041 .0029
MAR-convex I .0003 .0006 .0005 .0004 .0005 .0005
MAR-convex II .0021 .0030 .0025 .0014 .0014 .0011
MAR-convex III .0174 .0267 .0217 .0052 .0058 .0043
MNAR I .0004 .0005 .0003 .0004 .0004 .0003
MNAR II .0011 .0038 .0049 .0008 .0016 .0016
MNAR III .0153 .0219 .0150 .0042 .0045 .0031
Between-imputation variances – Non-Normally distributed data
24
DiscussionThe results indicate that multiple imputation is robust to violations of both continuity and normality.
The biases of the parameter estimates resulting from using MI were found to be consistently negative across all conditions.
The statistical tests performed after MI will tend to be less powerful.
It is possible to conclude that multiple imputation can be safely used to estimate parameters if the overall proportion of missing data in the dataset does not exceed a maximum of about 30%.
25
Limitations
The datasets used in this study contained ten variables inter-correlated with each other. The results of this study may have been different if uncorrelated variables were used.
The proportions of missing data were not consistent across conditions, which make comparisons across conditions somewhat harder.
This simulation uses only a sample size of 400, which is relatively large. Different results could be obtained if smaller sample sizes were used.
26
Future Research Questions
What is the maximum amount of missing data for which MI still functions adequately?
How much can the inclusion of predictors in the dataset help MI when the proportion of missing data is large?
How does sample size affect the performance of MI?