working with missing values - oregon state universitypeople.oregonstate.edu/~acock/missing/old...

Workshop on Working with Missing Values: Supplement to PowerPoint Presentation*

Alan C. AcockOregon State University

This document and selected references, data, and programs are available at

http://oregonstate.edu/~acock/missing/

Note to ReadersThese are lecture notes for a presentation. This is not a self-contained, systematic treatment of the topic. This is not intended for publication and has not been carefully edited for publication purposes. Instead, these notes are intended to complement a one-day workshop presentation. The workshop will expand and clarify many of the points presented in this document. The intention of this document is to help workshop participants follow the presentation. They are much more detailed than a usual power point set of slides, but much less detailed than a self-contained treatment of the topics. Others may find these notes useful, but they are not a substitute for participation in the workshop.

Workshop on Missing Values Presented at University of Nevada, RenoCollege of Human and Community Sciences

1

http://oregonstate.edu/~acock/missing/

Working with Missing ValuesTypes of Missing Values

Missing by definition of domain.

A survey participant is excluded from your analysis because they are not in the domain you are investigating.

If you are comparing the social networks of married women to unmarried, lesbian women, you would drop all men and unmarried women who are not lesbians because they do not fall into the domain of your study. This is not a problem.

An investigator needs to eliminate these people from the survey before talking about any type of missing values or attrition.

Note total sample size, then state number of participants who fit the definition of the study population.

Most surveys have several codes for missing values to distinguish participants who refused to answer

Those who answered that they don’t know, Those who were valid skips, and Those that were skipped by interviewer error.

If a researcher is imputing values, those that are missing, but who are not in the domain being investigated, should not have their values imputed. For example

Valid skips usually should not be imputed, but A handful of people who were skipped by interviewer error should be

imputed. It simply makes no sense to impute the age of first menstruation for men!


2

The distinctions between types of missing values are lost in datasets when only a single code (-9, dot, etc.) is used, regardless of the reason the value is missing so care in setting up coding is critical.

Imputing values for people who respond that they “don’t know” is especially challenging.

If asked to rate your marital satisfaction on a very satisfied to very dissatisfied scale, one researcher may say that “don’t know” is half way between satisfied and dissatisfied and assign a corresponding value.

A participant’s views may be bipolar, sometimes being extremely satisfied and other times being extremely dissatisfied. Giving them a score that is half way between satisfied and dissatisfied or imputing a value for them may not make sense from this perspective—the scale itself does not make sense to them. Ex. Family solidarity

Another time the “don’t know” option is problematic is when answering the question requires special knowledge. o If a person in the U.S. were asked to rate the average marital satisfaction

of women in the Ukraine, the person may answer that they “don’t know” because they are not even sure where the Ukraine is, much less know anything about gender roles and marital satisfaction there.

o This does not mean they are half way between high and low; it does mean the scale is not meaningful to them. Imputing a value for them would be inappropriate.

Defining the domain, eliminating people who do not fit this domain, and assigning values to missing values needs to be done with great care. Your decisions need to be clear to the reader and far too few papers are clear in how they do this.

Attrition Analysis

A quick review of major family journals indicates that many authors have done little or nothing in the way of attrition analysis. Once a dataset is reduced to


3

those participants who should have data, some analysis of attrition due to missing values is necessary.

If a traditional solution is used such as listwise deletion (drop any participant who has a missing value on any item in your analysis), then those participants who have missing values should be compared to those that you choose to analyze.

This can be done by simple chi-square or t-tests of variables on which you have information. o For example, if a panel study has four waves and some people were

missing for one or two of the waves then you should compare them to the participants you analyze for the waves on which you have data.

o Is the percentage minority higher in the group you dropped using listwise deletion?

o Are those you analyzed more educated? o Are women overrepresented among those you analyzed?

Attrition analysis alerts readers to potential biases in your analysis and the limits to the value your research has for generalizing.

If there are few statistically or substantively significant differences between those you drop and those you analyze, then this reassures the reader of the strength of your analysis.

MCAR, MAR, NMAR

Here is what we are talking about

Table 1

Patterns of Missing Values

Matrix D

(data with no missing values)

Y X1 X2 X3 X4


4

5 6 4 3 1

4 3 2 3 2

3 1 2 1 3

3 2 1 4 5

1 5 4 5 4.

Matrix D’

(data with missing values)

Matrix M

(missing pattern)

Y X1 X2 X3 X4 Dy M1 M2 M3 M4

5 . 4 3 1 0 1 0 0 0

4 . . . 2 0 1 1 1 0

. 1 2 1 . 1 0 0 0 1

3 2 1 4 5 0 0 0 0 0

1 5 4 . . 0 0 0 1 1

There are many explanations for patterns of missing values. The procedures we discuss here are appropriate for MCAR and MAR. We will mention procedures appropriate for NMAR.

MCAR is widely used in the statistical literature for data that is missing completely at random. This is rarely a realistic assumption. For example,

We know that men are more likely to have missing values than women. If this is true for your data, then the values that are missing are systematically related to gender and we could not say they were missing completely at random.


5

If you give each participant a random sample of 75% of the items on a questionnaire (each participant may answer a different subset of items this way), the values that are missing would be missing completely at random MCAR

MAR is the minimum condition and it has a meaning that is quite different than it sounds. It should not be confused with MCAR. If your dataset includes variables that “explain” patterns of missingness, then the residual values controlling for these variables will be missing at random. Suppose you know that missingness (not answering items about income is related to gender, race, education, income, marital status, and occupational category.

You control for gender, race, education, marital status, and occupational category.

After doing this, the missing values on income are not related to income—there is no residual relationship to income after you control for the other “mechanisms” of missingness.

This would be missing at random

NMAR means that the missing values are not missing at random. Suppose your study does not include education, race, gender, occupational status as in the last example. Without controlling for these it is likely that those with missing values on income are either poor or rich. This would mean that not reporting income was related to income and therefore it would not be ignorable

NMAR happens when your study does not include variables that “explain” patterns of missingness.

A panel study of math test scores from grades 10-12. Those who are missing at will likely over represent those who do very poorly at math and decide to drop out of school. Math scores would seem to get better each year because the students who are poor at math have a higher attrition rate.


6

MCAR says that R (pattern of missingness) is not related to X or Y, but only to some random process Z.MAR says that R (pattern of missingness) is related to X (other variables in model or included in imputations) in part and to some random process Z in part. R may be correlated with Y, but there is no partial relationship controlling for X and Z.MNAR says that R (pattern of missingness is related to X (other variables in model or included in imputations) in part, some random process Z in part, but also still related to Y controlling for X and Z.

These can be illustrated with data from Schafer and Graham, Table 1

Blood Pressure Measurements in January (X) and February (Y) with Missing Values Imposed in Three Different MethodsJanuaryMeasure (X)

February Measure (Y)

MCAR for February

MAR for February

MNAR for February

169 148 148 148 148126 123 - - -132 149 - - 149160 169 - 169 169105 138 - - -116 102 - - -125 88 - - -112 100 - - -


X Z X Z X Z

RYRYRY

MCAR MAR MNAR

7

133 150 - - 15094 113 - - -109 96 - - -109 78 - - -106 148 - - 148176 137 - 137 -128 155 - - 155131 131 - - -130 101 101 - -145 155 - 155 155136 140 - - -146 134 - 134 -111 129 - - -97 85 85 - -134 124 124 - -153 112 - 112 -118 118 - - -137 122 122 - -101 119 - - -103 106 106 - -78 74 74 - -151 113 - 113 -M = 125.7 M = 121.9 M = 108.6 M = 138.3 M = 153.4Sd = 23.0 Sd = 24.7 Sd = 25.1 Sd = 21.1 Sd = 7.5

MCAR is a random sample of 7 observation so it is completely a random processMAR is the time two score for people who scored over 140 in January (X). The R

(pattern of missingness is partly explained by X, but there is no relationship between Y and R not “explained” by X.

MNAR has the people measured on Y who have high scores on Y. Y “explains” the R (pattern of missingness).

Traditional Approaches

Listwise Deletion

Listwise deletion is the default in virtually all packages. It has the following problems:

1. Reduces power because you are throwing away data. Inflates standard erros2. If missing values are not missing completely at random, the parameter estimates will

have a systematic bias (more minorities and men will be missing than white women. If race or gender are related to the outcome variable, the exclusion of minorities and men who have missing values will bias estimates.


8

Pairwise Deletion

Pairwise deletion includes everybody who answers both items in a pair to estimate the covariance for that pair. It uses everybody who answers an item to estimate the variance of the item. It then puts these variances and covariances together in a variance covariance matrix and analyzes this matrix. Problems

1. Each covariance is based on a different subsample so what population is being represented is unclear.

2. Since covariance matrix does not represent a single population it may not be possible to invert it and the program may fail.

3. It is unclear what degrees of freedom are appropriate. Parts of the model use more information than other parts of the model.

Mean Substitution

Mean substitution replaces each missing value with the mean of that variable.

1. This greatly reduces the variance of a predictor and this weakens explanatory power


9

2. It will attenuate parameter estimates in the bivariate case but with several predictors it may make some bigger than they should be and others smaller.

3. Mean is often a terrible choice of a missing value. Those missing income often have very low or very high incomes but those with average incomes usually report it.

4. This keeps the full N and degrees of freedom but in doing this it is not allowing for cases that have no variance because they all got the mean value

Substituting a mean for a subgroup will mitigate these problems, but only to a degree. For example, divide a sample of women by marital status and substitute missing values on income with the mean of each marital type. Dummy Variable with Mean Substitution

Mean substitution replaces each missing value with the mean of that variable and a dummy variable is added coded 1 if missing the variable and 0 if not missing it. This keeps all the cases, but this is misleading. It will produce identical parameter estimates as the listwise deletion and the B’s for the dummy variables will show how much those missing a value on a variable deviate from the mean for those that have no missing values.

1. The parameter estimates are still potentially biased


10

2. The degrees of freedom (cases) are exaggerated.

Regression Imputation

A multiple regression is done to predict each variable in the model. These equations are done to impute missing values.

1. This approach does noting about the uncertainty of the imputation process. If the R2 = .90 the imputed values might be pretty good. If the R2 = .10, the imputed missing values will not be very good.

2. The predicted values are a function of the variables already in the model and hence these values are not really contributing an independent effect for the variables involving imputations.

Tansitional Approach: Single imputation using EM algorithm Discuss single imputation and how uncertainty is introduced.

1. This approach is better than the previous approaches and yields unbiased parameter estimates

2. This approach still has biased standard errors.

Readers should know that in an article in The American Statistician, von Hippel (2004) is highly critical of the way SPSS implements EM in the MVA module. He states:

The final method, expectation maximization (EM), produces asymptotically unbiased estimates, but EM’s implementation in MVA is limited to point estimates (without standard errors) of means, variances, and covariances. MVA can also impute values using the EM algorithm, but values are imputed without residual variation, so analyses that use the imputed values can be biased (von Hippel, 2004, p 160.

von Hippel acknowledges that although SPSS does not add the residual variation appropriately, it makes an adjustment later in the process. If a researcher chooses to do single imputation, there are freeware programs available that may be superior to


11

SPSS, although not as user friendly. An example is Graham’s program EMCOV available at http://methodology.psu.edu/downloads/EMCOV.html However, even Graham recommends that users should use multiple imputation when it is appropriate.

Data Used for this Presentation

Dataset for this workshop is nlsy97missing.dta. This is a subset of data from the NLSY97 dataset. Here is a condensed codebook:

. codebook, compactVariable Obs Unique Mean Min Max Label---------------------------------------------------------------------------------pubid 8984 8984 4504.302 1 9022 youth public id codesampwt97 8984 3920 215699.6 32330 1575942 round 1 sampling weight 1997age97 8984 7 14.35363 12 18 age at interview date 1997gender97 8984 2 1.48809 1 2 youth gender 1997hhsize97 8984 16 4.548976 1 16 household size 1997hhin97 6588 2242 46361.7 -48100 246474 household income 1997dinner97 5356 8 5.07823 0 7 # days/wk dinner w/family 1997fun97 5356 8 2.710045 0 7 # days/wk fun as a family 1997psmoke97 8871 5 2.611656 1 5 % peers smoke 1997pdrink97 8799 5 2.136152 1 5 % peers drunk 1+/month 1997psport97 8943 5 3.688695 1 5 % peers sports, clubs 1997pgang97 8812 5 1.594757 1 5 % peers belong to gang 1997pcoll97 8866 5 3.568915 1 5 % peers plan college 1997pvol97 8838 5 2.09131 1 5 % peers volunteer 1997pdrug97 8758 5 2.307376 1 5 % peers use illegal drugs 1997pcut97 8920 5 2.408184 1 5 % peers cut class/school 1997hwwdy97 4717 5 3.830189 1 5 # weekdays do homework 1997hwwenh97 4720 18 .828178 0 90 weekend hours do homework 1997smday97 3497 31 6.942808 0 30 # days smoke last 30 days 1997drday97 3819 27 1.834512 0 30 # days drank alc-30 days 1997maday97 1785 31 4.0493 0 30 # days used marij-30 days 1997---------------------------------------------------------------------------------

We will focus on two software packages, Norm and Stata. Norm is a freeware program available at:http://www.stat.psu.edu/%7Ejls/misoftwa.html


12

http://www.stat.psu.edu/~jls/misoftwa.html

http://methodology.psu.edu/downloads/EMCOV.html

Working with Missing Items in a Scale

StataThe spost commands discussed in the Workshop on Categorical and Count Dependent Variables includes a command called misschk. This is good to run when first creating a scale. Suppose we want to access negative peer influence. The NLSY97 has 8 items on which students, 12-18, rated the percentage of their schoolmates who did various things. A score of 1 reflects 0-19%, a 2 reflects 20-39%, a 3 reflects 40-59%, a 4 reflects 60-79%, and a 5 reflects 80-100% Here is the command (not available from a menu)

misschk psmoke97-pcut97, gen(m_) dummy help

misschk This is the name of the commandpsmoke97-pcut97 These are the variablesgen(m_) Generates a variable for how many observations are

missing itemsdummy A dummy variable for whether each item is missing or

not for each observation.help Prints out the names of the new variables and what they

are

. misschk psmoke97-pcut97, gen(m_) dummy help

Variables examined for missing values

# Variable # Missing % Missing-------------------------------------------- 1 psmoke97 113 1.3 2 pdrink97 185 2.1 3 psport97 41 0.5 4 pgang97 172 1.9 5 pcoll97 118 1.3 6 pvol97 146 1.6 7 pdrug97 226 2.5 8 pcut97 64 0.7


13

The columns in the table below correspond to the # in the table above.If a column is blank, there were no missing cases for that variable.

Missing for | which | variables? | Freq. Percent Cum.------------+----------------------------------- 12345 678 | 17 0.19 0.19 123_5 678 | 1 0.01 0.20 123__ ___ | 1 0.01 0.21 12_45 678 | 1 0.01 0.22 12_45 67_ | 4 0.04 0.27 12_45 6__ | 1 0.01 0.28 12_45 _7_ | 1 0.01 0.29 12_4_ 678 | 1 0.01 0.30 12_4_ 67_ | 2 0.02 0.32 12_4_ _78 | 2 0.02 0.35 12_4_ _7_ | 9 0.10 0.45 12_4_ ___ | 3 0.03 0.48 12__5 67_ | 1 0.01 0.49 12__5 6__ | 1 0.01 0.50 12__5 _78 | 3 0.03 0.53 12__5 ___ | 1 0.01 0.55 12___ 678 | 1 0.01 0.56 12___ 67_ | 3 0.03 0.59 12___ 6_8 | 1 0.01 0.60 12___ _78 | 3 0.03 0.63 12___ _7_ | 6 0.07 0.70 12___ ___ | 15 0.17 0.87 1_345 678 | 1 0.01 0.88 1_3_5 67_ | 1 0.01 0.89 1__45 ___ | 1 0.01 0.90 1__4_ 67_ | 1 0.01 0.91 1__4_ _7_ | 3 0.03 0.95 1__4_ ___ | 1 0.01 0.96 1___5 _7_ | 1 0.01 0.97 1___5 ___ | 1 0.01 0.98 1____ _7_ | 6 0.07 1.05 1____ __8 | 2 0.02 1.07 1____ ___ | 17 0.19 1.26 _2345 678 | 3 0.03 1.29 _234_ 67_ | 1 0.01 1.30 _234_ _7_ | 1 0.01 1.31 _234_ __8 | 1 0.01 1.32 _23__ ___ | 1 0.01 1.34 _2_45 67_ | 2 0.02 1.36 _2_45 _78 | 1 0.01 1.37 _2_45 _7_ | 1 0.01 1.38 _2_4_ 678 | 1 0.01 1.39 _2_4_ 67_ | 3 0.03 1.42 _2_4_ 6__ | 1 0.01 1.44 _2_4_ _78 | 2 0.02 1.46


14

_2_4_ _7_ | 11 0.12 1.58 _2_4_ __8 | 1 0.01 1.59 _2_4_ ___ | 11 0.12 1.71 _2__5 67_ | 1 0.01 1.73 _2__5 _7_ | 1 0.01 1.74 _2__5 __8 | 1 0.01 1.75 _2__5 ___ | 3 0.03 1.78 _2___ 678 | 3 0.03 1.81 _2___ 6__ | 6 0.07 1.88 _2___ _7_ | 18 0.20 2.08 _2___ __8 | 1 0.01 2.09 _2___ ___ | 32 0.36 2.45 __345 ___ | 1 0.01 2.46 __34_ 678 | 1 0.01 2.47 __34_ 67_ | 1 0.01 2.48 __34_ _7_ | 2 0.02 2.50 __34_ ___ | 1 0.01 2.52 __3_5 6_8 | 1 0.01 2.53 __3__ 67_ | 1 0.01 2.54 __3__ _7_ | 1 0.01 2.55 __3__ ___ | 4 0.04 2.59 ___45 67_ | 1 0.01 2.60 ___45 6__ | 3 0.03 2.64 ___45 _7_ | 2 0.02 2.66 ___45 ___ | 3 0.03 2.69 ___4_ 67_ | 1 0.01 2.70 ___4_ 6__ | 8 0.09 2.79 ___4_ _78 | 3 0.03 2.83 ___4_ _7_ | 13 0.14 2.97 ___4_ __8 | 1 0.01 2.98 ___4_ ___ | 43 0.48 3.46 ____5 678 | 2 0.02 3.48 ____5 67_ | 1 0.01 3.50 ____5 6__ | 7 0.08 3.57 ____5 _7_ | 8 0.09 3.66 ____5 __8 | 1 0.01 3.67 ____5 ___ | 39 0.43 4.11 _____ 678 | 1 0.01 4.12 _____ 67_ | 9 0.10 4.22 _____ 6__ | 51 0.57 4.79 _____ _78 | 5 0.06 4.84 _____ _7_ | 57 0.63 5.48 _____ __8 | 2 0.02 5.50 _____ ___ | 8,490 94.50 100.00------------+----------------------------------- Total | 8,984 100.00

Table indicates the number of variables for which an observationhas missing data.


15

Missing for | how many | variables? | Freq. Percent Cum.------------+----------------------------------- 0 | 8,490 94.50 94.50 1 | 245 2.73 97.23 2 | 122 1.36 98.59 3 | 46 0.51 99.10 4 | 35 0.39 99.49 5 | 18 0.20 99.69 6 | 5 0.06 99.74 7 | 6 0.07 99.81 8 | 17 0.19 100.00------------+----------------------------------- Total | 8,984 100.00

Variables created: m_pattern is a string variable showing the pattern of missing data. m_number is the number of variables for which a case has missing data. m_<varnm> is a binary variable indicating missing data for <varnm>.

. codebook m_*, compactVariable Obs Unique Mean Min Max Label---------------------------------------------------------------------------------m_psmoke97 8984 2 .0125779 0 1 Missing value for psmoke97?m_pdrink97 8984 2 .0205922 0 1 Missing value for pdrink97?m_psport97 8984 2 .0045637 0 1 Missing value for psport97?m_pgang97 8984 2 .0191451 0 1 Missing value for pgang97?m_pcoll97 8984 2 .0131345 0 1 Missing value for pcoll97?m_pvol97 8984 2 .0162511 0 1 Missing value for pvol97?m_pdrug97 8984 2 .0251558 0 1 Missing value for pdrug97?m_pcut97 8984 2 .0071238 0 1 Missing value for pcut97?m_pattern 8984 89 . . . Missing for which variables?m_number 8984 9 .1185441 0 8 Missing for how many variables?---------------------------------------------------------------------------------

It appears that quite a few people skipped one or two of the items. Very few skipped more than two. We could construct our variable using this scale if we have at least 6 of the 8 items answered (75%) and we would only lose about 1.4% of the observations.

In Stata we can use the command alpha (illustrate this using the menus). This is a much more capable command than reliability is in SPSS.


16

We can let the program decide if any items need to be reverse coded. We can indicate which ones need to be reverse coded. If it reverses any items,

we lose the simple interpretation of the score. We get the usual information we got from SPSS

alpha psmoke97-pcut97, detail generate(peers_neg) item label min(6)

Test scale = mean(unstandardized items)

Items | S it-cor ir-cor ii-cov alpha label-------------+-------------------------------------------------------------------psmoke97 | + 0.745 0.616 .34491 0.710 % peers smoke 1997pdrink97 | + 0.749 0.626 .34652 0.708 % peers drunk 1+/month 1997psport97 | - 0.372 0.203 .47517 0.780 % peers sports, clubs 1997pgang97 | + 0.549 0.413 .42652 0.748 % peers belong to gang 1997pcoll97 | - 0.495 0.335 .43752 0.761 % peers plan college 1997pvol97 | - 0.399 0.221 .46644 0.779 % peers volunteer 1997pdrug97 | + 0.794 0.682 .3253 0.695 % peers use illegal drugs1997pcut97 | + 0.711 0.572 .35653 0.719 % peers cut class/school 1997-------------+-------------------------------------------------------------------Test scale | .39735 0.765 mean(unstandardized items)---------------------------------------------------------------------------------

I will drop the positive items because they are measuring a different dimension.

alpha psmoke97 pdrink97 pgang97 pdrug97 pcut97, detail /// generate(negpeers) item label min(4)

Test scale = mean(unstandardized items)

Items | S it-cor ir-cor ii-cov alpha label-------------+-------------------------------------------------------------------psmoke97 | + 0.816 0.688 .71839 0.793 % peers smoke 1997pdrink97 | + 0.826 0.707 .71771 0.788 % peers drunk 1+/month 1997pgang97 | + 0.590 0.429 .97742 0.856 % peers belong to gang 1997pdrug97 | + 0.859 0.752 .6676 0.773 % peers use illegal drugs 1997pcut97 | + 0.783 0.637 .75332 0.808 % peers cut class/school 1997-------------+-------------------------------------------------------------------Test scale | .76691 0.839 mean(unstandardized items)---------------------------------------------------------------------------------


17

Interitem covariances (obs=pairwise, see below)

psmoke97 pdrink97 pgang97 pdrug97 pcut97psmoke97 1.6436pdrink97 1.0436 1.5548 pgang97 0.4247 0.3980 0.9645 pdrug97 1.0549 1.0926 0.5052 1.7159 pcut97 0.8347 0.8284 0.4734 1.0124 1.6349

Pairwise number of observations

psmoke97 pdrink97 pgang97 pdrug97 pcut97psmoke97 8773pdrink97 8731 8749 pgang97 8714 8690 8732 pdrug97 8696 8672 8655 8714 pcut97 8769 8745 8728 8710 8787

. label var negpeers "Negative Peers"

. tab negpeers, m

Negative | Peers | Freq. Percent Cum.------------+----------------------------------- 1 | 1,199 13.35 13.35 1.2 | 737 8.20 21.55 1.25 | 21 0.23 21.78 1.4 | 688 7.66 29.44 1.5 | 18 0.20 29.64 1.6 | 577 6.42 36.06 1.75 | 18 0.20 36.26 1.8 | 584 6.50 42.76 2 | 550 6.12 48.89 2.2 | 562 6.26 55.14 2.25 | 16 0.18 55.32 2.4 | 548 6.10 61.42 2.5 | 11 0.12 61.54 2.6 | 567 6.31 67.85 2.75 | 10 0.11 67.97 2.8 | 507 5.64 73.61 3 | 462 5.14 78.75 3.2 | 380 4.23 82.98 3.25 | 6 0.07 83.05 3.4 | 325 3.62 86.67 3.5 | 6 0.07 86.73 3.6 | 273 3.04 89.77 3.75 | 6 0.07 89.84 3.8 | 207 2.30 92.14 4 | 198 2.20 94.35 4.2 | 115 1.28 95.63 4.25 | 4 0.04 95.67


18

4.4 | 76 0.85 96.52 4.5 | 1 0.01 96.53 4.6 | 50 0.56 97.08 4.75 | 5 0.06 97.14 4.8 | 41 0.46 97.60 5 | 23 0.26 97.85 . | 193 2.15 100.00------------+----------------------------------- Total | 8,984 100.00

. sum negpeers

Variable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------------- negpeers | 8791 2.215675 .9578616 1 5

This way I’ve only lost 2.15% of the observations to missing values and I have a nice scale. The M = 2.22 is just over 2 (20-39%) and there appears to be good variance. The problem with the distribution is the excess of 1’s

. histogram negpeers, width(1) start(1) frequency

010

0020

0030

0040

00Fr

eque

ncy

1 2 3 4 5Negative Peers

Multiple Imputation using SASWorkshop on Missing Values Presented at University of Nevada, RenoCollege of Human and Community Sciences

19

Starting with SAS 8.2, SAS’ MI and MIANALYZED procedures. There is nothing difficult about this process or what NORM requires, it just takes time. It is reasonable to assume that all major software packages will implement an integrated way of doing this in the next few years.

How is the imputation performed? SAS offers three methods: (a) regression model, (b) propensity score method, and (c) a collection of techniques called Markov chain Monte Carlo (MCMC). The MCMC approach is advocated by Schafer (1997) and this is the approach he implements in NORM.

I will not go through using SAS, but this is a good solution for those familiar with SAS


20

Multiple Imputations using Norm

The process

We first need to put the data into an ASCII file that is space delimited. You can do this with SPSS or any program. I will do this with Stata using the File Export dialog to produce (I first made all the missing values -9999).


Step 1—create 5 to 10 datasets using data augmentation

Step 2—Estimate your model (regression, logistic regression, etc. separately for each of the 5 to 10 datasets using data augmentation

Step 3—Compute pooled estimates of your 5 to 10 solutions

21

outfile hhsize97 hhin97 dinner97 fun97 hwwdy97 hwwenh97 /// smday97 drday97 maday97 negpeers age97 gender97 /// using "F:\flash\LDV\missing.dat", nolabel replace wide

This produces an ascii file and here is part of that file

Create 3 Imputed Datasets

Normally, we would do 5 and preferably 10, but this is a tedious process so for illustration purposes we will just do 3 imputed datasets.

Double click on the Norm.exe file or icon if you have it on your desktop


22

Click on FileClick on “New Session”Locate the file, missing.dat and open it.

I changed the “Missing value code” to -9999.

Click on the “Variables” tab. Click on the names of the variables and replace them with the ones you wrote into the ASCII file. It is essential to use exactly the same order.


23

Under the “In model” column you can double click to take a variable out of the model. We will keep all of them. I did this for hhsize97 and then couldn’t undo it. The variables that will be included are in the far right column.

You could do transformations, click on a “none” and see the options. Pick options for “Rounding” File

o Save aso Missing.nrm . The “nrm” is important.

Click on the “Summarize” tab and Click “run”

This will produce useful information for evaluating the missing values

**************************************************NORM Version 2.03 for Windows 95/98/NTOutput from SUMMARIZE procedureuntitled


24

Data from file: F:\flash\LDV\missing.datTuesday, 27 December 200518:51:39************************************************** NUMBER OF OBSERVATIONS = 8984NUMBER OF VARIABLES = 12 NUMBER MISSING % MISSINGhhsize97 0 0.00hhin97 2396 26.67dinner97 3628 40.38fun97 3628 40.38hwwdy97 4267 47.50hwwenh97 4264 47.46smday97 5487 61.08drday97 5165 57.49maday97 7199 80.13negpeers 0 0.00age97 0 0.00gender97 0 0.00 MATRIX OF MISSINGNESS PATTERNS 1=OBSERVED 0=MISSING COUNT=NUMBER OF OBSERVATIONS WITH THE SPECIFIED PATTERN COUNT 289 1 1 1 1 1 1 1 1 1 1 1 1 86 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 2 1 1 0 0 1 1 1 1 1 1 1 1 4 1 0 0 0 1 1 1 1 1 1 1 1 87 1 1 1 1 0 0 1 1 1 1 1 1 33 1 0 1 1 0 0 1 1 1 1 1 1 705 1 1 0 0 0 0 1 1 1 1 1 1 232 1 0 0 0 0 0 1 1 1 1 1 1 25 1 1 1 1 1 1 0 1 1 1 1 1 8 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 5 1 1 1 1 0 0 0 1 1 1 1 1 2 1 0 1 1 0 0 0 1 1 1 1 1 78 1 1 0 0 0 0 0 1 1 1 1 1 38 1 0 0 0 0 0 0 1 1 1 1 1 51 1 1 1 1 1 1 1 0 1 1 1 1 18 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 20 1 1 1 1 0 0 1 0 1 1 1 1 2 1 0 1 1 0 0 1 0 1 1 1 1 47 1 1 0 0 0 0 1 0 1 1 1 1 11 1 0 0 0 0 0 1 0 1 1 1 1 7 1 1 1 1 1 1 0 0 1 1 1 1 5 1 0 1 1 1 1 0 0 1 1 1 1


25

1 1 1 1 1 1 0 0 0 1 1 1 1 6 1 1 1 1 0 0 0 0 1 1 1 1 2 1 0 1 1 0 0 0 0 1 1 1 1 12 1 1 0 0 0 0 0 0 1 1 1 1 4 1 0 0 0 0 0 0 0 1 1 1 1 387 1 1 1 1 1 1 1 1 0 1 1 1 112 1 0 1 1 1 1 1 1 0 1 1 1 2 1 1 0 1 1 1 1 1 0 1 1 1 2 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 55 1 1 1 1 0 0 1 1 0 1 1 1 20 1 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 410 1 1 0 0 0 0 1 1 0 1 1 1 149 1 0 0 0 0 0 1 1 0 1 1 1 365 1 1 1 1 1 1 0 1 0 1 1 1 122 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 4 1 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 52 1 1 1 1 0 0 0 1 0 1 1 1 24 1 0 1 1 0 0 0 1 0 1 1 1 368 1 1 0 0 0 0 0 1 0 1 1 1 145 1 0 0 0 0 0 0 1 0 1 1 1 333 1 1 1 1 1 1 1 0 0 1 1 1 91 1 0 1 1 1 1 1 0 0 1 1 1 2 1 1 1 0 1 1 1 0 0 1 1 1 2 1 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 59 1 1 1 1 0 0 1 0 0 1 1 1 21 1 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 175 1 1 0 0 0 0 1 0 0 1 1 1 83 1 0 0 0 0 0 1 0 0 1 1 1 2022 1 1 1 1 1 1 0 0 0 1 1 1 740 1 0 1 1 1 1 0 0 0 1 1 1 4 1 1 0 1 1 1 0 0 0 1 1 1 2 1 0 0 1 1 1 0 0 0 1 1 1 4 1 1 1 0 1 1 0 0 0 1 1 1 2 1 0 1 0 1 1 0 0 0 1 1 1 9 1 1 0 0 1 1 0 0 0 1 1 1 7 1 0 0 0 1 1 0 0 0 1 1 1 1 1 0 1 1 0 1 0 0 0 1 1 1 2 1 0 1 0 0 1 0 0 0 1 1 1 2 1 1 1 1 1 0 0 0 0 1 1 1 209 1 1 1 1 0 0 0 0 0 1 1 1 78 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 2 1 0 1 0 0 0 0 0 0 1 1 1 784 1 1 0 0 0 0 0 0 0 1 1 1 340 1 0 0 0 0 0 0 0 0 1 1 1


26

MEANS AND STANDARD DEVIATIONS OF OBSERVED DATA MEAN ST.DEV.hhsize97 4.54898 1.54016hhin97 46361.7 42143.5dinner97 5.07823 2.27363fun97 2.71004 2.06815hwwdy97 3.83019 1.18201hwwenh97 0.828178 2.11733smday97 6.94281 11.2727drday97 1.83451 3.84684maday97 4.04930 7.96396negpeers 2.16808 1.00050age97 14.3536 1.48814gender97 1.48809 0.499886

Click the tab “EM algorithm”This results in the following:

**************************************************NORM Version 2.03 for Windows 95/98/NTOutput from EM ALGORITHMuntitledData from file: F:\flash\LDV\missing.datTuesday, 27 December 200518:56:10************************************************** NUMBER OF OBSERVATIONS = 8984NUMBER OF VARIABLES = 12 VARIABLES IN MODEL hhsize97 no transformationhhin97 no transformationdinner97 no transformationfun97 no transformationhwwdy97 no transformationhwwenh97 no transformationsmday97 no transformationdrday97 no transformationmaday97 no transformationnegpeers no transformationage97 no transformationgender97 no transformation


27

MEANS AND STANDARD DEVIATIONS OF OBSERVED DATA MEAN ST.DEV.hhsize97 4.54898 1.54016hhin97 46361.7 42143.5dinner97 5.07823 2.27363fun97 2.71004 2.06815hwwdy97 3.83019 1.18201hwwenh97 0.828178 2.11733smday97 6.94281 11.2727drday97 1.83451 3.84684maday97 4.04930 7.96396negpeers 2.16808 1.00050age97 14.3536 1.48814gender97 1.48809 0.499886 ************************************************** STARTING VALUES: DEFAULT The default starting values use the observed data means and standard deviations reported above; starting values for all covariances are zero ************************************************** Maximum number of iterations: 1000Convergence criterion (maximum relative parameter change): 0.000100Method: maximum likelihood ITERATION HISTORY ITERATION # OBSERVED-DATA LOGLIKELIHOOD 1 -35881.00000 2 -34207.82904 3 -33999.76523 (iterations 4-403 ommitted) 404 -33766.16773 405 -33766.16772 EM CONVERGED AT ITERATION 405************************************************** EM PARAMETER ESTIMATES MEAN ST.DEV.hhsize97 4.54898 1.54007hhin97 46601.6 42147.0dinner97 4.86608 2.28890fun97 2.48676 2.08578


28

hwwdy97 3.75480 1.18629hwwenh97 0.824143 2.12757smday97 5.02342 11.4137drday97 1.33966 3.86059maday97 1.33428 7.89862negpeers 2.16808 1.00044age97 14.3536 1.48806gender97 1.48809 0.499858 COVARIANCE MATRIX hhsize97 hhin97 dinner97 fun97 hwwdy97hhsize97 2.372hhin97 1416. 0.1776E+10dinner97 0.1055 5522. 5.239fun97 0.8483E-01 -5753. 1.465 4.350hwwdy97 -0.4170E-01 5564. 0.2100 0.1607 1.407hwwenh97 -0.3007E-01 3550. 0.1195 0.1182 0.2914smday97 -0.9310 -0.1843E+05 -3.427 -3.391 -1.197drday97 -0.1257 -456.6 -1.084 -1.026 -0.3483maday97 -0.6772 -0.2148E+05 -3.115 -0.6828 -1.655negpeers -0.4892E-01 -3864. -0.4054 -0.2870 -0.1599age97 -0.8004E-01 1717. -0.4586 -0.4928 -0.1254gender97 -0.1810E-02 -299.5 -0.8623E-01 -0.1820E-01 0.6015E-01 hwwenh97 smday97 drday97 maday97 negpeershwwenh97 4.527smday97 -6.216 130.3drday97 -0.4451 12.77 14.90maday97 -2.074 30.51 10.32 62.39negpeers -0.7902E-01 3.455 0.8443 2.026 1.001age97 0.4423E-01 4.498 0.9621 2.188 0.6335gender97 0.7897E-01 -0.9980E-01 -0.8487E-01 -0.5312 0.4783E-01 age97 gender97age97 2.214gender97 0.6382E-02 0.2499 EM PARAMETER ESTIMATES WRITTEN TO: F:\flash\LDV\em.prm

Next we

Click tab for “Data Augmentation” Click “Computing”


29

Click “Number of Iterations” o Here you pick a number that is greater than 3 times the number of

interations in the EM Step. This is because we are only doing 3 imputations. 3 x 450 = 1350. I picked 2000

o If we were doing 10 imputations we would pick 10 x 450 = 4500 and pick 5000

Click “OK” Click on “Imputation”

o I click on “Impute at every kth” iteration and set k = 600 (more than the 450 in the EM part). If you want to see the 3 imputed datasets as they are imputed you would clicke the “View imputations immediately” button.

o Click OK

This will save 3 files

missing_1.imp, missing_2.imp, and missing_3.inp

Here are the first five observations from the first imputed dataset. Somehow I did not get hhsize97 but start with hhin97. I’m not sure what happened here and would normally start over if I needed that variable, but I do not so I won’t.


30

76000 5 2 5 2 24 3 -12 1.20 15 2 55000 7 7 2 2 0 1 3 3.60 13 1 27700 7 2 1 0 0 3 0 1.80 13 2 29600 4 0 4 1 10 2 2 3.40 15 1 96163 5 4 4 1 5 3 15 2.20 14 2

Importing the Imputed Datasets and Running the ModelsI will import these files into Stata and run my model separately for each file.

Here are the results for the three imputed datasets:

. infile hhin97 dinner97 fun97 hwwday97 hwwenh97 smday97 drday97 maday97 negpeers age97 gen> der97 using "F:\flash\LDV\missing_1.imp", clear(8984 observations read)

. do "C:\DOCUME~1\ALANAC~1\LOCALS~1\Temp\STD03000000.tmp"


31

. regress drday97 dinner97 fun97 negpeers age97 gender97, beta

Source | SS df MS Number of obs = 8984-------------+------------------------------ F( 5, 8978) = 132.16 Model | 9276.88704 5 1855.37741 Prob > F = 0.0000 Residual | 126041.143 8978 14.0388887 R-squared = 0.0686-------------+------------------------------ Adj R-squared = 0.0680 Total | 135318.03 8983 15.0637905 Root MSE = 3.7469

------------------------------------------------------------------------------ drday97 | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------- dinner97 | -.1270517 .0192573 -6.60 0.000 -.0715165 fun97 | -.1253703 .0209537 -5.98 0.000 -.0638361 negpeers | .6241382 .0442888 14.09 0.000 .1608902 age97 | .2182099 .029564 7.38 0.000 .0836667 gender97 | -.5827233 .0796335 -7.32 0.000 -.0750527 _cons | -1.284485 .4413611 -2.91 0.004 .------------------------------------------------------------------------------

. end of do-file

. do "C:\DOCUME~1\ALANAC~1\LOCALS~1\Temp\STD03000000.tmp"

. infile hhin97 dinner97 fun97 hwwday97 hwwenh97 smday97 ///> drday97 maday97 negpeers age97 gender97 ///> using "F:\flash\LDV\missing_2.imp", clear(8984 observations read)



------------------------------------------------------------------------------ drday97 | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------- dinner97 | -.115675 .0192072 -6.02 0.000 -.0653073 fun97 | -.1693357 .0209578 -8.08 0.000 -.0862223 negpeers | .7323058 .0439953 16.65 0.000 .1890819 age97 | .1678435 .0294491 5.70 0.000 .0644602 gender97 | -.4886166 .0790318 -6.18 0.000 -.0630349 _cons | -.9424769 .4413855 -2.14 0.033 .------------------------------------------------------------------------------


32

. infile hhin97 dinner97 fun97 hwwday97 hwwenh97 smday97 ///> drday97 maday97 negpeers age97 gender97 ///> using "F:\flash\LDV\missing_3.imp", clear(8984 observations read)



------------------------------------------------------------------------------ drday97 | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------- dinner97 | -.0971774 .0194587 -4.99 0.000 -.054321 fun97 | -.2148457 .0211425 -10.16 0.000 -.1087316 negpeers | .6266185 .0445778 14.06 0.000 .1608607 age97 | .1717949 .0296539 5.79 0.000 .0655973 gender97 | -.3408355 .0800658 -4.26 0.000 -.0437166 _cons | -.9064487 .4430372 -2.05 0.041 .------------------------------------------------------------------------------

We can compare these three solutions. The results are generally similar, but there are differences. You can see why doing at least 5 and preferably more and then pooling them is better than a single imputation approach like SPSS.

We could pool these by hand or create an ASCII file in a specific format and then have Norm do it for us. Here is the format we will use. This file is imputed.dat.

-.1270517 .0192573-.1253703 .0209537 .6241382 .0442888 .2182099 .029564-.5827233 .0796335-1.284485 .4413611

-.115675 .0192072-.1693357 .0209578


33

.7323058 .0439953 .1678435 .0294491-.4886166 .0790318-.9424769 .4413855

-.0971774 .0194587 -.2148457 .0211425 .6266185 .0445778 .1717949 .0296539 -.3408355 .0800658-.9064487 .4430372

This could get quite tedious for a complex model with 10 or more imputed datasets or when you had something like a path model where you had to do this for each endogenous variable.

Norm will combine these results

Open Norm Click on “Analyze” Click on MI Inferences: Scalar Select our file imputed.dat This opens a menu. Select “Stacked columns” Enter the “number of estimands”. This is the number of parameters we

estimated, the 5 B’s and the Intercept for a total of 6. Enter the “number of imputations.” We have 3 imputed datasets so I

entered 3 Click “run”


34

Here are the results:

**************************************************NORM Version 2.03 for Windows 95/98/NTOutput from MI INFERENCE: SCALAR METHODuntitledTuesday, 27 December 200519:44:09************************************************** Data read from file: F:\flash\LDV\imputed.dat Number of estimands = 6 Number of imputations = 3 File format: stacked columns ************************************************** QUANTITY ESTIMATE STD.ERR. T-RATIO DF P-VALUEQTY_1 -.113301 0.259986E-01 -4.36 9 0.0018QTY_2 -.169851 0.557732E-01 -3.05 2 0.0930QTY_3 0.661021 0.839346E-01 7.88 3 0.0043QTY_4 0.185949 0.438119E-01 4.24 6 0.0054QTY_5 -.470725 0.161728 -2.91 3 0.0620QTY_6 -1.04447 0.503330 -2.08 38 0.0448

Multiple imputation parameter estimates (10 imputations)

Intervals and inference based on d.f. from Barnard & Rubin (1999)------------------------------------------------------------------------------ drday97 | Coef. Std. Err. t P>|t| [95% Conf. Intvl] MI.df-------------+---------------------------------------------------------------- dinner97 | -.116435 .052811 -2.20 0.049 -.232059 -.00081 11.50 fun97 | -.14014 .033266 -4.21 0.000 -.209204 -.071077 21.60 negpeers | .716447 .069137 10.36 0.000 .575225 .857669 29.87 age97 | .174801 .052484 3.33 0.003 .065095 .284506 19.38 gender97 | -.535248 .15664 -3.42 0.003 -.867105 -.203391 16.12 _cons | -1.04943 1.00714 -1.04 0.316 -3.21621 1.11736 13.55------------------------------------------------------------------------------8984 observations.--------------------------------------------------------------------------


35

CONFIDENCE LEVEL FOR INTERVAL ESTIMATES (%): 95.00 QUANTITY LOW ENDPT. HIGH ENDPT. %MIS.INF.QTY_1 -.172114 -.544885E-01 53.4QTY_2 -.409823 0.701220E-01 90.8QTY_3 0.393903 0.928138 80.3QTY_4 0.787455E-01 0.293153 63.8QTY_5 -.985417 0.439668E-01 83.3QTY_6 -2.06341 -.255327E-01 26.7

Computing BetasIf you wanted to report Beta weights, you would simply average the Betas across the imputations. If the B is significant at some level, the significance of the Beta is identical. The Betas for the effect of negpeers were .16, .19, and .16 so the mean is . di (.1608607 + .1890819 + .1608902)/3.1702776

So β = .17 for the influence of negative peers on number of days the adolescent drank in the last 30, p < .01.

WhewThis is a very simple example and yet it was tedious—not hard, just tedious. Norm works with most any type of model.


36

Checking for Auxiliary/Mechanism Variables

Before doing any imputations you should drop observations that should not be included. An argument can be made that the multiple imputation will not be biased even if you impute values for people who were excluded by design, but I will not make that argument here.

Who should be dropped?

1. Boys would not be asked questions about their age at first menstruation. If we are using age at first menstruation in our model, boys should be dropped.

2. Some questions may have been limited to a narrow age range. If this is what we want to do, we should drop those not asked the questions. Questions about sexual behavior may asked only for those 14-18 and not those 12-13. We would drop those who were 12-13 before imputing missing values.

3. I did not check any of the variables in our analysis for such problems because this is just an illustration.

What “Extra” Variables Should be Included in Imputations?

There are two types of variables that you want to include when doing imputation that may or may not be related to your outcome variable

1. Variables that are correlated with your score on the outcome variable . For example, education may not be a variable in your model explaining marital happiness. Education may still be correlated with variables that are in your model. For example, education may be correlated with income or parenting style. Including education when you create the inputed datasets will help you get the best possible imputation when you include variables such as education.

2. Variables that are mechanisms for missingness . Whether a variable is correlated with a variable in your model or is not, it may still be a mechanism that explains missingness. For example, we know that minorities, those with low education, and men are all less likely to answer questions on most topics than are


37

majorities with high education and women. We would include variables in the imputation stage that predict whether you answer or do not answer an item.

You can use two ways to locate the first type of variable. You can think of each variable in your model (both independent and dependent) and think of possible correlates of it. These correlates could come from a theory, a literature review, or even your experience. Second, you may do an exploratory analysis by correlating any variable you think might be important with each of the variables in your model. Any variable that had a correlation of over, say .20, might be included in the list of variables used when creating the imputed datasets.

You can use the same two methods to locate the second type of variable. Even if gender is not a variable in your model, we know that men are less likely to answer most questions than women. We should, therefore, add gender as a variable used when doing the imputation.

We will focus on finding auxiliary variables that are mechanisms for explaining whether a variable is missing or not. The Stata command misschk that is part of the spost package generate a series of dummy variables coded 1 if an observation is missing the variable and 0 if they answered the variables. Here are the commands

misschk , gen(m_) dummy help

. pwcorr m_hhin97 - m_negpeers hhin97 hhsize smday97 maday97 drday97 dinner97 fun97 negpee> rs ///> gender97,obs sig

| m_hhin97 m_din~97 m_fun97 m_hwwd~7 m_hwwe~7 m_smd~97 m_drd~97-------------+--------------------------------------------------------------- m_hhin97 | 1.0000 | | 8984 | m_dinner97 | 0.0295 1.0000 | 0.0052 | 8984 8984 | m_fun97 | 0.0310 0.9940 1.0000 | 0.0033 0.0000


38

| 8984 8984 8984 | m_hwwdy97 | 0.0262 0.8453 0.8458 1.0000 | 0.0130 0.0000 0.0000 | 8984 8984 8984 8984 | m_hwwenh97 | 0.0251 0.8459 0.8455 0.9980 1.0000 | 0.0173 0.0000 0.0000 0.0000 | 8984 8984 8984 8984 8984 | m_smday97 | 0.0334 -0.1939 -0.1925 -0.2067 -0.2063 1.0000 | 0.0016 0.0000 0.0000 0.0000 0.0000 | 8984 8984 8984 8984 8984 8984 | m_drday97 | 0.0191 -0.2761 -0.2743 -0.2670 -0.2667 0.5040 1.0000 | 0.0703 0.0000 0.0000 0.0000 0.0000 0.0000 | 8984 8984 8984 8984 8984 8984 8984 | m_maday97 | 0.0190 -0.2372 -0.2360 -0.2437 -0.2446 0.5122 0.4730 | 0.0724 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 | 8984 8984 8984 8984 8984 8984 8984 | m_negpeers | 0.0235 -0.0187 -0.0109 -0.0087 -0.0101 0.0648 0.0824 | 0.0260 0.0766 0.3035 0.4090 0.3361 0.0000 0.0000 | 8984 8984 8984 8984 8984 8984 8984 | hhin97 | . 0.0237 0.0254 -0.0260 -0.0257 0.0178 -0.0282 | 1.0000 0.0546 0.0392 0.0349 0.0373 0.1487 0.0221 | 6588 6588 6588 6588 6588 6588 6588 | hhsize97 | 0.1539 -0.0269 -0.0268 0.0028 0.0025 0.0498 0.0839 | 0.0000 0.0108 0.0112 0.7889 0.8138 0.0000 0.0000 | 8984 8984 8984 8984 8984 8984 8984 | smday97 | -0.0007 0.2008 0.2018 0.2178 0.2184 . -0.1934 | 0.9658 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 | 3497 3497 3497 3497 3497 3497 3497 | maday97 | -0.0322 0.1121 0.1132 0.1177 0.1172 -0.0741 -0.0826 | 0.1742 0.0000 0.0000 0.0000 0.0000 0.0017 0.0005 | 1785 1785 1785 1785 1785 1785 1785 | drday97 | 0.0024 0.1180 0.1182 0.1206 0.1211 -0.1683 . | 0.8826 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 | 3819 3819 3819 3819 3819 3819 3819 | dinner97 | -0.0221 . 0.0200 -0.0885 -0.0847 0.1053 0.1141 | 0.1064 1.0000 0.1433 0.0000 0.0000 0.0000 0.0000 | 5356 5356 5356 5356 5356 5356 5356 | fun97 | -0.0235 0.0326 . -0.0540 -0.0565 0.1441 0.1450 | 0.0860 0.0170 1.0000 0.0001 0.0000 0.0000 0.0000 | 5356 5356 5356 5356 5356 5356 5356 | negpeers | 0.0173 0.3869 0.3858 0.3812 0.3813 -0.3037 -0.3339 | 0.1045 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 | 8791 8791 8791 8791 8791 8791 8791


39

| gender97 | -0.0022 0.0119 0.0123 -0.0048 -0.0050 0.0164 0.0207 | 0.8313 0.2596 0.2419 0.6517 0.6355 0.1207 0.0495 | 8984 8984 8984 8984 8984 8984 8984 |

| m_mad~97 m_negp~s hhin97 hhsize97 smday97 maday97 drday97-------------+--------------------------------------------------------------- m_maday97 | 1.0000 | | 8984 | m_negpeers | 0.0545 1.0000 | 0.0000 | 8984 8984 | hhin97 | 0.0273 -0.0510 1.0000 | 0.0265 0.0000 | 6588 6588 6588 | hhsize97 | 0.0491 0.0090 0.0190 1.0000 | 0.0000 0.3939 0.1234 | 8984 8984 6588 8984 | smday97 | -0.3493 -0.0282 -0.0275 -0.0546 1.0000 | 0.0000 0.0955 0.1581 0.0012 | 3497 3497 2629 3497 3497 | maday97 | . -0.0372 -0.0317 -0.0531 0.3372 1.0000 | 1.0000 0.1158 0.2466 0.0249 0.0000 | 1785 1785 1339 1785 1590 1785 | drday97 | -0.2751 -0.0135 -0.0051 -0.0173 0.2973 0.3771 1.0000 | 0.0000 0.4046 0.7878 0.2838 0.0000 0.0000 | 3819 3819 2838 3819 2578 1597 3819 | dinner97 | 0.1186 0.0092 0.0609 0.0265 -0.0976 -0.0996 -0.0921 | 0.0000 0.5003 0.0001 0.0522 0.0001 0.0113 0.0002 | 5356 5356 3985 5356 1668 647 1675 | fun97 | 0.1211 0.0275 -0.0591 0.0216 -0.0999 -0.0415 -0.0885 | 0.0000 0.0445 0.0002 0.1145 0.0000 0.2909 0.0003 | 5356 5356 3988 5356 1671 649 1679 | negpeers | -0.3667 . -0.1122 -0.0309 0.2821 0.1900 0.1971 | 0.0000 1.0000 0.0000 0.0038 0.0000 0.0000 0.0000 | 8791 8791 6460 8791 3463 1775 3790 | gender97 | 0.0375 -0.0111 -0.0130 -0.0024 -0.0094 -0.1035 -0.0306 | 0.0004 0.2945 0.2917 0.8237 0.5797 0.0000 0.0590 | 8984 8984 6588 8984 3497 1785 3819 |

| dinner97 fun97 negpeers gender97-------------+------------------------------------ dinner97 | 1.0000 |


40

| 5356 | fun97 | 0.2970 1.0000 | 0.0000 | 5343 5356 | negpeers | -0.1559 -0.1004 1.0000 | 0.0000 0.0000 | 5229 5234 8791 | gender97 | -0.0772 -0.0173 0.0983 1.0000 | 0.0000 0.2049 0.0000 | 5356 5356 8791 8984 |

We would need to study this and possible check for other auxiliary/mechanism variables.


41

Stata Commands: ice & micombine

The ice command is much more powerful than Norm and it has a vast array of options.

It can use different estimation models depending on the variable (regress, logit, mlogit, ologit)

It can work with multinomial categorical variables to get the appropriate category

It can work with interaction terms where you would not impute the interaction from the components, but compute the interaction from the imputed components

We will illustrate the command using the fewest options.

ice age97-negpeers using misingworking.dta, m(10) dryrun

ice The command nameage97-negpeers The variables we will use for imputation. This

command will impute missing values for all the variables in our dataset except for the variables we generated as dummy variables using the misschk command (m_negpeers, etc.)

using missingworking.dta The imputed datasets will be stacked in this file which is going to be on the currently active Stata file (lower left corner of Stata screen).

m(10) The number of imputed datasets. These will be our active file and are stacked. With 8894 observations, this will have 88,940 records, 1-8894 is the first dataset, etc.

dryrun We will see what happens before doing the imputations


42

. ice age97-negpeers using misingworking.dta, m(10) dryrun

#missing | values | Freq. Percent Cum.------------+----------------------------------- 0 | 289 3.22 3.22 1 | 548 6.10 9.32 2 | 922 10.26 19.58 3 | 2,315 25.77 45.35 4 | 1,639 18.24 63.59 5 | 1,054 11.73 75.32 6 | 845 9.41 84.73 7 | 998 11.11 95.84 8 | 358 3.98 99.82 9 | 16 0.18 100.00------------+----------------------------------- Total | 8,984 100.00

Variable | Command | Prediction equation------------+---------+------------------------------------------------------- age97 | | [No missing data in estimation sample] gender97 | | [No missing data in estimation sample] hhsize97 | | [No missing data in estimation sample] hhin97 | regress | age97 gender97 hhsize97 dinner97 fun97 hwwdy97 | | hwwenh97 smday97 drday97 maday97 negpeers dinner97 | regress | age97 gender97 hhsize97 hhin97 fun97 hwwdy97 hwwenh97 | | smday97 drday97 maday97 negpeers fun97 | regress | age97 gender97 hhsize97 hhin97 dinner97 hwwdy97 | | hwwenh97 smday97 drday97 maday97 negpeers hwwdy97 | mlogit | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwenh97 | | smday97 drday97 maday97 negpeers hwwenh97 | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwdy97 | | smday97 drday97 maday97 negpeers smday97 | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwdy97 | | hwwenh97 drday97 maday97 negpeers drday97 | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwdy97 | | hwwenh97 smday97 maday97 negpeers maday97 | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwdy97 | | hwwenh97 smday97 drday97 negpeers negpeers | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwdy97 | | hwwenh97 smday97 drday97 maday97

End of dry run. No imputations were done, no files were created.

If we included all of these variables in a model we would have just 289 cases using list wise (case wise) deletion. This shows us what variables will be used for each imputation and what regression model will be used. We are using all of the variables as predictors and we are predicting all of the variables. The ice command


43

allows us to not use all variables (not use an interaction to predict its components) and not to impute all variables (not to impute an interaction from its components, but to compute this for us instead).

It decided to use a mlogit model for one of the variables, hwwdy97. It did this because there were just 5 categories. We might change this to a regress command treating it as a quantitative variable. If we had a variable that was 5 nominal categories, we would need to modify the command so a series of 4 dummy variables were used as predictors instead of treating the categorical variable as quantitative.

Let’s change the ice command so that every variable is imputed using regress.

ice age97-negpeers using misingworking.dta, /// cmd(hwwdy97: regress) m(10) #missing | values | Freq. Percent Cum.------------+----------------------------------- 0 | 289 3.22 3.22 1 | 548 6.10 9.32 2 | 922 10.26 19.58 3 | 2,315 25.77 45.35 4 | 1,639 18.24 63.59 5 | 1,054 11.73 75.32 6 | 845 9.41 84.73 7 | 998 11.11 95.84 8 | 358 3.98 99.82 9 | 16 0.18 100.00------------+----------------------------------- Total | 8,984 100.00

Variable | Command | Prediction equation------------+---------+------------------------------------------------------- age97 | | [No missing data in estimation sample] gender97 | | [No missing data in estimation sample] hhsize97 | | [No missing data in estimation sample] hhin97 | regress | age97 gender97 hhsize97 dinner97 fun97 hwwdy97 | | hwwenh97 smday97 drday97 maday97 negpeers dinner97 | regress | age97 gender97 hhsize97 hhin97 fun97 hwwdy97 hwwenh97 | | smday97 drday97 maday97 negpeers fun97 | regress | age97 gender97 hhsize97 hhin97 dinner97 hwwdy97 | | hwwenh97 smday97 drday97 maday97 negpeers hwwdy97 | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwenh97 | | smday97 drday97 maday97 negpeers


44

hwwenh97 | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwdy97 | | smday97 drday97 maday97 negpeers smday97 | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwdy97 | | hwwenh97 drday97 maday97 negpeers drday97 | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwdy97 | | hwwenh97 smday97 maday97 negpeers maday97 | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwdy97 | | hwwenh97 smday97 drday97 negpeers negpeers | regress | age97 gender97 hhsize97 hhin97 dinner97 fun97 hwwdy97 | | hwwenh97 smday97 drday97 maday97

Imputing 1..2..3..4..5..6..7..8..9..10..file misingworking.dta saved

The file missingworking.dta is saved to the currently active file for Stata that is shown in the lower left.

We need to open this file.

. micombine regress drday97 dinner97 fun97 negpeers age97 ///> gender97, beta

Multiple imputation parameter estimates (10 imputations)------------------------------------------------------------------------------ drday97 | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- dinner97 | -.1164346 .0528114 -2.20 0.027 -.219957 -.0129122 fun97 | -.1401403 .0332663 -4.21 0.000 -.2053497 -.0749308 negpeers | .7164471 .069137 10.36 0.000 .5809228 .8519714 age97 | .1748006 .0524843 3.33 0.001 .0719194 .2776819 gender97 | -.535248 .1566399 -3.42 0.001 -.842298 -.2281981 _cons | -1.049425 1.007142 -1.04 0.297 -3.023654 .9248038------------------------------------------------------------------------------8984 observations.

We can add a option “br” that uses a way of estimating degrees of freedom using a method proposed in 1999 by Bernard and Rubin (Bernard, J., and Rubin, D. B. (1999). Small-sample degrees of freedom with multiple imputations. Biometrika 86:948-953. The advantages of this approach are not well established, but Rubin is the “father” of EM and multiple imputation.


45

. micombine regress drday97 dinner97 fun97 negpeers age97 ///> gender97, beta br



If you want to know the β’s, neither program will give them to you automatically. You would need to compute the 10 regressions separately, then get the mean for each β and use that value. The significance of the B is the same as the significance of the β.

Because the data is stacked, you need to use the following commands that divide the 89,840 records into the 10 separate datasets.

zscore drday97 dinner97 fun97 negpeers age97 gender97z_drday97 created with 0 missing valuesz_dinner97 created with 0 missing valuesz_fun97 created with 0 missing valuesz_negpeers created with 0 missing valuesz_age97 created with 0 missing valuesz_gender97 created with 0 missing values


46

. micombine regress z_drday97 z_dinner97 z_fun97 z_negpeers ///> z_age97 z_gender97, br beta


Intervals and inference based on d.f. from Barnard & Rubin (1999)------------------------------------------------------------------------------ z_drday97 | Coef. Std. Err. t P>|t| [95% Conf. Intvl] MI.df-------------+---------------------------------------------------------------- z_dinner97 | -.068906 .031254 -2.20 0.049 -.137332 -.00048 11.50 z_fun97 | -.075997 .01804 -4.21 0.000 -.11345 -.038544 21.60 z_negpeers | .177388 .017118 10.36 0.000 .142422 .212354 29.87 z_age97 | .067242 .02019 3.33 0.003 .025041 .109443 19.38 z_gender97 | -.069164 .020241 -3.42 0.003 -.112045 -.026282 16.12 _cons | .000078 .016708 0.00 0.996 -.034529 .034685 22.48------------------------------------------------------------------------------8984 observations.

We can compare the Stata results to those using Norm. The estimates are close. The t-tests are a bit inconsistent. When the ‘br’ option is used in Stata we get quite different estimates of the degrees of freedom.

. QUANTITY ESTIMATE STD.ERR. T-RATIO DF P-VALUEQTY_1 -.113301 0.259986E-01 -4.36 9 0.0018QTY_2 -.169851 0.557732E-01 -3.05 2 0.0930QTY_3 0.661021 0.839346E-01 7.88 3 0.0043QTY_4 0.185949 0.438119E-01 4.24 6 0.0054QTY_5 -.470725 0.161728 -2.91 3 0.0620QTY_6 -1.04447 0.503330 -2.08 38 0.0448




47

We can also compare both of these to what we would get using list wise deletion. You can see the results are quite different.

. use "F:\flash\LDV\nlsy97missingworking.dta", clear

. regress drday97 dinner97 fun97 negpeers ///> age97 gender97, beta


------------------------------------------------------------------------------ drday97 | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------- dinner97 | -.0736203 .0368884 -2.00 0.046 -.0515832 fun97 | -.1022276 .0444348 -2.30 0.022 -.0586974 negpeers | .5582945 .0915101 6.10 0.000 .1571593 age97 | .0704611 .090599 0.78 0.437 .0196987 gender97 | -.0617723 .1660812 -0.37 0.710 -.0091901 _cons | -.2351063 1.236965 -0.19 0.849 .------------------------------------------------------------------------------

Extensions of Stata’s Approach

Categorical variables

Suppose we have a categorical variable, X2, with three categories (0, 1, 2). We would want a mlogit to estimate this variable. When X2 is a predictor, however, we would want two dummy variables, X21 = 1 if X2 = 1, else X21 = 0; X22 = 1 if X2 = 2, else X22 = 0, and X22 = 0 is the reference category. Here is the somewhat complicated command:


48

ice age97-negpeers hw2 hw3 hw4 hw5 using sub.dta, /// cmd(hwwdy97: mlogit) /// passive(hw2:hwwdy97==2\hw3:hwwdy97==3\ /// hw4:hwwdy97==4\hw5:hwwdy97==5) /// substitute(hwwdy97:hw2 hw3 hw4 hw5) /// m(5) ice The commandage97-negpeers hw2 hw3 hw4 hw5 The variables including the 4 dummy

variables representing hwwdy97 when it is a predictor, but not when it is an outcome

using sub.dta The imputed data saved herecmd(hwwdy97: mlogit) When hwwdy97 is the outcome use

mlogit to predict itsubstitute(hwwdy97:hw2 hw3 hw4 hw5)

It will make these 0,1 values when they are imputed

If you were using hwwdy97 as an independent categorical variable, you would use the imputed hw2 hw3 hw4 and h35 as four dummy variables. If you were predicting hwwdy97 as a dependent categorical variable you would use multinomial logistic regression, mlogit.

Interaction

When you have an interaction term that is the product of two predictors, you need to impute the predictors and then compute the interaction term. If you imputed the predictors and also the interaction term, then the imputed interaction term might not be the same as the product of the two predictors.

Suppose we were going to use hwwday97, hwwenh97, and hhww, where hhww = hwwday97 x hwwenh97.


49

The computed interaction term, hhww, is the product of the actual/imputed values of hwwdy97*hwwenh97.

The micombine command works with the following regression models: clogit, cnreg, glm, logistic, logit, mlogit, ologit, oprobit, poisson, probit, qreg, regress, rreg, stcox, streg, or xtgee. It has not been validated for other regression commands.

Full Information Maximum Likelihood Estimation

Mplus has two ways of working with missing values. The simplest is to use full information maximum likelihood estimation with missing values (FIML). This uses all available data. For example, some adolescents were interviewed all six years but others may have skipped one, two, or even more years. We use all available information with this approach. The second approach is to utilize multiple imputations.


51

Multiple imputation involves

a. Imputing multiple datasets (usually 5-10) using appropriate procedures, b. Estimating the model for each of these datasets, and c. Then pooling the estimates and standard errors.

When the standard errors are pooled this way, they incorporate the variability across the 5-10 solutions and are thereby produced unbiased estimates of standard errors. Multiple imputations can be done with:

Norm can be used to generate the datasets Mplus can read these multiple datasets, estimate the model for each dataset, and

pool the estimates and their standard errors. This would be somewhat easier than importing the imputed datasets into SPSS or

Stata and then taking the results back to Norm

We will not illustrate the multiple imputation approach. However, the Mplus User’s Guide, discusses how you specify the datasets in the Data: section. We will illustrate the FIML approach because it is widely used and easily implemented.

The conceptual model does not change with missing values. The programming for implementing the FIML solution changes very little. You will recall that we did not need an Analysis: section in our program for doing a growth curve. However, we do need one when we are doing a growth curve with missing values and using FIML estimation. Directly above the Model command we insert

Analysis: Type = General Missing H1 ; Estimator = MLR ;

Type = General Missing H1; this line is the key change. The missing tells Mplus to do the full information maximum likelihood estimation. The H1 is necessary to get sample statistics in our output. We could do this with maximum likelihood estimation, but will use a robust maximum

likelihood estimator, Estimator = MLR, instead. This is optional, but generally conservative when you have substantial missing values.


52

In the Output: section, we also add a single word, patterns. This will give us a lot of information about patterns of missing values. We will see just what patterns there are, the frequency of occurrence of each pattern, and the percentage of data present for each covariance estimate.

Output: Sampstat Mod(3.84) patterns ; Plot: Type is Plot3; Series = bmi97 bmi98 bmi99 bmi00 bmi01 bmi02 bmi03(*);Here is the complete program:

Title: bmi_growth_fiml.inp Stata2Mplus conversion for F:\flash\academica\bmi_stata.dtaData: File is "F:\flash\academica\bmi_stata.dat" ;Variable: Names are

id grlprb_y boyprb_y grlprb_p boyprb_p male race_eth bmi97 bmi98 bmi99

bmi00 bmi01 bmi02 bmi03 white black hispanic asian other; Missing are all

(-9999) ; ! usevariables is limited to bmi variables Usevariables are

bmi97 bmi98 bmi99 bmi00 bmi01 bmi02 bmi03 ;Analysis: Type = General Missing H1 ; Estimator = MLR ;Model: i s | bmi97@0 bmi98@1 bmi99@2 bmi00@3 bmi01@4 bmi02@5 bmi03@6;Output: Sampstat Mod(3.84) patterns ; Plot: Type is Plot3; Series = bmi97 bmi98 bmi99 bmi00 bmi01 bmi02 bmi03(*);


53

Also, to simplify our presentation we will take out the quadratic term (the fit is better with the quadratic term, but it takes more space to present and interpret the results).

Here are selected, annotated results:

*** WARNING Data set contains cases with missing on all variables. These cases were not included in the analysis. Number of cases with missing on all variables: 3 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS

SUMMARY OF ANALYSIS

Number of groups 1Number of observations 1768 ! We had 1102 observations using listwise deletion.

Number of dependent variables 7Number of independent variables 0Number of continuous latent variables 2

Observed dependent variables

ContinuousBMI97 BMI98 BMI99 BMI00 BMI01 BMI02 BMI03

Continuous latent variables I S

Estimator MLR ! Robust ML estimator Information matrix OBSERVEDMaximum number of iterations 1000Convergence criterion 0.500D-04Maximum number of steepest descent iterations 20Maximum number of iterations for H1 2000Convergence criterion for H1 0.100D-03


54

! An ‘x’ mean the data are present. Pattern 1 -- no missing values! Pattern 2 – missing BMI03 SUMMARY OF MISSING DATA PATTERNS MISSING DATA PATTERNS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 BMI97 x x x x x x x x x x x x x x x x x x x x BMI98 x x x x x x x x x x x x x x x x x x x x BMI99 x x x x x x x x x x x x x x x BMI00 x x x x x x x x x x x x x BMI01 x x x x x x x x x x x x BMI02 x x x x x x x x x x x BMI03 x x x x x x x x x x

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 BMI97 x x x x x x x x x x x x x x x x x x x x BMI98 x x x x x x x x x BMI99 x x x x x x x x x x x BMI00 x x x x x x x x x x BMI01 x x x x x x x x x BMI02 x x x x x x x x x BMI03 x x x x x x x x x x

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 BMI97 x x x x x x x x x x x x x x x BMI98 x x x x x BMI99 x x x x x x BMI00 x x x x x x x x x x x BMI01 x x x x x x x x x x x x BMI02 x x x x x x x x x x x BMI03 x x x x x x x x x x

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 BMI97 BMI98 x x x x x x x x x x x BMI99 x x x x x x x x x x BMI00 x x x x x x x x x x x x BMI01 x x x x x x x x x x x x BMI02 x x x x x x x x x x x x x x BMI03 x x x x x x x x x x x x

81 BMI97 BMI98 BMI99 BMI00 BMI01 x BMI02 x BMI03 x

MISSING DATA PATTERN FREQUENCIES Pattern Frequency Pattern Frequency Pattern Frequency 1 1102 28 2 55 26


55

2 97 29 10 56 53 3 73 30 51 57 9 4 38 31 4 58 9 5 21 32 3 59 2 6 11 33 1 60 4 7 5 34 1 61 1 8 20 35 1 62 4 9 23 36 3 63 1 10 4 37 6 64 3 11 8 38 1 65 5 12 3 39 1 66 1 13 8 40 1 67 1 14 3 41 3 68 1 15 11 42 6 69 1 16 25 43 3 70 2 17 6 44 1 71 1 18 3 45 1 72 14 19 2 46 2 73 1 20 3 47 1 74 1 21 1 48 6 75 2 22 1 49 3 76 1 23 2 50 2 77 1 24 7 51 3 78 7 25 1 52 3 79 1 26 1 53 3 80 2 27 6 54 3 81 4! We might want to set some minimum standard and drop observations that do not meet that. For example, we might drop people who are missing their BMI for more than 3 waves.

COVARIANCE COVERAGE OF DATA

Minimum covariance coverage value 0.100

PROPORTION OF DATA PRESENT

Covariance Coverage BMI97 BMI98 BMI99 BMI00 BMI01 ________ ________ ________ ________ ________ BMI97 0.925 BMI98 0.847 0.902 BMI99 0.850 0.856 0.910 BMI00 0.842 0.846 0.864 0.906 BMI01 0.839 0.837 0.854 0.859 0.904 BMI02 0.796 0.794 0.805 0.811 0.817 BMI03 0.777 0.775 0.788 0.788 0.801

Covariance Coverage BMI02 BMI03 ________ ________


56

BMI02 0.861 BMI03 0.774 0.840! We have 77.4% of the 1768 observations answering both BMI02 and BMI03

SAMPLE STATISTICS! Notice that the means are not dramatically different from the results of the “basic” analysis that had the 1098 observations using listwise deletion. This is reassuring that our missing values are not creating a systematic bias. Means BMI97 BMI98 BMI99 BMI00 BMI01 ________ ________ ________ ________ ________ 1 20.572 21.839 22.651 23.305 23.846

Means BMI02 BMI03 ________ ________ 1 24.390 24.935

TESTS OF MODEL FIT! If you compare nested models with MLR estimation you need to use the scaling correction factor as discussed on their web page. We are not doing that here, so this is okay.Chi-Square Test of Model Fit

Value 116.426* Degrees of Freedom 23 P-Value 0.0000 Scaling Correction Factor 2.302 for MLR

* The chi-square value for MLM, MLMV, MLR, ULS, WLSM and WLSMV cannot be used for chi-square difference tests. MLM, MLR and WLSM chi-square difference testing is described in the Mplus Technical Appendices at www.statmodel.com. See chi-square difference testing in the index of the Mplus User's Guide.! The chi-square is much bigger when we use FIML estimation with missing values, in part because the sample is so much bigger. Still there are some fit problems without the quadratic term. Both the CFI and TLI are a bit low to be ideal (under .96). However the RMSEA is good and that is the most widely used measure of fit.Chi-Square Test of Model Fit for the Baseline Model

Value 1279.431 Degrees of Freedom 21 P-Value 0.0000CFI/TLI CFI 0.926 TLI 0.932RMSEA (Root Mean Square Error Of Approximation) Estimate 0.048SRMR (Standardized Root Mean Square Residual) Value 0.051


57

! The results are similar to the linear model solution with listwise deletion, but our z-scores are bigger due to having more observations. S WITH I 0.408 0.112 3.658

Means I 21.035 0.105 200.935 S 0.701 0.022 32.311

Variances I 15.051 0.958 15.714 S 0.255 0.031 8.340

Residual Variances BMI97 5.730 0.638 8.981 BMI98 3.276 0.414 7.907 BMI99 3.223 0.351 9.175 BMI00 4.361 0.973 4.483 BMI01 2.845 0.355 8.005 BMI02 9.380 3.384 2.772 BMI03 8.589 2.736 3.139

PLOT INFORMATION

The following plots are available:

Histograms (sample values, estimated factor scores, estimated values) Scatterplots (sample values, estimated factor scores, estimated values) Sample means Estimated means Sample and estimated means Observed individual values Estimated individual values

Missing Values with Mechanism VariablesAlthough there is widespread use of full information maximum likelihood estimation, researchers typically do not include additional variables that explain who does or does not answer each question. It is very easy to do this and I show how in an appendix to my November article in JMF

Title:Missing values including mechanism/auxiliary variables

Data:


58

File isMiss_systematic-999.dat ;

Variable:Names are

Childs satfin male hap_gen ident income98 educ hlth age:

Missing areall (-999) ;

Usevariables areHlth childs hap_gen income98 age educ satfin male ;

Analysis:Type = missing ;

Model:hlth on childs hap_gen income98 age educ;satfin on childs hap_gen income98 age educ ;male on childs hap_gen income98 age educ ;

Output:Standardized ;

The difference is that I’ve added a line that would include any auxiliary/mechanism variables under the Model: section. This is a nonsense equation since we would not predict gender this way. This does not have to be a meaningful equation, but it does need to include the auxiliary/mechanism variables you want involved. You ignore the results for this model. However, Mplus now includes these variables in the analysis using the full information approach.


59

Multiple Cohort Growth Model with Missing Waves

Major datasets often have multiple cohorts. NLSY97 has youth who were 12-18 in 1997. Seven years later, they are 19-25. It is quite likely that many growth processes that involve going from the age of 12 to the age of 19 are different than going from 19-25. For example, involvement in minor crimes (petty theft, etc.) may increase from 12 to 19, but then decrease from there to 25. Here is what we might have for our NLSY97 data

Individual Cohort 1997 1998 1999 2000 2001 2002 20031 1985 3 4 5 6 7 7 82 1985 2 4 3 5 6 7 73 1984 4 5 6 7 6 6 54 1982 6 7 5 4 3 2 25 1982 5 5 6 4 2 2 1

We can rearrange this data

Case Cohort HD12 HD13 HD14 HD15 HD16 HD17 HD18 HD19 HD20 HD211 1985 3 4 5 6 7 7 8 * * *2 1985 2 4 3 5 6 7 7 * * *3 1984 * 4 5 6 7 6 6 5 * *4 1982 * * * 6 7 5 4 3 2 25 1982 * * * 5 5 6 4 2 2 1

In this table HD is the age at which the data was collected. To capture everybody we would need to extend the table to HD25 because the youth who were 18 in 1997 are 25 seven years latter.

This table would have massive amounts of missing data, but the missingness would not be related to other variables. It would be missing at random.

We could develop a growth curve that covered the full range from age 12 to age 25. We would have 14 waves of data even though each participant was only measured 7 times. Each participant would have data for 7 of the years and have missing values for the other 7 years.

We would want to estimate a growth model with a quadratic term and expect the linear slope to be positive (growth from 12-18) and the quadratic term to be negative (decline from 18-25).


60

Mplus has a special Analysis: type called MCOHORT. There is an example on the Mplus WebPage and we will not cover it here. This is an extraordinary way to deal with missing values.Here is an example from data Muthén analyzed:

Recommendations

These recommendations rely on the statistical processes and potential problems primarily rather than on the particular illustration used in this paper.

1. Keep as much information on why a person has a missing value as possible. Distinguishing why you should impute and what you should leave missing is impossible if you have a single code such as a -9 or a dot for missing values.

2. If a “don’t know” response is interpretable as somewhere on an underlying scale between agree and disagree, then assigning or imputing a value may be reasonable. Otherwise, it should not be imputed

3. If you have a variable that was skipped by part of your sample you may still be able to use all of your observed data.

a. In the Fragile Families study mothers were asked about abuse only if they had a prior relationship with the biological father.


61

b. Because there were a couple hundred fathers who had no prior relationship with the mother, but planned to with the child, one solution would be to drop them from the analysis.

c. A better solution is to trichotomize abuse into high, medium, and low and then create three dummy variables using low abuse as the reference group. You would create dummy variables for medium abuse, high abuse, and a third dummy for not asked. This way you would keep all your observations.

4. Sometimes you can create a meaningful value for valid skips. Again using the Fragile Families study, nonresident fathers who had some contact with a child were asked a series of items measuring the quantity of their contact.

a. A substantial number of fathers had a valid skipped on these items since they had no contact with the child.

b. Rather than deleting them from the analysis or imputing a value for them, it is appropriate to assign them a value of ‘0,’ since, by definition, they have no contact.

c. If you do this, you will have a large concentration of ‘0’ values and need to use an appropriate statistical model that has an adjustment for censored variables using programs such as Mplus or Stata.

5. The mean substitution approach is probably the worst possible solution in that it attenuates variance and provides what is often a poor imputed value.

6. The widespread use of listwise or case deletion is unfortunate and both biases estimates and reduces power in typical applications.

7. When you are imputing values using single or multiple imputation the selection of variables you include during the imputation step is critical.

a. First, chose all potential mechanism variables that have few or no missing values. Adding a few mechanism variables, even though they are not in your analysis model can only help you to meet the MAR assumption. Including these in the imputation step even when they are not in the analysis step is not a bias (Meng, 1995 and Rubin, 1996.

b. Second, you must include all variables in your analysis model (both predictors and outcomes) in your imputation stage. If your dependent variable is related to an independent variable, this relationship needs to be incorporated in the imputation step. The parameter estimate for an


62

http://www.stat.psu.edu/~jls/mifaq.html#ref%23ref


analysis variable that is not included in the imputation step will be biased downward (Meng, 1995 and Rubin, 1996).

c. Third, you should include variables that predict missing values whether these variables are in the analysis state or not.

8. Multiple imputation is better than single imputation. Until multiple imputation is seamlessly integrated into your software package of choice, you should rely on single imputation until you reach the final analysis, then do the final analysis using multiple imputation.

9. Maximum likelihood approaches used in SEM models, especially with Mplus because of its ability to model nominal, ordinal and quantitative variables are excellent. These are easier to use than multiple implementation (see Mplus program in the appendix), but most implementations fail to include mechanism variables because these are not part of the analysis model. This problem is easy to correct.

Hopefully, the days that legitimate journals will tolerate the absence of attrition analysis and the use of traditional approaches to missing values are numbered. Standard statistical packages over the next couple years will make modern approaches a practical solution. Multiple imputation and the approaches available in structural equation modeling software are the best that are currently available.


63



working with missing values - oregon state universitypeople.oregonstate.edu/~acock/missing/old...

Documents