missing data and non response pdf
TRANSCRIPT
Anuj Vijay Bhatia
FPRM 14
Institute of Rural Management Anand
NON RESPONSE
ERROR HOW TO HANDLE IT?
Re
se
arc
h M
eth
od
olo
gy
The respondent has not replied to the mail or did not find time to give the interview or cannot be contacted. There can be many such reasons for nonresponse.
High rate of non response is serious.
Research may lose:
Credibility
Acceptability
Accuracy and Professional Soundness
Methodology used should be described completely.
Researchers responsibility to establish external validity.
Appropriate sample size and acceptable response rate must be achieved.
NON RESPONSE ERROR
Nonresponse error exist to the extent that subjects
included in sample fail to provide usable responses.
Research manifested by high nonresponse loses
Validity and Reliability.
Many research articles:
Do not mention nonresponse as a threat to external validity.
Do not attempt to control for non response error.
Do not provide reference to the literature of handling
nonresponse.
It limits the ability of the researcher to generalize.
NON RESPONSE ERROR
In a survey research, the ability to generalize is critical.
There is a risk that non-respondents will be
systematically different from respondents.
Response rate is higher (100% many times) when
purposive or convenience sampling is used.
However, probability sampling is used, response rates are
low.
Ability to generalize is limited when purposive or
convenience sampling is used.
The threat to validity is not due to response rate but due
to nonrepresentataive sampling procedures.
To ensure external validity answer: Will your results be
same if a 100% response rate was achieved?
SAMPLING PROCEDURES AND NON-
RESPONSE
Suppose the population is divided into two strata i .e., the respondents ( r ) and the non-respondents whose data is missing (m). Suppose we want to determine 𝑌 , the total population mean.
𝒀 = Wr 𝒀 𝒓 + Wm 𝒀 𝒎
Y r and Y m are the means of respondents and non—respondents respectively. Wr and Wm are weights.
If the survey fails to collect data from non-respondents, it will produce result estimate equal to 𝑌 𝑟.
The bias will be the dif ference between 𝑌 𝑟 𝑎𝑛𝑑 𝑌
𝒀 𝒓 − 𝒀 = 𝒀 𝒓 − ( Wr 𝒀 𝒓 + Wm 𝒀 𝒎 )
= 𝒀 𝒓 𝟏 − 𝑾𝒓 − 𝑾𝒎 𝒀 𝒎
= Wm (𝒀 𝒓 − 𝒀 𝒎)
A SIMPLE LOGIC
Begins with designing and implementation.
Appropriate sampling protocols and procedures
should be used to maximize participation.
Ensure that response rate is enough to conclude that
non-response is not a threat to external validity.
If required go for some additional procedures to
establish that non-response is not a threat to
external validity.
CONTROLLING NON-RESPONSE ERROR
Methods for Handling Non-Response
1. Comparison of Early to Late Respondents
2. Using “Days to Respond” as a Regression Variable
3. Compare Respondents to Non-Respondents
4. Compare Respondents on Characteristics known a
priori
5. Ignore Non-Response as a Threat to External
Validity
RECOMMENDATIONS FOR HANDLING
NON-RESPONSE
Method 1: Comparison of Early to Late Respondents
Extrapolation based on statistical inferences
Operationally define ‘Late Respondents’
Last wave of respondents: Late Respondents
Compare early and late respondents based on key
variables of interest.
If no difference, results can be generalized to larger
population.
METHODS FOR HANDLING
NON-RESPONSE
Method 2: Using “Days to Respond” as a Regression
Variable
“Days to respond” is coded as continuous variable and
used as IV in regression equation.
Primary variables of interest are regressed on variable
“Days to Respond”.
If not statistically significant: Assume that respondents
are not different from non-respondents.
METHODS FOR HANDLING
NON-RESPONSE
Method 3: Compare Respondents to Non-Respondents
Compute differences by sampling nonrespondents
and working extra diligently to get their responses.
Minimum 20% of responses from nonrespondents
should be obtained.
If fewer than 20% responses are obtained, Method 1
or 2 should be used by combining the results.
METHODS FOR HANDLING
NON-RESPONSE
Method 4: Compare Respondents on Characteristics
known a priori
Compare respondents to population or
characteristics known in advance
Describe similarities and differences.
Method 5: Ignore Non-Response as a Threat to External
Validity
If above methods are you can choose to ignore.
METHODS FOR HANDLING
NON-RESPONSE
Anuj Vijay Bhatia
FPRM 14
Institute of Rural Management Anand
MISSING DATA IN QUANTITATIVE RESEARCH
Re
se
arc
h M
eth
od
olo
gy
What is certain in life?
Death
Taxes
What is certain in research?
Measurement error
Missing data
Missing data can be:
Due to preventable errors, mistakes, or lack of foresight by the
researcher
Due to problems outside the control of the researcher
Deliberate, intended, or planned by the researcher to reduce
cost or respondent burden
Due to differential applicability of some items to subsets of
respondents Etc.
A FOOD FOR THOUGHT
• Non-Response v/s Missing Data
• Missing Data: Where valid values on one or more variables are not available for analysis.
• Researchers primary concern is to identify the patterns and relationships underlying the missing data.
• we need to understand process leading to missing data to take appropriate course of action.
• Common in Social Research
• More acute in experiments and surveys
• Best way is to avoid it by planning and conscientious data collection.
• Not uncommon to have some level of missing data.
MISSING DATA
Lost data
Reduces Statistical Power
Meaningfully diminishes sample size
Bias Parameter Estimates
Correlations biased downwards
Predictor scores affected
Restrict Variance
Central Tendency Biased
PRIMARY PROBLEMS
Simple Techniques
Listwise Deletion
Pairwise Deletion
Mean Substitution
Regression Imputation
Hot-Deck Imputation
Maximum Likelihood and Related Methods
Maximum Likelihood
Expectation Maximization
Repeated Measures and Time Series Designs
TECHNIQUES TO DEAL WITH
MISSING DATA
Eliminate all cases with missing data on any
predictor or criterion.
Sacrifices large amount of data
Decreases statistical power
May introduce bias in parameter
Default option in many statistical packages
LISTWISE DELETION
Deletes information only from those statistics
that “need” information.
Preserves great deal of information than
listwise deletion.
Interpretation becomes difficult.
May lead to mathematically inconsistent
correlations.
PAIRWISE DELETION
Use means in place of missing data
Allows to use rest of individual’s data
Preserves data
Easy to use
Attenuate variance and covariance estimates
Useful when correlations between variables is
low and less than 10% of data are missing.
MEAN SUBSTITUTION
Estimate missing data based on other variables in
data set.
Advantages:
Preserves data
Better than Listwise and Pairwise deletion
Preserves the deviation from the mean
Doesn’t attune correlations like mean substitution.
Variants:
Simple regression strategy
Only one iteration
Estimate relationships in variables and estimate missing data
Stepwise/Iterative Regression
Isolate a few key variables, prepare correlation matrix.
Estimate regression equation and predict missing values
REGRESSION IMPUTATION
Replace missing value with actual score from similar
case in current data set.
Hot-deck? What is so hot about it?
What is Cold-Deck then?
Missing values are replaced with a reasonable estimate
from similar individual.
Accurate: Real values are imputed
May not distort distributions.
Helpful when data is missing in patterns.
Little literature backing the accuracy claim.
Problematic when there are large classification variables.
Categorizing variables sacrifices information.
Estimating Standard Errors Difficult.
HOT-DECK IMPUTATION
Assume: The observed data are a sample drawn from
multivariate normal distribution.
Parameters are estimated by available data and then
missing scores are estimated based on the parameters
just estimated.
The missing values are predicted by using conditional
distribution of variables on which data is available.
ML provides explicit modeling of the imputation process
that is open to scientific analysis and critique.
More accurate then Listwise deletion and better than ad
hoc approaches like mean substitution.
However, it may be possible that differences are small
and the distributional assumptions in this method are
relatively strict.
MAXIMUM LIKELIHOOD
Uses Expectation Maximization Algorithm
Iterations through process of estimating missing data
First iteration involves estimating missing data and then
estimating parameters using ML method.
Second iteration would require re-estimating the missing
data based on new parameter estimates and then
recalculating the parameter estimates.
This process continues till there is convergence in the
parameter estimates.
Produces less biased estimates, more accurate.
Open to scientific analysis and critique.
Lengthy and complex.
EXPECTATION MAXIMIZATION
Problem of Missing Data more severe
Listwise deletion: Loss of more data due to repeated
measures.
Additional data is collected on same measures at
different time.
Opportunity to use strongly correlated variables to
impute missing data.
Linear regression and subject mean can be used to
predict missing values, but it may be biased.
Interpolation and Extrapolation can produced
relatively unbiased estimates.
REPEATED MEASURES AND TIME SERIES
DESIGN
The data can be missing at three levels:
1. Item-level missingness
2. Construct- level missingness
3. Person-level missingness
LEVELS OF MISSINGNESS
(Adopted from: Newman, D. A., (2014). Missing Data: Five Practical Guidelines, Sage Publications.)
Data can be missing randomly or
systematically.
Random Missingness:
Missing Completely at Random (MCR)
Systematic Missingness
Missing at Random (MAR)
Missing not at Random (MNAR)
MECHANISMS OF MISSING DATA
MCAR (Missing Completely at Random)
The probability that a variable value is missing does not depend on
the observed data values nor the missing data values.
P ( missing | complete data ) = P (missing)
MAR (Missing at Random)
The probability that a variable value is missing partly depends on
other data that are observed in the dataset but does not depend on
any of the values that are missing.
P(missing | complete data ) = P (missing | observed data)
MNAR (Missing Not at Random)
The probability that a variable value is missing depends on the
missing data values themselves.
P (missing | complete data ) ≠ P (missing | observed data)
(Adopted from: Newman, D. A., (2014). Missing Data: Five Practical Guidelines, Sage Publications.)
BIAS AND INACCURATE STANDARD
ERRORS
CHOOSING MISSING DATA TREATMENTS
(Adopted from: Newman, D. A., (2014). Missing Data: Five Practical Guidelines, Sage Publications.)
STEP 1: DETERMINE THE TYPE OF MISSING DATA
Is it under the control of researcher?
Is it ignorable?
Ignorable Missing Data
Expected
Remedies not needed
Allowance for missing data are inherent in the technique
Missing data is operating at random
Non—Ignorable Missing Data
Known to researchers: Some remedies if random
Unknown missing data: Process less easy, but remedies
available
Missing data known or unknown: Proceed to next step
A FOUR STEP PROCESS FOR IDENTIFYING
MISSING DATA AND APPLYING REMEDIES
STEP 2: DETERMINE THE EXTENT OF MISSING DATA
Determine the extent of missing data
Patterns of individual variables, individual cases and even
overall.
Is it low enough to affect the results?
It is random?
If sufficiently low: Apply any remedy
If not low: Determine the randomness before applying the
remedy
Assessing the Extent and Pattern of Missing data:
Tabulate
Number of cases with missing data
Percentage of variables with missing data in each case.
Look for non-random pattern
Also determine number of cases with no missing data (100%
complete)
Is missing data too high to create a bias? (Rule of Thumb 1)
Can deletion be used? (Rule of Thumb 2)
Missing data under 10% can generally be
ignored when it happens in random fashion.
The number of cases with no missing data
should be sufficient for the selected analysis
technique if replacement values will not be
substituted (imputed) for the missing data.
RULE OF THUMB 1
HOW MUCH MISSING DATA IS TOO MUCH?
Variables with less 15% data are candidates for deletion.
Higher level of missingness like 20-30% can be
remedied.
Deletion of large data should be justifiable.
Cases with missing data for dependent variables typically
are deleted to avoid increase in relationship with
independent variable.
While deleting a variable, ensure a highly correlated
variable is available to represent intent of original
variable.
Always perform analysis with or without the deleted
cases or variables to identify any marked differences.
RULE OF THUMB 2
DELETION BASED ON MISSING DATA
STEP 3: DIAGNOSE THE RANDOMNESS OF THE MISSING DATA PROCESSES.
Degree of randomness determines the appropriate level of remedy.
Level of Randomness
Random: MCAR
Observed values of Y are truly a random sample of Y values.
No underlying process that tends to bias the observed data.
Missing data are indistinguishable form complete data.
Non-Random: MAR
Missing values of Y depends on X but not on Y
Observed values of Y represent a random sample of Y for each value of X.
Cannot be generalized.
Diagnostic Tests for Level of Randomness
Forming 2 groups, with and without missing data : T-Test
Overall test of Randomness for MCAR
STEP 4: SELECT THE IMPUTATION METHOD
UNDER 10%
Any imputation method can be applied.
10% - 20%
For MCAR
Hot-Deck Case Substitution and Regression Imputation
For MAR
Model Based Methods
Over 20%
Regression method for MCAR
Model Based method for MAR
RULE OF THUMB 3
IMPUTATION OF MISSING DATA
1. Dooley, L. M., & Lindner, J. R. (2003). The handling of
nonresponse error. Human Resource Development
Quarterly , 14(1), 99-110.
2. Roth, P. L. (1994). Missing data: A conceptual review for
applied psychologists. Personnel psychology , 47(3), 537-560.
3. Blair, E., & Zinkhan, G. M. (2006). Nonresponse and
generalizability in academic research. Journal of the Academy
of Marketing Science , 34(1), 4-7.
4. Newman, D. A. (2014). Missing data five practical
guidelines. Organizational Research Methods , 17(4), 372-411.
5. Hair, J. F., Black, W. C., Babin, B. J. , Anderson, R. E., & Tatham,
R. L. (2006). Multivariate data analysis 6th Edition. New
Jersey: Pearson Education .
REFERENCES