missing data and non response pdf

Anuj Vijay Bhatia

FPRM 14

Institute of Rural Management Anand

NON RESPONSE

ERROR HOW TO HANDLE IT?

Re

se

arc

h M

eth

od

olo

gy

The respondent has not replied to the mail or did not find time to give the interview or cannot be contacted. There can be many such reasons for nonresponse.

High rate of non response is serious.

Research may lose:

Credibility

Acceptability

Accuracy and Professional Soundness

Methodology used should be described completely.

Researchers responsibility to establish external validity.

Appropriate sample size and acceptable response rate must be achieved.

NON RESPONSE ERROR

Nonresponse error exist to the extent that subjects

included in sample fail to provide usable responses.

Research manifested by high nonresponse loses

Validity and Reliability.

Many research articles:

Do not mention nonresponse as a threat to external validity.

Do not attempt to control for non response error.

Do not provide reference to the literature of handling

nonresponse.

It limits the ability of the researcher to generalize.

NON RESPONSE ERROR

In a survey research, the ability to generalize is critical.

There is a risk that non-respondents will be

systematically different from respondents.

Response rate is higher (100% many times) when

purposive or convenience sampling is used.

However, probability sampling is used, response rates are

low.

Ability to generalize is limited when purposive or

convenience sampling is used.

The threat to validity is not due to response rate but due

to nonrepresentataive sampling procedures.

To ensure external validity answer: Will your results be

same if a 100% response rate was achieved?

SAMPLING PROCEDURES AND NON-

RESPONSE

Suppose the population is divided into two strata i .e., the respondents ( r ) and the non-respondents whose data is missing (m). Suppose we want to determine 𝑌 , the total population mean.

𝒀 = Wr 𝒀 𝒓 + Wm 𝒀 𝒎

Y r and Y m are the means of respondents and non—respondents respectively. Wr and Wm are weights.

If the survey fails to collect data from non-respondents, it will produce result estimate equal to 𝑌 𝑟.

The bias will be the dif ference between 𝑌 𝑟 𝑎𝑛𝑑 𝑌

𝒀 𝒓 − 𝒀 = 𝒀 𝒓 − ( Wr 𝒀 𝒓 + Wm 𝒀 𝒎 )

= 𝒀 𝒓 𝟏 − 𝑾𝒓 − 𝑾𝒎 𝒀 𝒎

= Wm (𝒀 𝒓 − 𝒀 𝒎)

A SIMPLE LOGIC

Begins with designing and implementation.

Appropriate sampling protocols and procedures

should be used to maximize participation.

Ensure that response rate is enough to conclude that

non-response is not a threat to external validity.

If required go for some additional procedures to

establish that non-response is not a threat to

external validity.

CONTROLLING NON-RESPONSE ERROR

Methods for Handling Non-Response

1. Comparison of Early to Late Respondents

2. Using “Days to Respond” as a Regression Variable

3. Compare Respondents to Non-Respondents

4. Compare Respondents on Characteristics known a

priori

5. Ignore Non-Response as a Threat to External

Validity

RECOMMENDATIONS FOR HANDLING

NON-RESPONSE

Method 1: Comparison of Early to Late Respondents

Extrapolation based on statistical inferences

Operationally define ‘Late Respondents’

Last wave of respondents: Late Respondents

Compare early and late respondents based on key

variables of interest.

If no difference, results can be generalized to larger

population.

METHODS FOR HANDLING

NON-RESPONSE

Method 2: Using “Days to Respond” as a Regression

Variable

“Days to respond” is coded as continuous variable and

used as IV in regression equation.

Primary variables of interest are regressed on variable

“Days to Respond”.

If not statistically significant: Assume that respondents

are not different from non-respondents.


NON-RESPONSE

Method 3: Compare Respondents to Non-Respondents

Compute differences by sampling nonrespondents

and working extra diligently to get their responses.

Minimum 20% of responses from nonrespondents

should be obtained.

If fewer than 20% responses are obtained, Method 1

or 2 should be used by combining the results.


NON-RESPONSE

Method 4: Compare Respondents on Characteristics

known a priori

Compare respondents to population or

characteristics known in advance

Describe similarities and differences.

Method 5: Ignore Non-Response as a Threat to External

Validity

If above methods are you can choose to ignore.


NON-RESPONSE

Anuj Vijay Bhatia

FPRM 14

Institute of Rural Management Anand

MISSING DATA IN QUANTITATIVE RESEARCH

Re

se

arc

h M

eth

od

olo

gy

What is certain in life?

Death

Taxes

What is certain in research?

Measurement error

Missing data

Missing data can be:

Due to preventable errors, mistakes, or lack of foresight by the

researcher

Due to problems outside the control of the researcher

Deliberate, intended, or planned by the researcher to reduce

cost or respondent burden

Due to differential applicability of some items to subsets of

respondents Etc.

A FOOD FOR THOUGHT

• Non-Response v/s Missing Data

• Missing Data: Where valid values on one or more variables are not available for analysis.

• Researchers primary concern is to identify the patterns and relationships underlying the missing data.

• we need to understand process leading to missing data to take appropriate course of action.

• Common in Social Research

• More acute in experiments and surveys

• Best way is to avoid it by planning and conscientious data collection.

• Not uncommon to have some level of missing data.

MISSING DATA

Lost data

Reduces Statistical Power

Meaningfully diminishes sample size

Bias Parameter Estimates

Correlations biased downwards

Predictor scores affected

Restrict Variance

Central Tendency Biased

PRIMARY PROBLEMS

Simple Techniques

Listwise Deletion

Pairwise Deletion

Mean Substitution

Regression Imputation

Hot-Deck Imputation

Maximum Likelihood and Related Methods

Maximum Likelihood

Expectation Maximization

Repeated Measures and Time Series Designs

TECHNIQUES TO DEAL WITH

MISSING DATA

Eliminate all cases with missing data on any

predictor or criterion.

Sacrifices large amount of data

Decreases statistical power

May introduce bias in parameter

Default option in many statistical packages

LISTWISE DELETION

Deletes information only from those statistics

that “need” information.

Preserves great deal of information than

listwise deletion.

Interpretation becomes difficult.

May lead to mathematically inconsistent

correlations.

PAIRWISE DELETION

Use means in place of missing data

Allows to use rest of individual’s data

Preserves data

Easy to use

Attenuate variance and covariance estimates

Useful when correlations between variables is

low and less than 10% of data are missing.

MEAN SUBSTITUTION

Estimate missing data based on other variables in

data set.

Advantages:

Preserves data

Better than Listwise and Pairwise deletion

Preserves the deviation from the mean

Doesn’t attune correlations like mean substitution.

Variants:

Simple regression strategy

Only one iteration

Estimate relationships in variables and estimate missing data

Stepwise/Iterative Regression

Isolate a few key variables, prepare correlation matrix.

Estimate regression equation and predict missing values

REGRESSION IMPUTATION

Replace missing value with actual score from similar

case in current data set.

Hot-deck? What is so hot about it?

What is Cold-Deck then?

Missing values are replaced with a reasonable estimate

from similar individual.

Accurate: Real values are imputed

May not distort distributions.

Helpful when data is missing in patterns.

Little literature backing the accuracy claim.

Problematic when there are large classification variables.

Categorizing variables sacrifices information.

Estimating Standard Errors Difficult.

HOT-DECK IMPUTATION

Assume: The observed data are a sample drawn from

multivariate normal distribution.

Parameters are estimated by available data and then

missing scores are estimated based on the parameters

just estimated.

The missing values are predicted by using conditional

distribution of variables on which data is available.

ML provides explicit modeling of the imputation process

that is open to scientific analysis and critique.

More accurate then Listwise deletion and better than ad

hoc approaches like mean substitution.

However, it may be possible that differences are small

and the distributional assumptions in this method are

relatively strict.

MAXIMUM LIKELIHOOD

Uses Expectation Maximization Algorithm

Iterations through process of estimating missing data

First iteration involves estimating missing data and then

estimating parameters using ML method.

Second iteration would require re-estimating the missing

data based on new parameter estimates and then

recalculating the parameter estimates.

This process continues till there is convergence in the

parameter estimates.

Produces less biased estimates, more accurate.

Open to scientific analysis and critique.

Lengthy and complex.

EXPECTATION MAXIMIZATION

Problem of Missing Data more severe

Listwise deletion: Loss of more data due to repeated

measures.

Additional data is collected on same measures at

different time.

Opportunity to use strongly correlated variables to

impute missing data.

Linear regression and subject mean can be used to

predict missing values, but it may be biased.

Interpolation and Extrapolation can produced

relatively unbiased estimates.

REPEATED MEASURES AND TIME SERIES

DESIGN

The data can be missing at three levels:

1. Item-level missingness

2. Construct- level missingness

3. Person-level missingness

LEVELS OF MISSINGNESS

(Adopted from: Newman, D. A., (2014). Missing Data: Five Practical Guidelines, Sage Publications.)

Data can be missing randomly or

systematically.

Random Missingness:

Missing Completely at Random (MCR)

Systematic Missingness

Missing at Random (MAR)

Missing not at Random (MNAR)

MECHANISMS OF MISSING DATA

MCAR (Missing Completely at Random)

The probability that a variable value is missing does not depend on

the observed data values nor the missing data values.

P ( missing | complete data ) = P (missing)

MAR (Missing at Random)

The probability that a variable value is missing partly depends on

other data that are observed in the dataset but does not depend on

any of the values that are missing.

P(missing | complete data ) = P (missing | observed data)

MNAR (Missing Not at Random)

The probability that a variable value is missing depends on the

missing data values themselves.

P (missing | complete data ) ≠ P (missing | observed data)


BIAS AND INACCURATE STANDARD

ERRORS

CHOOSING MISSING DATA TREATMENTS


STEP 1: DETERMINE THE TYPE OF MISSING DATA

Is it under the control of researcher?

Is it ignorable?

Ignorable Missing Data

Expected

Remedies not needed

Allowance for missing data are inherent in the technique

Missing data is operating at random

Non—Ignorable Missing Data

Known to researchers: Some remedies if random

Unknown missing data: Process less easy, but remedies

available

Missing data known or unknown: Proceed to next step

A FOUR STEP PROCESS FOR IDENTIFYING

MISSING DATA AND APPLYING REMEDIES

STEP 2: DETERMINE THE EXTENT OF MISSING DATA

Determine the extent of missing data

Patterns of individual variables, individual cases and even

overall.

Is it low enough to affect the results?

It is random?

If sufficiently low: Apply any remedy

If not low: Determine the randomness before applying the

remedy

Assessing the Extent and Pattern of Missing data:

Tabulate

Number of cases with missing data

Percentage of variables with missing data in each case.

Look for non-random pattern

Also determine number of cases with no missing data (100%

complete)

Is missing data too high to create a bias? (Rule of Thumb 1)

Can deletion be used? (Rule of Thumb 2)

Missing data under 10% can generally be

ignored when it happens in random fashion.

The number of cases with no missing data

should be sufficient for the selected analysis

technique if replacement values will not be

substituted (imputed) for the missing data.

RULE OF THUMB 1

HOW MUCH MISSING DATA IS TOO MUCH?

Variables with less 15% data are candidates for deletion.

Higher level of missingness like 20-30% can be

remedied.

Deletion of large data should be justifiable.

Cases with missing data for dependent variables typically

are deleted to avoid increase in relationship with

independent variable.

While deleting a variable, ensure a highly correlated

variable is available to represent intent of original

variable.

Always perform analysis with or without the deleted

cases or variables to identify any marked differences.

RULE OF THUMB 2

DELETION BASED ON MISSING DATA

STEP 3: DIAGNOSE THE RANDOMNESS OF THE MISSING DATA PROCESSES.

Degree of randomness determines the appropriate level of remedy.

Level of Randomness

Random: MCAR

Observed values of Y are truly a random sample of Y values.

No underlying process that tends to bias the observed data.

Missing data are indistinguishable form complete data.

Non-Random: MAR

Missing values of Y depends on X but not on Y

Observed values of Y represent a random sample of Y for each value of X.

Cannot be generalized.

Diagnostic Tests for Level of Randomness

Forming 2 groups, with and without missing data : T-Test

Overall test of Randomness for MCAR

STEP 4: SELECT THE IMPUTATION METHOD

UNDER 10%

Any imputation method can be applied.

10% - 20%

For MCAR

Hot-Deck Case Substitution and Regression Imputation

For MAR

Model Based Methods

Over 20%

Regression method for MCAR

Model Based method for MAR

RULE OF THUMB 3

IMPUTATION OF MISSING DATA

1. Dooley, L. M., & Lindner, J. R. (2003). The handling of

nonresponse error. Human Resource Development

Quarterly , 14(1), 99-110.

2. Roth, P. L. (1994). Missing data: A conceptual review for

applied psychologists. Personnel psychology , 47(3), 537-560.

3. Blair, E., & Zinkhan, G. M. (2006). Nonresponse and

generalizability in academic research. Journal of the Academy

of Marketing Science , 34(1), 4-7.

4. Newman, D. A. (2014). Missing data five practical

guidelines. Organizational Research Methods , 17(4), 372-411.

5. Hair, J. F., Black, W. C., Babin, B. J. , Anderson, R. E., & Tatham,

R. L. (2006). Multivariate data analysis 6th Edition. New

Jersey: Pearson Education .

REFERENCES