revised thesis again

Abstract

Several imputation methods have been developed for imputing missing responses. It is

often not clear which imputation method is best for a particular condition. In choosing an

imputation method, several factors should be considered such as the types of estimates

that will be generated, the type and pattern of nonresponse, and the availability of the

auxiliary data that are highly correlated with characteristic of interest or with the response

propensity.

In the study compared the effectiveness of four imputation procedures namely the Overall

Mean, Hot Deck, Deterministic and Stochastic Regression Imputation using the first visit

variable to be its auxiliary variable. Values for variables second visit Total Income and

Expenditures (TOTIN2 and TOTEX2) were set to nonresponse to satisfy the assumption

of partial nonresponse. The results of the study provide some support for the following

conclusions: (a) for the 1997 FIES data, the Hot Deck Imputation and Overall Mean

Imputation methods are not appropriate for handling partial nonresponse data; (b)

stochastic regression imputation was selected as the best imputation method; and (c) the

imputation classes must be homogeneous to produce less biased estimates.

Chapter 1

The Problem and Its Background

1.1 Introduction

Missing data in sample surveys is inevitable. The problem of missing data occurs for

various reasons such as when the respondent moved to another location, refused to

participate in the survey or is unable to answer specific items in the survey. This failure

to obtain responses from the units selected in the sample is called nonresponse. There are

several types of nonresponse; (a) unit nonresponse refers to the failure to collect any

data from a sample unit; (b) item nonresponse refers to the failure to collect valid

responses to one or more items from a responding sample unit (i.e. in cases of surveys

with only one phase or considers a single phase ignoring other phases); and (c) partial

nonresponse occurs when there is a failure to collect responses for large sets or a block

of items (i.e. in cases of surveys with more than one phase, the same respondent cannot

answer in the succeeding phases of the survey) for a responding unit.

The effect of nonresponse must not be ignored since it leads to biased estimates which if

large would result to inaccuracy of estimates. Bias due to nonresponse is believed to be a

function of nonresponse rates and the difference in characteristic between responding and

nonresponding units. The larger the nonresponse rate or the wider the difference in

characteristic between the responding and nonresponding units, the result will lead to a

larger bias.

In practice, there are three ways of handling missing data. These are discarding the

missing values, applying weighting adjustments or using imputation techniques.

Discarding the missing values or otherwise known as the Available Case Method is

based on excluding the nonresponse records when analyzing the variable of interest. The

problem with this method is that it does not account for the difference in characteristic

between the responding and nonresponding units. Hence, methods for compensating

missing data are applied. The first method is called weighting adjustments. Weighting

adjustments is based on matching nonrespondents to respondents in terms of data

available on nonrespondents and increasing the weights of matched respondents to

account for the missing values. Hence, a weight proportionate to the amount of

nonresponse to the inverse of the response rate is often multiplied is often multiplied to

the inverse of the response rate. This is often applied for unit nonresponse. On the other

hand, imputation is also used by statisticians to account for nonresponse, usually in the

case of item and partial nonresponse. In imputation, a missing value is replaced by a

reasonable substitute for the missing information. Once nonresponse has been dealt with,

whether by weighting adjustments or imputation, then researchers can proceed with their

data analysis.

The Family Income and Expenditure Survey (FIES) is an example of a survey which has

more than one round of data collection. The FIES is a nationwide survey of households

conducted every three years with two visits per survey period on the sample unit by the

National Statistics Office (NSO) in order to provide information of the country’s income

distribution, spending patterns and poverty incidence. Like any other survey, FIES

encounters the problem of missing data, particularly the problem of nonresponse during

the second visit. Given the various contributions that this survey can provide, it is then

important to have precise estimates of the income and expenditure indicators.

With the 1997 FIES as the data set for this study, this paper will focus on dealing with

partial nonresponse through the use of imputation techniques. It aims to examine the

effects of imputed values in coming up with estimates for the missing data at various

nonresponse rates. Furthermore, the study aims to determine which imputation techniques

is appropriate for the FIES data through applying some of the methods mentioned in the

study about the 1978 Research Panel Survey for the Income Survey Development

Program (ISDP) entitled Compensating for Missing Data by Kalton (1983).

1.2 Statement of the Problem

This paper attempts to answer the following questions:

1. Which imputation technique is the most appropriate in handling partial nonresponse

for the FIES data?

2. How do varying nonresponse rates affect the results for each imputation method?

1.3 Objectives of the Study

The paper will attempt to achieve the following objectives:

1. To compare the imputation techniques namely overall mean imputation, hot deck

imputation, deterministic and stochastic regression imputation, in compensating partial

nonresponse in the FIES.

2. To investigate the effect of the varying rates of missing observations, particularly the

effect of 10%, 20% and 30% nonresponse rates on the precision of the estimates.

1.4 Significance of the Study

Nonresponse is a common problem in conducting surveys. The presence of nonresponse

in surveys causes to create incomplete data, which could pose serious problems during

data analysis, particularly in the generation of statistically reliable estimates. For this

reason, the use of imputation techniques enables to account for the difference between

respondents and nonrespondents. This then helps reduce nonresponse bias in the survey

estimates.

Since most statistical packages require the use of complete data before conducting any

procedure for data analysis, the use of imputation techniques can ensure consistency of

results across analyses, something that an incomplete data set cannot fully provide.

In a news article by Obanil (2006) entitled Topmost Floor of the NSO Building gutted by

Fire posted at Manila Bulletin Online, it mentioned that last October 3, 2006 around 1

Million Pesos worth of documents were destroyed by the fire. Among the documents

gutted by the fire is the first-visit questionnaire of the FIES for the NCR which at the

time of the fire has not yet been encoded.

In terms of statistical research, most countries in the developing world such as the United

States, Canada, UK and the Netherlands already employ imputation techniques in their

respective national statistical offices. In a country such as the Philippines, where data

collection is very difficult especially for some regions like the National Capital Region

(NCR), imputation will be able to ease the problem of data collection and nonresponse.

More importantly, given the great impact of this survey to the country, employing

imputation techniques will help statisticians in providing a method in handling

nonresponse, which could lead to a more meaningful generalization about our country’s

income distribution, spending patterns and poverty incidence. Hence, having estimates

with less bias and more consistent results, this can contribute in making our policymakers

and economists provide better solutions in improving the lives of the Filipinos.

1.5 Scope and Limitations

Throughout this paper, only the data from the 1997 Family Income and Expenditure

Survey (FIES) will be used to tackle the problem of nonresponse and to examine the

impact of the different imputation methods applied in the dataset. With regards to the

extent of how these imputation methods will be applied and evaluated, this paper will

only cover the partial nonresponse occurring in the National Capital Region (NCR) since

NCR is noted as the region with highest nonresponse rate. Also, the variables that will be

imputed for this study would be the Total Income (TOTIN2) and Total Expenditures

(TOTEX2) of the second visit of the FIES data.

The researchers will only focus on using the 1997 FIES data on the first visit to impute

the partial nonresponse that is present on the second visit. This paper also assumes that

the first visit data is complete and the pattern of nonresponse follows Missing Completely

at Random (MCAR) case. The MCAR case happens if the probability of response to Y is

unrelated to the value of Y itself or to any other variables; making the missing data

randomly distributed across all cases (Musil et. al, 2002). If the pattern on nonresponse

does not satisfy the MCAR assumption, imputation methods may not achieve its purpose.

As for the imputation techniques, only four imputation methods will be applied for this

paper namely: Overall Mean Imputation (OMI), Hot Deck Imputation (HDI),

Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI).

Other methods of handling nonresponse will not be covered in this paper.

On the aspect of evaluating the efficacy and appropriateness of the four imputation

methods, this will only be limited to the following: (a) Bias of the mean of the Imputed

Data, (b) Assessment of the Distributions of the Imputed vs. the Actual Data and (c) the

criteria mentioned in the report entitled Compensating for Missing Data (Kalton, 1983)

namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square

Deviation.

Chapter 2

Review of Related Literature

Much research effort has been devoted in the efficacy of various imputation methods. In

the report entitled Compensating for Missing Survey Data, two simulation studies were

carried out using the data in the 1978 Income Survey Development Program (ISDP)

Research Panel to compare some imputation methods. The first study compared

imputation methods for the variable Hourly Rate of Pay while the second dealt with the

imputation of the variable Quarterly Earnings. For both studies, the author stratified the

data into its imputation classes, constructed data sets with missing values by randomly

deleting some of the recorded values in the original dataset and then applied the various

imputation methods to fill in the missing values. This process was replicated ten times to

ensure consistency of the results. Once the imputation methods have been applied, the

three measures for evaluating the effectiveness of imputation methods namely the Mean

Deviation, Mean Absolute Deviation and the Root Mean Square Deviation were obtained

and averaged across the ten trials (Kalton, 1983).

For the first study of imputing the variable Hourly Rate of Pay, eight methods were used

namely the Grand Mean Imputation (GM), the Class Mean Imputation using eight

imputation classes (CM8), the Class Mean Imputation using ten imputation classes

(CM10), Random Imputation with eight imputation classes (RM8), Random Imputation

with ten imputation classes (RM10), Multiple Regression Imputation (MI), Multiple

Regression Imputation plus a random residual chosen from a normal distribution (MN)

and Multiple Regression Imputation plus a randomly chosen respondent residual (MR).

Using the Mean Deviation criteria, the results showed that all mean deviations were

negative, indicating that the imputed values underestimated the actual values. Moreover,

the results show that the Grand Mean Imputation (GM) has the greatest underestimation

among the eight procedures. Meanwhile for the Mean Absolute Deviation and Root Mean

Square Deviation, which measures the ability to reconstruct the deleted value, the results

showed that the Grand Mean Imputation fared the worst for both criteria. In addition, it

also showed that the Multiple Regression Imputation (MI) obtained the best measures for

the two criteria and that the procedures with greater number of imputation classes (i.e.,

CM8 VS. CM10, RC8 VS. RC10) slightly yield better results for the two criteria (Kalton,

1983).

For the second study, which is the imputation of Quarterly Earnings, ten imputation

procedures were used. These are the Grand Mean Imputation (GM), the Class Mean

Imputation using eight imputation classes (CM8), the Class Mean Imputation using

twelve imputation classes (CM12), Random Imputation with eight imputation classes

(RM8), Random Imputation with twelve imputation classes (RM12), Multiple Regression

Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a

normal distribution (MN), Multiple Regression Imputation plus a randomly chosen

respondent residual (MR), Mixed Deductive and Random Imputation using eight

imputation classes (DI8) and Mixed Deductive and Random Imputation using twelve

imputation classes (DI12). Using the first criteria, the Mean Deviation, the results showed

that the Grand Mean (GM) obtained a positive bias. This implied that the grand mean

imputation is not an effective imputation method for the study. The results also showed

that the regression imputation procedures have almost similar results producing almost

unbiased estimates. In addition, the Class Mean Imputation methods (CM8 and CM12)

have similar measures with those of the Random Imputation Methods. Nevertheless, all

methods have produced relatively small mean deviations except for the last two methods.

Comparing the Mean Absolute Deviations and the Root Mean Square Deviations, the

results show that the Grand Mean Imputation obtained values similar to the regression

procedures with residuals (i.e. Multiple Regression Imputation plus a random residual

chosen from a normal distribution or MN, Multiple Regression Imputation plus a

randomly chosen respondent residual or MR). The results also show that the RC8. RC12,

MN and MR procedures are over one third larger compared to deterministic procedures

such as the CM8, CM12 and MI procedures (Kalton, 1983).

To further investigate the relatively larger biases of DI8 and DI12 procedures, the author

further divided the date into the deductive and non deductive cases. This shed further

light on the Mean Deviations and Mean Absolute Deviations of the various imputation

methods. It was found that the mean deviations are positive on the deductive case and

negative on the non deductive case for all of the procedures. These then explains why

there are relatively small deviations in the previous results since the measures between

the cases tend to cancel out. It also showed that the DI8 and DI12 results are similar to

those of the RC8, RC12, CM8 and CM12 in the non-deductive cases but are largely

different in the deductive cases. This explains the larger values of DI8 and DI12 in the

previous results (Kalton, 1983).

At the end of the two studies, it showed that the imputation procedures tend to

overestimate the Hourly Rate of Pay and underestimate the Quarterly Earnings.

Moreover, it showed how the mean imputation appears to be the weakest imputation

method among the studies since it has distorted the distribution of the original data.

Lastly, Kalton’s study shows the impact of increasing the imputation classes with respect

to the criteria used such that it gives a better yield of values for the three criteria.

In contrast to Kalton’s criteria in measuring the performance of imputation procedures, a

paper entitled A Comparison of Imputation Techniques for Missing Data by C. Musil, C.

Warner, P. Yobas and S. Jones, the authors presented a much simple approach in

evaluating the performance of imputation techniques by using the means, standard

deviation and correlation coefficients, then comparing the statistics of the original data

with the statistics obtained from the five methods namely Listwise deletion, Mean

Imputation, Deterministic Regression, Stochastic Regression and EM Method. The

Expectation Maximization (EM) Method is an iterative procedure that generates missing

values by using expectation (E-step) and maximization (M-step) algorithms. The E-step

calculates expected values based on all complete data points while the M-step replaces

the missing values with E-step generated values and then recomputed new expected

values (Musil et. al, 2002).

Using the Center for Epidemiological Studies data on stress and health ratings of older

adults, the authors imputed a single variable namely the functional health rating. Of the

492 cases, 20% cases were deleted in an effort to maximize the effects of each imputation

method. Except for the Listwise Deletion and Mean Imputation, the researchers used the

SPSS Missing Value Analysis function for the Deterministic Regression, Stochastic

Regression and EM Method. For the correlations, the researchers obtained the correlation

values of the original data and the five methods of the imputed variable with the

variables, age, gender and self assed health rating (Musil et. al, 2002). The results show

that comparing the mean of the original data with the five methods, all imputed values

underestimated the mean. The closest to the original data was the Stochastic Regression,

followed very closely by EM Method, Deterministic Regression, Listwise Deletion and

Mean Imputation. The same results also hold for the standard deviations. For the

correlations, however, the EM Method produced the closest correlation values to the

original data followed closely by the Stochastic Regression, Deterministic Regression,

Listwise Deletion and Mean Imputation. Hence, the Finding suggests that the Stochastic

Regression and EM Method performed better while the Mean Imputation is the least

effective (Musil et. al, 2002).

Chapter 4

Methodology

p.43

4.4.3 Deterministic and Stochastic Regression Imputation

Deterministic Regression Imputation (DRI) is a procedure that involves the generation of

a Least Square Regression Equation where Y …..

…. With an additional procedure of adding an error term to the predicted value in order

to generate imputed values for missing data.

p.44

The diagnostic checking requires the fitted model to satisfy the following assumptions:

p.45

refer to the errata sheet and remove the discussion of the variance

change title to: Bias of the mean of the imputed data

Chapter 5: Results and Discussion

p. 54

Table 3: Chi-Square Test of Independence for the Matching Variable

p.55

Table 4: Measures of Association for Matching Variable

Phi-Coefficient Cramer's V Contingency TestCODIN1 CODEX1 CODIN1 CODEX1 CODIN1 CODEX1

PROV 0.192 0.183 0.111 0.105 0.188 0.18CODES1 0.386 0.408 0.273 0.288 0.36 0.378CODEP1 0.295 0.216 0.17 0.125 0.283 0.211

p.56

(changed font and font size)

Descriptive Statistics

VI IC Mean Minimum Maximum Std. DevValid

n

TOTIN2IC1 93588.32 9067.000 1340900 75619.52 2635IC2 186940.9 14490.00 4215480 281852.3 1434IC3 643191.2 54790.00 4357180 829409.3 61

TOTEX2IC1 74866.68 9025.000 731937.0 47517.69 2635IC2 135510.8 13575.00 3203978 151984.3 1434IC3 413184.0 40505.00 2726603 532577.1 61

p.57 (edited, changed font and font size)

VI NRRObservations retained Observations deleted

n Mean n Mean

TOTEX210% 3717 102748.610 413 99160.23520% 3304 102219.791 826 103069.69730% 2891 100709.947 1239 106309.365

TOTIN2

10% 3717 134821.662 413 127799.121

20% 3304 133624.722 826 136098.155

30% 2891 130685.596 1239 142131.636

Table 7: Model Fitted Results

revised thesis again

Documents