the estimation strategy of the national household survey (nhs)
DESCRIPTION
The estimation strategy of the National Household Survey (NHS). François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation at the ITSEW 2011 June 21, 2011. Outline of the presentation. Introduction Handling non-response error Simulation set-up Results - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/1.jpg)
The estimation strategy of the National Household Survey
(NHS)
François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden
Statistics CanadaPresentation at the ITSEW 2011
June 21, 2011
![Page 2: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/2.jpg)
2 2
Outline of the presentation1. Introduction2. Handling non-response error3. Simulation set-up4. Results5. Limits of the study6. Conclusion7. Future work
![Page 3: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/3.jpg)
3
1. Introduction 2006 Census: 20% long form, 80% short form 2011:
• 100% Census mandatory short form• 30% sampled to voluntarily complete the NHS long form
Objectives of the long form: get data to plan, deliver and support government programs directed at target populations
2011 common topics to both forms: demography, family structure, language
Additional 2011 long form topics: education, ethnicity, income, immigration, mobility…
NHS sample size is 4.5 million dwellings (f = 30%)
![Page 4: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/4.jpg)
4
1. Introduction Non-response error in the NHS:
• Survey now voluntary => expect significant non-response• To minimize the impact, after a fixed date restrict the collection efforts to a Non-
Response Follow-Up (NRFU) random sub-sample
Set-up developed by Hansen & Hurwitz (1946)1. Select 1st phase sample s from population U2. Non-response snr observed in s3. NRFU selected from snr 4. Response NRFUr and non-response NRFUnr observed in the NRFU (HH assumed
100% resp. rate)
Ussr snr NRFU
NRFUr NRFUnr
![Page 5: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/5.jpg)
5
1. Introduction
When 100% of the NRFU responds (as in Hansen and Hurwitz original setting), the NRFU can be used to estimate without non-response bias the total in snr
This is not the case in the NHS. However focusing the collection efforts on the NRFU converts part
of the non-response bias (that would be observed in the full snr) into sub-sampling error
Ussr snr NRFU
NRFUr NRFUnr
![Page 6: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/6.jpg)
6
2. Handling non-response error The estimation method chosen to minimize the remaining non-
response bias should have the following properties:• As few bias assumptions as possible should be made• The method should be simple to explain and to implement in
production
Available micro-level auxiliary data to adjust for non-response:• 2011 Census short form• Tax data
Calibration: Agreement with Census totals is desirable from a user’s perspective
![Page 7: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/7.jpg)
7
2. Handling non-response error First class of contenders: Reweighting
• Usual method used to compensate for total non-response in social surveys
• The Hansen & Hurwitz estimator of a total
is unbiased if 100% of the NRFU answers
When the assumption does not hold, we must model the last non-response mechanism/phase and reweight accordingly…
ˆr nr
k kHH
s NRFUak ak k s
y yt
![Page 8: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/8.jpg)
8
2. Handling non-response error Scores method:
• Model the probability of response with a logistic regression
• Form Response Homogeneity Groups (RHG) of respondents and non-respondents with similar predicted response probabilities
• Calculate the response rate in each RHG and assign these new predicted response probabilities to respondents
• Divide the NRFUr weights by this probability:
scoresˆ
ˆr r nr
k kRHG
s NRFUak ak kk s
y ytp
![Page 9: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/9.jpg)
9
2. Handling non-response error Second class of contenders: Imputation
• Usual method to compensate for item non-response• We will consider nearest-neighbour imputation using the
CANadian Census Edit & Imputation System (CANCEIS) only1. Partial imputation: Impute only non-respondents to the
subsample (NRFUnr) and use reweighting to take sampling into account
2. Mass imputation: Impute all non-respondents (snr/NRFUr)
mass
ˆˆc
r r nr r
k k
s NRFU s NRFUak ak
y yt
partial
ˆˆr r nrnr nr
k k k
s NRFU NRFUak ak akk s k s
y y yt
![Page 10: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/10.jpg)
10
2. Handling non-response error Some pros & cons
MethodScores Partial
imputationMass imputation
Preserves micro-level information of non-respondents
√ √√
Does not create synthetic information √√ √
Uses less heavy non-response hypotheses
√√ √√
Fully takes sub-sampling design into account
√√ √√
Census systems available √√ √√
More calibration to known Census totals can be done
√ √√
![Page 11: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/11.jpg)
11
3. Simulation set-up Use 2006 Census 20% long form sample data Restricted to Census Metropolitan Area (CMA) of Toronto Simulation aimed at preserving the properties of the NHS
(except for the f = 30%):• Non-response to the 1st phase was simulated by deterministically
blanking out the data of the 63% of respondents who answered last in 2006
• Of these non-respondents, the 78% who answered first will have their response restored if they are selected in the NRFU sub-sample
• NRFU sub-sampling was simulated by selecting a stratified random sample of 41% of snr
![Page 12: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/12.jpg)
12
3. Simulation set-up Estimators calculated
• As points of reference, unbiased estimators:
• As contenders:
mass
ˆˆc
r r nr r
k k
s NRFU s NRFUak ak
y yt
partial
ˆˆr r nrnr nr
k k k
s NRFU NRFUak ak akk s k s
y y yt
scoresˆ
ˆr r nr
k kRHG
s NRFUak ak kk s
y yt
p
2006ˆ k
s ak
yt
ˆr nr
k kHH
s NRFUak ak k s
y yt
![Page 13: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/13.jpg)
13
3. Simulation set-up The scores method
• A single logistic regression was done for the whole CMA of Toronto
• Household response probability was predicted• Considered for stepwise selection: household-level variables,
our best attempt at summarizing the person-level information and one paradata variable
• R-square of 26%• 13 RHG formed with predicted probabilities ranging from 29%
to 95%
![Page 14: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/14.jpg)
14
3. Simulation set-up Imputation methods
• Nearest-neighbour imputation done with CANCEIS• RHG is defined by household size• The distance between non-respondents and donors
(respondents) is defined by weighting each household-level, person-level and paradata characteristics in the distance function
• Preference is given to donors who are geographically close• For each non-respondents, a list of donors is made and one is
randomly selected with probability proportional to a measure of size (1st phase weight for mass imputation, score method weights for partial imputation)
![Page 15: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/15.jpg)
15
3. Simulation set-up M=84 non short form characteristics over the various topics Average relative difference:
• Calculated at the CMA level:
• At the Weighting area (953 WA in total) level within the CMA:
2006
1 2006
ˆ ˆ100ˆ
Mj j
j j
t tM t
1
ˆ ˆ100ˆ
Mj HHj
j HHj
t tM t
9532006
1 1 2006
ˆ ˆ100ˆ953
Mij ij
i j ij
t tM t
953
1 1
ˆ ˆ100ˆ953
Mij HHij
i j HHij
t tM t
![Page 16: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/16.jpg)
16
4. Results Errors at the CMA and WA levels for Toronto
CMA WA
Point of comparison Point of comparison
Full first-phase
Hansen & Hurwitz
Full first-phase
Hansen & Hurwitz
Hansen & Hurwitz estimator 0.94 0.00 22.98 0.00Mass imputation
2.97 N/A 24.56 N/APartial imputation
2.25 1.52 26.69 13.22Scores method
2.03 1.45 26.77 18.67
![Page 17: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/17.jpg)
17
5. Limits of the study Results:
• The simulation only includes one replication of the sub-sampling and non-response mechanisms
• Non-response bias is the measure of interest, but errors were presented
• Non-response mechanisms were generated deterministically. Should they be generated probabilistically?
• The 2011 sampling, non-response and available data (ex: paradata) cannot be replicated exactly
• Only totals studied. What about other parameters such as correlations?
![Page 18: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/18.jpg)
18
5. Limits of the study Possible confounding effects:
• Logistic regression was done at the aggregated level of the CMA and no WA effect or interaction were considered
• Paradata for imputation is more closely related to non-response mechanism (give preference to late respondents in the distance)
• Weighting of donors in imputation has an impact• Calibration done from sample to U; calibration at inner
levels/phases could help scores and partial imputation
![Page 19: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/19.jpg)
19
With these preliminary results, it seems scores method is doing well at aggregate levels, while partial imputation is doing better than scores at finer levels
• Mass imputation: Can you override the known sub-sample design with an imputation model?
• Partial imputation: Can include more information (person-level, paradata) than scores, but weighting of each component in the distance is partially data driven and not straightforward
• Scores method: More difficult to include the information, but variable selection to explain non-response is direct
6. Conclusion
![Page 20: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/20.jpg)
20
Possible:• Replicate sub-sampling and imputation more than once to
isolate bias components• Consider other levels of calibration in the comparisons• Hybrid of scores and partial imputation
Definite:• Implement a method into NHS production• Estimate the errors and variances (multi-phase, large sampling
fractions, errors due to modeling,…) and educate data users Important to get a good model for the last non-
response mechanism. Whatever the method, quality of the results is a function of the auxiliary data available.
7. Future Work
![Page 21: The estimation strategy of the National Household Survey (NHS)](https://reader036.vdocuments.us/reader036/viewer/2022081604/56815fd7550346895dced887/html5/thumbnails/21.jpg)
21
For more information,please contact:
François Verret - SSMD/DMES [email protected]
(613) 951-7318