the u.s. census bureau mail return rate challenge ...€¦ · winning model and htc score software...

17
The U.S. Census Bureau Mail Return Rate Challenge: Crowdsourcing to Develop a Hard-to-Count Score Chandra Erdman and Nancy Bates U.S. Census Bureau Federal Committee on Statistical Methodology Washington, DC November 5, 2013 Disclaimer: The views expressed on statistical issues are those of the authors only.

Upload: others

Post on 08-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

The U.S. Census Bureau Mail Return Rate Challenge:Crowdsourcing to Develop a Hard-to-Count Score

Chandra Erdman and Nancy BatesU.S. Census Bureau

Federal Committee on Statistical MethodologyWashington, DC

November 5, 2013

Disclaimer: The views expressed on statistical issues are those of the authors only.

Page 2: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

The U.S. Census Bureau Return Rate Challenge

“All you need is data and a question. Our data scientists willprovide the answer.”– Kaggle.com

Our research question: Which statistical model best predicts 2010Census mail return rates (block-group level)?

Our dataset: 2012 Census Planning Database (PDB)

Product: Updated model-based Hard-to-Count Score

Erdman & Bates (2013) Census Return Rate Challenge 2 / 17

Page 3: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Census Crowdsource Challenge

2009 America COMPETES Act

Contest ran August 31 - November 1, 2012

244 teams and individual competitors

Unanticipated challenges:Non US citizensUse of auxiliary datasets

Erdman & Bates (2013) Census Return Rate Challenge 3 / 17

Page 4: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Winning model and HTC score

Software developer from Maryland awarded top monetary prize(MSE=2.60)

Used random forests and gradient boosting

Model included 342 variables – many from sources external toCensus PDB

How to apply Challenge results toward new model-based HTC score?

Erdman & Bates (2013) Census Return Rate Challenge 4 / 17

Page 5: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

HTC-Related Studies

Bruce et al. (2001); Bruce and Robinson (2003)Original HTC score

Guterbock et al. (2006)Community attachment theory

Erdman et al. (2013)Interviewer performance stratification

Erdman & Bates (2013) Census Return Rate Challenge 5 / 17

Page 6: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Model Selection Criteria

1 Restrict to PDB predictors

2 Small number of predictor variables

3 High predictive value (adjusted R2)

4 Low mean square error

5 Model works for both tracts and block-groups

Additional consideration: to include or exclude race/ethnicitycomposition as predictors?

Erdman & Bates (2013) Census Return Rate Challenge 6 / 17

Page 7: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Winning Model Predictors

90 percent (308/342) of predictors from censusWhen ranked by relative influence, 24/25 top predictors from census

(Rank)

Rel

ativ

e In

fluen

ce

0 10 20 30 40 50

12

34 (1) Renter

(2) Ages 18−24

(3) Female head of household, no husband

Erdman & Bates (2013) Census Return Rate Challenge 7 / 17

Page 8: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Comparison of Predictors Across Studies

Table: Overlap Between Top 25 Predictors in Bame (2012) and Erdman et al.(2013)

Predictor Bame Erdman Bruce GuterbockRenter occupied units X X X XMarried family households X X X∗ XAges 65+ X X + XAges 18-24 X X − XCollege graduates X X − XMoved in 2005-2009 X X − XAges < 5 X X + XAges 5-17 X X + XVacant units X X XSingle unit structures X X X∗

Males X X −Non-Hispanic White X X +Persons per household X X −Population Density X X −Below poverty X X XHispanic X XNon-Hispanic Black X X

Erdman & Bates (2013) Census Return Rate Challenge 8 / 17

Page 9: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Comparison of Predictors Across Studies (Cont.)

Table: Remaining Top 25 Predictors from Bame (2012)

Predictor Bame Erdman Bruce GuterbockNot high school graduate X X XDifferent housing unit 1 year ago X X XRelated child < 6 X − XAges 25-44 X − XMedian household income X − XAges 45-64 X − XFemale head, no husband X −Single person households X −

Erdman & Bates (2013) Census Return Rate Challenge 9 / 17

Page 10: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Comparison of Predictors Across Studies (Cont.)

Table: Remaining Variables from Bruce et al. (2001)

Predictor Bame Erdman Bruce GuterbockPublic assistance XUnemployed − − XCrowded units XLinguistically isolated households XNo phone service X X

Erdman & Bates (2013) Census Return Rate Challenge 10 / 17

Page 11: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Model Fit Statistics

Table: Comparison of Model Fit Statistics Across Studies and Geographies

Block-group TractModel R-squared MSE R-squared MSETop 25 Bame (2012) including race 56.27 30.36 55.03 23.09Erdman et al. (2013) including race 56.18 30.37 55.33 22.84Top 25 Bame (2012) excluding race 55.58 30.85 54.52 23.27Erdman et al. (2013) excluding race 53.89 32.05 52.06 24.72Guterbock et al. (2006) 51.17 33.86 49.83 25.75Bruce et al. (2001) 45.66 37.99 45.78 28.21

Erdman & Bates (2013) Census Return Rate Challenge 11 / 17

Page 12: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Deciles of Return Rates for Block-Groups in DC

Fitted Actual

100%90%80%70%60%50%40%30%20%10%

Erdman & Bates (2013) Census Return Rate Challenge 12 / 17

Page 13: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Three HTC Block-Groups in DC

Columbia Heights: 43% Hispanic;

36% Other Language; 92% 10+ multi-

units; 64% non-family hhds; 85%

renters; 60% moved 5 years

Anacostia: 98% Black; 46% below

poverty; 89% single unit homes; 15%

non-family hhds; 21% moved 5 years;

93% renters

Trinidad: 37% Ages 18-24;

59% Moved 5 years; 33%

Below poverty; 28% Vacant;

55% Black; 31% white; 87%

renters

Erdman & Bates (2013) Census Return Rate Challenge 13 / 17

erdma004
Line
erdma004
Line
erdma004
Line
Page 14: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Considerations

Independent variable is mail response; 2020 Census willhave an Internet response option

“Single Unattached Mobiles” (Bates and Mulry, 2011)

64.7 percent of American Community Survey selfresponse by Internet (Baumgardner, 2013)

In January, 2013, ACS began asking about Internetconnectivity

Erdman & Bates (2013) Census Return Rate Challenge 14 / 17

Page 15: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Summary

Challenge was successful

Winning model was complex but predictors in rank order ofinfluence proved useful

Accurate predictions with relatively few predictors

Simple HTC score: model fits

First score at this level of geography

Useful for planning and targeted advertising

Erdman & Bates (2013) Census Return Rate Challenge 15 / 17

Page 16: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

References

Nancy Bates and Mary H. Mulry.Using geographic segmentation to understand, predict, and plan for census and survey mailnonresponse.Journal of Official Statistics, 27(4):601–618, 2011.

S. Baumgardner.Self-response check-in rate by mode and segmentation group – june 2013.Email communication with the author, U.S. Bureau of the Census. September 17, 2013.

Antonio Bruce and J. Gregory Robinson.Tract Level Planning Database with Census 2000 Data.U.S. Government Printing Office, Washington, DC, 2007.

Antonio Bruce, J. Gregory Robinson, and Monique V. Sanders.Hard-to-count scores and broad demographic groups associated with patterns of responserates in census 2000.In Proceedings of the Social Statistics Section. American Statistical Association, 2001.

Chandra Erdman, Tamara Adams, and Barbara C. O’Hare.Development of interviewer performance standards for national surveys.Draft Paper, 2013.

Thomas M. Guterbock, Ryan A. Hubbard, and Laura M. Hoilan.Community attachment as a predictor of survey non-response.Unpublished Paper, 2006.

Erdman & Bates (2013) Census Return Rate Challenge 16 / 17

Page 17: The U.S. Census Bureau Mail Return Rate Challenge ...€¦ · Winning model and HTC score Software developer from Maryland awarded top monetary prize (MSE=2.60) Used random forests

Contact

[email protected]

[email protected]

Erdman & Bates (2013) Census Return Rate Challenge 17 / 17