2014 planning database (pdb)
TRANSCRIPT
What is the Planning Database (PDB), and How
Can I Use it?
April 7, 2016 Nancy Bates, Kathleen Kephart,
Suzanne McArdle Center for Survey Measurement
1
Acknowledgements Thank you to Travis Pape and Julia Coombs for
creating the code to generate the PDB Luke Larsen and Alina Kline for their work on
the upcoming 2016 PDB Nancy Bates and Barb O’hare for their time
and effort to bring the PDB back Suzanne McArdle for her work on PDB data
visualizations
2
Overview A “greatest hits” of ACS 5 year estimates and
2010 Census variables Pulls together publicly available estimates in
one convenient file Available at two levels of geography: Tract and
Block Group Publicly available in CSV and now API format
3
Background First PDB developed for 2000 Census planning Selected 1990 Census tract data in easy-to-use format Hard-to-Count Score
ACS annual 5-year estimates for block groups resulted in revised PDB in 2012
2015 PDB Latest 5-year ACS estimates Health Insurance Coverage Estimates An API version of the data for developers
4
Contents of the 2015 PDB
Both 2009-2013 5-year ACS estimates and 2010 census data Types of variables Population: gender, age, education, poverty Household: language, relationship, income Housing unit: tenure, number of units Census operational: mailout/mailback, bilingual
5
A Broad Scope of Uses Useful for: Identifying areas with likely low survey
response rates Stratifying small areas Creating thematic maps Enhancing reports with population metrics Creating applications
6
Access
Available on the Census Bureau’s Research @ Census page
Link to the PDB CSV format:
http://www.census.gov/research/data/planning_database/ API format: www.census.gov/developers Documentation describing the files in PDF
format
7
Navigation to the PDB CSV Format
From the Census Bureau internet site (http://www.census.gov): 1. Select “Our Research” from under the “About the
Bureau” menu at the top of the page 2. Select the “Data” tab 3. Select the “Research Data Products” link 4. Select “Planning Database” under the “Demographic –
People and Households” heading 5. Select the appropriate year under “Data and
Documentation”
8
Navigation to the PDB API Format
From the Census Bureau internet site (http://www.census.gov):
1. Select “Data” 2. Select “Developers”
3. Select “Available APIs” from the sidebar 4. Scroll down and select “The 2015 Planning Database”
9
Managing the PDB
It’s a BIG dataset Block Group Level
220,354 block groups X 344 variables =
~75.8 Million cells
Tract Level 74,021 tracts X 566 variables =
~41.9 Million cells
The Structure
Geography Identifiers • GIDBG (12 chars) = State (2 chars) + County (3 chars) + Tract (6 chars) + Block Group (1 char) • GIDTR (11 chars) = State (2 chars) + County (3 chars) + Tract (6 chars)
Demographic, Socioeconomic, and Housing data. • Order of variables is consistent. Census data first, followed by ACS estimates and ACS MOEs. • For example, Males_CEN_2010, Males_ACS_09_13, Males_ACSMOE_09_13
Census Operational data including Mail Return Rate and Low Response Score
Percentages and MOE Percentages. Listed in the same order as their respective estimate. • Variables identified with ‘pct_’ added to their variable name. • For example, pct_Males_CEN_2010, pct_Males_ACS_09_13, pct_Males_ACSMOE_09_13
Low Response Score (Erdman and Bates slides)
13
Low Response Score for Use in Survey and Census Planning and Analysis
Chandra Erdman and Nancy Bates U.S. Census Bureau
Disclaimer: The views expressed on statist ical issues are those of the authors only.
Overview
1 The original Hard-to-Count (HTC) Score
The Census Kaggle Challenge
The Low Response Score (LRS)
2
3
Erdman & Bates (2014) Low Response Score (LRS)
The Original HTC Score
Bruce et al. (2001); Bruce and Robinson (2003)
1 Renter occupied units
Unmarried
Vacant units Multi-unit structures
Below Poverty
Not high school graduate
2
3
4
5
6
7 Different housing unit 1 year ago
Public assistance
Unemployed
8
9
10 Crowed units
11 Linguistically isolated households
12 No phone service
Erdman & Bates (2014) Low Response Score (LRS)
The Census Kaggle Challenge - 2012
“All you need is data and a question. Our data scientists will provide the answer.” – Kaggle.com
Data: 2012 Block-Group-Level Planning Database (PDB) Question: Which statistical model best predicts 2010 Census mail return rates? Product: Updated model-based “Hard-to-Count” Score
Erdman & Bates (2014) Low Response Score (LRS)
The Census Kaggle Challenge (Cont.)
2009 America COMPETES Act Contest ran August 31 - November 1, 2012
244 teams and individual competitors
Software developer from MD won top prize
Erdman & Bates (2014) Low Response Score (LRS)
Winning Model Predictors
When ranked by relative influence, 24/25 top predictors from PDB
(Rank)
Rel
ativ
e In
fluen
ce
2
0 10 20 30 40 50
1 3
4 ● (1) Renter
● (2) Ages 18−24
● (3) Female head of household, no husband
Erdman & Bates (2014) Low Response Score (LRS)
Low Response Model (Block-Group)
Sig: * * * p < .001; * * .001 ≤ p < .01; * .01 ≤ p < .05 R-squared: 56.10%, n = 217,417
Erdman & Bates (2014) Low Response Score (LRS)
Coef Sig Coef Sig (Intercept) 10.29 *** Renter occupied units 1.08 *** Ages 18-24 0.64 *** Female head, no husband 0.58 *** Non-Hispanic White -0.77 *** Ages 65+ -1.21 *** Related child <6 0.46 *** Males 0.09 *** Married family households -0.12 *** Ages 25-44 -0.06 Vacant units 1.08 *** College graduates -0.32 *** Median household income 0.24 *** Ages 45-64 -0.08 * Persons per household 3.44 *** Moved in 2005-2009 0.09 *** Hispanic 0.41 *** Single unit structures -0.52 *** Population Density -0.40 *** Below poverty 0.11 *** Different HU 1 year ago -0.12 *** Ages 5-17 0.17 *** Black -0.04 ** Single person households -0.24 *** Not high school grad -0.06 *** Median house value 0.71 ***
Distribution of the LRS
20 30 Low Response Score
Num
ber o
f Blo
ck G
roup
s
0 10 40 50
0 50
00
1000
0 15
000
2000
0 25
000
Erdman & Bates (2014) Low Response Score (LRS)
Rule of thumb…areas with LRS = >29 are hardest to count?
23
LRS/PDB Example: Three HTC Blocks in DC
Columbia Heights: 43% Hispanic;
36% Other Language; 92% 10+ multi-
units; 64% non-family hhds; 85%
renters; 60% moved 5 years ; LRS=32
Erdman & Bates (2014) Low Response Score (LRS)
Anacostia: 98% Black; 46% below
poverty; 89% single unit homes; 15%
non-family hhds; 21% moved 5 years;
93% renters; LRS=38
Trinidad: 37% Ages 18-24;
59% Moved 5 years; 33%
Below poverty; 28% Vacant;
55% Black; 31% white; 87%
renters; LRS=37
Considerations
Independent variable is mail response; 2020 Census will have an Internet response option
“Single Unattached Mobiles” (Bates and Mulry, 2011) 64.7 percent of American Community Survey self response by Internet (Baumgardner, 2013)
In January, 2013, ACS began asking about Internet connectivity
Erdman & Bates (2014) Low Response Score (LRS)
Summary
New “hard to count” metric for tracts and block groups Winning model was complex but predictors in rank order of influence proved useful Accurate predictions with relatively few predictors
Useful for planning and targeted advertising LRS updated yearly to reflect changes Develop mapping app populated with PDB and LRS?
Erdman & Bates (2014) Low Response Score (LRS)
Examples Using the PDB
27
Area Demographics 619,371 people live in 179 tracts in DC
DC* United States*
Male to female ratio 0.90 0.97
Population under 5 years old 5.9% 6.4%
Population that identifies as Hispanic 9.6% 16.6%
Population that moved within the past year 19.4% 15.1%
Population that was not born in the US 13.8% 12.9%
28
*ACS 5 year 2009-2013
Using Excel to Analyze Demographics
29
I used the Excel function SUM() on all DC tracts to find the total Census population
2016 Census Test Harris County Texas Demographics
484,358 people live in 292 block groups in the test site
Houston* United States*
Households where no one over 14 speaks English “very well” 14.8% 4.6%
Population 18-24 years old 9.4% 10.0%
Renter Occupied Units 60.9% 35.1%
Population 25 and over, with less than a HS diploma 19.1% 13.9%
30
*ACS 5 year 2009-2013 Estimate
31
Linguistic Isolation What if you want to identify areas that may
need support for a language other than English? Find block groups in the area that have a high
percentage of housing units where no one over the age of 14 speaks English “very well” What language is spoken in these tracts?
32
Linguistically Isolated BGs in 2016 Census Harris TX Test Site
Rank BG No one speaks English “very
well” Spanish Asian/Pacific
Islander Other
1 4327012 81.4% (14.3)
81.4% (14.0)
0% (2.1)
0% (2.1)
2 4330012 77.2% (13.4)
73.4% (13.5)
3.8% (4.1)
0% (2.3)
3 4327011 72.5% (11.1)
72.5% (10.9)
0% (1.6)
0% (1.6)
4 4335012 69.3% (10.9)
66.1% (10.7)
0% (1.7)
3.2% (4.8)
5 5214001 69.3% (21.1)
69.3% (20.6)
0% (3.7)
0% (3.7)
33
JSM Govt Section Data Challenge
Tailoring Outreach to Boost Mail Self-Response in Geographic Areas with Similar Low Response Scores — Darryl Creel
Exploring the Census Bureau's 2014 Planning Database Using Topological Data Analysis — Robert Baskin
Informing Natural Disaster Response with Census Data — Jonathan Auerbach ; Christopher Eshleman, New York City Council
Optimizing Survey Cost-Error Tradeoffs: A Multiple Imputation Strategy Using the Census Planning Database — Shin-Jung Lee, University of Michigan
34
Important Note Why are there duplicates tracts and BG in the
PDB? Short answer: Changes in geography since 2010
35