info 7470/ilrle 7400 statistical tools: edit and imputation john m. abowd and lars vilhuber march...
TRANSCRIPT
INFO 7470/ILRLE 7400 Statistical Tools:
Edit and Imputation
John M. Abowd and Lars VilhuberMarch 25, 2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
2
Outline
• Why learn about edit and imputation procedures• Formal models of edits and imputations• Missing data overview• Missing records
– Frame or census– Survey
• Missing items• Overview of different products• Overview of methods• Formal multiple imputation methods• Examples3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
3
Why?
• Users of public-use data can normally identify the existence and consequences of edits and imputations, but don’t have access to data that would improve them
• Users of restricted-access data normally encounter raw files that require sophisticated edit and imputation procedures in order to use effectively
• Users of integrated (linked) data from multiple sources face these problems in their extreme form
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
4
Formal Edit and Imputation Models
• Original work on this subject can be found in Fellegi and Holt (1976)
• One of Fellegi’s many seminal contributions• Recent work by Winkler (2008) • Formal models distinguish between edits
(based on expert judgments) and imputations (based on modeling)
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
5
Definition of “Edit”
• Checking each field (variable) of a data record to ensure that it contains a valid entry– Examples: NAICS code in range; 0<age<120
• Checking the entries of specified fields (variables) to ensure that they are consistent with each other– Examples: job creations – job destructions =
accessions – separations; age = data_reference_date – birth_date
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
6
Options When There Is an Edit Failure
1. Check original files (written questionnaires, census forms, original administrative records)
2. Contact reference entity (household, person, business, establishment)
3. Clerically “correct” data (using expert judgment, not model-base)
4. Automatic edit (using a specified algorithm)5. Delete record from analysis/tabulation
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
7
Direct Edits Are Expensive
• Even when feasible, it is expensive to cross-check every edit failure with original source material
• It is extremely expensive to re-contact sources, although some re-contacts are built into data collection budgets
• Computer-assisted survey/census methods can do these edits at the time of the original data collection
• Re-contact or revisiting original source material is usually either infeasible or prohibited for administrative records in statistical use
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
8
Example: Age on the SIPP
• The SIPP collects both age and birth date (as do the decennial census and ACS)
• These are often inconsistent; an edit is applied to make them consistent
• When linked to SSA administrative record data, the birth date on the SSN application becomes available
• It is often inconsistent with the respondent report• And both birth dates are often inconsistent with the
observed retirement benefit status of a respondent (also in administrative data).
• Why?3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
9
The SSA Birth Date Is Respondent Provided until Benefits are Claimed
• Prior to 1986, most Americans received SSNs when they entered the labor force not at birth as is now the case
• The birth date on SSN application records is “respondent” provided (and not edited by SSA)
• Further updates to the SSA “Numident” file, which is the master data base of SSNs, are also “respondent provided”
• Only when a person claims age-dependent benefits (Social Security at 62 or Medicare at 65) does SSA get a birth certificate and apply a true edit to the birth date
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
10
Lesson• Editing, even informal editing, always involves building models even if they
are difficult to lay out explicitly• Formal edit models involve specifying all the logical relations among the
variables (including impossibilities), then imposing them on a probability model like the ones we will consider in this lecture
• When an edit is made with probability 1 in such a system, the designers of the edit have declared that the expert’s prior judgment cannot be overturned by data
• In resolving the SIPP/SSA birth date example, the experts declared that the since benefit recipiency was based on audited (via birth certificates) birth dates, it should be taken as “true;” all other values were edited to agree.– Note: still don’t have enough information to apply this edit on receipt of SIPP data– Doesn’t help for those too young to claim age-eligible benefits.
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
11
Edits and Missing Data Overview
• Missing data are a constant feature of both sampling frames (derived from censuses) and surveys
• Two important types are distinguished– Missing record (frame) or interview (survey)– Missing item (in either context)
• Methods differ depending upon type
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
12
Missing Records: Frame or Census
• The problem of missing records in a census or sampling frame is detection
• By definition in these contexts the problem requires external information to solve
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
13
Census of Population and Housing
• Dress rehearsal Census• Pre-census housing list review• Census processing of housing units found on a
block not present on the initial list• Post-census evaluation survey• Post-census coverage studies
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
14
Economic Censuses and the Business Register
• Discussed in lecture 4• Start with tax records• Unduplication in the Business Register• Weekly updates• Multi-units updated with Report of
Organization Survey• Multi-units discovered during the intercensal
surveys are added to the BR
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
15
Missing Records: Survey
• Non-response in a survey is normally handled within the sample design
• Follow-up (up to a limit) to obtain interview/data
• Assessment of non-response within sample strata
• Adjustment of design weights to reflect non-responses
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
16
Missing and/or Inconsistent Items
• Edits during the interview (CAPI/CATI) and during post-processing of the survey
• Edit or imputation based on the other data in the interview/case (relational edit or imputation)
• Imputation based on related information on the same respondent (longitudinal edit or imputation)
• Imputation based on statistical modeling– Hot deck– Cold deck– Multiple imputation
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
17
Census 2000 PUMS Missing Data
• Pre-edit: When the original entry was rejected because it fell outside the range of acceptable values
• Consistency: Edited or imputed missing characteristics based on other information recorded for the person or housing unit
• Hot Deck: Supplied the missing information from the record of another person or housing unit.
• Cold Deck: Supplied missing information from a predetermined distribution
• See allocation flags for details
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
18
CPS Missing Data
• Relational edit or imputation: use other information in the record to infer value (based on expert judgment)
• Longitudinal edits: use values from the previous month if present in sample (based on rules, usually)
• Hot deck: use values from actual respondents whose data are complete for the, relatively few, conditioning variables
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
19
County Business Patterns
• The County and Zip code Business Patterns data are published from the Employer Business Register
• This is important because variables used in these publications are edited to publication standards
• The primary imputation method is a longitudinal edit
• http://www.census.gov/econ/cbp/methodology.htm3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
20
Economic Censuses
• Like demographic products, there are usually both edited and unedited versions of the publication variables in these files
• Publication variables (e.g., payroll, employment, sales, geography, ownership) have been edited
• Most recent files include allocation flags to indicate that a publication variable has been edited or imputed
• Many historical files include variables that have been edited or imputed but do not include the flags
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
21
QWI Missing Data Procedures
• Individual data– Multiple imputation
• Employer data– Relational edit– Bi-directional longitudinal edit– Single-value imputation
• Job data– Use multiple imputation of individual data– Multiple imputation of place of work
• Use data for each place of work
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
22
BLS National Longitudinal Surveys
• Non-responses to the first wave never enter the data
• Non-responses to subsequent waves are coded as “interview missing”
• Respondent are not dropped for missing an interview. Special procedures are used to fill critical items from missed interviews when the respondent is interviewed again
• Item non-response is coded as such3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
23
Federal Reserve Survey of Consumer Finances (SCF)
• General information on the Survey of Consumer Finances: http://www.federalreserve.gov/pubs/oss/oss2/scfindex.html
• Missing data and confidentiality protection are handled with the same multiple imputation procedure
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
24
SCF Details
• Survey collects detailed wealth information from an over-sample of wealthy households
• Item refusals and item non-response are rampant (see Kennickell article 2011)
• When there is item refusal, interview instrument attempts to get an interval
• The reported interval is used in the missing data imputation• When the response is deemed sensitive enough for
confidentiality protection, the response is treated as an item missing (using the same interval model as above)
• First major survey released with multiple imputation
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
25
Relational Edit or Imputation
• Uses information from the same respondent• Example: respondent provided age but not
birth date. Use age to impute birth date.• Example: some members of household have
missing race/ethnicity data. Use other members of same household to impute race/ethnicity
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
26
Longitudinal Edit or Imputation
• Look at the respondent’s history in the data to get the value
• Example: respondent’s employment information missing this month. Impute employment information from previous month
• Example: establishment industry code missing this quarter. Impute industry code from most recently reported code
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
27
Cross Walks and Other Imputations
• In business data, converting an activity code (e.g., SIC) to a different activity code (e.g., NAICS) is a form of missing data imputation– This was the original motivation for Rubin’s work (
1996 review article) using occupation codes• In general, the two activity codes are not done
simultaneously for the same entity• Often these imputations are treated as 1-1
when they are, in fact, many-to-many3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
28
Probabilistic Methods for Cross Walks
• Inputs: – original codes– new codes– information for computing
• Pr[new code | original code, other data]
• Processing– Randomly assign a new code from the appropriate
conditional distribution
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
29
The Theory of Missing Data Models
• General principles• Missing at random• Weighting procedures• Imputation procedures• Hot decks• Introduction to model-based procedures
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
30
General Principles
• Most of today’s lecture is taken from Statistical Analysis with Missing Data, 2nd edition, Roderick J. A. Little and Donald B. Rubin (New York: John Wiley & Sons, 2002)
• The basic insight is that missing data should be modeled using the same probability and statistical tools that are the basis of all data analysis
• Missing data are not an anomaly to be swept under the carpet
• They are an integral part of very analysis3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
31
Missing Data Patterns
• Univariate non-response• Multivariate non-response• Monotone• General• File matching• Latent factors, Bayesian parameters
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved 32
Missing Data Mechanisms
• The complete data are defined as the matrix Y (n K).
• The pattern of missing data is summarized by a matrix of indicator variables M (n K).
• The data generating mechanism is summarized by the joint distribution of Y and M.
missing is if ,1
observed is if ,0
ij
ijij y
ym
,,MYp
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved 33
Missing Completely at Random
• In this case the missing data mechanism does not depend upon the data Y.
• This case is called MCAR.
MpYMp ),,(
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved 34
Missing at Random
• Partition Y into observed and unobserved parts.
• Missing at random means that the distribution of M depends only on the observed parts of Y.
• Called MAR.
misobs ,YYY
),(),,( obs YMpYMp
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
35
Not Missing at Random
• If the condition for MAR fails, then we say that the data are not missing at random, NMAR.
• Censoring and more elaborate behavioral models often fall into this category.
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
36
The Rubin and Little Taxonomy
• Analysis of the complete records only• Weighting procedures• Imputation-based procedures• Model-based procedures
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
37
Analysis of Complete Records Only
• Assumes that the data are MCAR• Only appropriate for small amounts of missing
data• Used to be common in economics, less so in
sociology• Now very rare
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
38
Weighting Procedures
• Modify the design weights to correct for missing records
• Provide an item weight (e.g., earnings and income weights in the CPS) that corrects for missing data on that variable. See Bollinger and Hirsch discussion later in lecture
• See complete case and weighted complete case discussion in Rubin and Little
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
39
Imputation-based Procedures
• Missing values are filled-in and the resulting “completed” data are analyzed– Hot deck– Mean imputation– Regression imputation
• Some imputation procedures (e.g., Rubin’s multiple imputation) are really model-based procedures.
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
40
Imputation Based on Statistical Modeling
• Hot deck: use the data from related cases in the same survey to impute missing items (usually as a group)
• Cold deck: use a fixed probability model to impute the missing items
• Multiple imputation: use the posterior predictive distribution of the missing item, given all the other items, to impute the missing data
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
41
Current Population Survey
• Census Bureau imputation procedures:– Relational Imputation– Longitudinal Edit– Hot Deck Allocation Procedure– Winkler full edit/imputation system
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
42
“Hot Deck” Allocation
• Labor Force Status– Employed– Unemployed– Not in the Labor Force
(Thanks to Warren Brown)
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
43
“Hot Deck” Allocation
Black Non-Black
Male
16 – 24
25+ ID #0062
Female
16-24
25+
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
44
“Hot Deck” Allocation
Black Non-Black
Male
16 – 24 ID #3502 ID #1241
25+ ID #8177 ID #0062
Female
16-24 ID #9923 ID #5923
25+ ID #4396 ID #2271
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
45
CPS Example
• Effects of hot-deck imputation of labor force status
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
46
Public Use StatisticsTotal AXLFSR No change Allocated
Total A_LFSR 220,284,576Working 131,704,236W/job,not at work 4,572,653Unemp,looking for work 7,967,976Unemp,on layoff 1,371,469Not in labor force 74,668,242
Total A_AGE 220,284,576Average A_AGE 44.1Std Err A_AGE 0.15
Total A_SEX 220,284,576Male 105,972,746Female 114,311,831
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
47
Allocated v. UnallocatedTotal AXLFSR No change Allocated
Total A_LFSR 220,284,576 219,529,643 754,933Working 131,704,236 131,294,888 409,348W/job,not at work 4,572,653 4,564,589 8,063Unemp,looking for work 7,967,976 7,919,562 48,414Unemp,on layoff 1,371,469 1,367,766 3,703Not in labor force 74,668,242 74,382,838 285,405
Total A_AGE 220,284,576 219,529,643 754,933Average A_AGE 44.1 44.2 35.2Std Err A_AGE 0.15 0.15 1.96
Total A_SEX 220,284,576 219,529,643 754,933Male 105,972,746 105,603,454 369,292Female 114,311,831 113,926,189 385,641
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
48
Bollinger and Hirsch CPS Missing Data
• Studies the particular assumptions in the CPS hot deck imputer on wage regressions
• Census Bureau uses too few variables in its hot deck model
• Inclusion of additional variables improves the accuracy of the missing data models
• See Bollinger and Hirsch (2006)
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
49
Model-based Procedures
• A probability model based on p(Y, M) forms the basis for the analysis
• This probability model is used as the basis for estimation of parameters or effects of interest
• Some general-purpose model-based procedures are designed to be combined with likelihood functions that are not specified in advance
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
50
Little and Rubin’s Principles
• Imputations should be– Conditioned on observed variables– Multivariate– Draws from a predictive distribution
• Single imputation methods do not provide a means to correct standard errors for estimation error
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
513/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
523/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
533/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
543/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
553/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
563/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
573/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
583/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
593/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
603/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
613/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
623/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
633/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
643/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
653/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
663/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
673/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
68
Applications to Complicated Data
• Computational formulas for MI data• Examples of building Multiply-imputed data
files
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved 69
Computational Formulas
• Assume that you want to estimate something as a function of the data Q(Y)
• Formulas account for missing data contribution to variance
iiii
mm
M
m
Tmm
mm
M
m
mm
mm
mm
M
m
mm
mm
tb
YQT
BM
VT
YQB
MQYQQYQB
V
MYVV
YQYV
Q
MYQQ
YQ
/ Ratio sMissingnes
ofmatrix variance total
11
of variationimplicatebetween
matrix covariance average
implicate m thefrom ofmatrix covariance
estimand average
implicate m thefrom estimand
1
1
th
1
th
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
70
Examples
• Survey of Consumer Finances• Quarterly Workforce Indicators
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
71
Survey of Consumer Finances
• Codebook description of missing data procedures
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
72
How are the QWIs Built?
• Raw input files:– UI wage records– QCEW/ES-202 report– Decennial census and ACS files– SSA-supplied administrative records– Census derived administrative record household address files– LEHD geo-coding system
• Processed data files:– Individual characteristics– Employer characteristics– Employment history with earnings
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
73
Processing the Input Files
• Each quarter the complete history of every individual, every establishment, and every job is processed through the production system
• Missing data on the individuals are multiply imputed at the national level, posterior predictive distribution is stored
• Missing data on the employment history record are multiply imputed each quarter from fresh posterior predictive distribution
• Missing data on the employer characteristics are singly-imputed (explanation to follow)
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
74
Examples of Missing Data Problems
• Missing demographic data on the national individual file (birth date, sex, race, ethnicity, place of residence, and education)– Multiple imputations using information from the individual,
establishment, and employment history files– Model estimation component updated irregularly– Imputations performed once for those in estimation universe, then
once when a new PIK is encountered in the production system• This process was used on the current QWI and for the S2011
snapshot• An older process was used to create the current snapshots
(S2004/S2008)
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
75
A Very Difficult Missing Data Problem
• The employment history records only code employer to the UI account level
• Establishment characteristics (industry, geo-codes) are missing for multi-unit establishments
• The establishment (within UI account) is multiply imputed using a dynamic multi-stage probability model
• Estimation of the posterior predictive distribution depends on the existence of a state with establishments coded on the UI wage record (MN)
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
76
How Is It Done?
• Every quarter the QWI processes over 6 billion employment histories (unique person-employer pair) covering 1990 to 2012
• Approximately 30-40% of these histories require multiple employer imputations
• So, the system does more than 25 billion full information imputations every quarter
• The information used for the imputations is current, it includes all of the historical information for the person and every establishment associated with that person’s UI account
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
77
Does It Work?
• Full assessment using the state that codes both (MN)
• Summary slide follows
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
78
10th Percentile Median 90th Percentile-15%
-10%
-5%
0%
5%
10%
15%
20%
Percent Discrepancy
MN Known Unit vs. MN Imputed UnitWeighted
Earnings, Full-Quarter Accessions
Beginning-of-Period Employment
End-of-Period Employment
Full-Quarter Employment
Separations
Accessions
3/25/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
79
Cumulative Effect of All QWI Edits and Imputations
Average Z-scores Missingness Rates
Sample SizeEntity Size*Average
Employment
Beginning Employment
(b)
Full-quarter Employment
(f)Earnings (z_w3)
Beginning Employment
(b)
Full-quarter Employment
(f)Earnings (z_w3)
all 437 8.79 8.09 10.15 27.1% 27.1% 31.2% 237,7411-9 4 1.60 1.46 2.99 33.1% 33.3% 43.6% 95,520
10-99 35 4.84 4.40 6.69 24.4% 24.4% 25.7% 84,621100-249 160 11.08 10.14 13.52 21.8% 21.6% 20.5% 21,187250-499 354 16.66 15.29 19.37 20.9% 20.9% 19.1% 11,972500-999 707 23.59 21.68 26.03 20.7% 20.6% 17.9% 8,7871000+ 5538 56.67 52.61 52.11 20.2% 20.1% 16.1% 15,654
*Entity is county x NAICS sector x race x ethnicity for 2008:q3.
• Z-score is the ratio of the QWI estimate to the square root of its total variation (within and between implicate components)
• Missingness rate is the ratio of the between variance to the total variance
3/25/2013