info 7470/ilrle 7400 statistical tools: edit and imputation john m. abowd and lars vilhuber march...

INFO 7470/ILRLE 7400 Statistical Tools:

Edit and Imputation

John M. Abowd and Lars VilhuberMarch 25, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

2

Outline

• Why learn about edit and imputation procedures• Formal models of edits and imputations• Missing data overview• Missing records

– Frame or census– Survey

• Missing items• Overview of different products• Overview of methods• Formal multiple imputation methods• Examples3/25/2013


3

Why?

• Users of public-use data can normally identify the existence and consequences of edits and imputations, but don’t have access to data that would improve them

• Users of restricted-access data normally encounter raw files that require sophisticated edit and imputation procedures in order to use effectively

• Users of integrated (linked) data from multiple sources face these problems in their extreme form

3/25/2013


4

Formal Edit and Imputation Models

• Original work on this subject can be found in Fellegi and Holt (1976)

• One of Fellegi’s many seminal contributions• Recent work by Winkler (2008) • Formal models distinguish between edits

(based on expert judgments) and imputations (based on modeling)

3/25/2013

http://www.jstor.org/stable/2285726

http://en.wikipedia.org/wiki/Ivan_Fellegi

http://www.census.gov/srd/papers/pdf/rrs2008-08.pdf


5

Definition of “Edit”

• Checking each field (variable) of a data record to ensure that it contains a valid entry– Examples: NAICS code in range; 0<age<120

• Checking the entries of specified fields (variables) to ensure that they are consistent with each other– Examples: job creations – job destructions =

accessions – separations; age = data_reference_date – birth_date

3/25/2013


6

Options When There Is an Edit Failure

1. Check original files (written questionnaires, census forms, original administrative records)

2. Contact reference entity (household, person, business, establishment)

3. Clerically “correct” data (using expert judgment, not model-base)

4. Automatic edit (using a specified algorithm)5. Delete record from analysis/tabulation

3/25/2013


7

Direct Edits Are Expensive

• Even when feasible, it is expensive to cross-check every edit failure with original source material

• It is extremely expensive to re-contact sources, although some re-contacts are built into data collection budgets

• Computer-assisted survey/census methods can do these edits at the time of the original data collection

• Re-contact or revisiting original source material is usually either infeasible or prohibited for administrative records in statistical use

3/25/2013


8

Example: Age on the SIPP

• The SIPP collects both age and birth date (as do the decennial census and ACS)

• These are often inconsistent; an edit is applied to make them consistent

• When linked to SSA administrative record data, the birth date on the SSN application becomes available

• It is often inconsistent with the respondent report• And both birth dates are often inconsistent with the

observed retirement benefit status of a respondent (also in administrative data).

• Why?3/25/2013


9

The SSA Birth Date Is Respondent Provided until Benefits are Claimed

• Prior to 1986, most Americans received SSNs when they entered the labor force not at birth as is now the case

• The birth date on SSN application records is “respondent” provided (and not edited by SSA)

• Further updates to the SSA “Numident” file, which is the master data base of SSNs, are also “respondent provided”

• Only when a person claims age-dependent benefits (Social Security at 62 or Medicare at 65) does SSA get a birth certificate and apply a true edit to the birth date

3/25/2013


10

Lesson• Editing, even informal editing, always involves building models even if they

are difficult to lay out explicitly• Formal edit models involve specifying all the logical relations among the

variables (including impossibilities), then imposing them on a probability model like the ones we will consider in this lecture

• When an edit is made with probability 1 in such a system, the designers of the edit have declared that the expert’s prior judgment cannot be overturned by data

• In resolving the SIPP/SSA birth date example, the experts declared that the since benefit recipiency was based on audited (via birth certificates) birth dates, it should be taken as “true;” all other values were edited to agree.– Note: still don’t have enough information to apply this edit on receipt of SIPP data– Doesn’t help for those too young to claim age-eligible benefits.

3/25/2013


11

Edits and Missing Data Overview

• Missing data are a constant feature of both sampling frames (derived from censuses) and surveys

• Two important types are distinguished– Missing record (frame) or interview (survey)– Missing item (in either context)

• Methods differ depending upon type

3/25/2013


12

Missing Records: Frame or Census

• The problem of missing records in a census or sampling frame is detection

• By definition in these contexts the problem requires external information to solve

3/25/2013


13

Census of Population and Housing

• Dress rehearsal Census• Pre-census housing list review• Census processing of housing units found on a

block not present on the initial list• Post-census evaluation survey• Post-census coverage studies

3/25/2013


14

Economic Censuses and the Business Register

• Discussed in lecture 4• Start with tax records• Unduplication in the Business Register• Weekly updates• Multi-units updated with Report of

Organization Survey• Multi-units discovered during the intercensal

surveys are added to the BR

3/25/2013


15

Missing Records: Survey

• Non-response in a survey is normally handled within the sample design

• Follow-up (up to a limit) to obtain interview/data

• Assessment of non-response within sample strata

• Adjustment of design weights to reflect non-responses

3/25/2013


16

Missing and/or Inconsistent Items

• Edits during the interview (CAPI/CATI) and during post-processing of the survey

• Edit or imputation based on the other data in the interview/case (relational edit or imputation)

• Imputation based on related information on the same respondent (longitudinal edit or imputation)

• Imputation based on statistical modeling– Hot deck– Cold deck– Multiple imputation

3/25/2013


17

Census 2000 PUMS Missing Data

• Pre-edit: When the original entry was rejected because it fell outside the range of acceptable values

• Consistency: Edited or imputed missing characteristics based on other information recorded for the person or housing unit

• Hot Deck: Supplied the missing information from the record of another person or housing unit.

• Cold Deck: Supplied missing information from a predetermined distribution

• See allocation flags for details

3/25/2013


18

CPS Missing Data

• Relational edit or imputation: use other information in the record to infer value (based on expert judgment)

• Longitudinal edits: use values from the previous month if present in sample (based on rules, usually)

• Hot deck: use values from actual respondents whose data are complete for the, relatively few, conditioning variables

3/25/2013


19

County Business Patterns

• The County and Zip code Business Patterns data are published from the Employer Business Register

• This is important because variables used in these publications are edited to publication standards

• The primary imputation method is a longitudinal edit

• http://www.census.gov/econ/cbp/methodology.htm3/25/2013

http://www.census.gov/econ/cbp/methodology.htm


20

Economic Censuses

• Like demographic products, there are usually both edited and unedited versions of the publication variables in these files

• Publication variables (e.g., payroll, employment, sales, geography, ownership) have been edited

• Most recent files include allocation flags to indicate that a publication variable has been edited or imputed

• Many historical files include variables that have been edited or imputed but do not include the flags

3/25/2013


21

QWI Missing Data Procedures

• Individual data– Multiple imputation

• Employer data– Relational edit– Bi-directional longitudinal edit– Single-value imputation

• Job data– Use multiple imputation of individual data– Multiple imputation of place of work

• Use data for each place of work

3/25/2013


22

BLS National Longitudinal Surveys

• Non-responses to the first wave never enter the data

• Non-responses to subsequent waves are coded as “interview missing”

• Respondent are not dropped for missing an interview. Special procedures are used to fill critical items from missed interviews when the respondent is interviewed again

• Item non-response is coded as such3/25/2013


23

Federal Reserve Survey of Consumer Finances (SCF)

• General information on the Survey of Consumer Finances: http://www.federalreserve.gov/pubs/oss/oss2/scfindex.html

• Missing data and confidentiality protection are handled with the same multiple imputation procedure

3/25/2013

http://www.federalreserve.gov/pubs/oss/oss2/scfindex.html


24

SCF Details

• Survey collects detailed wealth information from an over-sample of wealthy households

• Item refusals and item non-response are rampant (see Kennickell article 2011)

• When there is item refusal, interview instrument attempts to get an interval

• The reported interval is used in the missing data imputation• When the response is deemed sensitive enough for

confidentiality protection, the response is treated as an item missing (using the same interval model as above)

• First major survey released with multiple imputation

3/25/2013

http://www.federalreserve.gov/econresdata/scf/files/ASA2011.1.pdf


25

Relational Edit or Imputation

• Uses information from the same respondent• Example: respondent provided age but not

birth date. Use age to impute birth date.• Example: some members of household have

missing race/ethnicity data. Use other members of same household to impute race/ethnicity

3/25/2013


26

Longitudinal Edit or Imputation

• Look at the respondent’s history in the data to get the value

• Example: respondent’s employment information missing this month. Impute employment information from previous month

• Example: establishment industry code missing this quarter. Impute industry code from most recently reported code

3/25/2013


27

Cross Walks and Other Imputations

• In business data, converting an activity code (e.g., SIC) to a different activity code (e.g., NAICS) is a form of missing data imputation– This was the original motivation for Rubin’s work (

1996 review article) using occupation codes• In general, the two activity codes are not done

simultaneously for the same entity• Often these imputations are treated as 1-1

when they are, in fact, many-to-many3/25/2013

http://www.jstor.org/stable/2291635


28

Probabilistic Methods for Cross Walks

• Inputs: – original codes– new codes– information for computing

• Pr[new code | original code, other data]

• Processing– Randomly assign a new code from the appropriate

conditional distribution

3/25/2013


29

The Theory of Missing Data Models

• General principles• Missing at random• Weighting procedures• Imputation procedures• Hot decks• Introduction to model-based procedures

3/25/2013


30

General Principles

• Most of today’s lecture is taken from Statistical Analysis with Missing Data, 2nd edition, Roderick J. A. Little and Donald B. Rubin (New York: John Wiley & Sons, 2002)

• The basic insight is that missing data should be modeled using the same probability and statistical tools that are the basis of all data analysis

• Missing data are not an anomaly to be swept under the carpet

• They are an integral part of very analysis3/25/2013


31

Missing Data Patterns

• Univariate non-response• Multivariate non-response• Monotone• General• File matching• Latent factors, Bayesian parameters

3/25/2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved 32

Missing Data Mechanisms

• The complete data are defined as the matrix Y (n K).

• The pattern of missing data is summarized by a matrix of indicator variables M (n K).

• The data generating mechanism is summarized by the joint distribution of Y and M.

missing is if ,1

observed is if ,0

ij

ijij y

ym

,,MYp

3/25/2013


Missing Completely at Random

• In this case the missing data mechanism does not depend upon the data Y.

• This case is called MCAR.

MpYMp ),,(

3/25/2013


Missing at Random

• Partition Y into observed and unobserved parts.

• Missing at random means that the distribution of M depends only on the observed parts of Y.

• Called MAR.

misobs ,YYY

),(),,( obs YMpYMp

3/25/2013


35

Not Missing at Random

• If the condition for MAR fails, then we say that the data are not missing at random, NMAR.

• Censoring and more elaborate behavioral models often fall into this category.

3/25/2013


36

The Rubin and Little Taxonomy

• Analysis of the complete records only• Weighting procedures• Imputation-based procedures• Model-based procedures

3/25/2013


37

Analysis of Complete Records Only

• Assumes that the data are MCAR• Only appropriate for small amounts of missing

data• Used to be common in economics, less so in

sociology• Now very rare

3/25/2013


38

Weighting Procedures

• Modify the design weights to correct for missing records

• Provide an item weight (e.g., earnings and income weights in the CPS) that corrects for missing data on that variable. See Bollinger and Hirsch discussion later in lecture

• See complete case and weighted complete case discussion in Rubin and Little

3/25/2013


39

Imputation-based Procedures

• Missing values are filled-in and the resulting “completed” data are analyzed– Hot deck– Mean imputation– Regression imputation

• Some imputation procedures (e.g., Rubin’s multiple imputation) are really model-based procedures.

3/25/2013


40

Imputation Based on Statistical Modeling

• Hot deck: use the data from related cases in the same survey to impute missing items (usually as a group)

• Cold deck: use a fixed probability model to impute the missing items

• Multiple imputation: use the posterior predictive distribution of the missing item, given all the other items, to impute the missing data

3/25/2013


41

Current Population Survey

• Census Bureau imputation procedures:– Relational Imputation– Longitudinal Edit– Hot Deck Allocation Procedure– Winkler full edit/imputation system

3/25/2013


42

“Hot Deck” Allocation

• Labor Force Status– Employed– Unemployed– Not in the Labor Force

(Thanks to Warren Brown)

3/25/2013


43


Black Non-Black

Male

16 – 24

25+ ID #0062

Female

16-24

25+

3/25/2013


44


Black Non-Black

Male

16 – 24 ID #3502 ID #1241

25+ ID #8177 ID #0062

Female

16-24 ID #9923 ID #5923

25+ ID #4396 ID #2271

3/25/2013


45

CPS Example

• Effects of hot-deck imputation of labor force status

3/25/2013


46

Public Use StatisticsTotal AXLFSR No change Allocated

Total A_LFSR 220,284,576Working 131,704,236W/job,not at work 4,572,653Unemp,looking for work 7,967,976Unemp,on layoff 1,371,469Not in labor force 74,668,242

Total A_AGE 220,284,576Average A_AGE 44.1Std Err A_AGE 0.15

Total A_SEX 220,284,576Male 105,972,746Female 114,311,831

3/25/2013


47

Allocated v. UnallocatedTotal AXLFSR No change Allocated

Total A_LFSR 220,284,576 219,529,643 754,933Working 131,704,236 131,294,888 409,348W/job,not at work 4,572,653 4,564,589 8,063Unemp,looking for work 7,967,976 7,919,562 48,414Unemp,on layoff 1,371,469 1,367,766 3,703Not in labor force 74,668,242 74,382,838 285,405

Total A_AGE 220,284,576 219,529,643 754,933Average A_AGE 44.1 44.2 35.2Std Err A_AGE 0.15 0.15 1.96

Total A_SEX 220,284,576 219,529,643 754,933Male 105,972,746 105,603,454 369,292Female 114,311,831 113,926,189 385,641

3/25/2013


48

Bollinger and Hirsch CPS Missing Data

• Studies the particular assumptions in the CPS hot deck imputer on wage regressions

• Census Bureau uses too few variables in its hot deck model

• Inclusion of additional variables improves the accuracy of the missing data models

• See Bollinger and Hirsch (2006)

3/25/2013

http://www.jstor.org/stable/10.1086/504276


49

Model-based Procedures

• A probability model based on p(Y, M) forms the basis for the analysis

• This probability model is used as the basis for estimation of parameters or effects of interest

• Some general-purpose model-based procedures are designed to be combined with likelihood functions that are not specified in advance

3/25/2013


50

Little and Rubin’s Principles

• Imputations should be– Conditioned on observed variables– Multivariate– Draws from a predictive distribution

• Single imputation methods do not provide a means to correct standard errors for estimation error

3/25/2013


513/25/2013


523/25/2013


533/25/2013


543/25/2013


553/25/2013


563/25/2013


573/25/2013


583/25/2013


593/25/2013


603/25/2013


613/25/2013


623/25/2013


633/25/2013


643/25/2013


653/25/2013


663/25/2013


673/25/2013


68

Applications to Complicated Data

• Computational formulas for MI data• Examples of building Multiply-imputed data

files

3/25/2013


Computational Formulas

• Assume that you want to estimate something as a function of the data Q(Y)

• Formulas account for missing data contribution to variance

iiii

mm

M

m

Tmm

mm

M

m

mm

mm

mm

M

m

mm

mm

tb

YQT

BM

VT

YQB

MQYQQYQB

V

MYVV

YQYV

Q

MYQQ

YQ

/ Ratio sMissingnes

ofmatrix variance total

11

of variationimplicatebetween

matrix covariance average

implicate m thefrom ofmatrix covariance

estimand average

implicate m thefrom estimand

1

1

th

1

th

3/25/2013


70

Examples

• Survey of Consumer Finances• Quarterly Workforce Indicators

3/25/2013


71

Survey of Consumer Finances

• Codebook description of missing data procedures

3/25/2013

http://www.federalreserve.gov/pubs/oss/oss2/2001/codebk2001.txt


72

How are the QWIs Built?

• Raw input files:– UI wage records– QCEW/ES-202 report– Decennial census and ACS files– SSA-supplied administrative records– Census derived administrative record household address files– LEHD geo-coding system

• Processed data files:– Individual characteristics– Employer characteristics– Employment history with earnings

3/25/2013


73

Processing the Input Files

• Each quarter the complete history of every individual, every establishment, and every job is processed through the production system

• Missing data on the individuals are multiply imputed at the national level, posterior predictive distribution is stored

• Missing data on the employment history record are multiply imputed each quarter from fresh posterior predictive distribution

• Missing data on the employer characteristics are singly-imputed (explanation to follow)

3/25/2013


74

Examples of Missing Data Problems

• Missing demographic data on the national individual file (birth date, sex, race, ethnicity, place of residence, and education)– Multiple imputations using information from the individual,

establishment, and employment history files– Model estimation component updated irregularly– Imputations performed once for those in estimation universe, then

once when a new PIK is encountered in the production system• This process was used on the current QWI and for the S2011

snapshot• An older process was used to create the current snapshots

(S2004/S2008)

3/25/2013


75

A Very Difficult Missing Data Problem

• The employment history records only code employer to the UI account level

• Establishment characteristics (industry, geo-codes) are missing for multi-unit establishments

• The establishment (within UI account) is multiply imputed using a dynamic multi-stage probability model

• Estimation of the posterior predictive distribution depends on the existence of a state with establishments coded on the UI wage record (MN)

3/25/2013


76

How Is It Done?

• Every quarter the QWI processes over 6 billion employment histories (unique person-employer pair) covering 1990 to 2012

• Approximately 30-40% of these histories require multiple employer imputations

• So, the system does more than 25 billion full information imputations every quarter

• The information used for the imputations is current, it includes all of the historical information for the person and every establishment associated with that person’s UI account

3/25/2013


77

Does It Work?

• Full assessment using the state that codes both (MN)

• Summary slide follows

3/25/2013


78

10th Percentile Median 90th Percentile-15%

-10%

-5%

0%

5%

10%

15%

20%

Percent Discrepancy

MN Known Unit vs. MN Imputed UnitWeighted

Earnings, Full-Quarter Accessions

Beginning-of-Period Employment

End-of-Period Employment

Full-Quarter Employment

Separations

Accessions

3/25/2013


79

Cumulative Effect of All QWI Edits and Imputations

Average Z-scores Missingness Rates

Sample SizeEntity Size*Average

Employment

Beginning Employment

(b)

Full-quarter Employment

(f)Earnings (z_w3)

Beginning Employment

(b)

Full-quarter Employment

(f)Earnings (z_w3)

all 437 8.79 8.09 10.15 27.1% 27.1% 31.2% 237,7411-9 4 1.60 1.46 2.99 33.1% 33.3% 43.6% 95,520

10-99 35 4.84 4.40 6.69 24.4% 24.4% 25.7% 84,621100-249 160 11.08 10.14 13.52 21.8% 21.6% 20.5% 21,187250-499 354 16.66 15.29 19.37 20.9% 20.9% 19.1% 11,972500-999 707 23.59 21.68 26.03 20.7% 20.6% 17.9% 8,7871000+ 5538 56.67 52.61 52.11 20.2% 20.1% 16.1% 15,654

*Entity is county x NAICS sector x race x ethnicity for 2008:q3.

• Z-score is the ratio of the QWI estimate to the square root of its total variation (within and between implicate components)

• Missingness rate is the ratio of the between variance to the total variance

3/25/2013

info 7470/ilrle 7400 statistical tools: edit and imputation john m. abowd and lars vilhuber march...

Documents