synthetic data generation for firm links
TRANSCRIPT
www.rti.orgRTI International is a registered trademark and a trade name of Research Triangle Institute.
Synthetic Data Generation for Firm Links
The Synthetic Longitudinal Business Database
Saki Kinney
6th January 2016
A portion of this work was conducted by Special Sworn Status researchers of the U.S. Census
Bureau at the Triangle Federal Statistical Research Data Center. Research results and conclusions
expressed are those of the authors and do not necessarily reflect the views of the Census Bureau.
Results have been screened to ensure that no confidential data are revealed. This work has been
supported by the US Census Bureau; Phase 1 by NSF Grant ITR-0427889.
Longitudinal Business Database (LBD)
Longitudinal economic census covering all private non-farm business
establishments with paid employees
– Developed by U.S. Census Bureau Center for Economic Studies
– Starts with 1976, updated annually
– >30 million establishments
– Low depth, high coverage
Unique research dataset used for looking at business formation and
growth, job flows, market volatility, business cycles, international
comparisons…
– Linkable to hundreds of datasets (within secure computing environment)
Confidential data protected by US law (Title 13 and Title 26)
2
Research Access to the LBD
Data can only be accessed in a secure Federal Statistical Research
Data Center (RDC)
– May require travel
– All outputs require disclosure review
Project-specific applications required
– Straightforward but very time consuming. Additional time for background
checks and special data requests
– Involves substantial user fee or institutional membership
3
SynLBD project
Goal is to create public-use file for LBD using synthetic data methods
– Test case for generating public-use business microdata
– Part of larger goal to expand research access
Provide users with disclosure-proofed microdata that permits users
to draw valid inferences for subset of uses
– Reduce the number of requests for special tabulations.
– No need to utilize the RDC network for some researchers.
– Aids users requiring RDC access.
Phase 1: Initial SynLBD released
– First public-use establishment-level microdata.
Phase 2 (underway): Adding firm links, geography, other
improvements
4
SynLBD Access
80 researchers from ~40 institutions and 6 countries have requested
access
7 researchers requested validation
Frequent requests include firm structure, geographic detail, NAICS
codes, longer time series
Small proposal required – quick turnaround, reviewed for feasibility
– Access via remote desktop emulating RDC environment (Cornell Virtual
RDC)
– All SynLBD analyses can be released w/o disclosure review.
– Or submitted for validation on restricted data (subject to disclosure review)
5
Why (partially) synthetic data?
Concerns about confidentiality protection for longitudinal census of
establishments
– Data are more disclosive than cross-sectional samples of people.
– No actual values of confidential values may be released (i.e., swapping,
etc. would provide insufficient protection)
6
Variables Used (Phase 2)
Variable Name Type Description Synthesized
y1 Firstyear Categorical First year establishment exists Yes
y2 Lastyear Categorical Last year establishment exists Yes
y3 (t) Inactive Binary Inactive in year t Yes
y4 (t) Multiunit Binary Part of multiunit firm in year t Yes
y5 (t) Employment Continuous March 12th employment year t Yes
y6 (t) Payroll Continuous Total payroll in year t Yes
y7 (t) Firm ID Categorical Firm ID in year t Yes
x1 Geography Categorical State No
x2 NAICS Categorical 3 digit Industry Code No
Table 1: Synthetic LBD Variable Names
Notes:– There is also a randomly generated estab ID number,
LBDnum.
– Published SynLBD contains one implicate, excludes Geography, Inactive, and FirmID, uses SIC instead of NAICS
7
Synthesis: General Approach
Generate joint posterior predictive distribution of Y|X
– f 𝑦1, 𝑦2 , 𝑦3,… 𝑋 = f 𝑦1 X f 𝑦2 𝑦1, X f 𝑦3 𝑦1, 𝑦2, X …
To draw from a posterior predictive distribution 𝑓(𝑦𝑘|𝑋, 𝑦1,…, 𝑦𝑘−1)
– Fit model using observed data
– Draw new values of model parameters from their posterior distributions
– Use new parameters to predict 𝑦𝑘 from 𝑋 and synthetic values of 𝑦1,…,
𝑦𝑘−1
8
First Year
Year establishment enters LBD
Impute Firstyear | NAICS, State using variant of Dirichlet-Multinomial
model
– Informative “confidentiality prior” used to protect small cells
– Synthetic values obtained from sampling from posterior multinomial
distribution
10
Last Year
Year establishment exits LBD
Impute Last Year | First Year, State SIC
Dirichlet-multinomial with flat prior (“simple multinomial”)
– Multinomial cell probabilities for synthetic data obtained from matching
cells in observed data
11
Inactive and multiunit status
Longitudinal binary indicators
– Inactive = I(estab. exited but will return)
– Multiunit = I(estab part of firm)
Modeled with simplified version of Dirichlet-multinomial (Beta-
binomial) for each year.
Status does not change for most establishments
12
Employment and Payroll
Highly skewed continuous variables
Imputed year by year for employment, then year by year for payroll
– Impute emp(t)|emp(t-1), other predictors
– Impute pay(t)|pay(t-1), emp(t), other predictors
CART models with Bayesian bootstrap and optional density
smoothing (Reiter 2005)
– Employment: For births predict employment; thereafter predict changes
13
CART synthesis method
Goal: Synthesize Y | X.
Grow maximum tree. Prune
for confidentiality.
– Partition X space so that subsets
of units formed by partitions have relatively homogenous Y.
Drawing Y:
– For any X, trace down tree until
reach appropriate leaf.
– Draw Y from leaf using Bayes bootstrap optionally with smoothed density
estimate (with agency-specified bandwidth).
14
Employment and Payroll
– CART synthesis found to be too good at predicting outliers. So
multiplicative noise applied to outliers prior to synthesis (Evans, Slanta, &
Zayatz).
15
Bias observed in Phase 1
Job Creation Rates: LBD and Implicates by Year
0
5
10
15
20
25
30
35
40
45
50
19
77
19
78
19
79
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
Year
LBD Implicate 1 Implicate 2 Implicate (Mean)
Job Destruction Rates: LBD and Implicates by Year
05
101520253035404550
19
77
19
78
19
79
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
Year
LBD Implicate 1 Implicate 2 Implicate (Mean)
16
17
18
Firm ID
Firm ID: Identification variable linking establishments that belong to
the same firm.
Can vary year to year due to mergers, acquisitions, sales, etc.
For most establishments it does not change
Feature most requested by SynLBD users and potential users
19
Synthesis of Firm IDs
1. Impute, year by year, binary variables indicating whether
establishments keep or switch firm ID (small % of establishments
switch)
– Longitudinal binary variables with continuous predictors, modeled using
CART approach
2. For new and switching estabs, predict firm characteristics
(employment, # estabs, age, etc.) simultaneously using multivariate
recursive partition trees (mvpart)
3. Link real and synthetic establishments using propensity score
matching on firm characteristics.
20
Firm Links
4. Assign Synthetic Firm ID = real Firm ID of linked establishment
5. Discard synthetic firm characteristics – not needed and not safe to
release.
– Firm characteristics computed using Firm IDs of synthetic establishments,
ensuring logical consistency
6. Replace synthetic firm IDs with pseudo-identifiers
21
22
23
24
25
Disclosure Risk Assessment
Fundamental trade-off between disclosure risk and analytic utility.
– Disclosure risk = probability of re-identification of establishment and/or
attributes
Phase 1 overall approach was ‘Maximum knowledge’ intruder
scenario
– With both observed data and synthetic data in hand, can mapping be
obtained?
In Phase 1, satisfactory differential privacy assessment done on a
subset.
– Phase 1 analyses can be replicated in Phase 2, adapted for firms
– Additional work needed for Phase 2 changes and additions
26
Current Status
Phase 2 synthesis is being finalized
Next steps
– Evaluate analytical validity
– Disclosure risk analysis
Seek IRS and Census DRB disclosure approval
Future
– Updating SynLBD as LBD updates, multiple implicates (?)
27
Thank you
For more info or to access Phase 1 SynLBD :
– http://www.census.gov/ces/dataproducts/synlbd/
Phase 2 coauthors: Jerry Reiter, Javier Miranda
Additional Phase 1 coauthors: John Abowd, Ron Jarmin, Arnold
Reznek
Other support: Lars Vilhuber, Kevin McKinney, many others
28