synthetic data generation for firm links

www.rti.orgRTI International is a registered trademark and a trade name of Research Triangle Institute.

Synthetic Data Generation for Firm Links

The Synthetic Longitudinal Business Database

Saki Kinney

6th January 2016

A portion of this work was conducted by Special Sworn Status researchers of the U.S. Census

Bureau at the Triangle Federal Statistical Research Data Center. Research results and conclusions

expressed are those of the authors and do not necessarily reflect the views of the Census Bureau.

Results have been screened to ensure that no confidential data are revealed. This work has been

supported by the US Census Bureau; Phase 1 by NSF Grant ITR-0427889.

Longitudinal Business Database (LBD)

Longitudinal economic census covering all private non-farm business

establishments with paid employees

– Developed by U.S. Census Bureau Center for Economic Studies

– Starts with 1976, updated annually

– >30 million establishments

– Low depth, high coverage

Unique research dataset used for looking at business formation and

growth, job flows, market volatility, business cycles, international

comparisons…

– Linkable to hundreds of datasets (within secure computing environment)

Confidential data protected by US law (Title 13 and Title 26)

2

Research Access to the LBD

Data can only be accessed in a secure Federal Statistical Research

Data Center (RDC)

– May require travel

– All outputs require disclosure review

Project-specific applications required

– Straightforward but very time consuming. Additional time for background

checks and special data requests

– Involves substantial user fee or institutional membership

3

SynLBD project

Goal is to create public-use file for LBD using synthetic data methods

– Test case for generating public-use business microdata

– Part of larger goal to expand research access

Provide users with disclosure-proofed microdata that permits users

to draw valid inferences for subset of uses

– Reduce the number of requests for special tabulations.

– No need to utilize the RDC network for some researchers.

– Aids users requiring RDC access.

Phase 1: Initial SynLBD released

– First public-use establishment-level microdata.

Phase 2 (underway): Adding firm links, geography, other

improvements

4

SynLBD Access

80 researchers from ~40 institutions and 6 countries have requested

access

7 researchers requested validation

Frequent requests include firm structure, geographic detail, NAICS

codes, longer time series

Small proposal required – quick turnaround, reviewed for feasibility

– Access via remote desktop emulating RDC environment (Cornell Virtual

RDC)

– All SynLBD analyses can be released w/o disclosure review.

– Or submitted for validation on restricted data (subject to disclosure review)

5

Why (partially) synthetic data?

Concerns about confidentiality protection for longitudinal census of

establishments

– Data are more disclosive than cross-sectional samples of people.

– No actual values of confidential values may be released (i.e., swapping,

etc. would provide insufficient protection)

6

Variables Used (Phase 2)

Variable Name Type Description Synthesized

y1 Firstyear Categorical First year establishment exists Yes

y2 Lastyear Categorical Last year establishment exists Yes

y3 (t) Inactive Binary Inactive in year t Yes

y4 (t) Multiunit Binary Part of multiunit firm in year t Yes

y5 (t) Employment Continuous March 12th employment year t Yes

y6 (t) Payroll Continuous Total payroll in year t Yes

y7 (t) Firm ID Categorical Firm ID in year t Yes

x1 Geography Categorical State No

x2 NAICS Categorical 3 digit Industry Code No

Table 1: Synthetic LBD Variable Names

Notes:– There is also a randomly generated estab ID number,

LBDnum.

– Published SynLBD contains one implicate, excludes Geography, Inactive, and FirmID, uses SIC instead of NAICS

7

Synthesis: General Approach

Generate joint posterior predictive distribution of Y|X

– f 𝑦1, 𝑦2 , 𝑦3,… 𝑋 = f 𝑦1 X f 𝑦2 𝑦1, X f 𝑦3 𝑦1, 𝑦2, X …

To draw from a posterior predictive distribution 𝑓(𝑦𝑘|𝑋, 𝑦1,…, 𝑦𝑘−1)

– Fit model using observed data

– Draw new values of model parameters from their posterior distributions

– Use new parameters to predict 𝑦𝑘 from 𝑋 and synthetic values of 𝑦1,…,

𝑦𝑘−1

8

First Year

Year establishment enters LBD

Impute Firstyear | NAICS, State using variant of Dirichlet-Multinomial

model

– Informative “confidentiality prior” used to protect small cells

– Synthetic values obtained from sampling from posterior multinomial

distribution

10

Last Year

Year establishment exits LBD

Impute Last Year | First Year, State SIC

Dirichlet-multinomial with flat prior (“simple multinomial”)

– Multinomial cell probabilities for synthetic data obtained from matching

cells in observed data

11

Inactive and multiunit status

Longitudinal binary indicators

– Inactive = I(estab. exited but will return)

– Multiunit = I(estab part of firm)

Modeled with simplified version of Dirichlet-multinomial (Beta-

binomial) for each year.

Status does not change for most establishments

12

Employment and Payroll

Highly skewed continuous variables

Imputed year by year for employment, then year by year for payroll

– Impute emp(t)|emp(t-1), other predictors

– Impute pay(t)|pay(t-1), emp(t), other predictors

CART models with Bayesian bootstrap and optional density

smoothing (Reiter 2005)

– Employment: For births predict employment; thereafter predict changes

13

CART synthesis method

Goal: Synthesize Y | X.

Grow maximum tree. Prune

for confidentiality.

– Partition X space so that subsets

of units formed by partitions have relatively homogenous Y.

Drawing Y:

– For any X, trace down tree until

reach appropriate leaf.

– Draw Y from leaf using Bayes bootstrap optionally with smoothed density

estimate (with agency-specified bandwidth).

14

Employment and Payroll

– CART synthesis found to be too good at predicting outliers. So

multiplicative noise applied to outliers prior to synthesis (Evans, Slanta, &

Zayatz).

15

Bias observed in Phase 1

Job Creation Rates: LBD and Implicates by Year

0

5

10

15

20

25

30

35

40

45

50

19

77

19

78

19

79

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

Year

LBD Implicate 1 Implicate 2 Implicate (Mean)

Job Destruction Rates: LBD and Implicates by Year

05

101520253035404550

19

77

19

78

19

79

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

Year

LBD Implicate 1 Implicate 2 Implicate (Mean)

16

Firm ID

Firm ID: Identification variable linking establishments that belong to

the same firm.

Can vary year to year due to mergers, acquisitions, sales, etc.

For most establishments it does not change

Feature most requested by SynLBD users and potential users

19

Synthesis of Firm IDs

1. Impute, year by year, binary variables indicating whether

establishments keep or switch firm ID (small % of establishments

switch)

– Longitudinal binary variables with continuous predictors, modeled using

CART approach

2. For new and switching estabs, predict firm characteristics

(employment, # estabs, age, etc.) simultaneously using multivariate

recursive partition trees (mvpart)

3. Link real and synthetic establishments using propensity score

matching on firm characteristics.

20

Firm Links

4. Assign Synthetic Firm ID = real Firm ID of linked establishment

5. Discard synthetic firm characteristics – not needed and not safe to

release.

– Firm characteristics computed using Firm IDs of synthetic establishments,

ensuring logical consistency

6. Replace synthetic firm IDs with pseudo-identifiers

21

Disclosure Risk Assessment

Fundamental trade-off between disclosure risk and analytic utility.

– Disclosure risk = probability of re-identification of establishment and/or

attributes

Phase 1 overall approach was ‘Maximum knowledge’ intruder

scenario

– With both observed data and synthetic data in hand, can mapping be

obtained?

In Phase 1, satisfactory differential privacy assessment done on a

subset.

– Phase 1 analyses can be replicated in Phase 2, adapted for firms

– Additional work needed for Phase 2 changes and additions

26

Current Status

Phase 2 synthesis is being finalized

Next steps

– Evaluate analytical validity

– Disclosure risk analysis

Seek IRS and Census DRB disclosure approval

Future

– Updating SynLBD as LBD updates, multiple implicates (?)

27

Thank you

For more info or to access Phase 1 SynLBD :

– http://www.census.gov/ces/dataproducts/synlbd/

Phase 2 coauthors: Jerry Reiter, Javier Miranda

Additional Phase 1 coauthors: John Abowd, Ron Jarmin, Arnold

Reznek

Other support: Lars Vilhuber, Kevin McKinney, many others

28

http://www.census.gov/ces/dataproducts/synlbd/

synthetic data generation for firm links

Documents