jörg drechsler (institute for employment research, germany) ntts 2009 brussels, 20. february 2009...

Jörg Drechsler(Institute for Employment Research,

Germany)

NTTS 2009

Brussels, 20. February 2009

Disclosure Control in Business Data Experiences with Multiply Imputed Synthetic Datasets for the German

IAB Establishment Survey

Overview

Background

Multiple imputation for statistical disclosure control

Challenges for real data applications

Some preliminary results

Conclusions/Future Work

SDC for Business Data

Public release of business data is often considered too risky

- Skewed distributions make identification of single units easy

- Information on businesses in the public domain

- High benefits from identifying a single unit

- High probability of inclusion for large establishments

Only coarsening and top-coding is not sufficient

Standard perturbation methods have to be applied on a high level

Release of high quality data is very difficult

Multiply imputed synthetic datasets as a possible solution

Partially synthetic datasets (Little 1993)

only potentially identifying or sensitive variables are replaced

advantages: - synthesis can be tailored to the records at risk

- approach is applicable to continuous and discrete variables

- modeling tries to preserve the joint distribution of the data

Challenges for real data applications

Missing data

Skip patterns

Logical constraints

Missing Data

Missing data is a common problem in surveys (More than 200 variables with missings in our survey)

Most SDL techniques can not deal with missing values

Imputation in two stages for synthetic data:- multiply impute missing values on stage one- generate synthetic datasets for each one stage nest on stage two

New combining rules necessary (Reiter, 2004)

Skip patterns

Joint modeling very difficult for datasets with skip patterns and different types of variables

Imputation by sequential regression (Raghunathan et al., 2001)

linear models for continuous variableslogit models for binary variablesmultinomial models for categorical variables

For skip patterns: Use logit model to decide if filtered questions are applicable Impute values only for records with a positive outcome from the logit

Logical constraints

All continuous variables>0 Redraw from the model for negative units until restriction is always

fulfilled Only possible, if truncation point is at the far end of the distribution Otherwise, refine model

Y1>Y2, e.g. total nb of employees>nb of part time employees

x=Y2/Y1

Z=logit(x)

Use standard linear model on transformed variable

Backtransform imputed values to get final values

The IAB Establishment Panel

Annually conducted establishment survey

Since 1993 in Western Germany, since 1996 in Eastern Germany

Population: All establishments with at least one employee covered by social security

Source: Official Employment Statistics

Sample of more than 16.000 establishments in the last wave

Contents: employment structure, changes in employment, investment, training, remuneration, working hours,

collective wage agreements, works councils

Synthesis of the IAB Establishment Panel

We only synthesize the wave 2007

Missing values are imputed for all variables

Roughly 25 variables are synthesized

Combination of key variables and sensitive variables

Key variables: region, industry code, personnel structure,…

Sensitive variables: turnover, investments,…

For data quality evaluation, we only look at the synthesis step

Number of imputations for the synthesis: r=10

Confidence interval overlap

Suggested by Karr et al. (2006)

Measure the overlap of CIs from the original data and CIs from the synthetic data

The higher the overlap, the higher the data utility

Compute the average relative CI overlap for any k

ksynksyn

koverkover

korigkorig

koverkoverk LU

overUoverL

origL synL origUsynU

CI for the synthetic data

CI for the original data

Two regression results

Regressions suggested by colleagues at the IAB

First regression:

- dependent variable: part-time yes/no

- probit regression on 19 explanatory variables + industry dummies

Second regression:

- Dependent variable: expected employment trend (decrease, no change, increase)

- ordered probit on 38 variables + industry dummies

Both regressions are computed separately for West and East Germany

beat org. beta syn. J.k.beta z-score org. z-score syn.Intercept -0.800 -0.777 0.95 -7.18 -6.725-10 employees 0.448 0.449 0.95 8.63 7.7810-20 employees 0.666 0.593 0.68 11.16 10.4820-50 employees 0.806 0.754 0.79 13.30 12.16100-200 employees 0.904 0.874 0.92 9.37 8.87200-500 employees 1.134 1.092 0.91 10.02 9.49>500 employees 1.675 1.580 0.88 8.28 8.00growth in employment exp. 0.002 -0.003 0.97 0.05 -0.05decrease in emp. expected 0.092 0.114 0.93 1.17 1.45share of female workers 1.453 1.378 0.76 17.79 19.22share of employees with university degree 0.314 0.372 0.90 2.14 2.71share of low qualified workers 1.105 1.179 0.80 12.12 12.53share of temporary employees -0.310 -0.139 0.75 -1.65 -1.12share of agency workers -0.492 -0.514 0.96 -3.11 -3.24employment in the last 6 month 0.388 0.370 0.90 8.21 7.86dismissal in the last 6 months 0.290 0.267 0.87 6.31 5.83foreign ownership -0.115 -0.118 0.99 -1.35 -1.41good or very good profitability 0.034 0.034 0.99 0.86 0.85salary above collective wage agreement 0.007 0.010 0.99 0.12 0.18collective wage agreement 0.020 0.023 0.99 0.39 0.45

Regression results for West Germany

Average CI overlap: 0.89

Regression results for East Germany

Average CI overlap: 0.92

beat org. beta syn. J.k.beta z-score org. z-score syn.Intercept -0.725 -0.675 0.89 -6.53 -6.025-10 employees 0.272 0.277 0.94 4.93 4.4410-20 employees 0.422 0.378 0.81 7.04 6.5520-50 employees 0.554 0.537 0.93 9.42 8.87100-200 employees 0.780 0.812 0.91 8.29 8.50200-500 employees 0.979 0.994 0.97 8.31 8.37>500 employees 1.412 1.410 0.99 5.72 5.64growth in employment exp. -0.034 -0.040 0.97 -0.62 -0.73decrease in emp. expected 0.040 0.042 0.99 0.51 0.54share of female workers 1.010 1.062 0.83 12.69 15.18share of employees with university degree 0.208 0.164 0.90 1.75 1.46share of low qualified workers 0.970 1.057 0.81 8.39 9.05share of temporary employees -0.072 -0.002 0.78 -0.46 -0.02share of agency workers -0.288 -0.243 0.93 -1.67 -1.42employment in the last 6 month 0.230 0.206 0.87 4.96 4.47dismissal in the last 6 months 0.300 0.296 0.98 6.43 6.36foreign ownership -0.166 -0.178 0.97 -1.73 -1.87good or very good profitability 0.098 0.100 0.99 2.40 2.45salary above collective wage agreement 0.092 0.092 1.00 1.19 1.19collective wage agreement 0.097 0.072 0.87 1.88 1.42

Average CI overlap: 0.90 Minimum CI overlap: 0.58

results for the second regression

Conclusions

Generating synthetic datasets is difficult and labour intensive

Synthetic datasets can handle many real data problems

Synthetic datasets seem to provide high data quality for our establishment survey

More data quality evaluations are necessary

Remaining disclosure risk needs to be quantified (Drechsler & Reiter, 2008)

Long term goal: release complete longitudinal data

Future Work

Thank you for your attention

Categorical Variables with a low number of observations

Standard approach: Multinomial/Dirichlet model

Covariates can only be incorporated indirectly by applying the model separately for different subgroups of the data

Provides good results for subgroups only if original dataset is large

Small datasets don’t provide enough observations to built models for different subgroups

Alternative: CART models

Suggested by Reiter (2005)

CART Models

Flexible tool for estimating the conditional distribution of a univariate outcome given multivariate predictors

Partition the predictor space to form subsets with homogeneous outcomes

Partitions found by recursive binary splits of the predictors

CART models for synthesis

Grow a tree using the original data

Define the minimum number of records in each leaf

Prune the tree if necessary

Use partially synthesized data to locate leaf for each unit

Draw new values for each unit by using the Bayesian Bootstrap for each leaf

Difficult to define optimal tree size

jörg drechsler (institute for employment research, germany) ntts 2009 brussels, 20. february 2009...

original data slide

missing data missing

regression slide

synthetic data ci

data utility

high data quality

categorical variables

explanatory variables

Documents

(ntts 33) bart d. ehrman-studies in the textual criticism of...

js research - jörg schulte

welcome!!! 4 march 2013dwb wp workshop before ntts 20131

david stephensen 1,2 , wendy drechsler 1 , oona scott 1

jörg rothe

(ntts 50) foster studies on the text of the new testament...

eages proceedings - g. jörg

jörg m. fegert - uniklinik ulm

em workshop: zeiss on your campus for life sciencesdr....

drechsler presentation unctad

(ntts 40) krans patristic and text-critical studies 2011

satellite services b.v. next generation tm/tc system (ntts...

fp7 opencube project presentation at ntts 2015 conference

jörg hoffmeister, sap ag

cleaning validation istanbul, turkey june 1, 2001 peter...

irish chamber orchestra jÖrg widmann

jörg bungart & andrea seeger

jörg andres - rally dos sertoes

pacifica quartet jÖrg widmann, clarinet

kathedersozialismus and social question drechsler