jörg drechsler (institute for employment research, germany) ntts 2009 brussels, 20. february 2009...
Post on 27-Mar-2015
215 Views
Preview:
TRANSCRIPT
Jörg Drechsler(Institute for Employment Research,
Germany)
NTTS 2009
Brussels, 20. February 2009
Disclosure Control in Business Data Experiences with Multiply Imputed Synthetic Datasets for the German
IAB Establishment Survey
2
Overview
Background
Multiple imputation for statistical disclosure control
Challenges for real data applications
Some preliminary results
Conclusions/Future Work
3
SDC for Business Data
Public release of business data is often considered too risky
- Skewed distributions make identification of single units easy
- Information on businesses in the public domain
- High benefits from identifying a single unit
- High probability of inclusion for large establishments
Only coarsening and top-coding is not sufficient
Standard perturbation methods have to be applied on a high level
Release of high quality data is very difficult
Multiply imputed synthetic datasets as a possible solution
4
Partially synthetic datasets (Little 1993)
only potentially identifying or sensitive variables are replaced
5
Partially synthetic datasets (Little 1993)
only potentially identifying or sensitive variables are replaced
6
Partially synthetic datasets (Little 1993)
only potentially identifying or sensitive variables are replaced
advantages: - synthesis can be tailored to the records at risk
- approach is applicable to continuous and discrete variables
- modeling tries to preserve the joint distribution of the data
7
Challenges for real data applications
Missing data
Skip patterns
Logical constraints
8
Missing Data
Missing data is a common problem in surveys (More than 200 variables with missings in our survey)
Most SDL techniques can not deal with missing values
Imputation in two stages for synthetic data:- multiply impute missing values on stage one- generate synthetic datasets for each one stage nest on stage two
New combining rules necessary (Reiter, 2004)
9
Skip patterns
Joint modeling very difficult for datasets with skip patterns and different types of variables
Imputation by sequential regression (Raghunathan et al., 2001)
linear models for continuous variableslogit models for binary variablesmultinomial models for categorical variables
For skip patterns: Use logit model to decide if filtered questions are applicable Impute values only for records with a positive outcome from the logit
model
10
Logical constraints
All continuous variables>0 Redraw from the model for negative units until restriction is always
fulfilled Only possible, if truncation point is at the far end of the distribution Otherwise, refine model
Y1>Y2, e.g. total nb of employees>nb of part time employees
x=Y2/Y1
Z=logit(x)
Use standard linear model on transformed variable
Backtransform imputed values to get final values
11
The IAB Establishment Panel
Annually conducted establishment survey
Since 1993 in Western Germany, since 1996 in Eastern Germany
Population: All establishments with at least one employee covered by social security
Source: Official Employment Statistics
Sample of more than 16.000 establishments in the last wave
Contents: employment structure, changes in employment, investment, training, remuneration, working hours,
collective wage agreements, works councils
12
Synthesis of the IAB Establishment Panel
We only synthesize the wave 2007
Missing values are imputed for all variables
Roughly 25 variables are synthesized
Combination of key variables and sensitive variables
Key variables: region, industry code, personnel structure,…
Sensitive variables: turnover, investments,…
For data quality evaluation, we only look at the synthesis step
Number of imputations for the synthesis: r=10
13
Confidence interval overlap
Suggested by Karr et al. (2006)
Measure the overlap of CIs from the original data and CIs from the synthetic data
The higher the overlap, the higher the data utility
Compute the average relative CI overlap for any k
ksynksyn
koverkover
korigkorig
koverkoverk LU
LU
LU
LUJ
,,
,,
,,
,,
2
1
overUoverL
origL synL origUsynU
CI for the synthetic data
CI for the original data
14
Two regression results
Regressions suggested by colleagues at the IAB
First regression:
- dependent variable: part-time yes/no
- probit regression on 19 explanatory variables + industry dummies
Second regression:
- Dependent variable: expected employment trend (decrease, no change, increase)
- ordered probit on 38 variables + industry dummies
Both regressions are computed separately for West and East Germany
15
beat org. beta syn. J.k.beta z-score org. z-score syn.Intercept -0.800 -0.777 0.95 -7.18 -6.725-10 employees 0.448 0.449 0.95 8.63 7.7810-20 employees 0.666 0.593 0.68 11.16 10.4820-50 employees 0.806 0.754 0.79 13.30 12.16100-200 employees 0.904 0.874 0.92 9.37 8.87200-500 employees 1.134 1.092 0.91 10.02 9.49>500 employees 1.675 1.580 0.88 8.28 8.00growth in employment exp. 0.002 -0.003 0.97 0.05 -0.05decrease in emp. expected 0.092 0.114 0.93 1.17 1.45share of female workers 1.453 1.378 0.76 17.79 19.22share of employees with university degree 0.314 0.372 0.90 2.14 2.71share of low qualified workers 1.105 1.179 0.80 12.12 12.53share of temporary employees -0.310 -0.139 0.75 -1.65 -1.12share of agency workers -0.492 -0.514 0.96 -3.11 -3.24employment in the last 6 month 0.388 0.370 0.90 8.21 7.86dismissal in the last 6 months 0.290 0.267 0.87 6.31 5.83foreign ownership -0.115 -0.118 0.99 -1.35 -1.41good or very good profitability 0.034 0.034 0.99 0.86 0.85salary above collective wage agreement 0.007 0.010 0.99 0.12 0.18collective wage agreement 0.020 0.023 0.99 0.39 0.45
Regression results for West Germany
Average CI overlap: 0.89
16
Regression results for East Germany
Average CI overlap: 0.92
beat org. beta syn. J.k.beta z-score org. z-score syn.Intercept -0.725 -0.675 0.89 -6.53 -6.025-10 employees 0.272 0.277 0.94 4.93 4.4410-20 employees 0.422 0.378 0.81 7.04 6.5520-50 employees 0.554 0.537 0.93 9.42 8.87100-200 employees 0.780 0.812 0.91 8.29 8.50200-500 employees 0.979 0.994 0.97 8.31 8.37>500 employees 1.412 1.410 0.99 5.72 5.64growth in employment exp. -0.034 -0.040 0.97 -0.62 -0.73decrease in emp. expected 0.040 0.042 0.99 0.51 0.54share of female workers 1.010 1.062 0.83 12.69 15.18share of employees with university degree 0.208 0.164 0.90 1.75 1.46share of low qualified workers 0.970 1.057 0.81 8.39 9.05share of temporary employees -0.072 -0.002 0.78 -0.46 -0.02share of agency workers -0.288 -0.243 0.93 -1.67 -1.42employment in the last 6 month 0.230 0.206 0.87 4.96 4.47dismissal in the last 6 months 0.300 0.296 0.98 6.43 6.36foreign ownership -0.166 -0.178 0.97 -1.73 -1.87good or very good profitability 0.098 0.100 0.99 2.40 2.45salary above collective wage agreement 0.092 0.092 1.00 1.19 1.19collective wage agreement 0.097 0.072 0.87 1.88 1.42
17
Average CI overlap: 0.90 Minimum CI overlap: 0.58
results for the second regression
18
Conclusions
Generating synthetic datasets is difficult and labour intensive
Synthetic datasets can handle many real data problems
Synthetic datasets seem to provide high data quality for our establishment survey
More data quality evaluations are necessary
Remaining disclosure risk needs to be quantified (Drechsler & Reiter, 2008)
Long term goal: release complete longitudinal data
Future Work
19
Thank you for your attention
20
Categorical Variables with a low number of observations
Standard approach: Multinomial/Dirichlet model
Covariates can only be incorporated indirectly by applying the model separately for different subgroups of the data
Provides good results for subgroups only if original dataset is large
Small datasets don’t provide enough observations to built models for different subgroups
Alternative: CART models
Suggested by Reiter (2005)
21
CART Models
Flexible tool for estimating the conditional distribution of a univariate outcome given multivariate predictors
Partition the predictor space to form subsets with homogeneous outcomes
Partitions found by recursive binary splits of the predictors
L2
Root
L1
L3
X1<3
X2<5
22
CART models for synthesis
Grow a tree using the original data
Define the minimum number of records in each leaf
Prune the tree if necessary
Use partially synthesized data to locate leaf for each unit
Draw new values for each unit by using the Bayesian Bootstrap for each leaf
Difficult to define optimal tree size
top related