unece workshop on data editing and imputation, vienna 22. april 2008

34
Jörg Drechsler (Institute for Employment Research, Germany) & Trivellore Raghunathan (University of Michigan) UNECE workshop on data editing and imputation, Vienna 22. April 2008 Evaluating Different Approaches for Multiple Imputation Under Linear Constrains

Upload: pillan

Post on 25-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Evaluating Different Approaches for Multiple Imputation Under Linear Constrains. Jörg Drechsler (Institute for Employment Research, Germany) & Trivellore Raghunathan (University of Michigan). UNECE workshop on data editing and imputation, Vienna 22. April 2008. Overview. The Problem - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: UNECE workshop on data editing and imputation, Vienna 22. April 2008

Jörg Drechsler(Institute for Employment Research,

Germany)&

Trivellore Raghunathan(University of Michigan)

UNECE workshop on data editing and imputation, Vienna

22. April 2008

Evaluating Different Approaches for Multiple Imputation Under

Linear Constrains

Page 2: UNECE workshop on data editing and imputation, Vienna 22. April 2008

2

Overview

The Problem

The Data

A Little Background on Multiple Imputation

The Methodology

The Simulation Design

The Results

Conclusions/Future Work

Page 3: UNECE workshop on data editing and imputation, Vienna 22. April 2008

3

The Problem

Some Variables Y1, Y2,…, Yk have to some up to a given total Yt

Examples

- turnover in different regions

- number of employees with different qualification levels

- Investment in different subcategories

kt YYYY ...21

Page 4: UNECE workshop on data editing and imputation, Vienna 22. April 2008

4

Overview

The Problem

The Data

A Little Background on Multiple Imputation

The Methodology

The Simulation Design

The Results

Conclusions/Future Work

Page 5: UNECE workshop on data editing and imputation, Vienna 22. April 2008

5

The Data

The IAB Establishment Panel

The number of employees

with - Yt total number of employees

- Ywork number of blue collar + white collar workers

- Ytrain number of trainees

- Yexec number of executives

- Yown number of owners + working family members

- Ymarg number of “marginal” workers not covered by social security

- Yother number of other employees

othermargownexectrainworkt YYYYYYY

Page 6: UNECE workshop on data editing and imputation, Vienna 22. April 2008

6

The Data

Summary Statistics

- data is heavily skewed- most variables are semi-continuous - low variation for the number of owners- additional constrain: all variables >=0

  Min.1st

Quart. Median Mean3rd

Quart. Max. nb of obs != 0

total nb of emp 1 6 19 128.6 79 22920 11536

workers 0 3 14 109.9 66 19410 11211

trainees 0 0 0 6.101 3 1552 5232

executives 0 0 0 6.124 0 6323 729

owners 0 0 0 0.6667 1 21 5735

marginal workers 0 0 1 5.413 3 2492 5772

others 0 0 0 0.4577 0 566 555

Page 7: UNECE workshop on data editing and imputation, Vienna 22. April 2008

7

Overview

The Problem

The Data

A Little Background on Multiple Imputation

The Methodology

The Simulation Design

The Results

Conclusions/Future Work

Page 8: UNECE workshop on data editing and imputation, Vienna 22. April 2008

8

A Little Background on Multiple Imputation

Generate random draws from

Imputation in two steps 1. Generate random draws for θ from its posterior distribution given the

observed values

2. Generate random draws for the missing values from the conditional predictive distribution given the drawn parameters

Drawing from 1. can be difficult

Solution MCMC-Techniques

dyPyyPdyyPyyP obsobsmisobsmisobsmis )|(),|()|,()|(

)|( obsmis yyP

),|( obsmis yyP

)|( obsyP

Page 9: UNECE workshop on data editing and imputation, Vienna 22. April 2008

9

Gibbs Sampling

Generate random draws from conditional univariate distributions

P(Y1|Y-1,θ1)

P(Yk|Y-k,θk)

Iteration provides draws from the joint distribution

Imputation in two steps for every univariate distribution

Imputation model can vary for different variable types

Page 10: UNECE workshop on data editing and imputation, Vienna 22. April 2008

10

Overview

The Problem

The Data

A Little Background on Multiple Imputation

The Methodology

The Simulation Design

The Results

Conclusions/Future Work

Page 11: UNECE workshop on data editing and imputation, Vienna 22. April 2008

11

The Methodology

Five imputation methods

- simple imputation of all variables

- independent imputation considering semi-continuity

- nested imputation of the proportions

- non-Bayesian Dirichlet imputation

- Bayesian Dirichlet/Multinomial imputation

Page 12: UNECE workshop on data editing and imputation, Vienna 22. April 2008

12

Simple Imputation

Impute all variables independently

Transform all continuous variables by taking the cubic root

Ignore semi-continuity

Use simple linear models

Use same models as for independent imputation under semi-continuity

Fulfill constrains by:

- setting if

- Down weighting all imputed subcategories if Yt is observed or

i it YY i iobstimp YY ,,

Ytotal Y1 Y2 Y3 Y4 Y5 Y6

20 . 5 3 . 1 1

Ytotal Y1 Y2 Y3 Y4 Y5 Y6

20 9 5 3 1 1 1

Ytotal Y1 Y2 Y3 Y4 Y5 Y6

20 18 5 3 2 1 1

i iobstimp YY ,,

Page 13: UNECE workshop on data editing and imputation, Vienna 22. April 2008

13

Independent imputation

Impute all variables independently

Run a logit regression for all variables to address semi-continuityOutcome: 1 if Yij>0, 0 otherwise

Run a linear regression only for the units with Yij>0 and impute only for missing units with positive outcome in the logit regression

set all other values to 0

Depending on number of units with Yij>0 stratify for Western/Eastern Germany and two quantiles for establishment size

Use only 20 explanatory variables for number of executives and other workers, ≈ 100 variables for all other dependent variables

Use same correction methods afterwards

Page 14: UNECE workshop on data editing and imputation, Vienna 22. April 2008

14

Nested Imputation of Proportions

Address semi-continuity with logit-model

Caculate proportions of the total for all subcategories with positive outcome

Use a logit transformation on the proportions

Variables are distributed between ]-Inf;Inf[

Impute variables with linear models

Use almost the same models as for independent imputation under semi-continuity

Nested Imputation: after imputing number of workers define proportions as

After imputation transform variables back and multiply with totals

Use same correction methods afterwards)/( ,, workitotaliij YYY

Page 15: UNECE workshop on data editing and imputation, Vienna 22. April 2008

15

Non Bayesian Dirichlet Distribution

Following an idea by Tempelman (2007)

Ignore semi-continuity

Calculate nested proportions again

Assume Dirichlet distribution for the proportions

Generate starting values using the EM-Algorithm for the Dirichlet Distribution

Page 16: UNECE workshop on data editing and imputation, Vienna 22. April 2008

16

Non Bayesian Dirichlet Distribution II

Imputation Algorithm (Data Augmentation):

- draw new values for from obtained by Maximum-Likelihood-Estimation

- draw new values for mi number of observations to impute for unit i

- Calculate

Not fully Bayesian since the distribution of is only approximated

Use same correction methods afterwards

misobs YY ,| ))ˆ(,ˆ(~,| VNYY misobs

)(~,| ,*

misimobsmis iDirYY

*,,, )1( misimjobsimisi YiYY

i

misobs YY ,|

Page 17: UNECE workshop on data editing and imputation, Vienna 22. April 2008

17

Bayesian Dirichlet/Multinomial Imputation

Generate starting values using the simple imputation approach

For each unit generate a random draw from the Dirichlet distribution with

For each unit generate a random draw from a multinomial distribution with and

weighted vector p for missing obs,

Use same correction methods afterwards

)(~ Dirp ),,,,,( ,,,,,, otherimargiowniexecitrainiworki YYYYYY

jYYsize jobstotal ,*mispprob

*misp 1* misp

Page 18: UNECE workshop on data editing and imputation, Vienna 22. April 2008

18

Overview

The Problem

The Data

A Little Background on Multiple Imputation

The Methodology

The Simulation Design

The Results

Conclusions/Future Work

Page 19: UNECE workshop on data editing and imputation, Vienna 22. April 2008

19

The Simulation Design

Use fully observed survey data (n=11536)

Generate a random sample with replacement of size n

Generate ≈30% missings for each variable (MAR)

Impute missings with different approaches (m=10, iterations=20)

Calculate different quantities of interest

Repeat whole process of sampling and imputation 100 times

Page 20: UNECE workshop on data editing and imputation, Vienna 22. April 2008

20

Generating missing values

X1 expected development for the number of employees in the next five years (6 categories)

X2 number of unskilled workers

X3 industry-wide wage agreement (1=Yes)

Increase for any X leads to decrease of pmis

321 01.05.04.1 XXXY

)exp(1)exp(YYpmis

Page 21: UNECE workshop on data editing and imputation, Vienna 22. April 2008

21

Quality measures

For all estimates of interest:

Compute the estimate from the original survey

Compute the average estimate across the 100 samples

Compute the average estimate across the 100 imputed samples

Compute the 95% coverage rate for the fully observed samples and the imputed samples

Compute

Compute

Compute the average confidence interval overlap for the fully observed sample and the imputed sample

org

)ˆ( sampleE

)ˆ( impE

)ˆvar(/)ˆvar( orgsample

)ˆvar(/)ˆvar( orgimp

Page 22: UNECE workshop on data editing and imputation, Vienna 22. April 2008

22

Confidence interval overlapSuggested by Karr et al. (2006)

Measure the overlap of CIs from the original data and CIs from the imputed data

The higher the overlap, the higher the data utility

Compute the average relative CI overlap for any k

ksynksyn

koverkover

korigkorig

koverkoverk LU

LULULU

J,,

,,

,,

,,

21

overUoverL

origL synL origUsynU

CI for the imputed data

CI for the original data

Page 23: UNECE workshop on data editing and imputation, Vienna 22. April 2008

23

Estimates of Interest

Mean (Yi) in the 16 German Länder

Logit regression to explain collective wage agreements by establishment size

- Use number of employees covered by social security in 6 categories (employees covered by social security = workers + trainees):

Y~emp<10+emp<50+emp<100+emp<250+emp<750+emp>750+industry.dummies

- Compare the estimates for the establishment size from the different imputation methods

Page 24: UNECE workshop on data editing and imputation, Vienna 22. April 2008

24

Overview

The Problem

The Data

A Little Background on Multiple Imputation

The Methodology

The Simulation Design

The Results

Conclusions/Future Work

Page 25: UNECE workshop on data editing and imputation, Vienna 22. April 2008

25

Example for the results

number of workers

 org

meansample mean

mis mean

imp mean

sample cov

mis cov imp cov

sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio length

region1 79.85 81.67 94.16 81.78 0.97 0.81 0.97 0.81 0.65 0.81 1.01 1.24 1.02 392

region2 303.47 294.44 359.77 294.59 0.85 0.93 0.86 0.80 0.75 0.80 0.90 1.10 0.90 180

region3 110.78 111.65 126.29 111.65 0.93 0.70 0.93 0.79 0.60 0.78 0.99 1.19 0.99 834

region4 56.08 55.37 62.64 55.41 0.93 0.86 0.93 0.78 0.71 0.78 0.96 1.19 0.96 781

region5 181.80 183.64 213.98 183.49 0.83 0.74 0.83 0.77 0.61 0.77 0.98 1.20 0.98 1126

region6 122.21 123.29 141.48 123.31 0.95 0.76 0.95 0.79 0.63 0.79 1.00 1.19 1.00 746

region7 83.05 85.24 96.92 85.30 0.97 0.80 0.97 0.80 0.62 0.80 1.02 1.21 1.02 560

region8 158.33 159.85 187.97 159.87 0.93 0.89 0.93 0.79 0.69 0.79 0.98 1.20 0.98 902

region9 217.41 218.12 256.89 218.06 0.90 0.89 0.90 0.80 0.69 0.80 0.97 1.16 0.97 866

region10 71.62 72.53 91.07 72.58 0.89 0.87 0.89 0.80 0.70 0.80 0.96 1.21 0.96 396

region11 109.49 108.11 124.26 108.03 0.70 0.88 0.70 0.79 0.74 0.79 0.84 1.01 0.84 594

region12 50.06 49.64 53.91 49.65 0.92 0.91 0.92 0.78 0.73 0.78 0.97 1.16 0.97 777

region13 60.71 61.61 66.80 61.59 0.95 0.90 0.94 0.80 0.72 0.80 1.01 1.20 1.01 682

region14 86.23 87.16 97.63 87.32 0.89 0.89 0.90 0.77 0.71 0.77 0.97 1.23 0.97 949

region15 73.57 73.75 79.30 73.87 0.95 0.87 0.95 0.80 0.71 0.80 0.99 1.17 0.99 793

region16 60.69 61.23 68.29 61.30 0.96 0.93 0.97 0.81 0.71 0.81 0.99 1.25 0.99 958

average         0.91 0.85 0.91 0.79 0.69 0.79 0.97 1.18 0.97  

Page 26: UNECE workshop on data editing and imputation, Vienna 22. April 2008

26

Results Averaged Over Different Regions

workers

  sample cov mis cov imp covsample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.908 0.852 0.909 0.793 0.686 0.793 0.972 1.182 0.971independent 0.895 0.861 0.893 0.788 0.691 0.788 0.963 1.173 0.966proportions 0.906 0.868 0.899 0.798 0.696 0.797 0.969 1.175 0.969Dirichlet 0.906 0.876 0.924 0.791 0.692 0.785 0.972 1.184 1.261Bayesian Dir. 0.893 0.870 0.922 0.791 0.698 0.778 0.968 1.178 1.656

trainees

  sample cov mis cov imp covsample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.893 0.888 0.902 0.794 0.769 0.796 0.955 1.139 0.962

independent 0.898 0.892 0.909 0.793 0.775 0.794 0.958 1.138 1.004proportions 0.894 0.900 0.937 0.798 0.776 0.797 0.963 1.143 1.153

Dirichlet 0.890 0.885 0.907 0.795 0.773 0.791 0.968 1.148 1.167Bayesian Dir. 0.884 0.886 0.890 0.793 0.770 0.793 0.963 1.140 0.992

Page 27: UNECE workshop on data editing and imputation, Vienna 22. April 2008

27

Results Averaged Over Different Regions

executives

  sample cov mis cov imp covsample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.863 0.897 0.868 0.801 0.783 0.802 0.938 1.120 0.943independent 0.827 0.861 0.833 0.786 0.769 0.786 0.930 1.101 0.942

proportions 0.855 0.889 0.886 0.795 0.776 0.798 0.949 1.127 1.025Dirichlet 0.850 0.874 0.861 0.790 0.770 0.789 0.954 1.139 1.095Bayesian Dir. 0.845 0.869 0.853 0.793 0.774 0.794 0.936 1.111 0.947

owners

  sample cov mis cov imp covsample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.937 0.798 0.945 0.787 0.671 0.708 0.996 1.128 1.694

independent 0.946 0.802 0.943 0.791 0.674 0.685 0.995 1.126 2.576

proportions 0.938 0.806 0.951 0.797 0.674 0.590 0.996 1.128 4.394

Dirichlet 0.943 0.778 0.795 0.791 0.661 0.519 0.996 1.126 3.470

Bayesian Dir. 0.949 0.806 0.982 0.796 0.673 0.711 0.998 1.127 2.505

Page 28: UNECE workshop on data editing and imputation, Vienna 22. April 2008

28

Results Averaged Over Different Regions

marginal workers

  sample cov mis cov imp covsample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.865 0.921 0.873 0.792 0.757 0.793 0.948 1.238 0.957independent 0.882 0.928 0.899 0.797 0.762 0.797 0.959 1.250 1.025

proportions 0.876 0.929 0.916 0.802 0.759 0.793 0.947 1.237 1.126Dirichlet 0.888 0.919 0.916 0.799 0.759 0.791 0.956 1.250 1.113Bayesian Dir. 0.874 0.928 0.903 0.794 0.757 0.794 0.954 1.243 1.017

others

  sample cov mis cov imp covsample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.803 0.808 0.852 0.790 0.760 0.783 0.912 1.118 1.045

independent 0.804 0.811 0.866 0.793 0.762 0.777 0.916 1.118 1.220proportions 0.800 0.822 0.937 0.790 0.765 0.746 0.903 1.101 2.150

Dirichlet 0.819 0.825 0.905 0.793 0.763 0.735 0.928 1.139 1.560Bayesian Dir. 0.799 0.810 0.873 0.789 0.761 0.775 0.913 1.102 1.201

Page 29: UNECE workshop on data editing and imputation, Vienna 22. April 2008

29

Average absolute deviation

Average absolute deviation

  simple independent proportions Dirichlet Bayesian Dirichlet

employees total 0.344 0.381 0.340 4.877 8.045

workers 0.219 0.349 0.836 4.172 7.688

trainees 0.073 0.130 0.328 0.298 0.135executives 0.050 0.078 0.267 0.186 0.070

owners 0.043 0.069 0.160 0.143 0.056

marginal workers 0.079 0.133 0.234 0.283 0.220others 0.048 0.070 0.151 0.131 0.060

Page 30: UNECE workshop on data editing and imputation, Vienna 22. April 2008

30

Results for the regression

  simple

 org

meansample mean

mis mean

imp mean

sample cov

mis cov

imp cov

sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

Intercept -1.09 -1.09 -0.79 -1.10 0.94 0.52 0.93 0.78 0.47 0.79 1.01 1.24 1.0110<x<=50 0.84 0.83 0.91 0.84 0.92 0.83 0.92 0.80 0.65 0.81 1.00 1.25 1.0250<x<100 1.29 1.29 1.41 1.27 0.94 0.77 0.96 0.80 0.61 0.80 1.00 1.33 1.02100<x<=250 1.81 1.81 1.89 1.81 0.97 0.86 0.96 0.79 0.70 0.79 1.00 1.30 1.02250<x<=750 2.35 2.36 2.32 2.35 0.92 0.90 0.93 0.77 0.76 0.78 1.00 1.26 1.01>750 emp. 3.86 3.93 3.81 3.93 0.98 0.93 0.97 0.80 0.76 0.80 1.03 1.21 1.04

  independent  org

meansample mean

mis mean

imp mean

sample cov

mis cov

imp cov

sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

Intercept -1.09 -1.08 -0.78 -1.08 0.96 0.52 0.96 0.79 0.45 0.79 1.01 1.24 1.0110<x<=50 0.84 0.84 0.92 0.85 0.88 0.74 0.93 0.76 0.61 0.76 1.00 1.26 1.0250<x<100 1.29 1.30 1.41 1.30 0.96 0.81 0.96 0.82 0.62 0.82 1.00 1.32 1.03100<x<=250 1.81 1.82 1.89 1.82 0.95 0.86 0.95 0.80 0.69 0.81 1.00 1.30 1.02250<x<=750 2.35 2.35 2.31 2.35 0.98 0.95 0.98 0.80 0.76 0.81 1.00 1.25 1.01>750 emp. 3.86 3.91 3.76 3.91 0.95 0.89 0.97 0.77 0.71 0.78 1.03 1.18 1.03

  proportions  org

meansample mean

mis mean

imp mean

sample cov

mis cov

imp cov

sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

Intercept -1.09 -1.08 -0.78 -1.08 0.91 0.51 0.91 0.77 0.44 0.77 1.00 1.24 1.0010<x<=50 0.84 0.83 0.91 0.85 0.97 0.81 0.96 0.82 0.63 0.81 1.00 1.25 1.0350<x<100 1.29 1.29 1.40 1.29 0.97 0.74 0.97 0.80 0.62 0.81 1.00 1.32 1.03100<x<=250 1.81 1.81 1.88 1.82 0.91 0.90 0.96 0.79 0.71 0.79 1.00 1.30 1.03250<x<=750 2.35 2.34 2.30 2.36 0.94 0.93 0.94 0.81 0.75 0.79 1.00 1.25 1.02>750 emp. 3.86 3.94 3.85 3.95 0.96 0.93 0.96 0.80 0.76 0.80 1.04 1.24 1.07

Page 31: UNECE workshop on data editing and imputation, Vienna 22. April 2008

31

Results for the regression II

  Bayesian Dirichlet  org

meansample

meanmis

meanimp

meansample

covmis cov imp cov sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

Intercept -1.09 -1.07 -0.76 -1.08 0.92 0.48 0.93 0.77 0.43 0.78 1.00 1.24 1.0010<x<=50 0.84 0.84 0.92 0.85 0.96 0.74 0.96 0.79 0.60 0.78 1.00 1.25 1.0250<x<100 1.29 1.28 1.39 1.26 0.95 0.82 0.94 0.80 0.65 0.80 1.00 1.33 1.03100<x<=250 1.81 1.82 1.88 1.78 0.94 0.91 0.95 0.81 0.72 0.79 1.00 1.30 1.02250<x<=750 2.35 2.35 2.32 2.31 0.94 0.92 0.89 0.79 0.75 0.76 1.00 1.26 1.01>750 emp. 3.86 3.94 3.82 3.62 0.94 0.94 0.80 0.76 0.75 0.70 1.05 1.22 0.99

  Dirichlet

 org

meansample

meanmis

meanimp

meansample

covmis cov

imp cov

sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

Intercept -1.09 -1.08 -0.79 -1.08 0.95 0.58 0.96 0.80 0.48 0.80 1.00 1.24 1.0010<x<=50 0.84 0.83 0.92 0.84 0.98 0.81 0.98 0.82 0.62 0.83 1.00 1.25 1.0350<x<100 1.29 1.29 1.42 1.27 0.94 0.79 0.95 0.80 0.59 0.79 1.00 1.32 1.02100<x<=250 1.81 1.81 1.89 1.77 0.95 0.88 0.95 0.80 0.69 0.79 1.00 1.30 1.01250<x<=750 2.35 2.35 2.30 2.31 0.94 0.96 0.96 0.80 0.76 0.79 1.00 1.25 1.01>750 emp. 3.86 3.88 3.77 3.67 0.95 0.96 0.84 0.80 0.76 0.73 1.01 1.19 0.95

Page 32: UNECE workshop on data editing and imputation, Vienna 22. April 2008

32

Overview

The Problem

The Data

A Little Background on Multiple Imputation

The Methodology

The Simulation Design

The Results

Conclusions/Future Work

Page 33: UNECE workshop on data editing and imputation, Vienna 22. April 2008

33

Conclusions

All methods provide good repeated sampling properties

Differences between the approaches are relatively small

Dirichlet and proportions approach tend to introduce more variability

Dirichlet and proportions approach don’t work very well for owners and others

The simple approach seems to work best with high coverage and low additional variability

Future Work Compare same approaches for more equally distributed subcategories

Page 34: UNECE workshop on data editing and imputation, Vienna 22. April 2008

34

Thank you for your attention