estimation of the probit model from anonymized micro data gerd ronning and martin rosemann...

23
Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

Upload: jesse-hopkins

Post on 29-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

Estimation of the Probit Model From Anonymized Micro Data

 Gerd Ronning and Martin RosemannUniversität Tübingen & IAW Tübingen

UNECE Work Session on Statistical Data Confidentiality, Geneva,

9 – 11 November 2005

Page 2: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

2

Agenda:

• The German anonymization project (see also the earlier presentation by Rainer Lenz)

• Main Results• Estimation of the probit model from anonymized

data

Page 3: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

3

Overview• German project run jointly by the German Statistical Office

and Institute for Applied Economic Research. • German law allows the Statistical Office to provide

scientific researchers with data which are only moderately anonymized

• These data are said to satisfy “factual anonymization” (in German “faktische Anonymisierung”).

• They can be seen as scientific-use files. • The main emphasis of the project is on data from

enterprises for which confidentiality is a more sensitive topic than for data from households

Page 4: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

4

Two objectives of anonymization

• Anonymization of data has to satisfy two objectives which are opposing each other: – (a) minimization of risk of disclosure, – (b) minimization of loss of data quality.

• A compromise has to be reached. • However, factual anonymization has to

guarantueed before we may consider the quality of these data.

• Alternative strategies may be possible and some may lead to a smaller loss of data quality.

Page 5: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

5

The business micro data used in the project

• Kostenstrukturerhebung im Verarbeitenden Gewerbe und Bergbau (1999) (cost structure),

• Umsatzsteuerstatistik (2000) (value added tax)• Einzelhandelsstatistik (1999) (retail business)

Only partly related:• IAB-Betriebspanel (IAB panel of firms)

Page 6: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

6

Different masking procedures

• We compare different masking procedures, in particular microaggregation and “addition” of noise (also in a multiplicative manner).

• For discrete variables we consider masking by post-randomization.

• Other masking procedures, in particular data swapping, have been found to imply too much distortion with respect to data quality.

Page 7: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

7

„Corrected“ estimation under anonymization

• We also consider the possibility of correcting the estimation procedures in linear and nonlinear models in such a way that consistent (unbiased) estimators are derived.

• Examples given below.

Page 8: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

8

Two different strategies of anonymization

• „Information reducing procedures“– Reduction with respect to observational units– Reduction with respect to certain variables– Reduction or coarsening of possible outcomes

• „Data modifying procedures“– Microaggregation– Noise addition– Post randomization (PRAM)

Page 9: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

9

Emphasis of project on „data modifying procedures“

• ....employing, however, some „information reducing procedures“ at the outset.

• For example, regional information was deleted with exception of „west“ and „east“ of Germany.

• „data modifying procedures“ have the advantage that impact on estimation of stochastic models can be formally analyzed.

• For example

Page 10: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

10

Examples of „corrected“ estimation

• For example, in the linear regression model microaggregation can easily be handled by specifying an adequate covariance structure.

• In case of addition of noise we have ‘errors in variables’ which ask for instrumental variable estimation.

• Alternatively we may use the “SIMEX” approach. • Post-randomization of a binary dependent

variable leads to a generalization of the probit model which allows consistent estimation.

Page 11: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

11

Problems of „Corrected“ estimation

• Effect of anonymization of a variable depends....• ...on the procedure ...• ...and whether we use the variable as regressor

or as regressand!• For example, if we post randomize a binary

variable, it can be used as dependent variable in the probit model or as a „dummy variable“ in the linear regression model.

• The first case will be discussed below in more detail.

Page 12: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

12

The probit model

Page 13: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

13

(Symmetric) Post randomization of binary variable

Page 14: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

14

ML Estimation of the probit model under PRAM (1)

Page 15: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

15

ML Estimation of the probit model under PRAM (2)

• Consistent Estimation of probit model under PRAM is possible if right hand regressors are left unprotected.

• As we will see, it is also possible to estimate consistently the probit model if only right hand variables are protected by addition of noise.

• However, no satisfactory procedure has been found so far for the most relevant case that both the dependent and the independent variables had been anonymized.

Page 16: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

16

Addition of noise (in the linear model)

• Additive error

• Multiplicative error

Page 17: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

17

Estimation under (additive) noise of regressor

• Inconsistency of estimate:

• Estimate from SIMEX procedure (adding error by purpose):

Page 18: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

18

Extrapolation in the SIMEX procedure

Page 19: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

19

Report on recent work from estimating the probit model from anonymized micro data (1)

• ML estimation of generalized probit model combined with SIMEX procedure did not work satisfactorily even in the case of no post randomizaion !

• However estimation of the generalized linear model for the special case representing the probit model gave good results for the case „noise addition but no PRAM“.

• STATA SIMEX Procedure !

Page 20: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

20

Report on recent work from estimating the probit model from anonymized micro data (2)

• So far we have no adequate estimation procedure for the case that both the dependent variable is masked by PRAM and the regressor variable(s) is (are) protected by noise addition.

• Note that we consider here a (frequently used) nonlinear model.

• However for linear models correcting estimation procedures seem to work fine.

• See the research report !

Page 21: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

21

Page 22: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

22

Concluding Remarks (Importance of the project) :

• For the first time simultaneous consideration of confidentiality issues and data quality aspects as seen from user‘s side..

• For the first time consideration of impacts of anonymization on statistical inference.

• Use of real data sets from German statistical office.• Use of modern matching algorithms in simulating

scenarios for disclosure. See earlier presentation by Rainer Lenz !

• Use of commercial data bases for simulating external knowledge.

Page 23: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

23

Future Research

• So far only cross-section data.• Extension to the case of panel data.• Multiple imputation as a masking procedure.• A project will start very soon !