estimation of the probit model from anonymized micro data
DESCRIPTION
Estimation of the Probit Model From Anonymized Micro Data. Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005. Agenda:. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/1.jpg)
Estimation of the Probit Model From Anonymized Micro Data
Gerd Ronning and Martin RosemannUniversität Tübingen & IAW Tübingen
UNECE Work Session on Statistical Data Confidentiality, Geneva,
9 – 11 November 2005
![Page 2: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/2.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
2
Agenda:
• The German anonymization project (see also the earlier presentation by Rainer Lenz)
• Main Results• Estimation of the probit model from anonymized
data
![Page 3: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/3.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
3
Overview• German project run jointly by the German Statistical Office
and Institute for Applied Economic Research. • German law allows the Statistical Office to provide
scientific researchers with data which are only moderately anonymized
• These data are said to satisfy “factual anonymization” (in German “faktische Anonymisierung”).
• They can be seen as scientific-use files. • The main emphasis of the project is on data from
enterprises for which confidentiality is a more sensitive topic than for data from households
![Page 4: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/4.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
4
Two objectives of anonymization
• Anonymization of data has to satisfy two objectives which are opposing each other: – (a) minimization of risk of disclosure, – (b) minimization of loss of data quality.
• A compromise has to be reached. • However, factual anonymization has to
guarantueed before we may consider the quality of these data.
• Alternative strategies may be possible and some may lead to a smaller loss of data quality.
![Page 5: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/5.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
5
The business micro data used in the project
• Kostenstrukturerhebung im Verarbeitenden Gewerbe und Bergbau (1999) (cost structure),
• Umsatzsteuerstatistik (2000) (value added tax)• Einzelhandelsstatistik (1999) (retail business)
Only partly related:• IAB-Betriebspanel (IAB panel of firms)
![Page 6: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/6.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
6
Different masking procedures
• We compare different masking procedures, in particular microaggregation and “addition” of noise (also in a multiplicative manner).
• For discrete variables we consider masking by post-randomization.
• Other masking procedures, in particular data swapping, have been found to imply too much distortion with respect to data quality.
![Page 7: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/7.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
7
„Corrected“ estimation under anonymization
• We also consider the possibility of correcting the estimation procedures in linear and nonlinear models in such a way that consistent (unbiased) estimators are derived.
• Examples given below.
![Page 8: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/8.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
8
Two different strategies of anonymization
• „Information reducing procedures“– Reduction with respect to observational units– Reduction with respect to certain variables– Reduction or coarsening of possible outcomes
• „Data modifying procedures“– Microaggregation– Noise addition– Post randomization (PRAM)
![Page 9: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/9.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
9
Emphasis of project on „data modifying procedures“
• ....employing, however, some „information reducing procedures“ at the outset.
• For example, regional information was deleted with exception of „west“ and „east“ of Germany.
• „data modifying procedures“ have the advantage that impact on estimation of stochastic models can be formally analyzed.
• For example
![Page 10: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/10.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
10
Examples of „corrected“ estimation
• For example, in the linear regression model microaggregation can easily be handled by specifying an adequate covariance structure.
• In case of addition of noise we have ‘errors in variables’ which ask for instrumental variable estimation.
• Alternatively we may use the “SIMEX” approach. • Post-randomization of a binary dependent
variable leads to a generalization of the probit model which allows consistent estimation.
![Page 11: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/11.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
11
Problems of „Corrected“ estimation
• Effect of anonymization of a variable depends....• ...on the procedure ...• ...and whether we use the variable as regressor
or as regressand!• For example, if we post randomize a binary
variable, it can be used as dependent variable in the probit model or as a „dummy variable“ in the linear regression model.
• The first case will be discussed below in more detail.
![Page 12: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/12.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
12
The probit model
![Page 13: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/13.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
13
(Symmetric) Post randomization of binary variable
![Page 14: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/14.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
14
ML Estimation of the probit model under PRAM (1)
![Page 15: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/15.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
15
ML Estimation of the probit model under PRAM (2)
• Consistent Estimation of probit model under PRAM is possible if right hand regressors are left unprotected.
• As we will see, it is also possible to estimate consistently the probit model if only right hand variables are protected by addition of noise.
• However, no satisfactory procedure has been found so far for the most relevant case that both the dependent and the independent variables had been anonymized.
![Page 16: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/16.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
16
Addition of noise (in the linear model)
• Additive error
• Multiplicative error
![Page 17: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/17.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
17
Estimation under (additive) noise of regressor
• Inconsistency of estimate:
• Estimate from SIMEX procedure (adding error by purpose):
![Page 18: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/18.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
18
Extrapolation in the SIMEX procedure
![Page 19: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/19.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
19
Report on recent work from estimating the probit model from anonymized micro data (1)
• ML estimation of generalized probit model combined with SIMEX procedure did not work satisfactorily even in the case of no post randomizaion !
• However estimation of the generalized linear model for the special case representing the probit model gave good results for the case „noise addition but no PRAM“.
• STATA SIMEX Procedure !
![Page 20: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/20.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
20
Report on recent work from estimating the probit model from anonymized micro data (2)
• So far we have no adequate estimation procedure for the case that both the dependent variable is masked by PRAM and the regressor variable(s) is (are) protected by noise addition.
• Note that we consider here a (frequently used) nonlinear model.
• However for linear models correcting estimation procedures seem to work fine.
• See the research report !
![Page 21: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/21.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
21
![Page 22: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/22.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
22
Concluding Remarks (Importance of the project) :
• For the first time simultaneous consideration of confidentiality issues and data quality aspects as seen from user‘s side..
• For the first time consideration of impacts of anonymization on statistical inference.
• Use of real data sets from German statistical office.• Use of modern matching algorithms in simulating
scenarios for disclosure. See earlier presentation by Rainer Lenz !
• Use of commercial data bases for simulating external knowledge.
![Page 23: Estimation of the Probit Model From Anonymized Micro Data](https://reader035.vdocuments.us/reader035/viewer/2022062217/568148f7550346895db6167b/html5/thumbnails/23.jpg)
UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005
23
Future Research
• So far only cross-section data.• Extension to the case of panel data.• Multiple imputation as a masking procedure.• A project will start very soon !