a comparative assessment of methods for protecting ...process of adding a realization of a random...
TRANSCRIPT
RTI International is a trade name of Research Triangle Institute
3040 Cornwallis Road ¦ P.O. Box 12194 ¦ Research Triangle Park, North Carolina, USA 27709 Phone: 919-541-6990 e-mail: [email protected]
A Comparative Assessment of Methods for Protecting Confidentiality of Microdata
David Wilson
Joint Statistical MeetingsMinneapolis, MN August 7-11, 2005
Outline
§ Framing the Comparison Scenario
§ 10 Statistical Disclosure Limitation Methods
§ Comparing 10 methods
§ Summary
Apples and Oranges, Oh my!
§ Driving forces behind Statistical Disclosure Limitation (SDL) are: Risk and Information
§ In order to compare and choose acceptable SDL techniques, one must define acceptable risk and acceptable information loss
§ Requires subjective determinations of “how much risk is acceptable” and “how much information loss is acceptable”
Disclosure: A Balancing Act
DATA CONFIDENTIALITYDATA QUALITY
Means of Comparison
§ Do methods use a common measure of risk?....No
§ Do methods use a common measure of information loss?....No
§ So how do we compare competing methods?
§ Ease of implementation, by type of data they can handle, impact on one of several measures of information loss
10 SDL Techniques
§ 10 SDL techniques applicable to microdata will be discussed
§ Global recoding, Local suppression, Rounding, Microaggregation, Noise addition, Sampling, Swapping, PRAM, Imputation, and MASSC
§ Not an exhaustive list
10 SDL Techniques (cont.)
In (rough) order of complexity:
§ Global Recoding (Top Coding, Bottom Coding)Global recoding of a variable is the process of combining two or
more categories of a variable into one category.
Continuous or Categorical data. Coarsens data.
443
1742
2041
NAge
444
1743
42
2001
NAge
10 SDL Techniques (cont.)
§ Local SuppressionLocal suppression is a record level process where a value for a variable is replaced by a value that indicates “missingness.” Applied to extreme values, for example. Changes distributions.
$45,0004
$1,500,0003
6
5
2
1
Obs
$32,000
$32,000
$32,000
$45,000
Income
$45,0004
.3
6
5
2
1
Obs
$32,000
$32,000
$32,000
$45,000
Income
10 SDL Techniques (cont.)
§ Rounding Values for a variable are replaced with some integer multiple ofa rounding base. Applicable to quantitative, continuous variables. Changes distributions.
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
124
353
6
5
2
1
Obs
28
90
44
13
Age (in years)
10 SDL Techniques (cont.)
§ MicroaggregationMicroaggregation is the process of replacing values of variables, for a given grouping of records, with an aggregate value derived from that group. “Flattens” distributions. Hides extreme values.
Female
Male
Female
Male
Female
Male
Gender
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
Female
Male
Female
Male
Female
Male
Gender
284
463
6
5
2
1
Obs
28
46
28
46
Age (in years)
10 SDL Techniques (cont.)
§ Noise Addition
Additive (Multiplicative is another) noise addition is the record level process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible data values if done incorrectly. Changes distributions and multivariate relationships.
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
14.84
33.13
6
5
2
1
Obs
29.1
93.2
41.3
13.9
Age (in years)
10 SDL Techniques (cont.)
§ Sampling Release a subset of all records contained in the file by sampling from the set of all records. Increases variances and may impactrare responses.
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
35.33
6
2
1
Obs
27.9
43.7
13.2
Age (in years)
10 SDL Techniques (cont.)
§ Swapping Data swapping is the process of choosing two records at random from a microdata set and swapping the values of a set of variables
Female
Male
Female
Male
Female
Male
Gender
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
Female
Male
Male
Female
Male
Female
Gender
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
10 SDL Techniques (cont.)
§ PRAM (Post Randomization)Values of variables for each record in a microdata set are changed according to a known probabilistic methodology
Values are changed according to a probability mechanism so values after application of PRAM may or not differ from the original values
§ Estimates after PRAM can be adjusted because of the known probability mechanism
10 SDL Techniques (cont.)
§ Multiple/Single Imputation (synthetic data)
§ For single imputation, replace a value in a data set with a value either
§ 1) derived from a model of the population from which the data were derived or
§ 2) using some mathematical method to choose an imputed value that is “close” to the original value (e.g. Nearest neighbor)
10 SDL Techniques (cont.)
§ Multiple/Single Imputation (synthetic data)
§ For “full” multiple imputation: § Model the population distribution of the variables contained in
the microdata and generate realizations of the microdata, under the developed model, and release the set of generated realizations.
§ For “partial” multiple imputation: § Model the population distribution of the variables contained in
the microdata and generate realizations of “parts” of the microdata, under the developed model, and release the set of generated realizations along with the un-imputed data.
10 SDL Techniques (cont.)
§ MASSC
§ Uses recoding, substitution, and sampling to change the values of key variables
§ Sampling weights are created or adjusted to allow for accurate estimation of “totals”
After all that…how do we compare methods?
§ Some perturb data: All but Global recoding, Swapping, and Local suppression
§ Some require construction of models: Single and Multiple imputation
§ Some are easier to implement than others: Global recoding, Local Suppression, Rounding
After all that…how do we compare methods? (cont.)
Existence of software
§ Not much SDL software available
§ µ-Argus – Publicly Available, Free
§ MASSC – Software exists, provided as service
§ Privacert Appliance – Commercial, appears to use suppression
§ IVEware – General imputation software for SAS
After all that…how do we compare methods? (cont.)
Impact on information loss
§ No generally accepted measure of information loss
§ Some methods provide estimates of information loss: PRAM and MASSC
After all that…how do we compare methods? (cont.)
Analyzing Data After Treatment
§ GR – no adjustments necessary
§ Swapping/Local suppression/rounding – no general adjustments exist
§ Imputation – adjusted variance formulas exist in certain implementations
§ Sampling/MASSC – sampling weights can be used to adjust estimates
§ PRAM – Estimates adjusted because probability mechanism is known
After all that…how do we compare methods? (cont.)
§ Ability to assess Risk
§ PRAM, Imputation, and MASSC all provide some measure of “risk”
§ No one measure of risk exists though record linkage techniques have been used to compare different methods( probabilistic, distance-based)
Summary
§ There are technical and other motivations (legal, political) that affect which SDL methods are used
§ Find a balance between disclosure risk and information loss
§ No method is right for every situation
§ Consult your local statistician!
There can be only one…Reference
§ There are many, many journal articles dealing with statistical disclosure
§ Too many to list here
§ One good reference:
Willenborg, Leon and de Waal, Ton (2001), Elements of Statistical Disclosure Control, Lecture Notes in Statistics, Springer-Verlag.