a comparative assessment of methods for protecting ...process of adding a realization of a random...

RTI International is a trade name of Research Triangle Institute

3040 Cornwallis Road ¦ P.O. Box 12194 ¦ Research Triangle Park, North Carolina, USA 27709 Phone: 919-541-6990 e-mail: [email protected]

A Comparative Assessment of Methods for Protecting Confidentiality of Microdata

David Wilson

Joint Statistical MeetingsMinneapolis, MN August 7-11, 2005

Outline

§ Framing the Comparison Scenario

§ 10 Statistical Disclosure Limitation Methods

§ Comparing 10 methods

§ Summary

Apples and Oranges, Oh my!

§ Driving forces behind Statistical Disclosure Limitation (SDL) are: Risk and Information

§ In order to compare and choose acceptable SDL techniques, one must define acceptable risk and acceptable information loss

§ Requires subjective determinations of “how much risk is acceptable” and “how much information loss is acceptable”

Disclosure: A Balancing Act

DATA CONFIDENTIALITYDATA QUALITY

Means of Comparison

§ Do methods use a common measure of risk?....No

§ Do methods use a common measure of information loss?....No

§ So how do we compare competing methods?

§ Ease of implementation, by type of data they can handle, impact on one of several measures of information loss

10 SDL Techniques

§ 10 SDL techniques applicable to microdata will be discussed

§ Global recoding, Local suppression, Rounding, Microaggregation, Noise addition, Sampling, Swapping, PRAM, Imputation, and MASSC

§ Not an exhaustive list

10 SDL Techniques (cont.)

In (rough) order of complexity:

§ Global Recoding (Top Coding, Bottom Coding)Global recoding of a variable is the process of combining two or

more categories of a variable into one category.

Continuous or Categorical data. Coarsens data.

443

1742

2041

NAge

444

1743

42

2001

NAge


§ Local SuppressionLocal suppression is a record level process where a value for a variable is replaced by a value that indicates “missingness.” Applied to extreme values, for example. Changes distributions.

$45,0004

$1,500,0003

6

5

2

1

Obs

$32,000

$32,000

$32,000

$45,000

Income

$45,0004

.3

6

5

2

1

Obs

$32,000

$32,000

$32,000

$45,000

Income


§ Rounding Values for a variable are replaced with some integer multiple ofa rounding base. Applicable to quantitative, continuous variables. Changes distributions.

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

124

353

6

5

2

1

Obs

28

90

44

13

Age (in years)


§ MicroaggregationMicroaggregation is the process of replacing values of variables, for a given grouping of records, with an aggregate value derived from that group. “Flattens” distributions. Hides extreme values.

Female

Male

Female

Male

Female

Male

Gender

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

Female

Male

Female

Male

Female

Male

Gender

284

463

6

5

2

1

Obs

28

46

28

46

Age (in years)


§ Noise Addition

Additive (Multiplicative is another) noise addition is the record level process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible data values if done incorrectly. Changes distributions and multivariate relationships.

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

14.84

33.13

6

5

2

1

Obs

29.1

93.2

41.3

13.9

Age (in years)


§ Sampling Release a subset of all records contained in the file by sampling from the set of all records. Increases variances and may impactrare responses.

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

35.33

6

2

1

Obs

27.9

43.7

13.2

Age (in years)


§ Swapping Data swapping is the process of choosing two records at random from a microdata set and swapping the values of a set of variables

Female

Male

Female

Male

Female

Male

Gender

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

Female

Male

Male

Female

Male

Female

Gender

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)


§ PRAM (Post Randomization)Values of variables for each record in a microdata set are changed according to a known probabilistic methodology

Values are changed according to a probability mechanism so values after application of PRAM may or not differ from the original values

§ Estimates after PRAM can be adjusted because of the known probability mechanism


§ Multiple/Single Imputation (synthetic data)

§ For single imputation, replace a value in a data set with a value either

§ 1) derived from a model of the population from which the data were derived or

§ 2) using some mathematical method to choose an imputed value that is “close” to the original value (e.g. Nearest neighbor)


§ Multiple/Single Imputation (synthetic data)

§ For “full” multiple imputation: § Model the population distribution of the variables contained in

the microdata and generate realizations of the microdata, under the developed model, and release the set of generated realizations.

§ For “partial” multiple imputation: § Model the population distribution of the variables contained in

the microdata and generate realizations of “parts” of the microdata, under the developed model, and release the set of generated realizations along with the un-imputed data.


§ MASSC

§ Uses recoding, substitution, and sampling to change the values of key variables

§ Sampling weights are created or adjusted to allow for accurate estimation of “totals”

After all that…how do we compare methods?

§ Some perturb data: All but Global recoding, Swapping, and Local suppression

§ Some require construction of models: Single and Multiple imputation

§ Some are easier to implement than others: Global recoding, Local Suppression, Rounding

After all that…how do we compare methods? (cont.)

Existence of software

§ Not much SDL software available

§ µ-Argus – Publicly Available, Free

§ MASSC – Software exists, provided as service

§ Privacert Appliance – Commercial, appears to use suppression

§ IVEware – General imputation software for SAS


Impact on information loss

§ No generally accepted measure of information loss

§ Some methods provide estimates of information loss: PRAM and MASSC


Analyzing Data After Treatment

§ GR – no adjustments necessary

§ Swapping/Local suppression/rounding – no general adjustments exist

§ Imputation – adjusted variance formulas exist in certain implementations

§ Sampling/MASSC – sampling weights can be used to adjust estimates

§ PRAM – Estimates adjusted because probability mechanism is known


§ Ability to assess Risk

§ PRAM, Imputation, and MASSC all provide some measure of “risk”

§ No one measure of risk exists though record linkage techniques have been used to compare different methods( probabilistic, distance-based)

Summary

§ There are technical and other motivations (legal, political) that affect which SDL methods are used

§ Find a balance between disclosure risk and information loss

§ No method is right for every situation

§ Consult your local statistician!

There can be only one…Reference

§ There are many, many journal articles dealing with statistical disclosure

§ Too many to list here

§ One good reference:

Willenborg, Leon and de Waal, Ton (2001), Elements of Statistical Disclosure Control, Lecture Notes in Statistics, Springer-Verlag.

Contact for Additional Information

David Wilson

RTI [email protected]

http://www.rti.org/JSM

a comparative assessment of methods for protecting ...process of adding a realization of a random...

Documents