a comparative assessment of methods for protecting ...process of adding a realization of a random...

25
RTI International is a trade name of Research Triangle Institute 3040 Cornwallis Road ¦ P.O. Box 12194 ¦ Research Triangle Park, North Carolina, USA 27709 Phone: 919-541-6990 e-mail: [email protected] A Comparative Assessment of Methods for Protecting Confidentiality of Microdata David Wilson Joint Statistical Meetings Minneapolis, MN August 7-11, 2005

Upload: others

Post on 25-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

RTI International is a trade name of Research Triangle Institute

3040 Cornwallis Road ¦ P.O. Box 12194 ¦ Research Triangle Park, North Carolina, USA 27709 Phone: 919-541-6990 e-mail: [email protected]

A Comparative Assessment of Methods for Protecting Confidentiality of Microdata

David Wilson

Joint Statistical MeetingsMinneapolis, MN August 7-11, 2005

Page 2: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

Outline

§ Framing the Comparison Scenario

§ 10 Statistical Disclosure Limitation Methods

§ Comparing 10 methods

§ Summary

Page 3: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

Apples and Oranges, Oh my!

§ Driving forces behind Statistical Disclosure Limitation (SDL) are: Risk and Information

§ In order to compare and choose acceptable SDL techniques, one must define acceptable risk and acceptable information loss

§ Requires subjective determinations of “how much risk is acceptable” and “how much information loss is acceptable”

Page 4: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

Disclosure: A Balancing Act

DATA CONFIDENTIALITYDATA QUALITY

Page 5: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

Means of Comparison

§ Do methods use a common measure of risk?....No

§ Do methods use a common measure of information loss?....No

§ So how do we compare competing methods?

§ Ease of implementation, by type of data they can handle, impact on one of several measures of information loss

Page 6: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques

§ 10 SDL techniques applicable to microdata will be discussed

§ Global recoding, Local suppression, Rounding, Microaggregation, Noise addition, Sampling, Swapping, PRAM, Imputation, and MASSC

§ Not an exhaustive list

Page 7: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

In (rough) order of complexity:

§ Global Recoding (Top Coding, Bottom Coding)Global recoding of a variable is the process of combining two or

more categories of a variable into one category.

Continuous or Categorical data. Coarsens data.

443

1742

2041

NAge

444

1743

42

2001

NAge

Page 8: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

§ Local SuppressionLocal suppression is a record level process where a value for a variable is replaced by a value that indicates “missingness.” Applied to extreme values, for example. Changes distributions.

$45,0004

$1,500,0003

6

5

2

1

Obs

$32,000

$32,000

$32,000

$45,000

Income

$45,0004

.3

6

5

2

1

Obs

$32,000

$32,000

$32,000

$45,000

Income

Page 9: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

§ Rounding Values for a variable are replaced with some integer multiple ofa rounding base. Applicable to quantitative, continuous variables. Changes distributions.

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

124

353

6

5

2

1

Obs

28

90

44

13

Age (in years)

Page 10: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

§ MicroaggregationMicroaggregation is the process of replacing values of variables, for a given grouping of records, with an aggregate value derived from that group. “Flattens” distributions. Hides extreme values.

Female

Male

Female

Male

Female

Male

Gender

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

Female

Male

Female

Male

Female

Male

Gender

284

463

6

5

2

1

Obs

28

46

28

46

Age (in years)

Page 11: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

§ Noise Addition

Additive (Multiplicative is another) noise addition is the record level process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible data values if done incorrectly. Changes distributions and multivariate relationships.

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

14.84

33.13

6

5

2

1

Obs

29.1

93.2

41.3

13.9

Age (in years)

Page 12: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

§ Sampling Release a subset of all records contained in the file by sampling from the set of all records. Increases variances and may impactrare responses.

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

35.33

6

2

1

Obs

27.9

43.7

13.2

Age (in years)

Page 13: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

§ Swapping Data swapping is the process of choosing two records at random from a microdata set and swapping the values of a set of variables

Female

Male

Female

Male

Female

Male

Gender

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

Female

Male

Male

Female

Male

Female

Gender

12.44

35.33

6

5

2

1

Obs

27.9

89.5

43.7

13.2

Age (in years)

Page 14: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

§ PRAM (Post Randomization)Values of variables for each record in a microdata set are changed according to a known probabilistic methodology

Values are changed according to a probability mechanism so values after application of PRAM may or not differ from the original values

§ Estimates after PRAM can be adjusted because of the known probability mechanism

Page 15: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

§ Multiple/Single Imputation (synthetic data)

§ For single imputation, replace a value in a data set with a value either

§ 1) derived from a model of the population from which the data were derived or

§ 2) using some mathematical method to choose an imputed value that is “close” to the original value (e.g. Nearest neighbor)

Page 16: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

§ Multiple/Single Imputation (synthetic data)

§ For “full” multiple imputation: § Model the population distribution of the variables contained in

the microdata and generate realizations of the microdata, under the developed model, and release the set of generated realizations.

§ For “partial” multiple imputation: § Model the population distribution of the variables contained in

the microdata and generate realizations of “parts” of the microdata, under the developed model, and release the set of generated realizations along with the un-imputed data.

Page 17: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

10 SDL Techniques (cont.)

§ MASSC

§ Uses recoding, substitution, and sampling to change the values of key variables

§ Sampling weights are created or adjusted to allow for accurate estimation of “totals”

Page 18: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

After all that…how do we compare methods?

§ Some perturb data: All but Global recoding, Swapping, and Local suppression

§ Some require construction of models: Single and Multiple imputation

§ Some are easier to implement than others: Global recoding, Local Suppression, Rounding

Page 19: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

After all that…how do we compare methods? (cont.)

Existence of software

§ Not much SDL software available

§ µ-Argus – Publicly Available, Free

§ MASSC – Software exists, provided as service

§ Privacert Appliance – Commercial, appears to use suppression

§ IVEware – General imputation software for SAS

Page 20: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

After all that…how do we compare methods? (cont.)

Impact on information loss

§ No generally accepted measure of information loss

§ Some methods provide estimates of information loss: PRAM and MASSC

Page 21: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

After all that…how do we compare methods? (cont.)

Analyzing Data After Treatment

§ GR – no adjustments necessary

§ Swapping/Local suppression/rounding – no general adjustments exist

§ Imputation – adjusted variance formulas exist in certain implementations

§ Sampling/MASSC – sampling weights can be used to adjust estimates

§ PRAM – Estimates adjusted because probability mechanism is known

Page 22: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

After all that…how do we compare methods? (cont.)

§ Ability to assess Risk

§ PRAM, Imputation, and MASSC all provide some measure of “risk”

§ No one measure of risk exists though record linkage techniques have been used to compare different methods( probabilistic, distance-based)

Page 23: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

Summary

§ There are technical and other motivations (legal, political) that affect which SDL methods are used

§ Find a balance between disclosure risk and information loss

§ No method is right for every situation

§ Consult your local statistician!

Page 24: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

There can be only one…Reference

§ There are many, many journal articles dealing with statistical disclosure

§ Too many to list here

§ One good reference:

Willenborg, Leon and de Waal, Ton (2001), Elements of Statistical Disclosure Control, Lecture Notes in Statistics, Springer-Verlag.

Page 25: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible

Contact for Additional Information

David Wilson

RTI [email protected]

http://www.rti.org/JSM