altman - perfectly anonymous data is perfectly useless data

Perfectly Anonymous Data is Perfectly Useless Data

Micah AltmanDirector of Research--MIT Libraries

Prepared for

NISO-RDA SymposiumPrivacy Implications of Research Data

Denver, COSept 2016

DISCLAIMERThese opinions are my own, they are not the opinions of MIT, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators

Secondary disclaimer:

“It’s tough to make predictions, especially about the future!”

-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille,Albert Einstein, Enrico

Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L.

White, etc.

Perfectly Anonymous Data

Collaborators & Co-Conspirators

Privacy Tools for Sharing Research Data Team (Salil Vadhan, P.I.)http://privacytools.seas.harvard.edu/people

Research Support Supported in part by NSF grant CNS-123723 Supported in part by the Sloan Foundation


Related Work

Main Project: Privacy Tools for Sharing Research Data

http://privacytools.seas.harvard.edu/

Related publications:

Altman M, Wood A, O'Brien D, Vadhan S, Gasser U. Towards a Modern Approach to Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law 30(3) 1967-2072. 2016

Wood A, Airoldi E, Altman M, de MontandreY, Gasser U, O'Brien D, Vadhan S. Privacy Tools project response to Common Rule Notice of Proposed Rule Making. Comments on Regulation.Gov . 2016. (Copy available here: http://informatics.mit.edu/publications/privacy-tools-project-response-common-rule-notice-proposed-rule-making); and

Vayena E, Gasser U, Wood A, O'Brien D, Altman M. Elements of a New Ethical and Regulatory Framework for Big Data Research. Washington and Lee Law Review. 2016;72(3):420-442.

Altman M, Capps C, Prevost R. Location Confidentiality and Official Surveys. Social Science Research Network [Internet]. 2016.

Slides and reprints available from:informatics.mit.edu


Today’s Perspectives & Provocations

* What is information privacy? ** Privacy, Utility and Use *

* Calibrating Protection and Risk *



What is Information Privacy?

Some Definitions

PrivacyControl over extent and circumstances of sharing

ConfidentialityControl over disclosure of

information

IdentifiabilityPotential for learning about individuals based on their

inclusion in a data

SensitivityPotential for harm

if information disclosed and used to learn about indviduals


Privacy is not Anonymization


Anonymization / deidentification / PII are legal concepts May be no rigorous formal definition

Definition varies by law, may include … Presence of specific attributes (e.g., PII, HIPAA identifiers) Feasibility of record linkage ... Evaluation of knowledge of data publisher (e.g. “no actual

knowledge”, “readily ascertainable”) Legal definitions do not match statistical/scientific

concepts

Privacy is not Information Security

Confidentiality(Secrecy)

Integrity

Availability

Authenticity

Non-Repudiation


• Information security maintains many security properties within information system

• Failures of information secrecy may also violate privacy

• Information security controls access• … does not control computation, use,

or inference• Information security functions only within an

information system• … privacy depends on what happens to

information before, during, and after it is an informatin system

Privacy aims to Prevent Harm from Disclosure


Data Subjects Vulnerable Groups

Institutions Society

Public Data isn’t Always …

The “Doesn’t Stay in Vegas” problem-- information shared locally can be found anywhere Digitization and API’s make aggregation easy Many “public records” were not practically public – until now Creates new business models Provokes sector-by-sector

legislative reaction


More Information• Sweeney, Latanya. "Discrimination in online ad

delivery." Queue 11.3 (2013): 10.• Wikipedia Mugshot database


Privacy, Utility,

and Use (Cases)

No-Free-Lunch for Privacy


Any data analysis that is useful* leaks some measurable

private information

Scientific Trend: More Sophisticated “Reidentification” Concepts


Record-linkage“where’s waldo”

• Match a real person to precise record in a database

• Examples: direct identifiers.

• Caveats: Satisfies compliance for specific laws, but not generally; substantial potential for harm remains

Indistinguishability“hidden in the crowd”

• Individuals can be linked only to a cluster of records (of known size)

• Examples: K-anonymity, attribute disclosure

• Caveats: Potential for substantial harms may remain, must specify what external information is observable, & need diversity for sensitive attributes

Limited Adversarial Learning“confidentiality guaranteed”

• Formally bounds the total learning about any individual that occurs from a data release

• Examples: differential privacy, zero-knowledge proofs

• Caveats: Challenging to implement, often requires interactive systems

More Information• Willenborg, Leon, and Ton De Waal. Elements of statistical disclosure control. Vol. 155. Springer Science & Business Media, 2012.• Fung, Benjamin, et al. "Privacy-preserving data publishing: A survey of recent developments." ACM Computing Surveys

(CSUR) 42.4 (2010): 14.• Dwork, Cynthia. "Differential privacy: A survey of results." International Conference on Theory and Applications of Models of

Computation. Springer Berlin Heidelberg, 2008.

What use is it?

Utility is defined broadly as the analytical value of the data There is no universal operational definition of utility There is a structural tradeoff between privacy and utility

simultaneously… can’t maximize both However, new methods can sometimes do better than

traditional anonymization on both fronts.

Some Approaches to Measuring of Utility

• Precision• Bias• Variance

Statistical Inference

• Truthfulness• Completeness• Consistency• Interpretability

Semantic

• Entropy• Information Complexity

Computational

Image Sources• https://commons.wikimedia.org/wiki/File:Solna_Brick_wall_Stretcher_bond_variation1.jpg• https://commons.wikimedia.org/wiki/File:Archery_Target_80cm.svg

𝑥 ∈ 𝑫 → 𝒙 ∈ 𝑫′

Challenges to Measuring Utility

Utility may not track quality, or value

Quality depends on intended use Value is dependent on intended and

unintended uses; and on markets

Utility at one level doesn’t match utility at another levels

Truthful records≠ accurate aggregates

Measure

Record

Aggregate Quantity

Model

Data

Traditional Confidentiality-Enhancing Data Publishing

Published Outputs

* Jones * * 1961 021*

* Jon es * * 1 9 6 1 0 2 1 *

* Jon es * * 1 9 7 2 9 4 0 4 *

* Jon es * * 1 9 7 2 9 4 0 4 *

* Jon es * * 1 9 7 2 9 4 0 4 *

Modal Practice“The correlation between X and Y was large and

statistically significant”

Summary statistics

Contingency table

Public use sample microdata

Information Visualization


Name SSN Birthdate Zipcode Gender FavoriteIce Cream

# of crimescommitted

A. Jones 12341 01011961 02145 M Raspberry 0

B. Jones 12342 02021961 02138 M Pistachio 0

C. Jones 12343 11111972 94043 M Chocolate 0

D. Jones 12344 12121972 94043 M Hazelnut 0

E. Jones 12345 03251972 94041 F Lemon 0

F. Jones 12346 03251972 02127 F Lemon 1

G. Jones 12347 08081989 02138 F Peach 1

H. Smith 12348 01011973 63200 F Lime 2

I. Smith 12349 02021973 63300 M Mango 4

J. Smith 12350 02021973 63400 M Coconut 16

K. Smith 12351 03031974 64500 M Frog 32

L. Smith 12352 04041974 64600 M Vanilla 64

M. Smith 12353 04041974 64700 F Pumpkin 128

N. Smith 12354 04041974 64800 F Allergic 256

Just some data?




A. Jones 12341 01011961 02145 M Raspberry 0

B. Jones 12342 02021961 02138 M Pistachio 0

C. Jones 12343 11111972 94043 M Chocolate 0

D. Jones 12344 12121972 94043 M Hazelnut 0

E. Jones 12345 03251972 94041 F Lemon 0

F. Jones 12346 03251972 02127 F Lemon 1

G. Jones 12347 08081989 02138 F Peach 1

H. Smith 12348 01011973 63200 F Lime 2

I. Smith 12349 02021973 63300 M Mango 4

J. Smith 12350 02021973 63400 M Coconut 16

K. Smith 12351 03031974 64500 M Frog 32

L. Smith 12352 04041974 64600 M Vanilla 64

M. Smith 12353 04041974 64700 F Pumpkin 128

N. Smith 12354 04041974 64800 F Allergic 256

What’s wrong with this picture?Identifier Sensitive

PrivateIdentifier

PrivateIdentifier

Identifier

Sensitive

Unexpected Response?

Mass resident

FERPA too?

Californian

Twins, separated at birth?


Help, help, I’m being suppressed…



[Name 1] 12341 *1961 021* M Raspberry .1

[Name 2] 12342 *1961 021* M Pistachio -.1

[Name 3] 12343 *1972 940* M Chocolate 0

[Name 4] 12344 *1972 940* M Hazelnut 0

[Name 5] 12345 *1972 940* F Lemon .6

[Name 6] 12346 *1972 021* F Lemon .6

[Name 7] 12347 *1989 021* * Peach 64.6

[Name 8] 12348 *1973 632* F Lime 3

[Name 9] 12349 *1973 633* M Mango 3

[Name 10] 12350 *1973 634* M Coconut 37.2

[Name 11] 12351 *1974 645* M * 37.2

[Name 12] 12352 *1974 646* M Vanilla 37.2

[Name 13] 12353 *1974 647* F * 64.4

[Name 14] 12354 *1974 648* F Allergic 256 Redaction

VarSynthetic Global Recode Local Suppression Aggregation+

Perturbation

Traditional Static Suppression

Data reduction Observation

Measure

Cell

Perturbation Microaggregation

Rule-based data swapping

Adding noise


Perfect Privacy*



* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *


* Global Tabular Suppression

How does Anonymization Affect Utility


Measure Record Aggregate Model

Bias Generalization

(N/A)

K-anonCell suppressionRecord deletion+ any measure bias

Column deletion+ any aggregation biasSynthetic

Precision Cell Perturbation (All methods increase variance)

Truthfulness Cell PerturbationMicroaggregationSwappingSynthetic

Cell PerturbationMicroaggregationSwapping

(Bias is analogous to truthfulness – see above)

Completeness Column suppression Record suppression(Cell suppression affectscompletess & interpretability)

(Record incompleteness increases bias)

(N/A)

Consistency Cell PerturbationMicroaggregation

Cell Perturbation

(N/A)Interpretability Cell Perturbation

MicroaggregationCell PerturbationCell suppressionMicroaggregation

All anonymization methods effect variance, informational complexity, entropy

Special Challenges to Research Uses


Computational Replication / Process

Verification / HistoricalVerification

Reuse, Reanalysis, Extension

Data Integration across Separate

Source

Data Reuse


Data Integration Pooling Measures (Vertical Partitioning) Pooling Subjects (Horizontal Partitioning) Pooling Studies (Meta-analysis) Pooling Time (Longitudinal Integration) Follow-ups/recontacts (Adding Waves, Future data collection)

Data Reuse Research Extension / Reanalysis / Critique New Model / Analysis / Estimation Extraction of New Signals (Measures)

Data Integration and Privacy Challenges


Aim: Integrate evidence across multiple independent databases

Horizontal integration possible –if known to be different sample observations of same population

Most privacy protections impede Vertical (multi-measure) Longitudinal integration Full meta-analysis

Anonymization impedes follow-up studies Mitigating Approaches

Tiered access Third-party escrow Artificial linking identifiers

¨ Can leak information¨ Not robust to errors in identifying information¨ Challenging to coordinate across institutions

Secure multiparty computation

Reuse and Privacy Challenges


Aim: Critique previous results, extend model/methods, ask new questions –possibly far in future

Methods that impede data integration partially impede critique, extension, new questions

Synthetic data approaches, suppressing measures, can impeded new questions

Privacy protections that affect interpretability are a challenge to meaningful long-term access

Generalizations, synthetic data approaches, aggregation impeded extraction of new measures for new questions

Mitigating Approaches Broad consent Tiered access

Replication/Verification


Some common characteristics Uses the “original” data, model, methods, implementation Corresponds to those used when results were published Often archived as “replication packages”

Calibration Verify understanding of software, method, model before extending Often a preliminary to reuse

Check research integrity Recreate, as closely as possible, the process actually followed by an author in producing

previously published results Detect substantial omissions in methods, Unsophisticated data manipulation, Misleading

precision in publication Retrospective process quality evaluation

Evaluate process methods in practice Detect: Transcription errors, Numerical issues, Software bugs

Historical/legal verification Verify some phenomenon/relation existed/did not exist in data? Verify historical aggregate Did some individual have/not have a particular property

Replication, Verification and Privacy Challenges


Verify particular process and results from a prior publication/statement Affected by variance, bias,

interpretability, completeness Calibrate prior to extension,

critique Affected by bias, variance,

interpretability Assess process/method robustness

Affected by precision, variance, bias

Verify historical statements about records Affected by truthfulness,

completeness

Mitigating Approaches Tiered access Consent for replication Privacy aware publishing

Publish research only on post-protected data

Avoid publishing results with unnecessary precision

Document uncertainty in Replication servers Process logs + audit trails

Big Data Research and Privacy Challenges

Big Data can be Rich, Messy & Surprising The “Blog problem” :

Pseudononymous communication used for topic mining can be linked through stylometric analysis

Observable Behavior Leaves Unique “Fingerprints” The “GIS” problem:

Location trails are individualistic, externally observable, difficult to mas

Traditional Anonymization Methods can destroy utility The “Netflix Problem”:

many people may have unique long-tail behavior


More Information• J. Novak, P. Raghavan, A. Tomkins, 2004. Anti-aliasing on the Web, Proceedings of the 13th international conference on

World Wide Web • Narayanan, Arvind, and Vitaly Shmatikov. "Robust de-anonymization of large sparse datasets." 2008 IEEE Symposium on

Security and Privacy (sp 2008). IEEE, 2008.• De Montjoye, Yves-Alexandre, et al. "Unique in the crowd: The privacy bounds of human mobility." Scientific reports 3 (2013).

Computational Methods Beyond Anonymization

Controlling Access Virtual Data Enclaves

Controlling Computation Secure Multiparty

Computation Functional Encryption Homomorphic Encryption Blockchain

Controlling Inference Differential Privacy

Restricting Use Executable Policy

Languages


More Information• Altman M, Wood A, O'Brien D, Vadhan S, Gasser U. Towards a Modern Approach to Privacy-Aware Government Data Releases.

Berkeley Journal of Technology Law 30(3) 1967-2072. 2016• Altman M, Capps C , Prevost R. Location Confidentiality and Official Surveys. Social Science Research Network [Internet]. 2016

Caution: No Current Method Cure Algorithmic Discrimination

Big data algorithms may pick up unanticipated relationships in data

Algorithms that incorporate human behavior may amplify human biases

Excluding sensitive input is not enough Check if classifications have

distributional bias Check if classification error is biased


More Information• Sweeney, Latanya. "Discrimination in online ad delivery." Queue 11.3 (2013): 10.• Larson, et al. How We Analyzed the COMPAS Recidivism Algorithm https://www.propublica.org/article/how-we-analyzed-the-

compas-recidivism-algorithm

Bad News/Good News for Big Data and Privacy

Bad News: De-identification is not enough- accumulation of many information releases may compose into substantial disclosure; and/or lead to algorithmic discrimnation

Good News: large numbers are your friend- new methods such as differential privacy can provide strong disclosure protection and good estimates for large samples

Bad News: Traditional anonymization primitives (e.g. local suppression, topcoding) can add variance, require reweighting, and bias answers

Good News: some new cryptography-based methods (multiparty secure computation, some differentially private algorithms) are unbiased, and directly interpretable

Bad News: Traditional methods focus explicitly on limiting data release

Good News: new frameworks for limiting inferences (differential privacy); computation (multiparty secure computation); and uses (policy languages)

Bake Privacy In


Fair Information Practice: Notice/awareness Choice/consent Access/participation

(verification, accuracy, correction)

Integrity/security Enforcement/redress

Self-regulation, private remedies; government enforcements

Privacy by design: Proactive not reactive;

Preventative not remedial

Privacy as the default setting

Privacy embedded into design

Full Functionality –Positive-Sum, not Zero-Sum

End-to-End Security –Full Lifecycle Protection

Visibility and Transparency – Keep it Open

Respect for User Privacy – Keep it User-Centric

OECD Principles Collection

limitation Data quality Purpose

specification Use limitation Security

Safeguards Openness Individual

participation Accountability

Notice and Consent, by itselfDoes not Scale


Most people believe they have lost control of their private information

Many individuals do not understand how and what is shared, or potential consequences

Most people encounter thousands of “terms of use” every year

Because big data is rich, messy and surprising -- future uses are more difficult to anticipate

More Information• “State of Privacy in America”, 2016, Pew Fact Tank: http://www.pewresearch.org/fact-tank/2016/01/20/the-state-of-privacy-in-america/• Big Data And Privacy: A Technical Perspective, PCAST, 2016.

https://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdf• McDonald, Aleecia M., and Lorrie Faith Cranor. "Cost of reading privacy policies, the." ISJLP 4 (2008): 543.APA

Lifecycle approach to data management

Review of uses, threats, and vulnerabilities as information is used over time

Select appropriate controls at each stage

Catalog of privacy controls Procedural, technical, educational, economic, and legal means for

enhancing privacy—at each stage of the information lifecycle

Procedural Economic Educational Legal Technical

Access/Release

Access controls;Consent;

Expert panels; Individual privacy settings;

Presumption of openness vs. privacy;

Purpose specification;Registration;

Restrictions on use by data controller;

Risk assessments

Access/Use fees (for data

controller or subjects);

Property rights assignment

Data asset registers;Notice;

Transparency

Integrity and accuracy requirements; Data use agreements (contract with data recipient)/

Terms of service

Authentication; Computable policy;Differential privacy;

Encryption (incl. Functional;

Homomorphic);Interactive query

systems;Secure multiparty

computation

Calibrating Controls

Illustrating how to choose privacy controls that are consistent with the uses, threats, and vulnerabilities at each

lifecycle stage

More Information• Altman M, Wood A, O'Brien D, Vadhan S, Gasser U. Towards a Modern Approach to Privacy-Aware Government Data Releases.

Berkeley Journal of Technology Law 30(3) 1967-2072. 2016

Principles of a Modern Approach to Information Privacy & Confidentiality


Calibrating privacy and security controls to the intended uses and privacy risks associated with the data;

When conceptualizing informational risks, considering not just reidentification risks but also inference risks, or the potential for others to learn about individuals from the inclusion of their information in the data;

Addressing informational risks using a combination of privacy and security controls rather than relying on a single control such as consent or deidentification;

Anticipating, regulating, monitoring, and reviewing interactions with data across all stages of the lifecycle (including the post-access stages), as risks and methods will evolve over time; and

Recommended Readings


Altman M, Wood A, O'Brien D, Vadhan S, Gasser U. Towards a Modern Approach to Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law 30(3) 1967-2072. 2016

Vayena E, Gasser U, Wood A, O'Brien D, Altman M. Elements of a New Ethical and Regulatory Framework for Big Data Research. Washington and Lee Law Review. 2016;72(3):420-442.

Creative Commons License

This work by Micah Altman is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.