altman - perfectly anonymous data is perfectly useless data
TRANSCRIPT
Perfectly Anonymous Data is Perfectly Useless Data
Micah AltmanDirector of Research--MIT Libraries
Prepared for
NISO-RDA SymposiumPrivacy Implications of Research Data
Denver, COSept 2016
DISCLAIMERThese opinions are my own, they are not the opinions of MIT, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille,Albert Einstein, Enrico
Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L.
White, etc.
Perfectly Anonymous Data
Collaborators & Co-Conspirators
Privacy Tools for Sharing Research Data Team (Salil Vadhan, P.I.)http://privacytools.seas.harvard.edu/people
Research Support Supported in part by NSF grant CNS-123723 Supported in part by the Sloan Foundation
Perfectly Anonymous Data
Related Work
Main Project: Privacy Tools for Sharing Research Data
http://privacytools.seas.harvard.edu/
Related publications:
Altman M, Wood A, O'Brien D, Vadhan S, Gasser U. Towards a Modern Approach to Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law 30(3) 1967-2072. 2016
Wood A, Airoldi E, Altman M, de MontandreY, Gasser U, O'Brien D, Vadhan S. Privacy Tools project response to Common Rule Notice of Proposed Rule Making. Comments on Regulation.Gov . 2016. (Copy available here: http://informatics.mit.edu/publications/privacy-tools-project-response-common-rule-notice-proposed-rule-making); and
Vayena E, Gasser U, Wood A, O'Brien D, Altman M. Elements of a New Ethical and Regulatory Framework for Big Data Research. Washington and Lee Law Review. 2016;72(3):420-442.
Altman M, Capps C, Prevost R. Location Confidentiality and Official Surveys. Social Science Research Network [Internet]. 2016.
Slides and reprints available from:informatics.mit.edu
Perfectly Anonymous Data
Today’s Perspectives & Provocations
* What is information privacy? ** Privacy, Utility and Use *
* Calibrating Protection and Risk *
Perfectly Anonymous Data
Perfectly Anonymous Data
What is Information Privacy?
Some Definitions
PrivacyControl over extent and circumstances of sharing
ConfidentialityControl over disclosure of
information
IdentifiabilityPotential for learning about individuals based on their
inclusion in a data
SensitivityPotential for harm
if information disclosed and used to learn about indviduals
Perfectly Anonymous Data
Privacy is not Anonymization
Perfectly Anonymous Data
Anonymization / deidentification / PII are legal concepts May be no rigorous formal definition
Definition varies by law, may include … Presence of specific attributes (e.g., PII, HIPAA identifiers) Feasibility of record linkage ... Evaluation of knowledge of data publisher (e.g. “no actual
knowledge”, “readily ascertainable”) Legal definitions do not match statistical/scientific
concepts
Privacy is not Information Security
Confidentiality(Secrecy)
Integrity
Availability
Authenticity
Non-Repudiation
Perfectly Anonymous Data
• Information security maintains many security properties within information system
• Failures of information secrecy may also violate privacy
• Information security controls access• … does not control computation, use,
or inference• Information security functions only within an
information system• … privacy depends on what happens to
information before, during, and after it is an informatin system
Privacy aims to Prevent Harm from Disclosure
Perfectly Anonymous Data
Data Subjects Vulnerable Groups
Institutions Society
Public Data isn’t Always …
The “Doesn’t Stay in Vegas” problem-- information shared locally can be found anywhere Digitization and API’s make aggregation easy Many “public records” were not practically public – until now Creates new business models Provokes sector-by-sector
legislative reaction
Perfectly Anonymous Data
More Information• Sweeney, Latanya. "Discrimination in online ad
delivery." Queue 11.3 (2013): 10.• Wikipedia Mugshot database
Perfectly Anonymous Data
Privacy, Utility,
and Use (Cases)
No-Free-Lunch for Privacy
Perfectly Anonymous Data
Any data analysis that is useful* leaks some measurable
private information
Scientific Trend: More Sophisticated “Reidentification” Concepts
Perfectly Anonymous Data
Record-linkage“where’s waldo”
• Match a real person to precise record in a database
• Examples: direct identifiers.
• Caveats: Satisfies compliance for specific laws, but not generally; substantial potential for harm remains
Indistinguishability“hidden in the crowd”
• Individuals can be linked only to a cluster of records (of known size)
• Examples: K-anonymity, attribute disclosure
• Caveats: Potential for substantial harms may remain, must specify what external information is observable, & need diversity for sensitive attributes
Limited Adversarial Learning“confidentiality guaranteed”
• Formally bounds the total learning about any individual that occurs from a data release
• Examples: differential privacy, zero-knowledge proofs
• Caveats: Challenging to implement, often requires interactive systems
More Information• Willenborg, Leon, and Ton De Waal. Elements of statistical disclosure control. Vol. 155. Springer Science & Business Media, 2012.• Fung, Benjamin, et al. "Privacy-preserving data publishing: A survey of recent developments." ACM Computing Surveys
(CSUR) 42.4 (2010): 14.• Dwork, Cynthia. "Differential privacy: A survey of results." International Conference on Theory and Applications of Models of
Computation. Springer Berlin Heidelberg, 2008.
What use is it?
Utility is defined broadly as the analytical value of the data There is no universal operational definition of utility There is a structural tradeoff between privacy and utility
simultaneously… can’t maximize both However, new methods can sometimes do better than
traditional anonymization on both fronts.
Some Approaches to Measuring of Utility
• Precision• Bias• Variance
Statistical Inference
• Truthfulness• Completeness• Consistency• Interpretability
Semantic
• Entropy• Information Complexity
Computational
Image Sources• https://commons.wikimedia.org/wiki/File:Solna_Brick_wall_Stretcher_bond_variation1.jpg• https://commons.wikimedia.org/wiki/File:Archery_Target_80cm.svg
𝑥 ∈ 𝑫 → 𝒙 ∈ 𝑫′
Challenges to Measuring Utility
Utility may not track quality, or value
Quality depends on intended use Value is dependent on intended and
unintended uses; and on markets
Utility at one level doesn’t match utility at another levels
Truthful records≠ accurate aggregates
Measure
Record
Aggregate Quantity
Model
Data
Traditional Confidentiality-Enhancing Data Publishing
Published Outputs
* Jones * * 1961 021*
* Jon es * * 1 9 6 1 0 2 1 *
* Jon es * * 1 9 7 2 9 4 0 4 *
* Jon es * * 1 9 7 2 9 4 0 4 *
* Jon es * * 1 9 7 2 9 4 0 4 *
Modal Practice“The correlation between X and Y was large and
statistically significant”
Summary statistics
Contingency table
Public use sample microdata
Information Visualization
Perfectly Anonymous Data
Name SSN Birthdate Zipcode Gender FavoriteIce Cream
# of crimescommitted
A. Jones 12341 01011961 02145 M Raspberry 0
B. Jones 12342 02021961 02138 M Pistachio 0
C. Jones 12343 11111972 94043 M Chocolate 0
D. Jones 12344 12121972 94043 M Hazelnut 0
E. Jones 12345 03251972 94041 F Lemon 0
F. Jones 12346 03251972 02127 F Lemon 1
G. Jones 12347 08081989 02138 F Peach 1
H. Smith 12348 01011973 63200 F Lime 2
I. Smith 12349 02021973 63300 M Mango 4
J. Smith 12350 02021973 63400 M Coconut 16
K. Smith 12351 03031974 64500 M Frog 32
L. Smith 12352 04041974 64600 M Vanilla 64
M. Smith 12353 04041974 64700 F Pumpkin 128
N. Smith 12354 04041974 64800 F Allergic 256
Just some data?
Perfectly Anonymous Data
Name SSN Birthdate Zipcode Gender FavoriteIce Cream
# of crimescommitted
A. Jones 12341 01011961 02145 M Raspberry 0
B. Jones 12342 02021961 02138 M Pistachio 0
C. Jones 12343 11111972 94043 M Chocolate 0
D. Jones 12344 12121972 94043 M Hazelnut 0
E. Jones 12345 03251972 94041 F Lemon 0
F. Jones 12346 03251972 02127 F Lemon 1
G. Jones 12347 08081989 02138 F Peach 1
H. Smith 12348 01011973 63200 F Lime 2
I. Smith 12349 02021973 63300 M Mango 4
J. Smith 12350 02021973 63400 M Coconut 16
K. Smith 12351 03031974 64500 M Frog 32
L. Smith 12352 04041974 64600 M Vanilla 64
M. Smith 12353 04041974 64700 F Pumpkin 128
N. Smith 12354 04041974 64800 F Allergic 256
What’s wrong with this picture?Identifier Sensitive
PrivateIdentifier
PrivateIdentifier
Identifier
Sensitive
Unexpected Response?
Mass resident
FERPA too?
Californian
Twins, separated at birth?
Perfectly Anonymous Data
Help, help, I’m being suppressed…
Name SSN Birthdate Zipcode Gender FavoriteIce Cream
# of crimescommitted
[Name 1] 12341 *1961 021* M Raspberry .1
[Name 2] 12342 *1961 021* M Pistachio -.1
[Name 3] 12343 *1972 940* M Chocolate 0
[Name 4] 12344 *1972 940* M Hazelnut 0
[Name 5] 12345 *1972 940* F Lemon .6
[Name 6] 12346 *1972 021* F Lemon .6
[Name 7] 12347 *1989 021* * Peach 64.6
[Name 8] 12348 *1973 632* F Lime 3
[Name 9] 12349 *1973 633* M Mango 3
[Name 10] 12350 *1973 634* M Coconut 37.2
[Name 11] 12351 *1974 645* M * 37.2
[Name 12] 12352 *1974 646* M Vanilla 37.2
[Name 13] 12353 *1974 647* F * 64.4
[Name 14] 12354 *1974 648* F Allergic 256 Redaction
VarSynthetic Global Recode Local Suppression Aggregation+
Perturbation
Traditional Static Suppression
Data reduction Observation
Measure
Cell
Perturbation Microaggregation
Rule-based data swapping
Adding noise
Perfectly Anonymous Data
Perfect Privacy*
Name SSN Birthdate Zipcode Gender FavoriteIce Cream
# of crimescommitted
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
* * * * * * *
Perfectly Anonymous Data
* Global Tabular Suppression
How does Anonymization Affect Utility
Perfectly Anonymous Data
Measure Record Aggregate Model
Bias Generalization
(N/A)
K-anonCell suppressionRecord deletion+ any measure bias
Column deletion+ any aggregation biasSynthetic
Precision Cell Perturbation (All methods increase variance)
Truthfulness Cell PerturbationMicroaggregationSwappingSynthetic
Cell PerturbationMicroaggregationSwapping
(Bias is analogous to truthfulness – see above)
Completeness Column suppression Record suppression(Cell suppression affectscompletess & interpretability)
(Record incompleteness increases bias)
(N/A)
Consistency Cell PerturbationMicroaggregation
Cell Perturbation
(N/A)Interpretability Cell Perturbation
MicroaggregationCell PerturbationCell suppressionMicroaggregation
All anonymization methods effect variance, informational complexity, entropy
Special Challenges to Research Uses
Perfectly Anonymous Data
Computational Replication / Process
Verification / HistoricalVerification
Reuse, Reanalysis, Extension
Data Integration across Separate
Source
Data Reuse
Perfectly Anonymous Data
Data Integration Pooling Measures (Vertical Partitioning) Pooling Subjects (Horizontal Partitioning) Pooling Studies (Meta-analysis) Pooling Time (Longitudinal Integration) Follow-ups/recontacts (Adding Waves, Future data collection)
Data Reuse Research Extension / Reanalysis / Critique New Model / Analysis / Estimation Extraction of New Signals (Measures)
Data Integration and Privacy Challenges
Perfectly Anonymous Data
Aim: Integrate evidence across multiple independent databases
Horizontal integration possible –if known to be different sample observations of same population
Most privacy protections impede Vertical (multi-measure) Longitudinal integration Full meta-analysis
Anonymization impedes follow-up studies Mitigating Approaches
Tiered access Third-party escrow Artificial linking identifiers
¨ Can leak information¨ Not robust to errors in identifying information¨ Challenging to coordinate across institutions
Secure multiparty computation
Reuse and Privacy Challenges
Perfectly Anonymous Data
Aim: Critique previous results, extend model/methods, ask new questions –possibly far in future
Methods that impede data integration partially impede critique, extension, new questions
Synthetic data approaches, suppressing measures, can impeded new questions
Privacy protections that affect interpretability are a challenge to meaningful long-term access
Generalizations, synthetic data approaches, aggregation impeded extraction of new measures for new questions
Mitigating Approaches Broad consent Tiered access
Replication/Verification
Perfectly Anonymous Data
Some common characteristics Uses the “original” data, model, methods, implementation Corresponds to those used when results were published Often archived as “replication packages”
Calibration Verify understanding of software, method, model before extending Often a preliminary to reuse
Check research integrity Recreate, as closely as possible, the process actually followed by an author in producing
previously published results Detect substantial omissions in methods, Unsophisticated data manipulation, Misleading
precision in publication Retrospective process quality evaluation
Evaluate process methods in practice Detect: Transcription errors, Numerical issues, Software bugs
Historical/legal verification Verify some phenomenon/relation existed/did not exist in data? Verify historical aggregate Did some individual have/not have a particular property
Replication, Verification and Privacy Challenges
Perfectly Anonymous Data
Verify particular process and results from a prior publication/statement Affected by variance, bias,
interpretability, completeness Calibrate prior to extension,
critique Affected by bias, variance,
interpretability Assess process/method robustness
Affected by precision, variance, bias
Verify historical statements about records Affected by truthfulness,
completeness
Mitigating Approaches Tiered access Consent for replication Privacy aware publishing
Publish research only on post-protected data
Avoid publishing results with unnecessary precision
Document uncertainty in Replication servers Process logs + audit trails
Big Data Research and Privacy Challenges
Big Data can be Rich, Messy & Surprising The “Blog problem” :
Pseudononymous communication used for topic mining can be linked through stylometric analysis
Observable Behavior Leaves Unique “Fingerprints” The “GIS” problem:
Location trails are individualistic, externally observable, difficult to mas
Traditional Anonymization Methods can destroy utility The “Netflix Problem”:
many people may have unique long-tail behavior
Perfectly Anonymous Data
More Information• J. Novak, P. Raghavan, A. Tomkins, 2004. Anti-aliasing on the Web, Proceedings of the 13th international conference on
World Wide Web • Narayanan, Arvind, and Vitaly Shmatikov. "Robust de-anonymization of large sparse datasets." 2008 IEEE Symposium on
Security and Privacy (sp 2008). IEEE, 2008.• De Montjoye, Yves-Alexandre, et al. "Unique in the crowd: The privacy bounds of human mobility." Scientific reports 3 (2013).
Computational Methods Beyond Anonymization
Controlling Access Virtual Data Enclaves
Controlling Computation Secure Multiparty
Computation Functional Encryption Homomorphic Encryption Blockchain
Controlling Inference Differential Privacy
Restricting Use Executable Policy
Languages
Perfectly Anonymous Data
More Information• Altman M, Wood A, O'Brien D, Vadhan S, Gasser U. Towards a Modern Approach to Privacy-Aware Government Data Releases.
Berkeley Journal of Technology Law 30(3) 1967-2072. 2016• Altman M, Capps C , Prevost R. Location Confidentiality and Official Surveys. Social Science Research Network [Internet]. 2016
Caution: No Current Method Cure Algorithmic Discrimination
Big data algorithms may pick up unanticipated relationships in data
Algorithms that incorporate human behavior may amplify human biases
Excluding sensitive input is not enough Check if classifications have
distributional bias Check if classification error is biased
Perfectly Anonymous Data
More Information• Sweeney, Latanya. "Discrimination in online ad delivery." Queue 11.3 (2013): 10.• Larson, et al. How We Analyzed the COMPAS Recidivism Algorithm https://www.propublica.org/article/how-we-analyzed-the-
compas-recidivism-algorithm
Bad News/Good News for Big Data and Privacy
Bad News: De-identification is not enough- accumulation of many information releases may compose into substantial disclosure; and/or lead to algorithmic discrimnation
Good News: large numbers are your friend- new methods such as differential privacy can provide strong disclosure protection and good estimates for large samples
Bad News: Traditional anonymization primitives (e.g. local suppression, topcoding) can add variance, require reweighting, and bias answers
Good News: some new cryptography-based methods (multiparty secure computation, some differentially private algorithms) are unbiased, and directly interpretable
Bad News: Traditional methods focus explicitly on limiting data release
Good News: new frameworks for limiting inferences (differential privacy); computation (multiparty secure computation); and uses (policy languages)
Bake Privacy In
Perfectly Anonymous Data
Fair Information Practice: Notice/awareness Choice/consent Access/participation
(verification, accuracy, correction)
Integrity/security Enforcement/redress
Self-regulation, private remedies; government enforcements
Privacy by design: Proactive not reactive;
Preventative not remedial
Privacy as the default setting
Privacy embedded into design
Full Functionality –Positive-Sum, not Zero-Sum
End-to-End Security –Full Lifecycle Protection
Visibility and Transparency – Keep it Open
Respect for User Privacy – Keep it User-Centric
OECD Principles Collection
limitation Data quality Purpose
specification Use limitation Security
Safeguards Openness Individual
participation Accountability
Notice and Consent, by itselfDoes not Scale
Perfectly Anonymous Data
Most people believe they have lost control of their private information
Many individuals do not understand how and what is shared, or potential consequences
Most people encounter thousands of “terms of use” every year
Because big data is rich, messy and surprising -- future uses are more difficult to anticipate
More Information• “State of Privacy in America”, 2016, Pew Fact Tank: http://www.pewresearch.org/fact-tank/2016/01/20/the-state-of-privacy-in-america/• Big Data And Privacy: A Technical Perspective, PCAST, 2016.
https://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdf• McDonald, Aleecia M., and Lorrie Faith Cranor. "Cost of reading privacy policies, the." ISJLP 4 (2008): 543.APA
Lifecycle approach to data management
Review of uses, threats, and vulnerabilities as information is used over time
Select appropriate controls at each stage
Catalog of privacy controls Procedural, technical, educational, economic, and legal means for
enhancing privacy—at each stage of the information lifecycle
Procedural Economic Educational Legal Technical
Access/Release
Access controls;Consent;
Expert panels; Individual privacy settings;
Presumption of openness vs. privacy;
Purpose specification;Registration;
Restrictions on use by data controller;
Risk assessments
Access/Use fees (for data
controller or subjects);
Property rights assignment
Data asset registers;Notice;
Transparency
Integrity and accuracy requirements; Data use agreements (contract with data recipient)/
Terms of service
Authentication; Computable policy;Differential privacy;
Encryption (incl. Functional;
Homomorphic);Interactive query
systems;Secure multiparty
computation
Calibrating Controls
Illustrating how to choose privacy controls that are consistent with the uses, threats, and vulnerabilities at each
lifecycle stage
More Information• Altman M, Wood A, O'Brien D, Vadhan S, Gasser U. Towards a Modern Approach to Privacy-Aware Government Data Releases.
Berkeley Journal of Technology Law 30(3) 1967-2072. 2016
Principles of a Modern Approach to Information Privacy & Confidentiality
Perfectly Anonymous Data
Calibrating privacy and security controls to the intended uses and privacy risks associated with the data;
When conceptualizing informational risks, considering not just reidentification risks but also inference risks, or the potential for others to learn about individuals from the inclusion of their information in the data;
Addressing informational risks using a combination of privacy and security controls rather than relying on a single control such as consent or deidentification;
Anticipating, regulating, monitoring, and reviewing interactions with data across all stages of the lifecycle (including the post-access stages), as risks and methods will evolve over time; and
Recommended Readings
Perfectly Anonymous Data
Altman M, Wood A, O'Brien D, Vadhan S, Gasser U. Towards a Modern Approach to Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law 30(3) 1967-2072. 2016
Vayena E, Gasser U, Wood A, O'Brien D, Altman M. Elements of a New Ethical and Regulatory Framework for Big Data Research. Washington and Lee Law Review. 2016;72(3):420-442.
Creative Commons License
This work by Micah Altman is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
Perfectly Anonymous Data