aspects of privacy and big datafields2015bigdata2inference.weebly.com/uploads/4/4/... · •...
TRANSCRIPT
Topics in Big Data March 20, 2015
Aspects of Privacy and Big Data
Aspects of Privacy and Big Data • Workshop on Big Data in Health Policy
• Workshop on Big Data in Social Policy
2 Topics in Big Data March 20, 2015
Data anonymisation Significance Oct 2014 • anonymisation is meant to allow custodians of data sets
to share the information with third parties • custodians are typically governments and corporations • McKinsey report estimated that open data could
generate $3 trillion per year in economic value bit.ly/1tsTcve
• with completely anonymised data no-one can be identified by that data on its own or in combination with other data
• completely anonymised data has little utility • so risk of identification must be estimated – depends on
how existing data is released and also on what other data may be available
3 Topics in Big Data March 20, 2015
Some high-profile failures • Massachusetts Group Insurance Commission published
data on all hospital visits by state employees • birthdates, sex, zip codes, but no names • Sweeney knew Governor Weld lived in Cambridge MA • purchased voter rolls • only 6 people in Cambridge had the same birthday as
the governor; only 3 were male; only one lived in his zip code
• his hospital records were identifiable from the published data
• led to new rules for de-identification – 17 key pieces of information that must be removed HIPAA
4 Topics in Big Data March 20, 2015
Some high-profile failures • Netflix training data contained ratings and dates only • 18,000 films; 480,000 raters; 100 million ratings • Narayan & Shmatikov matched ratings and dates to ratings
and dates made by IMDb users • and uniquely identified two individuals based on IMDb profiles
• New York City Taxi released data set of times, routes and cab fares for 173 million rides
• identifiers for drivers easily reverse engineered to medallion numbers; hence salaries of drivers available
• combined with time-stamped photos of celebrities from fan websites enabled identification of celebrity fares (and tips)
Topics in Big Data March 20, 2015 5
On one hand • Governor Weld may have been more easily re-identified,
because of his high profile • de-identification (removal of formal identifiers) is a first step in
anonymisation, but not enough • Atz: “If you do anonymisation properly, it’s very powerful” • Cavoukian & Castro (June 2014)
“Big Data and Innovation, Setting the record straight: De-identification does work”
• example: Health Heritage Prize – medical claims history for 113,000 patients; Privacy Analytics hired to oversee anonymisation
• replaced direct identifiers, stripped out uncommonly high values, truncated the number of claims per patient, removed high-risk patients and claims data, etc.
Topics in Big Data March 20, 2015 6
On the other hand • Narayanan & Felten (July 2014)
“No silver bullet: De-identification still doesn’t work” • “there is no known method to de-identify location data” • “Cavoukian & Castro concede that de-identification is
inadequate for high-dimensional data. But nowadays most interesting data sets are high dimensional”
• “Data privacy is a hard problem” • “Data custodians face a choice between roughly three
alternatives: • Sticking with de-identification and hoping for the best • turning to emerging technologies like differential privacy • using legal agreements to limit the flow and use of
sensitive data”
Topics in Big Data March 20, 2015 7
High-dimensional data • makes all the problems more difficult • with very many values recorded, each individual is essentially
unique • even when formal identifiers are removed • example: de Montjoye (2013) “Unique in the crowd” • 15 months of mobile location data for 1.5 million mobile phone
users (de-identified) • found that 95% of individuals could be uniquely identified if
you knew the location of individuals at four points in time • Science Jan 30 2015 – three months of credit card
transactions; 1.1 million people in 10,000 shops • only metadata – amounts, shop type, code for each person • de Montjoye was able to simulate a “correlation attack” and
identify 90% of the individuals
Topics in Big Data March 20, 2015 8
High-dimensional data • three alternatives:
– Sticking with de-identification and hoping for the best – turning to emerging technologies like differential privacy – using legal agreements to limit the flow and use of sensitive
data”
• differential privacy uses algorithms to introduce noise • Atz, Elliot – added noise degrades usefulness of data
• legal agreements, “effective functional anonymisation”, statistical disclosure limitation used by many government agencies
• firewalls, end user agreements, etc.
Topics in Big Data March 20, 2015 9
Open data • The Open Data Institute includes several success stories • such as the start-up, Mastodon C, that analysed
prescription data base • and found £27m in savings for the National Health
Service, each month
• UK Anonymisation Network collects best practices in anonymisation
• de Montjoye and colleagues at MIT proposing a new model of the Personal Data Store
Topics in Big Data March 20, 2015 10
Protection from hacking, cyber-attacks, etc. • protecting data on personal computers from intrusion • either for individuals, or companies • statistical methods used in this setting as well • signature-based detections look for telltale patterns that
correspond to known threats • stateful protocol analysis relies on specific rules for what
network connections should be doing • anomaly-based detection compares statistics of normal
behaviour against observed behaviour
Topics in Big Data March 20, 2015 11
Science • “Balancing privacy versus accuracy in research protocols
(Goroff) • “attitudes towards research that analyzes personal data
should depend both on how well the protocol generates valuable statistics and on how well it protects confidential details”
• called risk/utility or R/U tradeoff in some literature • “many protocols purport to deliver more than they do on
either score” • “Eight examples follow”
Topics in Big Data March 20, 2015 12
Science 1. Open Data – “Sunshine List” in Ontario; salaries of all
public employees earning more than $100,000 per year – facilitates accuracy but not confidentiality
2. Date Enclaves – often used for federal data, e.g. US Census Bureau, Statistics Canada – “federal enclaves have produced no known security breaches,
and are becoming less cumbersome to use, but replication is problematic
3. Nondisclosure agreements – for online business data – gives the company control over what details may be released – usually precludes replications – designed to protect proprietary interests, not privacy
Topics in Big Data March 20, 2015 13
Science 4. Anonymisation of administrative data
– as discussed above – “sanitizing data doesn’t” and “de-identified data isn’t”
5. Randomised response in survey data – can be used for sensitive questions – for example: flip a coin; if heads answer truthfully “is your
income below the poverty line?”. If tails, flip again. If the 2nd toss is heads, answer truthfully; if tails give the opposite answer
– 2*(fraction “yes”) – ½ estimates the actual fraction, but less accurate
Topics in Big Data March 20, 2015 14
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0.25
Science 6. Multi-party communication
– example: computing average salary of three people – each generates two random numbers, and gives one to each of
the other two participants – everyone adds the two random numbers she generated to her
own salary, subtracts the two numbers she was give, and reports the results
– all three random numbers cancel in the sum – no individual salary was communicated – but can be broken by collusion – accurate but not private
7. Fully homomorphic encryption – personal data store – allows calculations to be performed on encrypted text, which
duplicates the result (after decryption) on original text – control reverts to owners of the data – which jeopardizes representativeness of the sample, as well as
reproducibility
Topics in Big Data March 20, 2015 15
Science Aside – Differencing Attack To discover whether the CEO of a company earns more than $1m, send two queries: -- how many employees earn more than $1m -- how many employees who are not the CEO earn more than $1m
Topics in Big Data March 20, 2015 16
Science 8. Differential Privacy
– Compares a data set D with a data set D’, that differs only in omitting the data on one individual
– If the probability that the answer to a query from D and D’ is close to one, then the method used to answer the query is ‘differentially private’
– formulated by Dwork, McSherry, Nissim & Smith (2006) – captures intuition about privacy – constructed research protocols to guarantee differential privacy – data are held by a trusted curator – calculations performed behind a firewall – and released after adding a small amount of noise – the ‘closeness’ of the probability determines the trade-off
between risk and utility
Topics in Big Data March 20, 2015 17
Science Public Health and Privacy (M. Ensernick) • example: Ebola virus • selected individuals were publicly followed and criticized for
not self-quarantining, whether or not it was reasonable • many privacy concerns are overlooked during epidemics • information about specific patients – although anonymised – is
shared worldwide on public e-mail lists • qualitative study conducted in Canada at the height of the
2009 influenza pandemic showed that some doctors were surprisingly reluctant to report patients with flu-like symptoms
• World Health Organization is preparing a report on government surveillance of disease, expected in 2016
Patients controlling data (J. Couzin-Frankel) • PatientsLikeMe; Personal Genome Project; RUDY
Topics in Big Data March 20, 2015 18
Statistical Disclosure Limitation
• http://community.amstat.org/CPC/methods
• Slavkovic notes http://www.cse.psu.edu/~asmith/courses/privacy598d/www/lec-notes/Privacy-F07-Lec01-Aug-30-SDL-overview.pdf
Topics in Big Data March 20, 2015 19
Legal Safeguards
• Statistics Canada RDC etc. http://sites.utoronto.ca/rdc/data.html
• Federal committee on statistical methodology http://fcsm.sites.usa.gov/files/2014/04/spwp22.pdf
Topics in Big Data March 20, 2015 20
Journal of Privacy and Confidentiality
• Volume 1, Number 1 Abowd et al. (2009)
• Dwork’s paper on differential privacy
• see also http://community.amstat.org/CPC/methods
Topics in Big Data March 20, 2015 21
Journal of Privacy and Confidentiality
• Volume 1, Number 1 Abowd et al. (2009) – editorial • several groups of researchers consider scientific analysis
of privacy and confidentiality • statisticians -- statistical disclosure limitation • computer scientists -- privacy-preserving data-mining
and cryptography • lawyers and social scientists -- the role of government
and regulation in privacy • health researchers -- trade-off between a patient’s
privacy and the scientific value of integrated medical records
• survey designers – entice participation, ensure privacy • online data companies – monetize their databases
Topics in Big Data March 20, 2015 22
Journal of Privacy and Confidentiality
• Volume 1, Number 1 Abowd et al. (2009) – editorial • government agencies are responsible for providing
general statistical information • this information is a public good
– one person’s use of a price index does not reduce the amount of “price index” available for another user
– once an agency has published a price index, it can no longer reasonably control who uses that index, and for what purpose
• most government agencies use a “trusted custodian” model – identities are provided to the custodian, but released summary data is to be non-identifiable
• statistical disclosure limitation arose to address this
Topics in Big Data March 20, 2015 23
Journal of Privacy and Confidentiality
• Volume 1, Number 1 Abowd et al. (2009) – editorial • Statistical Disclosure Limitation
– released data typically counts or magnitudes stratified by characteristics of the entities to which they apply
– an item is sensitive if its publication allows estimation of another value of the entity too precisely
– rules designed to prohibit release of data in cells at ‘too much’ risk, and prohibit release of data in other cells to prevent reconstruction of sensitive items – Cell Suppression
• other methods – synthetic data, perturbation models, swapping, reporting bounds, reporting marginal tables, use of risk-utility trade-off models
• computer science -- privacy-preserving data-mining; secure computation, differential privacy
• theoretical work on differential privacy has yielded solutions for function approximation, statistical analysis, data-mining, and sanitized databases
• it remains to see how these theoretical results might influence the practices of government agencies and private enterprise
Topics in Big Data March 20, 2015 24
Journal of Privacy and Confidentiality
Topics in Big Data March 20, 2015 25
Differential Privacy Dwork & Smith, 2009 • model of a trusted curator, e.g. government agency • two models: data is released, to be used at will • or, data is released interactively – queries and responses
modified by the curator to protect the privacy of respondents
• interactive curators should be able to provide better accuracy, since they don’t need to provide answers to all possible questions
• initial step to provide a mathematical interpretation to the phrase: “access to the statistical database does not help the adversary to compromise the privacy of any individual”
• this goal cannot be achieved in the presence of arbitrary auxiliary information
Topics in Big Data March 20, 2015 26
Differential Privacy Dwork & Smith, 2009
Topics in Big Data March 20, 2015 27
Suppose you have access to a database that allows you to compute the total income of all residents in a certain area. If you knew that Mr. White was going to move to another area, simply querying this database before and after his move would allow you to deduce his income.
Differential Privacy Dwork & Smith, 2009 A randomized function K gives ε-differential privacy if for all data sets D and D’ differing on at most one element, and all subsets in the range of K:
The probability is computed over the randomization in K. Example: if a database were to be consulted by an insurance provider before deciding whether or not to insure a given individual, then the presence or absence of that individual’s data in the database would not significantly affect her chance of receiving coverage
Topics in Big Data March 20, 2015 28
Pr{K(D) 2 S} exp(✏)Pr{K(D0) 2 S}
Differential Privacy Dwork & Smith, 2009
Topics in Big Data March 20, 2015 29
• can be extended to group privacy – use kε • if the algorithm used to answer each question is ε-
differentially private, and the adversary asks q questions, then the resulting process is qε-differentially private
• improved accuracy and more flexibility:
Pr{K(D) 2 S} exp(✏)Pr{K(D0) 2 S}+ �
Differential Privacy Dwork & Smith, 2009
Topics in Big Data March 20, 2015 30
• how to achieve differential privacy? • suppose the query is a function f, with true response f(D) • the sensitivity of f is defined to be
for all D, D’ differing in at most one element • for counting queries, this will be 1 • the privacy mechanism K computes f(D) and adds noise
from the Laplace density with standard deviation
p2�f/✏
�f = max
D,D0||f(D)� f(D0
)||1
Differential Privacy Dwork & Smith, 2009
Topics in Big Data March 20, 2015 31
• differential privacy and maximum likelihood estimation • differential privacy and robust statistics • differential privacy and nonparametric techniques
– Wasserman & Zhou JASA 2010 p.375 – The goals of this paper are to explain differential privacy in
statistical language, to show how to compare different privacy mechanisms by computing the rate of convergence of distributions and densities based on the released data, and to study a general privacy method
• what we want to learn – set of open problems in S4 of Dwork & Smith
ASA Committee on Privacy and Confidentiality
Topics in Big Data March 20, 2015 32
Aleksandra B. Slavković Department of Statistics, Penn State University [email protected]
Data Privacy Course -- Aug 30, 2007
Overview of Statistical Disclosure Limitation
Statistical Methods for Data Privacy, Confidentiality and Disclosure Limitation
LINK
Topics in Big Data March 20, 2015 34