aspects of privacy and big datafields2015bigdata2inference.weebly.com/uploads/4/4/... · •...

Topics in Big Data March 20, 2015

Aspects of Privacy and Big Data

Aspects of Privacy and Big Data •  Workshop on Big Data in Health Policy

•  Workshop on Big Data in Social Policy

2 Topics in Big Data March 20, 2015

Data anonymisation Significance Oct 2014 •  anonymisation is meant to allow custodians of data sets

to share the information with third parties •  custodians are typically governments and corporations •  McKinsey report estimated that open data could

generate $3 trillion per year in economic value bit.ly/1tsTcve

•  with completely anonymised data no-one can be identified by that data on its own or in combination with other data

•  completely anonymised data has little utility •  so risk of identification must be estimated – depends on

how existing data is released and also on what other data may be available


Some high-profile failures •  Massachusetts Group Insurance Commission published

data on all hospital visits by state employees •  birthdates, sex, zip codes, but no names •  Sweeney knew Governor Weld lived in Cambridge MA •  purchased voter rolls •  only 6 people in Cambridge had the same birthday as

the governor; only 3 were male; only one lived in his zip code

•  his hospital records were identifiable from the published data

•  led to new rules for de-identification – 17 key pieces of information that must be removed HIPAA


Some high-profile failures •  Netflix training data contained ratings and dates only •  18,000 films; 480,000 raters; 100 million ratings •  Narayan & Shmatikov matched ratings and dates to ratings

and dates made by IMDb users •  and uniquely identified two individuals based on IMDb profiles

•  New York City Taxi released data set of times, routes and cab fares for 173 million rides

•  identifiers for drivers easily reverse engineered to medallion numbers; hence salaries of drivers available

•  combined with time-stamped photos of celebrities from fan websites enabled identification of celebrity fares (and tips)

Topics in Big Data March 20, 2015 5

On one hand •  Governor Weld may have been more easily re-identified,

because of his high profile •  de-identification (removal of formal identifiers) is a first step in

anonymisation, but not enough •  Atz: “If you do anonymisation properly, it’s very powerful” •  Cavoukian & Castro (June 2014)

“Big Data and Innovation, Setting the record straight: De-identification does work”

•  example: Health Heritage Prize – medical claims history for 113,000 patients; Privacy Analytics hired to oversee anonymisation

•  replaced direct identifiers, stripped out uncommonly high values, truncated the number of claims per patient, removed high-risk patients and claims data, etc.


On the other hand •  Narayanan & Felten (July 2014)

“No silver bullet: De-identification still doesn’t work” •  “there is no known method to de-identify location data” •  “Cavoukian & Castro concede that de-identification is

inadequate for high-dimensional data. But nowadays most interesting data sets are high dimensional”

•  “Data privacy is a hard problem” •  “Data custodians face a choice between roughly three

alternatives: •  Sticking with de-identification and hoping for the best •  turning to emerging technologies like differential privacy •  using legal agreements to limit the flow and use of

sensitive data”


High-dimensional data •  makes all the problems more difficult •  with very many values recorded, each individual is essentially

unique •  even when formal identifiers are removed •  example: de Montjoye (2013) “Unique in the crowd” •  15 months of mobile location data for 1.5 million mobile phone

users (de-identified) •  found that 95% of individuals could be uniquely identified if

you knew the location of individuals at four points in time •  Science Jan 30 2015 – three months of credit card

transactions; 1.1 million people in 10,000 shops •  only metadata – amounts, shop type, code for each person •  de Montjoye was able to simulate a “correlation attack” and

identify 90% of the individuals


High-dimensional data •  three alternatives:

–  Sticking with de-identification and hoping for the best –  turning to emerging technologies like differential privacy –  using legal agreements to limit the flow and use of sensitive

data”

•  differential privacy uses algorithms to introduce noise •  Atz, Elliot – added noise degrades usefulness of data

•  legal agreements, “effective functional anonymisation”, statistical disclosure limitation used by many government agencies

•  firewalls, end user agreements, etc.


Open data •  The Open Data Institute includes several success stories •  such as the start-up, Mastodon C, that analysed

prescription data base •  and found £27m in savings for the National Health

Service, each month

•  UK Anonymisation Network collects best practices in anonymisation

•  de Montjoye and colleagues at MIT proposing a new model of the Personal Data Store


Protection from hacking, cyber-attacks, etc. •  protecting data on personal computers from intrusion •  either for individuals, or companies •  statistical methods used in this setting as well •  signature-based detections look for telltale patterns that

correspond to known threats •  stateful protocol analysis relies on specific rules for what

network connections should be doing •  anomaly-based detection compares statistics of normal

behaviour against observed behaviour


Science •  “Balancing privacy versus accuracy in research protocols

(Goroff) •  “attitudes towards research that analyzes personal data

should depend both on how well the protocol generates valuable statistics and on how well it protects confidential details”

•  called risk/utility or R/U tradeoff in some literature •  “many protocols purport to deliver more than they do on

either score” •  “Eight examples follow”


Science 1.  Open Data – “Sunshine List” in Ontario; salaries of all

public employees earning more than $100,000 per year –  facilitates accuracy but not confidentiality

2.  Date Enclaves – often used for federal data, e.g. US Census Bureau, Statistics Canada –  “federal enclaves have produced no known security breaches,

and are becoming less cumbersome to use, but replication is problematic

3.  Nondisclosure agreements – for online business data –  gives the company control over what details may be released –  usually precludes replications –  designed to protect proprietary interests, not privacy


Science 4.  Anonymisation of administrative data

–  as discussed above –  “sanitizing data doesn’t” and “de-identified data isn’t”

5.  Randomised response in survey data –  can be used for sensitive questions –  for example: flip a coin; if heads answer truthfully “is your

income below the poverty line?”. If tails, flip again. If the 2nd toss is heads, answer truthfully; if tails give the opposite answer

–  2*(fraction “yes”) – ½ estimates the actual fraction, but less accurate


0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

0.15

0.20

0.25

Science 6.  Multi-party communication

–  example: computing average salary of three people –  each generates two random numbers, and gives one to each of

the other two participants –  everyone adds the two random numbers she generated to her

own salary, subtracts the two numbers she was give, and reports the results

–  all three random numbers cancel in the sum –  no individual salary was communicated –  but can be broken by collusion –  accurate but not private

7.  Fully homomorphic encryption – personal data store –  allows calculations to be performed on encrypted text, which

duplicates the result (after decryption) on original text –  control reverts to owners of the data –  which jeopardizes representativeness of the sample, as well as

reproducibility


Science Aside – Differencing Attack To discover whether the CEO of a company earns more than $1m, send two queries: -- how many employees earn more than $1m -- how many employees who are not the CEO earn more than $1m


Science 8.  Differential Privacy

–  Compares a data set D with a data set D’, that differs only in omitting the data on one individual

–  If the probability that the answer to a query from D and D’ is close to one, then the method used to answer the query is ‘differentially private’

–  formulated by Dwork, McSherry, Nissim & Smith (2006) –  captures intuition about privacy –  constructed research protocols to guarantee differential privacy –  data are held by a trusted curator –  calculations performed behind a firewall –  and released after adding a small amount of noise –  the ‘closeness’ of the probability determines the trade-off

between risk and utility


Science Public Health and Privacy (M. Ensernick) •  example: Ebola virus •  selected individuals were publicly followed and criticized for

not self-quarantining, whether or not it was reasonable •  many privacy concerns are overlooked during epidemics •  information about specific patients – although anonymised – is

shared worldwide on public e-mail lists •  qualitative study conducted in Canada at the height of the

2009 influenza pandemic showed that some doctors were surprisingly reluctant to report patients with flu-like symptoms

•  World Health Organization is preparing a report on government surveillance of disease, expected in 2016

Patients controlling data (J. Couzin-Frankel) •  PatientsLikeMe; Personal Genome Project; RUDY


Statistical Disclosure Limitation

•  http://community.amstat.org/CPC/methods

•  Slavkovic notes http://www.cse.psu.edu/~asmith/courses/privacy598d/www/lec-notes/Privacy-F07-Lec01-Aug-30-SDL-overview.pdf


Legal Safeguards

•  Statistics Canada RDC etc. http://sites.utoronto.ca/rdc/data.html

•  Federal committee on statistical methodology http://fcsm.sites.usa.gov/files/2014/04/spwp22.pdf


Journal of Privacy and Confidentiality

•  Volume 1, Number 1 Abowd et al. (2009)

•  Dwork’s paper on differential privacy

•  see also http://community.amstat.org/CPC/methods



•  Volume 1, Number 1 Abowd et al. (2009) – editorial •  several groups of researchers consider scientific analysis

of privacy and confidentiality •  statisticians -- statistical disclosure limitation •  computer scientists -- privacy-preserving data-mining

and cryptography •  lawyers and social scientists -- the role of government

and regulation in privacy •  health researchers -- trade-off between a patient’s

privacy and the scientific value of integrated medical records

•  survey designers – entice participation, ensure privacy •  online data companies – monetize their databases



•  Volume 1, Number 1 Abowd et al. (2009) – editorial •  government agencies are responsible for providing

general statistical information •  this information is a public good

–  one person’s use of a price index does not reduce the amount of “price index” available for another user

–  once an agency has published a price index, it can no longer reasonably control who uses that index, and for what purpose

•  most government agencies use a “trusted custodian” model – identities are provided to the custodian, but released summary data is to be non-identifiable

•  statistical disclosure limitation arose to address this



•  Volume 1, Number 1 Abowd et al. (2009) – editorial •  Statistical Disclosure Limitation

–  released data typically counts or magnitudes stratified by characteristics of the entities to which they apply

–  an item is sensitive if its publication allows estimation of another value of the entity too precisely

–  rules designed to prohibit release of data in cells at ‘too much’ risk, and prohibit release of data in other cells to prevent reconstruction of sensitive items – Cell Suppression

•  other methods – synthetic data, perturbation models, swapping, reporting bounds, reporting marginal tables, use of risk-utility trade-off models

•  computer science -- privacy-preserving data-mining; secure computation, differential privacy

•  theoretical work on differential privacy has yielded solutions for function approximation, statistical analysis, data-mining, and sanitized databases

•  it remains to see how these theoretical results might influence the practices of government agencies and private enterprise


Differential Privacy Dwork & Smith, 2009 •  model of a trusted curator, e.g. government agency •  two models: data is released, to be used at will •  or, data is released interactively – queries and responses

modified by the curator to protect the privacy of respondents

•  interactive curators should be able to provide better accuracy, since they don’t need to provide answers to all possible questions

•  initial step to provide a mathematical interpretation to the phrase: “access to the statistical database does not help the adversary to compromise the privacy of any individual”

•  this goal cannot be achieved in the presence of arbitrary auxiliary information


Differential Privacy Dwork & Smith, 2009


Suppose you have access to a database that allows you to compute the total income of all residents in a certain area. If you knew that Mr. White was going to move to another area, simply querying this database before and after his move would allow you to deduce his income.

Differential Privacy Dwork & Smith, 2009 A randomized function K gives ε-differential privacy if for all data sets D and D’ differing on at most one element, and all subsets in the range of K:

The probability is computed over the randomization in K. Example: if a database were to be consulted by an insurance provider before deciding whether or not to insure a given individual, then the presence or absence of that individual’s data in the database would not significantly affect her chance of receiving coverage


Pr{K(D) 2 S} exp(✏)Pr{K(D0) 2 S}



•  can be extended to group privacy – use kε •  if the algorithm used to answer each question is ε-

differentially private, and the adversary asks q questions, then the resulting process is qε-differentially private

•  improved accuracy and more flexibility:

Pr{K(D) 2 S} exp(✏)Pr{K(D0) 2 S}+ �



•  how to achieve differential privacy? •  suppose the query is a function f, with true response f(D) •  the sensitivity of f is defined to be

for all D, D’ differing in at most one element •  for counting queries, this will be 1 •  the privacy mechanism K computes f(D) and adds noise

from the Laplace density with standard deviation

p2�f/✏

�f = max

D,D0||f(D)� f(D0

)||1



•  differential privacy and maximum likelihood estimation •  differential privacy and robust statistics •  differential privacy and nonparametric techniques

–  Wasserman & Zhou JASA 2010 p.375 –  The goals of this paper are to explain differential privacy in

statistical language, to show how to compare different privacy mechanisms by computing the rate of convergence of distributions and densities based on the released data, and to study a general privacy method

•  what we want to learn – set of open problems in S4 of Dwork & Smith

ASA Committee on Privacy and Confidentiality


Aleksandra B. Slavković Department of Statistics, Penn State University [email protected]

Data Privacy Course -- Aug 30, 2007

Overview of Statistical Disclosure Limitation

Statistical Methods for Data Privacy, Confidentiality and Disclosure Limitation

LINK

aspects of privacy and big datafields2015bigdata2inference.weebly.com/uploads/4/4/... · •...

Documents