1 pinning down “privacy” defining privacy in statistical databases adam smith weizmann institute...

45
1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science http://theory.csail.mit.edu/~asmith

Post on 21-Dec-2015

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

1

Pinning Down “Privacy”Defining Privacy in Statistical Databases

Adam SmithWeizmann Institute of Science

http://theory.csail.mit.edu/~asmith

Page 2: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

2

Database Privacy

You

Bob

AliceUsers

(government, researchers,

marketers, …)

“Census problem”

Two conflicting goals

• Utility: Users can extract “global” statistics

• Privacy: Individual information stays hidden

• How can these be formalized?

Collection and

“sanitization”

Page 3: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

3

Database Privacy

You

Bob

AliceUsers

(government, researchers,

marketers, …)

“Census problem”

Why privacy?

• Ethical & legal obligation

• Honest answers require respondents’ trust

Collection and

“sanitization”

Page 4: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

4

Trust is important

Page 5: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

5

Database Privacy

You

Bob

AliceUsers

(government, researchers,

marketers, …)

• Trusted collection agency

• Published statistics may be tables, graphs, microdata, etc

• May have noise or other distortions

• May be interactive

Collection and

“sanitization”

Page 6: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

6

Database Privacy

You

Bob

AliceUsers

(government, researchers,

marketers, …)

Variations on model studied in

• Statistics

• Data mining

• Theoretical CS

• Cryptography

Different traditions for what “privacy” means

Collection and

“sanitization”

Page 7: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

7

How can we formalize “privacy”?

• Different people mean different things

• Pin it down mathematically?

Page 8: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

8

I ask them to take a poem and hold it up to the light like a color slide

or press an ear against its hive.

[…]

But all they want to dois tie the poem to a chair with ropeand torture a confession out of it.

They begin beating it with a hoseto find out what it really means.

- Billy Collins, “Introduction to poetry”

Can we approach privacy scientifically?

• Pin down social concept• No perfect definition?• But lots of place for rigor• Too late? (see Adi’s talk)

Page 9: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

9

How can we formalize “privacy”?

• Different people mean different things

• Pin it down mathematically?

Goal #1: Rigor Prove clear theorems about privacy

• Few exist in literature

Make clear (and refutable) conjectures

Sleep better at night

Goal #2: Interesting science (New) Computational phenomenon

Algorithmic problems

Statistical problems

Page 10: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

10

Overview

• Examples

• Intuitions for privacy

Why crypto def’s don’t apply

• A Partial* Selection of Definitions

• Conclusions

* “partial” = “incomplete” and “biased”

Page 11: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

11

Basic Setting

• Database DB = table of n rows, each in domain DD can be numbers, categories, tax forms, etc

This talk: D = {0,1}d

E.g.: Married?, Employed?, Over 18?, …

xn

xn-1

x3

x2

x1

SanUsers

(government, researchers,

marketers, …)

query 1answer 1

query Tanswer T

DB=

random coins¢ ¢ ¢

Page 12: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

12

Examples of sanitization methods

• Input perturbation Change data before processing

E.g. Randomized response

• flip each bit of table with probability p

• Summary statistics Means, variances

Marginal totals (# people with blue eyes and brown hair)

Regression coefficients

• Output perturbation Summary statistics with noise

• Interactive versions of above: Auditor decides which queries are OK, type of noise

Page 13: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

13

Two Intuitions for Privacy

“If the release of statistics S makes it possible to determine the

value [of private information] more accurately than is possible

without access to S, a disclosure has taken place.” [Dalenius]

• Learning more about me should be hard

Privacy is “protection from being brought to the attention of

others.” [Gavison]

• Safety is blending into a crowd

Page 14: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

14

Why not use crypto definitions?

• Attempt #1: Def’n: For every entry i, no information about xi is leaked

(as if encrypted)

Problem: no information at all is revealed!

Tradeoff privacy vs utility

• Attempt #2:Agree on summary statistics f(DB) that are safe

Def’n: No information about DB except f(DB)

Problem: how to decide that f is safe?

Tautology trap

(Also: how do you figure out what f is? --Yosi)

C

C

CC

C

C

Page 15: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

15

Overview

• Examples

• Intuitions for privacy

Why crypto def’s don’t apply

• A Partial* Selection of Definitions

Two straw men

Blending into the Crowd

An impossibility result

Attribute Disclosure and Differential Privacy

• Conclusions* “partial” = “incomplete” and “biased”

Criteria• Understandable• Clear adversary’s goals &

prior knowledge / side information• I am a co-author...

Page 16: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

16

xn

xn-1x3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems

Does not preclude learning value with 99% certainty or narrowing down to a small interval

• Historically: Focus: auditing interactive queries

Difficulty: understanding relationships between queries

E.g. two queries with small difference

Page 17: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

17

Straw man #2: Learning the distribution

• Assume x1,…,xn are drawn i.i.d. from unknown

distribution

• Def’n: San is safe if it only reveals distribution

• Implied approach:

learn the distribution

release description of distrib

or re-sample points from distrib

• Problem: tautology trap

estimate of distrib. depends on data… why is it safe?

Page 18: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

18

Blending into a Crowd

• Intuition: I am safe in a group of k or morek varies (3… 6… 100… 10,000 ?)

• Many variations on theme:Adv. wants predicate g such that

0 < #{ i | g(xi)=true} < kg is called a breach of privacy

• Why?Fundamental:

• R. Gavison: “protection from being brought to the attention of others”

Rare property helps me re-identify someone Implicit: information about a large group is public

• e.g. liver problems more prevalent among diabetics

Page 19: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

19

Blending into a Crowd

• Intuition: I am safe in a group of k or morek varies (3… 6… 100… 10,000 ?)

• Many variations on theme:Adv. wants predicate g such that

0 < #{ i | g(xi)=true} < kg is called a breach of privacy

• Why?Fundamental:

• R. Gavison: “protection from being brought to the attention of others”

Rare property helps me re-identify someone Implicit: information about a large group is public

• e.g. liver problems more prevalent among diabetics

How can we capture this?• Syntactic definitions• Bayesian adversary• “Crypto-flavored” definitions

Two variants:• frequency in DB• frequency in

underlying population

Page 20: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

20

“Syntactic” Definitions

• Given sanitization S, look at set of all databases consistent with S

• Def’n: Safe if no predicate is a breach for all consistent databases

k-anonymity [L. Sweeney]

• Sanitization is histogram of data

Partition D into bins B1 [ B2 [ [ Bt

Output cardinalities fj = # ( DB Å Bj )

• Safe if for all j, either fj ¸ k or fj=0

Cell bound methods [statistics, 1990’s]

• Sanitization consists of marginal sums

Let fz = #{i : xi =z}. Then San(DB) = various sums of fz

• Safe if for all z, either 9 cons’t DB with fz ¸ k or 8 cons’t DB’s, fz=0

• Large literature using algebraic and combinatorial techniques

brown blue

blond 2 10 12

brown 12 6 18

14 16

brown blue

blond [0,12] [0,12] 12

brown [0,14] [0,16] 18

14 16

Page 21: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

21

brown blue

blond [0,12] [0,12] 12

brown [0,14] [0,16] 18

14 16

“Syntactic” Definitions

• Given sanitization S, look at set of all databases consistent with S

• Def’n: Safe if no predicate is a breach for all consistent databases

k-anonymity [L. Sweeney]

• Sanitization is histogram of data

Partition D into bins B1 [ B2 [ [ Bt

Output cardinalities fj = # ( DB Å Bj )

• Safe if for all j, either fj ¸ k or fj=0

Cell bound methods [statistics, 1990’s]

• Sanitization consists of marginal sums

if fz = #{i : xi =z} then output = various sums of fz

• Safe if for all z, either 9 cons’t DB with fz ¸ k or 8 cons’t DB’s, fz=0

• large literature using algebraic and combinatorial techniques

Issues:

• If k is small: “all three Canadians

at Weizmann sing in a choir.”

• Semantics? Probability not considered

What if I have side information?

Algorithm for making decisions

not considered

What adversary does this apply to?

Page 22: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

22

Security for “Bayesian” adversaries

Goal: • Adversary outputs point z 2 D

• Score = 1/fz if fz > 0 0 otherwise

• Def’n: sanitization safe if E(score) · Procedure:• Assume you know adversary’s prior distribution over databases

• Given a candidate output (e.g. set of marginal sums)

Update prior conditioned on output (via Bayes’ rule)

If maxz E( score | output ) < then release

Else consider new set of marginal sums

• Extensive literature on computing expected value (see Yosi’s talk)

Issues:

• Restricts the type of predicates adversary can choose

• Must know prior distribution

Can 1 scheme work for many distributions?

Sanitizer works harder than adversary

• Conditional probabilities don’t consider previous iterations “Simulatability” [KMN’05]

Can this be fixed (with efficient computations)?

Page 23: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

23

Crypto-flavored Approach [CDMSW,CDMT,NS]

“If the release of statistics S makes it possible to determine the

value [of private information] more accurately than is possible

without access to S, a disclosure has taken place.” [Dalenius]

Page 24: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

24

2 C

Crypto-flavored Approach [CDMSW,CDMT,NS]

• [CDMSW]: Compare to “simulator”:

8 distributions on databases DB

8 adversaries A, 9 A’ such that

8 subsets J µ DB:

PrDB,S[ A(S) = breach in J ] – PrDB[ A’() = breach in J ] ·

• Definition says nothing if adversary knows x1

Require that it hold for all subsets of DB

• No non-trivial examples satisfying this definition Restrict family of distributions to some class C of distribs

Try to make as large as possible

Sufficient: i.i.d. from “smooth” distribution

Page 25: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

25

Crypto-flavored Approach [CDMSW,CDMT,NS]

• [CDMSW]: Compare to “simulator”:

8 distributions on databases DB 2 C

8 adversaries A, 9 A’ such that

8 subsets J µ DB:

PrDB,S[ A(S) = breach in J ] – PrDB[ A’() = breach in J ] ·

• [CDMSW,CDMT] Geometric data Assume xi 2 Rd

Relax definition: • Ball predicates gz,r = {x : ||x-z|| · r}

g’z,r = {x : ||x-z|| · C¢ r}

• Breach if # (DB Å gz,r) >0 and #(DB Å g’z,r) < k

Several types of histograms can be released Sufficient for “metric” problems: clustering, min. span tree,…

2

231

2 2

Page 26: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

26

Crypto-flavored Approach [CDMSW,CDMT,NS]

• [CDMSW]: Compare to “simulator”:

8 distributions on databases DB 2 C

8 adversaries A, 9 A’ such that

8 subsets J µ DB:

PrDB,S[ A(S) = breach in J ] – PrDB[ A’() = breach in J ] ·

• [NS] No geometric restrictions A lot of noise

• Almost erase data!

Strong privacy statement

Very weak utility

• [CDMSW,CDMT,NS]: proven statements!

Issues:

• Works for a large class of prior distributions and side information

But not for all

• Not clear if it helps with “ordinary” statistical calculations

• Interesting utility requires geometric restrictions

• Too messy?

Page 27: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

27

Blending into a Crowd

• Intuition: I am safe in a group of k or more• pros:

appealing intuition for privacyseems fundamentalmathematically interestingmeaningful statements are possible!

• consdoes it rule out learning

facts about particular individual?all results seem to make strong assumptions on

adversary’s prior distribution is this necessary? (yes…)

Page 28: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

28

Overview

• Examples

• Intuitions for privacy

Why crypto def’s don’t apply

• A Partial* Selection of Definitions

Two Straw men

Blending into the Crowd

An impossibility result

Attribute Disclosure and Differential Privacy

• Conclusions

Page 29: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

29

an impossibility result

• An abstract schema: Define a privacy breach

8 distributions on databases8 adversaries A, 9 A’ such that

Pr( A(San) = breach ) – Pr( A’() = breach ) ·

• Theorem: [Dwork-Naor] For reasonable “breach”, if San(DB) contains information about DB

then some adversary breaks this definition

• Example: Adv. knows Alice is 2 inches shorter than average Lithuanian

• but how tall are Lithuanians?

With sanitized database, probability of guessing height goes up

Theorem: this is unavoidable

Page 30: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

30

proof sketch

• Suppose

If DB is uniform then entropy I( DB ; San(DB) ) > 0

“breach” is predicting a predicate g(DB)

• Pick hash function h: {databases}!{0,1}H(DB|San)

Prior distrib. is uniform conditioned on h(DB)=z

• Then

h(DB)=z gives no info on g(DB)

San(DB) and h(DB)=z together determine DB

• [DN] vastly generalize this

Page 31: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

31

xn

xn-1x3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

random coins¢ ¢ ¢

Preventing Attribute Disclosure

• Large class of definitions safe if adversary can’t learn “too much” about any entry

E.g.:

• Cannot narrow Xi down to small interval

• For uniform Xi, mutual information I(Xi; San(DB) ) ·

• How can we decide among these definitions?

Page 32: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

32

Differential Privacy

• Lithuanians example:

Adv. learns height even if Alice not in DB

• Intuition [DM]:

“Whatever is learned would be learned regardless of whether or not

Alice participates”

Dual: Whatever is already known, situation won’t get worse

xn

xn-1x3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

random coins¢ ¢ ¢

Page 33: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

33

Differential Privacy

xn

xn-10x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

random coins¢ ¢ ¢

• Define n+1 games “Game 0”: Adv. interacts with San(DB)

For each i, let DB-i = (x1,…,xi-1,0,xi+1,…,xn)

“Game i”: Adv. interacts with San(DB-i)

• Bayesian adversary: Given S and prior distrib p() on DB, define n+1 posterior distrib’s

Page 34: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

34

Differential Privacy

xn

xn-10x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

random coins¢ ¢ ¢

• Definition: San is safe if

8 prior distributions p(¢) on DB,

8 transcripts S, 8 i =1,…,n

StatDiff( p0(¢|S) , pi(¢| S) ) ·

• Note that the prior distribution may be far from both

• How can we satisfy this?

Page 35: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

35

Approach: Indistinguishability [DiNi,EGS,BDMN]

xn

xn-1

xn

xn-1

x3

x2

x1

San

query 1answer 1query T

answer TDB=

transcriptS

random coins¢ ¢ ¢

x3 San

query 1answer 1query T

answer TDB’=

transcriptS’

random coins¢ ¢ ¢

x2’x1

Distributions at “distance” ·

Choice of distance measure is importantDiffer in 1 row

Page 36: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

36

Approach: Indistinguishability [DiNi,EGS,BDMN]

xn

xn-1

xn

xn-1

x3

x2

x1

San

query 1answer 1query T

answer TDB=

transcriptS

random coins¢ ¢ ¢

x3 San

query 1answer 1query T

answer TDB’=

transcriptS’

random coins¢ ¢ ¢

x2’x1

Distributions at “distance” ·

Choice of distance measure is important

Page 37: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

37

Approach: Indistinguishability [DiNi,EGS,BDMN]

SanDB=query 1

query TS

¢ ¢ ¢

SanDB’=query 1

query TS’

¢ ¢ ¢

Distrib’sdistance

·

Problem: must be large

• By hybrid argument:Any two databases induce transcripts at distance · n

• To get utility, > 1/n

Statistical difference 1/n is not meaningful

• Example: Release random point in database

San(x1,…,xn) = ( j, xj ) for random j

For every i , changing xi induces statistical difference 1/n

But some xi is revealed with probability 1

Page 38: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

38

?

Formalizing Indistinguishability

Definition: San is -indistinguishable if

8 A, 8 DB, DB’ which differ in 1 row, 8 sets of transcripts E

Adversary A

query 1answer 1transcript

Squery 1

answer 1transcript

S’

• Equivalently, 8 S: p( San(DB) = S )p( San(DB’)= S )

2 1 ±

p( San(DB) 2 E ) 2 e±(1 ± p( San(DB’) 2 E )

Page 39: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

39

Indistinguishability ) Differential Privacy

• Definition: San is safe if

8 prior distributions p(¢) on DB,

8 transcripts S, 8 i =1,…,n

StatDiff( p0(¢|S) , pi(¢| S) ) ·

• We can use indistinguishability: For every S and DB:

This implies StatDiff( p0(¢|S) , pi(¢| S) ) ·

Page 40: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

40

Why does this help?

With relatively little noise:

• Averages

• Histograms

• Matrix decompositions

• Certain types of clustering

• …

See Kobbi’s talk

Page 41: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

41

Preventing Attribute Disclosure

• Various ways to capture

“no particular value should be revealed”

• Differential Criterion: “Whatever is learned would be learned regardless of whether or not

person i participates”

• Satisfied by indistinguishability Also implies protection from re-identification?

• Two interpretations: A given release won’t make privacy worse

Rational respondent will answer if there is some gain

• Can we preserve enough utility?

Page 42: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

42

Overview

• Examples

• Intuitions for privacy

Why crypto def’s don’t apply

• A Partial* Selection of Definitions

Two Straw men

Blending into the Crowd

An impossibility result

Attribute Disclosure and Differential Privacy

* “partial” = “incomplete” and “biased”

Page 43: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

43

Things I Didn’t Talk About

• Economic Perspective [KPR]

Utility of providing data = value – cost

May depend on whether others participate

When is it worth my while?

• Specific methods for re-identification

• Various other frameworks (e.g. “L-diversity”)

• Other pieces of big “data privacy” picture

Access Control

Implementing trusted collection center

Page 44: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

44

Conclusions

• Pinning down social notion in particular context

• Biased survey of approaches to definitionsA taste of techniques along the way

Didn’t talk about utility

• Question has different flavor fromusual crypto problems

statisticians’ traditional conception

• Meaningful statements are possible!Practical?

Do they cover everything? No

Page 45: 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science asmith

45

Conclusions

• How close are we to converging? e.g. s.f.e., encryption, Turing machines,…

But we’re after a social concept?

Silver bullet?

• What are the big challenges?

• Need “cryptanalysis” of these systems (Adi…?)