from idiosyncratic to stereotypical: toward privacy in public databases

28
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Upload: saskia

Post on 12-Jan-2016

37 views

Category:

Documents


1 download

DESCRIPTION

From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases. Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee. Database Privacy. Census data – a prototypical example Individuals provide information Census bureau publishes sanitized records - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck

Wee

From Idiosyncratic to Stereotypical:

Toward Privacy in Public Databases

Page 2: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla2

Database Privacy

Census data – a prototypical example Individuals provide information Census bureau publishes sanitized records

Privacy is legally mandated; what utility can we achieve?

Our Goal: What do we mean by preservation of privacy? Characterize the trade-off between privacy and utility

– disguise individual identifying information– preserve macroscopic properties

Develop a “good” sanitizing procedure with theoretical guarantees

Page 3: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla3

An outline of this talk

A mathematical formalism What do we mean by privacy? Prior work An abstract model of datasets Isolation; Good sanitizations

A candidate sanitization A brief overview of results General argument for privacy of n-point datasets

Open issues and concluding remarks

Page 4: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla4

Privacy… a philosophical view-point

[Ruth Gavison] … includes protection from being brought to the attention of others …

Matches intuition; inherently desirable Attention invites further loss of privacy Privacy is assured to the extent that one blends in

with the crowd

Appealing definition; can be converted into a precise mathematical statement!

Page 5: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla5

Database Privacy

Statistical approaches Alter the frequency (PRAN/DS/PERT) of particular

features, while preserving means. Additionally, erase values that reveal too much

Query-based approaches involve a permanent trusted third party Query monitoring: dissallow queries that breach

privacy Perturbation: Add noise to the query output

[Dinur Nissim’03, Dwork Nissim’04]

Statistical perturbation + adversarial analysis [Evfimievsky et al ’03] combine statistical techniques

with analysis similar to query-based approaches

Page 6: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla6

Everybody’s First Suggestion

Learn the distribution, then output: A description of the distribution, or, Samples from the learned distribution

Want to reflect facts on the ground

Statistically insignificant facts can be important for allocating resources

Page 7: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla7

Our Approach

Crypto-flavored definitions Mathematical characterization of Adversary’s goal Precise definition of when sanitization procedure fails

Intuition: seeing sanitized DB gives Adversary an “advantage”

Statistical Techniques Perturbation of attribute values

Differs from previous work: perturbation amounts depend on local densities of points

Highly abstracted version of problem If we can’t understand this, we can’t understand real life. If we get negative results here, the world is in trouble.

Page 8: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla8

A geometric view

Abstraction : Points in a high dimensional metric space – say R d;

drawn i.i.d. from some distribution Points are unlabeled; you are your collection of

attributes Distance is everything

Real Database (RDB) – privaten unlabeled points in d-dimensional space.

Sanitized Database (SDB) – publicn’ new points possibly in a different space.

Page 9: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla9

The adversary or Isolator

Using SDB and auxiliary information (AUX), outputs a point q

q “isolates” a real point x, if it is much closer to x than to x’s neighbors.

Even if q looks similar to x, it may fail to isolate x if it looks as similar to x’s neighbors as well.

Tightly clustered points have a smaller radius of isolation

RDB

Non-isolating

Isolating

Page 10: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla10

(c-1)

I(SDB,AUX) = q

x is isolated if B(q,c) contains less than T points

T-radius of x – distance to its T-nearest neighbor

x is “safe” if x > (T-radius of x)/(c-1)

B(q,cx) contains x’s entire T-neighborhood

c – privacy parameter; eg. 4

qx

c

The adversary or Isolator

large T and small c is good

Page 11: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla11

A good sanitization

Sanitizing algorithm compromises privacy if the adversary is able to considerably increase his probability of isolating a point by looking at its output

A rigorous (and too ideal) definitionD I I ’ w.o.p RDB 2R Dn aux z x 2 RDB :

| Pr[I(SDB,z) isolates x] – Pr[I ’(z) isolates x] | · /n

Definition of can be forgiving, say, 2-(d) or (1 in a 1000)

Quantification over x : If aux reveals info about some x, the privacy of some other y should still be preserved

Provides a framework for describing the power of a sanitization method, and hence for comparisons

Page 12: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla12

The Sanitizer

The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-radius

x’ = San(x) R S(x,T-rad(x))

T=1

Page 13: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla13

The Sanitizer

The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-radius

x’ = San(x) R S(x,T-rad(x))

Intuition: We are blending x in with its crowd

If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one.

We are adding random noise with mean zero to x, so several macroscopic properties should be preserved.

Page 14: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla14

Results on privacy.. An overview

Distribution Num. points

Revealed to adversary Auxiliary information

Uniform on surface of sphere

2 Both sanitized points Distribution, 1-radius

Uniform over a bounding box or surface of sphere

n One sanitized point, all other real points

Distribution, all real points

Uniform over a cube

n Exact histogram count over subcells of sufficiently large size

Distribution

Uniform over a cube

n Perturbation of n/2 points Distribution; exact histogram counts of subcells

Adversary is computationally unbounded

Page 15: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla15

Results on utility… An overview

Distributional/Worst-case

Objective Assumptions Result

Worst-case Find K clusters minimizing largest diameter

- Optimal diameter as well as approximations increase by at most a factor of 3

Distributional Find k maximum likelihood clusters

Mixture of k Gaussians

Correct clustering with high probability as long as means are pairwise sufficiently far

Page 16: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla16

A special case - one sanitized point

RDB = {x1,…,xn}

The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 ; T = 1; c=4; “flat” prior

Recall: x’1 2R S(x1,|x1-y|)

where y is the nearest neighbor of x1

Main idea:Consider the posterior distribution on x1

Show that the adversary cannot isolate a large probability mass under this distribution

Page 17: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla17

Let Z = { pR d | p is a legal pre-image for x’1 }

Q = { p | if x1=p then x1 is isolated by q }

We show that Pr[ Q∩Z | x’1 ] ≤ 2-(d) Pr[ Z | x’1 ]

Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4)

A special case - one sanitized point

Qq

x’1

x2

x3

x4

x5

Z Q∩Z

x6

|p-q| · 1/3 |p-x’1|

Page 18: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla18

Contribution from Z

Pr[x1=p | x’1] Pr[x’1 | x1=p] 1/rd (r = |x’1-p|) Increase in r x’1 gets randomized over a larger area

– proportional to rd. Hence the inverse dependence.

Pr[x’1 | x12 S] sS 1/rd solid angle subtended at x’1

Z subtends a solid angle equal to at least half a sphere at x’1

x’1

x2

x3

x4

x5

Z

x6

S

r

p

Page 19: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla19

Contribution from Q Å Z

The ellipsoid is roughly as far from x’1 as its longest radius

Contribution from ellipsoid is 2-d x total solid angle

Therefore, Pr[x1 2 QÅZ] / Pr[x1 2 Z] 2-d

Qq

x’1

x2

x3

x4

x5

Z Q∩Z

x6

r r

Page 20: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla20

The general case… n sanitized points

Initial intuition is wrong:

Privacy of x1 given x1’ and other points in the clear does not implyprivacy of x1 given x1’ and sanitizations of others!

Problem: Sanitization is non-obliviousOther sanitized points reveal information about x, if x is their nearest neighbor

Solution: Decouple the two kinds of information – from x’ and x’i

L R

Page 21: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla21

The general case… n sanitized points

Perturbation of L is a function of R What function of R would reveal no information

about R? Answer: Coarse-grained histogram information!

Divide space into “cells” Histogram count of cell C = number of points in

RÅC

Perturbation radius of a point p / density of points in the cell

containing p

Page 22: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla22

Histogram-based sanitization

Recursively divide space into “cells” until all cells have few points

Reveal the EXACT count of points in each cell Contrast this to k-anonymity

T=62 0 2 0

2 2 4 1

5

2 3

0 2

Page 23: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla23

Histogram-based sanitization

Adversary outputs (q,r) guess and radius of isolation

Adversary wins if purple ball contains > 1 points and orange ball contains < T points

q

Page 24: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla24

Histogram-based sanitization

We show: If purple ball is “large”,

then orange ball contains the parent cell => at least T points If purple ball is “small”,

then orange ball is exponentially larger than purple ball => either purple has < 1 points or orange has > T points

q

Recall: cells are d-dimensional

Page 25: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla25

Results on privacy.. An overview

Distribution Num. points

Revealed to adversary Auxiliary information

Uniform on surface of sphere

2 Both sanitized points Distribution, 1-radius

Uniform over a bounding box or surface of sphere

n One sanitized point, all other real points

Distribution, all real points

Uniform over a cube

n Exact histogram count over subcells of sufficiently large size

Distribution

Uniform over a cube

n Perturbation of n/2 points Distribution; exact histogram counts of subcells

Page 26: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla26

Future directions

Extend the privacy argument to other “nice” distributions

For what distributions is there no meaningful privacy—utility trade-off?

Characterize acceptable auxiliary information Think of auxiliary information as an a priori

distribution

The low-dimensional case – Is it inherently impossible?

Discrete-valued attributes Our proofs require a “spread” in all attributes

Extend the utility argument to other interesting macroscopic properties – e.g. correlations

Page 27: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla27

Conclusions

A first step towards understanding the privacy-utility trade-off

A general and rigorous definition of privacy

A work in progress!

Page 28: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla28

Questions?