from idiosyncratic to stereotypical: toward privacy in public databases
DESCRIPTION
From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases. Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee. Database Privacy. Census data – a prototypical example Individuals provide information Census bureau publishes sanitized records - PowerPoint PPT PresentationTRANSCRIPT
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck
Wee
From Idiosyncratic to Stereotypical:
Toward Privacy in Public Databases
Shuchi Chawla2
Database Privacy
Census data – a prototypical example Individuals provide information Census bureau publishes sanitized records
Privacy is legally mandated; what utility can we achieve?
Our Goal: What do we mean by preservation of privacy? Characterize the trade-off between privacy and utility
– disguise individual identifying information– preserve macroscopic properties
Develop a “good” sanitizing procedure with theoretical guarantees
Shuchi Chawla3
An outline of this talk
A mathematical formalism What do we mean by privacy? Prior work An abstract model of datasets Isolation; Good sanitizations
A candidate sanitization A brief overview of results General argument for privacy of n-point datasets
Open issues and concluding remarks
Shuchi Chawla4
Privacy… a philosophical view-point
[Ruth Gavison] … includes protection from being brought to the attention of others …
Matches intuition; inherently desirable Attention invites further loss of privacy Privacy is assured to the extent that one blends in
with the crowd
Appealing definition; can be converted into a precise mathematical statement!
Shuchi Chawla5
Database Privacy
Statistical approaches Alter the frequency (PRAN/DS/PERT) of particular
features, while preserving means. Additionally, erase values that reveal too much
Query-based approaches involve a permanent trusted third party Query monitoring: dissallow queries that breach
privacy Perturbation: Add noise to the query output
[Dinur Nissim’03, Dwork Nissim’04]
Statistical perturbation + adversarial analysis [Evfimievsky et al ’03] combine statistical techniques
with analysis similar to query-based approaches
Shuchi Chawla6
Everybody’s First Suggestion
Learn the distribution, then output: A description of the distribution, or, Samples from the learned distribution
Want to reflect facts on the ground
Statistically insignificant facts can be important for allocating resources
Shuchi Chawla7
Our Approach
Crypto-flavored definitions Mathematical characterization of Adversary’s goal Precise definition of when sanitization procedure fails
Intuition: seeing sanitized DB gives Adversary an “advantage”
Statistical Techniques Perturbation of attribute values
Differs from previous work: perturbation amounts depend on local densities of points
Highly abstracted version of problem If we can’t understand this, we can’t understand real life. If we get negative results here, the world is in trouble.
Shuchi Chawla8
A geometric view
Abstraction : Points in a high dimensional metric space – say R d;
drawn i.i.d. from some distribution Points are unlabeled; you are your collection of
attributes Distance is everything
Real Database (RDB) – privaten unlabeled points in d-dimensional space.
Sanitized Database (SDB) – publicn’ new points possibly in a different space.
Shuchi Chawla9
The adversary or Isolator
Using SDB and auxiliary information (AUX), outputs a point q
q “isolates” a real point x, if it is much closer to x than to x’s neighbors.
Even if q looks similar to x, it may fail to isolate x if it looks as similar to x’s neighbors as well.
Tightly clustered points have a smaller radius of isolation
RDB
Non-isolating
Isolating
Shuchi Chawla10
(c-1)
I(SDB,AUX) = q
x is isolated if B(q,c) contains less than T points
T-radius of x – distance to its T-nearest neighbor
x is “safe” if x > (T-radius of x)/(c-1)
B(q,cx) contains x’s entire T-neighborhood
c – privacy parameter; eg. 4
qx
c
The adversary or Isolator
large T and small c is good
Shuchi Chawla11
A good sanitization
Sanitizing algorithm compromises privacy if the adversary is able to considerably increase his probability of isolating a point by looking at its output
A rigorous (and too ideal) definitionD I I ’ w.o.p RDB 2R Dn aux z x 2 RDB :
| Pr[I(SDB,z) isolates x] – Pr[I ’(z) isolates x] | · /n
Definition of can be forgiving, say, 2-(d) or (1 in a 1000)
Quantification over x : If aux reveals info about some x, the privacy of some other y should still be preserved
Provides a framework for describing the power of a sanitization method, and hence for comparisons
Shuchi Chawla12
The Sanitizer
The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-radius
x’ = San(x) R S(x,T-rad(x))
T=1
Shuchi Chawla13
The Sanitizer
The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-radius
x’ = San(x) R S(x,T-rad(x))
Intuition: We are blending x in with its crowd
If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one.
We are adding random noise with mean zero to x, so several macroscopic properties should be preserved.
Shuchi Chawla14
Results on privacy.. An overview
Distribution Num. points
Revealed to adversary Auxiliary information
Uniform on surface of sphere
2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere
n One sanitized point, all other real points
Distribution, all real points
Uniform over a cube
n Exact histogram count over subcells of sufficiently large size
Distribution
Uniform over a cube
n Perturbation of n/2 points Distribution; exact histogram counts of subcells
Adversary is computationally unbounded
Shuchi Chawla15
Results on utility… An overview
Distributional/Worst-case
Objective Assumptions Result
Worst-case Find K clusters minimizing largest diameter
- Optimal diameter as well as approximations increase by at most a factor of 3
Distributional Find k maximum likelihood clusters
Mixture of k Gaussians
Correct clustering with high probability as long as means are pairwise sufficiently far
Shuchi Chawla16
A special case - one sanitized point
RDB = {x1,…,xn}
The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 ; T = 1; c=4; “flat” prior
Recall: x’1 2R S(x1,|x1-y|)
where y is the nearest neighbor of x1
Main idea:Consider the posterior distribution on x1
Show that the adversary cannot isolate a large probability mass under this distribution
Shuchi Chawla17
Let Z = { pR d | p is a legal pre-image for x’1 }
Q = { p | if x1=p then x1 is isolated by q }
We show that Pr[ Q∩Z | x’1 ] ≤ 2-(d) Pr[ Z | x’1 ]
Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4)
A special case - one sanitized point
x’1
x2
x3
x4
x5
Z Q∩Z
x6
|p-q| · 1/3 |p-x’1|
Shuchi Chawla18
Contribution from Z
Pr[x1=p | x’1] Pr[x’1 | x1=p] 1/rd (r = |x’1-p|) Increase in r x’1 gets randomized over a larger area
– proportional to rd. Hence the inverse dependence.
Pr[x’1 | x12 S] sS 1/rd solid angle subtended at x’1
Z subtends a solid angle equal to at least half a sphere at x’1
x’1
x2
x3
x4
x5
Z
x6
S
r
p
Shuchi Chawla19
Contribution from Q Å Z
The ellipsoid is roughly as far from x’1 as its longest radius
Contribution from ellipsoid is 2-d x total solid angle
Therefore, Pr[x1 2 QÅZ] / Pr[x1 2 Z] 2-d
x’1
x2
x3
x4
x5
Z Q∩Z
x6
r r
Shuchi Chawla20
The general case… n sanitized points
Initial intuition is wrong:
Privacy of x1 given x1’ and other points in the clear does not implyprivacy of x1 given x1’ and sanitizations of others!
Problem: Sanitization is non-obliviousOther sanitized points reveal information about x, if x is their nearest neighbor
Solution: Decouple the two kinds of information – from x’ and x’i
L R
Shuchi Chawla21
The general case… n sanitized points
Perturbation of L is a function of R What function of R would reveal no information
about R? Answer: Coarse-grained histogram information!
Divide space into “cells” Histogram count of cell C = number of points in
RÅC
Perturbation radius of a point p / density of points in the cell
containing p
Shuchi Chawla22
Histogram-based sanitization
Recursively divide space into “cells” until all cells have few points
Reveal the EXACT count of points in each cell Contrast this to k-anonymity
T=62 0 2 0
2 2 4 1
5
2 3
0 2
Shuchi Chawla23
Histogram-based sanitization
Adversary outputs (q,r) guess and radius of isolation
Adversary wins if purple ball contains > 1 points and orange ball contains < T points
q
Shuchi Chawla24
Histogram-based sanitization
We show: If purple ball is “large”,
then orange ball contains the parent cell => at least T points If purple ball is “small”,
then orange ball is exponentially larger than purple ball => either purple has < 1 points or orange has > T points
q
Recall: cells are d-dimensional
Shuchi Chawla25
Results on privacy.. An overview
Distribution Num. points
Revealed to adversary Auxiliary information
Uniform on surface of sphere
2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere
n One sanitized point, all other real points
Distribution, all real points
Uniform over a cube
n Exact histogram count over subcells of sufficiently large size
Distribution
Uniform over a cube
n Perturbation of n/2 points Distribution; exact histogram counts of subcells
Shuchi Chawla26
Future directions
Extend the privacy argument to other “nice” distributions
For what distributions is there no meaningful privacy—utility trade-off?
Characterize acceptable auxiliary information Think of auxiliary information as an a priori
distribution
The low-dimensional case – Is it inherently impossible?
Discrete-valued attributes Our proofs require a “spread” in all attributes
Extend the utility argument to other interesting macroscopic properties – e.g. correlations
Shuchi Chawla27
Conclusions
A first step towards understanding the privacy-utility trade-off
A general and rigorous definition of privacy
A work in progress!
Shuchi Chawla28
Questions?