from idiosyncratic to stereotypical: toward privacy in public databases
DESCRIPTION
From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases. Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee. Database Privacy. Census data – a prototypical example Individuals provide information - PowerPoint PPT PresentationTRANSCRIPT
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith,
Larry Stockmeyer, Hoeteck Wee
From Idiosyncratic to Stereotypical:
Toward Privacy in Public Databases
Shuchi Chawla2
Database Privacy
Census data – a prototypical example Individuals provide information Census bureau publishes sanitized records
Privacy is legally mandated; what utility can we achieve?
Our Goal: What do we mean by preservation of privacy? Characterize the trade-off between privacy and utility
– disguise individual identifying information– preserve macroscopic properties
Develop a “good” sanitizing procedure with theoretical guarantees
Shuchi Chawla3
An outline of this talk
A mathematical formalism What do we mean by privacy? Prior work An abstract model of datasets Isolation; Good sanitizations
A candidate sanitization A brief overview of results General argument for privacy of n-point datasets
Open issues and concluding remarks
Shuchi Chawla4
Privacy… a philosophical view-point
[Ruth Gavison] … includes protection from being brought to the attention of others …
Matches intuition; inherently desirable Attention invites further loss of privacy Privacy is assured to the extent that one blends in
with the crowd
Appealing definition; can be converted into a precise mathematical statement!
Shuchi Chawla5
Database Privacy
Statistical approaches Alter the frequency (PRAN/DS/PERT) of particular
features, while preserving means. Additionally, erase values that reveal too much
Query-based approaches involve a permanent trusted third party Query monitoring: dissallow queries that breach
privacy Perturbation: Add noise to the query output
[Dinur Nissim’03, Dwork Nissim’04]
Statistical perturbation + adversarial analysis [Evfimievsky et al ’03] combine statistical techniques
with analysis similar to query-based approaches
Shuchi Chawla6
Everybody’s First Suggestion
Learn the distribution, then output: A description of the distribution, or, Samples from the learned distribution
Want to reflect facts on the ground
Statistically insignificant facts can be important for allocating resources
Shuchi Chawla7
A geometric view
Abstraction : Points in a high dimensional metric space – say R d;
drawn i.i.d. from some distribution Points are unlabeled; you are your collection of
attributes Distance is everything
Real Database (RDB) – privaten unlabeled points in d-dimensional space.
Sanitized Database (SDB) – publicn’ new points possibly in a different space.
Shuchi Chawla8
The adversary or Isolator
Using SDB and auxiliary information (AUX), outputs a point q
q “isolates” a real point x, if it is much closer to x than to x’s neighbors,
T-radius of x – distance to its T-nearest neighbor x is “safe” if x > (T-radius of x)/(c-1)
B(q, cx) contains x’s entire T-neighborhood
(c-1)
c – privacy parameter; eg. 4
qx
c
large T and small c is good
i.e., if B(q,c) contains less than T RDB points
Shuchi Chawla9
A good sanitization
Sanitizing algorithm compromises privacy if the adversary is able to considerably increase his probability of isolating a point by looking at its output
A rigorous (and too ideal) definitionD I I ’ w.o.p RDB 2R Dn aux z x 2 RDB :
| Pr[I(SDB,z) isolates x] – Pr[I ’(z) isolates x] | · /n
Definition of can be forgiving, say, 2-(d) or (1 in a 1000)
Quantification over x : If aux reveals info about some x, the privacy of some other y should still be preserved
Provides a framework for describing the power of a sanitization method, and hence for comparisons
Shuchi Chawla10
The Sanitizer
The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-radius
x’ = San(x) R S(x,T-rad(x))
Intuition: We are blending x in with its crowd
If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one.
We are adding random noise with mean zero to x, so several macroscopic properties should be preserved.
Shuchi Chawla11
Results on privacy.. An overview
Distribution Num. of points
Revealed to adversary
Auxiliary information
Uniform on surface of sphere
2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere
n One sanitized point, all other real points
Distribution, all real points
Gaussian 2o(d) n sanitized points Distribution
Gaussian 2(d) Work under progress
Shuchi Chawla12
Results on utility… An overview
Distributional/Worst-case
Objective Assumptions Result
Worst-case Find K clusters minimizing largest diameter
- Optimal diameter as well as approximations increase by at most a factor of 3
Distributional Find k maximum likelihood clusters
Mixture of k Gaussians
Correct clustering with high probability as long as means are pairwise sufficiently far
Shuchi Chawla13
A special case - one sanitized point
RDB = {x1,…,xn}
The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 ; T = 1; c=4; “flat” prior
Recall: x’1 2R S(x1,|x1-y|)
where y is the nearest neighbor of x1
Main idea:Consider the posterior distribution on x1
Show that the adversary cannot isolate a large probability mass under this distribution
Shuchi Chawla14
Let Z = { pR d | p is a legal pre-image for x’1 }
Q = { p | if x1=p then x1 is isolated by q }
We show that Pr[ Q∩Z | x’1 ] ≤ 2-(d) Pr[ Z | x’1 ]
Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4)
A special case - one sanitized point
x’1
x2
x3
x4
x5
Z Q∩Z
x6
|p-q| · 1/3 |p-x’1|
Shuchi Chawla15
Contribution from Z
Pr[x1=p | x’1] Pr[x’1 | x1=p] 1/rd (r = |x’1-p|) Increase in r x’1 gets randomized over a larger area
– proportional to rd. Hence the inverse dependence.
Pr[x’1 | x12 S] sS 1/rd solid angle subtended at x’1
Z subtends a solid angle equal to at least half a sphere at x’1
x’1
x2
x3
x4
x5
Z
x6
S
r
p
Shuchi Chawla16
Contribution from Q Å Z
The ellipsoid is roughly as far from x’1 as its longest radius
Contribution from ellipsoid is 2-d x total solid angle
Therefore, Pr[x1 2 QÅZ] / Pr[x1 2 Z] 2-d
x’1
x2
x3
x4
x5
Z Q∩Z
x6
r r
Shuchi Chawla17
The general case… n sanitized points
Initial intuition is wrong: Privacy of x1 given x1’ and all the other points in the
clear does not imply privacy of x1 given x1’ and sanitizations of others!
Sanitization is non-oblivious – Other sanitized points reveal information about x, if x is their nearest neighbor
Where we are now Consider some example of safe sanitization (not necessarily
using perturbations) Density regions? Histograms?
Relate perturbations to the safe sanitization
Uniform distribution; histogram over fixed-size cells exponentially low probability of isolation
Shuchi Chawla18
Future directions
Extend the privacy argument to other “nice” distributions
For what distributions is there no meaningful privacy—utility trade-off?
Characterize acceptable auxiliary information Think of auxiliary information as an a priori
distribution
The low-dimensional case – Is it inherently impossible?
Discrete-valued attributes Our proofs require a “spread” in all attributes
Extend the utility argument to other interesting macroscopic properties – e.g. correlations
Shuchi Chawla19
Conclusions
Our work so far: A first step towards understanding the privacy-utility
trade-off A general and rigorous definition of privacy A work in progress!
How does this compare to other frameworks e.g. Query-based approaches?
Query-based approaches: directly identify good and bad functions
Our approach: summarize “good” functions by a “sanitized
database”
Shuchi Chawla20
Questions?