toward privacy in public databases
DESCRIPTION
Toward Privacy in Public Databases . Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee Work Done at Microsoft Research. Database Privacy. Think “Census” Individuals provide information - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/1.jpg)
Toward Privacy in Public Databases
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith,
Larry Stockmeyer, Hoeteck Wee
Work Done at Microsoft Research
![Page 2: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/2.jpg)
2
Database Privacy Think “Census”
Individuals provide information Census Bureau publishes sanitized records
Privacy is legally mandated; what utility can we achieve?
Inherent Privacy vs Utility trade-off One extreme – complete privacy; no information Other extreme – complete information; no privacy
Goals: Find a middle path
preserve macroscopic properties “disguise” individual identifying information
Change the nature of discourse Establish framework for meaningful comparison of
techniques
![Page 3: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/3.jpg)
3
Current solutions Statistical approaches
Alter the frequency (PRAN/DS/PERT) of particular features, while preserving means.
Additionally, erase values that reveal too much
Query-based approaches Disallow queries that reveal too much Output perturbation (add noise to true answer)
Unsatisfying Ad-hoc definitions of the privacy/breach Erasure can disclose information Noise can cancel (although, see work of Nissim+.) Combinations of several seemingly innocuous queries
could reveal information; refusal to answer can be revelatory
![Page 4: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/4.jpg)
4
Everybody’s First Suggestion Learn the distribution, then output
A description of the distribution, or Samples from the learned distribution
Want to reflect facts on the ground Statistically insignificant clusters can be
important for allocating resources
![Page 5: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/5.jpg)
5
Our Approach Crypto-flavored definitions
Mathematical characterization of Adversary’s goal Precise definition of when sanitization procedure fails
Intuition: seeing sanitized DB gives Adversary an “advantage”
Statistical Techniques Perturbation of attribute values
Differs from previous work: perturbation amounts depend on local densities of points
Highly abstracted version of problem If we can’t understand this, we can’t understand real life
(and we can’t…) If we get negative results here, the world is in trouble.
![Page 6: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/6.jpg)
6
What do WE mean by privacy?
[Ruth Gavison] Protection from being brought to the attention of others
inherently valuable attention invites further privacy loss
Privacy is assured to the extent that one blends in with the crowd
Appealing definition; can be converted into a precise mathematical statement…
![Page 7: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/7.jpg)
7
A geometric view Abstraction:
Database consists of points in high dimensional space Rd
independent samples from some underlying distribution
Points are unlabeledyou are your collection of attributes
Distance is everythingpoints are similar if and only if they are close (L2 norm)
Real Database (RDB), privaten unlabeled points in d-dimensional space
Sanitized Database (SDB), publicn’ new points, possibly in a different space
![Page 8: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/8.jpg)
8
The adversary or Isolator - Intuition On input SDB and auxiliary information,
adversary outputs a point q Rd
q “isolates” a real DB point x, if it is much closer to x than to x’s near neighbors q fails to isolate x if q looks roughly as much
like everyone in x’s neighborhood as it looks like x itself
Tightly clustered points have a smaller radius of isolation
RDB
![Page 9: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/9.jpg)
9
Isolation – the definition
(c-1)
I(SDB,aux) = q x is isolated if B(q,c) contains fewer than T
other points from RDB T-radius of x – distance to its Tth-nearest
neighbor x is “safe” if x > (T-radius of x)/(c-1)
B(q,cx) contains x’s entire T-neighborhood
c – privacy parameter; eg, 4q
x
c
p
If |x-p| < T-radx < (c-1)x then |q-p| · |q-x| + |x-p| < x + T-radx < cx
![Page 10: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/10.jpg)
10
Requirements for the sanitizer No way of obtaining privacy if AUX already reveals too
much!
Sanitization procedure compromises privacy if giving the adversary access to the SDB considerably increases its probability of success
Definition of “considerably” can be forgiving, say, n-2. Made rigorous by quantification over adversaries,
distributions, auxiliary information, sanitizations, samples:
I I’ w.o.p. D aux z x 2 D |Pr[I(SDB,z) isolates x] – Pr[I’(z) isolates x]| is small/n
Provides a framework for describing the power of a sanitization method, and hence for comparisons
![Page 11: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/11.jpg)
11
The Sanitizer The privacy of x is linked to its T-radius
Randomly perturb it in proportion to its T-radius
x’ = San(x) R B(x,T-rad(x))
Intuition: We are blending x in with its crowd We are adding to x random noise with mean
zero, so several macroscopic properties should be preserved.
![Page 12: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/12.jpg)
12
Flavor of Results (Preliminary) Assumptions
Data arises from a mixture of GaussiansDimension d, number of points n are larged = (log n)
Results Privacy: An adversary who knows the Gaussians
and some auxiliary information cannot isolate any point with probability more than 2-(d)
several special cases; general result not yet proved; Very different proof techniques from anything in the
statistics or crypto literatures!Utility: A user who does not know the Gaussians
can compute the means with a high probability.
![Page 13: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/13.jpg)
13
The “simplest” interesting case Two points – x and y – generated uniformly from
surface of a ball B(o,)
The adversary knows x’, y’, and = |x-y|
We prove there are 2(d) “decoy” pairs (xi,yi) such that |xi-yi|= and Pr[ xi,yi | x’,y’ ] = Pr[ x,y | x’,y’ ]
Furthermore, the adversary can only isolate one point xi or yi at a time: they are “far apart” wrt
Proof based on symmetry arguments and coding theory.
High dimensionality crucial.
![Page 14: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/14.jpg)
14
Finding Decoy Pairs Consider a hyperplane H through x’, y’ and o xH, yH – mirror reflections of x, y through H
Note: reflections preserve distances! The world of xH, yH looks identical to the world of x,
y
x
yy’
x’
xH
yH
Pr[ xH,yH | x’,y’ ] = Pr[ x,y | x’,y’ ]
H
![Page 15: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/15.jpg)
15
Lots of choices for H xH, yH – reflections of x, y through H(x’,y’,o)
Note: reflections preserve distances! The world of xH, yH looks identical to the world of x,
y How many different H such that the corresponding
xH are pairwise distant (and distant from x)?
2r sin
r
2
Sufficient to pick r > 2/3 and = 30°
Fact: There are 2(d) vectors in d-dim, at angle 60° from each other.
Probability that adversary wins ≤ 2-(d)
x
x1
x2
> 2/3 r
![Page 16: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/16.jpg)
16
Towards the general case… n points The adversary is given n-1 real points x2,
…,xn and one sanitized point x’1
Symmetry does not work – too many constraints
A more direct argument – Let Z = { pRd | p is a legal pre-image for x’1 }
Q = {p | if x1 = p then x1 is isolated by q } Show that Pr[x1 in Q∩Z | x’1 ] ≤ 2-(d)
Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4)
![Page 17: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/17.jpg)
17
Why does Q∩Z contribute so little mass?Z = { p| p is a legal pre-image for x’1 } Q = { p | if x1 = p then x1 is isolated by q }
qx1’
x2
x3
x4
x5Z
Q
Key observation: As |q-x1’| increases, Q becomes larger.
But, larger distance from x1’ implies smaller probability mass, as x1 is randomized over a larger area
Q∩Z
x6
T=1; perturb to 1-radius|x1’ – x1| = 1-rad(x1)
![Page 18: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/18.jpg)
18
The general case… n sanitized points Initial intuition is wrong:
Privacy of x1 given x1’ and all the other points in the clear does not imply privacy of x1 given x1’ and sanitizations of others!
Sanitization of other points reveals information about x
![Page 19: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/19.jpg)
19
Digression: Histogram Sanitization U = d-dim cube, side = 2 Cut into 2d subcubes
split along each axis subcube has side = 1
For each subcubeif number of RDB points > 2T
then recurse Output: list of cells and
counts
![Page 20: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/20.jpg)
20
Digression: Histogram Sanitization Theorem: If n = 2o(d) and points are drawn
uniformly from U, then histogram sanitizations are safe with respect to 8-isolation: Pr[I(SDB) succeeds] · 2-(d).
Rough Intuition:For q 2 C: expected distance to any x 2 C is relatively large (and even larger for x 2 C’); distances tightly concentrated. Increasing radius by 8 captures almost all the parent cell, which contains at least 2T points.
![Page 21: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/21.jpg)
21
Combining the Two Sanitizations Partition RDB into two sets A and B Cross-training
Compute histogram sanitization for B v 2 A: v = side length of C containing v Output GSan(v, v)
A B
![Page 22: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/22.jpg)
22
Cross-Training Privacy Privacy for B: only histogram information
about B is used Privacy for A: enough variance for
enough coordinates of v, even given C containing v and sanitization v’ of v.
![Page 23: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/23.jpg)
23
Results on privacy.. The special Cases
Distribution Num. of points
Revealed to adversary
Auxiliary information
Uniform on surface of sphere
2 Both sanitized points
Distribution, 1-radius
Uniform over a bounding box or surface of sphere
n One sanitized point, all other real points
Distribution
Uniform over a hypercube
2(d) n/2 sanitized points Distribution
Gaussian 2o(d) n sanitized points Distribution
![Page 24: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/24.jpg)
24
Learning mixtures of Gaussians - Spectral techniques
Observation: Optimal low-rank approx to a matrix of complex data yields the underlying structure, eg, means [M01,VW02].
We show that McSherry’s algorithm works for clustering sanitized Gaussian data
original distribution (mixture of Gaussians) is recovered
![Page 25: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/25.jpg)
25
Spectral techniques for perturbed data
A sanitized point is the sum of two Gaussian variables – sample + noise
w.h.p. the T-radius of a point is less than the “radius” of its Gaussian
Variance of the noise is small Previous techniques work
![Page 26: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/26.jpg)
26
Results on utility… An overview
Distributional/Worst-case
Objective Assumptions
Result
Worst-case Find K clusters minimizing largest diameter
- Diameter increases by a factor of 3
Distributional Find k maximum likelihood clusters
Mixture of k Gaussians
Correct clustering with high probability as long as means are pairwise sufficiently far
![Page 27: Toward Privacy in Public Databases](https://reader038.vdocuments.us/reader038/viewer/2022103102/568167ad550346895ddcfd2e/html5/thumbnails/27.jpg)
27
What about the real world? Lessons from the abstract model
High dimensionality is our friend Gaussian perturbations seem to be the right thing to do Need to scale different attributes appropriately, so that
data is well rounded
Moving towards real data Outliers
– Our notion of c-isolation deals with them- Existence of outlier may be disclosed
Discrete attributes – Convert them into real-valued attributes- e.g. Convert a binary variable into a probability