differentially private bayesian linear regression garrett
TRANSCRIPT
Problem Statement
Privacy: 𝒜 is differentially privateCalibration: 𝒫 can obtain or approximate p 𝜽, 𝜎' 𝒛)Utility: p 𝜽, 𝜎' 𝒛) is near to p 𝜽, 𝜎' 𝑋, 𝒚)
Utility‣ Maximum Mean Discrepancy to non-private posterior as measure of utility
‣ Naïve method only matches noise-aware method at easier settingsSufficient Statistics Update‣ Sufficient statistics 𝒔 = ∑/012 𝑡(𝑥/) are a sum
over individuals‣ Approximate with CLT as 𝒔 ~𝒩 𝑛𝝁:, 𝑛Σ<‣ 𝝁: and Σ< derived in terms of first four
moments of 𝒙
Run time‣MCMC-Ind scales with population size
‣Other methods stay constant
Differentially Private Bayesian Linear Regression Garrett Bernstein1, Daniel Sheldon1,21University of Massachusetts Amherst; 2 Mount Holyoke College
Results
Background
ϵ-Differential Privacy‣ Pr 𝒜 𝑿 ≤ exp 𝜖 Pr 𝒜 𝑿F
‣ Probability of any output is nearly unchanged by a single individual opting into the data
Bayesian Linear Regression‣ Normal-inverse-gamma is conjugate prior
‣ Leads to closed form posterior 𝑝 𝜽, 𝜎' 𝑋, 𝒚)‣ Sufficient statistics: 𝒔 = ∑/012 𝑡(𝒙/, 𝑦/) = 𝑋I𝑋, 𝑋I𝒚, 𝒚I𝒚
Laplace Mechanism‣ Release noisy sufficient statistics: 𝒛 = 𝒔 + Laplace(Δ/𝜖)‣ Δ = sensitivity: Impact on a function’s output due to
removal/addition of an individual’s data
Supported by National Science Foundation Grant Nos. 1522054 and 1617533www.GarrettBernstein.com https://github.com/gbernstein6/private_bayesian_regression
‣ Given individuals’ sensitive covariate data 𝑋 and response data 𝒚 ∼ Normal(𝜽I𝑋, 𝜎'), and prior 𝑝 𝜽, 𝜎' …
‣ …privately release 𝐳 = 𝒜(𝑋, 𝒚)…‣ …then calculate the posterior p 𝜽, 𝜎' 𝒛) via 𝒫 with…
Private Bayesian Inference
Challenges+
Solutions
‣ Only noisy sufficient statistics observed Use Gibbs sampler‣ Integration over all possible individuals is intractable CLT approximation ‣ Not easy to sample from product of normal and Laplace Scale mixture of normal .‣ Need to make assumptions about data Only need first four moments.
Sufficient Statistics-based Gibbs sampler
Want: p 𝜽, 𝜎' 𝒛)Private Model
Individual-based MCMC
𝒜 – randomized algorithm𝑿,𝑿F — neighboring data sets
‣ Set explicit data prior
Calibration‣ Given true parameters 𝜽, 𝜎', let 𝐹𝒛 be CDF of true posterior 𝑝 𝜽, 𝜎' 𝒛)‣ Using fact 𝐹𝒛(𝜽, 𝜎') is uniform, test correctness of approximate posterior by
testing uniformity of its CDF, W𝐹𝒛(𝜽, 𝜎'), via Kolmogorov-Smirnov test ‣ Noise-aware methods are nearly as well-calibrated as non-private method
Parameter Update‣ Conjugate update uses latent 𝒔
𝜽, 𝜎' ∼ p(𝜃, 𝜎'; 𝜆F); 𝜆F = Conjugate-Update(𝜆, 𝒔, 𝑛)
Noise Update‣ Represent Laplace noise model as scale mixture of
normals for easier sampling with CLT normal
Algorithm 1 Gibbs Sampler
1: Initialize ✓,�2,!
2
2: repeat
3: Calculate µt and ⌃t via Eqs. (2) and (3)4: s ⇠ NormProduct
�nµt, n⌃t, z, diag(!2)
�
5: ✓,�2 ⇠ NIG(✓,�2;µn,⇤n, an, bn) via Eqn. (1)
6: 1/!2j ⇠ InverseGaussian
⇣✏
�s|z�s| ,✏2
�2s
⌘for all j
Subroutine NormProduct1: input: µ1,⌃1,µ2,⌃2
2: ⌃3 =�⌃�1
1 + ⌃�12
��1
3: µ3 = ⌃3
�⌃�1
1 µ1 + ⌃�12 µ2
�
4: return: N (µ3,⌃3)
⌘ijkl, and ⇠ij,kl. The current parameter values are available within the sampler, but the mod-172
eler must provide estimates for the moments of x, either using prior knowledge or by (privately)173
estimating the moments from the data. We discuss three specific possibilities in Section 3.4.4.174
E [xiy] =X
j
✓j⌘ij
E⇥y2⇤ = �2 +
X
i,j
✓i✓j⌘ij
Cov (xixj , xky) =X
l
✓l⇠ij,kl
Cov�xixj , y
2� =X
k,l
✓k✓l⇠ij,kl
Cov (xiy, xjy) = �2⌘ij +X
k,l
✓k✓l⇠ij,kl
Cov�xiy, y
2� =X
j,k,l
✓j✓k✓l⇠ij,kl + 2�2X
j
✓j⌘ij
Var�y2� = 2�4 +
X
i,j,k,l
✓i✓j✓k✓l⇠ij,kl
+ 4�2X
i,j
✓i✓j⌘ij
Once again, more modeling assumptions are175
needed than in the non-private case, where it176
is possible to condition on x. Gibbs-SS re-177
quires milder assumptions (second and fourth178
moments), however, than MCMC-Ind (a full179
prior distribution).180
3.4.2 Variable augmentation for p(z | s)181
The above approximation for the distribution182
over sufficient statistics means the full condi-183
tional distribution involves the product of a184
normal and a Laplace distribution,185
p(s | ✓, z) / N (s;nµt, n⌃t)
· Lap(z; s,�s/✏).
It is unclear how to sample from this distri-186
bution directly. A similar situation arises in187
the Bayesian Lasso, where it is solved by188
variable augmentation [25]. Bernstein and189
Sheldon [5] adapted the variable augmenta-190
tion scheme to private inference in exponential family models. We take the same approach here,191
and represent a Laplace random variable as a scale mixture of normals. Specifically, l ⇠ Lap(u, b)192
is identically distributed to l ⇠ N (u,!2) where the variance !2 ⇠ Exp
�1/(2b2)
�is drawn from193
the exponential distribution (with density 1/(2b2) exp��!
2/(2b2)
�). We augment separately for194
each component of the vector z so that z ⇠ N�s, diag(!2)
�, where !
2j ⇠ Exp
�✏2/(2�2
s)�. The195
augmented full conditional p(s | ✓, z,!) is a product of two multivariate normal distributions, which196
is itself a multivariate normal distribution.197
3.4.3 The Gibbs sampler198
✓,�2 ⇠ NIG(✓,�2;µ0,⇤0, a0, b0)
s ⇠ N (nµt, n⌃t)
!2j ⇠ Exp
✏2
2�2s
!for all j
z ⇠ N�s, diag(!2)
�
The full generative process is shown to the199
right, and the corresponding Gibbs sampler200
is shown in Algorithm 1. The update for201
!2 follows Park and Casella [25]; the inverse202
Gaussian density is InverseGaussian(w;m, v) =203 pv/(2⇡w3) exp
��v(w �m)2/(2m2
w)�. Note204
that the resulting s drawn from p(s | µt,⌃t,!2)205
may require projection onto the space of valid sufficient statistics. This can be done by observing that206
if A = [X,y] then the sufficient statistics are contained in the positive-semidefinite (PSD) matrix207
B = ATA. For a randomly drawn s, we project if necessary so the corresponding B matrix is PSD.208
5
101 102 103
n
0.00
0.35
0.70
.S s
tat.
θ0
101 102 103
n
θbias
101 102 103
n
(a)σ2
.S s
tat.
(F)
Figure 3: Synthetic data results: (a) calibration vs. n for ✏ = 0.1; (b) calibration vs. ✏ for n = 10;(c) QQ plot for n = 10 and ✏ = 0.1; (d) 95% credible interval coverage; (e) MMD of methods tonon-private posterior; (f) method runtimes for ✏ = 0.1.
to approximations in the calculation of multivariate normal distribution fourth moments from a data295
prior. Utility results are shown in Figure 3e; the noise-aware methods provide at least as good utility296
as Naive. Run time results are shown in Figure 3f; MCMC-Ind scales with increasing population size297
while the Gibbs-SS methods, Naive, and Non-Private remain constant. Accordingly, we do not298
include results for MCMC-Ind for n = 1000 as its run time is prohibitive in those settings.299
4.3 Predictive posteriors on real data300
Figure 4: Coverage for predictive posterior 50%and 90% credible intervals.
We evaluate the predictive posteriors of the301
methods on a real world data set measuring302
the effect of drinking rate on cirrhosis rate.4303
We scale both covariate and response data to304
[0, 1]. There are 46 total points, which we305
randomly split into 36 training examples and306
10 test points for each trial. After prelimi-307
nary exploration to gain domain knowledge,308
we set a reasonable model prior of ✓,�2 ⇠309
NIG�[1, 0], diag([.25, .25]), 20, .5
�. We draw310
samples ✓(k),�
2k from the posterior given train-311
ing data, and then form the posterior predictive distribution for each test point yi from these samples.312
Figure 4 shows coverage of 50% and 90% credible intervals on 1000 test points collected over 100313
random train-test splits. Non-Private achieves nearly correct coverage, with the discrepancy due314
to the fact that the data is not actually drawn from the prior. Gibbs-SS-Noisy achieves nearly315
the coverage of Non-Private, while Naive is drastically worse in this regime. We note that this316
experiment emphasizes the advantage of Gibbs-SS-Noisy not needing an explicitly defined data317
prior, as it only requires the same parameter prior that is needed in non-private analysis.318
4http://people.sc.fsu.edu/~jburkardt/datasets/regression/x20.txt
8
Empirical CDFs‣ Empirical CDF plots show noise-aware methods are well-calibrated, but the
naïve method is over-confident
0 1rDnk index
0
1
true
CD
) vD
lue
θ0
0 1rDnk index
θbias
0 1rDnk index
(F)σ2
coverage
(e)
Figure 3: Synthetic data results: (a) calibration vs. n for ✏ = 0.1; (b) calibration vs. ✏ for n = 10;(c) QQ plot for n = 10 and ✏ = 0.1; (d) 95% credible interval coverage; (e) MMD of methods tonon-private posterior; (f) method runtimes for ✏ = 0.1.
to approximations in the calculation of multivariate normal distribution fourth moments from a data295
prior. Utility results are shown in Figure 3e; the noise-aware methods provide at least as good utility296
as Naive. Run time results are shown in Figure 3f; MCMC-Ind scales with increasing population size297
while the Gibbs-SS methods, Naive, and Non-Private remain constant. Accordingly, we do not298
include results for MCMC-Ind for n = 1000 as its run time is prohibitive in those settings.299
4.3 Predictive posteriors on real data300
Figure 4: Coverage for predictive posterior 50%and 90% credible intervals.
We evaluate the predictive posteriors of the301
methods on a real world data set measuring302
the effect of drinking rate on cirrhosis rate.4303
We scale both covariate and response data to304
[0, 1]. There are 46 total points, which we305
randomly split into 36 training examples and306
10 test points for each trial. After prelimi-307
nary exploration to gain domain knowledge,308
we set a reasonable model prior of ✓,�2 ⇠309
NIG�[1, 0], diag([.25, .25]), 20, .5
�. We draw310
samples ✓(k),�
2k from the posterior given train-311
ing data, and then form the posterior predictive distribution for each test point yi from these samples.312
Figure 4 shows coverage of 50% and 90% credible intervals on 1000 test points collected over 100313
random train-test splits. Non-Private achieves nearly correct coverage, with the discrepancy due314
to the fact that the data is not actually drawn from the prior. Gibbs-SS-Noisy achieves nearly315
the coverage of Non-Private, while Naive is drastically worse in this regime. We note that this316
experiment emphasizes the advantage of Gibbs-SS-Noisy not needing an explicitly defined data317
prior, as it only requires the same parameter prior that is needed in non-private analysis.318
4http://people.sc.fsu.edu/~jburkardt/datasets/regression/x20.txt
8
03 101 102n
0
200
400
600
seconds
Figure 3: Synthetic data results: (a) calibration vs. n for ✏ = 0.1; (b) calibration vs. ✏ for n = 10;(c) QQ plot for n = 10 and ✏ = 0.1; (d) 95% credible interval coverage; (e) MMD of methods tonon-private posterior; (f) method runtimes for ✏ = 0.1.
to approximations in the calculation of multivariate normal distribution fourth moments from a data295
prior. Utility results are shown in Figure 3e; the noise-aware methods provide at least as good utility296
as Naive. Run time results are shown in Figure 3f; MCMC-Ind scales with increasing population size297
while the Gibbs-SS methods, Naive, and Non-Private remain constant. Accordingly, we do not298
include results for MCMC-Ind for n = 1000 as its run time is prohibitive in those settings.299
4.3 Predictive posteriors on real data300
Figure 4: Coverage for predictive posterior 50%and 90% credible intervals.
We evaluate the predictive posteriors of the301
methods on a real world data set measuring302
the effect of drinking rate on cirrhosis rate.4303
We scale both covariate and response data to304
[0, 1]. There are 46 total points, which we305
randomly split into 36 training examples and306
10 test points for each trial. After prelimi-307
nary exploration to gain domain knowledge,308
we set a reasonable model prior of ✓,�2 ⇠309
NIG�[1, 0], diag([.25, .25]), 20, .5
�. We draw310
samples ✓(k),�
2k from the posterior given train-311
ing data, and then form the posterior predictive distribution for each test point yi from these samples.312
Figure 4 shows coverage of 50% and 90% credible intervals on 1000 test points collected over 100313
random train-test splits. Non-Private achieves nearly correct coverage, with the discrepancy due314
to the fact that the data is not actually drawn from the prior. Gibbs-SS-Noisy achieves nearly315
the coverage of Non-Private, while Naive is drastically worse in this regime. We note that this316
experiment emphasizes the advantage of Gibbs-SS-Noisy not needing an explicitly defined data317
prior, as it only requires the same parameter prior that is needed in non-private analysis.318
4http://people.sc.fsu.edu/~jburkardt/datasets/regression/x20.txt
8
101 102 103n
0
1
00D
θ0
101 102 103n
θbias
101 102 103n
(e)σ2
seconds
Figure 3: Synthetic data results: (a) calibration vs. n for ✏ = 0.1; (b) calibration vs. ✏ for n = 10;(c) QQ plot for n = 10 and ✏ = 0.1; (d) 95% credible interval coverage; (e) MMD of methods tonon-private posterior; (f) method runtimes for ✏ = 0.1.
to approximations in the calculation of multivariate normal distribution fourth moments from a data295
prior. Utility results are shown in Figure 3e; the noise-aware methods provide at least as good utility296
as Naive. Run time results are shown in Figure 3f; MCMC-Ind scales with increasing population size297
while the Gibbs-SS methods, Naive, and Non-Private remain constant. Accordingly, we do not298
include results for MCMC-Ind for n = 1000 as its run time is prohibitive in those settings.299
4.3 Predictive posteriors on real data300
Figure 4: Coverage for predictive posterior 50%and 90% credible intervals.
We evaluate the predictive posteriors of the301
methods on a real world data set measuring302
the effect of drinking rate on cirrhosis rate.4303
We scale both covariate and response data to304
[0, 1]. There are 46 total points, which we305
randomly split into 36 training examples and306
10 test points for each trial. After prelimi-307
nary exploration to gain domain knowledge,308
we set a reasonable model prior of ✓,�2 ⇠309
NIG�[1, 0], diag([.25, .25]), 20, .5
�. We draw310
samples ✓(k),�
2k from the posterior given train-311
ing data, and then form the posterior predictive distribution for each test point yi from these samples.312
Figure 4 shows coverage of 50% and 90% credible intervals on 1000 test points collected over 100313
random train-test splits. Non-Private achieves nearly correct coverage, with the discrepancy due314
to the fact that the data is not actually drawn from the prior. Gibbs-SS-Noisy achieves nearly315
the coverage of Non-Private, while Naive is drastically worse in this regime. We note that this316
experiment emphasizes the advantage of Gibbs-SS-Noisy not needing an explicitly defined data317
prior, as it only requires the same parameter prior that is needed in non-private analysis.318
4http://people.sc.fsu.edu/~jburkardt/datasets/regression/x20.txt
8
parameters
covariatedata
sufficientstatistics
noisy sufficientstatistics
𝒛𝑋I𝒚
𝒚𝜽, 𝜎'
𝑋𝑋I𝑋
𝒚I𝒚response
data𝒔 ✓,�2 ⇠ p(✓,�2)
x1:niid⇠ ???
y1:n ⇠ N (✓Txi,�2)
s =Xn
i=1t(xi, yi)
z ⇠ s+ Laplace(�/✏)<latexit sha1_base64="mESBHDnPc4iE5UHlkppjC9Aifao=">AAADInicfVJNbxMxEPUuXyV8pXDkYhFRpaJKswGJCqlKBRw4IFSkpq0UJyuvM0ms2t7VehY1WPtbuPBXuHAAASckfgzeNBJNWjEra5/ezPMbj51kSlpst38H4ZWr167fWLtZu3X7zt179fX7hzYtcgE9kao0P064BSUN9FCiguMsB64TBUfJyasqf/QBcitTc4CzDAaaT4wcS8HRU/F6sMOSVI3sTPufYzgF5OUWZVZONB926IZHmmbN/1ZtUsZqTHOcJmN3WsYuemHKSopcnOSgnJSj0lU7lZRtVV+32600s3+l3mW+g+DKvSsv8xsenLOQK+6Wbux6ptDMpEpqiTZ2cjcqh4Zic1k3i+VSwx+X/D1hS/qEMoRTdG95prgA389rUMi3GWRWqtRs1mpxvdFutedBL4JoARpkEftx/ScbpaLQYFAobm0/amc4cDxHKRSUNVZYyPzA+AT6HhquwQ7c/IpL+tgzIzpOc78M0jl7XuG4ttW4fGV1Cruaq8jLcv0CxzsDJ01WIBhxZjQuFMWUVu+FjmQOAtXMAy5y6XulYspzLtC/qmoI0eqRL4LDTit62uq8f9bYe7kYxxp5SB6RJonIc7JH3pB90iMi+BR8Cb4F38PP4dfwR/jrrDQMFpoHZCnCP38BxOX/6Q==</latexit>
need to make assumptions!
‣ Implement in PyMC3‣ Instantiate individuals
First four moments of 𝒙‣ Gibbs-SS-Noisy: Release moments from sample data‣ Gibbs-SS-Prior: Sample population via data prior,
then calculate moments‣ Gibbs-SS-Update: Hierarchical normal prior with
updated parameters
CLT approximation
scale mixture of normals
✓,�2 ⇠ p(✓,�2)
s ⇠ N (nµt, n⌃t)
!2 ⇠ Exp(✏2/2�2)
z ⇠ N (s,!2)<latexit sha1_base64="cktWEax2Dzs3az41IsBO1ym9PUY=">AAAC1nicfVLLbhMxFPUMrxJeKSzZWESgVKrCzIBUlhUPiRUqKmkjxenI49wkVv0Yje+ghtGwACG2fBs7PoJ/wJOMKH2IK1k+Ovcen+trZ7mSDqPoVxBeuXrt+o2Nm51bt+/cvdfdvH/gbFkIGAqrbDHKuAMlDQxRooJRXgDXmYLD7PhVkz/8CIWT1nzAZQ4TzedGzqTg6Km0+5tlVk3dUvutYrgA5PU2ZU7ONT9K6BOPNM37/63aoox1mOa4yGaVq1vRihBcVe/qvjmj12Wd4jY1bL/Rp7jWWw3zU0uGcILVm5O87jPInVTWHCVPE/YaFP61bGq85afLLE/78Y22Z2+l3V40iFZBL4K4BT3Sxl7a/cmmVpQaDArFnRvHUY6TihcohYK6w0oHORfHfA5jDw3X4CbV6llq+tgzUzqzhV8G6Yr9V1Fx7ZqZ+MqmWXc+15CX5cYlzl5MKmnyEsGItdGsVBQtbd6YTmUBAtXSAy4K6XulYsELLtD/hI4fQnz+yhfBQTKInw2S9897uy/bcWyQh+QR6ZOY7JBd8pbskSERwX6wDL4EX8NR+Dn8Fn5fl4ZBq3lAzkT44w93XuNO</latexit>