a general methodology for masking output from remote ... · a general methodology for masking...
TRANSCRIPT
![Page 1: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/1.jpg)
A GENERAL METHODOLOGY FOR MASKING
OUTPUT FROM REMOTE ANALYSIS SYSTEMS
Krish Muralidhar
Christine O’Keefe
Rathindra Sarathy
![Page 2: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/2.jpg)
REMOTE ANALYSIS SYSTEM
O’Keefe and Chipperfield (in press)
Query
Dataset Analysis
Output
Data
Transformations
Output for
publication
![Page 3: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/3.jpg)
FOCUS OF THIS PAPER
Responses to statistical queries involving
numerical variables
We explicitly do not consider tabular data release
![Page 4: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/4.jpg)
DATA-BASED CONFIDENTIALIZATION MEASURES
FOR REMOTE ANALYSIS
Input Perturbation and Data Subsetting
Restrictions on Data Transformations
Query
Dataset Analysis
Output
Data
Transformations
Output for
publication
![Page 5: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/5.jpg)
ANALYSIS-BASED CONFIDENTIALIZATION
MEASURES
Refusal to answer risky queries
Output checking
Query
Dataset Analysis
Output
Data
Transformations
Output for
publication
![Page 6: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/6.jpg)
OUTPUT CONFIDENTIALIZATION
Modify output prior to release
Query
Dataset Analysis
Output
Data
Transformations
Output for
publication
![Page 7: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/7.jpg)
EFFECTIVE OUTPUT MASKING
Respond to a diverse set of queries
Meaningful responses to queries
Robust
Control disclosure risk
Automated
![Page 8: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/8.jpg)
OUTPUT MASKING MECHANISMS
Additive Perturbation
Including differential privacy
In our opinion, the applicability of differential privacy for
statistical analyses involving numerical variables is open
to question. We do not consider differential privacy
further
Multiplicative perturbation
![Page 9: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/9.jpg)
A SIMPLE ILLUSTRATION
Query: “What is the variance of a particular
subset of the data (n = 100)?”
True response: 3.81
![Page 10: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/10.jpg)
RESPONSE DISTRIBUTION - ADDITIVE NOISE
But which one?
![Page 11: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/11.jpg)
RESPONSE DISTRIBUTION - MULTIPLICATIVE
But which one?
![Page 12: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/12.jpg)
DRAW FROM THE SAMPLING DISTRIBUTION
Use Chi-Square distribution to approximate the sampling distribution of the sample variance. Draw the response from this distribution.
![Page 13: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/13.jpg)
ROBUST? The Chi-square approximation is sensitive to normality
assumption and not very robust. The data in this case is heavily skewed.
![Page 14: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/14.jpg)
AN IDEAL MASKING MECHANISM
For any query, select a random sample from the
relevant population (not the database),
compute the value of the statistic, and release
this value
Practically infeasible
![Page 15: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/15.jpg)
ALTERNATIVE MECHANISM
For any query, derive the sampling distribution
of the statistic. Randomly draw a value from
this distribution. Release this value
May be feasible for some simple statistics (like the
sample mean), but as our variance example
illustrates, may not be possible for others
Theoretically infeasible
![Page 16: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/16.jpg)
A FEASIBLE APPROACH
Selecting a value from the sampling
distribution of the statistic always provides an
appropriate masked response
Problem – how do we approximate the
sampling distribution of the statistic that is
both accurate and robust?
Solution – THE STATISTICAL BOOTSTRAP
![Page 17: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/17.jpg)
THE STATISTICAL BOOTSTRAP (EFRON 1979)
Draw a bootstrap sample of size n, with replacement, from the original sample also of size n.
Compute value of statistic from the bootstrap sample
Repeat process of selecting bootstrap samples
The standard deviation of the values of the statistic from the bootstrap samples provide a good approximation of the standard error of the statistic
The distribution of 𝜃 ∗ − 𝜃 provides a good
approximation of the distribution of 𝜃 − 𝜃
𝜃 – Parameter; 𝜃 - Statistic; 𝜃 ∗ - Bootstrap statistic
![Page 18: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/18.jpg)
BACK TO OUR EXAMPLE
![Page 19: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/19.jpg)
APPROPRIATE MASKED RESPONSE
Since the bootstrap distribution of the statistic
closely approximates the sampling distribution
of the statistic, choosing a value randomly from
the bootstrap distribution is a close
approximation of choosing a value randomly
from the true sampling distribution of the
statistic
Close equivalent to drawing an independent sample
from the population
![Page 20: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/20.jpg)
CHOOSING FROM THE BOOTSTRAP
DISTRIBUTION
Only a single realization from the bootstrap
distribution is required
A single realization from the bootstrap
distribution is the result of selecting a single
bootstrap sample
No need to construct the entire bootstrap
distribution!
![Page 21: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/21.jpg)
ACTUAL MASKING PROCEDURE
From the original query set, select one
bootstrap sample of the same size as the
original set, with replacement.
Compute the value of the statistic for this
bootstrap sample.
Release the value of this statistic as the
masked response.
![Page 22: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/22.jpg)
CHARACTERISTICS OF THE BOOTSTRAP METHOD
The distribution of 𝜃 ∗ closely approximates the
sampling distribution of 𝜃 ,
If 𝜃 is an unbiased estimator, then 𝐸 𝜃 ∗ = 𝜃 ,
and
Variance of 𝜃 ∗ = 𝜎𝜃 2, the variance of 𝜃 .
![Page 23: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/23.jpg)
PERFORMANCE OF THE BOOTSTRAP METHOD
Easy implementation
Usefulness: 𝜃 ∗ is a random value chosen from a distribution that closely approximates the
sampling distribution of 𝜃
Disclosure risk: Noise addition approximately
equal to the standard error of the statistic 𝜃
Robust (no assumptions)
Easily automated and programmed without the need for ongoing human intervention.
![Page 24: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/24.jpg)
FUTURE RESEARCH
Tabular data
Multiple imputation using the bootstrap
Compare with Rubin’s Bayesian bootstrap
Relationship between the bootstrap and
smooth sensitivity
![Page 25: A General Methodology for Masking Output from Remote ... · A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar ... The data in this case is heavily](https://reader034.vdocuments.us/reader034/viewer/2022042106/5e84615e79fe68425e1ab532/html5/thumbnails/25.jpg)
QUESTIONS OR COMMENTS?
Thank you