privacy-maxent: integrating background knowledge in privacy quantification wenliang (kevin) du,...
Post on 19-Dec-2015
215 views
TRANSCRIPT
Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification
Wenliang (Kevin) Du,
Zhouxuan Teng,
and Zutao Zhu.Department of Electrical Engineering & Computer Science
Syracuse University, Syracuse, New York.
Introduction Privacy-Preserving Data Publishing. The impact of background knowledge:
How does it affect privacy? How to measure its impact on privacy?
Integrate background knowledge in privacy quantification. Privacy-MaxEnt: A systematic approach. Based on well-established theories.
Evaluation.
Privacy-Preserving Data Publishing Data disguise methods
Randomization Generalization (e.g. Mondrian) Bucketization (e.g. Anatomy)
Our Privacy-MaxEnt method can be applied to Generalization and Bucketization. We pick Bucketization in our presentation.
Data Sets
Identifier Quasi-Identifier (QI) Sensitive Attribute (SA)
Bucketized Data
P( Breast cancer | {female, college}, bucket=1 ) = 1/4P( Breast cancer | {female, junior}, bucket=2 ) = 1/3
Quasi-Identifier (QI) Sensitive Attribute (SA)
Impact of Background Knowledge
Background Knowledge:
It’s rare for male to have breast cancer.
This analysis is hard for large data sets.
Previous Studies Martin, et al. ICDE’07.
First formal study on background knowledge Chen, LeFevre, Ramakrishnan. VLDB’07.
Improves the previous work. They deal with rule-based knowledge.
Deterministic knowledge. Background knowledge can be much more
complicated. Uncertain knowledge
Complicated Background Knowledge Rule-based knowledge:
P (s | q) = 1. P (s | q) = 0.
Probability-Based Knowledge P (s | q) = 0.2. P (s | Alice) = 0.2.
Vague background knowledge 0.3 ≤ P (s | q) ≤ 0.5.
Miscellaneous types P (s | q1) + P (s | q2) = 0.7 One of Alice and Bob has “Lung Cancer”.
Challenges How to analyze privacy in a systematic way
for large data sets and complicated background knowledge?
Directly computing P( S | Q ) is hard.
What do we want to compute? P( S | Q ), given the background knowledge and
the published data set. P(S | Q ) is primitive for most privacy metrics.
Our Approach
BackgroundKnowledge
Published Data
Public Information
Constraintson x
Constraintson x
Solve x
Consider P( S | Q ) as variable x (a vector).
Most unbiased solution
Maximum Entropy Principle “Information theory provides a constructive
criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information.” — by E. T. Jaynes, 1957.
The MaxEnt Approach
BackgroundKnowledge
Published Data
Public Information
Constraintson P( S | Q )
Constraintson P( S | Q )
Estimate P( S | Q )
Maximum Entropy Estimate
Entropy
Because H(S | Q, B) = H(Q, S, B) – H(Q, B)
Constraint should use P(Q, S, B) as variables
BSQ
BQSPBQSPBQPBQSH,,
).,|(log),|(),(),|( :Entropy
BSQ
BSQPBSQPBSQH,,
).,,(log),,(),,( :Entropy
Maximum Entropy Estimate
Let vector x = P(Q, S, B). Find the value for x that maximizes its
entropy H(Q, S, B), while satisfying h1(x) = c1, …, hu(x) = cu : equality constraints
g1(x) ≤ d1, …, gv(x) ≤ dv : inequality constraints
A special case of Non-Linear Programming.
Constraints from Knowledge
Linear model: quite generic. Conditional probability:
P (S | Q) = P(Q, S) / P(Q). Background knowledge has nothing to do with B:
P(Q, S) = P(Q, S, B=1) + … + P(Q, S, B=m).
Background Knowledge
Constraintson P(Q, S, B)
Constraints from Published Data
Constraints Truth and only the truth. Absolutely correct for the original data set. No inference.
Published Data SetD’
Constraintson P(Q, S, B)
Assignment and Constraints
Observation: the original data is one of the assignmentsConstraint: true for all possible assignments
QI Constraint
Constraint:
Example:
),(),,(1
bqPbsqP j
h
j
2.0)1,()1,,()1,,()1,,( 1312111 qPsqPsqPsqP
SA Constraint
Constraint:
Example:
),(),,(1
bsPbsqPg
ii
P(q1,s4 ,2) P(q3,s4,2) P(q4,s4 ,2) P(s4 ,2) 0.1
Zero Constraint P(q, s, b) = 0, if q or s does not appear in
Bucket b. We can reduce the number of variables.
Theoretic Properties Soundness: Are they correct?
Easy to prove. Completeness: Have we missed any constraint?
See our theorems and proofs. Conciseness: Are there redundant constraints?
Only one redundant constraint in each bucket. Consistency: Is our approach consistent with the
existing methods (i.e., when background knowledge is Ø).
Completeness w.r.t Equations Have we missed any equality constraint?
Yes! If F1 = C1 and F2 = C2 are constraints, F1 + F2 = C1
+ C2 is too. However, it is redundant.
Completeness Theorem: U: our constraint set. All linear constraints can be written as the linear
combinations of the constraints in U.
Completeness w.r.t Inequalities Have we missed any inequalities constraint?
Yes! If F = C, then F ≤ C+0.2 is also valid (redundant).
Completeness Theorem: Our constraint set is also complete in the
inequality sense.
Putting Them Together
BackgroundKnowledge
Published Data
Public Information
Constraintson P( S | Q )
Constraintson P( S | Q )
Estimate P( S | Q )
Maximum Entropy Estimate
Tools: LBFGS, TOMLAB, KNITRO, etc.
Inevitable Questions:
Where do we get background knowledge? Do we have to be very very knowledgeable? For P (s | q) type of knowledge:
All useful knowledge is in the original data set. Association rules:
Positive: Q S Negative: Q ¬S, ¬Q S, ¬Q ¬S
Bound the knowledge in our study. Top-K strongest association rules.
Knowledge about Individuals
Knowledge 1: Alice has either s1 or s4.
Constraint:
Knowledge 1: Two people among Alice, Bob, and Charlie have s4.
Constraint:
Alice: (i1, q1)Bob: (i4, q2)Charlie: (i9, q5)
NqipsqiPsqiPsqiP 111411111111 ),()2,,,()2,,,()1,,,(
NsqiPsqiPsqiP 2459424411 )3,,,()3,,,()2,,,(
Evaluation Implementation:
Lagrange multipliers: Constrained Optimization Unconstrained Optimization
LBFGS: solving the unconstrained optimization problem.
Pentium 3Ghz CPU with 4GB memory.
Privacy versus KnowledgeEstimation Accuracy: KL Distance between P(MaxEnt) (S | Q) and P(Original) (S | Q).
Privacy versus # of QI attributes
Performance vs. Knowledge
Running Time vs. Data Size
Iteration vs. Data size
Conclusion Privacy-MaxEnt is a systematic method
Model various types of knowledge Model the information from the published data Based on well-established theory.
Future work Reducing the # of constraints Vague background knowledge Background knowledge about individuals