optrr: optimizing randomized response schemes for privacy-preserving data mining
DESCRIPTION
OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining. Zhengli Huang and Wenliang (Kevin) Du Department of EECS Syracuse University. Data Mining/Analysis. Data cannot be published directly because of privacy concern. Background: Randomized Response. - PowerPoint PPT PresentationTRANSCRIPT
OptRR: Optimizing Randomized ReOptRR: Optimizing Randomized Response Schemes For Privacy-Pressponse Schemes For Privacy-Pres
erving Data Miningerving Data Mining
Zhengli Huang and Wenliang (Kevin) DuDepartment of EECSSyracuse University
Data Mining/AnalysisData Mining/Analysis
Data Publisher
Step 1: Data Collection
IndividualData
Data Miner
Step 2: Data PublishingData cannot be published directly because of privacy concern
Background:Background:Randomized ResponseRandomized Response
)5.0(
)(
≠=
θθYesP
€
P'(Yes) = P(Yes) ⋅θ + P(No) ⋅(1−θ)
P'(No) = P(Yes) ⋅(1−θ) + P(No) ⋅θ
Do you smoke?
Head
TailNo
Yes
The true answer is “Yes”
Biased coin:
5.0
)(
≠=
θθHeadP
RR for Categorical DataRR for Categorical Data
True Value: Si
Si
Si+1
Si+2
Si+3
q1
q2
q3
q4
€
P '(s1)
P '(s2)
P '(s3)
P '(s4)
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
=
q1 q4 q3 q2
q2 q1 q4 q3
q3 q2 q1 q4
q4 q3 q2 q1
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
P(s1)
P(s2)
P(s3)
P(s4)
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
M
A GeneralizationA Generalization
• Several RR Matrices have been proposedSeveral RR Matrices have been proposed– [Warner 65][Warner 65]– [R.Agrawal et al. 05], [S. Agrawal et al. [R.Agrawal et al. 05], [S. Agrawal et al. 05] 05]
• RR Matrix can be arbitraryRR Matrix can be arbitrary
• Can we find optimal RR matrices?Can we find optimal RR matrices?
€
M =
a11 a12 a13 a14
a21 a22 a23 a24
a31 a32 a33 a34
a41 a42 a43 a44
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
What is an optimal What is an optimal matrix?matrix?
• Which of the following is Which of the following is better?better?
€
M1 =
1 0 0
0 1 0
0 0 1
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
€
M2 =
13
13
13
13
13
13
13
13
13
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
Privacy: M2 is betterUtility: M1 is better
So, what is an optimal matrix?
Optimal RR MatrixOptimal RR Matrix
• An RR matrix M is optimal if no other RR An RR matrix M is optimal if no other RR matrix’s privacy and utility are both matrix’s privacy and utility are both better than M (i, e, better than M (i, e, no other matrix no other matrix dominates Mdominates M).).– PrivacyPrivacy Quantification Quantification– UtilityUtility Quantification Quantification
• A number of privacy and utility metrics A number of privacy and utility metrics have been proposed. We use the following:have been proposed. We use the following:– PrivacyPrivacy: how accurately one can estimate : how accurately one can estimate individualindividual info. info.
– UtilityUtility: how accurately we can estimate : how accurately we can estimate aggregateaggregate info. info.
Optimization MethodsOptimization Methods
• Approach 1: Weighted sum: Approach 1: Weighted sum: ww1 1 Privacy + wPrivacy + w22 Utility Utility
• Approach 2Approach 2– Fix Privacy, find M with the optimal Fix Privacy, find M with the optimal Utility.Utility.
– Fix Utility, find M with the optimal Fix Utility, find M with the optimal Privacy.Privacy.
– Challenge: Challenge: Difficult to generate M with a Difficult to generate M with a fixed privacy or utility.fixed privacy or utility.
• Our Approach: Multi-Objective Our Approach: Multi-Objective OptimizationOptimization
Evolutionary Multi-Evolutionary Multi-ObjectiveObjective
Optimization (EMOO)Optimization (EMOO)• Genetic algorithms has Genetic algorithms has difficulty of dealing with difficulty of dealing with multiple objectives.multiple objectives.
• We use the EMOO algorithmWe use the EMOO algorithm• We use SPEA2.We use SPEA2.
Our SPEA2-based Our SPEA2-based algorithmalgorithm
EMOOEMOO
• EvolutionEvolution– CrossoverCrossover– MutationMutation
• Fitness Assignment (SPEA2)Fitness Assignment (SPEA2)– Strength Value S(M):Strength Value S(M): the number of matrix the number of matrix dominated by M. dominated by M.
– Raw fitness F’(M):Raw fitness F’(M): the sum of the the sum of the strength of the RR matrices that dominate strength of the RR matrices that dominate M. The lower the better.M. The lower the better.
– Density d(M):Density d(M): discriminate the matrices discriminate the matrices with the same fitness.with the same fitness.
DiversityDiversity
Privacy
Utility
Worse
Better
M1
M2
M4
M3
M5
The Output of The Output of OptimizationOptimization
• Pareto FrontsPareto Fronts– The optimal set is often plotted in The optimal set is often plotted in the objective space and the plot is the objective space and the plot is called the called the Pareto frontPareto front..
Privacy
Utility(error)
0
ExperimentsExperiments
For normal distribution with different δ
For First attribute of Adult dataFor First attribute of Adult data
For normal distribution (For normal distribution (δδ=0.75=0.75))
Summary Summary
• We use an evolutionary multi-We use an evolutionary multi-objective optimization objective optimization technique to search for optimal technique to search for optimal RR matrices.RR matrices.
• The evaluation shows that our The evaluation shows that our scheme achieves better scheme achieves better performance than the existing performance than the existing RR schemes.RR schemes.