data classification using anatomized training data...β’ minimal involvement of π·βspublisher,...
TRANSCRIPT
CERIASThe Center for Education and Research in Information Assurance and Security
The ProblemAnatomized Learning ProblemGiven some l-diverse data in the anatomy model, can we learn accurate data miningmodels?
β’ What is the anatomy model?β’ What is l-diverse?β’ Under which assumptions?
Anatomy ModelSeparate data table (π·) into two tables, identifier (πΌπ) and sensitive table (ππ) insteadof generalizing records in the same group:
β’ Divide π· in π groups πΊπ, group id (πΊπΌπ·) πβ’ IT attributes (π΄ππ): π΄1, β― , π΄πβ’ ST attribute: π΄π β’ Publish πΌπ and ππ instead of π·β’ L-diverseβ’ Xiao et al. (2006), Nergiz et al. (2011, 2013)β’ Patient Data Example (HIPAA 2002)
L-diverse: Privacy StandardEvery instance in πΌπ can be associated with L different instances in ππ
β’ Patient Data Example: L=2
β’ β πΊπ , π£ β ππ΄π πΊπ ,ππππ π£,πΊπ
πΊπβ€1
π
β’ Machanavajjhala et al. (2007)
H(SEQ) GID (G)
Disease (D)
Hk2(1) 1 Cold
Hk2(2) 1 Fever
Hk2(3) 2 Flu
Hk2(4) 2 Cough
Hk2(5) 3 Flu
Hk2(6) 3 Fever
Hk2(7) 4 Cough
Hk2(8) 4 Flu
Patient (P)
Age (A)
Address (AD)
GID (G)
SEQ (S)
Ike 41 Dayton 1 1
Eric 22 Richmond 1 2
Olga 30 Lafayette 2 3
Kelly 35 Lafayette 2 4
Faye 24 Richmond 3 5
Mike 47 Richmond 3 6
Jason 45 Lafayette 4 7
Max 31 Lafayette 4 8
Sensitive table (ST)Identifier table (IT)
ApproachesLearning Problem Assumptions
β’ Training set of ππ‘π instances in anatomy model and test set of ππ‘π instanceswithout any anonymization (Inan et al. (2009))
β’ No background knowledge for πΌπβ’ Canβt predict the sensitive attribute π¨π. If we could, we would be violating
privacy!β’ Prediction task of π΄πor πΆ (binary!):
β’ Type 1: π΄π β π΄1, β― , π΄πβ’ Type 2: πΆ β π΄1, β― , π΄π β§ πΆ β π΄π
β’ No πΌπ and ππ linkingβ’ Data must remain L-diverseβ’ No involvement of Dβs publisherβ’ Relaxed Assumptions (some models):
β’ Minimal involvement of π·βs publisher, limited sources of π·βs publisherβ’ Link πΌπ and ππ on small subsetsβ’ βDistributed data miningβ between third party (server) and data publisher
(client)
Collaborative Decision Tree Analysis:β’ Type 1 prediction task with relaxed assumptionsβ’ Advantages:
1. Preserves privacy with reasonable accuracy2. Big part of the decision tree is learnt by the third party, a desired situation in
cloud server/client architectureβ’ Limitations:
1. Hard to give any bound on the model performance, in particular on the conditional risk (error rate) of the classification.
2. What about the execution time guarantees in a cloud client/server architecture?
β’ Need of a more justified model with the conditional risk guarantees!
Nearest Neighbor Rule in Anatomy Modelβ’ Type 2 prediction task without relaxed assumptions.
β’ Anatomized Training Data (π«π¨): πΌπβ
πΌπ. πΊπΌπ· = ππ. πΊπΌπ·ππ
β’ Augmentation of nearest neighbor rule (Cover and Hart 1967): Expand the
training set such that the expanded version has size ππ‘ππβ’ For all fixed π, the conditional risk is the corollary of Cover and Hart when
ππ‘π β ββ’ One Critical Question: βHow does the Bayes Risk change?β
Collaborative Decision Tree Learning1. Distributed Data Mining in the cloud (Client/server architecture)2. On-the fly encrypted subtrees (Mancuhan and al. 2014)3. Experiments with four datasets from the UCI collection: adult, vote, autos and
Australian credit4. 10 fold cross validation on each dataset measuring accuracy
Empirical Results
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
cdtl
cdbsl
cnl
cdtl
cdbsl
cnl
cdtl
cdbsl
cnl
cdtl
cdbsl
cnl
cdtl
cdbsl
cnl
Ad
ult
Age
Sen
siti
ve
Ad
ult
Rel
atio
nsh
ipSe
nsi
tive
Vo
teA
uto
Au
stra
lian
Cre
dit
Accuracy
AD=Dayton
AD=RichmondAD=Lafayette
A AD G S
30 Lafayette 2 3
35 Lafayette 2 4
45 Lafayette 4 7
31 Lafayette 4 8
A AD G S
22 Richmond 1 2
24 Richmond 3 5
47 Richmond 3 6
A AD G S
41 Dayton 1 1
A AD G S
41 Dayton 1 1
22 Richmond 1 2
30 Lafayette 2 3
35 Lafayette 2 4
24 Richmond 3 5
47 Richmond 3 6
45 Lafayette 4 7
31 Lafayette 4 8
A AD D
30 Lafayette Flu
35 Lafayette Cough
45 Lafayette Cough
31 Lafayette Flu A AD D
30 Lafayette Flu
31 Lafayette Flu
A AD D
35 Lafayette Cough
45 Lafayette Cough
D=Flu
D=Cough
Encrypted Subtree!
H(SEQ) G D
Hk2(3) 2 Flu
Hk2(4) 2 Cough
Hk2(7) 4 Cough
Hk2(8) 4 Flu
Current Work
β’ Experimentation of the nearest neighbor classifier using real data
β’ SVM classification generalization: How to adjust the right margin for the good
generalization property when the training data is anatomized?
β’ Real-world case study: How this could inform data retention policies
Ongoing work supported by Northrup Grumman Cybersecurity Research Consortium
Theorem: Let π β βπ+1 be a metric space, π· be the training data and π·π΄ be the
anatomized training data. Let ππ΄1(π) and ππ΄2(π) be the smooth probability density
functions of π . Let ππ΄1(π) and ππ΄2(π) be the class priors such that ππ΄(π) =
ππ΄1ππ΄1(π) + ππ΄2ππ΄2(π). Similarly, let π1(π) and π2(π) be the smooth probability
density functions of π such that π(π) = π1π1(π) + π2π2(π) with class priors π1 and π2.Let βπ΄(π) = βln (ππ΄1(π)/ππ΄2(π)) and β(π) = βln (π1(π)/π2(π)) be the classifiers with
biases Ξβπ΄(π) and Ξβ(π) respectively. Let π‘ = ln(π1/π2) be the decision threshold
with threshold bias Ξπ‘. Let ππ΄ > 0 be the small changes on π1(π) and π2(π) resulting
in ππ΄1(π) and ππ΄2(π); and π π΄β, π β be the Bayesian error estimations with respective
biases Ξπ π΄β, Ξπ β. Let ππ΄π(π)and ππ(π) be the Parzen density estimations; and πΎ(β) be
the kernel function for π· with shape matrix π΄ and size/volume parameter π. Last, let's
assume that 1) π΄ππ and π΄π are independent in the training data π· and the anatomized
training data π·π΄ 2) π π΄β = π β hold 3) Ξπ‘ < 1. Therefore, the estimated Bayes risk is:
π π΄β β π1π
2 + π2π4 + π3
πβπβ1
π+ ππ΄π4π
2 + ππ¨π5π4 β ππ¨π6
πβπβ1
π
where ππ¨π6πβπβ1
π> 0 always holds.
β’ Another critical question: βHow does the convergence rate to the asymptotical
conditional risk change?β
β’ π(1/([ππ]π+1) versus π(1/[π]π+1)β’ Faster convergence to the asymptotical conditional risk using anatomized
training data.
β’ How is the asymptotical conditional risk?
β’ Depends on the Bayes risk (Theorem above)
This work supported by NPRP grant 09-256-1-046 from the Qatar National Research
Foundation, 1415 PRF Research Grant from Purdue University
Theoretical Results
Koray Mancuhan
Chris Clifton
Data Classification Using Anatomized Training Data