Privacy-Preserving Decision Tree Ensembles
Scenario State-Of-The-Art Comparison and Challenges
Inductive Learning
Dataset
Optimized Decision
Tree Algorithm
Inductive learning: learn target model through iterative inductions over the training sample setHow?
Approximation to an optimal hypothesis optimize an objective learning functionExample: Minimizing a Loss Function
Training dataset: 𝑥1, 𝑐1 , … , 𝑥𝑛, 𝑐𝑛 where 𝑐𝑖 is the true label of 𝑥𝑖Target function: 𝑐 = Γ 𝑥Inductive learner produces a model 𝑦 = 𝑔(𝑥) which approximates Γ 𝑥 such that the loss function 𝐿 𝑐, 𝑦 is minimized
Optimal model minimizes the average loss defined by 𝐿 𝑐, 𝑦 for all samples in the training setWeighted by their posterior probability 𝑃𝑐 𝑦 𝑥
For many problems, 𝑐 = Γ 𝑥 is a non-deterministic functionDecision Tree is one of the most fundamental inductive learning models
Healthcare cost prediction [1], disease diagnosis [2] [3] , computer network analysis [4], credit risk assessment [5] [6]
1 Sushmita, Shanu, et al. "Population cost prediction on public healthcare datasets." Proceedings of the 5th International Conference on Digital Health 2015. ACM, 2015.2 Azar, Ahmad Taher, and Shereen M. El-Metwally. "Decision tree classifiers for automated medical diagnosis." Neural Computing and Applications 23.7-8 (2013): 2387-2403.3 Singh, Anima, and John V. Guttag. "A comparison of non-symmetric entropy-based classification trees and support vector machine for cardiovascular risk stratification." Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE. IEEE, 2011. 4 Antonakakis, Manos, et al. "From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware." USENIX security symposium. Vol. 12. 2012.5 Kim, Soo Y., and Arun Upneja. "Predicting restaurant financial distress using decision tree and AdaBoosted decision tree models." Economic Modelling 36 (2014): 354-362.6 Koh, Hian Chye, Wei Chin Tan, and Chwee Peng Goh. "A two-step method to construct credit scoring models with data mining techniques." International Journal of Business and Information 1.1 (2015).7 Agrawal, Rakesh, and Ramakrishnan Srikant. "Privacy-preserving data mining." ACM Sigmod Record. Vol. 29. No. 2. ACM, 2000.8 Kargupta, Hillol, et al. "On the privacy preserving properties of random data perturbation techniques." Data Mining, 2003. ICDM 2003. Third IEEE International Conference on. IEEE, 2003.9 Fan, Wei. "On the optimality of probability estimation by random decision trees." AAAI. Vol. 2004. 2004.10 Ho, Tin Kam. "Random decision forests." Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on. Vol. 1. IEEE, 1995.
11 Dwork, Cynthia. "Differential privacy." Encyclopedia of Cryptography and Security. Springer US, 2011. 338-34012 Blum, Avrim, et al. "Practical privacy: the SuLQ framework." Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2005.13 Friedman, Arik, and Assaf Schuster. "Data mining with differential privacy." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.14 Rana, Santu, Sunil Kumar Gupta, and Svetha Venkatesh. "Differentially private random forest with high utility." Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 2015.15 Jagannathan, Geetha, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. "A practical differentially private random decision tree classifier." Data Mining Workshops, 2009. ICDMW'09. IEEE International Conference on. IEEE, 2009.16 Lindell, Yehuda, and Benny Pinkas. "An efficient protocol for secure two-party computation in the presence of malicious adversaries." Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer Berlin Heidelberg, 2007.17 Beaver, Donald. "Commodity-based cryptography." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997.18 Paillier, Pascal. "Public-key cryptosystems based on composite degree residuosity classes." Eurocrypt. Vol. 99. 1999.19 Cramer, Ronald, Rosario Gennaro, and Berry Schoenmakers. "A secure and optimally efficient multi‐authority election scheme." Transactions on Emerging Telecommunications Technologies 8.5 (1997): 481-490.
20 Rabin, Michael O. "How To Exchange Secrets with Oblivious Transfer." IACR Cryptology ePrint Archive 2005 (2005): 187.21 Yao, Andrew Chi-Chih. "How to generate and exchange secrets." Foundations of Computer Science, 1986., 27th Annual Symposium on. IEEE, 1986.22 Shamir, Adi. "How to share a secret." Communications of the ACM 22.11 (1979): 612-613.23 Lindell, Yehuda, and Benny Pinkas. "Privacy preserving data mining." Advances in Cryptology—CRYPTO 2000. Springer Berlin/Heidelberg, 2000.24 de Hoogh, Sebastiaan, et al. "Practical secure decision tree learning in a teletreatment application." International Conference on Financial Cryptography and Data Security. Springer, Berlin, Heidelberg, 2014.25 Wu, David J., et al. "Privately evaluating decision trees and random forests." Proceedings on Privacy Enhancing Technologies 2016.4 (2016): 335-355.26 De Cock, Martine, et al. "Efficient and Private Scoring of Decision Trees, Support Vector Machines and Logistic Regression Models based on Pre-Computation." IEEE Transactions on Dependable and Secure Computing(2017).27 Ohrimenko, Olga, et al. "Oblivious Multi-Party Machine Learning on Trusted Processors." USENIX Security Symposium. 2016.28 Vaidya, Jaideep, et al. "A random decision tree framework for privacy-preserving data mining." IEEE transactions on dependable and secure computing 11.5 (2014): 399-411.
Privacy-Preserving Training
Primary Care Physician’s
Dataset
Hospital’s Dataset
Insurance Provider’s
Dataset
Medical Specialist’s
Dataset
Complete Dataset
HIPAA ?
Patient’s Private Medical Data
Patient’s Sensitive Classification Result
Privacy-Preserving Evaluation
Decision Tree Model
Randomization Techniques
Black
Box
What does this mean?Tree T1 trained on dataset D1Tree T2 trained on dataset D2 where
D2 = any dataset differing from D1 by, at most, one training exampleIf any adversary cannot tell the difference between T1 and T2 T1 is a differentially private decision tree
How?Add noise to D1 before building the tree!
• Train using differentially private queries [12]
• Make each step of the training process differentially private [13]
• Add randomization- Random forests [14]
- Random decision trees [15]
“In the setting of multiparty computation, sets of two or more parties with private inputs wish to jointly compute some (predetermined) function of their inputs. The computation should be such that the outputs received by the parties are correctly distributed, and furthermore, that the privacy of each party's input is preserved as much as possible, even in the presence of adversarial behavior.” [16]
What does this mean?Exchange random-looking message such that messages can still be used to compute decision tree
Messages don’t mean anything, still get trained model
How? Building Blocks:• Commodity-Based Cryptography [17]
• Homomorphic Encryption [18] [19]
• Oblivious Transfer [20]
• Yao’s Garbled Circuits [21]
• Shamir’s Secret Sharing Scheme [22]
Primary Concern:Privacy of the datasets
Leakage Points:(1) Training Process, (2) Tree Structure
Idea: Evaluation as a ServiceService provider has predictive ensemble modelCharges per query made
Privacy Concerns:Server: Models
- as a source of revenue- encodes business knowledge- encodes underlying, potentially sensitive, training data
Client: DataClassification Result
Differential Privacy [11]
• Training based on Garbled Circuits [23]
• Training based on Shamir’s Secret Sharing [24]
• Evaluation using Homomorphic Encryption [25]
• Evaluation using Commodity-Based Cryptography [26]
• Evaluation using SGX [27]
Secure Multiparty Computation
Seminal Work: Agrawal and Srikant – Privacy-Preserving Data Mining [7]
• Introduced privacy-preserving data mining concept• Techniques:
- Discretize values Protect individual, unique values- 𝑥𝑖 + 𝑟 where 𝑟 ∈𝑅 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛, 𝑔𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
• Has since been broken [8]
• Opened doors into the area
Random Decision Trees• Introduced by Fan [9]
• Splits at each node according to a randomly chosen feature- Reduces problem to protecting
the leaf nodes
Random Forests• Introduced by Ho [10]
• Random subspace method to implement stochastic discrimination• Ensemble method with bagging
Comparison of Approaches
Trade-Offs:• Black-Box Access vs Accuracy Loss
- Attacker can combine a-priori information with the results from many protocol executions to reverse engineer private data
OR- Can introduce randomness and lose accuracy of the resulting model
• Efficiency Loss vs Data Access- Multiple data holders need to exchange messages privately cryptographic operations
- Efficiency lossOR- Must assume single data holder
[7] [28] [12]
[13]
[14] [25] [26] [24]
Open ResearchChallenges
• Risks of Reverse Engineering
• Computation Costs
• Incorporating Different Trust and Sensitivity Levels
• Combining Secure Multiparty Computation with Differential Privacy
• Dynamic and Flexible Collaborative Learning
AcknowledgementThis research has been partially sup- port by the National Science Foundation under Grants CNS-1115375, NSF 1547102, SaTC 1564097, and an RCN BD Fellowship, provided by the Research Coordination Network (RCN) on Big Data and Smart Cities. The first author was awarded a partial GRA support from IISP.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the RCN or National Science Foundation.
labeled data
unlabeled data
……
final
predictions
learn the combination from labeled data
training testing
classifier 1
classifier 2
classifier m
Ensemble model
Privacy Preserving Ensemble Learning
• Differential Privacy, Secure Multiparty Computation, Quantification of Privacy
• Ensemble Learning: Supervised, Unsupervised, Semi-supervised
• Distributed vs. Centralized Privacy Preserving Ensemble Learning Architecture
• Decision Trees, Deep Neural Networks
Stacey Truex and Ling Liu