tools for privacy preserving distributed data mining by michael holmes
TRANSCRIPT
Tools for Privacy Preserving Distributed Data Mining
By Michael Holmes
Why Private Data Mining
❖ The CDC may want to use data mining techniques to identify trends in disease outbreaks.
❖ Insurance companies have useful data but can’t disclose it because of privacy concerns.
❖ Is there a way to obtain this data without revealing the identity of the patients?
Private Data Mining Techniques
❖ Secure Sum
❖ Secure Set Union
❖ Secure Size of Set Intersection
❖ Scalar Product
Private Data Mining Toolkit
❖ Association Rules in horizontally partitioned data
❖ Association Rules in vertically partitioned data
❖ EM Clustering
Secure Sum
❖ Securely compute the sum from individual databases.
❖ Have a site randomly generate a number R
❖ Add this number to every value and send it to site 2.
❖ Site 2 can then add each of it’s values to that values sent from site 1 and return a single number back to Site 1.
❖ Site 1 can then remove the random number N times and find the correct sum.
Secure Sum
Secure Set Union
Secure Size of Set Intersection
❖ Only possible with Commutative Encryption.
❖ very party encrypts their data and then sends it to another party.
❖ The next party also encrypts the encrypted data.
❖ After all parties have encrypted all the data from every other party only that has been duplicated by the encryption is shared.
❖ Count the duplicates and you know the size of the intersection.
Scalar Product
❖ Want to compute the sum of x1 * y1 between two databases
❖ Use linear combinations of random numbers to disguise elements and then computationally remove these once you get the result.
Association Rules in Horizontally Partitioned Data
❖ Candidate Set Generation
❖ Local Pruning
❖ Itemset Exchange (Secure Union Step here)
❖ Support Count Exchange
Association Rules in Vertically Partitioned Data
❖ Uses scalar product to determine if the count of an item set is greater than a threshold
❖ If the count is above the threshold you’ve determined that the database is worth querying
❖ Can also user Secure Size Set Intersection to see how much is in common.
❖ Useful when using algorithm such as apriori algorithm
EM Clustering
❖ Uses secure sum to get a global number associated with all sites involved.
❖ Once global sum is computed, it can be used in the Expectation-maximization method to generate staistical models.
EM Clustering
❖ Uses secure sum to get a global number associated with all sites involved.
❖ Once global sum is computed, it can be used in the Expectation-maximization method to generate staistical models.
Things to Note
❖ These algorithms are not fully private, some information is learned in the process.
❖ For example in the set intersection, sites can potentially learn the sizes of each database.
❖ Make sure to pick the appropriate algorithms for what you need to accomplish
❖ Watch out for intermediate information being leaked!
Thank you