privacy protection in published data using an efficient clustering method
DESCRIPTION
Privacy protection in published data using an efficient clustering method. ICT 2010 presentation. Presented By: Md. Manzoor Murshed Thursday, August 14, 2014. Overview of the Presentation. Introduction Re-identification of Data k -anonymity Model MOKA Algorithm Experimental results - PowerPoint PPT PresentationTRANSCRIPT
1
Privacy protection in published data Privacy protection in published data using an efficient clustering methodusing an efficient clustering method
Presented By: Md. Manzoor MurshedSaturday, April 22, 2023
ICT 2010 presentation
23/4/22 2
Overview of the Presentation
Introduction Re-identification of Data k-anonymity Model MOKA Algorithm Experimental results Future Work Conclusion Question?
An Abundance of Data
Supermarket scanners Credit card transactions Call center records ATM machines Web server logs Customer web site trails Podcasts Blogs Closed caption
Scientific experiments Sensors, Cameras Hospital visits Social Networks Facebook, Myspace Twitter Speech‐to‐text translation Email Education Institute Travel records
23/4/22 3
Print, film, optical, and magnetic storage: 5 Exabytes (EB) of new information in 2002, doubled in the last three years [How much Information 2003, UC Berkeley]
Data Holders Publish SensitiveInformation to Facilitate Research.
Publish information that: Discloses as much statistical information as possible. discover valid, novel, potentially useful, and ultimately
understandable patterns in data Preserves the privacy of the individuals contributing
the data.
23/4/22 4
23/4/22 5
Question?
How do you publicly release a database without compromising individual privacy?
The Wrong Approach: Just leave out any unique identifiers like name and
SSN and hope that this works.Why? The triple (DOB, gender, zip code) suffices to
uniquely identify at least 87% of US citizens in publicly available databases (1990 U.S. Census summary data).
Moral: Any real privacy guarantee must be proved and established mathematically.
Examples of Re-identification Attempts
23/4/22 6
Examples Health-specific and General Examples of Re-identification
AOL search data Researchers were capable of revealing sensitive details of the participant’s private lives, such as Social Security numbers, credit-card numbers, addresses etc. from the anonymized AOL Internet search data that contains health related searches as well [8].
Chicago homicide database
A large percentage of individuals were re-identified easily by linking the Chicago homicide database with the social security death index.
Netflix movie recommendations
Several individuals were re-identified from the publicly available anonymized Netflix movie recommendations database by linking their anonymized movie ratings with ratings in a publicly available Internet movie rating web site [9].
Re-identification of the medical record
Massachusetts governor’s sensitive medical records was re-identified by linking the anonymized data of the Group Insurance Commission, which purchases health insurance for state employees, with the voter list for Cambridge [1].
Southern Illinoisan vs. The Department of Public Health
Individuals in a neuroblastoma data set from the Illinois cancer registry was re-identified with a very high accuracy [4].
Canadian Adverse Event Database
An unfortunate death of a 26 year-old student by taking a particular drug was re-identified from the publicly released adverse drug reaction database of Health Canada [4].
AOL Data Release …
AOL “anonymously” released a list of 21 million web searchqueries.
UserIDs were replaced by random numbers …
23/4/22 7
A Face Is Exposed for AOL Searcher No. 4417749[New York Times, August 9, 2006]
…No. 4417749 conducted hundreds of searches over a threemonthperiod on topics ranging from “numb fingers” to “60single men” to “dog that urinates on everything.”And search by search, click by click, the identity of AOL user No.4417749 became easier to discern. There are queries for“landscapers in Lilburn, Ga,” several people with the lastname Arnold and “homes sold in shadow lake subdivisiongwinnett county georgia.”It did not take much investigating to follow that data trail toThelma Arnold, a 62‐year‐old widow who lives in Lilburn, Ga.,frequently researches her friends’ medical ailments and lovesher three dogs. “Those are my searches,” she said, after areporter read part of the list to her.…
23/4/22 8
Re-identification of AOL data release
23/4/22 9
Ms. Arnold says she loves online research, but the disclosure of her searches has left her disillusioned. In response, she plans to drop her AOL subscription. “We allhave a right to privacy,” she said. “Nobody should have found this all out.”
Source: http://data.aolsearchlogs.com
23/4/22 10
Re-identification by linking
NAHDO reported that 37 states have legislative mandates to collect hospital level data
GIC is responsible for purchasing health insurance
Medical Data was consideredanonymous, since identifying attributes were removed.
Governor of Massachusetts, wasuniquely identified by the attributesZip Birth Date Sex
Hence, his private medical recordswere out in the open
23/4/22 11
Re-identification by linking (Example)
DOB Gender Zipcode Disease
1/21/76 Male 53715 Heart Disease
4/13/86 Female 53715 Hepatitis
2/28/76 Male 53703 Brochitis
1/21/76 Male 53703 Broken Arm
4/13/86 Female 53706 Flu
2/28/76 Female 53706 Hang Nail
Name DOB Gender Zipcode
Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410
Carol 10/1/44 Female 90210
Dan 2/21/84 Male 02174
Ellen 4/19/72 Female 02237
Hospital Patient Data Vote Registration Data
Andre has heart disease!
23/4/22 12
Data Publishing and Data Privacy
Society is experiencing exponential growth in the number and variety of data collections containing person-specific information.
These collected information is valuable both in research and business. Data sharing is common.
Publishing the data may put the respondent’s privacy in risk.
Objective: Maximize data utility while limiting disclosure risk to an
acceptable level
What is Privacy?
23/4/22 13
“The claim of individuals, groups, or institutions todetermine for themselves when, how and to what extentinformation about them is communicated to others”
Westin, Privacy and Freedom, 1967
But we need quantifiable notions of privacy …
... nothing about an individual should be learnable fromthe database that cannot be learned without access tothe database …
T. Dalenius, 1977
Quality versus anonymity
23/4/22 14
23/4/22 15
Related Works
Statistical Databases Adding noise & maintaining some statistical invariant.
Disadvantages: destroy the integrity of the data
Multi-level Databases Data is stored at different security classifications and
users having different security clearances. Restrict the release of lower classified information
Eliminate precise inference. Disadvantages: It is impossible to consider every possible attack Suppression can drastically reduce the quality of
the data.
23/4/22 16
K-Anonymity
Sweeny came up with a formal protection model named k-anonymity
What is K-Anonymity? If the information for each person contained in the
release cannot be distinguished from at least k-1 individuals whose information also appears in the release.
Ex.If you try to identify a man from a release, but the only information you have is his birth date and gender. There are k people meet the requirement. This is k-Anonymity.
23/4/22 17
Example of suppression and generalization
First Last Age Race
Lynn Isvik 34 American
John Siblic 36 Hisp
Erin Isvik 25 Hisp
John Deer 40 Hisp
Victor Clark 50 Cauc
Danial Clark 40 Cauc
First Last Age Race
* Isvik 20-40 American
John * 36 Hisp
* Isvik 20-40 Hisp
John * 36 Hisp
* Clark 40-50 Cauc
* Clark 40-50 Cauc
The following database: Can be 2-anonymized as follows:
Rows 1 and 3 are identical, rows 2 and 4 are identical, rows 4 and 5 are identical.
Suppression can replace individual attributes with a * Generalization replace individual attributes with a border
category
K-Anonymity Protection Model
23/4/22 18
Definition 1 (Quasi-identifier): A set of non sensitive attributes{Q1, . . . ,Qw} of a table that can be linked with external data to uniquely identify at least one individual from the general population are known as Quasi-identifier.
Definition 2 (k-anonymity requirement): Each release of data must be such that every combination of values of quasi-identifiers can be indistinctly matched to at least k respondents.
Definition 3 (k-Anonymity): A table T satisfies k-anonymity if for all t in T , there exists (k-1) other tuples ti1, ti2 , . . . , tik−1 in T such that t[C]=ti1 [C] = ti2 [C] = · · · = tik−1 [C], for all C in QI.
23/4/22 19
Metrics used for the AlgorithmMetrics used for the Algorithm
person
Asian Non-Asian
Information Loss: L(Pi) = |Pi| * D (Pi)
m numeric quasi-identifiers N1, N2, … Nm and q categorical quasi-identifiers C1, C2, … Cq.
23/4/22 20
Pseudocode of the MOKA AlgorithmPseudocode of the MOKA Algorithm
//clustering stateSort all the records in the table using the non-sensitive attributesSet the number of clusters K = number of records in the table / k value of k-anonymityRemove and set every kth record of the table as the starting record of each clusterFind and assign rest of the records of the table to its nearest cluster//adjusting stageFind the clusters G that has records greater than kSort the records of the GRemove and assign all the records of G in R that are greater than kth locationFind and assign every records of R to the closest cluster of size less than k
23/4/22 21
(MOKA) (MOKA) ALGORITALGORITHMHM
23/4/22 22
Experimental resultsExperimental results
Future work
ℓ-diversity Homogeneity Attack Background knowledge attack
t-closeness Skewness attack
23/4/22 23
23/4/22 24
Conclusion
The k-anonymity protection model can prevent identity disclosure but lack of diversity of the sensitive values attribute breaks the protection mechanism.
Clustering similar kind of data together before anonymization can lower the information loss due to generalization.
In this research we propose a modified clustering method for k-anonymization.
We compare our algorithm with k-means algorithm and got less information loss for some cases.
we are planning to change some parameters of our algorithm and would like to check the performance with other similar algorithm.
25
Questions?
Thank you!
23/4/22 26
References Sweeney, “k-anonymity: a model for protecting privacy”, International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, 2002. Jun-Lin Lin, Meng-Cheng Wei, “An efficient clustering method for k-anonymization”.
Proceedings of the 2008 international workshop on Privacy and anonymity in information society, ACM, PAIS 2008.
Jun-Lin Lin, Meng-Cheng Wei, Chih-Wen Li, Kuo-Chiang Hsieh, “A hybrid Method for k-anonymization”, Proceedings of the 2008 IEEE Asia-Pacific Services Computing Conference, APSCC 2008, pp.385-390.
Khaled El Emam, Fida Kamal Dankar, “Protecting Privacy Using K-anonymity”, Journal of the American Medical Informatics Association Volume 15 Number 5 September / October 2008.
Sweeney, “Computational Disclosure Control: A primer on data privacy Protection”, PhD thesis, Massachusetts Institute of Technology 2001.
Kristen LeFevre, David J. DeWitt, Raghu Ramakrishanan, “Incognito: Efficient Full Domain k-anonymity”, Proceedings of SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA, pp. 49-60.
Kristen LeFevre, David J. DeWitt, Raghu Ramakrishanan, “Mondrian Multidimensional K-Anonymity”, technical report of University of Wisconsin, Madison.
Robert Lemos, Researchers reverse Netflix anonymization, SecurityFocus 2007-12-04, (http://www.privacyanalytics.ca/news/netflix.pdf).
S. Hettich and S. D. Bay. The UCI KDD Archive, 1999, http://kdd.ics.uci.edu Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, Muthuramakrishana
Venkitasubramaniam . “ℓ-diversity: Privacy Beyond k-Anonymity.”IEEE Internationl Conference on Data Engineering, 2006.
Ninghui Li, Tiancheng Li, Suresh Venkatasubramanium, “t-Closeness: Privacy Beyond k-Anonymity and ℓ-diversity, Proceedings of IEEE 23rd Int'l Conference on Data Engineering (ICDE) 2007.