privacy protection in published data using an efficient clustering method

1

Privacy protection in published data Privacy protection in published data using an efficient clustering methodusing an efficient clustering method

Presented By: Md. Manzoor MurshedSaturday, April 22, 2023

ICT 2010 presentation

23/4/22 2

Overview of the Presentation

Introduction Re-identification of Data k-anonymity Model MOKA Algorithm Experimental results Future Work Conclusion Question?

An Abundance of Data

Supermarket scanners Credit card transactions Call center records ATM machines Web server logs Customer web site trails Podcasts Blogs Closed caption

Scientific experiments Sensors, Cameras Hospital visits Social Networks Facebook, Myspace Twitter Speech‐to‐text translation Email Education Institute Travel records

23/4/22 3

Print, film, optical, and magnetic storage: 5 Exabytes (EB) of new information in 2002, doubled in the last three years [How much Information 2003, UC Berkeley]

Data Holders Publish SensitiveInformation to Facilitate Research.

Publish information that: Discloses as much statistical information as possible. discover valid, novel, potentially useful, and ultimately

understandable patterns in data Preserves the privacy of the individuals contributing

the data.

23/4/22 4

23/4/22 5

Question?

How do you publicly release a database without compromising individual privacy?

The Wrong Approach: Just leave out any unique identifiers like name and

SSN and hope that this works.Why? The triple (DOB, gender, zip code) suffices to

uniquely identify at least 87% of US citizens in publicly available databases (1990 U.S. Census summary data).

Moral: Any real privacy guarantee must be proved and established mathematically.

Examples of Re-identification Attempts

23/4/22 6

Examples Health-specific and General Examples of Re-identification

AOL search data Researchers were capable of revealing sensitive details of the participant’s private lives, such as Social Security numbers, credit-card numbers, addresses etc. from the anonymized AOL Internet search data that contains health related searches as well [8].

Chicago homicide database

A large percentage of individuals were re-identified easily by linking the Chicago homicide database with the social security death index.

Netflix movie recommendations

Several individuals were re-identified from the publicly available anonymized Netflix movie recommendations database by linking their anonymized movie ratings with ratings in a publicly available Internet movie rating web site [9].

Re-identification of the medical record

Massachusetts governor’s sensitive medical records was re-identified by linking the anonymized data of the Group Insurance Commission, which purchases health insurance for state employees, with the voter list for Cambridge [1].

Southern Illinoisan vs. The Department of Public Health

Individuals in a neuroblastoma data set from the Illinois cancer registry was re-identified with a very high accuracy [4].

Canadian Adverse Event Database

An unfortunate death of a 26 year-old student by taking a particular drug was re-identified from the publicly released adverse drug reaction database of Health Canada [4].

AOL Data Release …

AOL “anonymously” released a list of 21 million web searchqueries.

UserIDs were replaced by random numbers …

23/4/22 7

A Face Is Exposed for AOL Searcher No. 4417749[New York Times, August 9, 2006]

…No. 4417749 conducted hundreds of searches over a threemonthperiod on topics ranging from “numb fingers” to “60single men” to “dog that urinates on everything.”And search by search, click by click, the identity of AOL user No.4417749 became easier to discern. There are queries for“landscapers in Lilburn, Ga,” several people with the lastname Arnold and “homes sold in shadow lake subdivisiongwinnett county georgia.”It did not take much investigating to follow that data trail toThelma Arnold, a 62‐year‐old widow who lives in Lilburn, Ga.,frequently researches her friends’ medical ailments and lovesher three dogs. “Those are my searches,” she said, after areporter read part of the list to her.…

23/4/22 8

Re-identification of AOL data release

23/4/22 9

Ms. Arnold says she loves online research, but the disclosure of her searches has left her disillusioned. In response, she plans to drop her AOL subscription. “We allhave a right to privacy,” she said. “Nobody should have found this all out.”

Source: http://data.aolsearchlogs.com

23/4/22 10

Re-identification by linking

NAHDO reported that 37 states have legislative mandates to collect hospital level data

GIC is responsible for purchasing health insurance

Medical Data was consideredanonymous, since identifying attributes were removed.

Governor of Massachusetts, wasuniquely identified by the attributesZip Birth Date Sex

Hence, his private medical recordswere out in the open

23/4/22 11

Re-identification by linking (Example)

DOB Gender Zipcode Disease

1/21/76 Male 53715 Heart Disease

4/13/86 Female 53715 Hepatitis

2/28/76 Male 53703 Brochitis

1/21/76 Male 53703 Broken Arm

4/13/86 Female 53706 Flu

2/28/76 Female 53706 Hang Nail

Name DOB Gender Zipcode

Andre 1/21/76 Male 53715

Beth 1/10/81 Female 55410

Carol 10/1/44 Female 90210

Dan 2/21/84 Male 02174

Ellen 4/19/72 Female 02237

Hospital Patient Data Vote Registration Data

Andre has heart disease!

23/4/22 12

Data Publishing and Data Privacy

Society is experiencing exponential growth in the number and variety of data collections containing person-specific information.

These collected information is valuable both in research and business. Data sharing is common.

Publishing the data may put the respondent’s privacy in risk.

Objective: Maximize data utility while limiting disclosure risk to an

acceptable level

What is Privacy?

23/4/22 13

“The claim of individuals, groups, or institutions todetermine for themselves when, how and to what extentinformation about them is communicated to others”

Westin, Privacy and Freedom, 1967

But we need quantifiable notions of privacy …

... nothing about an individual should be learnable fromthe database that cannot be learned without access tothe database …

T. Dalenius, 1977

Quality versus anonymity

23/4/22 14

23/4/22 15

Related Works

Statistical Databases Adding noise & maintaining some statistical invariant.

Disadvantages: destroy the integrity of the data

Multi-level Databases Data is stored at different security classifications and

users having different security clearances. Restrict the release of lower classified information

Eliminate precise inference. Disadvantages: It is impossible to consider every possible attack Suppression can drastically reduce the quality of

the data.

23/4/22 16

K-Anonymity

Sweeny came up with a formal protection model named k-anonymity

What is K-Anonymity? If the information for each person contained in the

release cannot be distinguished from at least k-1 individuals whose information also appears in the release.

Ex.If you try to identify a man from a release, but the only information you have is his birth date and gender. There are k people meet the requirement. This is k-Anonymity.

23/4/22 17

Example of suppression and generalization

First Last Age Race

Lynn Isvik 34 American

John Siblic 36 Hisp

Erin Isvik 25 Hisp

John Deer 40 Hisp

Victor Clark 50 Cauc

Danial Clark 40 Cauc

First Last Age Race

* Isvik 20-40 American

John * 36 Hisp

* Isvik 20-40 Hisp

John * 36 Hisp

* Clark 40-50 Cauc

* Clark 40-50 Cauc

The following database: Can be 2-anonymized as follows:

Rows 1 and 3 are identical, rows 2 and 4 are identical, rows 4 and 5 are identical.

Suppression can replace individual attributes with a * Generalization replace individual attributes with a border

category

K-Anonymity Protection Model

23/4/22 18

Definition 1 (Quasi-identifier): A set of non sensitive attributes{Q1, . . . ,Qw} of a table that can be linked with external data to uniquely identify at least one individual from the general population are known as Quasi-identifier.

Definition 2 (k-anonymity requirement): Each release of data must be such that every combination of values of quasi-identifiers can be indistinctly matched to at least k respondents.

Definition 3 (k-Anonymity): A table T satisfies k-anonymity if for all t in T , there exists (k-1) other tuples ti1, ti2 , . . . , tik−1 in T such that t[C]=ti1 [C] = ti2 [C] = · · · = tik−1 [C], for all C in QI.

23/4/22 19

Metrics used for the AlgorithmMetrics used for the Algorithm

person

Asian Non-Asian

Information Loss: L(Pi) = |Pi| * D (Pi)

m numeric quasi-identifiers N1, N2, … Nm and q categorical quasi-identifiers C1, C2, … Cq.

23/4/22 20

Pseudocode of the MOKA AlgorithmPseudocode of the MOKA Algorithm

//clustering stateSort all the records in the table using the non-sensitive attributesSet the number of clusters K = number of records in the table / k value of k-anonymityRemove and set every kth record of the table as the starting record of each clusterFind and assign rest of the records of the table to its nearest cluster//adjusting stageFind the clusters G that has records greater than kSort the records of the GRemove and assign all the records of G in R that are greater than kth locationFind and assign every records of R to the closest cluster of size less than k

23/4/22 21

(MOKA) (MOKA) ALGORITALGORITHMHM

23/4/22 22

Experimental resultsExperimental results

Future work

ℓ-diversity Homogeneity Attack Background knowledge attack

t-closeness Skewness attack

23/4/22 23

23/4/22 24

Conclusion

The k-anonymity protection model can prevent identity disclosure but lack of diversity of the sensitive values attribute breaks the protection mechanism.

Clustering similar kind of data together before anonymization can lower the information loss due to generalization.

In this research we propose a modified clustering method for k-anonymization.

We compare our algorithm with k-means algorithm and got less information loss for some cases.

we are planning to change some parameters of our algorithm and would like to check the performance with other similar algorithm.

25

Questions?

Thank you!

23/4/22 26

References Sweeney, “k-anonymity: a model for protecting privacy”, International Journal of Uncertainty,

Fuzziness and Knowledge-Based Systems, 2002. Jun-Lin Lin, Meng-Cheng Wei, “An efficient clustering method for k-anonymization”.

Proceedings of the 2008 international workshop on Privacy and anonymity in information society, ACM, PAIS 2008.

Jun-Lin Lin, Meng-Cheng Wei, Chih-Wen Li, Kuo-Chiang Hsieh, “A hybrid Method for k-anonymization”, Proceedings of the 2008 IEEE Asia-Pacific Services Computing Conference, APSCC 2008, pp.385-390.

Khaled El Emam, Fida Kamal Dankar, “Protecting Privacy Using K-anonymity”, Journal of the American Medical Informatics Association Volume 15 Number 5 September / October 2008.

Sweeney, “Computational Disclosure Control: A primer on data privacy Protection”, PhD thesis, Massachusetts Institute of Technology 2001.

Kristen LeFevre, David J. DeWitt, Raghu Ramakrishanan, “Incognito: Efficient Full Domain k-anonymity”, Proceedings of SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA, pp. 49-60.

Kristen LeFevre, David J. DeWitt, Raghu Ramakrishanan, “Mondrian Multidimensional K-Anonymity”, technical report of University of Wisconsin, Madison.

Robert Lemos, Researchers reverse Netflix anonymization, SecurityFocus 2007-12-04, (http://www.privacyanalytics.ca/news/netflix.pdf).

S. Hettich and S. D. Bay. The UCI KDD Archive, 1999, http://kdd.ics.uci.edu Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, Muthuramakrishana

Venkitasubramaniam . “ℓ-diversity: Privacy Beyond k-Anonymity.”IEEE Internationl Conference on Data Engineering, 2006.

Ninghui Li, Tiancheng Li, Suresh Venkatasubramanium, “t-Closeness: Privacy Beyond k-Anonymity and ℓ-diversity, Proceedings of IEEE 23rd Int'l Conference on Data Engineering (ICDE) 2007.

privacy protection in published data using an efficient clustering method

Documents

anonymized data

aol data release aol

published data

neuroblastoma data set

census summary data

anonymized movie ratings

aol searcher

health insurance