1/3/2016 1 a framework for privacy- preserving cluster analysis ieee isi 2008 benjamin c. m. fung...

04/21/23 1

A Framework for Privacy-Preserving Cluster Analysis

IEEE ISI 2008

Benjamin C. M. FungConcordia University

Canada

[email protected]

Lingyu Wang Mourad DebbabiConcordia University

Canada

{wang, debbabi}@ciise.concordia.ca

Ke WangSimon Fraser University

Canada

[email protected]

04/21/23 2

Agenda

Motivation Problem Scope: Anonymity in Clustering Proposed Method: Top-Down Specialization (TDS) Proposed Framework Experimental Results Related Work Conclusion Q & A

04/21/23 3

Motivation

Corporations, agencies, governments, individuals are desirous to share valuable information.

But, are reluctant to do so due to privacy issues.

The focus of this study is to publish data for the purpose of cluster analysis.

But to satisfy both the privacy goal and the clustering goal?

04/21/23 4

Motivation (cont.)

Real world scenario

A data owner wants to release a person-specific data table to another party (or the public) for the purpose of cluster analysis without compromising privacy of the individuals in the released data.

Data owner Data recipients

Person-specificdata

Adversary

Privacy Threat Looking at the tables below, a description on (Education, Sex) is

so specific that not many people match it, releasing such tables will lead to link a unique or a small number of individuals with their sensitive information.

Education Sex Disease Disease # of Recs.

9th F 30 Flu 3

10th M 32 Heart 4

11th F 35 Fever 5

12th F 37 Fever 4

Bachelors F 42 Flu 6

Bachelors F 44 Heart 4

Masters M 44 Flu 4

Masters F 44 Flu 3

Doctorate F 44 HIV 1

Total: 34

Name Education Sex …

Alice Bachelors F …

Bob Bachelors M …

Cathy Masters F …

Doug Masters F …

Emily Doctorate F …

04/21/23 6

Privacy Goal: k-Anonymity The privacy goal is specified by the anonymity on a

combination of attributes called Quasi-Identifier (QID), where each description on a QID is required to be shared by at least k records in the table [Sweeney and Samarati 1998]

Anonymity requirement Consider QID1,…, QIDp. e.g., QID = {Education, Sex}.

a(qidi) denotes the number of data records in T that share the value qidi on QIDi. e.g., qid = {Doctorate, Female}.

A(qidi) denotes the smallest a(qidi) for any value qidi on QIDi. A table T satisfies the anonymity requirement

{<QID1, h1>, …, <QIDp, hp>}

if A(qidi) ≥ hi for 1 ≤ i ≤ p, where hi is the anonymity threshold on QIDi , specified by the data owner.

04/21/23 7

Anonymity RequirementExample: QID1 = {Education, Sex}, h1 = 4

Education Sex Age Class # of Recs.

9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4

Bachelors F 42 4G2B 6


Masters M 44 4G0B 4

Masters F 44 3G0B 3

Doctorate F 44 1G0B 1

Total: 34

a( qid1 )

3

4

5

4

10

4

3

1

A(QID1) = 1

04/21/23 8

GeneralizationGeneralize values in UVIDj.

Education Sex Age Disease # of Recs.

9th F 30 Flu 3

10th M 32 Heart 4

11th F 35 Fever 5

12th F 37 Fever 4



Masters M 44 Flu 4

Masters F 44 Flu 3

Doctorate F 44 HIV 1

Education Sex Age Disease # of Recs.

9th F 30 Flu 3

10th M 32 Heart 4

11th F 35 Fever 5

12th F 37 Fever 4



Grad School M 44 Flu 4

Grad School F 44 Flu/HIV 4

04/21/23 9

Problem Statement

Anonymity in Cluster Analysis Given a table T, an anonymity requirement, and a

taxonomy tree of each categorical attribute in UQIDj, generalize T to satisfy the anonymity requirement while preserving as much information as possible (cluster structure) for cluster analysis.

We use the existing k-anonymity algorithms available in the current literature [Sweeny 2002; Bayardo and Agrawal 2005; Fung et al. 2005, 2007; LeFevere et al. 2005]

04/21/23 10

Intuition Clustering goal and privacy goal are mutually exclusive:

Privacy goal: Masking sensitive information, usually specific descriptions that identify individuals.

Clustering goal: Grouping similar items together and extract general structures that capture trends and patterns.

Generalization eliminates outliers, but general cluster structures could be preserved.

If generalization is performed, “carefully”, identifying information can be masked while still preserving trends and patterns for clustering.

04/21/23 11

Challenges What exactly are the cluster structures? What information should we preserve? Our previous work [Fung et al. 2005] addressed the problem

of anonymity for classification analysis.


9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4



Masters M 44 4G0B 4

Masters F 44 3G0B 3

Doctorate F 44 1G0B 1


9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4



Grad School M 44 4G0B 4

Grad School F 44 4G0B 4

04/21/23 12

Raw Labeled Table

Tl

Generalized LabeledTable

Tl

Generalized Table Tl

Raw Table Tl

Ste

p 1

Clu

ster

ing

& L

abel

ing

Data-Owner

Data- User

Step 2

Generalizing

Ste

p 3

Clu

ster

ing

& L

abel

ing

Step3 Comparing Cluster Structures

The Framework: Convert the Problem

Step 4Release

Ap

ply

clu

ste

ring

alg

orit

hm

Apply Top-D

own Specializa

tion (T

DS)

F-measure

Ap

ply

clu

ste

ring

alg

orit

hm

04/21/23 13

Algorithm: Top-Down Specialization (TDS)

Initialize every value in T to the top most value. Initialize Cuti to include the top most value.

while some x UCuti is valid do Find the Best specialization of the highest Score in UCuti. Perform the Best specialization on T and update UCuti. Update Score(x) and validity for x UCuti.

end while

return Generalized T and UCuti.AgeANY

[1-99)

[1-37) [37-99)

04/21/23 14

Search Criteria: Score

Consider a specialization v child(v). To heuristically maximize the information of the generalized data for achieving a given anonymity, we favor the specialization on v that has the maximum information gain for each unit of privacy loss:

04/21/23 15

Experimental Evaluation

Objectives: Evaluate the information loss (in terms of cluster quality)

caused due to generalization. This is the cost for achieving anonymity.

Evaluate the information gain (in terms of cluster quality) compared to existing k-anonymization algorithms (without the focus of preserving cluster structures). This is the benefit of using our method.

Data set: de facto benchmark – Adult data set US census data 45,222 records (each record represents one US resident)

04/21/23 16

Experimental Evaluation (cont.)• Cost = 1-clusterFM (In terms of loss in clusters structure)• Benefit = clusterFM-distortFM

04/21/23 17

Experimental Evaluation (cont.)

04/21/23 18

Related Works [Sweeny 2002] employed bottom-up generalization to achieve k-

anonymity. Single QID. Not considering specific use of data.

[Iyengar 2002] proposed a genetic algorithm (GA) to address the problem of anonymity for classification. Single QID. GA needs 18 hours to generalize 45000 records.

[Fung et al. 2005] proposed an efficient top-down specialization method for the problem of anonymity for classification. TDS needs only 7 seconds to generalize same set of records (with

comparable classification accuracy.

04/21/23 19

Conclusion

Quality clustering and privacy preservation can coexist.

An effective top-down method to iteratively specialize the data, guided by maximizing the information utility and minimizing privacy specificity.

Great applicability to both public and private sectors that share information for mutual benefits.

1/3/2016 1 a framework for privacy- preserving cluster analysis ieee isi 2008 benjamin c. m. fung...

Documents