1/3/2016 1 a framework for privacy- preserving cluster analysis ieee isi 2008 benjamin c. m. fung...

19
06/27/22 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada [email protected]. ca Lingyu Wang Mourad Debbabi Concordia University Canada {wang, debbabi}@ciise.concordia.ca Ke Wang Simon Fraser University Canada [email protected]

Upload: barrie-parks

Post on 19-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 1

A Framework for Privacy-Preserving Cluster Analysis

IEEE ISI 2008

Benjamin C. M. FungConcordia University

Canada

[email protected]

Lingyu Wang Mourad DebbabiConcordia University

Canada

{wang, debbabi}@ciise.concordia.ca

Ke WangSimon Fraser University

Canada

[email protected]

Page 2: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 2

Agenda

Motivation Problem Scope: Anonymity in Clustering Proposed Method: Top-Down Specialization (TDS) Proposed Framework Experimental Results Related Work Conclusion Q & A

Page 3: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 3

Motivation

Corporations, agencies, governments, individuals are desirous to share valuable information.

But, are reluctant to do so due to privacy issues.

The focus of this study is to publish data for the purpose of cluster analysis.

But to satisfy both the privacy goal and the clustering goal?

Page 4: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 4

Motivation (cont.)

Real world scenario

A data owner wants to release a person-specific data table to another party (or the public) for the purpose of cluster analysis without compromising privacy of the individuals in the released data.

Data owner Data recipients

Person-specificdata

Adversary

Page 5: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

Privacy Threat Looking at the tables below, a description on (Education, Sex) is

so specific that not many people match it, releasing such tables will lead to link a unique or a small number of individuals with their sensitive information.

Education Sex Disease Disease # of Recs.

9th F 30 Flu 3

10th M 32 Heart 4

11th F 35 Fever 5

12th F 37 Fever 4

Bachelors F 42 Flu 6

Bachelors F 44 Heart 4

Masters M 44 Flu 4

Masters F 44 Flu 3

Doctorate F 44 HIV 1

Total: 34

Name Education Sex …

Alice Bachelors F …

Bob Bachelors M …

Cathy Masters F …

Doug Masters F …

Emily Doctorate F …

Page 6: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 6

Privacy Goal: k-Anonymity The privacy goal is specified by the anonymity on a

combination of attributes called Quasi-Identifier (QID), where each description on a QID is required to be shared by at least k records in the table [Sweeney and Samarati 1998]

Anonymity requirement Consider QID1,…, QIDp. e.g., QID = {Education, Sex}.

a(qidi) denotes the number of data records in T that share the value qidi on QIDi. e.g., qid = {Doctorate, Female}.

A(qidi) denotes the smallest a(qidi) for any value qidi on QIDi. A table T satisfies the anonymity requirement

{<QID1, h1>, …, <QIDp, hp>}

if A(qidi) ≥ hi for 1 ≤ i ≤ p, where hi is the anonymity threshold on QIDi , specified by the data owner.

Page 7: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 7

Anonymity RequirementExample: QID1 = {Education, Sex}, h1 = 4

Education Sex Age Class # of Recs.

9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4

Bachelors F 42 4G2B 6

Bachelors F 44 4G0B 4

Masters M 44 4G0B 4

Masters F 44 3G0B 3

Doctorate F 44 1G0B 1

Total: 34

a( qid1 )

3

4

5

4

10

4

3

1

A(QID1) = 1

Page 8: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 8

GeneralizationGeneralize values in UVIDj.

Education Sex Age Disease # of Recs.

9th F 30 Flu 3

10th M 32 Heart 4

11th F 35 Fever 5

12th F 37 Fever 4

Bachelors F 42 Flu 6

Bachelors F 44 Heart 4

Masters M 44 Flu 4

Masters F 44 Flu 3

Doctorate F 44 HIV 1

Education Sex Age Disease # of Recs.

9th F 30 Flu 3

10th M 32 Heart 4

11th F 35 Fever 5

12th F 37 Fever 4

Bachelors F 42 Flu 6

Bachelors F 44 Heart 4

Grad School M 44 Flu 4

Grad School F 44 Flu/HIV 4

Page 9: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 9

Problem Statement

Anonymity in Cluster Analysis Given a table T, an anonymity requirement, and a

taxonomy tree of each categorical attribute in UQIDj, generalize T to satisfy the anonymity requirement while preserving as much information as possible (cluster structure) for cluster analysis.

We use the existing k-anonymity algorithms available in the current literature [Sweeny 2002; Bayardo and Agrawal 2005; Fung et al. 2005, 2007; LeFevere et al. 2005]

Page 10: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 10

Intuition Clustering goal and privacy goal are mutually exclusive:

Privacy goal: Masking sensitive information, usually specific descriptions that identify individuals.

Clustering goal: Grouping similar items together and extract general structures that capture trends and patterns.

Generalization eliminates outliers, but general cluster structures could be preserved.

If generalization is performed, “carefully”, identifying information can be masked while still preserving trends and patterns for clustering.

Page 11: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 11

Challenges What exactly are the cluster structures? What information should we preserve? Our previous work [Fung et al. 2005] addressed the problem

of anonymity for classification analysis.

Education Sex Age Class # of Recs.

9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4

Bachelors F 42 4G2B 6

Bachelors F 44 4G0B 4

Masters M 44 4G0B 4

Masters F 44 3G0B 3

Doctorate F 44 1G0B 1

Education Sex Age Class # of Recs.

9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4

Bachelors F 42 4G2B 6

Bachelors F 44 4G0B 4

Grad School M 44 4G0B 4

Grad School F 44 4G0B 4

Page 12: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 12

Raw Labeled Table

Tl

Generalized LabeledTable

Tl

Generalized Table Tl

Raw Table Tl

Ste

p 1

Clu

ster

ing

& L

abel

ing

Data-Owner

Data- User

Step 2

Generalizing

Ste

p 3

Clu

ster

ing

& L

abel

ing

Step3 Comparing Cluster Structures

The Framework: Convert the Problem

Step 4Release

Ap

ply

clu

ste

ring

alg

orit

hm

Apply Top-D

own Specializa

tion (T

DS)

F-measure

Ap

ply

clu

ste

ring

alg

orit

hm

Page 13: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 13

Algorithm: Top-Down Specialization (TDS)

Initialize every value in T to the top most value. Initialize Cuti to include the top most value.

while some x UCuti is valid do Find the Best specialization of the highest Score in UCuti. Perform the Best specialization on T and update UCuti. Update Score(x) and validity for x UCuti.

end while

return Generalized T and UCuti.AgeANY

[1-99)

[1-37) [37-99)

Page 14: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 14

Search Criteria: Score

Consider a specialization v child(v). To heuristically maximize the information of the generalized data for achieving a given anonymity, we favor the specialization on v that has the maximum information gain for each unit of privacy loss:

Page 15: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 15

Experimental Evaluation

Objectives: Evaluate the information loss (in terms of cluster quality)

caused due to generalization. This is the cost for achieving anonymity.

Evaluate the information gain (in terms of cluster quality) compared to existing k-anonymization algorithms (without the focus of preserving cluster structures). This is the benefit of using our method.

Data set: de facto benchmark – Adult data set US census data 45,222 records (each record represents one US resident)

Page 16: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 16

Experimental Evaluation (cont.)• Cost = 1-clusterFM (In terms of loss in clusters structure)• Benefit = clusterFM-distortFM

Page 17: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 17

Experimental Evaluation (cont.)

Page 18: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 18

Related Works [Sweeny 2002] employed bottom-up generalization to achieve k-

anonymity. Single QID. Not considering specific use of data.

[Iyengar 2002] proposed a genetic algorithm (GA) to address the problem of anonymity for classification. Single QID. GA needs 18 hours to generalize 45000 records.

[Fung et al. 2005] proposed an efficient top-down specialization method for the problem of anonymity for classification. TDS needs only 7 seconds to generalize same set of records (with

comparable classification accuracy.

Page 19: 1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu

04/21/23 19

Conclusion

Quality clustering and privacy preservation can coexist.

An effective top-down method to iteratively specialize the data, guided by maximizing the information utility and minimizing privacy specificity.

Great applicability to both public and private sectors that share information for mutual benefits.