issn number (online): 2454-9614 ensuring privacy in micro data publishing using slicing ·...

South Asian Journal of Engineering and Technology Vol.2, No.40 (2016) 1–9

1

ISSN Number (online): 2454-9614

Ensuring Privacy in Micro Data Publishing Using

Slicing Gandhi. O

1, Tirumala Rao. S. N

2

Department of CSE, Narasaraopeta Engineering College, Narasaraopet, JNTUK, AP, India. [email protected]

[email protected]

Abstract—The basic idea of slicing is to break the association

cross columns, but to preserve the association within each

column. This reduces the dimensionality of the data and

preserves better utility than generalization and bucketization.

Slicing preserves utility because it groups highly-correlated

attributes together, and preserves the correlations between such

attributes. Slicing protects privacy because it breaks the

associations between uncorrelated attributes, which are

infrequent and thus identifying. Note that when the dataset

contains QIs and one SA, bucketization has to break their

correlation; slicing, on the other hand, can group some QI

attributes with the SA, preserving attribute correlations with

the sensitive attribute. This paper focuses on how to publish and

share data in a privacy-preserving manner. a new approach

called slicing to privacy-preserving micro data publishing.

Slicing overcomes the limitations of generalization and

bucketization and preserves better utility while protecting

against privacy threats. We illustrate how to use slicing to

prevent attribute disclosure and membership disclosure. Our

experiments show that slicing preserves better data utility than

generalization and is more effective than bucketization in

workloads involving the sensitive attribute. The general

methodology proposed by this work is that: before anonym zing

the data, one can analyze the data characteristics and use these

characteristics in data anonymization. The rationale is that one

can design better data anonymization techniques when we know

the data better.

I. INTRODUCTION

In the information age, data are increasingly being

collected and used. Much of such data are person specific,

containing a record for each individual. For example, micro

data are collected and used by various government agencies

(e.g., U.S. Census Bureau and Department of Motor

Vehicles) and by many commercial companies (e.g., health organizations, insurance companies, and retailers). Other

examples include personal search histories collected by web

search engines.

Companies and agencies who collect such data often need

to publish and share the data for research and other purposes.

However, such data usually contains personal sensitive

information, the disclosure of which may violate the

individual‟s privacy. Examples of recent attacks include

discovering the medical diagnosis of the governor of

Massachusetts.

In the wake of these well-publicized attacks, privacy has

become an important problem in data publishing and data sharing. This paper focuses on how to publish and share data

in a privacy-preserving manner.

II. BACKGROUND

2.1 Data Collection and Data Publishing

A typical scenario of data collection and publishing is

described in Figure 1.1. In the data collection phase, the

data holder collects data from record owners (e.g., Alice and

Bob). In the data publishing phase, the data holder releases

the collected data to a data miner or the public, called the

data recipient, who will then conduct data mining on the

published data. In this example, the hospital is the data

holder, patients are record owners, and the medical center is

the data recipient.

Typically, micro data is stored in a table, and each record

(row) corresponds to one individual.


2

Each record has a number of attributes, which can be

divided into the following three categories: 1. Explicit Identifiers such as Name or Social Security

Number are the attributes that can be uniquely

identify the individuals.

2. Some attributes may be Sensitive Attributes (SAs)

such as disease and salary.

3. Some may be Quasi-Identifiers (QI) such as zip

code, age, and sex whose values, when taken

together, can potentially identify an individual.

4. Non-Sensitive attributes contains all attributes that

do not fall into the previous three categories

Each of these attributes does not uniquely identifya record

owner, but their combination, called the quasi-identifier, oftensingles out a unique or a small number of record

owners.we assume that each attribute in the microdata is

associated with one of the above three attribute types and

attribute types can be specified by the data publisher.

2.2 Information Disclosure Risks

When releasing microdata, it is necessary to prevent the

sensitive information of the individuals from being disclosed.

Three types of information disclosure have been identified in the literature: membership disclosure, identity disclosure, and

attribute disclosure.

Membership Disclosure: When the data to be published is

selected from a larger population and the selection criteria

are sensitive (e.g., when publishing datasets about diabetes

patients for research purposes), it is important to prevent an

adversary from learningwhether an individual‟s record is in

the data or not.

Identity Disclosure: Identity disclosure (also called re-

identification) occurs when anindividual is linked to a

particular record in the released data. Identity disclosure is

whatthe society views as the clearest form of privacy

violation. If one is able to correctlyidentify one individual‟s

record from supposedly anonymized data, then people agree

thatprivacy is violated. In fact, most publicized privacy

attacks are due to identity disclosure. When identity

disclosure occurs, we also say“anonymity” is broken.

Attribute Disclosure: Attribute disclosure occurs when new

information about someindividuals is revealed, i.e., the

released data makes it possible to infer the characteristicsof

an individual more accurately than it would be possible

before the data release. Identitydisclosure often leads to

attribute disclosure. Once there is identity disclosure, an individualis re-identified and the corresponding sensitive

values are revealed.

Attribute disclosure can occur with or without identity

disclosure. It has been recognized that even disclosure of

false attribute information may cause harm. An observer of

the released data may incorrectly perceive that an

individual‟s sensitive attribute takes a particular value, and

behave accordingly based on the perception. This can harm

the individual, even if the perception is incorrect. Protection againstmembership disclosure alsohelps protect

against identity disclosure and attribute disclosure: it is in

general hard to learnsensitive information about an individual

if you don‟t even know whether this individual‟srecord is in the data or not.

2 .3 Privacy-Preserving Data Publishing:

In the most basic form of privacy-preserving data

publishing (PPDP), the data holder has a table of the form:

2.4 Data Anonymization

Anonymity is the condition of having one‟s name or identity

unknown or concealed. It serves valuable social purposes and

empowers individuals as against institutions by limiting

surveillance, but it is also used by wrong doers to hide their


3

actions or avoid accountability the ability to allow

anonymous access to services, which avoid tracking of user's

personal information and user behavior such as user location,

frequency of a service usage, and so on. If someone sends a

file, there may be information on the file that leaves a trail to

the sender. The sender's information may be traced from the

data logged after the file is sent.

2.4.1. Anonymity vs. security

Anonymity is a very powerful technique for protecting

privacy. The decentralized and stateless design of the Internet

is particularly suitable for anonymous behavior. Although

anonymous actions can ensure privacy, they should not be

used as the sole means for ensuring privacy as they also allow for harmful activities, such as spamming, slander, and

harmful attacks without fear of reprisal. Security dictates that

one should be able to detect and catch individuals conducting

illegal behavior, such as hacking, conspiring for terrorist acts,

and conducting fraud. Legitimate needs for privacy should be

allowed, but the ability to conduct harmful anonymous

behavior without responsibility and repercussions in the

name of privacy should not.

2.4.2. Anonymity vs. Privacy

Privacy and anonymity are not the same. The distinction

between privacy and anonymity is clearly seen in an

information technology context. Privacy corresponds to

being able to send an encrypted e-mail to another recipient.

Anonymity corresponds to being able to send the contents of

the e-mail in plain, easily readable form but without any

information that enables a reader of the message to identify the person who wrote it. Privacy is important when the

contents of a message are at issue, whereas anonymity is

important when the identity of the author of a message is at

issue. So, in order to preserve privacy we are using

anonymization now a day.

Data Anonymization is a technology that convert clear

text into a non-human readable form. Data anonymization

technique for privacy-preserving data publishing has received

a lot of attention in recent years. Detailed data (also called as

micro-data) contains information about a person, a

household or an organization. Most popular anonymization

techniques are Generalization and Bucketization. Data is considered anonymized even when conjoined

with pointer or pedigree values that direct the user to

the originating system, record, and value (e.g., supporting

selective revelation) and when anonymized records can

be associated, matched, and/or conjoined with other

anonymized records.Clearly, explicit identifiers ofrecord

owners must be removed. Data anonymization enables the

transfer of information across a boundary, such as

between two departments within an agency or between

two agencies, while reducing the risk of unintended

disclosure, and in certain environments in a manner that

enables evaluation and analytics post-anonymization.

III. RELATED WORK

Privacy models:

When the micro data publishing the various attacks are

occurred like record linkage model attack and attribute

linkage model attack. So avoid these attacks the different anonymization techniques was introduced. There are two

principles for privacy preserving.

3.1. k-anonymity

The database where attributes are suppressed or

generalized until each row is identical with at least k-1 other

rows that database is said to be K-anonymous. K-Anonymity

prevents definite database linkages. K-Anonymity has been

releasing data accurately. K-anonymity focuses on two

techniques:generalization and suppression. K-anonymity model wasdeveloped to protect released data from linking

attack which causes the information disclosure. The

protection k-anonymity provides is easy and simple to

understand. K-anonymity cannot provide a safety against

attribute disclosure. K-anonymity model for multiple

sensitive attributes mentioned that there are three kinds of

information disclosure.

1) Identity Disclosure: When an individual is linked to a

particular record in the published data called as identity

disclosure. 2) Attribute Disclosure: When sensitive information regarding individual is disclosed called as attribute

disclosure.

3) Membership Disclosure: When information regarding

individual‟s information belongs from data set is present or

not is disclosed is said to be membership disclosure.

3.1.1. Attacks on k-anonymity

In this section we studied two attacks on k-anonymity: the

homogeneity attack and the background knowledge attack.

1) Homogeneity Attack:

Sensitive information may be revealed based on the

known information if the non-sensitive information of an

individual is known to the attacker. If there is no diversity in

the sensitive attributes for a particular block then it occurs.

To getting sensitive information this method is also known as

positive disclosure.

2) Background Knowledge Attack:


4

If the user has some extra demographic information which

can be linked to the released data which helps in neglecting

some of the sensitive attributes, then some sensitive

information about an individual might be revealing

information. Such a method of revealing information is

known as negative disclosure. Limitations of k-anonymity are:

(1) It cannot hide whether a given individual is in the

database, reveals sensitive attributes

(2) K-anonymity cannot protect against attacks based on

background knowledge,

(4) Mere knowledge of the k-anonymization algorithm can be

violated by the privacy,

(5) K-anonymity does not applied to high-dimensional data

without complete loss of utility.

(6) If a dataset is anonymized and published more than once

then special methods are required.

3.2. L-Diversity

L-diversity can be introduced from the limitation of k-

anonymity. The constraints can be putted on minimum

number of distinct values by the l-diversity which can be

seen within an equivalence class for any sensitive attribute.

When there is l or more well-represented values for

thesensitive attribute then it is an equivalence class of l-

diversity.

3.2.1Attacks on l-diversity

3.2.1.1) Skewness Attack

L-diversity cannot prevent attribute disclosure whenever

the overall distribution is skewed and satisfied.

3.2.1.2) Similarity Attack

When the sensitive attribute values are distinct but also

semantically similar, an adversary can learn important

information.

3.2.2.Limitation of L-diversity

While the l-diversity principle represents an important step with respect to k-anonymity in protecting against attribute

disclosure, it has several drawbacks. It is very difficult

toachieve l – Diversity and it also may not provide sufficient

privacy protection.

3.3. Anonymization Techniques

Two widely popular data anonymization techniques

areGeneralization and Bucketization.

3.3.1. Generalization

Data Generalization is the process of creating successive

layers of summary data in an evaluation database. The

original table is shown in Table (a) and Generalization table

intable 3. With the help of semantically consistent value

generalization is applied on the quasi-identifiers (QI) and

replaces a quasi-identifiers value. More records will have thesame set of quasi-identifier values display as a result. We

define an equivalence class of a generalized table to be a set

of records that have the same values for the quasi-identifiers.

Three types of encoding schemes have been introduced for

generalization:

Global Recording,

Regional Recording

Local Recording.

The property gifted to global recoding is that the generalized value can be replaced with the multiple

occurrences of the same value. Regional record partitions the

domain space into Non- intersect regions and data points in

the same region are represented by the region they are in.

Regional record is also called multi-dimensional recoding.

Local recoding allows different occurrences of the same

value to be generalized differently and does not have the

above constraints. Generalization consists of substituting

attribute values with less precise but semantically consistent

values. For example, the identification of a specific

individual is more difficult if the month of birth can be

replaced by the year of birth which occurs in more records. Generalization maintains the correctness of the data at the

record level. Generalization may also results in less specific

information that may affect the accuracy of machine learning

algorithms applied on the k-anonymous dataset.

Drawbacks

1) Due to the curse of dimensionality generalization fails

on high-dimensional data.


5

2) Due to the uniform distribution assumption,

generalization causes too much information loss.

3.3.2. Bucketization

The first, which we term Bucketization, is to partition

the tuples in T into buckets, and then to separate the

sensitive attribute from the non-sensitive ones by

randomly permuting the sensitive attribute values within

each bucket. The sanitized data then consists of the buckets

with permuted sensitive values. Bucketization as the method of constructing the published data from the original

table T, although all our results hold for full-domain

generalization as well. We now specify our notion of

Bucketization more formally. Partition the tuples into buckets

(i.e., horizontally partition the table T according to some

scheme), and within each bucket, we apply an independent

random permutation to the column containing S-values. The

resulting set of buckets, denoted by B, is then

published. For example, if the underlying table T, then the

publisher might publish Bucketization B. Of course, for

added privacy, the publisher can completely mask the identifying attribute (Name) and may partially mask some of

the other non-sensitive attributes (Age, Sex, and Zip). For a

bucket bϵ B, we use the following notation. While

Bucketization has better data utility than generalization, it

has several limitations.

1. Bucketization does not prevent membership disclosure.

Because Bucketization publishes the QI values in their

original forms, an adversary can find out whether an

individual has a record in the published data or not.

2. Second, Bucketization requires a clear separation

between QIs and SAs. However, in many data sets, it is

unclear which attributes are QIs and which are SAs. 3. Third, by separating the sensitive attribute from the QI

attributes, Bucketization breaks the attribute correlations

between the QIs and the SAs.

IV. EXPERIMENT RESULTS

4.1 Slicing

To improve the current state of the art in this paper, we introduce a novel data anonymization technique called

slicing [1]. Slicing partitions the data set both vertically and

horizontally. Vertical partitioning is done by grouping

attributes into columns based on the correlations among the

attributes. Each column contains a subset of attributes that

are highly correlated. Horizontal partitioning is done by

grouping tuples into buckets. Finally, within each bucket,

values in each column are randomly permutated (or sorted) to

break the linking between different columns. The basic idea

of slicing is to break the association cross columns, but

to preserve the association within each column. This reduces the dimensionality of the data and preserves better

utility than generalization and Bucketization. Slicing

preserves utility because it groups highly correlated attributes

together, and preserves the correlations between such

attributes. Slicing protects privacy because it breaks the

associations between uncorrelated attributes, which are

infrequent and thus identifying. Note that when the data set

contains QIs and one SA, Bucketization has to break their

correlation; slicing, on the other hand, can group some QI

attributes with the SA, preserving attribute correlations with

the sensitive attribute. The key intuition that slicing provides

privacy protection is that the slicing process ensures that for any tuples, there are generally multiple matching buckets.

Slicing first partitions attributes into columns. Each column

contains a subset of attributes. Slicing also partition tuples

into buckets. Each bucket contains a subset of tuples. This

horizontally partitions the table. Within each bucket, values

in each column are randomly permutated to break the linking

between different columns.

4.2 SLICING ALGORITHM

This algorithm consists of three phases:

Attribute partitioning, Column Generalization, and Tuple

partitioning.

4.2.1 Attribute Partitioning

This algorithm partitions attributes so that highly

correlated attributes are in the same column. This is good

for both utility and privacy. In terms of data utility, grouping

highly correlated attributes preserves the correlations among

those attributes. In terms of privacy, the association of uncorrelated attributes presents higher identification risks

than the association of highly correlated attributes because

the associations of uncorrelated attribute values is much less

frequent and thus more identifiable.

4.2.2 Column Generalization


6

First, column generalization may be required for

identity/membership disclosure protection. If a column value

is unique in a column, a tuple with this unique column value

can only have one matching bucket. This is not good for

privacy protection, as in the case of

generalization/Bucketization where each tuple can belong to only one equivalence class/bucket.

4.2.3 Tuple Partitioning

The algorithm maintains two data structures: 1) a

queue of buckets Q and 2) a set of sliced buckets SB.

Initially, Q contains only one bucket which includes all

tuples and SB is empty. For each iteration, the algorithm

removes a bucket from Q and splits the bucket into two

buckets [5]. If the sliced table after the split satisfies l-diversity, then the algorithm puts the two buckets at the end

of the queue Q Otherwise, we cannot split the bucket

anymore and the algorithm puts the bucket into SB.When Q

becomes empty, we have computed the sliced table. The set

of sliced buckets is SB.

4.2.4 ALGORITHM

Step 1: In the initial stage we consider a queue of buckets

Qand a set of sliced buckets SB. Initially Q containsonly one

bucket which includes all tuples and SB isempty. So

Q={T};SB=∅. Step 2: In each Iteration the algorithm removes a bucket

fromQ and splits the bucket into two buckets. Q=Q-{B};for

l-diversity check(T,Q∪ {B1, B2}∪SB,l); the mainpart of

tuple partitioning algorithm is to checkwhether a sliced table

satisfies l- diversity.

Step 3: In the diversity check algorithm for each tuple t,

itmaintains a list of statistics L[t] contains Statisticsabout one

matching bucket B. t∈T,L[t]=∅.Thematching probability

p(t,B) and the distribution ofcandidate sensitive values

D(t,B).

Step 4: Q=Q∪ {B1, B2} here two buckets are moved to

the endof the Q

Step 5: else SB=SB∪{B} in this step we cannot split

thebucket more so the bucket is sent to SB

Step 6: Thus a final result return SB,here when Q

becomesempty we have Computed the sliced table. The set

ofsliced buckets is SB .So, finally Return SB.


7

Generalization of data:

Bucketized table

Sliced table:


8

Original table:

V. CONCLUSION

In this project, a new approach called slicing to privacy-

preserving micro data publishing. Slicing overcomes the

limitations of generalization and bucketization and preserves

better utility while protecting against privacy threats. We

illustrate how to use slicing to prevent attribute disclosure

and membership disclosure. Our experiments show that

slicing preserves better data utility than generalization and is

more effective than bucketization in workloads involving the sensitive attribute. The general methodology proposed by this

work is that: before anonymizing the data, one can analyze

the data characteristics and use these characteristics in data

anonymization. The rationale is that one can design better

data anonymization techniques when we know the data

better.

VI. FUTURE ENHANCEMENTS

This work motivates several directions for future research.

First, in this paper, we consider slicing where highly

correlated attributes are in exactly one column. An extension

to the notion of slicing, we can duplicate an attribute in more than one column. These releases more attribute correlations.

For example, in Table, one could choose to include the

Disease attribute also in the first column. That is, the two

columns are {Age, Sex, Disease} and {Zip code, Disease}.

This could provide better data utility, but the privacy

implications need to be carefully studied and understood.

Second, slicing is a promising technique for handling

high-dimensional data. By partitioningattributes into

columns, we protect privacy by breaking the association of

uncorrelated attributes and preserve data utility by preserving

the association between highly-correlated attributes. For

example, slicing can be used for anonymizing transaction databases, which has been studied recently.

Finally, while a number of anonymization techniques have

been designed, it remains an open problem on how to use the

anonymized data. In our experiments, we randomly generate

the associations between column values of a bucket. This

may lose data utility.

REFERENCES

[1] Tiancheng Li, Ninghui Li, Senior Member, IEEE, Jia Zhang,

Member, IEEE, and Ian Molloy “Slicing: A New Approach for

Privacy Preserving Data Publishing” Proc. IEEE TRANSACTIONS

ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3,

MARCH 2012.

[2] V. Ciriani, S. De Capitani di Vimercati,S. Foresti, and P. Samarati On

K-Anonymity. In Springer US, Advances in Information Security

(2007).

[3] Latanya Sweeney. K-anonymity: “a model for protecting privacy”.

International Journal on Uncertainty, Fuzziness and Knowledge-

Based Systems, 10(5):557–570, 2002.

[4] J. Brickell and V. Shmatikov, “The Cost of Privacy: Destruction of

Data-Mining Utility in Anonymized Data Publishing,” Proc. ACM

SIGKDD Int‟l Conf. Knowledge Discovery and Data Mining (KDD),

pp. 70-78, 2008

[5] Benjamin C. M. Fung, Ke Wang, AdaWai-Chee Fu, and Philip S. Yu,

“Privacy Preserving Data Publishing Concepts and Techniques” ,Data

mining and knowledge discovery series (2010).


9

[6] Neha V. Mogre, Girish Agarwal, PragatiPatil: “A Review on

Data Anonymization Technique For Data Publishing” Proc.

International Journal of Engineering Research & Technology (IJERT)

Vol. 1 Issue 10, December-2012 ISSN: 2278-0181

[7] N. Li, T. Li, and S. Venkatasubramanian,“t-Closeness: Privacy

Beyond k-Anonymity and „l-Diversity,”Proc. IEEE 23rd Int‟l Conf.

Data Eng. (ICDE), pp. 106-115, 2007.

[8] A. Machanavajjhala, D. Kifer, J. Gehrke,and M.

Venkitasubramaniam. “l-diversity: Privacy beyond k-anonymity”.

InICDE, 2006.

[9] D. Martin, D. Kifer, A. Machanavajjhala,J. Gehrke, and J. Halpern.

“Worst-case background knowledge for privacy-preserving data

publishing”. In ICDE, 2007.

[10] G.Ghinita, Y. Tao, and P. Kalnis, “On the Anonymization of

Sparse High-Dimensional Data,” Proc. IEEE 24th Int‟lConf. Data

Eng. (ICDE), pp. 715-724, 2008.

issn number (online): 2454-9614 ensuring privacy in micro data publishing using slicing ·...

Documents