issn number (online): 2454-9614 ensuring privacy in micro data publishing using slicing ·...
TRANSCRIPT
South Asian Journal of Engineering and Technology Vol.2, No.40 (2016) 1–9
1
ISSN Number (online): 2454-9614
Ensuring Privacy in Micro Data Publishing Using
Slicing Gandhi. O
1, Tirumala Rao. S. N
2
Department of CSE, Narasaraopeta Engineering College, Narasaraopet, JNTUK, AP, India. [email protected]
Abstract—The basic idea of slicing is to break the association
cross columns, but to preserve the association within each
column. This reduces the dimensionality of the data and
preserves better utility than generalization and bucketization.
Slicing preserves utility because it groups highly-correlated
attributes together, and preserves the correlations between such
attributes. Slicing protects privacy because it breaks the
associations between uncorrelated attributes, which are
infrequent and thus identifying. Note that when the dataset
contains QIs and one SA, bucketization has to break their
correlation; slicing, on the other hand, can group some QI
attributes with the SA, preserving attribute correlations with
the sensitive attribute. This paper focuses on how to publish and
share data in a privacy-preserving manner. a new approach
called slicing to privacy-preserving micro data publishing.
Slicing overcomes the limitations of generalization and
bucketization and preserves better utility while protecting
against privacy threats. We illustrate how to use slicing to
prevent attribute disclosure and membership disclosure. Our
experiments show that slicing preserves better data utility than
generalization and is more effective than bucketization in
workloads involving the sensitive attribute. The general
methodology proposed by this work is that: before anonym zing
the data, one can analyze the data characteristics and use these
characteristics in data anonymization. The rationale is that one
can design better data anonymization techniques when we know
the data better.
I. INTRODUCTION
In the information age, data are increasingly being
collected and used. Much of such data are person specific,
containing a record for each individual. For example, micro
data are collected and used by various government agencies
(e.g., U.S. Census Bureau and Department of Motor
Vehicles) and by many commercial companies (e.g., health organizations, insurance companies, and retailers). Other
examples include personal search histories collected by web
search engines.
Companies and agencies who collect such data often need
to publish and share the data for research and other purposes.
However, such data usually contains personal sensitive
information, the disclosure of which may violate the
individual‟s privacy. Examples of recent attacks include
discovering the medical diagnosis of the governor of
Massachusetts.
In the wake of these well-publicized attacks, privacy has
become an important problem in data publishing and data sharing. This paper focuses on how to publish and share data
in a privacy-preserving manner.
II. BACKGROUND
2.1 Data Collection and Data Publishing
A typical scenario of data collection and publishing is
described in Figure 1.1. In the data collection phase, the
data holder collects data from record owners (e.g., Alice and
Bob). In the data publishing phase, the data holder releases
the collected data to a data miner or the public, called the
data recipient, who will then conduct data mining on the
published data. In this example, the hospital is the data
holder, patients are record owners, and the medical center is
the data recipient.
Typically, micro data is stored in a table, and each record
(row) corresponds to one individual.
South Asian Journal of Engineering and Technology Vol.2, No.40 (2016) 1–9
2
Each record has a number of attributes, which can be
divided into the following three categories: 1. Explicit Identifiers such as Name or Social Security
Number are the attributes that can be uniquely
identify the individuals.
2. Some attributes may be Sensitive Attributes (SAs)
such as disease and salary.
3. Some may be Quasi-Identifiers (QI) such as zip
code, age, and sex whose values, when taken
together, can potentially identify an individual.
4. Non-Sensitive attributes contains all attributes that
do not fall into the previous three categories
Each of these attributes does not uniquely identifya record
owner, but their combination, called the quasi-identifier, oftensingles out a unique or a small number of record
owners.we assume that each attribute in the microdata is
associated with one of the above three attribute types and
attribute types can be specified by the data publisher.
2.2 Information Disclosure Risks
When releasing microdata, it is necessary to prevent the
sensitive information of the individuals from being disclosed.
Three types of information disclosure have been identified in the literature: membership disclosure, identity disclosure, and
attribute disclosure.
Membership Disclosure: When the data to be published is
selected from a larger population and the selection criteria
are sensitive (e.g., when publishing datasets about diabetes
patients for research purposes), it is important to prevent an
adversary from learningwhether an individual‟s record is in
the data or not.
Identity Disclosure: Identity disclosure (also called re-
identification) occurs when anindividual is linked to a
particular record in the released data. Identity disclosure is
whatthe society views as the clearest form of privacy
violation. If one is able to correctlyidentify one individual‟s
record from supposedly anonymized data, then people agree
thatprivacy is violated. In fact, most publicized privacy
attacks are due to identity disclosure. When identity
disclosure occurs, we also say“anonymity” is broken.
Attribute Disclosure: Attribute disclosure occurs when new
information about someindividuals is revealed, i.e., the
released data makes it possible to infer the characteristicsof
an individual more accurately than it would be possible
before the data release. Identitydisclosure often leads to
attribute disclosure. Once there is identity disclosure, an individualis re-identified and the corresponding sensitive
values are revealed.
Attribute disclosure can occur with or without identity
disclosure. It has been recognized that even disclosure of
false attribute information may cause harm. An observer of
the released data may incorrectly perceive that an
individual‟s sensitive attribute takes a particular value, and
behave accordingly based on the perception. This can harm
the individual, even if the perception is incorrect. Protection againstmembership disclosure alsohelps protect
against identity disclosure and attribute disclosure: it is in
general hard to learnsensitive information about an individual
if you don‟t even know whether this individual‟srecord is in the data or not.
2 .3 Privacy-Preserving Data Publishing:
In the most basic form of privacy-preserving data
publishing (PPDP), the data holder has a table of the form:
2.4 Data Anonymization
Anonymity is the condition of having one‟s name or identity
unknown or concealed. It serves valuable social purposes and
empowers individuals as against institutions by limiting
surveillance, but it is also used by wrong doers to hide their
South Asian Journal of Engineering and Technology Vol.2, No.40 (2016) 1–9
3
actions or avoid accountability the ability to allow
anonymous access to services, which avoid tracking of user's
personal information and user behavior such as user location,
frequency of a service usage, and so on. If someone sends a
file, there may be information on the file that leaves a trail to
the sender. The sender's information may be traced from the
data logged after the file is sent.
2.4.1. Anonymity vs. security
Anonymity is a very powerful technique for protecting
privacy. The decentralized and stateless design of the Internet
is particularly suitable for anonymous behavior. Although
anonymous actions can ensure privacy, they should not be
used as the sole means for ensuring privacy as they also allow for harmful activities, such as spamming, slander, and
harmful attacks without fear of reprisal. Security dictates that
one should be able to detect and catch individuals conducting
illegal behavior, such as hacking, conspiring for terrorist acts,
and conducting fraud. Legitimate needs for privacy should be
allowed, but the ability to conduct harmful anonymous
behavior without responsibility and repercussions in the
name of privacy should not.
2.4.2. Anonymity vs. Privacy
Privacy and anonymity are not the same. The distinction
between privacy and anonymity is clearly seen in an
information technology context. Privacy corresponds to
being able to send an encrypted e-mail to another recipient.
Anonymity corresponds to being able to send the contents of
the e-mail in plain, easily readable form but without any
information that enables a reader of the message to identify the person who wrote it. Privacy is important when the
contents of a message are at issue, whereas anonymity is
important when the identity of the author of a message is at
issue. So, in order to preserve privacy we are using
anonymization now a day.
Data Anonymization is a technology that convert clear
text into a non-human readable form. Data anonymization
technique for privacy-preserving data publishing has received
a lot of attention in recent years. Detailed data (also called as
micro-data) contains information about a person, a
household or an organization. Most popular anonymization
techniques are Generalization and Bucketization. Data is considered anonymized even when conjoined
with pointer or pedigree values that direct the user to
the originating system, record, and value (e.g., supporting
selective revelation) and when anonymized records can
be associated, matched, and/or conjoined with other
anonymized records.Clearly, explicit identifiers ofrecord
owners must be removed. Data anonymization enables the
transfer of information across a boundary, such as
between two departments within an agency or between
two agencies, while reducing the risk of unintended
disclosure, and in certain environments in a manner that
enables evaluation and analytics post-anonymization.
III. RELATED WORK
Privacy models:
When the micro data publishing the various attacks are
occurred like record linkage model attack and attribute
linkage model attack. So avoid these attacks the different anonymization techniques was introduced. There are two
principles for privacy preserving.
3.1. k-anonymity
The database where attributes are suppressed or
generalized until each row is identical with at least k-1 other
rows that database is said to be K-anonymous. K-Anonymity
prevents definite database linkages. K-Anonymity has been
releasing data accurately. K-anonymity focuses on two
techniques:generalization and suppression. K-anonymity model wasdeveloped to protect released data from linking
attack which causes the information disclosure. The
protection k-anonymity provides is easy and simple to
understand. K-anonymity cannot provide a safety against
attribute disclosure. K-anonymity model for multiple
sensitive attributes mentioned that there are three kinds of
information disclosure.
1) Identity Disclosure: When an individual is linked to a
particular record in the published data called as identity
disclosure. 2) Attribute Disclosure: When sensitive information regarding individual is disclosed called as attribute
disclosure.
3) Membership Disclosure: When information regarding
individual‟s information belongs from data set is present or
not is disclosed is said to be membership disclosure.
3.1.1. Attacks on k-anonymity
In this section we studied two attacks on k-anonymity: the
homogeneity attack and the background knowledge attack.
1) Homogeneity Attack:
Sensitive information may be revealed based on the
known information if the non-sensitive information of an
individual is known to the attacker. If there is no diversity in
the sensitive attributes for a particular block then it occurs.
To getting sensitive information this method is also known as
positive disclosure.
2) Background Knowledge Attack:
South Asian Journal of Engineering and Technology Vol.2, No.40 (2016) 1–9
4
If the user has some extra demographic information which
can be linked to the released data which helps in neglecting
some of the sensitive attributes, then some sensitive
information about an individual might be revealing
information. Such a method of revealing information is
known as negative disclosure. Limitations of k-anonymity are:
(1) It cannot hide whether a given individual is in the
database, reveals sensitive attributes
(2) K-anonymity cannot protect against attacks based on
background knowledge,
(4) Mere knowledge of the k-anonymization algorithm can be
violated by the privacy,
(5) K-anonymity does not applied to high-dimensional data
without complete loss of utility.
(6) If a dataset is anonymized and published more than once
then special methods are required.
3.2. L-Diversity
L-diversity can be introduced from the limitation of k-
anonymity. The constraints can be putted on minimum
number of distinct values by the l-diversity which can be
seen within an equivalence class for any sensitive attribute.
When there is l or more well-represented values for
thesensitive attribute then it is an equivalence class of l-
diversity.
3.2.1Attacks on l-diversity
3.2.1.1) Skewness Attack
L-diversity cannot prevent attribute disclosure whenever
the overall distribution is skewed and satisfied.
3.2.1.2) Similarity Attack
When the sensitive attribute values are distinct but also
semantically similar, an adversary can learn important
information.
3.2.2.Limitation of L-diversity
While the l-diversity principle represents an important step with respect to k-anonymity in protecting against attribute
disclosure, it has several drawbacks. It is very difficult
toachieve l – Diversity and it also may not provide sufficient
privacy protection.
3.3. Anonymization Techniques
Two widely popular data anonymization techniques
areGeneralization and Bucketization.
3.3.1. Generalization
Data Generalization is the process of creating successive
layers of summary data in an evaluation database. The
original table is shown in Table (a) and Generalization table
intable 3. With the help of semantically consistent value
generalization is applied on the quasi-identifiers (QI) and
replaces a quasi-identifiers value. More records will have thesame set of quasi-identifier values display as a result. We
define an equivalence class of a generalized table to be a set
of records that have the same values for the quasi-identifiers.
Three types of encoding schemes have been introduced for
generalization:
Global Recording,
Regional Recording
Local Recording.
The property gifted to global recoding is that the generalized value can be replaced with the multiple
occurrences of the same value. Regional record partitions the
domain space into Non- intersect regions and data points in
the same region are represented by the region they are in.
Regional record is also called multi-dimensional recoding.
Local recoding allows different occurrences of the same
value to be generalized differently and does not have the
above constraints. Generalization consists of substituting
attribute values with less precise but semantically consistent
values. For example, the identification of a specific
individual is more difficult if the month of birth can be
replaced by the year of birth which occurs in more records. Generalization maintains the correctness of the data at the
record level. Generalization may also results in less specific
information that may affect the accuracy of machine learning
algorithms applied on the k-anonymous dataset.
Drawbacks
1) Due to the curse of dimensionality generalization fails
on high-dimensional data.
South Asian Journal of Engineering and Technology Vol.2, No.40 (2016) 1–9
5
2) Due to the uniform distribution assumption,
generalization causes too much information loss.
3.3.2. Bucketization
The first, which we term Bucketization, is to partition
the tuples in T into buckets, and then to separate the
sensitive attribute from the non-sensitive ones by
randomly permuting the sensitive attribute values within
each bucket. The sanitized data then consists of the buckets
with permuted sensitive values. Bucketization as the method of constructing the published data from the original
table T, although all our results hold for full-domain
generalization as well. We now specify our notion of
Bucketization more formally. Partition the tuples into buckets
(i.e., horizontally partition the table T according to some
scheme), and within each bucket, we apply an independent
random permutation to the column containing S-values. The
resulting set of buckets, denoted by B, is then
published. For example, if the underlying table T, then the
publisher might publish Bucketization B. Of course, for
added privacy, the publisher can completely mask the identifying attribute (Name) and may partially mask some of
the other non-sensitive attributes (Age, Sex, and Zip). For a
bucket bϵ B, we use the following notation. While
Bucketization has better data utility than generalization, it
has several limitations.
1. Bucketization does not prevent membership disclosure.
Because Bucketization publishes the QI values in their
original forms, an adversary can find out whether an
individual has a record in the published data or not.
2. Second, Bucketization requires a clear separation
between QIs and SAs. However, in many data sets, it is
unclear which attributes are QIs and which are SAs. 3. Third, by separating the sensitive attribute from the QI
attributes, Bucketization breaks the attribute correlations
between the QIs and the SAs.
IV. EXPERIMENT RESULTS
4.1 Slicing
To improve the current state of the art in this paper, we introduce a novel data anonymization technique called
slicing [1]. Slicing partitions the data set both vertically and
horizontally. Vertical partitioning is done by grouping
attributes into columns based on the correlations among the
attributes. Each column contains a subset of attributes that
are highly correlated. Horizontal partitioning is done by
grouping tuples into buckets. Finally, within each bucket,
values in each column are randomly permutated (or sorted) to
break the linking between different columns. The basic idea
of slicing is to break the association cross columns, but
to preserve the association within each column. This reduces the dimensionality of the data and preserves better
utility than generalization and Bucketization. Slicing
preserves utility because it groups highly correlated attributes
together, and preserves the correlations between such
attributes. Slicing protects privacy because it breaks the
associations between uncorrelated attributes, which are
infrequent and thus identifying. Note that when the data set
contains QIs and one SA, Bucketization has to break their
correlation; slicing, on the other hand, can group some QI
attributes with the SA, preserving attribute correlations with
the sensitive attribute. The key intuition that slicing provides
privacy protection is that the slicing process ensures that for any tuples, there are generally multiple matching buckets.
Slicing first partitions attributes into columns. Each column
contains a subset of attributes. Slicing also partition tuples
into buckets. Each bucket contains a subset of tuples. This
horizontally partitions the table. Within each bucket, values
in each column are randomly permutated to break the linking
between different columns.
4.2 SLICING ALGORITHM
This algorithm consists of three phases:
Attribute partitioning, Column Generalization, and Tuple
partitioning.
4.2.1 Attribute Partitioning
This algorithm partitions attributes so that highly
correlated attributes are in the same column. This is good
for both utility and privacy. In terms of data utility, grouping
highly correlated attributes preserves the correlations among
those attributes. In terms of privacy, the association of uncorrelated attributes presents higher identification risks
than the association of highly correlated attributes because
the associations of uncorrelated attribute values is much less
frequent and thus more identifiable.
4.2.2 Column Generalization
South Asian Journal of Engineering and Technology Vol.2, No.40 (2016) 1–9
6
First, column generalization may be required for
identity/membership disclosure protection. If a column value
is unique in a column, a tuple with this unique column value
can only have one matching bucket. This is not good for
privacy protection, as in the case of
generalization/Bucketization where each tuple can belong to only one equivalence class/bucket.
4.2.3 Tuple Partitioning
The algorithm maintains two data structures: 1) a
queue of buckets Q and 2) a set of sliced buckets SB.
Initially, Q contains only one bucket which includes all
tuples and SB is empty. For each iteration, the algorithm
removes a bucket from Q and splits the bucket into two
buckets [5]. If the sliced table after the split satisfies l-diversity, then the algorithm puts the two buckets at the end
of the queue Q Otherwise, we cannot split the bucket
anymore and the algorithm puts the bucket into SB.When Q
becomes empty, we have computed the sliced table. The set
of sliced buckets is SB.
4.2.4 ALGORITHM
Step 1: In the initial stage we consider a queue of buckets
Qand a set of sliced buckets SB. Initially Q containsonly one
bucket which includes all tuples and SB isempty. So
Q={T};SB=∅. Step 2: In each Iteration the algorithm removes a bucket
fromQ and splits the bucket into two buckets. Q=Q-{B};for
l-diversity check(T,Q∪ {B1, B2}∪SB,l); the mainpart of
tuple partitioning algorithm is to checkwhether a sliced table
satisfies l- diversity.
Step 3: In the diversity check algorithm for each tuple t,
itmaintains a list of statistics L[t] contains Statisticsabout one
matching bucket B. t∈T,L[t]=∅.Thematching probability
p(t,B) and the distribution ofcandidate sensitive values
D(t,B).
Step 4: Q=Q∪ {B1, B2} here two buckets are moved to
the endof the Q
Step 5: else SB=SB∪{B} in this step we cannot split
thebucket more so the bucket is sent to SB
Step 6: Thus a final result return SB,here when Q
becomesempty we have Computed the sliced table. The set
ofsliced buckets is SB .So, finally Return SB.
South Asian Journal of Engineering and Technology Vol.2, No.40 (2016) 1–9
7
Generalization of data:
Bucketized table
Sliced table:
South Asian Journal of Engineering and Technology Vol.2, No.40 (2016) 1–9
8
Original table:
V. CONCLUSION
In this project, a new approach called slicing to privacy-
preserving micro data publishing. Slicing overcomes the
limitations of generalization and bucketization and preserves
better utility while protecting against privacy threats. We
illustrate how to use slicing to prevent attribute disclosure
and membership disclosure. Our experiments show that
slicing preserves better data utility than generalization and is
more effective than bucketization in workloads involving the sensitive attribute. The general methodology proposed by this
work is that: before anonymizing the data, one can analyze
the data characteristics and use these characteristics in data
anonymization. The rationale is that one can design better
data anonymization techniques when we know the data
better.
VI. FUTURE ENHANCEMENTS
This work motivates several directions for future research.
First, in this paper, we consider slicing where highly
correlated attributes are in exactly one column. An extension
to the notion of slicing, we can duplicate an attribute in more than one column. These releases more attribute correlations.
For example, in Table, one could choose to include the
Disease attribute also in the first column. That is, the two
columns are {Age, Sex, Disease} and {Zip code, Disease}.
This could provide better data utility, but the privacy
implications need to be carefully studied and understood.
Second, slicing is a promising technique for handling
high-dimensional data. By partitioningattributes into
columns, we protect privacy by breaking the association of
uncorrelated attributes and preserve data utility by preserving
the association between highly-correlated attributes. For
example, slicing can be used for anonymizing transaction databases, which has been studied recently.
Finally, while a number of anonymization techniques have
been designed, it remains an open problem on how to use the
anonymized data. In our experiments, we randomly generate
the associations between column values of a bucket. This
may lose data utility.
REFERENCES
[1] Tiancheng Li, Ninghui Li, Senior Member, IEEE, Jia Zhang,
Member, IEEE, and Ian Molloy “Slicing: A New Approach for
Privacy Preserving Data Publishing” Proc. IEEE TRANSACTIONS
ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3,
MARCH 2012.
[2] V. Ciriani, S. De Capitani di Vimercati,S. Foresti, and P. Samarati On
K-Anonymity. In Springer US, Advances in Information Security
(2007).
[3] Latanya Sweeney. K-anonymity: “a model for protecting privacy”.
International Journal on Uncertainty, Fuzziness and Knowledge-
Based Systems, 10(5):557–570, 2002.
[4] J. Brickell and V. Shmatikov, “The Cost of Privacy: Destruction of
Data-Mining Utility in Anonymized Data Publishing,” Proc. ACM
SIGKDD Int‟l Conf. Knowledge Discovery and Data Mining (KDD),
pp. 70-78, 2008
[5] Benjamin C. M. Fung, Ke Wang, AdaWai-Chee Fu, and Philip S. Yu,
“Privacy Preserving Data Publishing Concepts and Techniques” ,Data
mining and knowledge discovery series (2010).
South Asian Journal of Engineering and Technology Vol.2, No.40 (2016) 1–9
9
[6] Neha V. Mogre, Girish Agarwal, PragatiPatil: “A Review on
Data Anonymization Technique For Data Publishing” Proc.
International Journal of Engineering Research & Technology (IJERT)
Vol. 1 Issue 10, December-2012 ISSN: 2278-0181
[7] N. Li, T. Li, and S. Venkatasubramanian,“t-Closeness: Privacy
Beyond k-Anonymity and „l-Diversity,”Proc. IEEE 23rd Int‟l Conf.
Data Eng. (ICDE), pp. 106-115, 2007.
[8] A. Machanavajjhala, D. Kifer, J. Gehrke,and M.
Venkitasubramaniam. “l-diversity: Privacy beyond k-anonymity”.
InICDE, 2006.
[9] D. Martin, D. Kifer, A. Machanavajjhala,J. Gehrke, and J. Halpern.
“Worst-case background knowledge for privacy-preserving data
publishing”. In ICDE, 2007.
[10] G.Ghinita, Y. Tao, and P. Kalnis, “On the Anonymization of
Sparse High-Dimensional Data,” Proc. IEEE 24th Int‟lConf. Data
Eng. (ICDE), pp. 715-724, 2008.