generalization-based privacy preservation and discrimination prevention in data publishing and...

31
Data Min Knowl Disc DOI 10.1007/s10618-014-0346-1 Generalization-based privacy preservation and discrimination prevention in data publishing and mining Sara Hajian · Josep Domingo-Ferrer · Oriol Farràs Received: 9 February 2013 / Accepted: 10 January 2014 © The Author(s) 2014 Abstract Living in the information society facilitates the automatic collection of huge amounts of data on individuals, organizations, etc. Publishing such data for secondary analysis (e.g. learning models and finding patterns) may be extremely useful to pol- icy makers, planners, marketing analysts, researchers and others. Yet, data publishing and mining do not come without dangers, namely privacy invasion and also poten- tial discrimination of the individuals whose data are published. Discrimination may ensue from training data mining models (e.g. classifiers) on data which are biased against certain protected groups (ethnicity, gender, political preferences, etc.). The objective of this paper is to describe how to obtain data sets for publication that are: (i) privacy-preserving; (ii) unbiased regarding discrimination; and (iii) as useful as possi- ble for learning models and finding patterns. We present the first generalization-based approach to simultaneously offer privacy preservation and discrimination prevention. We formally define the problem, give an optimal algorithm to tackle it and evaluate the algorithm in terms of both general and specific data analysis metrics (i.e. vari- ous types of classifiers and rule induction algorithms). It turns out that the impact of our transformation on the quality of data is the same or only slightly higher than the Responsible editor: Guest Editors of PKDD 2014 (Dr. Toon Calders, Prof. Floriana Esposito, Prof. Eyke Hüllermeier and Dr. Rosa Meo). S. Hajian (B ) · J. Domingo-Ferrer · O. Farràs Department of Computer Engineering and Maths, UNESCO Chair in Data Privacy, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007 Tarragona, Catalonia e-mail: [email protected] J. Domingo-Ferrer e-mail: [email protected] O. Farràs e-mail: [email protected] 123

Upload: oriol

Post on 23-Dec-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Data Min Knowl DiscDOI 10.1007/s10618-014-0346-1

Generalization-based privacy preservationand discrimination prevention in data publishingand mining

Sara Hajian · Josep Domingo-Ferrer ·Oriol Farràs

Received: 9 February 2013 / Accepted: 10 January 2014© The Author(s) 2014

Abstract Living in the information society facilitates the automatic collection of hugeamounts of data on individuals, organizations, etc. Publishing such data for secondaryanalysis (e.g. learning models and finding patterns) may be extremely useful to pol-icy makers, planners, marketing analysts, researchers and others. Yet, data publishingand mining do not come without dangers, namely privacy invasion and also poten-tial discrimination of the individuals whose data are published. Discrimination mayensue from training data mining models (e.g. classifiers) on data which are biasedagainst certain protected groups (ethnicity, gender, political preferences, etc.). Theobjective of this paper is to describe how to obtain data sets for publication that are: (i)privacy-preserving; (ii) unbiased regarding discrimination; and (iii) as useful as possi-ble for learning models and finding patterns. We present the first generalization-basedapproach to simultaneously offer privacy preservation and discrimination prevention.We formally define the problem, give an optimal algorithm to tackle it and evaluatethe algorithm in terms of both general and specific data analysis metrics (i.e. vari-ous types of classifiers and rule induction algorithms). It turns out that the impact ofour transformation on the quality of data is the same or only slightly higher than the

Responsible editor: Guest Editors of PKDD 2014 (Dr. Toon Calders, Prof. Floriana Esposito, Prof. EykeHüllermeier and Dr. Rosa Meo).

S. Hajian (B) · J. Domingo-Ferrer · O. FarràsDepartment of Computer Engineering and Maths, UNESCO Chair in Data Privacy, Universitat Rovirai Virgili, Av. Països Catalans 26, 43007 Tarragona, Cataloniae-mail: [email protected]

J. Domingo-Ferrere-mail: [email protected]

O. Farràse-mail: [email protected]

123

Page 2: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

impact of achieving just privacy preservation. In addition, we show how to extend ourapproach to different privacy models and anti-discrimination legal concepts.

Keywords Data mining · Anti-discrimination · Privacy · Generalization

1 Introduction

In the information society, massive and automated data collection occurs as a conse-quence of the ubiquitous digital traces we all generate in our daily life. The availabilityof such wealth of data makes its publication and analysis highly desirable for a varietyof purposes, including policy making, planning, marketing, research, etc. Yet the realand obvious benefits of data publishing and mining have a dual, darker side. Thereare at least two potential threats for individuals whose information is published: pri-vacy invasion and potential discrimination. Privacy invasion occurs when the valuesof published sensitive attributes can be linked to specific individuals (or companies).Discrimination is unfair or unequal treatment of people based on membership to a cat-egory, group or minority, without regard to individual characteristics. On the legal side,parallel to the development of privacy legislation (European Union Legislation 1995),anti-discrimination legislation has undergone a remarkable expansion (Australian Leg-islation 2008; European Union Legislation 2009), and it now prohibits discriminationagainst protected groups on the grounds of race, color, religion, nationality, sex, mar-ital status, age and pregnancy, and in a number of settings, like credit and insurance,personnel selection and wages, and access to public services.

On the technology side, efforts at guaranteeing privacy have led to developingstatistical disclosure control (SDC, Willenborg and de Waal 1996; Hundepool et al.2012) and privacy preserving data mining (PPDM, Agrawal and Srikant 2000; Lindelland Pinkas 2000). SDC and PPDM have become increasingly popular because theyallow publishing and sharing sensitive data for secondary analysis. Different privacymodels and their variations have been proposed to trade off the utility of the resultingdata for protecting individual privacy against different kinds of privacy attacks. k-Anonymity (Samarati 2001; Sweeney 2002), l-diversity (Machanavajjhala et al. 2007),t-closeness (Li et al. 2007) and differential privacy (Dwork 2006) are among the best-known privacy models. Detailed descriptions of different PPDM models and methodscan be found in Aggarwal and Yu (2008) and Fung et al. (2010). The issue of anti-discrimination has recently been considered from a data mining perspective (Pedreschiet al. 2008). Some proposals are oriented to using data mining to discover and measurediscrimination (Pedreschi et al. 2008, 2009; Ruggieri et al. 2010; Loung et al. 2011;Dwork et al. 2012; Berendt and Preibusch 2012); other proposals (Calders and Verwer2010; Hajian et al. 2011; Kamiran and Calders 2011; Kamiran et al. 2010; Kamishimaet al. 2012; Zliobaite et al. 2011) deal with preventing data mining from becomingitself a source of discrimination. In other words, discrimination prevention in datamining (DPDM) consists of ensuring that data mining models automatically extractedfrom a data set are such that they do not lead to discriminatory decisions even if thedata set is inherently biased against protected groups. For a survey of contributions todiscrimination-aware data analysis see Custers et al. (2013).

123

Page 3: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

Although PPDM/SDC and DPDM have different goals, they have some technicalsimilarities. Necessary steps of PPDM/SDC are: (i) define the privacy model (e.g.k-anonymity); (ii) apply a proper anonymization technique (e.g. generalization) tosatisfy the requirements of the privacy model; (iii) measure data quality loss as a sideeffect of data distortion (the measure can be general or tailored to specific data miningtasks). Similarly, necessary steps for DPDM include: (i) define the non-discriminationmodel according to the respective legal concept (i.e. α-protection according to thelegal concept of direct discrimination); (ii) apply a suitable data distortion method tosatisfy the requirements of the non-discrimination model; (iii) measure data qualityloss as in the case of DPDM. Considering the literature, there is an evident gap betweenthe large body of research in data privacy technologies and the recent early results onanti-discrimination technologies in data mining.

1.1 Motivating example

Table 1 presents raw customer credit data, where each record represents a customer’sspecific information. Sex, Race, and working hours named Hours can be taken asquasi-identifier attributes. The class attribute has two values, Yes and No, to indicatewhether the customer has received credit. Assume that Salary is a sensitive/privateattribute and groups of Sex and Race attributes are protected. The credit giver wantsto publish a privacy-preserving and non-discriminating version of Table 1. To do that,she needs to eliminate two types of threats against her customers:

– Privacy threat, e.g., record linkage: If a record in the table is so specific that only afew customers match it, releasing the data may allow determining the customer’sidentity (record linkage attack) and hence the salary of that identified customer.Suppose that the adversary knows that the target identified customer is white andhis working hours are 40. In Table 1, record I D = 1 is the only one matching thatcustomer, so the customer’s salary becomes known.

– Discrimination threat: If credit has been denied to most female customers, releas-ing the data may lead to making biased decisions against them when these data

Table 1 Private data set withbiased decision records

ID Sex Race Hours Salary Credit_approved

1 Male White 40 High Yes

2 Male Asian–Pac 50 Medium Yes

3 Male Black 35 Medium No

4 Female Black 35 Medium No

5 Male White 37 Medium Yes

6 Female Amer–Indian 37 Medium Yes

7 Female White 35 Medium No

8 Male Black 35 High Yes

9 Female White 35 Low No

10 Male White 50 High Yes

123

Page 4: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

are used for extracting decision patterns/rules as part of the automated decisionmaking. Suppose that the minimum support (ms) required to extract a classi-fication rule from the data set in Table 1 is that the rule be satisfied by atleast 30 % of the records. This would allow extracting the classification ruler : Sex = f emale → Credit_approved = no from these data. Clearly, usingsuch a rule for credit scoring is discriminatory against female customers.

1.2 Paper contributions and overview

We argue that both threats above must be addressed at the same time, since providingprotection against only one of them might not guarantee protection against the other.An important question is how we can provide protection against both privacy anddiscrimination risks without one type of protection working against the other and withminimum impact on data quality. In Hajian et al. (2012), the authors investigated thisproblem in the context of knowledge/pattern publishing. They proposed a combinedpattern sanitization framework that yields both privacy and discrimination protectedpatterns, while introducing reasonable (controlled) pattern distortion. In this paper, weinvestigate for the first time the problem of discrimination- and privacy-aware datapublishing, i.e. transforming the data, instead of patterns, in order to simultaneouslyfulfill privacy preservation and discrimination prevention in data mining. Our approachfalls into the pre-processing category: it sanitizes the data before they are used in datamining tasks rather than sanitizing the knowledge patterns extracted by data miningtasks (post-processing). Very often, knowledge publishing (publishing the sanitizedpatterns) is not enough for the users or researchers, who want to be able to mine thedata themselves. This gives researchers greater flexibility in performing the requireddata analyses.

We introduce an anti-discrimination model that can cover every possible nuance ofdiscrimination w.r.t. multiple attributes, not only for specific protected groups withinone attribute. Note that the existing pre-processing discrimination prevention methodsare based on data perturbation, either by modifying class attribute values (Kamiranand Calders 2011; Hajian et al. 2011; Hajian and Domingo-Ferrer 2013) or by mod-ifying PD attribute values (Hajian et al. 2011; Hajian and Domingo-Ferrer 2013) ofthe training data. One of the drawbacks of these techniques is that they cannot beapplied (are not preferred) in countries where data perturbation is not legally accepted(preferred), while generalization is allowed; e.g. this is the case of Sweden and otherNordic countries (see p. 24 of Statistics Sweden 2001). Moreover, generalizationnot only can make the original data privacy-protected but can also simultaneouslymake the original data both discrimination- and privacy-protected. In our earlier work(Hajian and Domingo-Ferrer 2012), we explored under which conditions several dataanonymization techniques could also help preventing discrimination. The techniquesexamined included suppression and several forms of generalization (global recoding,local recoding, multidimensional generalizations). In this paper, the approach is quitedifferent: rather than examining a set of techniques, we focus on the (novel) prob-lem of achieving simultaneous discrimination prevention and privacy protection indata publishing and mining. Specifically, we leverage the conclusions of our previous

123

Page 5: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

study (Hajian and Domingo-Ferrer 2012) to choose the best possible technique forsolving the problem: the high applicability of generalization in data privacy justifiesits choice.

We present an optimal algorithm that can cover different legally-grounded mea-sures of discrimination to obtain all full-domain generalizations whereby the dataare discrimination and privacy protected. The “minimal” generalization (i.e., the oneincurring the least information loss according to some criterion) can then be chosen. Inaddition, we evaluate the performance of the proposed approach and the data qualityloss incurred as a side effect of the data generalization needed to achieve both discrim-ination and privacy protection. Data quality loss is measured in terms of both generaland specific data analysis metrics (i.e. various types of classifiers and rule inductionalgorithms). We compare this quality loss with the one incurred to achieve privacyprotection only. Finally, we present how to extend our approach to satisfy differentprivacy models and anti-discrimination legal concepts.

The article is organized as follows. Section 2 introduces basic definitions and con-cepts used throughout the paper. Privacy and anti-discrimination models are presentedin Sect. 3 and 4, respectively. In Sect. 5, we formally define the problem of simul-taneous privacy and anti-discrimination data protection. Our proposed approach andan algorithm for discrimination- and privacy-aware data publishing and mining arepresented in Sects. 5.1 and 5.2. Section 6 reports experimental work. An extension ofthe approach to alternative privacy-preserving requirements and anti-discriminationlegal constraints is presented in Sect. 7. Finally, Sect. 8 summarizes conclusions andidentifies future research topics.

2 Basic notions

Given the data table DB(A1, . . . , An), a set of attributes A = {A1, . . . , An}, and arecord/tuple t ∈ DB, t[Ai , . . . , A j ] denotes the sequence of the values of Ai , . . . , A j

in t , where {Ai , . . . , A j } ⊆ {A1, . . . , An}. Let DB[Ai , . . . , A j ] be the projection,maintaining duplicate records, of attributes Ai , . . . , A j in DB. Let |DB| be the cardi-nality of DB, that is, the number of records it contains. The attributes A in a databaseDB can be classified into several categories. Identifiers are attributes that uniquelyidentify individuals in the database, like Passport number. A quasi-identifier (QI) is aset of attributes that, in combination, can be linked to external identified informationfor re-identifying an individual; for example, Zipcode, Birthdate and Gender form aquasi-identifier because together they are likely to be linkable to single individualsin external public identified data sources (like the electoral roll). Sensitive attributes(S) are those that contain sensitive information, such as Disease or Salary. Let S bea set of sensitive attributes in DB. Civil rights laws (Australian Legislation 2008;European Union Legislation 2009; United States Congress 1963), explicitly identifythe attributes to be protected against discrimination. For instance, U.S. federal laws(United States Congress 1963) prohibit discrimination on the basis of race, color, reli-gion, nationality, sex, marital status, age and pregnancy. In our context, we considerthese attributes as potentially discriminatory (PD). Let D A be a set of PD attributesin DB specified by law. Comparing privacy legislation (European Union Legislation

123

Page 6: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

Table 2 Main acronyms usedthroughout the paper

Definition

BK Background knowledge

CA Classification accuracy

CM Classification metric

D A A set of PD attributes in DBDB Data table

DR Discernibility ratio

DT Domain tuple

GH Generalization height

L A legally-grounded attributes

PD Potentially discriminatory

PND Potentially non-discriminatory

QI Quasi-identifier

1995) and anti-discrimination legislation (European Union Legislation 2009; UnitedStates Congress 1963), PD attributes can overlap with QI attributes (e.g. Sex, Age,Marital_status) and/or sensitive attributes (e.g. Religion in some applications). A classattribute Ac ∈ A is a fixed attribute of DB, also called decision attribute, reportingthe outcome of a decision made of an individual record. An example is attributeCredit_approved, which can be yes or no. A domain DAi is associated with eachattribute Ai to indicate the set of values that the attribute can assume. Table 2 lists themain acronyms used throughout the paper.

An item is an expression Ai = q, where Ai ∈ A and q ∈ DAi , e.g. Race=black. Aclass item Ai = q is an item where Ai = Ac and q ∈ DAc , e.g. Credit_approved=no.An itemset X is a collection of one or more items, e.g. {Race = black, Hours = 40}.In previous works on anti-discrimination (Pedreschi et al. 2008, 2009; Ruggieri et al.2010; Kamiran and Calders 2011; Hajian et al. 2011; Hajian and Domingo-Ferrer2013; Kamiran et al. 2010; Zliobaite et al. 2011), the authors propose discriminationdiscovery and prevention techniques w.r.t. specific protected groups, e.g. black and/orfemale persons. However, this assumption fails to capture the various nuances ofdiscrimination since minority or disadvantaged groups can be different in differentcontexts. For instance, in a neighborhood with almost all black people, whites are aminority and may be discriminated. Then we consider Ai = q to be a PD item, forevery q ∈ DAi , where Ai ∈ D A, e.g. Race = q is a PD item for any race q, whereD A = {Race}. This definition is also compatible with the law. For instance, the U.S.Equal Pay Act (United States Congress 1963) states that: “a selection rate for anyrace, sex, or ethnic group which is less than four-fifths of the rate for the group withthe highest rate will generally be regarded as evidence of adverse impact”. An itemAi = q with q ∈ DAi is potentially1 non-discriminatory (PND) if Ai /∈ D A, e.g.Hours = 35 where D A = {Race}. A PD itemset is an itemset containing only PD

1 The use of PD (resp., PND) attributes in decision making does not necessarily lead to (or exclude)discriminatory decisions (Ruggieri et al. 2010).

123

Page 7: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

items, which we also call protected-by-law (or protected, for short) groups. A PNDitemset is an itemset containing only PND items.

The support of an itemset X in a data table DB is the number of records that containX , i.e. suppDB(X) = |{ti ∈ DB|X ⊆ ti }|. A classification rule is an expressionr : X → C , where C is a class item and X is an itemset containing no class item,e.g. {Race = Black, Hours = 35} → Credit_approved = no. The itemset X iscalled the premise of the rule. The confidence of a classification rule, con fDB(X →C), measures how often the class item C appears in records that contain X . We omitthe subscripts in suppDB(·) and con fDB(·) when there is no ambiguity. Also, thenotation readily extends to negated itemsets ¬X . A frequent classification rule is aclassification rule with support and/or confidence greater than respective specifiedlower bounds. Support is a measure of statistical significance, whereas confidence isa measure of the strength of the rule. In this paper we consider frequent rules w.r.t. thesupport measure.

3 Privacy model

To prevent record linkage attacks through quasi-identifiers, Samarati and Sweeney(Samarati and Sweeney 1998; Sweeney 1998) proposed the notion of k-anonymity.

Definition 1 (k-anonymity) Let DB(A1, . . . , An) be a data table and Q I ={Q1, . . . , Qm} ⊆ {A1, . . . , An} be a quasi-identifier. DB is said to satisfy k-anonymityw.r.t. Q I if each combination of values of attributes in Q I is shared by at least k tuples(records) in DB.

A data table satisfying this requirement is called k-anonymous. The set of all tuplesin DB for each sequence of values in DB[Q I ] is called frequency set. Typically, theoriginal data table does not satisfy k-anonymity and, before being published, it mustbe modified through an anonymization method. Samarati and Sweeney (Samarati andSweeney 1998; Samarati 2001; Sweeney 2002) gave methods for k-anonymizationbased on generalization. Computational procedures alternative to generalization havethereafter been proposed to attain k-anonymity, like microaggregation (Domingo-Ferrer and Torra 2005). Nonetheless, generalization remains not only the main methodfor k-anonymity, but it can also be used to satisfy other privacy models (e.g., l-diversityin Machanavajjhala et al. (2007), t-closeness in Li et al. (2007) and differential pri-vacy in Mohammed et al. (2011)). Generalization replaces QI attribute values with ageneralized version of them. Let Di and D j be two domains. If the values of D j arethe generalization of the values in domain Di , we denote Di ≤D D j . A many-to-onevalue generalization function γ : Di → D j is associated with every Di , D j withDi ≤D D j .

Generalization is based on a domain generalization hierarchy and a correspondingvalue generalization hierarchy on the values in the domains. A domain generalizationhierarchy is defined to be a set of domains that is totally ordered by the relationship≤D . We can consider the hierarchy as a chain of nodes, and if there is an edge fromDi to D j , it means that D j is the direct generalization of Di . Let Domi be a set ofdomains in a domain generalization hierarchy of a quasi-identifier attribute Qi ∈ Q I .

123

Page 8: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

Fig. 1 An example of domain (left) and value (right) generalization hierarchies of Race, Sex and Hoursattributes

Fig. 2 Generalization lattice forthe Race and Sex attributes

For every Di , D j , Dk ∈ Domi if Di ≤D D j and D j ≤D Dk , then Di ≤D Dk . Inthis case, domain Dk is an implied generalization of Di . The maximal element ofDomi is a singleton, which means that all values in each domain can be eventuallygeneralized to a single value. Figure 1 left shows possible domain generalizationhierarchies for the Race, Sex and Hours attributes in Table 1. Value generalizationfunctions associated with the domain generalization hierarchy induce a correspondingvalue-level tree, in which edges are denoted by γ , i.e. direct value generalization, andpaths are denoted by γ +, i.e. implied value generalization. Figure 1 right shows avalue generalization hierarchy with each value in the Race, Sex and Hours domains,e.g. Colored = γ (black) and Any-race ∈ γ +(black). For a Q I = {Q1, . . . , Qn}consisting of multiple attributes, each with its own domain, the domain generalizationhierarchies of the individual attributes Dom1, . . . , Domn can be combined to forma multi-attribute generalization lattice. Each vertex of a lattice is a domain tupleDT = 〈N1, . . . , Nn〉 such that Ni ∈ Domi , for i = 1, . . . , n, representing a multi-attribute domain generalization. An example for Sex and Race attributes is presentedin Fig. 2.

Definition 2 (Full-domain generalization) Let DB be a data table having a quasi-identifier Q I = {Q1, . . . , Qn} with corresponding domain generalization hierarchiesDom1, . . . , Domn . A full-domain generalization can be defined by a domain tupleDT = 〈N1, . . . , Nn〉 with Ni ∈ Domi , for every i = 1, . . . , n. A full-domain

123

Page 9: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

generalization with respect to DT maps each value q ∈ DQi to some a ∈ DNi

such that a = q, a = γ (q) or a ∈ γ +(q).

Full-domain generalization guarantees that all values in an attribute are generalizedto the same domain.2 For example, consider Fig. 1 right and assume that values 40 and50 of Hours are generalized to [40−100); then 35 and 37 must be generalized to [1, 40).A full-domain generalization w.r.t. domain tuple DT is k-anonymous if it yields k-anonymity for DB with respect to Q I . In the literature, different generalization-basedalgorithms have been proposed to k-anonymize a data table. They are optimal (Sama-rati 2001; Lefevre et al. 2005; Bayardo and Agrawal 2005) or minimal (Iyengar 2002;Fung et al. 2005; Wang et al. 2004). Although minimal algorithms are in generalmore efficient than optimal ones, we choose an optimal algorithm (i.e. Incognito,Lefevre et al. 2005) because, in this first work about combining privacy preservationand discrimination prevention in data publishing, it allows us to study the worst-casetoll on efficiency of achieving both properties. Incognito is a well-known suite ofoptimal bottom-up generalization algorithms to generate all possible k-anonymousfull-domain generalizations.3 In comparison with other optimal algorithms, Incognitois more scalable and practical for larger data sets and more suitable for categoricalattributes. Incognito is based on two main properties satisfied for k-anonymity:

– Subset property. IfDB is k-anonymous with respect to Q I , then it is k-anonymouswith respect to any subset of attributes in Q I (the converse property does not holdin general).

– Generalization property. Let P and Q be nodes in the generalization lattice ofDB such that DP ≤D DQ . If DB is k-anonymous with respect to P , then DB isalso k-anonymous with respect to Q (monotonicity of generalization).

Example Continuing the motivating example (Sect. 1.1), consider Table 1 and supposeQ I = {Race, Sex} and k = 3. Consider the generalization lattice over Q I attributesin Fig. 2. Incognito finds that Table 1 is 3-anonymous with respect to domain tuples〈S1, R1〉, 〈S0, R2〉 and 〈S1, R2〉.

4 Non-discrimination model

The legal notion of under-representation has inspired existing approaches for discrim-ination discovery based on rule/pattern mining (Pedreschi et al. 2008), taking intoaccount different legal concepts (e.g., direct and indirect discrimination and genuineoccupational requirement). Direct discrimination occurs when the input data contain

2 In full-domain generalization if a value is generalized, all its instances are generalized. There are alter-native generalization schemes, such as multi-dimensional generalization or cell generalization, in whichsome instances of a value may remain ungeneralized while other instances are generalized.3 Although algorithms using multi-dimensional or cell generalizations (e.g. the Mondrian algorithm,Lefevre et al. 2006) cause less information loss than algorithms using full-domain generalization, theformer suffer from the problem of data exploration (Fung et al. 2010). This problem is caused by the co-existence of specific and generalized values in the generalized data set, which make data exploration andinterpretation difficult for the data analyst.

123

Page 10: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

Fig. 3 Discrimination measures

PD attributes, e.g., Sex, while for indirect discrimination it is the other way round. Inthe rest, we assume that the input data contain protected groups, which is a reasonableassumption for attributes such as Sex, Age and Pregnancy/marital status and we omitthe word “direct” for brevity. In Sect. 7, we describe how to deal with other legalconcepts, e.g., indirect discrimination.

Given a set D A of potentially discriminatory attributes and starting from a data setD of historical decision records, the idea is to extract frequent classification rules of theform A, B → C (where A is a non-empty PD itemset and B a is PND itemset), calledPD rules, to unveil contexts B of possible discrimination, where the non-empty pro-tected group A suffers from over-representation with respect to the negative decisionC (C is a class item reporting a negative decision, such as credit denial, applicationrejection, job firing). In other words, A is under-represented with respect to the cor-responding positive decision ¬C . As an example, rule Sex=female, Hours=35 →Credit_approved=no is a PD rule about denying credit (the decision C) to women (theprotected group A) among those working 35 hours per week (the context B), withD A={Sex}. In other words, the context B determines the subsets of protected groups(e.g., subsets of women working 35 h per week).

The degree of under-representation should be measured over each PD rule usinga legally-grounded measure,4 such as those introduced in Pedreschi et al. (2009) andshown in Fig. 3. Selection lift (slift)5 is the ratio of the proportions of benefit denialbetween the protected and unprotected groups, e.g. women and men resp., in the givencontext. Extended lift (elift)6 is the ratio of the proportions of benefit denial, e.g. creditdenial, between the protected groups and all people who were not granted the benefit,e.g. women versus all men and women who were denied credit, in the given context.A special case of slift occurs when we deal with non-binary attributes, for instancewhen comparing the credit denial ratio of blacks with the ratio for other groups ofthe population. This yields a third measure called contrasted lift (clift) which, givenA as a single item a = v1 (e.g. Race=black), compares it with the most favored item

4 On the legal side, different measures are adopted worldwide; see Pedreschi et al. (2013) for parallelsbetween different measures and anti-discrimination acts.5 Discrimination occurs when a group is treated “less favorably” than others.6 Discrimination of a group occurs when a higher proportion of people not in the group is able to complywith a qualifying criterion.

123

Page 11: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

a = v2 (e.g. Race=white). The last measure is the odds lift (olift), the ratio betweenthe odds of the proportions of benefit denial between the protected and unprotectedgroups. The discrimination measures mentioned so far in this paragraph are formallydefined in Fig. 3. Whether a rule is to be considered discriminatory according toa specific discrimination measure can be assessed by thresholding the measure asfollows.

Definition 3 (α-protective/discriminatory rule) Let f be one of the measures in Fig.3, α ∈ R a fixed threshold,7 A a PD itemset and B a PND itemset with respect to D A.A PD classification rule c = A, B → C is α-protective with respect to f if f (c) < α.Otherwise, c is α-discriminatory.

Building on Definition 3, we introduce the notion of α-protection for a data table.

Definition 4 (α-protective data table) Let DB(A1, . . . , An) be a data table, D A a setof PD attributes associated with it, and f be one of the measures in Fig. 3. DB issaid to satisfy α-protection or to be α-protective w.r.t. D A and f if each PD frequentclassification rule c : A, B → C extracted from DB is α-protective, where A is a PDitemset and B is a PND itemset.

Example Continuing the motivating example, suppose D A = {Sex}, α = 1.2 andms = 20 %. Table 1 does not satisfy 1.2-protection w.r.t. both f = sli f t andf = eli f t , since for a frequent rule c equal to Sex=female, Salary = medium →credit_approved=no we have

sli f t (c) = p1

p2= a1/n1

a2/n2= 2/3

1/3= 2

and

eli f t (c) = p1

p= a1/n1

(a1 + a2)/(n1 + n2)= 2/3

3/6= 1.33.

Note that α-protection in DB not only prevents discrimination against the mainprotected groups w.r.t. D A (e.g., women) but also against any subsets of protectedgroups w.r.t. A\D A (e.g., women who have medium salary and/or work 36 h perweek) (Pedreschi et al 2009). Releasing an α-protective (unbiased) version of anoriginal data table is desirable to prevent discrimination with respect to D A. If theoriginal data table is biased w.r.t. D A, it must be modified before being published (i.e.pre-processed).

7 α states an acceptable level of discrimination according to laws and regulations. For example, the U.S.Equal Pay Act (United States Congress 1963) states that “a selection rate for any race, sex, or ethnic groupwhich is less than four-fifths of the rate for the group with the highest rate will generally be regarded asevidence of adverse impact”. This amounts to using clift with α = 1.25.

123

Page 12: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

5 Simultaneous privacy preservation and discrimination prevention

We want to obtain anonymized data tables that are protected against record linkageand also free from discrimination, more specifically α-protective k-anonymous datatables defined as follows.

Definition 5 (α-protective k-anonymous data table) Let DB(A1, . . . , An) be a datatable, Q I = {Q1, . . . , Qm} a quasi-identifier, D A a set of PD attributes, k ananonymity threshold, and α a discrimination threshold. DB is α-protective k-anonymous if it is both k-anonymous and α-protective with respect to Q I and D A,respectively.

We focus on the problem of producing a version of DB that is α-protective k-anonymous with respect to Q I and D A. The problem could be investigated withrespect to different possible relations between categories of attributes in DB. k-Anonymity uses quasi-identifiers for re-identification, so we take the worst case forprivacy in which any attribute can be part of a quasi-identifier, except the class/decisionattribute. We exclude the latter attribute from Q I because we assume that the decisionmade on a specific individual is not publicly linked to his/her identity by the decisionmaker, e.g. banks do not publicize who was granted or denied credit. On the other hand,each QI attribute can be PD or not. In summary, the following relations are assumed:(1) Q I ∩ C = ∅, (2) D A ⊆ Q I . Taking the largest possible QI makes sense indeed.The more attributes are included in QI, the more protection k-anonymity providesand, in general, the more information loss it causes. Thus, we test our proposal in theworst-case privacy scenario. On the discrimination side, as explained in Sect. 5.2, themore attributes are included in QI, the more protection is provided by α-protection.

5.1 The generalization-based approach

To design a method, we need to consider the impact of data generalization on discrim-ination.

Definition 6 Let DB be a data table having a quasi-identifier Q I = {Q1, . . . , Qn}with corresponding domain generalization hierarchies Dom1, . . . , Domn . Let D A bea set of PD attributes associated with DB. Each node N in Domi is said to be PD ifDomi corresponds to one of the attributes in D A and N is not the singleton of Domi .Otherwise node N is said to be PND.

Definition 6 states that not only ungeneralized nodes of PD attributes are PD butalso the generalized nodes of these domains are PD. For example, in South Africa,about 80 % of the population is Black, 9 % White, 9 % Colored and 2 % Asian.Generalizing the Race attribute in a census of the South African population to {White,Non-White} causes the Non-White node to inherit the PD nature of Black, Coloredand Asian. We consider the singleton nodes as PND because generalizing all instancesof all values of a domain to single value is PND, e.g. generalizing all instances of maleand female values to any-sex is PND.

123

Page 13: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

Example Continuing the motivating example, consider D A = {Race, Sex} and Fig. 1left. Based on Definition 6, in the domain generalization hierarchy of Race, Sex andHours, R0, R1, S0 are PD nodes, whereas R2, S1, H0, H1 and H2 are PND nodes.

When we generalize data (i.e. full-domain generalization) can we achieve α-protection? By presenting two main scenarios we show that the answer can be yesor no depending on the generalization:

– When the original data table DB is biased versus some protected groups w.r.t. D Aand f (i.e., there is at least one frequent rule c : A → C such that f (c) ≥ α,where A is a PD itemset w.r.t. D A), a full-domain generalization can make DBα-protective if it includes the generalization of the respective protected groups (i.e.A).

– When the original data table DB is biased versus a subset of the protected groupsw.r.t. D A (i.e., there is at least one frequent rule c : A, B → C such that f (c) ≥ α,where A is a PD itemset and B is a PND itemset w.r.t. D A), a full-domain gener-alization can make DB α-protective if any of the following holds: (1) it includesthe generalization of the respective protected groups (i.e. A); (2) it includes thegeneralization of the attributes which define the respective subsets of the protectedgroups (i.e. B); (3) it includes both (1) and (2).

Then, given the generalization lattice of DB over QI, where D A ⊆ Q I , there aresome candidate nodes for which DB is α-protective (i.e., α-protective full-domaingeneralizations). Specifically, we can state the following correctness result.

Theorem 1 Let DB be a data table having a quasi-identifier Q I = {Q1, . . . , Qn}with corresponding domain generalization hierarchies Dom1, . . . , Domn. Let D A bea set of PD attributes in DB. If FG is the set of k-anonymous full-domain generaliza-tions with respect to DB, Q I and Dom1, . . . , Domn, there is at least one k-anonymousfull-domain generalization in FG that is α-protective with respect to D A.

Proof There exists g ∈ FG with respect to 〈N1, . . . , Nn〉 such that for all i = 1, . . . , n,Ni ∈ Domi is the singleton of Domi of attribute Qi ∈ D A. Then g is also α-protectivebased on Definition 4, since all instances of each value in each D A are generalized toa single most general value. ��Observation 1 k-Anonymity and α-protection can be achieved simultaneously in DBby means of full-domain generalization.

Example Continuing Example 3, suppose f = eli f t/sli f t and consider the gener-alization lattice over Q I attributes in Fig. 2. Among three 3-anonymous full-domaingeneralizations, only 〈S1, R1〉 and 〈S1, R2〉 are also 1.2-protective with respect toD A = {Sex}.

Our task is to obtain α-protective k-anonymous full-domain generalizations. Thenaive approach is the sequential way: first, obtain k-anonymous full-domain gener-alizations and then restrict to the subset of these that are α-protective. Although thiswould solve the problem, it is a very expensive solution: discrimination should bemeasured for each k-anonymous full-domain generalization to determine whether itis α-protective. In the next section we present a more efficient algorithm that takesadvantage of the common properties of α-protection and k-anonymity.

123

Page 14: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

5.2 The algorithm

In this section, we present an optimal algorithm for obtaining all possible full-domaingeneralizations with which DB is α-protective k-anonymous.

5.2.1 Foundations

Observation 2 (Subset property of α-protection) From Definition 4, observe that ifDB is α-protective with respect to D A, it is α-protective w.r.t. any subset of attributesin D A. The converse property does not hold in general.

For example, if Table 1 is 1.2-protective w.r.t D A = {Sex, Race}, Table 1 mustalso be 1.2-protective w.r.t. D A = {Sex} and D A = {Race}. Otherwise put, if Table 1is not 1.2-protective w.r.t. D A = {Sex} or it is not 1.2 protective w.r.t. D A = {Race},it cannot be 1.2-protective w.r.t. D A = {Sex, Race}. This is in correspondence withthe subset property of k-anonymity. Thus, α-protection w.r.t. all strict subsets of D Ais a necessary (but not sufficient) condition for α-protection w.r.t. D A. Then, givengeneralization hierarchies over QI, the generalizations that are not α-protective w.r.t. asubset D A′ of D A can be discarded along with all their descendants in the hierarchy. Toprove the generalization property of α-protection, we need a preliminary well-knownmathematical result, stated in the following lemma.

Lemma 1 Let x1, . . . , xn, y1, . . . , yn be positive integers and let x = x1 + · · · + xn

and y = y1 + · · · + yn. Then

min1≤i≤n

{xi

yi

}≤ x

y≤ max

1≤i≤n

{xi

yi

}.

Proof Without loss of generality, suppose that x1y1

≤ · · · ≤ xnyn

. Then

x

y= y1

y

x1

y1+ · · · + yn

y

xn

yn≤

(y1

y+ · · · + yn

y

)xn

yn≤ xn

yn.

The other inequality is proven analogously. ��Proposition 1 (Generalization property of α-protection) Let DB be a data table andP and Q be nodes in the generalization lattice of D A with DP ≤D DQ. If DB isα-protective w.r.t. to P considering minimum support ms = 1 and discriminationmeasure elift or clift, then DB is also α-protective w.r.t. to Q.

Proof Let A1, . . . , An and A be itemsets in P and Q, respectively, such thatγ −1(A) = {A1, . . . , An}. That is, A is the generalization of {A1, . . . , An}. Let Bbe an itemset from attributes in Q I\D A, and C a decision item. For simplicity,assume that supp(Ai , B) > 0 for i = 1, . . . , n. According to Sect. 4, for the PDrule c : A, B → C ,

eli f t (c) =supp(A,B,C)

supp(A,B)

supp(B,C)supp(B)

and cli f t (c) =supp(A,B,C)

supp(A,B)

supp(X,B,C)supp(X,B)

,

123

Page 15: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

where X is the most favored itemset in Q with respect to B and C . Since supp(A, B) =∑i supp(Ai , B), and supp(A, B, C) = ∑

i supp(Ai , B, C), by Lemma 1 we obtainthat

supp(A, B, C)

supp(A, B)≤ max

i

supp(Ai , B, C)

supp(Ai , B).

Hence if none of the rules Ai , B → C are α-discriminatory with respect to the measureelift, then the rule A, B → C is not α-discriminatory. Now we consider the measureclift. Let Y be the most favored itemset in P with respect to the itemsets B and theitem C . By following an analogous argument, we obtain that

supp(X, B, C)

supp(X, B)≥ supp(Y, B, C)

supp(Y, B).

Therefore if none of the rules Ai , B → C are α-discriminatory with respect to themeasure clift, then c is not α-discriminatory. ��

For example, considering D A = {Race} and f = eli f t or f = cli f t , based onthe generalization property of k-anonymity, if Table 1 is 3-anonymous w.r.t. 〈R0, H0〉,it must be also 3-anonymous w.r.t. 〈R1, H0〉 and 〈R0, H1〉. However, based on thegeneralization property of α-protection, if Table 1 is 1.2-protective w.r.t. 〈R0, H0〉,it must be also 1.2-protective w.r.t. 〈R1, H0〉, which contains the generalization ofthe attributes in D A, but not necessarily w.r.t. 〈R0, H1〉 (the latter generalization isfor an attribute not in D A). Thus, we notice that the generalization property of α-protection is weaker than the generalization property of k-anonymity, because theformer is only guaranteed for generalizations of attributes in D A ⊆ Q I , whereas thelatter holds for generalizations of any attribute in Q I . Moreover, the generalizationproperty has a limitation. Based on Definition 4, a data table is α-protective w.r.t. D A ifall PD frequent rules extracted from the data table are not α-discriminatory w.r.t. D A.Hence, a data table might contain PD rules which are not α-protective and not frequent,e.g. Race=White, Hours=35 → Credit_approved=no, Race=White, Hours=37 →Credit_approved=no, Race=White, Hours=36 → Credit_approved=no, where D A ={Race}. However, after generalization, frequent PD rules can appear which might beα-discriminatory and discrimination will show up, e.g. Race=White, Hours=[1-40) →Credit_approved=no. This is why the generalization property of α-protection requiresthat α-protection w.r.t. P hold for all PD rules, frequent and infrequent (this explainsthe condition ms = 1 in Proposition 1). The next property allows improving theefficiency of the algorithm for obtaining α-protective k-anonymous data tables bymeans of full-domain generalizations. Its proof is straightforward.

Proposition 2 (Roll-up property of α-protection) Let DB be a data table with recordsin a domain tuple DT , let DT ′ be a domain tuple with DT ≤D DT ′, and let γ : DT →DT ′ be the associated generalization function. The support of an itemset X in DT ′ isthe sum of the supports of the itemsets in γ −1(X).

123

Page 16: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

5.2.2 Overview

We take Incognito as an optimal anonymization algorithm based on the above prop-erties and extend it to generate the set of all possible α-protective k-anonymous full-domain generalizations of DB. Based on the subset property for α-protection andk-anonymity, the algorithm, named α-protective Incognito, begins by checking single-attribute subsets of QI, and then iterates by checking k-anonymity and α-protectionwith respect to increasingly larger subsets, in a manner reminiscent of Agrawal andSrikant (1994). Consider a graph of candidate multi-attribute generalizations (nodes)constructed from a subset of QI of size i . Denote this subset by Ci . The set of directmulti-attribute generalization relationships (edges) connecting these nodes is denotedby Ei .

The i-th iteration of α-protective Incognito performs a search that determines firstthe k-anonymity status and second the α-protection status of table DB with respect toeach candidate generalization in Ci . This is accomplished using a modified bottom-up breadth-first search, beginning at each node in the graph that is not the directgeneralization of some other node. A modified breadth-first search over the graphyields the set of multi-attribute generalizations of size i with respect to which DB isα-protective k-anonymous (denoted by Si ). After obtaining the entire Si , the algorithmconstructs the set of candidate nodes of size i + 1 (Ci+1), and the edges connectingthem (Ei+1) using the subset property.

5.2.3 Description

Algorithm 1 describes α-protective Incognito. In the i th iteration, the algorithm deter-mines the k-anonymity status of DB with respect to each node in Ci by computing thefrequency set in one of the following ways: if the node is root, the frequency set is com-puted using DB. Otherwise, for non-root nodes, the frequency set is computed usingall parents’ frequency sets. This is based on the roll-up property for k-anonymity. IfDB is k-anonymous with respect to the attributes of the node, the algorithm performstwo actions. First, it marks all direct generalizations of the node as k-anonymous.This is based on the generalization property for k-anonymity: these generalizationsneed not be checked anymore for k-anonymity in the subsequent search iterations.Second, if the node contains at least one PD attribute and i ≤ τ (where τ is the dis-crimination granularity level, see definition further below), the algorithm determinesthe α-protection status of DB by computing the Check α-protection(i , node) function(see Algorithm 2). If DB is α-protective w.r.t. the attributes of the node, the algorithmmarks as α-protective k-anonymous all direct generalizations of the node which areα-protective according to the generalization property of α-protection. The algorithmwill not check them anymore for α-protection in the subsequent search iterations.Finally, the algorithm constructs Ci+1 and Ei+1 by considering only nodes in Ci thatare marked as α-protective k-anonymous.

The discrimination granularity level τ ≤ |Q I | is one of the inputs of α-protectiveIncognito. The larger τ , the more protection regarding discrimination will be achieved.The reason is that, if the algorithm can check the status of α-protection in DB w.r.t.nodes which contain more attributes (i.e., finer-grained subsets of protected groups),

123

Page 17: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

Algorithm 1 α- protective IncognitoRequire: Original data table DB, a set Q I = {Q1, · · · , Qn } of quasi-identifier attributes, a set of domain generalization

hierarchies Dom1, · · · , Domn , a set of PD attributes D A, α, f , k, C={Class item}, ms=minimum support, τ ≤ |Q I |Ensure: The set of α-protective k-anonymous full-domain generalizations1: C1={Nodes in the domain generalization hierarchies of attributes in Q I }2: CP D = {∀C ∈ C1 s.t. C is PD}3: E1={Edges in the domain generalization hierarchies of attributes in Q I }4: queue=an empty queue5: for i = 1 to n do6: //Ci and Ei define a graph of generalizations7: Si =copy of Ci8: {roots}={all nodes ∈ Ci with no edge ∈ Ei directed to them}9: Insert {roots} into queue, keeping queue sorted by height10: while queue is not empty do11: node = Remove first item from queue12: if node is not marked as k-anonymous or α-protective k-anonymous then13: if node is a root then14: f requencySet= Compute the frequency set of DB w.r.t. attributes of node using DB.15: else16: f requencySet= Compute the frequency set of DB w.r.t. attributes of node using the parents’ frequency sets.17: end if18: Use f requencySet to check k-anonymity w.r.t. attributes of node19: if DB is k-anonymous w.r.t. attributes of node then20: Mark all direct generalizations of node as k-anonymous21: if ∃N ∈ CP D s.t. N ⊆ node and i ≤ τ then22: if node is a root then23: M R= Check α- protection(i , node) of DB w.r.t. attributes of node using DB.24: else25: M R= Check α- protection(i , node) of DB w.r.t. attributes of node using parents’ Il and Il+126: end if27: Use M R to check α-protection w.r.t. attributes of node28: if M R = case3 then29: Mark all direct generalizations of node that contain the generalization of N as k-anonymous α-protective30: else if M R = case1 then31: Delete node from Si32: Insert direct generalizations of node into queue, keeping queue ordered by height33: end if34: end if35: else36: Steps 31-3237: end if38: else if node is marked as k-anonymous then39: Steps 21-3640: end if41: end while42: Ci+1, Ei+1 = GraphGeneration(Si , Ei )43: end for44: Return projection of attributes of Sn onto DB and Dom1, ..., Domn

then more possible local niches of discrimination in DB are discovered. However, agreater τ leads to more computation by α-protective Incognito, because α-protectionof DB should be checked in more iterations. In fact, by setting τ < |Q I |, we canprovide a trade-off between efficiency and discrimination protection.

As mentioned above, Algorithm 2 implements the Check α-protection(i , node)function to check the α-protection of DB with respect to the attributes of the node. Todo it in an efficient way, first the algorithm generates the set of l-itemsets of attributesof node with their support values, denoted by Il , and the set of (l + 1)-itemsetsof attributes of node and class attribute, with their support values, denoted by Il+1,where l = i is the number of items in the itemset. In SQL language, Il and Il+1 areobtained from DB by issuing a suitable query. This computation is only necessary

123

Page 18: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

for root nodes in each iteration; for non-root nodes, Il and Il+1 are obtained fromIl and Il+1 of parent nodes based on the roll-up property of α-protection. Then, PDclassification rules (i.e. P Dgroups) with the required values to compute each f inFig. 3 (i.e. n1, a1, n2 and a2) are obtained by scanning Il+1. During the scan of Il+1,PD classification rules A, B → C (i.e. P Dgroups) are obtained with the respectivevalues a1 = supp(A, B, C), n1 = supp(A, B) (note that supp(A, B) is in Il ),a2 = supp(¬A, B, C) (obtained from Il+1), and n2 = supp(¬A, B) (obtained fromIl ). By relaxing τ we can limit the maximum number of itemsets in Il and Il+1 thatare generated during the execution of α-protective Incognito.

Algorithm 2 Check α- protection (i , node)1: l = i2: Il ={l-itemsets containing attributes of node}3: Il+1={(l + 1)-itemsets containing attributes of node and class item C}4: for each R ∈ Il+1 do5: X = R\C6: a1 = supp(R)

7: n1 = supp(X) // X found in Il8: A=largest subset of X containing protected groups w.r.t. D A9: T = R\A10: Z = ¬A ∪ T11: a2 = supp(Z) // Obtained from Il+112: n2 = supp(Z\C) // Obtained from Il13: Add R : A, B → C to P Dgroups with values a1, n1, a2 and n214: end for15: Return M R=Measure_disc(α, ms, f )

Algorithm 3 Measure_disc(α, ms, f )1: if f = sli f t or oli f t then2: if ∃ a group (A, B → C) in P Dgroup which is frequent w.r.t. ms and α-discriminatory w.r.t. f then3: Return M R = Case1 // DB is not α-protective w.r.t. attributes of node4: else5: Return M R = Case2 // DB is α-protective w.r.t. attributes of node6: end if7: end if8: if f = eli f t or cli f t then9: if ∃ a group (A, B → C) in P Dgroup which is frequent w.r.t. ms and α-discriminatory w.r.t. f then10: Return M R = Case1 // DB is not α-protective w.r.t. attributes of node11: else if ∃ a group (A, B → C) in P Dgroup which is infrequent w.r.t. ms and α-discriminatory w.r.t. f then12: Return M R = Case2 // DB is α-protective w.r.t. attributes of node13: else if f = cli f t and ∃ a group (A, B → C) in P Dgroup which is infrequent w.r.t. ms whose confidence is lower

than the confidence of the most favored item considered in the computation of cli f t then14: Return M R = Case2 // DB is α-protective w.r.t. attributes of node15: else16: Return M R = Case3 // DB is α-protective w.r.t. attributes of node and subsets of its generalizations17: end if18: end if

After obtaining P Dgroups with the values a1, a2, n1 and n2, Algorithm 2 computesthe Measure_disc (α, ms, f ) function (see Algorithm 3). This function takes f as aparameter and is based on the generalization property of α-protection. If f = sli f tor f = oli f t and if there exists at least one frequent group A, B → C in P Dgroups

with sli f t (A, B → C) ≥ α, then M R = case1 (i.e. DB is not α-protective w.r.t.

123

Page 19: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

attributes of node). Otherwise, M R = case2 (i.e. DB is α-protective w.r.t. attributesof node). If f = eli f t or f = cli f t , the generalization property of α-protectionis satisfied, so if there exists at least one frequent group A, B → C in P Dgroups

with eli f t (A, B → C) ≥ α, then M R = case1. Otherwise if there exists at leastone infrequent group A, B → C in P Dgroups with eli f t (A, B → C) ≥ α, thenM R = case2. Otherwise if all groups in P Dgroups , frequent and infrequent, haveeli f t (A, B → C) < α, M R = case3.

It is worth mentioning that in the i-th iteration of α-protective Incognito, for eachnode in Ci , first k-anonymity will be checked and then α-protection. This is becausethe algorithm only checks α-protection for the nodes that contain at least one PDattribute, while k-anonymity is checked for all nodes. Moreover, in some iterations,the algorithm does not check α-protection if τ < |Q I |.

6 Experimental analysis

Our first objective is to evaluate the performance of α-protective Incognito (Algo-rithm 1) and compare it with Incognito. Our second objective is to evaluate the qualityof unbiased anonymous data output by α-protective Incognito, compared to that ofthe anonymous data output by plain Incognito, using both general and specific dataanalysis metrics. We implemented all algorithms using Java and IBM DB2. All exper-iments were performed on an Intel Core i5 CPU with 4 GB of RAM. The softwareincluded Windows 7 Home Edition and DB2 Express Version 9.7. We considereddifferent values of f , D A, k, α and τ in our experiments.

6.1 Data sets

Adult data set: This data set is also known as Census Income and it can be retrieved fromthe UCI Repository of Machine Learning Databases (Bache and Lichman 2013). Adulthas 6 continuous attributes and 8 categorical attributes. The class attribute representstwo income levels, ≤50 or >50 K. There are 45,222 records without missing values,pre-split into 30,162 and 15,060 records for training and testing. We ran experimentson the training set. We used the same 8 categorical attributes used in Fung et al. (2005),shown in Table 3, and obtained their generalization hierarchies from the authors of

Table 3 Description of theAdult data set

Attribute #Distinct values #Levels of hierarchies

Education 16 5

Marital_status 7 4

Native_country 40 5

Occupation 14 3

Race 5 3

Relationship 6 3

Sex 2 2

Work-class 8 5

123

Page 20: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

Fung et al. (2005). For our experiments, we set ms = 5 % and 8 attributes in Table 3as QI, and D A1 = {Race, Gender, Marital_status}, D A2 = {Race, Gender} andD A3 = {Race, Marital_status}. The smaller ms, the more computation and themore discrimination discovery. In this way, we considered a very demanding scenarioin terms of privacy (all 8 attributes were QI) and anti-discrimination (small ms).

German Credit data set: We also used this data set from Bache and Lichman(2013). It has 7 continuous attributes, 13 categorical attributes, and a binary classattribute representing low or high credit risk. There are 666 and 334 records, with-out missing values, for the pre-split training and testing, respectively. This data sethas been frequently used in the anti-discrimination literature (Pedreschi et al. 2008;Kamiran and Calders 2011). We used the 11 categorical attributes, shown in Table 4.For our experiments, we set ms = 5 % and 10 attributes in Table 4 as QI, andD A1 = {Gender, Marital_status, Foreign_worker}, D A2 = {Gender, Mari-tal_status} and D A3 = {Gender, Foreign_worker}.

6.2 Performance

Figures 4 and 5 report the execution time of α-protective Incognito, for different valuesof τ , D A, f , k in comparison with Incognito in Adult and German Credit, respectively.We observe that for both data sets, as the size of k increases, the performance of

Table 4 Description of theGerman Credit data set

Attribute #Distinct values #Levels of hierarchies

Account-status 4 3

Credit-history 5 3

Load-purpose 11 4

Savings-account 5 4

Employment 5 4

Marital-status 4 3

Sex 2 2

Existing-credits 4 3

Job 4 3

Foreign worker 2 2

10 20 30 40 50

1

1.5

2

2.5

3

k

Exe

cutio

n tim

e (m

in) Incognito

α -protective Incognito, DA1 , τ = 4α -protective Incognito, DA2 , τ = 4α -protective Incognito, DA3 , τ = 4

10 20 30 40 50

1

1.5

2

2.5

3

k

Incognitoα -protective Incognito, f = elif t , τ = 4α -protective Incognito, f = slif t , τ = 4

10 20 30 40 50

1

2

3

4

5

6

k

Incognitoα -protective Incognito, τ = 8α -protective Incognito, τ = 6α -protective Incognito, τ = 4

Fig. 4 Adult data set: Performance of Incognito and α-protective Incognito for several values of k, τ , fand D A. Unless otherwise specified, f = sli f t and D A = D A1 and α = 1.2

123

Page 21: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

10 20 30 40 500

1

2

3

4

5

k

Exe

cutio

n tim

e (m

in) Incognito

α -protective Incognito, DA1 , τ = 5α -protective Incognito, DA2 , τ = 5α -protective Incognito, DA3 , τ = 5

10 20 30 40 500

1

2

3

4

5

k

Incognitoα -protective Incognito, f = elif t , τ = 5α -protective Incognito, f = slif t , τ = 5

10 20 30 40 500

1

2

3

4

5

k

Incognitoα -protective Incognito, τ = 10α -protective Incognito, τ = 5α -protective Incognito, τ = 4

Fig. 5 German credit dataset: Performance of Incognito and α-protective Incognito for several values ofk, τ , f and D A. Unless otherwise specified, f = sli f t , D A = D A1 and α = 1.2

both algorithms improves. This is mainly because, as the size of k increases, moregeneralizations are pruned as part of smaller subsets, and less execution time is needed.On the German Credit data set, α-protective Incognito is always faster than Incognito.On the Adult data set, α-protective Incognito is slower than Incognito only if the valueof τ is very high (e.g. τ = 6 or τ = 8). The explanation is that, with α-protectiveIncognito, more generalizations are pruned as part of smaller subsets by checkingboth k-anonymity and α-protection, and less execution time is needed. The differencebetween the performance of the two algorithms gets smaller when k increases. Inaddition, because of the generalization property of α-protection with respect to elift,α-protective Incognito is faster for f = eli f t than for f = sli f t . However, thisdifference is not substantial since, as we mentioned in Sect. 5.2, α-protection shouldconsider all frequent and infrequent PD rules.

In summary, since α-protective Incognito provides extra protection against discrim-ination compared to Incognito, the cost can be sometimes a longer execution time,especially when the value of τ is very high, near |Q I |. However, our results show thatin most cases α-protective Incognito is even faster than Incognito. This is a remark-able result, because discrimination discovery is an intrinsically expensive task (asdiscrimination may be linked to a large number of attribute and value combinations).

6.3 Data quality

Privacy preservation and discrimination prevention are one side of the problem wetackle. The other side is retaining information so that the published data remain prac-tically useful. Data quality can be measured in general or with respect to a specificdata analysis task (e.g. classification).

First, we evaluate the data quality of the protected data obtained by α-protectiveIncognito and Incognito using standard general metrics: the generalization height(Lefevre et al. 2005; Samarati 2001) and discernibility (Bayardo and Agrawal 2005).The generalization height (GH) is the height of an anonymized data table in the gen-eralization lattice. Intuitively, it corresponds to the number of generalization stepsthat were performed. The discernibility metric charges a penalty to each record forbeing indistinguishable from other records. For each record in equivalence QI classqid, the penalty is |DB[qid]|. Thus, the discernibility cost is equivalent to the sum of

123

Page 22: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

the |DB[qid]|2. We define the discernibility ratio (DR) as DR=∑

qid |DB[qid]|2|DB|2 . Note

that: (i) 0 ≤ DR ≤ 1; (ii) lower DR and GH mean higher data quality. From the listof full-domain generalizations obtained from Incognito and α-protective Incognito,respectively, we compute the minimal full-domain generalization w.r.t. both GH andDR for each algorithm and compare them.

Second, we measure the data quality of the anonymous data obtained byα-protectiveIncognito and Incognito for a classification task using the classification metric CMfrom Iyengar (2002). CM charges a penalty for each record generalized to a qidgroup in which the record’s class is not the majority class. Lower CM means higherdata quality. From the list of full-domain generalizations obtained from Incognito andα-protective Incognito, respectively, we compute the minimal full-domain general-ization w.r.t. CM for each algorithm and we compare them. In addition, to evaluatethe impact of our transformations on the accuracy of a classification task, we firstobtain the minimal full-domain generalization w.r.t. CM to anonymize the trainingset. Then, the same generalization is applied to the testing set to produce a generalizedtesting set. Next, we build a classifier on the anonymized training set and measure theclassification accuracy (CA) on the generalized records of the testing set. For clas-sification models we use the well-known decision tree classifier J48 from the Wekasoftware package (Witten and Frank 2005). We also measure the classification accu-racy on the original data without anonymization. The difference represents the costin terms of classification accuracy for achieving either both privacy preservation anddiscrimination prevention or privacy preservation only.

Figures 6 and 7 summarize the data quality results using general metrics fordifferent values of k, D A and α, where f = sli f t and τ = |Q I | in Adult andGerman Credit, respectively. We found that the data quality of k-anonymous tables(i.e. in terms of GH and DR) without α-protection is equal or slightly better than thequality of k-anonymous tables with α-protection. This is because the α-protectionk-anonymity requirement provides extra protection (i.e., against discrimination) at thecost of some data quality loss when D A and k are large and α is small. As k increases,more generalizations are needed to achieve k-anonymity, which increases GH and

5 10 25 50 100

14

15

16

17

k

Min

imum

GH

IIIIIIIVV

5 10 25 50 100

0.1

0.15

0.2

0.25

k

Min

imum

DR

IIIIIIIVV

Fig. 6 Adult data set: General data quality metrics. Left generalization height (GH). Right discernibilityratio (DR). Results are given for k-anonymity (I); and α-protection k-anonymity with D A2, α = 1.2(II); D A2, α = 1.6 (III); D A1, α = 1.2 (IV); D A1, α = 1.6 (V). In all cases f = sli f t, D A1 ={Race, Gender, Marital_status}, and D A2 = {Race, Gender}

123

Page 23: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

5 10 25 50 100

14

15

16

17

18

k

Min

imum

GH

IIIIIIIVV

5 10 25 50 100

0.1

0.2

0.3

0.4

0.5

k

Min

imum

DR

IIIIIIIVV

Fig. 7 German credit dataset: General data quality metrics. Left generalization height (GH). Right dis-cernibility ratio (DR). Results are given for k-anonymity (I); and α-protection k-anonymity with D A2,α = 1.2 (II); D A2, α = 1.6 (III); D A1, α = 1.2 (IV); D A1, α = 1.6 (V). In all cases f = sli f t ,D A1 = {Gender, Marital_status, Foreign_worker}, and D A2 = {Gender, Marital_status}

0 20 40 60 80 1000.27

0.28

0.29

k

Min

imum

CM

IIIIIIIVV

0 20 40 60 80 100

80.76

81

81.5

82

82.5

k

CA

(%

) of

J48

0IIIIIIIVV

Fig. 8 Adult data set: Data quality for classification analysis. Left classification metric (CM). Right clas-sification accuracy, in percentage (CA). Results are given for the original data (0); k-anonymity (I); andα-protection k-anonymity with D A2, α = 1.2 (II); D A2, α = 1.6 (III); D A1, α = 1.2 (IV); D A1, α = 1.6(V). In all cases f = sli f t, D A1 = {Race, Gender, Marital_status}, and D A2 = {Race, Gender}

DR. We performed the same experiment for other discrimination measures f , and weobserved a similar trend (details omitted due to lack of space). As discussed in Sect.5.2.3, τ = |Q I | indicates the worst-case anti-discrimination scenario.

The left-hand side charts of Figs. 8 and 9 summarize the data quality results usingthe classification metric (CM) for different values of k, D A and α, where f = sli f tand τ = |Q I |. It can be seen that the information loss is higher in the German creditdata set than in the Adult data set, due to the former being more biased (that is, havingmore α-discriminatory rules). However, the relevant comparison is not between datasets, but rather within each data set. In this respect, we notice that, for each data set,the data quality of k-anonymous tables (i.e. in terms of CM) without α-protectionis equal or slightly better than the quality of k-anonymous tables with α-protection.This is because the α-protection k-anonymity requirement provides extra protection(i.e., against discrimination) at the cost of some data quality loss when D A and k arelarge. The right-hand side charts of Figs. 8 and 9 summarize the impact of achievingk-anonymity or α-protection k-anonymity on the percentage classification accuracy(CA) of J48 for different values of k, D A and α, where f = sli f t . We observe a similar

123

Page 24: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

0 20 40 60 80 100

0.29

0.31

k

Min

imum

CM

IIIIIIIVV

0 20 40 60 80 100

71.25

72.15

74.25

k

CA

(%

) of

J48

0IIIIIIIVV

Fig. 9 German credit dataset: Data quality for classification analysis. Left classification metric (CM).Right classification accuracy, in percentage (CA). Results are given for the original data (0); k-anonymity(I); and α-protection k-anonymity with D A2, α = 1.2 (II); D A2, α = 1.6 (III); D A1, α = 1.2 (IV);D A1, α = 1.6 (V). In all cases f = sli f t, D A1 = {Gender, Marital_status, Foreign_worker}, andD A2 = {Gender, Marital_status}

Table 5 Adult dataset: accuracyfor various types of classifiers

In all cases f = sli f t, D A ={Race, Gender, Marital_status}, and τ = |Q I |

Classifier Original datatable

50-Anonymousdata table

50-Anony.1.2-protectivedata table

J48 82.36 82.01 82.01

Naïve Bayes 79.40 82.01 82.01

Logisticregression

82.95 82.08 82.08

RIPPER 82.7 81.34 81.34

PART 82.48 82.01 82.01

Table 6 German credit dataset:accuracy for various types ofclassifiers

In all cases f = sli f t, D A ={Gender, Marital_status,Foreign_worker}, andτ = |Q I |

Classifier Original datatable

50-Anonymousdata table

50-Anony.1.2-protectivedata table

J48 74.25 71.25 71.25

Naïve Bayes 71.25 71.25 71.25

Logisticregression

72.15 71.25 71.25

RIPPER 76.94 71.25 71.25

PART 66.46 71.25 70.95

trend as for CM. The accuracies of J48 using k-anonymous tables without α-protectionare equal or only slightly better than the accuracies of J48 using k-anonymous tableswith α-protection.

We also extend our results to alternative data mining algorithms. Tables 5 and 6show for each data set, respectively, the accuracy for various types of classifiers,including decision trees (J48), naïve Bayes, logistic regression, and rule induction(RIPPER and PART) obtained from original data, 50-anonymous and 50-anonymous1.2-protective. In either data set, we do not observe a significant difference betweenthe accuracy of the classifiers obtained from the k-anonymous and the k-anonymousα-protective version of original data tables. These results support the conclusion that

123

Page 25: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

the transformed data obtained by our approach are still usable for learning models andfinding patterns, while minimizing both privacy and discrimination threats.

7 Extensions

We consider here alternative privacy models and anti-discrimination requirements.

7.1 Alternative privacy models

7.1.1 Attribute disclosure

k-Anonymity can protect the original data against record linkage attacks, but it can-not protect the data against attribute linkage (disclosure). In the attack of attributelinkage, the attacker may not precisely identify the record of the specific individual,but could infer his/her sensitive values (e.g., salary, disease) from the published datatable DB. In contrast to k-anonymity, the privacy models in attribute linkage assumethe existence of sensitive attributes in DB such that Q I ∩ S = ∅. Some models havebeen proposed to address this type of threat. The most popular ones are l-diversity andt-closeness. The general idea of these models is to diminish the correlation between QIattributes and sensitive attributes (see Machanavajjhala et al. 2007; Li et al. 2007 forformal definitions). As shown in Machanavajjhala et al. (2007) and Li et al. (2007), byusing full-domain generalizations over QI, we can obtain data tables protected againstattribute disclosure. Considering attribute disclosure risks, we focus on the problemof producing an anonymized version of DB which is protected against attribute dis-closure and free from discrimination (e.g., α-protective l-diverse data table). We studythis problem considering the following possible relations between Q I , D A and S:

– D A ⊆ Q I : It is possible that the original data are biased in the subsets of the pro-tected groups which are defined by sensitive attributes (e.g. women who havemedium salary). In this case, only full-domain generalizations which includethe generalization of protected groups values can make DB α-protective. Thisis because the generalization is only performed over QI attributes.

– D A ⊆ S: A full-domain generalization over QI can make the original data α-protective only if DB is biased in the subsets of protected groups which are definedby QI attributes. In other scenarios, i.e., when data are biased versus some protectedgroups or subsets of protected groups which are defined by sensitive attributes,full-domain generalizations over QI cannot make DB α-protective. One possiblesolution is to generalize attributes which are both sensitive and PD (e.g., Religionin some applications), even if they are not in QI.

Observation 3 If D A ⊆ Q I , l-diversity/t-closeness and α-protection can beachieved simultaneously in DB by means of full-domain generalization.

Since the subset and generalization properties are also satisfied for l-diversity andt-closeness, to obtain all full-domain generalizations with which data are α-protectiveand protected against attribute disclosure, we take α-protective Incognito and make

123

Page 26: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

the following changes: (1) every time a data table is tested for k-anonymity, it isalso tested for l-diversity or t-closeness; (2) every time a data table is tested for α-protection, it is tested w.r.t. attributes of node and sensitive attributes. This can bedone by simply updating the Check α-protection function. Just as the data quality ofk-anonymous data tables without l-diversity or t-closeness is slightly better than thequality of k-anonymous data tables with l-diversity or t-closeness, we expect a similarslight quality loss when adding l-diversity or t-closeness to k-anonymity α-protection.

7.1.2 Differential privacy

Differential privacy is a privacy model that provides a worst-case privacy guarantee inthe presence of arbitrary external information. It protects against any privacy breachesresulting from joining different databases. Informally, differential privacy requires thatthe output of a data analysis mechanism be approximately the same, even if any singlerecord in the input database is arbitrarily added or removed (see Dwork 2006 for aformal definition). There are several approaches for designing algorithms that satisfydifferential privacy. One of the best-known approaches is Laplace noise addition.After the query function is computed on the original data set DB, Laplace-distributedrandom noise is added to the query result, where the magnitude of the noise dependson the sensitivity of the query function and a privacy budget. The sensitivity of afunction is the maximum difference of its outputs from two data tables that differonly in one record. We define a differentially private data table as an anonymizeddata table generated by a function (algorithm) which is differentially private. Thereare some works available in literature studying the problem of differentially privatedata release (Dwork 2011). The general structure of these approaches is to first builda contingency table of the original raw data over the database domain. After that,noise is added to each frequency count in the contingency table to satisfy differentialprivacy. However, as mentioned in Mohammed et al. (2011), these approaches are notsuitable for high-dimensional data with a large domain because when the added noise isrelatively large compared to the count, the utility of the data is significantly decreased.In Mohammed et al. (2011), a generalization-based algorithm for differentially privatedata release is presented. It first probabilistically generates a generalized contingencytable and then adds noise to the counts. Thanks to generalization, the count of eachpartition is typically much larger than the added noise. In this way, generalization helpsachieving a differentially private version of DB with higher data utility. Consideringthe differential privacy model, we focus on the problem of producing a private versionof DB which is differentially private and free from discrimination with respect to D A.Since the differentially private version of DB is an approximation of DB generated atrandom, we have the following observation.

Observation 4 Making original data table DB differentially private using Laplacenoise addition can make DB more or less α-protective w.r.t. D A and f .

Given the above observation and the fact that generalization can help to achievedifferential privacy with higher data quality, we propose to obtain a noisy generalizedcontingency table of DB which is also α-protective. To do this, one solution is to adduncertainty to an algorithm that generates all possible full-domain generalizations with

123

Page 27: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

which DB is α-protective. As shown in Mohammed et al. (2011), for higher values ofthe privacy budget, the quality of differentially private data tables is higher than thequality of k-anonymous data tables, while for smaller value of the privacy budget it isthe other way round. Therefore, we expect that differential privacy plus discriminationprevention will compare similarly to the k-anonymity plus discrimination preventionpresented in the previous sections of this paper.

7.2 Alternative anti-discrimination legal concepts

Unlike privacy legislation, anti-discrimination legislation is very sparse and includesdifferent legal concepts, e.g. direct and indirect discrimination and the so-called gen-uine occupational requirement.

7.2.1 Indirect discrimination

Indirect discrimination occurs when the input does not contain PD attributes, butdiscriminatory decisions against protected groups might be indirectly made becauseof the availability of some background knowledge; for example, discrimination againstblack people might occur if the input data contain Zipcode as attribute (but not Race)and one knows that the specific zipcode is mostly inhabited by black people 8 (i.e., thereis high correlation between Zipcode and Race attributes). Then, if the protected groupsdo not exist in the original data table or have been removed from it due to privacy or anti-discrimination constraints, indirect discrimination still remains possible. Given D A,we define background knowledge as the correlation between D A and PND attributeswhich are in DB:

BK = {Ai → Ax |Ai ∈ A, Ai is PND and Ax ∈ D A}

Given BK, we define I A as a set of PND attributes in DB which are highly correlatedto D A, determined according to BK. Building on Definition 3, we introduce the notionof non-redlining α-protection for a data table.

Definition 7 (Non-redlining α-protected data table) Given DB(A1, . . . , An), D A, fand BK, DB is said to satisfy non-redlining α-protection or to be non-redlining α-protective w.r.t. D A and f if each PND frequent classification rule c : D, B → Cextracted from DB is α-protective, where D is a PND itemset of I A attributes and Bis a PND itemset of A\I A attributes.

Given D A and BK, releasing a non-redlining α-protective version of an originaltable is desirable to prevent indirect discrimination against protected groups w.r.t. D A.Since indirect discrimination against protected groups originates from the correlationbetween D A and I A attributes, a natural countermeasure is to diminish this correla-tion. Then, an anonymized version of an original data table protected against indirectdiscrimination (i.e. non-redlining α-protective) can be generated by generalizing I A

8 http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:62011CJ0385:EN:Not

123

Page 28: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

attributes. As an example, generalizing all instances of 47677, 47602 and 47678 zip-code values to the same generalized value 476** can prevent indirect discriminationagainst black people living in the 47602 neighborhood.

Observation 5 If I A ⊆ Q I , non-redlining α-protection can be achieved in DB bymeans of full-domain generalization.

Consequently, non-redlining α-protection can be achieved with each of the above-mentioned privacy models based on full-domain generalization of DB (e.g. k-anonymity), as long as I A ⊆ Q I . Fortunately, the subset and generalization prop-erties satisfied by α-protection are also satisfied by non-redlining α-protection.Hence, in order to obtain all possible full-domain generalizations with which DBis indirect discrimination- and privacy-protected, we take α-protective Incognito andmake the following changes: (1) add BK as the input of the algorithm and deter-mine I A w.r.t. BK, where PD attributes are removed from DB; (2) every time adata table is tested for α-protection, test it for non-redlining α-protection instead.Considering the above changes, when combining indirect discrimination preven-tion and privacy protection, we expect similar data quality and algorithm perfor-mance as we had when combining direct discrimination prevention and privacyprotection.

7.2.2 Genuine occupational requirement

The legal concept of genuine occupational requirement refers to detecting that part ofthe discrimination which may be explained by other attributes (Zliobaite et al. 2011),named legally-grounded attributes; e.g., denying credit to women may be explainableif most of them have low salary or delay in returning previous credits. Whether lowsalary or delay in returning previous credits is an acceptable legitimate argument todeny credit is for the law to determine. Given a set L A of legally-grounded attributesin DB, there are some works which attempt to cater technically to them in the anti-discrimination protection (Loung et al. 2011; Zliobaite et al. 2011; Dwork et al. 2012).The general idea is to prevent only unexplainable (bad) discrimination. Loung et al.(2011) propose a variant of the k-nearest neighbor (k-NN) classification which labelseach record in a data table as discriminated or not. A record t is discriminated if: (i) ithas a negative decision value in its class attribute; and (ii) the difference between theproportions of k-nearest neighbors of t w.r.t. L A whose decision value is the same oft and belong to the same protected-by-law groups as t and the ones that do not belongto the same protected groups as t is greater than the discrimination threshold. Thisimplies that the negative decision for t is not explainable on the basis of the legally-grounded attributes, but it is biased by group membership. We say that a data table isprotected only against unexplainable discrimination w.r.t. D A and L A if the numberof records labeled as discriminated is zero (or near zero). An anonymized version ofan original data table which is protected against unexplainable discrimination can begenerated by generalizing L A and/or D A attributes. Given a discriminated record,generalizing L A and/or D A attributes can decrease the difference between the twoabove-mentioned proportions. Hence, an anonymized version of an original data tablewhich is privacy-protected and protected against unexplainable discrimination can be

123

Page 29: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

obtained using full-domain generalization over QI attributes as long as D A ⊆ Q I andL A ⊆ Q I .

8 Conclusions

We have investigated the problem of discrimination- and privacy-aware data publish-ing and mining, i.e., distorting an original data set in such a way that neither privacy-violating nor discriminatory inferences can be made on the released data sets, whilemaximizing the usefulness of data for learning models and finding patterns. To studythe impact of data generalization (i.e. full-domain generalization) on discriminationprevention, we applied generalization not only for making the original data privacy-protected but also for making them protected against discrimination. We found that asubset of k-anonymous full-domain generalizations with the same or slightly higherdata distortion than the rest (in terms of general and specific data analysis metrics)are also α-protective. Hence, k-anonymity and α-protection can be combined to attainprivacy protection and discrimination prevention in the published data set. We haveadapted to α-protection two well-known properties of k-anonymity, namely the subsetand the generalization properties. This has allowed us to propose an α-protective ver-sion of Incognito, which can take as parameters several legally-grounded measures ofdiscrimination and generate privacy- and discrimination-protected full-domain gen-eralizations. We have evaluated the quality of data (i.e. in terms of various types ofclassifiers and rule induction algorithms) output by this algorithm, as well as its exe-cution time. Both turn out to be nearly as good as with plain Incognito, so the toll paidto obtain α-protection is very reasonable. Finally, we have sketched how our approachcan be extended to satisfy alternative privacy guarantees or anti-discriminationlegal constraints. Detailed implementations of these extensions are left for futurework.

Acknowledgments The authors wish to thank Kristen LeFevre for providing the implementation of theIncognito algorithm and Guillem Rufian-Torrell for helping in the implementation of the algorithm proposedin this paper. This work was partly supported by the Government of Catalonia under Grant 2009 SGR 1135,by the Spanish Government through projects TIN2011-27076-C03-01 “CO-PRIVACY”, TIN2012-32757“ICWT” and CONSOLIDER INGENIO 2010 CSD2007-00004 “ARES”, and by the European Comissionunder FP7 projects “DwB” and “INTER-TRUST”. The second author is partially supported as an ICREAAcadèmia researcher by the Government of Catalonia. The authors are with the UNESCO Chair in DataPrivacy, but they are solely responsible for the views expressed in this paper, which do not necessarilyreflect the position of UNESCO nor commit that organization.

References

Aggarwal CC, Yu PS (eds) (2008) Privacy preserving data mining: models and algorithms. Springer, BerlinAgrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings

of the 20th international conference on very large data bases, VLDB, pp 487–499Agrawal R, Srikant R (2000) Privacy preserving data mining. In: ACM SIGMOD 2000, pp 439–450Australian Legislation (2008) (a) Equal Opportunity Act—Victoria State, (b) Anti-Discrimination Act—

Queensland StateBache K, Lichman M (2013) UCI machine learning repository. University of California, School of Infor-

mation and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml. Accessed 20 Jan 2014

123

Page 30: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

S. Hajian et al.

Bayardo RJ, Agrawal R (2005) Data privacy through optimal k-anonymization. In: ICDE 2005: IEEE, pp217–228

Berendt B, Preibusch S (2012) Exploring discrimination: a user-centric evaluation of discrimination-awaredata mining. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEEComputer Society, pp 344–351

Calders T, Verwer S (2010) Three naive Bayes approaches for discrimination-free classification. DataMining Knowl Discov 21(2):277–292

Custers B, Calders T, Schermer B, Zarsky TZ (eds) (2013) Discrimination and privacy in the informationsociety—data mining and profiling in large databases. Studies in applied philosophy, epistemology andrational ethics, vol 3. Springer, Berlin

Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous k-anonymity through microag-gregation. Data Mining Knowl Discov 11(2):195–212

Dwork C (2006) Differential privacy. In: ICALP 2006, LNCS 4052, Springer, pp 112Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):8695Dwork C, Hardt M, Pitassi T, Reingold O, Zemel RS (2012) Fairness through awareness. In: ITCS 2012,

ACM, pp 214–226European Union Legislation (1995) Directive 95/46/ECEuropean Union Legislation (2009) (a) Race Equality Directive, 2000/43/EC, 2000; (b) Employment Equal-

ity Directive, 2000/78/EC, 2000; (c) Equal Treatment of Persons, European Parliament legislative reso-lution, P6_TA(2009) 0211

Fung BCM, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In:ICDE 2005, IEEE, pp 205–216

Fung BCM, Wang K, Fu AW-C, Yu P (2010) Introduction to privacy-preserving data publishing: conceptsand techniques. Chapman & Hall/CRC, New York

Hajian S, Domingo-Ferrer J, Martínez-Ballesté A (2011) Rule protection for indirect discrimination pre-vention in data mining. In: MDAI 2011, LNCS 6820, Springer, pp 211–222

Hajian S, Domingo-Ferrer J (2013) A methodology for direct and indirect discrimination prevention in datamining. IEEE Trans Knowl Data Eng 25(7):1445–1459

Hajian S, Monreale A, Pedreschi D, Domingo-Ferrer J, Giannotti F (2012) Injecting discrimination and pri-vacy awareness into pattern discovery. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 360–369

Hajian S, Domingo-Ferrer J (2012) A study on the impact of data anonymization on anti-discrimination.In: 2012 IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE ComputerSociety, pp 352–359

Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Schulte-Nordholt E, Spicer K, de Wolf P-P (2012)Statistical disclosure control. Wiley, Chichester

Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: SIGKDD 2002, ACM, pp 279288Kamiran F, Calders T (2011) Data preprocessing techniques for classification without discrimination. Knowl

Inf Syst 33(1):1–33Kamiran F, Calders T, Pechenizkiy M (2010) Discrimination aware decision tree learning. In: ICDM 2010,

IEEE, pp 869–874Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regu-

larizer. In: ECML/PKDD, LNCS 7524, Springer, pp 35–50Lefevre K, Dewitt DJ, Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In SIGMOD

2005, ACM, pp 49–60Lefevre K, Dewitt DJ, Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In: ICDE 2006,

IEEE, p 25Li N, Li T, Venkatasubramanian S (2007) t-Closeness: privacy beyond k-anonymity and l-diversity. In:

IEEE ICDE 2007, IEEE, pp 106–115Lindell Y, Pinkas B (2000) Privacy preserving data mining. In: Bellare M (ed) Advances in cryptology-

CRYPTO’00, LNCS 1880, Springer, Berlin, pp 36–53Loung BL, Ruggieri S, Turini F (2011) k-NN as an implementation of situation testing for discrimination

discovery and prevention. In: KDD 2011, ACM, pp 502–510Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-Diversity: privacy beyond k-

anonymity. ACM Trans Knowl Discov Data (TKDD) 1(1):Article 3Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In:

KDD 2011, ACM, pp 493–501

123

Page 31: Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Generalization-based privacy preservation and discrimination prevention

Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: KDD 2008, ACM, pp560–568

Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records.In: SDM 2009, SIAM, pp 581–592

Pedreschi D, Ruggieri S, Turini F (2009) Integrating induction and deduction for finding evidence ofdiscrimination. In: ICAIL 2009, ACM, pp 157–166

Pedreschi D, Ruggieri S, Turini F (2013) The discovery of discrimination. In: Custers BHM, Calders T,Schermer BW, Zarsky TZ (eds) Discrimination and privacy in the information society: studies in appliedphilosophy, epistemology and rational, ethics. Springer, Berlin, pp 91–108

Ruggieri S, Pedreschi D, Turini F (2010) Data mining for discrimination discovery. ACM Trans KnowlDiscov Data (TKDD) 4(2):Article 9

Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng13(6):1010–1027

Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information. In:Proceedings of the 17th ACM SIGACTSIGMOD-SIGART symposium on principles of database systems(PODS 98), Seattle, WA, p 188

Statistics Sweden (2001) Statistisk rjandekontroll av tabeller, databaser och kartor (Statistical disclosure con-trol of tables, databases and maps, in Swedish). Statistics Sweden, Örebro. http://www.scb.se/statistik/_publikationer/OV9999_2000I02_BR_X97P0102. Accessed 20 Jan 2014

Sweeney L (1998) Datafly: a system for providing anonymity in medical data. In: Proceedings of the IFIPTC11 WG11.3 11th international conference on database security XI: status and prospects, pp 356–381

Sweeney L (2002) k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl BasedSyst 10(5):557–570

United States Congress (1963) US Equal Pay Act (EPA) (Pub. L. 88-38). http://www.eeoc.gov/eeoc/history/35th/thelaw/epa.html. Accessed 20 Jan 2014

Wang K, Yu PS, Chakraborty S (2004) Bottom-up generalization: a data mining solution to privacy protec-tion. In: ICDM 2004, IEEE, pp 249–256

Willenborg L, de Waal T (1996) Elements of statistical disclosure control. Springer, BerlinWitten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan

Kaufmann, San FranciscoZliobaite I, Kamiran F, Calders T (2011) Handling conditional discrimination. In: ICDM 2011, IEEE, pp

992–1001

123