fairness constraints in semi-supervised learningtao zhang, tianqing zhu, mengde han, jing li, wanlei...

1

Fairness Constraints in Semi-supervised LearningTao Zhang, Tianqing Zhu, Mengde Han, Jing Li, Wanlei Zhou, Senior Member, IEEE and Philip S. Yu Fellow,

IEEE

Abstract—Fairness in machine learning has received consider-able attention. However, most studies on fair learning focus oneither supervised learning or unsupervised learning. Very fewconsider semi-supervised settings. Yet, in reality, most machinelearning tasks rely on large datasets that contain both labeledand unlabeled data. One of key issues with fair learning isthe balance between fairness and accuracy. Previous studiesarguing that increasing the size of the training set can havea better trade-off. We believe that increasing the training setwith unlabeled data may achieve the similar result. Hence, wedevelop a framework for fair semi-supervised learning, which isformulated as an optimization problem. This includes classifierloss to optimize accuracy, label propagation loss to optimizeunlabled data prediction, and fairness constraints over labeledand unlabeled data to optimize the fairness level. The frameworkis conducted in logistic regression and support vector machinesunder the fairness metrics of disparate impact and disparatemistreatment. We theoretically analyze the source of discrimi-nation in semi-supervised learning via bias, variance and noisedecomposition. Extensive experiments show that our method isable to achieve fair semi-supervised learning, and reach a bettertrade-off between accuracy and fairness than fair supervisedlearning.

Index Terms—Fairness, discrimination, machine learning,semi-supervised learning,

I. INTRODUCTION

Machine learning algorithms, as useful decision-makingtools, are widely used in the society. These algorithms areoften assumed to be paragons of objectivity. However, manystudies show that the decisions made by these models can bebiased against certain groups of people. For example, Choulde-chova et al. [1] found evidence of racial bias in recidivismprediction where Black defendants are particularly likely tofalsely be flagged as future criminals, and Obermeyer et al. [2]found prejudice in health care systems where Black patientsassigned the same level of risk by the algorithm are sickerthan White patients. These events prove that discrimination canarise from machine learning algorithms, and do harm to thefundamental rights of human beings. Given the widespread useof machine learning to support decisions over loan allocations,insurance coverage, the best candidate for a job, and manyother basic precursors to equity, fairness in machine learninghas become a significantly important issue. Thus, designing

Tianqing Zhu∗ is the corresponding author.Tao Zhang, Tianqing Zhu, Mengde Han, Wanlei Zhou are with

the Centre for Cyber Security and Privacy, School of ComputerScience, University of Technology Sydney, Sydney Australia.Email: {[email protected], [email protected],[email protected], [email protected]}

Jing Li is in the centre for Artificial Intelligence, University of TechnologySydney. Email: {[email protected]}

Philip S, Yu is with the Department of Computer Science University ofIllinois at Chicago Chicago, Illinois, USA. Email: {[email protected]}

machine learning algorithms that treat all groups equally iscritical.

Over the past few years, many fairness metrics have beenproposed to define what is fairness in machine learning.Popular fairness metrics include statistical fairness [1], [3],[4], individual fairness [5], [6], [7], [8] and casual fairness[9], [10], [11]. Meanwhile, a great many algorithms havebeen developed to address fairness issues for both supervisedlearning settings [3], [4], [5], [12], [13] and unsupervisedsettings [14], [15], [16], [17], [18]. Generally, these studieshave focused on two key issues: how to formalize the conceptof fairness in the context of machine learning tasks, and howto design efficient algorithms that strike a desirable trade-offbetween accuracy and fairness. What is lacking is study thatconsider semi-supervised learning (SSL) scenarios.

In real-world machine learning tasks, the data used fortraining is often a combination of labeled and unlabeled data.Therefore, fair SSL is a vital area of development. Like theother learning settings, achieving a balance between accuracyand fairness is also a key issue. However, given the mix oflabeled and unlabeled data in SSL, the trade-off solutionsdeveloped for supervised and unsupervised learning do notdirectly apply. According to [19], increasing the size of thetraining set can create a better trade-off. This finding sparkedan idea over whether the trade-off might be improved viaunlabeled data. Unlabeled data is abundant and, if it could beused as training data, we may be able to avoid the need to makethe compromise between fairness and accuracy. Generally,fair SSL has two challenges: 1) how to achieve fair learningfrom both labeled and unlabeled data; 2) how to make use ofunlabeled data to achieve a better trade-off between accuracyand fairness.

To solve these challenges, we propose a framework of fairSSL that can support multiple classifiers and fairness metrics.The framework is formulated as an optimization problem,where the objective function includes a loss for both theclassifier and label propagation, and fairness constraints overlabeled and unlabeled data. Classifier loss is to optimizethe accuracy of training result; label propagation loss is tooptimize the label predictions on unlabeled data; the fairnessconstraint is to adjust the fairness level as desirable. Theoptimization includes two steps. In the first step, fairnessconstraints enforce weights update towards a fair direction.This step can be solved by a convex problem and convex-concave programming when disparate impact and disparatemistreatment are used as fairness metrics respectively. In thesecond step, updated weights further direct labels assignedto unlabeled data in a fair direction by label propagation.Labels for unlabeled data can be calculated in a closed form.In this way, labeled and unlabeled data are used to achieve

arX

iv:2

009.

0619

0v1

[cs

.LG

] 1

4 Se

p 20

20

2

a better trade-off between accuracy and fairness. With thisstrategy, we can control the level of discrimination in themodel and, therefore, provide a machine learning frameworkthat offers fair SSL. Our approach incorporates a wide rangeof fairness definitions such as disparate impact and disparatemistreatment, which is guaranteed to yield an accurate fairclassifier.

With the aim of achieving fair SSL, the contributions of thispaper are three-fold.• First, we propose a framework that is able to achieve

fair SSL that supports multiple classifiers and fairnessmetrics of disparate impact and disparate mistreatment.This framework enables the use of unlabeled data toachieve a better trade-off between fairness and accuracy.

• Second, we consider different cases of fairness constraintson labeled and unlabeled data, and analyze the impactsof these constraints on the training results. This helps ushow to control the fairness level in practice.

• Third, we theoretically analyze the sources of discrim-ination in SSL, and conduct extensive experiments tovalidate the effectiveness of our proposed method.

The rest of this paper is organized as follows. The preliminar-ies is given in Section II, and the proposed framework is givenin Section III. Section IV presents the discrimination analysis,and the experiments are set out in Section V. The related workappears in Section VI, with the conclusion in Section VII.

II. PRELIMINARIES

A. Notations and Problem DefinitionLet X = {x1, ..., xK}T ∈ RK×v denote the training

data matrix, where K is the number of data point and vis the number of unprotected attributes; z = {z1, ..., zn} ∈{0, 1} denotes the protected attribute, e.g., gender or race.Labeled dataset is denoted as Dl = {xi, zi, yl,i}Kl

i=1 withKl data points, and yl = {yl,1, ..., yl,Kl

}T ∈ {0, 1} is thelabel for the labeled dataset. Unlabeled dataset is denotedas Du = {xi, zi}Ku

i=1 with Ku data points, and yu ={yu,1, ..., yu,Ku

}T ∈ {0, 1} is the predicted labeled for theunlabeled dataset. Our objective is to learn a classificationmodel f(·) with the parameters w and yu over discriminatorydatasets Dl and Du that delivers high accuracy with lowdiscrimination.

B. Fairness MetricFairness is usually assessed on the basis of pro-

tected/unprotected groups of individuals (defined by their sen-sitive attributes). In our framework, we have applied disparateimpact and disparate mistreatment as the fairness metrics [3],[20].

1) Disparate impact: A classification model does not sufferdisparate impact if,

Pr(y = 1|z = 1) = Pr(y = 1|z = 0) (1)

where y is the predicted label. The probability that a classifierpredicts positive class for a data point regardless of the datapoint is in the protected group or unprotected group. Thismeans that positive prediction is the same for both values ofthe sensitive feature z, then there is no disparate impact.

2) Disparate mistreatment: A binary classifier will notsuffer different mistreatment if the misclassification rate ofdifferent groups with different values of sensitive feature zis the same. Specifically, the misclassification rate can bemeasured as fractions over the group class and the groundtruth label, that is, as the false positive rate and false negativerate. Here, three different kind of disparate mistreatments areadopted to evaluate the discrimination as follows,overall misclassification rate (OMR):

Pr(y 6= y|z = 1) = Pr(y 6= y|z = 0) (2)

false positive rate (FPR):

Pr(y 6= y|z = 1, y = 0) = Pr(y 6= y|z = 0, z = 0) (3)

false negative rate (FNR):

Pr(y 6= y|z = 1, y = 1) = Pr(y 6= y|z = 0, y = 1) (4)

In most cases, a classifier suffers discrimination in terms ofdisparate impact or disparate mistreatment. The discriminationlevel is defined as the differences in rates between differentgroups.

Definition 1. Let γz denote the extent of discrimination ingroup z in terms of a fairness metric. The discrimination levelΓ(y) is measured by the difference between γz , denoted asΓ(y) = |γ0(y)− γ1(y)|.

C. Bias, variance and noise

According to [19], bias, variance and noise decompositioncan be used to analyze the sources of discrimination. In thefollowing, we give definition of main prediction, bias, varianceand noise. The main prediction is defined as ym(x, z) =arg miny ED[L(y, y)] for a loss function L and a set oftraining sets D, where y is the true label; y is the predictedlabel, and the expectation is taken over the training sets in D.

Definition 2. (Bias, variance and noise) Following [21], biasB, variance V and noise N at a point (x, z) are defined as,

B(y, x, z) = L(y∗(x, z), ym(x, z)) (5)V (y, x, z) = ED[L(ym(x, z), y(x, z)] (6)N(y, x, z) = ED[L(y∗(x, z), y(x, z))] (7)

where y∗ is the optimal prediction that attains the smallestexpected error. Bias is the loss between the main predictionand the optimal prediction. Variance is the average loss in-duced by predictions relative to the main prediction. Noise isan inevitable part of the loss, which is irrelevant of the learningmodel.

The bias-variance-noise decomposition is suitable for dis-crimination analysis because the loss is related to the misclas-sification rate. For instance, when zero-one loss function isapplied, misclassification rate can be presented as,

ED[L(y, y)] = ED[y 6= y|z = 0] +ED[y 6= y|z = 1]

Loss function can be decomposited into false positive rate andfalse negative rate. When these rates are known, true positiverate and true negative rate can also be calculated. This means

3

that many fairness metrics, such as demographic parity andequal opportunity, might be explained via bias, variance andnoise decomposition.

III. PROPOSED METHOD

In this section, we first present the proposed frameworkin section 3.1. Then fairness metrics of disparate impact anddisparate mistreatment in logistic regression are analyzed insection 3.2. Section 3.3 shows the case in support vectormachines, and finally a discussion is given in section 3.4.

A. The Proposed Framework

We formulate the framework of fair SSL as following,including the classification loss, the label propagation loss andfairness constraints.

minw,yu

L1(w,yu) + αL2(yu) s.t. s(w) ≤ c (8)

where L1 is the classifier loss between predicted labels andtrue labels; L2 is the loss of label propagation from labeleddata to unlabeled data; α is a parameter to balance the loss;s(w) is the fairness constraints and c is a threshold.

1) Classifier Loss: A classifier loss function evaluates howwell a specific algorithm models the given dataset. Whendifferent algorithms are used to train datasets, such as logisticregression or neural networks, a corresponding loss functionis applied to evaluate the accuracy of the model.

2) Label Propagation Loss: Label propagation is per-formed by an efficient SSL algorithm that allocates labels tounlabeled data [22]. In our framework, label propagation isbased on a fully- connected graph where the nodes representdata points - labeled and unlabeled data points. The edgesin the graph between each pair of data points i and j isweighted. The closer the two points are in Euclidean spacedij , the greater the weight θij . We chose a Gaussian similarityfunction to calculate the weights, given as follows:

θij = exp

(−d2ijσ2

)= exp

(∑d

(xdi − xdj

)2σ2

)(9)

where σ is a length scale parameter. This parameter has animpact on the graph structure; hence, the value of σ needs tobe selected carefully [23].

Given the whole dataset, an adjacent matrix is constructed asΘ = θij ∈ RK×K ,∀i, j ∈ 1, ...,K (K = Kl + Ku), and thedegree matrix D is constructed as a diagonal matrix whosei-th diagonal element is dii =

∑Kj=1 θij . When a binary

classification is considered, the vector y = [yl;yu] ∈ RK

is a cluster indicator vector that includes labels of labeled andunlabeled data. Therefore, the label propagation loss for L2

through SSL can be expressed as,

L2 = min∀fi=yu,i,i=1,...,Ku

Tr(yTUy) (10)

where Tr denotes the trace, and U is Laplacian matrix,calculated as U = D −Θ.

3) Fairness Constraints: Adding fairness constraints is auseful method to enforce fair learning with in-processingmethods. In SSL, labeled data and unlabeled data have differ-ent impacts on discrimination. One reason is that predictinglabels for unlabeled data will bring noise to the labels. Anotherreason is that labeled data and unlabeled data may have differ-ent data distributions. Therefore, the discrimination inherentlyin unlabeled data is different from the discrimination in labeleddata. For these reasons, we impose fairness constraints onlabeled and unlabeled data to measure discrimination, respec-tively. We consider four cases of fairness constraints enforcedon the training data:• 1. Labeled data: s1(w) ≤ c.• 2. Unlabeled data: s2(w) ≤ c.• 3. Combined labeled and unlabeled data: s1(w) ≤ c1 (for

labeled data) and s2(w) ≤ c2 (for unlabeled data).• 4. Mixed labeled and unlabeled data: s(w) ≤ c.

where c is a discrimination threshold, that is adjusted tocontrol the trade-off between accuracy and fairness. Note thatmany fairness constraints [3], [20], [24] have been proposedto enforce various fairness metrics, such as disparate impactand disparate mistreatment, and these fairness constraints canbe used in our framework. The basic idea to design fairnessconstraints is that using the covariance between the userssensitive attributes and the signed distance between the featurevectors restricts the correlation between sensitive attributes andthe classification result. This can be described as,

| 1

Kgw (z − z) | ≤ c (11)

where gw is the signed distance between the feature vectorsand the decision boundary of a classifier. The form of gw isdifferent in fairness metrics, and we list them in the following,• (Disparate impact)

gw = wTX (12)

• (Overall misclassification rate)

gw = min(0,ywTX

)(13)

• (False positive rate)

gw = min

(0,

1− yT

2ywTX

)(14)

• (False negative rate)

gw = min

(0,

1 + yT

2ywTX

)(15)

Note that labels appear in the disparate mistreatment, anddo not appear in the disparate impact. This could result indifferences when the four cases of fairness constraint areused. These fairness metrics will be analyzed in the followingsections.

B. A Case of Logistic Regression

In this section, we focus on Eq. (8), which is the case ofa binary logistic regression (LR) classifiers. The classifier issubjected to the fairness metric of demographic parity with

4

mixed labeled and unlabeled data. The objective function ofLR is defined as,

LLR1 = − ln(p)y − ln(1− p)(1− y) (16)

where p = 1

1+e−wT Xis the probability distribution of mapping

X to the class label y; 1 denotes a column vector with allits elements being 1. Given the logistic regression loss, thelabel propagation loss and the fairness metric, the optimizedproblem (8) adopts the form,

minw,yu

− ln(p)y − ln(1− p)(1− y) + αTr(yTUy)

s.t. | 1

Kgw (z − z) | ≤ c

(17)

1) Disparate impact: The fairness constraints used hereis from [3], which is defined as the covariance between thesensitive attribute and the signed distance from feature vectorsto the decision boundary. Therefore, c is the threshold ofcovariance that controls discrimination level. A smaller cindicates a lower discrimination level. The optimization ofproblem (17) includes two parts: learning the weights w andpredicted labels of unlabeled data yu. The problem is solvedby updating w and yu iteratively as follows.

Solving w when yu is fixed, the problem (17) becomes

minw− ln(p)y − ln(1− p)(1− y)

s.t. | 1

KwTX (z − z) | ≤ c

(18)

Note that problem (18) is a convex problem that can be writtenas a regularized optimization problem by moving fairnessconstraints to the objective function. The optimal w∗ can thenbe calculated by using KKT conditions.

Solving yu when w is fixed, the problem (17) becomes

minyu

− ln(p)y − ln(1− p)(1− y) + αTr(yTUy) (19)

Given that problem (19) is also a convex problem, the optimalyu can be obtained from the deviation of yu in problem(19). In order to calculate yu conveniently, we split Laplacianmatrix L into four blocks after the l-th row and the l-thcolumn: U =

[Ull Ulu

Uul Uuu

]. The deviation of Eq.(19) is then

calculated w.r.t. yu and setting to zero, we have

α(2yuUuu+Uulyl+(ylUlu)T )−[(ln(p))T +(ln(1−p))T ] = 0(20)

Note that U is a symmetric matrix and, after simplification,the closed updated form of yu can be derived from

yu = −U−1uu (Uulyl +1

2α[(ln(p))T + (ln(1− p))T ]) (21)

When updating yu with Eq. (21), the value of yu may falloutside range of 0 to 1. Therefore, before using yu to updatethe next w, the value of yu,i ∈ yu, i = 1, ...,Ku is set to,

yu,i =

{1, yu,i ≥ T0, yu,i < T

(22)

where T is the threshold that determines the classificationresult. Then, the optimization problem (16) can be solved byoptimizing w and yu iteratively. Algorithm 1 summarizes thesolution of optimized problem (17) with the disparate impact.

Algorithm 1: The algorithm of optimizing problem (17)with disparate impactInput: Labeled dataset Dl, unlabeled dataset Du, fairness

thresholds cParameter: T , σInitialize: Given random initial values of yu

Output: w and yu

1: Calculate the adjacent matrix Θ according to Eq.(9)2: repeat3: Fix yu and update w with KKT4: Fix w and update yu by Eq.(21)5: Set yu,i ∈ yu to 0 or 1 by Eq. (22)6: until The optimization problem (17) convergs

2) Disparate mistreatment: Disparate mistreatment metricsinclude overall misclassification rate, false positive rate andfalse negative rate. For simplicity, overall misclassificationrate is used to analyze disparate mistreatment. However, falsepositive rate and false negative rate can also be analyzedeasily, and the result of three disparate mistreatment metricsare presented in the experiment.

With the overall misclassification rate as the fairness metric,the objective function is denoted as,

minw− ln(p)y − ln(1− p)(1− y) + αTr(yTUy)

s.t. | 1

Kgw(x) (z − z) | ≤ c

(23)

Note that fairness constraints of disparate mistreatment arenon-convex, and the solution of the optimization problem (23)is more challenging than the optimization problem in (17).Next, we convert these constraints into a Disciplined Convex-Concave Program (DCCP). Thus, the optimization problem(23) can be solved efficiently with the recent advances inconvex-concave programming [25].

The fairness constraint of disparate mistreatment can be splitinto two terms,

1

K|∑D0

(0− z)gw +∑D1

(1− z)gw| ≤ c (24)

where D0 and D1 are the subsets of the labeled dataset Dl

and unlabeled dataset Du with values z = 0 and z = 1,respectively. K0 and K1 are defined as the number of datapoints in the D0 and D1, and thus z can be rewriten as z =0∗K0+1∗K1

K = K1

K . Then the fairness constraint of disparatemistreatment can be rewriten as,

K1

K|∑D0

gw +∑D1

gw| ≤ c (25)

Solving w when yu is fixed, the problem (23) becomes

minw− ln(p)y − ln(1− p)(1− y)

s.t.K1

K|∑D0

gw +∑D1

gw| ≤ c(26)

The optimization problem (26) is a Disciplined Convex-Concave Program (DCCP) for any convex loss, and can besolved with some efficient heuristics [25].

5

Solving yu when w is fixed, the problem (23) becomes

minyu

− ln(p)y − ln(1− p)(1− y) + αTr(yTUy) (27)

The solution of Eq. (27) is the same as the solution of the Eq.(21). The closed form of yu can be obtained via Eq. (22), andthen the optimization problem (23) can be solved by updatingyu and w iteratively. Algorithm 2 summarizes this process.

Algorithm 2: The algorithm of optimizing problem (23)Input: Labeled dataset Dl, unlabeled dataset Du, fairness

thresholds cParameter: T , σInitialize: Given random initial values of yu

Output: w and yu

1: Calculate the adjacent matrix Θ according to Eq.(9)2: Choose a metric in disparate mistreatment3: repeat4: Divide D into D0 and D1

5: Calculate K0 and K1

6: Fix yu and update w with DCCP7: Fix w and update yu by Eq.(21)8: Set yu,i ∈ yu to 0 or 1 by Eq. (22)9: until The optimization problem (23) convergs

C. A Case of SVM

SVM can also be applied in our framework. SVM uses ahyperplane wTX = 0 to classify data points. The loss functionof SVM is defined as,

LSVM1 =

1

K

(1− y

(wTX

))(28)

Based on SVM loss, the label propagation and the fairnessmetrics, the objective function is given as,

minw,yu

1

K

(1− y

(wTX

))+ αTr(yTUy)

s.t. | 1

Kgw (z − z) | ≤ c

(29)

Disparate impact and disparate mistreatment can be used in thefairness constraint. Since SVM loss is convex, the solution ofproblem (28) is similar to the LR case. For simplicity, we omitthe process, and show the results in the experiment.

D. Discussion

Based on above analysis, some conclusions can be drawn:1. Since unlabeled data do not contain any label information,

they do not label biased information so that we can takeadvantage of the unlabeled data to improve the trade offbetween accuracy and fairness. In our framework, due to thefairness constraint, the weight w is updated towards a fairdirection. Using the updated w to update yu also ensuresthat yu is directed towards fairness. In this way, fairness isenforced in labeled and unlabeled data by updating w and yu

iteratively. Therefore, labels of unlabeled data are calculated ina fair way, which is beneficial to the accuracy of the classifieras well as the fairness of the classifier.

2. Fairness constraints on disparate impact and disparatemistreatment adjust the covariance between the sensitive at-tribute and the signed distance between feature vectors to thedecision boundaries. Disparate impact can be converted into aconvex constraint, and disparate mistreatment can be convertedinto a non-convex constraint.

3. The computed optimal yu is decimals, and it cannot beused to update w directly because only integers are allowedto optimize L1 in the next update. Due to this, we need toconvert yu from decimals to integers to update w.

IV. DISCRIMINATION ANALYSIS

Following [19], we analyze the fairness of the predictivemodel on the basis of model bias, variance, and noise. First,we redefine discrimination level to account for randomness inthe sampled fair datasets.

Definition 3. The expected discrimination level Γ(y) of thepredicted model from a random training dataset D is definedas,

Γ(y) = |ED[γ0(yD)− γ1(yD)]| = |γ0(yD)− γ1yD)| (30)

Discrimination can be decoupled as discrimination in bias,discrimination in variance and discrimination in noise.

Lemma 1. The discrimination with regard to group z ∈ Z isdefined as,

γz(y) = Bz(y) + Vz(y) + Nz (31)

When the protected group and unprotected group is given,the discrimination level is calculated as,

Γ = |(B0 − B1) + (V0 − V1) + (N0 − N1)| (32)

and each of the component of Eq. (17) are calculated as,

Bz(y) = ED[B(ym, x, z)|Z = z] (33)Vz(y) = ED[cv(x, z)V (ym, x, z)|Z = z] (34)Nz = ED[cn(x, z)L(y∗(x, z), y)|Z = z] (35)

where cv(x, z) and cn(x, z) are parameters related to the lossfunction. For more details, see the proof in [19].

Lemma 2. The discrimination learning curve Γ(y, K) :=|γ0(y, K)− γ1(y, K)| is asymptotic and behaves as an inversepower law curve, where K is the size of the training set [19].

Theorem 1. Unlabeled data help to reduce discriminationwith our proposed method, if (|Vz(y)sl|−|Vz(y)ssl|)−Nz,u ≥0.

Proof. To prove Theorem 1, the goal is to prove thatthe discrimination level in SSL Γssl is lower thanthe discrimination level in supervised learning Γsl, i.e.,Γssl ≤ Γsl. In the following, we compare the dis-crimination in supervised learning and SSL in terms ofdiscrimination in bias Bz(y)ssl, discrimination in varianceVz(y)ssl, and discrimination in noise Nz,ssl.

Discrimination in bias: Bias can measure the fitting abilityof the algorithm itself, and represent the accuracy of the model.Hence, bias in discrimination Bz(y) = ED[B(ym, x, z)|Z =

6

z] only depends on the model. When the labeled datasetand mixed dataset are trained with the same model forboth supervised and SSL settings, this can be expressed as|B(y)sl| − |B(y)ssl| = 0.

Discrimination in variance: Lemma 2 states that the dis-crimination level Γ(y, n) decreases as the size of training datan increases. This means that discrimination in variance Vz(f)can be lessened by more unlabeled data. In our framework,the classifier is trained with labeled data and unlabeled datawhich is marked labels through label propagation. The sizeof the mixed training dataset K is larger than the size ofthe labeled training dataset Kl. Hence, we can conclude that|Vz(y)ssl| − |Vz(y)sl| ≤ 0.

Discrimination in noise: Unlabeled data introduces dis-crimination in noise because predicting labels for unlabeleddata via label propagation contains discrimination in noise.In our method, the fairness constraint c is related to thenoise level in unlabeled data. A smaller c enables that labelpropagation is towards in a fairer direction. This means thatsmaller c brings smaller discrimination in noise. To analyzediscrimination in noise in SSL, we divide it into discriminationin noise in labeled data Nz,l and discrimination in noise inunlabeled data Nz,u, which is expressed as,

Nz,ssl = Nz,l + Nz,u (36)

Discrimination in noise in labeled data Nz,l is the same as thediscrimination in noise in supervised learning. Discriminationin noise in unlabeled data may stem from four cases ofmislabeled unlabeled data during label propagation,

Ny=0,z=0 = EDun[y∗p = 1|y = 0, z = 0] (37)

Ny=0,z=1 = EDun [y∗p = 1|y = 0, z = 1] (38)

Ny=1,z=0 = EDun[y∗p = 0|y = 1, z = 0] (39)

Ny=1,z=1 = EDun[y∗p = 0|y = 1, z = 1] (40)

where y∗p is the optimal predicted label of unlabeled datavia label propagation. With the ‘missing’ labels filled in,the noise can be separated into the protected group N1,u =Ny=0,z=1 + Ny=1,z=1 and the unprotected group N0,u =Ny=0,z=0 + Ny=1,z=0. In our framework, discrimination inNz,u is controlled by the fairness constraint c. A smaller cgenerates a smaller Nz,u. Hence, Nz,u can be adjusted by cand can be measured with,

Nz,u = |N1,u − N0,u| (41)

From the analysis above, we conclude that when |Vz(y)ssl −Vz(y)sl|− Nz,u ≥ 0, unlabeled data is able to reduce discrim-ination with our proposed method, which results in a bettertrade-off between accuracy and discrimination. Unlabeled datado not change discrimination in bias, reduce discrimination invariance and increase discrimination in noise.

Comparing with the discrimination analysis in [26] in SSL,discrimination in variance is reduced by unlabeled data andensemble learning, and discrimination in noise is reduced byensemble learning. Our proposed method can reduce discrimi-nation in variance by unlabeled data and reduce discriminationin noise by the fairness constraint. The advantage of our

method is that the discrimination in noise is controllable withthe threshold c in the fairness constraint. [26] and our proposedmethod are complementary work to explore how to reduce thediscrimination in noise that unlabeled data may bring.

V. EXPERIMENT

In this section, we first describe the experimental setup, in-cluding datasets, baselines, and parameters. Then, we evaluateour method on three real-world datasets under the fairnessmetric of disparate impact and disparate mistreatment (includ-ing OMR, FNR and FPR). The aim of our experiments is toassess: the effectiveness of our method to achieve fair semi-supervised learning; the impact of different fairness constraintson fairness; and the extent to which unlabeled can balancefairness with accuracy.

A. Experimental Setup

1) Dataset: Our experiments involve three real-worlddatasets: Health dataset 1, Titanic dataset 2 and Bank dataset3.• The task in the Health dataset is to predict whether

people will spend time in the hospital. In order to convertthe problem into the binary classification task, we onlypredict whether people will spend any day in the hospital.After data preprocessing, the dataset contains 27,000 datapoints with 132 features. We divide patients into twogroups based on age (≥65 years) and consider ’Age’ tobe the sensitive attribute.

• The Bank dataset contains a total of 41,188 records with20 attributes and a binary label, which indicates whetherthe client has subscribed (positive class) or not (negativeclass) to a term deposit. We consider ’Age’ as sensitiveattribute.

• The Titanic dataset comes from a Kaggle competitionwhere the goal is to analyze which sorts of peoplewere likely to survive the sinking of the Titanic. Weconsider ”Gender” as the sensitive attribute. After datapreprocessing, we extract 891 data points with 9 features.

2) Parameters: The sensitive attributes are excluded fromthe training set to ensure fairness between groups and areonly used to evaluate discrimination in the test phrases. Inthe Health, Bank and Titanic datasets, data are all labeled.In the Health dataset, we sample 4000 data points as labeleddataset, 4000 data points as test dataset, and left as unlabeleddataset. In the Bank dataset, we sample 4000 data points aslabeled dataset, 4000 data points as test dataset, and left asunlabeled dataset. In the Titanic dataset, we sample 200 datapoints as labeled dataset, 200 data points as test dataset, andleft as unlabeled dataset. Therefore, Dl and Du are collectedfrom the similar data distribution.

In the experiments, the results are an average of 10 resultsby randomly sampling labeled dataset, test dataset and unla-beled dataset. We set α = 1 and T = 0.5 in all datasets;

1https://foreverdata.org/1015/index.html2https://www.kaggle.com/c/titanic/data3https://archive.ics.uci.edu/ml/datasets/bank+marketing

7

σ = 0.5 in the Health dataset and Bank dataset, and σ = 0.1in the Titanic dataset. In DCCP, parameter τ is set to 0.05 and1 in the Bank and Titanic dataset, respectively. Parameter µ isset to 1.2 in both of Bank and Titanic datasets.

3) Baseline Methods: The methods chosen for comparisonare listed as follows. It is worth to note that [27] also usedunlabeled data on fairness. However, they only applied theequal opportunity metric, which is different to ours. Hence,we did not compare the proposed method with them.• Fairness Constraints (FS): Fairness constraints are used

to ensure fairness for classifiers. [3]• Uniform Sampling (US): The number of data points

in each groups is equalized through oversampling and/undersampling. [28]

• Preferential Sampling (PS): The number of data pointsin each groups is equalized by taking samples near theborderline data points. [28]

• Fairness-enhanced sampling (FES): A fair SSL frame-work includes pseudo labeling, re-sampling and ensemblelearning. [26]

B. Experimental Results of disparate impact

0.784

0.787

0.790

0.00 0.02 0.04 0.05 0.07Discrimination Level

0.662

0.686

0.710

Accuracy

LR-FSLR-SemiLR-USLR-PSLR-FES

(a) LR-Health

0.785

0.788

0.791


0.677

0.699

0.720

Accu

racy

SVM-FSSVM-SemiSVM-USSVM-PSSVM-FES

(b) SVM-Health


0.640

0.662

0.685

0.708

0.730

Accu

racy

LR-FSLR-SemiLR-USLR-PSLR-FES

(c) LR-Titanic


0.665

0.682

0.700

0.718

0.735

Accu

racy

SVM-FSSVM-SemiSVM-USSVM-PSSVM-FES

(d) SVM-TitanicFig. 1: The trade-off between accuracy and discrimination in proposedmethod Semi (Red), FS (Blue), US (Blue cross), PS (Yellow cross)and FES (Green cross) under the fairness metric of disparate impactwith LR and SVM in two datasets. As the threshold of covari-ance c increases, accuracy and discrimination increase. The resultsdemonstrate that and our method achieves a better trade-off betweenaccuracy and discrimination than other methods.

1) Trade-off Between Accuracy and Discrimination: Figure1 shows that as c varies, accuracy and discrimination levelin the proposed method and other methods with LR andSVM on two datasets. From the results, we can observethat our framework provides the better trade-off betweenaccuracy and discrimination. A better trade-off means thatwith the same accuracy, discrimination is low or with thesame discrimination, accuracy is higher. For example, at thesame level of accuracy on the Titanic dataset, (shown by the

Dataset Health datasetConstraint Labeled Unlabeled Combined Mixed

Acc Dis Acc Dis Acc Dis Acc Disc=0.0 0.7868 0.0042 N/A N/A 0.7874 0.0022 0.7862 0.0003c=0.1 0.7890 0.0129 N/A N/A 0.7890 0.0145 0.7892 0.0149c=0.2 0.7900 0.0170 0.7900 0.0207 0.7898 0.0170 0.7898 0.0170c=0.3 0.7898 0.0207 0.7898 0.0170 0.7900 0.0207 0.7900 0.0207c=0.4 0.7902 0.0178 0.7898 0.0170 0.7900 0.0207 0.7900 0.0207c=0.5 0.7900 0.0207 0.7900 0.0207 0.7900 0.0207 0.7900 0.0207c=0.6 0.7900 0.0207 0.7906 0.0186 0.7900 0.0207 0.7900 0.0207c=0.7 0.7900 0.0207 0.7900 0.0207 0.7900 0.0207 0.7900 0.0207c=0.8 0.7900 0.0207 0.7904 0.0191 0.7900 0.0207 0.7900 0.0207c=0.9 0.7900 0.0207 0.7908 0.0190 0.7900 0.0207 0.7900 0.0207c=1.0 0.7900 0.0207 0.7900 0.0207 0.7900 0.0207 0.7900 0.0207

TABLE I: The impact of fairness constraints on different datasetsin terms of accuracy (Acc) and discrimination level (Dis) under thefairness metric of disparate impact with LR in the Health dataset.

Dataset Titanic datasetConstraint Labeled Unlabeled Combined Mixed

Acc Dis Acc Dis Acc Dis Acc Disc=0.0 0.6330 0.0128 0.6970 0.1244 0.6290 0.0139 0.6440 0.0402c=0.05 0.6690 0.0579 0.7070 0.1265 0.6690 0.0716 0.6810 0.0948c=0.1 0.7150 0.1272 0.7140 0.1332 0.7100 0.1239 0.7150 0.1256c=0.15 0.7200 0.1366 0.7190 0.1336 0.7190 0.1336 0.7200 0.1366c=0.2 0.7200 0.1366 0.7200 0.1366 0.7200 0.1366 0.7200 0.1366c=0.25 0.7200 0.1366 0.7200 0.1366 0.7200 0.1366 0.7200 0.1366

TABLE II: The impact of fairness constraints on different datasetsin terms of accuracy (Acc) and discrimination level (Dis) under thefairness metric of disparate impact with LR in the Titanic dataset.

black line), our method with LR has a discrimination level ofaround 0.08, while FS method has a discrimination level of0.11. A similar observation can be made from the results withPS method (Yellow cross), US method (Blue cross) and FESmethod (Green cross). Note that the discrimination level (redline) with LR in the Health dataset does not extend becausediscrimination does not increase as c grows. Additionally,we note that accuracy and discrimination level are related tothe training models. In the Titanic dataset, LR has a loweraccuracy and discrimination than SVM and the choice oftraining models is related to the datasets.

2) Different Fairness Constraints: Our next set of ex-periments is to determine the impact of different fairnessconstraints. For these tests, the size of unlabeled data is set to12,000 data points in the Health dataset and 400 data pointsin the Titanic dataset. Due to space limitation, we have onlyreported the results for the LR, which appear in Tables 1 and 2.The result show that, when varying the threshold of covariancec, different fairness constraints on labeled and unlabeled datahave different impacts on the training results. As the thresholdof covariance increases, both accuracy and discriminationlevel increase before steadying off for the duration. In termsof accuracy, this is because a larger c allows for a largerspace to find better weights w to inform classification. Interms of discrimination, a larger c tends to introduce morediscrimination in noise.

It is also observed that the fairness constraint on mixed datagenerally has the best performance in the trade-off betweenaccuracy and discrimination. Other three constraints have verysimilar accuracy and discrimination levels. We attribute this tothe assumption that labeled and unlabeled data have the similardata distribution, and therefore the mixed fairness constrainton labeled and unlabeled data gives the best description ofthe covariance between sensitive attributes and signed distancefrom feature vectors to the decision boundary.

8

5000 10000 15000 20000Size of unlabeled data

0.788

0.789

0.790

0.790

0.791

Accuracy

LR-Acc

0.010

0.024

0.038

0.052

0.066

0.080

Discrim

ination

LR-Dis

(a) LR-Health


0.7904

0.7905

0.7906

0.7907

0.7908

Accuracy

SVM-Acc

0.011

0.012

0.014

0.015

0.016

0.018

Discrim

ination

SVM-Dis

(b) SVM-Health


0.716

0.720

0.723

0.726

0.730

Accuracy

LR-Acc

0.125

0.131

0.138

0.144

0.151

0.157Discrim

ination

LR-Dis

(c) LR-Titanic


0.723

0.726

0.728

0.731

0.734Ac

curacy

SVM-Acc

0.141

0.148

0.154

0.161

0.167

0.174

Discrim

ination

SVM-Dis

(d) SVM-TitanicFig. 2: The impact of the amount of unlabeled data in the training seton accuracy (Red) and discrimination level (Blue) under the fairnessmetric of disparate impact with LR and SVM in two datasets. TheX-axis is the size of unlabeled dataset; left y-axis is accuracy; andright y-axis is discrimination level.

3) The Impact of Unlabeled Data: For these experiments,we set the covariance threshold c = 1 for the Health and Ti-tanic datasets. Figure 2 shows that accuracy and discriminationlevel varies with the amount of unlabeled data. This applies toboth the LR and SVM classifiers on both datasets. As shown,accuracy increases as the amount of unlabeled data increasesin both datasets before stabilizing at its peak. Discriminationlevel sharply decreases almost immediately, then also stabilize.These results clearly demonstrate that discrimination in vari-ance decreases as the amount of unlabeled data in the trainingset increases.

C. Experimental Results of Disparate Mistreatment

1) Trade-off Between Accuracy and Discrimination: Fig-ures 3-5 show that as c varies, accuracy and discriminationlevel in the proposed framework and the FS method with LRand SVM on two datasets under the fairness metric of OMR,FPR and FNR. From the results, we can observe that our pro-posed method (Red line) generally is in the left above the FSmethod (Blue line). This indicates that our framework providesthe better trade-off between accuracy and discrimination inthree metrics for the most time. For example, at the same levelof accuracy (Acc = 0.885) on the Bank dataset under OMR,our method with LR has a discrimination level of around0.045, while FS method has a discrimination level of 0.06.We also observe that discrimination level is quite differentunder fairness metrics. For example, discrimination level canreach 0.17 at the end under FNR, while discrimination levelonly shows 0.01 under FPR. In addition, we note that accuracyand discrimination level have different performance on trainingmodels. In the Bank dataset, SVM generally has a loweraccuracy and discrimination than LR.

2) Different Fairness Constraints under OMR: Table 3 andTable 4 shows that different fairness constraints on labeled


0.879

0.882

0.885

0.888

0.891

Accu

racy

LR-FSLR-Semi

(a) LR-Bank-OMR


0.870

0.873

0.876

0.879

0.882

Accu

racy

SVM-FSSVM-Semi

(b) SVM-Bank-OMR


0.660

0.667

0.673

0.680

0.687

Accu

racy

LR-FSLR-Semi

(c) LR-Titanic-OMR


0.640

0.650

0.660

0.670

0.680

Accu

racy

SVM-FSSVM-Semi

(d) SVM-Titanic-OMRFig. 3: The trade-off between accuracy and discrimination in proposedmethod Semi (Red), FS (Blue) with LR and SVM in two datasetsunder the metric of overall misclassification rate. As the thresholdof covariance c increases, accuracy and discrimination increase. Theresults demonstrate that our method using unlabeled data achieves abetter trade-off between accuracy and discrimination.


0.877

0.879

0.881

0.884

0.886

Accu

racy

LR-FSLR-Semi

(a) LR-Bank-FNR


0.872

0.875

0.878

0.881

0.884

Accuracy

SVM-FSSVM-Semi

(b) SVM-Bank-FNR


0.510

0.542

0.575

0.608

0.640

Accu

racy

LR-FSLR-Semi

(c) LR-Titanic-FNR


0.605

0.614

0.623

0.631

0.640

Accu

racy

SVM-FSSVM-Semi

(d) SVM-Titanic-FNRFig. 4: The trade-off between accuracy and discrimination in theproposed method Semi (Red), FS (Blue) with LR and SVM in twodatasets under the metric of false negative rate. As the threshold ofcovariance c increases, accuracy and discrimination increase.

and unlabeled data have different impacts on the trainingresults. Due to space limitation, we have only reported theresults for the LR under the metric of OMR on the Bankand Titanic datasets. For these tests, the size of unlabeleddata is set to 4,000 data points in the Bank dataset and 400data points in the Titanic dataset. As shown, when varyingthe threshold of covariance c, different fairness constraintson labeled and unlabeled data have huge difference on thetraining results. When the fairness constraint is enforced in

9


0.883

0.885

0.886

0.887

0.889Ac

curacy

LR-FSLR-Semi

(a) LR-Bank-FPR


0.882

0.884

0.885

0.887

0.889

Accu

racy

SVM-FSSVM-Semi

(b) SVM-Bank-FPR


0.645

0.655

0.665

0.675

0.685

Accu

racy

LR-FSLR-Semi

(c) LR-Titanic-FPR


0.665

0.670

0.675

0.680

0.685

Accu

racy

SVM-FSSVM-Semi

(d) SVM-Titanic-FPRFig. 5: The trade-off between accuracy and discrimination in proposedmethod Semi (Red), FS (Blue) with LR and SVM in two datasetsunder the metric of false positive rate. As the threshold of covariancec increases, accuracy and discrimination increase.

Dataset Bank datasetConstraint Labeled Unlabeled Combined Mixed

Acc Dis Acc Dis Acc Dis Acc Disc=0.0 0.8635 0.0905 0.8407 0.1847 0.8342 0.147 0.8605 0.0987c=0.5 0.8625 0.092 0.8402 0.1854 0.8342 0.1442 0.8605 0.0987c=1.0 0.8638 0.0922 0.8402 0.1854 0.835 0.1452 0.8635 0.1071c=1.5 0.8645 0.0918 0.8407 0.1833 0.835 0.1452 0.8635 0.1071c=2.0 0.8648 0.0907 0.841 0.1822 0.8347 0.1462 0.8625 0.1071c=2.5 0.8652 0.0914 0.8413 0.1812 0.8353 0.1469 0.8635 0.1084c=3.0 0.866 0.0923 0.8413 0.1784 0.8342 0.147 0.8627 0.1084c=3.5 0.8662 0.0927 0.8407 0.1805 0.8342 0.147 0.8627 0.1097c=4.0 0.8665 0.093 0.841 0.1795 0.8342 0.147 0.8627 0.1097c=4.5 0.8668 0.0919 0.8407 0.1791 0.835 0.1452 0.8635 0.1113c=5.0 0.867 0.0909 0.8407 0.1791 0.8355 0.1444 0.8635 0.1113

TABLE III: The impact of fairness constraints on different datasetsin terms of accuracy (Acc) and discrimination level (Dis) under thefairness metric of overall misclassification rate with LR in the Bankdataset.

labeled data, accuracy and discrimination increases with theincrease in c in the Titanic dataset. This is because a smallerc enforces the lowest discrimination level, which results in alower accuracy.

However, when the fairness constraint is enforced in unla-beled data, accuracy and discrimination could decrease withthe increase in c. This is because the label of unlabeled dataappears in the fairness constraint of disparate mistreatment,and it is updated during the training. This means that thedistribution of unlabeled data is not described well during thetraining. As a result, the fairness constraint on unlabeled datais not that effective.

3) The Impact of Unlabeled Data under OMR: For theseexperiments, we show the impact of unlabeled data on OMR.The covariance threshold is set as c = 1 for the Bank andTitanic datasets. Figure 6 shows accuracy and discriminationlevel varies given different size of unlabeled data with LR andSVM on two datasets. As shown, before the peak is reached,as the amount of unlabeled data increases in the two data sets,accuracy will also increase. Discrimination level decreases atthe beginning, and then stabilize in the Titanic dataset. These

Dataset Titanic datasetConstraint Labeled Unlabeled Combined Mixed

Acc Dis Acc Dis Acc Dis Acc Disc=0.0 0.7448 0.0285 0.7138 0.3996 0.7655 0.1387 0.7483 0.0175c=0.5 0.7483 0.0335 0.6966 0.4386 0.7655 0.1547 0.7483 0.0175c=1.0 0.7517 0.0385 0.6931 0.4656 0.7552 0.1397 0.7517 0.0225c=1.5 0.7552 0.0436 0.7103 0.3946 0.7793 0.1748 0.7448 0.0445c=2.0 0.7552 0.0436 0.7069 0.4216 0.7724 0.1648 0.7483 0.0495c=2.5 0.7586 0.0326 0.7103 0.4106 0.7759 0.1378 0.7448 0.0605c=3.0 0.7552 0.0596 0.7552 0.2678 0.7552 0.0596 0.7483 0.0655c=3.5 0.7552 0.0596 0.6931 0.4656 0.7552 0.0596 0.7483 0.0816c=4.0 0.7586 0.0646 0.7103 0.4106 0.7586 0.0646 0.7517 0.0866c=4.5 0.7586 0.0646 0.7138 0.3996 0.7586 0.0646 0.7517 0.0866c=5.0 0.7552 0.0756 0.7103 0.4106 0.7552 0.0756 0.7483 0.0816

TABLE IV: The impact of fairness constraints on different datasetsin terms of accuracy (Acc) and discrimination level (Dis) under thefairness metric of overall misclassification rate with LR in the Titanicdataset.


0.8530

0.8635

0.8740

0.8845

0.8950

Accuracy

LR-Acc

0.1060.1130.1200.1280.1350.142

Discrim

ination

LR-Dis

(a) LR-Bank-OMR


0.8490

0.8582

0.8675

0.8768

0.8860

Accuracy

SVM-Acc

0.076

0.095

0.114

0.134

0.153

0.172

Discrim

ination

SVM-Dis

(b) SVM-Bank-OMR


0.7240

0.7308

0.7375

0.7442

0.7510

Accuracy

LR-Acc

0.098

0.107

0.115

0.124

0.132

0.141

Discrim

ination

LR-Dis

(c) LR-Titanic-OMR


0.7170

0.7255

0.7340

0.7425

0.7510

Accu

racy

SVM-Acc

0.225

0.235

0.246

0.256

0.267

0.277

Discrim

ination

SVM-Dis

(d) SVM-Titanic-OMRFig. 6: The impact of the amount of unlabeled data in the training seton accuracy (Red) and discrimination level (Blue) under the fairnessmetric of overall mistreatment rate with LR and SVM in two datasets.The X-axis is the size of unlabeled dataset; left y-axis is accuracy;and right y-axis is discrimination level.

results indicate that discrimination in variance decreases as theamount of unlabeled data in the training set increases.

D. Discussion and Summary

1) Discussion: We discuss on how the proposed frameworkis able to reduce discrimination in terms of discriminationdecomposition into discrimination in bias, variance and noise.Discrimination in bias depends on the model choice. Dis-crimination in variance relates to the size of training data.Our framework uses unlabeled data to expand the size oftraining data, and thus reduce the discrimination in variance.Discrimination in noise depends on the quality of trainingdata. In our framework, discrimination in noise also dependson the label propagation. Predicting labels for unlabeled datamay bring discrimination in noise. This discrimination canbe adjusted by the threshold c in the fairness constraint. Asmaller threshold generates a smaller discrimination in noise.The reduction in discrimination in variance is generally morethan the discrimination in noise induced by label propagation.Thus, when the same model is used, unlabeled data helps tolesson discrimination.

10

2) Summary: From these experiments, we can obtain someconclusions. 1) The proposed framework can make use ofunlabeled data to achieve a better trade-off between accuracyand discrimination. 2) Under the fairness metric of disparateimpact, the fairness constraint on mixed labeled and unlabeleddata generally has the best trade-off between accuracy anddiscrimination. Under the fairness metric of disparate mistreat-ment, the fairness constraint on labeled data is used to achievethe trade-off between accuracy and discrimination. 3) Moreunlabeled data generally helps to make a better compromisebetween accuracy and discrimination. 4) Model choice canaffect the trade-off between accuracy and discrimination. Ourexperiments show that SVM is more friendly to achieve abetter trade-off than LR.

VI. RELATED WORK

In recent years, a large number of work have studied thefairness in machine learning, and we classify them into twostreams.

A. Fair Supervised Learning

Methods for fair supervised learning include pre-processing,in-processing and post-processing methods. In pre-processing,discrimination is eliminated by guiding the distribution oftraining data towards to a fairer direction [28] or by transform-ing the training data into a new space [12], [13], [29], [30],[31]. The main advantage of the pre-processing method is thatit does not require changes to the machine learning algorithm,so it is very simple to use. In in-processing, discriminationis constrained by fair constraints or a regularizer during thetraining phase. For example, Zafar et al. [32] used regularizerterm to penalize discrimination in the learning objective.[3], [33], [20] designed the convex fairness constraint, calleddecision boundary covariance to achieve fair classification forclassifiers. Recent work presented the constrained optimizationproblem as a two-player game, and formalized the definitionof fairness as a linear inequality [34], [35], [36]. This categoryis more flexible for optimizing different fair constraints, andsolutions using this method are considered to be the mostrobust. In addition, these methods have shown good resultsin terms of accuracy and fairness. A third approach to achiev-ing fairness is post-processing, where a learned classifier ismodified to adjust the decisions to be non-discriminatory fordifferent groups [4], [37], [38]. Post-processing does not needchanges in the classifier, but it cannot guarantee a optimalclassifier.

B. Fair Unsupervised Learning

Chierichetti et al. [14] was the first to study fairnessin clustering problems. Their solution, under both k-centerand k-median objectives, was required every group to be(approximately) equally represented in each cluster. Manysubsequent works have since been undertaken on the subjectof fair clustering. Among these, Rosner et al. [17] extendedfair clustering to more than two groups. Chen et al. [39]consider the fair k-means problem in the streaming model,

define fair coresets and show how to compute them in astreaming setting, resulting in significant reduction in theinput size. Bera et al. [40] presented a more generalizedapproach to fair clustering, providing a tunable notion offairness in clustering. Kleindessner et al. [41] studied a versionof constrained spectral clustering incorporating the fairnessconstraints.

C. Comparing with other work

Existing fair learning methods mainly focus on supervisedand unsupervised learning, and cannot be directly applied toSSL. As far as we know, only [27], [42], [26] has exploredfairness in SSL. Chzhen et al. [27] studied Bayes classifierunder the fairness metric of equal opportunity, where labeleddata is used to learn the output conditional probability, andunlabeled data is used for to calibrate threshold in the post-processing phase. However, unlabeled data is not fully usedto eliminate discrimination, and the proposed method onlyapplies in equal opportunity. In [42], the proposed method isbuilt on neural networks for SSL in the in-processing phase,where unlabeled data is marked labels with pseudo labeling.Zhang et al. [26] proposed a pre-processing framework whichincludes pseudo labeling, re-sampling and ensemble learningto remove representation discrimination. Our solution willfocus on margin-based classifier in the in-processing stage,as in-processing methods have demonstrated good flexibilityin both balancing fairness and supporting multiple classifiersand fairness metrics.

VII. CONCLUSION

In this paper, we propose a framework of fair semi-supervised learning that operates during in-processing phase.Our framework is formulated as an optimization problem withthe goal of finding weights and labeling unlabeled data byminimizing the loss function subject to fairness constraints.We analyze several different cases of fairness constraints fortheir effects on the optimization problem plus the accuracyand discrimination level in the results. A theoretical analysison three sources of discrimination bias, variance and noisedecomposition shows that unlabeled data is a viable option forachieving a better trade-off between accuracy and fairness. Ourexperiments confirm this analysis, showing that the proposedframework provides accuracy and fairness at high levels insemi-supervised settings.

A further research direction is to explore ways of achievingfair SSL under other fairness metrics, such as individualfairness. Further, in this paper, we assume that labeled dataand unlabeled have a similar data distribution. However, thisassumption may not be hold in some practical situations.Therefore, another research direction is how to achieve fairSSL when labeled and unlabeled data distribution is different.

ACKNOWLEDGMENT

This work is supported by an ARC Discovery Project(DP190100981) from Australian Research Council, Australia;and in part by NSF under grants III-1526499, III-1763325,III-1909323, and CNS-1930941.

11

REFERENCES

[1] A. Chouldechova, “Fair prediction with disparate impact: A study ofbias in recidivism prediction instruments,” Big data, vol. 5, no. 2, pp.153–163, 2017.

[2] Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan, “Dissectingracial bias in an algorithm used to manage the health of populations,”Science, vol. 366, no. 6464, pp. 447–453, 2019. [Online]. Available:https://science.sciencemag.org/content/366/6464/447

[3] M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi, “FairnessConstraints: Mechanisms for Fair Classification,” in Proceedings of the20th International Conference on Artificial Intelligence and Statistics,vol. 54, 20–22 Apr 2017, pp. 962–970.

[4] M. Hardt, E. Price, N. Srebro et al., “Equality of opportunity insupervised learning,” in Advances in neural information processingsystems, 2016, pp. 3315–3323.

[5] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairnessthrough awareness,” in Proceedings of the 3rd innovations in theoreticalcomputer science conference. ACM, 2012, pp. 214–226.

[6] C. Jung, M. Kearns, S. Neel, A. Roth, L. Stapleton, and Z. S. Wu,“Eliciting and enforcing subjective individual fairness,” arXiv preprintarXiv:1905.10660, 2019.

[7] T. Zhu and P. S. Yu, “Applying differential privacy mechanism inartificial intelligence,” in 2019 IEEE 39th International Conference onDistributed Computing Systems (ICDCS), 2019, pp. 1601–1609.

[8] C. Dwork, C. Ilvento, and M. Jagadeesan, “Individual fairness inpipelines,” arXiv preprint arXiv:2004.05167, 2020.

[9] M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactualfairness,” in Advances in Neural Information Processing Systems, 2017,pp. 4066–4076.

[10] N. Kilbertus, M. R. Carulla, G. Parascandolo, M. Hardt, D. Janzing,and B. Scholkopf, “Avoiding discrimination through causal reasoning,”in Advances in Neural Information Processing Systems, 2017, pp. 656–666.

[11] Y. Wu, L. Zhang, X. Wu, and H. Tong, “Pc-fairness: A unified frame-work for measuring causality-based fairness,” in Advances in NeuralInformation Processing Systems, 2019, pp. 3404–3414.

[12] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, “Learningfair representations,” in International Conference on Machine Learning,2013, pp. 325–333.

[13] J. Song, P. Kalluri, A. Grover, S. Zhao, and S. Ermon, “Learning con-trollable fair representations,” in Proceedings of the 22nd InternationalConference on Artificial Intelligence and Statistics (AISTATS) 2019,,vol. 89, 16–18 Apr 2019, pp. 2164–2173.

[14] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii, “Fair clus-tering through fairlets,” in Advances in Neural Information ProcessingSystems, 2017, pp. 5029–5037.

[15] M. Kleindessner, P. Awasthi, and J. Morgenstern, “Fair k-center cluster-ing for data summarization,” in Proceedings of the 36th InternationalConference on Machine Learning, vol. 97, Long Beach, California,USA, 09–15 Jun 2019, pp. 3448–3457.

[16] A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner,“Scalable fair clustering,” arXiv preprint arXiv:1902.03519, 2019.

[17] C. Rosner and M. Schmidt, “Privacy Preserving Clustering with Con-straints,” in 45th International Colloquium on Automata, Languages,and Programming (ICALP 2018), vol. 107, 2018, pp. 96:1–96:14.

[18] X. Chen, B. Fain, C. Lyu, and K. Munagala, “Proportionally fairclustering,” in ICML, 2019.

[19] I. Chen, F. D. Johansson, and D. Sontag, “Why is my classifierdiscriminatory?” in Advances in Neural Information Processing Systems31, 2018, pp. 3539–3550.

[20] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi,“Fairness beyond disparate treatment & disparate impact: Learningclassification without disparate mistreatment,” in Proceedings of the 26thInternational Conference on World Wide Web, 2017, pp. 1171–1180.

[21] P. Domingos, “A unified bias-variance decomposition,” in Proceedingsof 17th International Conference on Machine Learning, 2000, pp. 231–238.

[22] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled datawith label propagation,” 2002.

[23] F. Wang and C. Zhang, “Label propagation through linear neigh-borhoods,” IEEE Transactions on Knowledge and Data Engineering,vol. 20, no. 1, pp. 55–67, 2007.

[24] A. Agarwal, A. Beygelzimer, M. Dudık, J. Langford, and H. Wal-lach, “A reductions approach to fair classification,” arXiv preprintarXiv:1803.02453, 2018.

[25] X. Shen, S. Diamond, Y. Gu, and S. Boyd, “Disciplined convex-concaveprogramming,” in 2016 IEEE 55th Conference on Decision and Control(CDC). IEEE, 2016, pp. 1009–1014.

[26] T. Zhang, T. Zhu, J. Li, M. Han, W. Zhou, and P. Yu, “Fairness in semi-supervised learning: Unlabeled data help to reduce discrimination,” IEEETransactions on Knowledge and Data Engineering, 2020.

[27] E. Chzhen, C. Denis, M. Hebiri, L. Oneto, and M. Pontil, “Leveraginglabeled and unlabeled data for consistent fair binary classification,” inAdvances in Neural Information Processing Systems, 2019, pp. 12 739–12 750.

[28] F. Kamiran and T. Calders, “Data preprocessing techniques for classi-fication without discrimination,” Knowledge and Information Systems,vol. 33, no. 1, pp. 1–33, 2012.

[29] F. Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, and K. R.Varshney, “Optimized pre-processing for discrimination prevention,” inAdvances in Neural Information Processing Systems, 2017, pp. 3992–4001.

[30] R. Feng, Y. Yang, Y. Lyu, C. Tan, Y. Sun, and C. Wang, “Learn-ing fair representations via an adversarial framework,” arXiv preprintarXiv:1904.13341, 2019.

[31] H. Zhao and G. Gordon, “Inherent tradeoffs in learning fair representa-tions,” in Advances in neural information processing systems, 2019, pp.15 675–15 685.

[32] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma, “Fairness-awareclassifier with prejudice remover regularizer,” in Joint European Con-ference on Machine Learning and Knowledge Discovery in Databases.Springer, 2012, pp. 35–50.

[33] G. Goh, A. Cotter, M. Gupta, and M. P. Friedlander, “Satisfying real-world goals with dataset constraints,” in Advances in Neural InformationProcessing Systems 29, 2016, pp. 2415–2423.

[34] M. Donini, L. Oneto, S. Ben-David, J. S. Shawe-Taylor, and M. Pontil,“Empirical risk minimization under fairness constraints,” in Advances inNeural Information Processing Systems, 2018, pp. 2791–2801.

[35] A. Agarwal, A. Beygelzimer, M. Dudık, J. Langford, and H. Wal-lach, “A reductions approach to fair classification,” arXiv preprintarXiv:1803.02453, 2018.

[36] A. Cotter, H. Jiang, S. Wang, T. Narayan, S. You, K. Sridharan, andM. R. Gupta, “Optimization with non-differentiable constraints withapplications to fairness, recall, churn, and other goals,” Journal ofMachine Learning Research, vol. 20, no. 172, pp. 1–59, 2019.

[37] M. P. Kim, A. Ghorbani, and J. Zou, “Multiaccuracy: Black-box post-processing for fairness in classification,” in Proceedings of the 2019AAAI/ACM Conference on AI, Ethics, and Society, 2019, pp. 247–254.

[38] P. K. Lohia, K. N. Ramamurthy, M. Bhide, D. Saha, K. R. Varshney, andR. Puri, “Bias mitigation post-processing for individual and group fair-ness,” in Icassp 2019-2019 ieee international conference on acoustics,speech and signal processing (icassp). IEEE, 2019, pp. 2847–2851.

[39] M. Schmidt, C. Schwiegelshohn, and C. Sohler, “Fair coresetsand streaming algorithms for fair k-means clustering,” CoRR, vol.abs/1812.10854, 2018. [Online]. Available: http://arxiv.org/abs/1812.10854

[40] S. Bera, D. Chakrabarty, N. Flores, and M. Negahbani, “Fair algorithmsfor clustering,” in Advances in Neural Information Processing Systems32, 2019, pp. 4955–4966.

[41] M. Kleindessner, P. Awasthi, and J. Morgenstern, “Fair k-center cluster-ing for data summarization,” in Proceedings of the 36th InternationalConference on Machine Learning, vol. 97, Long Beach, California,USA, 09–15 Jun 2019, pp. 3448–3457.

[42] V. Noroozi, S. Bahaadini, S. Sheikhi, N. Mojab, and P. S. Yu, “Leverag-ing semi-supervised learning for fairness using neural networks,” arXivpreprint arXiv:1912.13230, 2019.

Tao Zhang he works towards his Ph.D degree withthe school of Computer Science in the University ofTechnology Sydney, Australia. His research interestsinclude privacy preserving, algorithmic fairness, andmachine learning.

https://science.sciencemag.org/content/366/6464/447

http://arxiv.org/abs/1812.10854

http://arxiv.org/abs/1812.10854

12

Tianqing Zhu received the B.Eng. degree in chem-istry and M.Eng. degree in automation from WuhanUniversity, Wuhan, China, in 2000 and 2004, re-spectively, and the Ph.D. degree in computer sci-ence from Deakin University, Geelong, Australia, in2014.

She is currently an Associate Professor in theFaulty of Engineering and Information Technologywith the School of Computer Science, University ofTechnology Sydney, Sydney, Australia. Before that,she was a Lecturer in the School of Information

Technology, Deakin University, from 2014 to 2018. Her research interestsinclude privacy preserving, data mining, and network security

Mengde Han is a PhD student at University ofTechnology Sydney with a focus on Local Differen-tial Privacy. He completed his Master’s at the JohnsHopkins University.

Jing Li received the B.Eng and M.Eng degrees incomputer science and technology from NorthwesternPolytechnical University, Xian, China, in 2015 and2018, respectively.

Currently, he is pursuing the Ph.D degree with theCentre for Artificial Intelligence in the University ofTechnology Sydney, Australia. His research interestsinclude machine learning and privacy preserving.

Wanlei Zhou received the B.Eng and M.Eng de-grees from Harbin Institute of Technology, Harbin,China in 1982 and 1984, respectively, and the PhDdegree from The Australian National University,Canberra, Australia, in 1991, all in Computer Sci-ence and Engineering. He also received a DSc degree(a higher Doctorate degree) from Deakin Univer-sity in 2002. He is currently the Head of Schoolof Computer Science in University of TechnologySydney (UTS). Before joining UTS, Professor Zhouheld the positions of Alfred Deakin Professor, Chair

of Information Technology, and Associate Dean (International Research En-gagement) of Faculty of Science, Engineering and Built Environment, DeakinUniversity. His research interests include security and privacy, bioinformatics,and e-learning. Professor Zhou has published more than 400 papers in refereedinternational journals and refereed international conferences proceedings,including many articles in IEEE transactions and journals.

Philip S. Yu received the B.S. degree in electri-cal engineering from National Taiwan University,Taipei, Taiwan, the M.S. and Ph.D. degrees in elec-trical engineering from Stanford University, Stan-ford, CA, USA, and the M.B.A. degree from NewYork University, New York, NY, USA. He was withIBM, Armonk, NY, USA, where he was a Managerof the Software Tools and Techniques Departmentwith the Thomas J. Watson Research Center. He isa Distinguished Professor of computer science withthe University of Illinois at Chicago, Chicago, IL,

USA, where he also holds the Wexler Chair in information technology. Hehas published over 1200 papers in peer-reviewed journals, such as the IEEETransactions on Knowledge and Data Engineering, ACM Transactions onKnowledge Discovery from Data, VLDBJ, and the ACM Transactions onIntelligent Systems and Technology and conferences, such as KDD, ICDE,WWW, AAAI, SIGIR, ICML, and CIKM. He holds or has applied for over300 U.S. patents. His current research interests include data mining, datastreams, databases, and privacy. Dr. Yu was a recipient of the ACM SIGKDD2016 Innovation Award for his influential research and scientific contributionson mining, fusion, and anonymization of Big Data and the IEEE ComputerSociety 2013 Technical Achievement Award. He was the Editor-in-Chief ofACM Transactions on Knowledge Discovery from Data (2011-2017) and IEEETransactions on Knowledge and Data Engineering (2001-2004). He is a fellowof ACM.

fairness constraints in semi-supervised learningtao zhang, tianqing zhu, mengde han, jing li, wanlei...

Documents