characterizing bias in classifiers using generative …...characterizing bias in classiﬁers using...

Characterizing Bias in Classifiers using GenerativeModels

Daniel McDuff, Yale Song and Ashish KapoorMicrosoft

Redmond, WA, USA{damcduff,yalesong,akapoor}@microsoft.com

Shuang MaSUNY Buffalo

Buffalo, [email protected]

Abstract

Models that are learned from real-world data are often biased because the dataused to train them is biased. This can propagate systemic human biases that existand ultimately lead to inequitable treatment of people, especially minorities. Tocharacterize bias in learned classifiers, existing approaches rely on human oracleslabeling real-world examples to identify the “blind spots” of the classifiers; theseare ultimately limited due to the human labor required and the finite nature ofexisting image examples. We propose a simulation-based approach for interrogat-ing classifiers using generative adversarial models in a systematic manner. Weincorporate a progressive conditional generative model for synthesizing photo-realistic facial images and Bayesian Optimization for an efficient interrogation ofindependent facial image classification systems. We show how this approach canbe used to efficiently characterize racial and gender biases in commercial systems.

1 Introduction

Models that are learned from found data (e.g., data scraped from the Internet) are often biased becausethese data sources are biased (Torralba & Efros, 2011). This can propagate systemic inequalitiesthat exist in the real-world (Caliskan et al., 2017) and ultimately lead to unfair treatment of people.A model may perform poorly on populations that are minorities within the training set and presenthigher risks to them. For example, there is evidence of lower precision in pedestrian detectionsystems for people with darker skin tones (higher on the Fitzpatrick (1988) scale). This exposesone group to greater risk from self-driving/autonomous vehicles than another (Wilson et al., 2019).Other studies have revealed systematic biases in facial classification systems (Buolamwini, 2017;Buolamwini & Gebru, 2018), with the error rate of gender classification up to seven times largeron women than men and poorer on people with darker skin tones. Another study found that facerecognition systems misidentify people with darker skin tones, women, and younger people at highererror rates (Klare et al., 2012). To exacerbate the negative effects of inequitable performance, there isevidence that African Americans are subjected to higher rates of facial recognition searches (Garvie,2016). Commercial facial classification APIs are already deployed in consumer-facing systems andare being used by law enforcement. The combination of greater exposure to algorithms and a reducedprecision in the results for certain demographic groups deserves urgent attention.

Many learned models exhibit bias as training datasets are limited in size and diversity. The “preferen-tial selection of units for data analysis” leads to sample selection bias Bareinboim & Pearl (2016)and is the primary form of bias we consider here. Preferential selection need not necessarily beintentional. Let us take several benchmark datasets as exemplars. Almost 50% of the people featuredin the widely used MS-CELEB-1M dataset (Guo et al., 2016) are from North America (USA andCanada) and Western Europe (UK and Germany), and over 75% are men. The demographic make up

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

Figure 1: We propose a composite loss function, modeled as a Gaussian Process. It takes as inputconditioning parameters for a validated progressive conditional generative model, and produces animage f(θ). This image is then used to interrogate an image classification system to compute aclassification loss Lc = Loss(f(θ)), where the loss is a binary classification loss (0 or 1), capturingwhether the classifier performed accurately or poorly. This loss allows exploitation of failures and iscombined with a diversity loss to promote exploration of the parameter space.

of these countries is predominantly Caucasian/white.1 Another dataset of faces, IMDB-WIKI (Rotheet al., 2015), features 59.3% men and Americans are hugely over-represented (34.5%). Anothersystematic profile (Buolamwini, 2017) found that the IARPA Janus Benchmark A (IJB-A) (Klareet al., 2015) contained only 7.80% of faces with skin types V or VI (on the Fitzpatrick skin typescale) and again featured over 75% males. Sampling from the datasets listed here indiscriminatelyleads to a large proportion of images of males with lighter skin tones and upon training an imageclassifer often results in a biased system. Creating balanced datasets is a non-trivial task. Sourcingnaturalistic images of a large number of different people is challenging. Furthermore, no matter howlarge the dataset is, it may still be difficult to find images that are distributed evenly across differentdemographic groups. Attempts have been made to improve facial classification by including genderand racial diversity. In one example, by Ryu et al. (Ryu et al., 2017), results were improved byscraping images from the web and learning facial representations from a held-out dataset with auniform distribution across race and gender intersections.

Improving the performance of machine-learned classifiers is virtuous but there are other approachesto addressing concerns around bias. The concept of fairness through awareness (Dwork et al., 2012)is the principle that in order to combat bias, we need to be aware of the biases and why they occur. Incomplex systems, such as deep neural networks, many of the “unknowns” are unknown and need tobe identified (Lakkaraju et al., 2016, 2017). Identifying and characterizing “unknowns” in a modelrequires a combination of exploration to identify regions of the model that contain failure modes andexploitation to sample frequently from these region in order to characterize performance. Identifyingfailure modes is similar to finding adversarial examples for image classifiers (Athalye et al., 2017;Tramèr et al., 2017), a subject that is of increasing interest.

One way of characterizing bias that holds promise is data simulation. Parameterized computergraphics simulators are one way of testing vision models (Veeravasarapu et al., 2015a,b, 2016;Vazquez et al., 2014; Qiu & Yuille, 2016). Generally, it has been proposed that graphics models beused for performance evaluation (Haralick, 1992). Qiu and Yuille (2016) propose applications ofsimulations in computer vision, including testing deep network algorithms. Recently, McDuff etal. 2018 illustrated how highly realistic simulations could be used to interrogate the performance offace detection systems. Kortylewaski et al. (2018; 2019) show that the damage of real-world datasetbiases on facial recognition systems can be partially addressed by pre-training on synthetic data.However, creating high fidelity 3D assets for simulating many different facial appearances (e.g., bonestructures, facial attributes, skin tones etc.) is time consuming and expensive.

Generative adversarial networks (GANs) (Goodfellow et al., 2014a) are becoming increasinglypopular for synthesizing data (Shrivastava et al., 2017) and present an alternative, or complement,to graphical simulations. Generative models are trained to match the target distribution. Thus, oncetrained, a generative model can flexibly generate a large amount of diverse samples without the needfor pre-built 3D assets. They can also be used to generate images with a set of desired properties byconditioning the model during the training stage and thus enabling generation of new samples in a

1http://data.un.org/

2

controllable way at testing time. Thus, GANs have been used for synthesizing images of faces atdifferent ages (Yang et al., 2017a; Choi et al., 2018) or genders (Dong et al., 2017; Choi et al., 2018).However, unlike parameterized models, statistical models (such as GANs) are fallible and mightalso have errors themselves. For example, the model may produce an image of a man even whenconditioned on a woman. To use such a model for characterizing the performance of an independentclassifier it is important to first characterize the error in the image generation itself. Two motivationsfor using GANs are: 1) We can sample a larger number of faces than existed in the original photodataset used to train the GAN; 2) Synthesizing face images has an advantage of preserving theprivacy of individuals, thus when interrogating a commercial classifier we do not need images of“real people".

In this paper, we propose to use a characterized state-of-the-art progressive conditional generativemodel in order to test existing computer vision classification systems for bias, as shown in Figure 1. Inparticular, we train a progressive conditional generative network that allows us to create high-fidelityimages of new faces with different appearances by exploiting the underlying manifold. We train themodel on diverse image examples by sampling in a balanced manner from men and women fromdifferent countries. Then we characterize this generator using oracles (human judges) to identify anyerrors in the synthesis model. Using the conditioned synthesis model and a Bayesian search schemewe efficiently exploit and explore the parameterized space of faces, in order to find the failure casesof a set of existing commercial facial classification systems and identify biases. One advantage of thisscheme is that we only need to train and characterize the performance of the generator once and canthen evaluate many classifiers efficiently and systematically, with potentially many more variations offacial images than were used to train the generator.

The contributions of this paper are: (1) to present an approach for conditionally generating syntheticface images based on a curated dataset of people from different nations, (2) to show how syntheticdata can be used to efficiently identify limits in existing facial classification systems, and (3) topropose a Bayesian Optimization based sampling procedure to identify these limits more efficiently.We release the nationality data, model and code to accompany the image data used in this paper (seethe supplementary material).

2 Related Work

Algorithmic Bias. There is wide concern about the equitable nature of machine learned systems.Algorithmic bias can exist for several reasons and the discovery or characterization of biases isnon-trivial (Hajian et al., 2016). Even if biases are not introduced maliciously they can result fromexplicit variables contained within a model or via variables that correlate with sensitive attributes- indirect discrimination. Ideally, we would minimize algorithmic bias or discrimination as muchas possible, or prevent it entirely. However, this is challenging: First, algorithms can be releasedby third-parties who may not be acting in the public’s best interest and not take the time or effortrequired to maximize the fairness of their models. Second, removing biases is technically challenging.For example, balancing a dataset and removing correlates with sensitive variables is very difficult,especially when learning algorithms are data hungry and the largest, accessible data sources (e.g., theInternet) are biased (Baeza-Yates, 2016).

Tools are needed to help practitioners evaluate models, especially black-box approaches. Makingalgorithms more transparent and increasing accountability is another approach to increasing fair-ness (Dwork et al., 2012; Lepri et al., 2018). A significant study (Buolamwini & Gebru, 2018)highlighted that facial classification systems were not as accurate on faces with darker skin tones andon females. This paper led to improvements in the models behind these APIs being made (Raji &Buolamwini, 2019). This illustrates how characterization of model biases can be used to advance thequality of machine learned systems.

Biases often result from unknowns within a system. Methods have been proposed to help addressthe discovery of unknowns in predictive models (Lakkaraju et al., 2016, 2017). In their work thesearch-space is partitioned into groups which can be given interpretable descriptions. Then anexplore-exploit strategy is used to navigate through these groups systematically based on the feedbackfrom an oracle (e.g., a human labeler). Bansal and Weld proposed a new class of utility modelsthat rewarded how well the discovered unknown unknowns help explain a sample distribution ofexpected queries (Bansal & Weld, 2018). Using human oracles is labor intensive and not scalable.

3

We employ an explore-exploit strategy in our work, but rather than rely on a human oracle we use animage synthesis model. Using a conditioned network we provide a systematic way of interrogating ablack box model by generating variations on the target images and repeatedly testing a classifier’sperformance. We incentivize the search algorithm to explore the parameter space of faces but alsoreward it for identifying failures and interrogating these regions of the space more frequently.

Generative Adversarial Networks. Deep generative adversarial networks have enabled considerableimprovements in image generation (Goodfellow et al., 2014b; Zhang et al., 2017; Xu et al., 2018).Conditional GANs (Mirza & Osindero, 2014) allow for the addition of conditional variables, such thatgeneration can be performed in a “controllable” way. The conditioning variables can take differentforms (e.g. specific attributes or a raw image (Choi et al., 2018; Yan et al., 2016).) For facial images,this has been applied to control the gender (Dong et al., 2017), age (Yang et al., 2017a; Choi et al.,2018), hair color, skin tone and facial expressions (Choi et al., 2018) of generated faces. This allowsfor a level of systematic simulation via manifolds in the space of faces. Increasing the resolution ofimages synthesized using GANs is the focus of considerable research. Higher quality output imageshave been achieved by decomposing the generation process into different stages. The LR-GAN (Yanget al., 2017b) decomposes the process by generating image foregrounds and backgrounds separately.StackGAN (Zhang et al., 2017) decomposes generation stages into several steps each with greaterresolution. PG-GAN (Karras et al., 2018) has shown impressive performance using a progressivetraining procedure starting from very low resolution images (4×4) and ending with high resolutionimages (1024×1024). It can produce high fidelity pictures that are often tricky to distinguish fromreal photos.

In this paper, we employ a progressive conditional generative adversarial model for creating photo-realistic image examples with controllable “gender” and “race”. These images are then used tointerrogate independent image classification systems. We model the problem as a Gaussian pro-cess (Rasmussen, 2004), sampling images from the model iteratively based on the performance of theclassifiers, to efficiently discover blind-spots in the models. Empirically, we find that these examplescan be used to identify biases in the image classification systems.

3 Approach

We propose to use a generative model to synthesize face images and then apply Bayesian Optimizationto efficiently generate images that have the highest likelihood of breaking a target classifier.

Image Generation. To generate photo-realistic face images in a controllable way, we propose toadopt a progressively growing conditional GAN (Karras et al., 2018; Mirza & Osindero, 2014)architecture. This model is trained so as to condition the generator G and discriminator D onadditional labels. The given condition θ could be any kinds of auxiliary information; here we use θ tospecify both the race r and gender g of the subject in the image, i.e., θ = [r; g]. During testing time,the trained G should produce face images with the race and gender as specified by θ.

We curated a dataset {x; θ} (described below), where x is a face image and θ indicates the race r andgender g labels of x. To train the conditional generator, the input of the generator is a combination of acondition θ and a prior noise input pz(z); z is a 100-D vector sampled from a unit normal distributionand θ is a one-hot vector that represents a unique combination of (race, gender) conditions. Weconcatenate z and θ as the input to our model. G’s objective is defined by:

LG = −Ez,θ[

logD(G(z, θ))]

(1)

The design of the discriminator D is inspired by Thekumparampil et al.’s (2018) Robust ConditionalGAN model which proved successful at delivering robust results. We train D on two objectives: todiscriminate whether the synthesized image is real or fake, and to classify the synthesized image intothe correct class (e.g., race and gender). The training objective for D is defined by:

LD = −E[

logD(x)]− Ez,θ

[logD(G(z, θ))

]− Ez,θ

[logC(G(z, θ))

](2)

where C is an N-way classifier. This classifier (C) is independent of the classifier being interrogatedfor biases. Our full learning objective is:

Ladv = minG

maxDLG + LD (3)

4

Figure 2: The training pipeline of our generative network. The input for the generator G is a jointhidden representation of combined latent vectors z and labels θ, where θ specifies the race andgender of the face. The Discriminator D is used to discriminate the generated samples from realsamples (indicated by 1/0). At the same time, D forces the generated samples to be classified intothe appropriate corresponding classes (indicated by θ). The network is trained progressively, i.e. froma resolution of 4×4 pixels to 16×16 pixels, and eventually to 128×128 pixels (increasing by a factorof two).

We train the generator progressively (Karras et al., 2018) by increasing the image resolution by apower of two at each step, from 4×4 pixels to 128×128 pixels (see Figure 2). The real samplesare downsampled into the corresponding resolution in each stage. The training code is included insupplementary material.

Bayesian Optimization. Now that we have a systematically controllable image generation model,we propose to combine this with Bayesian Optimization (Brochu et al., 2010) to explore and exploitthe space of parameters θ to find errors in the target classifier. We have θ as parameters that spawnan instance of a simulation f(θ) (e.g., a synthesized face image). This instance is fed into a targetimage classifier to check whether the system correctly identifies f(θ). Consequently, we can define acomposite function Lc = Loss(f(θ)), where Loss is the classification loss and reflects if the targetclassifier correctly handles the simulation instance generated when applying the parameters θ. As anexample, for gender classification the Loss is 0 if the classification of gender was correct or 1 if it wasincorrect. Carrying out Bayesian Optimization withLc allows us to find θ that maximizes the loss, thusdiscovering parameters that are likely to break the classifier we are interrogating (i.e., exploitation).However, we are not interested in just one instance but sets of diverse examples that would lead tomisclassification. Consequently, we carry out a sequence of Bayesian Optimization tasks, where eachsubsequent run considers an adaptive objective function that is conditioned on examples that werediscovered in the previous round. Formally, in each round of Bayesian Optimization we maximize:

L = (1− α)Lc + αmini||Θi − θ||. (4)

The above composite function is a convex combination of the misclassification cost with a term thatencourages discovering new solutions θ that are diverse from the set of previously found examplesΘi. Specifically, the second term is the minimum euclidean distance of θ from the existing set and ahigh value of that term indicates that the example being considered is diverse from rest of the set.Intuitively, this term encourages exploration and prioritizes sampling a diverse set of images. In ourpreliminary tests, we varied the size of the set of previously found examples. We found the resultswere not sensitive to the size of this set and fixed it to 50 in our main experiments. Figure 1 graphicallydescribes such a composition. The sequence of Bayesian Optimizations find a diverse set of examplesby first modeling the composite function L as a Gaussian Process (GP) (Rasmussen, 2004). Modelingas a GP allows us to quantify uncertainty around the predictions, which in turn is used to efficientlyexplore the parameter space in order to identify the spots that satisfy the search criterion. In this work,we follow the recommendations in (Snoek et al., 2012), and model the composite function via a GPwith a Radial Basis Function (RBF) kernel, and use Expected Improvement (EI) as an acquisitionfunction. A formal definition of EI is provided by Frazier Frazier (2018). The code is included insupplementary material.

5

Region Country People Frames Generated ImagesM W M W M W

BlackNigerian 81 28 768 467

Kenya 11 5 91 49S. Africa 136 102 1641 1984

Total 228 135 2500 2500

S AsianIndia 142 83 2108 2267

Sri Lanka 1 2 11 7Pakistan 19 11 381 226

Total 162 96 2500 2500

White Australia 175 121 2500 2500Total 175 121 2500 2500

NE Asian

Japan 105 89 930 1421China 105 46 789 447

S. Korea 29 12 464 251Hong Kong 36 28 317 381

Total 275 175 2500 2500

Table 1: The number of people and images we sampled from (by country and gender) to train ourgeneration model. Examples of generated faces for each race and gender. M = Men, W = Women.

Data. We use the MS-CELEB-1M (Guo et al., 2016) dataset for our experimentation. This is a largeimage dataset with a training set containing 100K different people and approximately 10 millionimages. To identify the nationalities of the people in the dataset we used the Google Search APIand pulled biographic text associated with each person featured in the dataset. We then used theNLTK library to extract nationality and gender information from the biographies. Many nationshave heterogeneous national and/or ethnic compositions and assuming that sampling from them atrandom would give consistent appearances is not well founded. Characterizing these differencesis difficult, but necessary if we are to understand biases in vision classifiers. The United Nations(UN) notes that the ethnic and/or national groups of the population are dependent upon individualnational circumstances and terms such as “race” and “origin” have many connotations. There is nointernationally accepted criteria. Therefore, care must be taken in how we use these labels to generateimages of different appearances.

To help address this we used demographic data provided by the UN that gives the national and/orethnic statistics for each country1 and then only sampled from countries with more homogeneousdemographics. We selected four regions that have predominant and similar racial appearance groups.These group are Black (darker skin tones, Sub-Saharan African appearance), South Asian (darker skintone, Caucasian appearance), Northeast Asian (moderate skin tone, East Asian appearance) and White(light skin tone, Caucasian appearance) and sampled from a set of countries to obtain images foreach. We sampled 5,000 images (2,500 men and 2,500 women) from each region prioritizing higherresolution images (256×256) and then lower resolution images (128×128). The original raw imagesselected for training and the corresponding race and gender labels are included in the supplementarymaterial. The nationality and gender labels for the complete MS-CELEB-1M training set will also bereleased. Table 1 shows the nations from which we sampled images and the corresponding appearancegroup. The number of people and images that were used in the final data are shown. It was notnecessary to use all the images from every country to create a model for generating faces, and toobtain evenly distributed data over both gender and region we used this subset. Examples of theimages produced by our trained model are also shown in the table. Higher resolution images can befound in the supplementary material.

4 Experiments and Results

Validation of Image Generation. Statistical generative models such as GANs are not perfect andmay not always generate images that reflect the conditioned variables. Therefore, it is important to

6

API Task All Black S Asian NE Asian White Men Women

IBM Face Det. 8.05 16.9 7.63 3.96 3.8 11.3 2.27Gender Class. 8.26 9.00 2.13 20.0 1.87 15.8 0.27

SE Face Det. 0.13 0.00 0.00 0.53 0.00 0.21 0.00Gender Class. 2.84 3.39 0.74 5.85 1.38 5.14 0.00

Table 2: Face detection and gender classification error rates (in percentage). SE = SightEngine.

Mis

clas

sified

Cor

rect

ly C

lass

ifie

d

Men Women Both Men Women BothFace Detection Gender Classification

Figure 3: Mean faces for correct classifications and incorrect classifications. Notice how the skintone for face detection failure cases is darker than for success cases. Women were very infrequentlyclassified as men, thus the average face is not very clear.

validate the performance of the GAN that we used at producing images that represent the specifiedconditions (race and gender) reliably. We generated a uniform sample of 50 images, at 128×128resolution, from each race and gender (total 50x4x2 = 400 images) and recruited five participants onMTurk to label the gender of the face in each image and the quality of the image (see Table 1 forexample images). The quality of the image was labeled on a scale of 0 (no face is identifiable) to 5(the face is indistinguishable from a photograph). Of 400 images, the gender of only seven (1.75%)images was classified differently by a majority of the labelers than the intended condition dictated.The mean quality rating of the images was 3.39 (SD=0.831) out of 5. There was no significantdifference in quality between races or genders (Black Men: 3.31, Black Women: 3.24, White Men:3.20, White Women: 3.51, S. Asian Men: 3.37, S. Asian Women: 3.40, NE Asian Men: 3.48, NEAsian Women: 3.58). In none of the images was a face considered unidentifiable.

Additionally, we computed FID scores for each region and gender, we synthesized 500 images fromeach region and gender (4K in total), using our conditional PG-GAN and StyleGAN (Karras et al.,2018). StyleGAN is the current state-of-the-art on face image synthesis and serves as a good referencepoint. The FID scores were similar across all regions: Ours (Black: M 8.10, F 8.14; White: M 8.08,F 7.70; NE Asian: M 8.01, F 8.00; S. Asian: M 8.06, F 8.10). StyleGAN (Black: M 7.70, F 7.92;White: M 7.68, F 7.8; NE Asian: M 7.76, F 7.80; S. Asian: M 7.94, F 7.66). Our model producescomparable FID scores with the state-of-the-art results. Note that StyleGAN synthesizes new imagesby conditioning on a real image, while our model is just conditioned on labels. The results confirmthat our dataset can be used to synthesize images across each gender and region with sufficient qualityand diversity.

Classifier Interrogation. Numerous companies offer services for face detection and gender classifi-cation from images (Microsoft, IBM, Amazon, SightEngine, Kairos, etc.). We selected two of thesecommercial APIs (IBM and SightEngine) to interrogate in our experiments. These are exemplars andthe specific APIs used here are not the focus of our paper. Each API accepts HTTP POST requestswith URLs of the images or binary image data as a parameter within the request. If a face is detectedthey return JSON formatted data structures with the locations of the detected faces and a predictionof the gender of the face. Details of the APIs can be found in the supplementary material.

We ran our sampling procedure for a fixed number of iterations (400) in each trial (i.e., we sampled400 images at 128×128 resolution). In all cases the percentage of failure cases had stabilized. Table 2shows the error rates (in %) for face detection and gender classification. Figure 3 shows the mean

7

Alpha

Perc

enta

ge

of S

ample

s th

at C

ause

Fac

e D

etec

tion

or

Gen

der

Cla

ssific

atio

n E

rror

s

Randomα=0.2

α=0.4

α=0.6

α=0.8

α=1.0

Number of Samples

No.

of

Face

or

Gen

der

Cla

ssific

atio

n E

rrors

(a) (b)

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80Bayesian OptimizationRandom Sampling

0.2 0.4 0.6 0.8 110

12

14

16

18

20

22 Face Det. Failure Examples:

Gender Class. Failure Examples (Men Classified as Women):

(c)

Figure 4: a) Sample efficiency of finding samples that were misclassified using random sampling andBayesian Optimization with α=1. Shaded regions reflect one standard deviation either side of themean. b) Percentage of images that cause classifier failures (y-axis) as we vary the value of α. Errorbars reflect one standard deviation either side of the mean. c) Qualitative examples of failure cases.

face of images containing faces that were detected and not detected by the APIs. The skin tonesillustrate that missed faces had darker skin tones and gender classification was considerably lessaccurate on people from NE Asia. We found men were more frequently misclassified as women thanthe other way around.

Next, we compare two approaches for searching our space of simulated faces for face detection andgender classification failure cases. For these analyses we used the IBM API as the target imageclassifier. First, we randomly sample parameters for generating face configurations and second weuse Bayesian Optimization. Again, we ran the sampling for 400 iterations in each case. In the caseof Bayesian Optimization, the image generation was updated dependant on the success or failure ofclassification of the previous image. This allows us to use an explore-exploit strategy and navigatethrough the facial appearance space efficiently using the feedback from the automated “oracle”.Figure 4(a) shows the sample efficiency of finding face detection and gender classification failures.Figure 4(b) shows the how the percentage of the errors found varies with the value of α and forrandom sampling.

We repeated our analysis by directly sampling from the real images used to train the PG-GAN. UsingBO and the GAN images allowed us to discover gender classification failures at a higher rate (16%higher) than using BO and the real images (both with α=0.6), partially because the real image setis ultimately limited and in specific failure regions we eventually exhaust the images and have tosample elsewhere.

Classifier Retraining. Finally, we wanted to test whether the synthesized images could be used toimprove the “weak spots” in a classifier. To do this, we trained three ResNet-50 gender classificationsmodels. First, we sampled 14K images from our dataset (1K Black, 3K S Asian, 5K White and 5KNE Asian - this captures a typical model trained on unbalanced data). We then tested this classifieron an independent set of images and the gender classification accuracy was 81%. Following this,we then retrained the model, adding 800 synthesized failure cases discovered using our method, theretrained classifier achieved 87% accuracy on the gender classification task. Compare this to a modeltrained on our balanced dataset for which the accuracy was 92%. Adding the synthesized imagesbrings performance up, close to that of a model trained on a balanced dataset.

5 Discussion

Bias in machine learning classifiers is problematic and often these biases may not be introducedintentionally. Regardless, biases can still propagate systemic inequalities that exist in the real-world.Yet, there are still few practical tools for helping researchers and developers mitigate bias and createwell characterized classifiers. Adversarial training is a powerful tool for creating generative modelsthat can produce highly realistic content. By using an adversarial training architecture we createa model that can be used to interrogate facial classification systems. We apply an optimal searchalgorithm that allows us to perform an efficient exploration of the space of faces to reveal biases from

8

a smaller number of samples than via a brute force method. We tested this approach on face detectionand gender classification tasks and interrogated commercial APIs to demonstrate its application.

Can our conditional GAN produce sufficiently high quality images to interrogate a classifier?Our validation of our face GAN shows that the model is able to generate realistic face images thatare reliably conditioned on race and gender. Human subjects showed very high agreement with theconditional labels for the generated images and the quality of the images were rated similarly acrosseach race and gender. This suggests that our balanced dataset and training procedure produced amodel that can generate images reliably conditioned on race and gender and of suitably equivalentquality. Examples of the generated images can be seen in Table 1 (high resolution images are availablein the supplementary material). Very few of the images have very noticeable artifacts.

How do commercial APIs perform? Both of the commercial APIs we tested failed at significantlyhigher rates on images of people from the African and South Asian groups (see Table 2). For theIBM system the face detection error rate was more than four times as high on African faces as Whiteand NE Asian faces. The face detection error rates on Black and South Asian faces were the highest,suggesting that skin tone is a key variable here. Gender classification error rates were also highfor African faces but unlike face detection the performance was worst for NE Asian faces. Theseresults suggest that gender classification performance is not only impacted by skin tone but also othercharacteristics of appearance. Perhaps facial hair, or lack thereof, in NE Asian photographs, menwith “bangs” and make-up (see Figure 4c and supplementary material for examples of images thatresulted in errors.)

Can we efficiently sample images to find errors? Errors are typically sparse (many APIs have aglobal error rate of less than 10%) and therefore simply randomly sampling images in order to identifybiases is far from efficient. Using our optimization scheme we are able to identify an equivalentnumber of errors in significantly fewer samples (see Figure 4a). The results show that we are able toidentify almost 50% more failure cases using the Bayesian sampling scheme than without. In somesenses our approach can be thought of as a way of efficiently identifying adversarial examples.

Trading off exploitation and exploration? In our sampling procedure we have an explicit trade-off,using α, between exploration of the underlying manifold of face images and exploitation to find thehighest number of errors (see Figure 4(b)). With little exploration there is a danger that the samplingwill find a local minima and continue sampling from a single region. Our results show that withα equal 0.6 we maximize the number of errors found. This is empirical evidence that explorationand exploration are both important. Otherwise there is risk that one might miss regions that havefrequent failure cases. As the parameter space of θ grows in dimensionality our sampling procedurewill become even more favorable compared to naive methods.

Can synthesized images be used to help address errors? Using our generative face image modelwe compared the performance of a gender classification model trained with unbalanced data and alsotrained with supplementary synthesized failure images discovered using our sampling method. Thesynthesized images helped successfully improve accuracy of the resulting gender classifier, suggestingthat the generated images can help address problems associated with data bias. In this work we onlyuse GANs to interrogate bias in facial classification systems across two (albeit important) dimensions:gender and race. The number of parameters could be extended with other critical dimensions, e.g.,age; this could accentuate the advantages of using generative models that we have highlighted.

6 Conclusions

We have presented an approach applying a conditional progressive generative model for creatingphoto-realistic synthetic images that can be used to interrogate facial classifiers. We test commercialimage classification application programming interfaces and find evidence of systematic biases intheir performance. A Bayesian search algorithm allows for efficient search and characterization ofthese biases. Biases in vision-based systems are of wide concern especially as these system becomewidely deployed by industry and governments. In this work, we focus specifically on sample selectionbias but it should be noted that other types of bias exist (e.g., confounding bias). Generative modelsare a practical tool that can be used to characterize the performance of these systems. We hope thatthis work can help increase the prevalence of rigorous benchmarking of commercial classifiers in thefuture.

9

ReferencesAthalye, Anish, Engstrom, Logan, Ilyas, Andrew, and Kwok, Kevin. Synthesizing robust adversarial

examples. arXiv preprint arXiv:1707.07397, 2017.

Baeza-Yates, Ricardo. Data and algorithmic bias in the web. In Proceedings of the 8th ACMConference on Web Science, pp. 1–1. ACM, 2016.

Bansal, Gagan and Weld, Daniel S. A coverage-based utility model for identifying unknownunknowns. In Proc. of AAAI, 2018.

Bareinboim, Elias and Pearl, Judea. Causal inference and the data-fusion problem. Proceedings ofthe National Academy of Sciences, 113(27):7345–7352, 2016.

Brochu, Eric, Cora, Vlad M, and De Freitas, Nando. A tutorial on bayesian optimization of expensivecost functions, with application to active user modeling and hierarchical reinforcement learning.arXiv preprint arXiv:1012.2599, 2010.

Buolamwini, Joy and Gebru, Timnit. Gender shades: Intersectional accuracy disparities in commercialgender classification. In Conference on Fairness, Accountability and Transparency, pp. 77–91,2018.

Buolamwini, Joy Adowaa. Gender shades: intersectional phenotypic and demographic evaluation offace datasets and gender classifiers. PhD thesis, Massachusetts Institute of Technology, 2017.

Caliskan, Aylin, Bryson, Joanna J, and Narayanan, Arvind. Semantics derived automatically fromlanguage corpora contain human-like biases. Science, 356(6334):183–186, 2017.

Choi, Yunjey, Choi, Minje, Kim, Munyoung, Ha, Jung-Woo, Kim, Sunghun, and Choo, Jaegul.Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797,2018.

Dong, Hao, Neekhara, Paarth, Wu, Chao, and Guo, Yike. Unsupervised image-to-image translationwith generative adversarial networks. arXiv preprint arXiv:1701.02676, 2017.

Dwork, Cynthia, Hardt, Moritz, Pitassi, Toniann, Reingold, Omer, and Zemel, Richard. Fairnessthrough awareness. In Proceedings of the 3rd innovations in theoretical computer science confer-ence, pp. 214–226. ACM, 2012.

Fitzpatrick, Thomas B. The validity and practicality of sun-reactive skin types i through vi. Archivesof dermatology, 124(6):869–871, 1988.

Frazier, Peter I. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.

Garvie, Clare. The perpetual line-up: Unregulated police face recognition in america. GeorgetownLaw, Center on Privacy & Technology, 2016.

Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil,Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Ghahramani, Z., Welling,M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural InformationProcessing Systems 27, pp. 2672–2680. Curran Associates, Inc., 2014a. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.

Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil,Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in neuralinformation processing systems, pp. 2672–2680, 2014b.

Guo, Yandong, Zhang, Lei, Hu, Yuxiao, He, Xiaodong, and Gao, Jianfeng. Ms-celeb-1m: A datasetand benchmark for large-scale face recognition. In European Conference on Computer Vision, pp.87–102. Springer, 2016.

Hajian, Sara, Bonchi, Francesco, and Castillo, Carlos. Algorithmic bias: From discriminationdiscovery to fairness-aware data mining. In Proceedings of the 22nd ACM SIGKDD internationalconference on knowledge discovery and data mining, pp. 2125–2126. ACM, 2016.

10

http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

Haralick, Robert M. Performance characterization in computer vision. In BMVC92, pp. 1–8. Springer,1992.

Karras, Tero, Aila, Timo, Laine, Samuli, and Lehtinen, Jaakko. Progressive growing of GANs forimproved quality, stability, and variation. In International Conference on Learning Representations,2018. URL https://openreview.net/forum?id=Hk99zCeAb.

Klare, Brendan F, Burge, Mark J, Klontz, Joshua C, Bruegge, Richard W Vorder, and Jain, Anil K.Face recognition performance: Role of demographic information. IEEE Transactions on Informa-tion Forensics and Security, 7(6):1789–1801, 2012.

Klare, Brendan F, Klein, Ben, Taborsky, Emma, Blanton, Austin, Cheney, Jordan, Allen, Kristen,Grother, Patrick, Mah, Alan, and Jain, Anil K. Pushing the frontiers of unconstrained face detectionand recognition: Iarpa janus benchmark a. In Proceedings of the IEEE conference on computervision and pattern recognition, pp. 1931–1939, 2015.

Kortylewski, Adam, Egger, Bernhard, Schneider, Andreas, Gerig, Thomas, Morel-Forster, Andreas,and Vetter, Thomas. Empirically analyzing the effect of dataset biases on deep face recognitionsystems. In Proceedings of the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops, pp. 2093–2102, 2018.

Kortylewski, Adam, Egger, Bernhard, Schneider, Andreas, Gerig, Thomas, Morel-Forster, Andreas,and Vetter, Thomas. Analyzing and reducing the damage of dataset bias to face recognitionwith synthetic data. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops, pp. 0–0, 2019.

Lakkaraju, Himabindu, Kamar, Ece, Caruana, Rich, and Horvitz, Eric. Discovering blind spots ofpredictive models: Representations and policies for guided exploration. CoRR, 2016.

Lakkaraju, Himabindu, Kamar, Ece, Caruana, Rich, and Horvitz, Eric. Identifying unknown un-knowns in the open world: Representations and policies for guided exploration. In AAAI, volume 1,pp. 2, 2017.

Lepri, Bruno, Oliver, Nuria, Letouzé, Emmanuel, Pentland, Alex, and Vinck, Patrick. Fair, transparent,and accountable algorithmic decision-making processes. Philosophy & Technology, 31(4):611–627,2018.

McDuff, Daniel, Cheng, Roger, and Kapoor, Ashish. Identifying bias in ai using simulation. arXivpreprint arXiv:1810.00471, 2018.

Mirza, Mehdi and Osindero, Simon. Conditional generative adversarial nets. CoRR, abs/1411.1784,2014. URL http://arxiv.org/abs/1411.1784.

Qiu, Weichao and Yuille, Alan. Unrealcv: Connecting computer vision to unreal engine. In EuropeanConference on Computer Vision, pp. 909–916. Springer, 2016.

Raji, Inioluwa Deborah and Buolamwini, Joy. Actionable auditing: Investigating the impact ofpublicly naming biased performance results of commercial ai products. In AAAI/ACM Conf. on AIEthics and Society, 2019.

Rasmussen, Carl Edward. Gaussian processes in machine learning. In Advanced lectures on machinelearning, pp. 63–71. Springer, 2004.

Rothe, Rasmus, Timofte, Radu, and Van Gool, Luc. Dex: Deep expectation of apparent age from asingle image. In Proceedings of the IEEE International Conference on Computer Vision Workshops,pp. 10–15, 2015.

Ryu, Hee Jung, Mitchell, Margaret, and Adam, Hartwig. Improving smiling detection with race andgender diversity. arXiv preprint arXiv:1712.00193, 2017.

Shrivastava, Ashish, Pfister, Tomas, Tuzel, Oncel, Susskind, Joshua, Wang, Wenda, and Webb,Russell. Learning from simulated and unsupervised images through adversarial training. In CVPR,volume 2, pp. 5, 2017.

11

https://openreview.net/forum?id=Hk99zCeAb

http://arxiv.org/abs/1411.1784

Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machinelearning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.

Thekumparampil, Kiran K, Khetan, Ashish, Lin, Zinan, and Oh, Sewoong. Robustness of con-ditional gans to noisy labels. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31,pp. 10271–10282. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8229-robustness-of-conditional-gans-to-noisy-labels.pdf.

Torralba, Antonio and Efros, Alexei A. Unbiased look at dataset bias. In Computer Vision andPattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1521–1528. IEEE, 2011.

Tramèr, Florian, Papernot, Nicolas, Goodfellow, Ian, Boneh, Dan, and McDaniel, Patrick. The spaceof transferable adversarial examples. arXiv preprint arXiv:1704.03453, 2017.

Vazquez, David, Lopez, Antonio M, Marin, Javier, Ponsa, Daniel, and Geronimo, David. Virtual andreal world adaptation for pedestrian detection. IEEE transactions on pattern analysis and machineintelligence, 36(4):797–809, 2014.

Veeravasarapu, VSR, Hota, Rudra Narayan, Rothkopf, Constantin, and Visvanathan, Ramesh. Modelvalidation for vision systems via graphics simulation. arXiv preprint arXiv:1512.01401, 2015a.

Veeravasarapu, VSR, Hota, Rudra Narayan, Rothkopf, Constantin, and Visvanathan, Ramesh. Simu-lations for validation of vision systems. arXiv preprint arXiv:1512.01030, 2015b.

Veeravasarapu, VSR, Rothkopf, Constantin, and Ramesh, Visvanathan. Model-driven simulations fordeep convolutional neural networks. arXiv preprint arXiv:1605.09582, 2016.

Wilson, Benjamin, Hoffman, Judy, and Morgenstern, Jamie. Predictive Inequity in Object Detection.arXiv e-prints, art. arXiv:1902.11097, Feb 2019.

Xu, Tao, Zhang, Pengchuan, Huang, Qiuyuan, Zhang, Han, Gan, Zhe, Huang, Xiaolei, and He,Xiaodong. Attngan: Fine-grained text to image generation with attentional generative adversarialnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp. 1316–1324, 2018.

Yan, Xinchen, Yang, Jimei, Sohn, Kihyuk, and Lee, Honglak. Attribute2image: Conditional imagegeneration from visual attributes. In European Conference on Computer Vision, pp. 776–791.Springer, 2016.

Yang, Hongyu, Huang, Di, Wang, Yunhong, and Jain, Anil K. Learning face age progression: Apyramid architecture of gans. arXiv preprint arXiv:1711.10352, 2017a.

Yang, Jianwei, Kannan, Anitha, Batra, Dhruv, and Parikh, Devi. Lr-gan: Layered recursive generativeadversarial networks for image generation. CoRR, abs/1703.01560, 2017b.

Zhang, Han, Xu, Tao, Li, Hongsheng, Zhang, Shaoting, Wang, Xiaogang, Huang, Xiaolei, andMetaxas, Dimitris N. Stackgan: Text to photo-realistic image synthesis with stacked generativeadversarial networks. In Proceedings of the IEEE International Conference on Computer Vision,pp. 5907–5915, 2017.

12

http://papers.nips.cc/paper/8229-robustness-of-conditional-gans-to-noisy-labels.pdf

http://papers.nips.cc/paper/8229-robustness-of-conditional-gans-to-noisy-labels.pdf

characterizing bias in classifiers using generative …...characterizing bias in classiﬁers using...

Documents