arxiv:1911.03842v2 [cs.cl] 16 apr 2020

14
Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation Emily Dinan * , Angela Fan * , Adina Williams, Jack Urbanek, Douwe Kiela, Jason Weston Facebook AI Research Laboratoire Lorrain d’Informatique et Applications (LORIA) Abstract Models often easily learn biases present in the training data, and their predictions di- rectly reflect this bias. We analyze gender bias in dialogue data, and examine how this bias is actually amplified in subsequent gener- ative chit-chat dialogue models. We measure gender bias in six existing dialogue datasets, and focus on the most biased one, the multi- player text-based fantasy adventure dataset LIGHT (Urbanek et al., 2019), as a testbed for our bias mitigation techniques. The LIGHT dataset is highly imbalanced with respect to gender, containing predominantly male char- acters, likely because it is entirely collected by crowdworkers and reflects common biases that exist in fantasy or medieval settings. We consider three techniques to mitigate gender bias: counterfactual data augmentation, tar- geted data collection, and bias controlled train- ing. We show that our proposed techniques mitigate gender bias in LIGHT by balancing the genderedness of generated dialogue utter- ances and are particularly eective in combi- nation. We quantify performance using vari- ous evaluation methods—such as quantity of gendered words, a dialogue safety classifier, and human studies—all of which show that our models generate less gendered, but equally en- gaging chit-chat responses. 1 Introduction Machine learning algorithms learn to model pat- terns present in training datasets, so data qual- ity aects what they learn. Model predictions have been shown to directly reflect harmful so- cietal biases present in training datasets, such as racial bias in sports reports (Merullo et al., 2019) and political bias in news data (Fan et al., 2019). Moreover, biases have been discovered in many NLP tasks, for example, in learned word embed- dings (Bolukbasi et al., 2016; Brunet et al., 2018; * Joint first authors. Gendered word counts in dialogue datasets Dataset % gend. words % male bias LIGHT 0.94 73.4 Reddit 1.32 69.76 Wizard of Wikipedia 0.076 65.9 Daily Dialog 1.02 59.04 Empathetic Dialogues 2.07 53.45 ConvAI2 1.28 50.05 Table 1: Counts of gendered words in several di- alogue datasets. We report the percent of gendered words (% gend. words) as well as the percentage of male-gendered words among all gendered words (% male bias). Datasets are arranged in descending order from most to least male biased. Among these, LIGHT has the most male bias, making it an ideal testbed. Zhao et al., 2019), visual semantic role label- ing (Zhao et al., 2017), natural language infer- ence (He et al., 2019), abusive language classifi- cation (Park et al., 2018), and coreference resolu- tion (Zhao et al., 2018a). Although research into bias in NLP is maturing, bias in dialogue utter- ances has received somewhat less attention (Liu et al., 2019; Sheng et al., 2019; Henderson et al., 2018). However, with the rapid development of real-world use-cases for dialogue agents, such as interactive assistants, bias in dialogue models has the very real potential not only to replicate exist- ing social biases, but also to exacerbate them. Di- alogue debiasing is thus becoming an increasingly important problem in NLP. In this work, we aim to address this issue by measuring gender bias in dialogue data and mitigating its eects on down- stream dialogue generation models. Previous work has noted that gender bias is prevalent in many machine learning datasets (Stock and Cisse, 2017; Zhao et al., 2017), and arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

Upload: others

Post on 05-Apr-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation

Emily Dinan∗, Angela Fan∗†, Adina Williams, Jack Urbanek, Douwe Kiela, Jason WestonFacebook AI Research

†Laboratoire Lorrain d’Informatique et Applications (LORIA)

AbstractModels often easily learn biases present inthe training data, and their predictions di-rectly reflect this bias. We analyze genderbias in dialogue data, and examine how thisbias is actually amplified in subsequent gener-ative chit-chat dialogue models. We measuregender bias in six existing dialogue datasets,and focus on the most biased one, the multi-player text-based fantasy adventure datasetLIGHT (Urbanek et al., 2019), as a testbed forour bias mitigation techniques. The LIGHTdataset is highly imbalanced with respect togender, containing predominantly male char-acters, likely because it is entirely collectedby crowdworkers and reflects common biasesthat exist in fantasy or medieval settings. Weconsider three techniques to mitigate genderbias: counterfactual data augmentation, tar-geted data collection, and bias controlled train-ing. We show that our proposed techniquesmitigate gender bias in LIGHT by balancingthe genderedness of generated dialogue utter-ances and are particularly effective in combi-nation. We quantify performance using vari-ous evaluation methods—such as quantity ofgendered words, a dialogue safety classifier,and human studies—all of which show that ourmodels generate less gendered, but equally en-gaging chit-chat responses.

1 Introduction

Machine learning algorithms learn to model pat-terns present in training datasets, so data qual-ity affects what they learn. Model predictionshave been shown to directly reflect harmful so-cietal biases present in training datasets, such asracial bias in sports reports (Merullo et al., 2019)and political bias in news data (Fan et al., 2019).Moreover, biases have been discovered in manyNLP tasks, for example, in learned word embed-dings (Bolukbasi et al., 2016; Brunet et al., 2018;

∗Joint first authors.

Gendered word counts in dialogue datasets

Dataset % gend.words

% malebias

LIGHT 0.94 73.4Reddit 1.32 69.76Wizard of Wikipedia 0.076 65.9Daily Dialog 1.02 59.04Empathetic Dialogues 2.07 53.45ConvAI2 1.28 50.05

Table 1: Counts of gendered words in several di-alogue datasets. We report the percent of genderedwords (% gend. words) as well as the percentage ofmale-gendered words among all gendered words (%male bias). Datasets are arranged in descending orderfrom most to least male biased. Among these, LIGHThas the most male bias, making it an ideal testbed.

Zhao et al., 2019), visual semantic role label-ing (Zhao et al., 2017), natural language infer-ence (He et al., 2019), abusive language classifi-cation (Park et al., 2018), and coreference resolu-tion (Zhao et al., 2018a). Although research intobias in NLP is maturing, bias in dialogue utter-ances has received somewhat less attention (Liuet al., 2019; Sheng et al., 2019; Henderson et al.,2018). However, with the rapid development ofreal-world use-cases for dialogue agents, such asinteractive assistants, bias in dialogue models hasthe very real potential not only to replicate exist-ing social biases, but also to exacerbate them. Di-alogue debiasing is thus becoming an increasinglyimportant problem in NLP. In this work, we aimto address this issue by measuring gender bias indialogue data and mitigating its effects on down-stream dialogue generation models.

Previous work has noted that gender biasis prevalent in many machine learning datasets(Stock and Cisse, 2017; Zhao et al., 2017), and

arX

iv:1

911.

0384

2v2

[cs

.CL

] 1

6 A

pr 2

020

Page 2: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

Persona Example (Original LIGHT Dataset)

daughter: I spend most of my time doing household chores. I want to find meaning in life. I am energetic and happy.

chief wife: I am the king’s chief wife. Of all the women that he has married, or who are his concubines, I am theprincipal one. I represent the kingdom of my father, who is the king’s biggest ally. My sons are the oneswho will most likely become the king after the death of my husband.

women: I live with my husband and 4 children in the village. I spend my days washing clothing and cleaning ourhome. My husband works for the royal army defending out town.

farmer Bob’s wife: I am farmer Bob’s wife. I like to take care of all our animals. I help Farmer Bob everyday on the farm.

mother: I am a mother of eight children. I live with my family in a cottage in the countryside. I spend every daytending to the needs of all of my little ones which can be overwhelming, but I always manage to maintaina pleasing disposition and a happy smile.

wife: I am the wife of a farmer. While I may not be the most attractive woman ever, I am loyal and loving. Myhusband is a good man, but only seems to stay with me out of duty.

shady lady: I am a shady lady. I work in a tavern, and I am willing to trade sexual favors for money. I have to split themoney with the tavernkeeper, so that he will offer me a room to work in. I am beginning to get sick fromthe “king’s evil”, which doctors call syphilis. My future is bleak: madness and death. But this is the onlyway that I can support myself, so I continue.

Table 2: Examples of gender biased personas in LIGHT. In a review that we conducted in this work, none ofthese characters were flagged as sexist or offensive. For male examples, see Appendix Table 11.

here we analyzed the gender bias in several exist-ing dialogue datasets (see Table 1, and §3 for morediscussion). As a testbed for our investigation,we chose the dialogue dataset from the LIGHTtext adventure world (Urbanek et al., 2019), be-cause we find it to be significantly more male-biased than other comparable dialogue datasets.Not only is it large enough to train neural chit-chat dialogue models, LIGHT is also interesting,because has multiple potential sources of bias—namely, characters, personas, and dialogues. Inthe dialogue creation process, crowdworkers werepresented with a character (with names such as“farmer” or “witch”) process, as well as an associ-ated persona—a short textual description for thecharacter. Supplied with characters and personas,crowdworkers were paired, up and tasked withgenerating a dialogue between the characters. Alldialogues contained within LIGHT are entirelycrowdsourced—thus susceptible to reflecting thegender biases of crowdworkers (Otterbacher et al.,2018; Barbosa and Chen, 2019). We investigatecharacters, personas, and dialogues as possiblesources of bias in turn in §3.

After measuring gender bias in LIGHT, we thenexplore three bias mitigation techniques, each ofwhich is either wholly novel, or novel in its appli-cation to dialogue: (i) Counterfactual Data Aug-mentation (CDA) (Maudslay et al., 2019; Zmi-grod et al., 2019), (ii) a targeted data collectionmethod, which we refer to as Positive-Bias Data

Dialogue Example (Original LIGHT Dataset)

wife: I was married off by my family about fiveyears ago. I spend my days cooking andcleaning so my husband will have somethingto eat when he returns from his work and canenjoy a clean home. I love my husband dearlybecause he works very hard to provide for us.

merchant: What a great day for more money.wife: Oh my. That is some thick dust!merchant: Indeed, it is very old.wife: This room is going to take a while to clean.

You might want to come back later.merchant: It is fine I can set my booth up here.wife: With all the foot traffic?merchant: Yes it should be ok.wife: It doesn’t appear that anyone ever comes up

here!merchant: Well they will when they know I am here.wife: I have my doubts but I’ll just go about my

cleaning.merchant: Yea sounds like a good idea.wife: What is that supposed to mean?merchant: I am saying we should both do our jobs.wife: Don’t take that tone with me!

Table 3: A dialogue from the original LIGHT data. Thetext for the wife persona was crowdsourced.

collection, and (iii) Bias Controlled text genera-tion. We show that these techniques are most ef-fective in combination, resulting in dialogue mod-els that produce engaging responses with measur-ably less gender bias and offensive content (see§5). Models and code will be released at parl.ai/projects/genderation_bias.

Page 3: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

2 Related Work

Recently, the NLP community has focused on ex-ploring gender bias in NLP systems (Sun et al.,2019), uncovering many gender disparities andharmful biases in algorithms and text (Cao andDaume 2019; Chang et al. 2019; Chang and McK-eown 2019; Costa-jussa 2019; Du et al. 2019;Emami et al. 2019; Garimella et al. 2019; Gautet al. 2019; Habash et al. 2019; Hashempour 2019;Hoyle et al. 2019; Kang et al. 2019; Lee et al.2019a; Lepp 2019; Qian 2019; Qian et al. 2019;Sharifirad et al. 2019; Sharifirad and Matwin2019; Stanovsky et al. 2019; O’Neil 2016). Partic-ular attention has been paid to uncovering, analyz-ing, and removing gender biases in word embed-dings (Basta et al., 2019; Kaneko and Bollegala,2019; Zhao et al., 2019, 2018b; Bolukbasi et al.,2016). This word embedding work has extendedto multilingual work on gender-marking (Gonenet al., 2019; Williams et al., 2019; Zhou et al.,2019). Despite these efforts, many methods fordebiasing embeddings remain problematic—i.e.,they have only succeeded in hiding word embed-ding biases as opposed to removing them (Gonenand Goldberg, 2019)—making gender debiasingstill an open area of research.

Despite the relatively ample literature on gen-der debiasing for word-level representations, verylittle work has focused on sentence representa-tions (Liang et al., 2019; Liu et al., 2019; Shenget al., 2019; Lee et al., 2019b). The majority ofsentence debiasing work up until this point fore-grounds measuring bias (Lee et al., 2019b; Shenget al., 2019). For example, Liu et al. presenta test dataset for dialogue created counterfactu-ally by combining templates and hand-created listsof word pairs; this work shows that models pro-duce less diverse dialogues when prompted withsentences containing words describing individu-als from underrepresented groups. Acknowledg-ing the obvious importance of measuring genderbias (see, e.g., Liu et al. 2019), our dialogue workis novel in that we also propose and compare threemethods for directly mitigating it1.

1To the best of our knowledge, only one other workattempts to gender-debias sentence representations (Liet al., 2018); however, it extends a word-embedding post-processing method (Bolukbasi et al., 2016) shown to be inef-fective at removing gender bias (Gonen and Goldberg, 2019)to sentences. Thus, we take a different tack.

# Characters # RefF M N All F M

LIGHTOrig Data 159 258 1460 1877 439 1238Swap Persona 336 230 694 1260 1419 1030New Charac. 151 120 1448 1719 357 275Total 646 608 3602 4856 2215 2543

ConvAI2Orig Data 1109 1048 4214 6371 1283 1148

Table 4: Analysis of gender in LIGHT and Con-vAI2: the original LIGHT dataset contains 1.6× asmany male-gendered as female-gendered characters.We compare the original dataset with the dataset ob-tained after gender-swapping personas and collectingnew characters (with new personas). The referencescolumn indicates the gender of characters mentioned inthe personas. By contrast, ConvAI2 contains a roughlyequal number of male and female gendered personas.

3 Measuring Bias

Before one can mitigate bias, one must first mea-sure it. To determine which dataset to focus on,we initially measured both the amount of gen-dered words used, and the percent of those whichreferred to male characters, for six existing dia-logue datasets (Table 1). Throughout, we compareLIGHT to the other dialogue datasets, and findthat it is considerably more biased, which leadsus to give LIGHT particular attention in this pa-per. We primarily address three sources of genderbias in dialogue: (i) imbalance in character gen-ders, (ii) personas (Table 2), and (iii) dialogues be-tween characters (Table 3).

Dialogue research has found that incorporatingpersonas, or personality descriptions that ground aspeaker’s chat, like I love fishing, increases engag-ingness and improves consistency (Zhang et al.,2018; Shuster et al., 2018; Mazare et al., 2018;Olabiyi et al., 2018; Li et al., 2016). However,they can also crystallize gender bias (Clark et al.,2019; Henderson et al., 2018), propagating it tosubsequently generated conversations. We answerthree questions in the context of persona-based di-alogue: when creating dialogue datasets, (i) docrowdworkers generate an equal number of maleand female characters, (ii) do these characters’personas feature sexism or gender biases, and (iii)are the resulting dialogue utterances biased?

Bias in Number of Characters. We first answerthe question: do crowdworkers create an equalnumber of male and female characters? In addi-

Page 4: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

tion to LIGHT, we also consider the persona-baseddialogue dataset ConvAI2 (Zhang et al., 2018).

To examine gender balance in characters, weasked annotators to label the gender of each char-acter in both the LIGHT and ConvAI2 datasetsbased on the persona (choosing neutral if the gen-der was not explicit). This annotation is possiblebecause many personas include text such as I am ayoung woman, although the majority of personasdo not mention an explicit gender.

We find LIGHT characters to be highly genderimbalanced: in Table 4, we can see that there areover 1.6 times as many male characters as femaleones2. It is considerably less gender-balanced thanConvAI2, which has a nearly equal number ofmale and female gendered personas.3

Bias in Personas. In addition to the stark under-representation of female characters, the medievalsetting in LIGHT is likely to encourage crowd-workers to generate dialogues accentuating his-torical biases and inequalities of the time period(Bowman, 2010; Garcia, 2017). There is no obli-gation to recreate historical biases: one can insteaduse creative license to craft a fun world with gen-der parity. Therefore, we investigate references tomen or women in the text of personas, as anothersource of bias. To motivate this, take for example,a female persona that contains a gendered refer-ence such as I want to follow in my father’s foot-steps rather than in my mother’s. Using genderedrelational nouns (Barker, 1992; Williams, 2018),such as father, doesn’t always signal gender bias,but if female characters are predominantly definedin reference to male characters, it becomes a prob-lem. We count the appearance of gendered wordsin personas using the list compiled by Zhao et al.(2018b) and find that men are disproportionatelyreferred to in the personas: there are nearly 3x asmany mentions of men than women (see Table 2for examples, and Table 4 for counts).

Qualitatively, LIGHT personas contain manyexamples that strike us as gender biased (see Table2). For example, the character description for girlcontains the line I regularly clean and cook din-ner. Gender bias and sexism are clearly present inmany dialogue datasets (Henderson et al., 2018),but finding a clear way to define sexism (and other

2We use “female” and “male” for LIGHT characters –rather than “woman” and “man” – because some are binarilygendered, but not human.

3Note that annotators may widen the gender gap by im-plicitly assuming genders for ungendered personas.

kinds of unsafe text), let alone measure it at scale,is very challenging. A simple answer is to rely onannotation where annotators operate under theirown, albeit subjective, definition(s) of sexism. Toassess the pervasiveness of unsafe content in ex-isting personas, we asked three independent an-notators to examine each persona for potentiallyoffensive content. If annotators detected contentwas ‘offensive’ or ‘maybe offensive’, they wereasked to place it in one of four categories—racist,sexist, classist, other—and to provide a reason fortheir response. Just over 2% of personas wereflagged by at least one annotator, and these per-sonas and the dialogues between these personaswere removed from the dataset.

Bias in Human-Generated Dialogue Utter-ances. After uncovering bias in the gender ofcharacters and personas— qualitatively and innumber of gendered words—we go on to exam-ine how those biases may propagate to the dia-logues that are created from crowdworkers playingthe role of these personas.

First, we count the number of male and femalegendered words in the training sets of various di-alogue datasets (LIGHT, ConvAI2, Reddit, Wiz-ard of Wikipedia, Daily Dialog, Empathetic Dia-logues, and ConvAI2), using the same word list asbefore (Zhao et al., 2018b). We use this to calcu-late the percentage of gendered words out of allwords, and the percent male bias, or the percent-age of male gendered words among all genderedwords. Results are shown in Table 1. LIGHTis the most gender imbalanced dataset among alldatasets in this table, with a male bias of 73%.

With this in mind, we qualitatively examine theLIGHT dataset and find many biased utterancespresent in the training data. For example, thequeen persona adheres to negatively stereotypedgender roles when uttering the line I spend mydays doing embroidery and having a talk with theladies. Another character admires a sultry wenchwith fire in her eyes. We see the direct effect of thebiased persona on the resultant dialogue (see Table3): for example, a wife persona contains the text Ispend my days cooking and cleaning so my hus-band will have something to eat when he returnsfrom his work..., and, in dialogue with a merchant,discusses only her cleaning duties. The merchanteven derisively refers to cleaning as the wife’s job.

Page 5: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

Figure 1: We compare the performance of various bias mitigation methods—Counterfactual Data Augmentation(CDA), Positive-Bias Data Collection (Pos. Data), Bias Control Model (Bias Ctrl), and combining these methods(ALL)—on the test set, splitting the test set across the four genderedness bins: F0/+M0/+. X0 indicates there are noX-gendered words in the gold response, while X+ indicates that there is at least one. We measure the percent ofgendered words in the generated utterances (% gend. words) and the percent of male bias (% male bias), i.e. thepercent of male-gendered words among all gendered words generated. While each of these methods yield someimprovement, combining all of these methods in one yields the best control over the genderedness of the utteranceswhile improving the F1-score.

4 Mitigating Bias in Generative Dialogue

As we found the LIGHT was considerably morebiased than other dialogue datasets, throughout therest of the paper we use the LIGHT dataset as atestbed for developing a general framework formitigating bias in generative dialogue.

When we train dialogue models on biaseddatasets, the bias will manifest in model-generateddialogues. We explore data augmentation andother algorithmic methods to mitigate bias in gen-erative Transformer models. We (i) extend coun-terfactual data augmentation to dialogue (Maud-slay et al., 2019; Zmigrod et al., 2019) to swapgendered words, (ii) perform positive data col-lection by augmenting the existing dataset viatargeted data collection with crowdworkers, andlastly, (iii) present a bias controlled dialogue gen-eration method that controls how many male andfemale gendered words models produce.

4.1 Counterfactual Data Augmentation

A solution proposed for gender bias in word em-beddings is Counterfactual Data Augmentation(CDA) (Maudslay et al., 2019; Zmigrod et al.,2019; Liu et al., 2019). CDA swaps, say, all in-stances of grandmother with grandfather, she withhe, and so on. We apply this word-based data aug-mentation to dialogue generation by first copyingevery dialogue with a gendered word(s) in it, thenswapping it with its pair from the list provided by

Zhao et al. (2018b). The augmentation is limitedto words on the gendered word list, and the swap-ping is performed automatically.

4.2 Positive-Bias Data Collection

While CDA has been shown to be a somewhat ef-fective strategy for mitigating bias in word embed-dings, this method has several pitfalls: it may re-sult in ungrammatical sentences and it relies onexisting (and incomplete) word pair lists to deter-mine and swap gender. To resolve these issues, weuse humans to collect additional dialogue data viaa two-pronged Positive-Bias Data Collection (Pos.Data) strategy. We first collect additional personasby having humans (i) manually swap the gender ofthe persona (rather than relying on the word lists)and (ii) write additional, diversified personas. Wethen use these personas to seed the collection ofadditional, positively biased dialogue data, whichwe refer to as Pos. Data throughout.

New Personas. As LIGHT contains more malepersonas than female personas (see §3), we bal-ance existing personas with gender swapping.For every gendered persona, annotators create anew opposite-gendered persona for which refer-ring nouns or pronouns are changed, but the rest ofthe character description remains unchanged. Forexample, for every persona describing a king, an-notators will create a new one describing a queen.Annotators are instructed to swap the gender(s) of

Page 6: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

other characters referred to in the text (e.g., if anoriginal persona describes a female in relation toher father, the new male persona will describe amale in relation to his mother). This method en-sures that the created sentences will be grammati-cal, unlike heuristic data augmentation.

However, simply balancing references to menand women is insufficient, as female charactersmight be described in sexist ways (see §3). Asdetecting sexism is challenging, we take our qual-itative analysis to be sufficient, and move to offsetit by collecting a new set of interesting and inde-pendent female characters. We do this by primingworkers with examples like adventurer with per-sonas like I am a woman passionate about explor-ing a world I have not yet seen. I embark on am-bitious adventures. We also provide crowdwork-ers with additional instruction to guide them to-wards creating diverse characters: We’re lookingfor strong and diverse descriptions. Avoid descrip-tions that could be considered hateful, offensive,or stereotypical. Even with this explicit instruc-tion, 3 times as many male characters as femalecharacters were created; this fact alone revealsthe inherent gender biases of the available crowd-worker pool. We ultimately exclude all male-gendered personas created in this fashion from thenew dataset, which brings the number of men andwomen and the number of references to male orfemale gendered words to approximate balance inthe new dataset (see Table 4). In total, we add2,629 new personas.

New Dialogues. After gender-balancing the per-sonas, we focus next on our main goal: debiasinggenerated dialogues. As the personas are a startingpoint for bias entering the dataset, it is importantto address balance in personas as a prior step.

We use the gender-balanced set of personas de-rived from the two methods described above tocrowdsource additional dialogues. We select morefemale-gendered characters for new dialogue col-lection, and instructed annotators to be mindful ofgender bias. In particular, we encourage them toassume equality—social, economic, political, orotherwise—between genders in this fantasy set-ting. We collect a total of 507 new dialogues con-taining 6,658 utterances (approximately 6% of theoriginal dataset size). We refer to this additionaldialogue data as Pos. Data.

F0M0 F0M+ F+M0 F+M+

% of test set 60.65 27.21 7.61 4.63

Table 5: Percentage of dialogue examples in eachof the four genderedness bins —F0/+M0/+— for theLIGHT dialogue data test set.

4.3 Bias Controlled Training

Gender bias in dialogue can take the form of im-balanced use of gendered words. To create dia-logue models that can generate an equal numberof gendered words, we control model output withBias Control (Bias Ctrl) via conditional training.Previous conditional training models learn to asso-ciate specific control tokens with some desired textproperties (Kikuchi et al., 2016; Fan et al., 2017;Oraby et al., 2018; See et al., 2019), but have notbeen applied to address bias issues.

We apply conditional training techniques tocontrol gender bias in generative dialogue bylearning to associate control tokens with proper-ties of gender bias. Any general function that takesas input a dialogue utterance and outputs a con-tinuous or discrete value that provides informa-tion about gender bias could be used as a controlvariable. In our case, prior to training, each dia-logue response is binned into one of four bins—F0/+M0/+ —where X0 indicates that there are zeroX-gendered words in the response. X+ indicatesthe presence of one or more X-gendered word.The percentage of examples from the test set thatfall into each bin is noted in Table 5. Nouns andadjectives are binned via an aggregation of ex-isting gendered word lists (Zhao et al., 2018b,a;Hoyle et al., 2019). Note that other functionscould be used as well, such as a bias classifier.

We append a special token to the input that in-dicates the bin that the response falls into. DuringBias Ctrl training, the model should learn to as-sociate the special token with the genderedness ofthe dialogue response, such that at inference time,we could modify these special tokens to controlthe genderedness of the model output. For exam-ple, a model trained with multiple gender controlbins could be set to the gender neutral (in this case,F0M0) setting at inference time, to produce a re-sponse containing no gendered words.

4.4 Implementation Details

Following Urbanek et al. (2019), we fine-tunea large, pre-trained Transformer encoder-decoder

Page 7: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

Figure 2: Performance of the ALL debiasing model controlled by indicating specific bins for all examples at testtime. We report results for each possible conditioning bin choice. Across bins, the model maintains performanceas measured by F1 whilst radically changing the genderedness of the language generated.

neural network in all generation experiments onthe dialogues in the LIGHT dataset. FollowingHumeau et al. (2019), the model was pre-trainedon Reddit conversations using a previously ex-isting Reddit dataset extracted and obtained by athird party and made available on pushshift.io.During pre-training, models learned to generate acomment conditioned on the full thread leadingup to the comment. Comments containing URLsor under 5 characters in length were removed,along with child comments, resulting in approx-imately 2.2 billion training examples. Similar topre-training, during fine-tuning, the models areconditioned on the full dialogue history leading upto the next utterance. The model is based on theParlAI implementation of Miller et al. (2017), andis an 8-layer encoder, 8-layer decoder, with 512dimensional embeddings and 16 attention heads.For final generations, we decode sequences withbeam search size of 5.

5 Results

We train five Transformer models: a baseline,three models, one for each of our new methods(see §4.1 for CDA, §4.2 for Positive-Bias DataCollection, and §4.3 for Bias Control), then onefinal model, ALL, which combines all three meth-ods and achieves the best results.

Bias is Amplified in Generation. ExistingTransformer generative dialogue models (Serbanet al., 2016; Yang et al., 2018; Urbanek et al.,2019) are trained to take the dialogue context asinput and generate the next utterance. Generativemodels are well-known to produce generic text (Li

et al., 2015; Fan et al., 2018), which makes it likelythey will reproduce statistical biases present indatasets. As described previously (see §2), workshows that machine learning models reflect biases(Zhao et al., 2019; Brunet et al., 2018). Moreover,biases can be easier to learn than more challeng-ing reasoning (Bolukbasi et al., 2016; Lewis andFan, 2018), suggesting that Transformer modelsare likely to reflect dataset bias.

Figure 9 compares the performance of the var-ious techniques. We compare our methods to thegold labels from the test set and a baseline Trans-former generative dialogue model trained on theoriginal data without any bias mitigation tech-niques. To do this, we divide the test set into fourgenderedness bins (as defined in Section 4.3)—F0M0, F0M+, F+M0, and F+M+—and calculate:(i) the F1 word overlap with the gold response,(ii) the percentage of gendered words generated(% gend. words), and (iii) the percentage of male-gendered words generated (relative to the sum to-tal of gendered words generated by the model).

We find that Transformer models not only re-flect dataset biases, but also they amplify them.When the model produces gendered words (fromour gendered word list), it generates male-gendered words the vast majority of the time.Even on utterances for which it is supposed to gen-erate only female-gendered words (the gold labelonly contains female-gendered words), it gener-ates male-gendered words nearly 78% of the time.

Comparing Debiasing Methods As shown inFigure 1, each of our methods improves themetrics—percent gendered words, percent male

Page 8: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

bias, and F1—over the baseline Transformer, butwe find combining all methods in one in the ALL-model is most advantageous. While ALL has moredata than CDA and Bias Ctrl, more data alone isnot enough — the Positive-Bias Data Collectionmodel does not achieve as good results. Both theBias Ctrl and ALL models benefit from knowingthe data split (F0M0, for example), and both yielda gender ratio closest to ground truth.

Bias Controlled Training Controls GenderedWords. Our Bias Ctrl method can control thenumber of gendered words in generated dialogues,as shown in Figure 10. We examine the effect ofBias Ctrl by generating responses conditioning theALL model on each bin. We observe that changingthe bin radically changes the genderedness of gen-erated text with only small differences in overallF1. We can control the male bias of the generateddialogue by manipulating these bins.

Examples of generated text from both the base-line and the ALL model are shown in Table6. The baseline model generates male-genderedwords when the gold response contains no gen-dered words or only female-gendered words, evengenerating unlikely sequences such as my name isabigail. i am the king of this kingdom.

For various methods, we show the top 20 wordsgenerated on the test set (after removing stopwords) in Table 8. We denote gendered nounsusing an asterisk. Among the top 20 words gen-erated by the baseline model, there are only twogendered nouns—knight and king—and both aremale-gendered. The ALL model generates a simi-lar set of words, but also features queen in its top20 words, another indication that it is more bal-anced across the male and female genders.

5.1 Safety of Generated Text

In Table 7, following Liu et al. (2019), we usea Transformer-based dialogue safety classifier toclassify model-generated utterances as offensiveor safe. The classifier was fine-tuned on an of-fensive language classification task (Dinan et al.,2019), and achieves state-of-the-art results.

We apply this classifier to each utterance gen-erated by the ALL and baseline models on the testset, in addition to the gold (human generated) la-bels from the test set. Our proposed ALL model israted as less offensive than both the baseline modeland the ground truth (gold) labels (see Table 7).

Generation Examples

Bin F0M0

Context [P1:] Owl can you find out how I died?[P2:] I can look around the forest, but I need

more information to help. Tell me whatyou remember about your past life.

[P1:] I don’t remember anything I was hopingyou could find out.

[P2:] Your form is very hazy. Do you rememberif you are a man or woman?

Baseline: i am not a man. i am a man of the forest.ALL: no, i don’t remember.Gold: I don’t know what’s wrong with me!

Bin F+M0

Context [P1:] I do not believe my eyes, for an angel isupon me! Angel, please tell me your name.

[P2:] My name is Abigail!Baseline: my name is abigail. i am the king of this kingdom.ALL: i am the queen’s daughter!Gold: Abigail! Such a beautiful name. To what do I owe

the pleasure of meeting you?

Table 6: Example generations from the baselinemodel and the proposed debiased models. Gold truth(‘Gold’) either contains no gendered words or onlyfemale-gendered words, but the baseline model stillgenerates male-gendered words.

Gold Labels Baseline ALL

% Offensive 13.0 14.25 10.37

Table 7: Offensive language classification of modelresponses on the LIGHT dialogue test set.

5.2 Human Evaluation: Bias and Quality

We compare the quality of our debiasing methodsusing human evaluation (see Figure 3). One mighthypothesize that some gender debiasing methodswork by replacing contentful words (e.g., witch)with bleached or uninteresting ones (e.g., person,thing), effectively trading off gender bias with en-gagingness. We use the dialogue evaluation sys-tem Acute-Eval (Li et al., 2019) to ask humanevaluators to compare two conversations from dif-ferent models and decide which model generates(i) more biased dialogues and (ii) more engag-ing dialogues. We collect 100 model conver-sations with crowdworkers. Then, we compareconversations between a human and the baselinemodel to conversations between a human and theALL model with all generations set to the F0M0

gender-neutral control bin. Asking for predictionsof speaker gender was found to be more effectivethan asking about sexism or gender bias directly.

As shown in Figure 3, it is more challenging to

Page 9: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

Figure 3: Human Evaluation of ALL model (F0M0)compared to baseline Transformer generative model.Evaluators choose which model output they preferfor dialogue engagingness and difficulty of predictingspeaker gender. The ALL model produces less gen-dered text while engagingness is not affected.

Model Top 20 generated words

Baseline sorry, hear, not, what, glad, doing, don,king*, thank, sure, will, your, can, much,do, know, but, knight*, blacksmith, going

ALL sorry, hear, sure, not, what, help, doing,your, course, trying, glad, thank, queen*,don, good, king*, but, yes, know, sir*

ALL F0M0 sorry, hear, sure, what, not, doing, glad,thank, your, yes, course, but, don, do,know, help, have, enjoying, fool, much

ALL F0M+ sorry, hear, help, trying, sure, good, king*,sir*, not, your, day, course, father*, he*,don, thank, happy, guard*, glad, have

ALL F+M0 sorry, hear, queen*, sure, miss*, not, your,thank, how, hello, today, guard*, she*, yes,course, kind, woman*, help, glad, what

ALL F+M+ sorry, queen*, hear, guard*, help, trying,your, sure, good, course, day, knight*, not,protect, yes, friend, king*, woman*, she*,thank

Table 8: Genderedness bins control the gendered-ness of generated text. The top 20 words (test set)with stop words removed. * indicates gendered nouns.

predict the gender of ALL model generations (sig-nificant at p < 0.01) but the responses are just asengaging according to human evaluators. We con-clude our proposed methods are able to mitigategender bias without degrading dialogue quality.

6 Discussion

Generality of Gendered Words. The genderedword lists used may not be comprehensive (Zhaoet al., 2018a,b; Hoyle et al., 2019). For example,they do not include hag or wench, which are com-mon in LIGHT. Further, a more continuous repre-sentation of gender should be used in the future.

More Fine-Grained Control. We present an ef-fective method to control the quantity of gen-dered words generated by manipulating controlbins. This technique is general and could be usedto control other properties of generated utterances.For example, a sexism or bias classifier could beused instead of the gendered word list.

Quality of Generated Dialogue. Generative di-alogue models are prone to overuse frequent wordsand produce generic utterances, the so-called Idon’t know problem (Li et al., 2015). We also ob-serve these effects which can affect bias.

7 Conclusion

We propose general purpose techniques for reduc-ing gender bias in dialogue. Our methods combinedata augmentation, positive-bias data collection,and bias controlled training. Our new data col-lection techniques help mitigate issues, so clearlybias should be considered at the earliest stages ofa project. Bias control training lessens bias at thetraining stage, and is also beneficial. Together,they are especially effective, producing less gen-dered, more gender balanced, safer utterances thatmaintain engaging dialogue with humans.

ReferencesNata M Barbosa and Monchu Chen. 2019. Rehuman-

ized crowdsourcing: A labeling framework address-ing bias and ethics in machine learning. In Proceed-ings of the 2019 CHI Conference on Human Factorsin Computing Systems, page 543. ACM.

Chris Barker. 1992. Possessive descriptions.

Christine Basta, Marta R Costa-jussa, and Noe Casas.2019. Evaluating the underlying gender bias in con-textualized word embeddings. In Proceedings of the1st Workshop on Gender Bias in Natural LanguageProcessing.

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,Venkatesh Saligrama, and Adam T Kalai. 2016.Man is to computer programmer as woman is tohomemaker? debiasing word embeddings. In Ad-vances in neural information processing systems,pages 4349–4357.

Sarah Lynne Bowman. 2010. McFarland and Co.

Marc-Etienne Brunet, Colleen Alkalay-Houlihan, Ash-ton Anderson, and Richard Zemel. 2018. Under-standing the origins of bias in word embeddings.arXiv preprint arXiv:1810.03611.

Page 10: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

Yang Trista Cao and Hal Daume. 2019. Towardgender-inclusive coreference resolution. arXivpreprint arXiv:1910.13913.

Kai-Wei Chang, Vinod Prabhakaran, and Vicente Or-donez. 2019. Bias and fairness in natural languageprocessing. In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts, Hong Kong, China.Association for Computational Linguistics.

Serina Chang and Kathy McKeown. 2019. Automat-ically inferring gender associations from language.In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 5745–5751, Hong Kong, China. Association for Computa-tional Linguistics.

Christopher Clark, Mark Yatskar, and Luke Zettle-moyer. 2019. Don’t take the easy way out: Ensem-ble based methods for avoiding known dataset bi-ases. arXiv preprint arXiv:1909.03683.

Marta R Costa-jussa. 2019. An analysis of gender biasstudies in natural language processing. Nature Ma-chine Intelligence, pages 1–2.

Emily Dinan, Samuel Humeau, Bharath Chintagunta,and Jason Weston. 2019. Build it break it fix it fordialogue safety: Robustness from adversarial humanattack. arXiv preprint arXiv:1908.06083.

Yupei Du, Yuanbin Wu, and Man Lan. 2019. Explor-ing human gender stereotypes with word associa-tion test. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages6135–6145.

Ali Emami, Paul Trichelair, Adam Trischler, Ka-heer Suleman, Hannes Schulz, and Jackie Chi KitCheung. 2019. The knowref coreference corpus:Removing gender and number cues for difficultpronominal anaphora resolution. In Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics, pages 3952–3961.

Angela Fan, David Grangier, and Michael Auli. 2017.Controllable abstractive summarization. arXivpreprint arXiv:1711.05217.

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-erarchical neural story generation. arXiv preprintarXiv:1805.04833.

Lisa Fan, Marshall White, Eva Sharma, Ruisi Su,Prafulla Kumar Choubey, Ruihong Huang, andLu Wang. 2019. In plain sight: Media bias throughthe lens of factual reporting. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th International

Joint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 6342–6348, Hong Kong,China. Association for Computational Linguistics.

Antero Garcia. 2017. Privilege, power, and dungeons& dragons: How systems shape racial and genderidentities in tabletop role-playing games. Mind,Culture, and Activity, 24(3):232–246.

Aparna Garimella, Carmen Banea, Dirk Hovy, andRada Mihalcea. 2019. Womens syntactic resilienceand mens grammatical luck: Gender-bias in part-of-speech tagging and dependency parsing. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 3493–3498.

Andrew Gaut, Tony Sun, Shirlyn Tang, Yuxin Huang,Jing Qian, Mai ElSherief, Jieyu Zhao, Diba Mirza,Elizabeth Belding, Kai-Wei Chang, et al. 2019. To-wards understanding gender bias in relation extrac-tion. arXiv preprint arXiv:1911.03642.

Hila Gonen and Yoav Goldberg. 2019. Lipstick on apig: Debiasing methods cover up systematic genderbiases in word embeddings but do not remove them.In Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pages 609–614,Minneapolis, Minnesota. Association for Computa-tional Linguistics.

Hila Gonen, Yova Kementchedjhieva, and Yoav Gold-berg. 2019. How does grammatical gender affectnoun representations in gender-marking languages?arXiv preprint arXiv:1910.14161.

Nizar Habash, Houda Bouamor, and Christine Chung.2019. Automatic gender identification and reinflec-tion in arabic. In Proceedings of the First Workshopon Gender Bias in Natural Language Processing,pages 155–165.

Reyhaneh Hashempour. 2019. A deep learning ap-proach to language-independent gender predictionon twitter. In Proceedings of the 2019 Workshop onWidening NLP.

He He, Sheng Zha, and Haohan Wang. 2019. Unlearndataset bias in natural language inference by fittingthe residual. arXiv preprint arXiv:1908.10763.

Peter Henderson, Koustuv Sinha, Nicolas Angelard-Gontier, Nan Rosemary Ke, Genevieve Fried, RyanLowe, and Joelle Pineau. 2018. Ethical challengesin data-driven dialogue systems. In Proceedings ofthe 2018 AAAI/ACM Conference on AI, Ethics, andSociety, AIES 2018, New Orleans, LA, USA, Febru-ary 02-03, 2018, pages 123–129.

Alexander Miserlis Hoyle, Lawrence Wolf-Sonkin,Hanna Wallach, Isabelle Augenstein, and Ryan Cot-terell. 2019. Unsupervised discovery of gendered

Page 11: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

language through latent-variable modeling. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1706–1716, Florence, Italy. Association for Computa-tional Linguistics.

Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,and Jason Weston. 2019. Real-time inference inmulti-sentence tasks with deep pretrained transform-ers. arXiv preprint arXiv:1905.01969.

Masahiro Kaneko and Danushka Bollegala. 2019.Gender-preserving debiasing for pre-trained wordembeddings. arXiv preprint arXiv:1906.00742.

Dongyeop Kang, Varun Gangal, and Eduard Hovy.2019. (male, bachelor) and (female, Ph.D) havedifferent connotations: Parallelly annotated stylis-tic language dataset with multiple personas. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 1696–1706, Hong Kong, China. Association for Computa-tional Linguistics.

Yuta Kikuchi, Graham Neubig, Ryohei Sasano, HiroyaTakamura, and Manabu Okumura. 2016. Control-ling output length in neural encoder-decoders. arXivpreprint arXiv:1609.09552.

Nayeon Lee, Yejin Bang, Jamin Shin, and PascaleFung. 2019a. Understanding the shades of sexismin popular TV series. In Proceedings of the 2019Workshop on Widening NLP.

Nayeon Lee, Andrea Madotto, and Pascale Fung.2019b. Exploring social bias in chatbots usingstereotype knowledge. In Proceedings of the 2019Workshop on Widening NLP, pages 177–180.

Haley Lepp. 2019. Pardon the interruption: Automaticanalysis of gender and competitive turn-taking inunited states supreme court hearings. In Proceed-ings of the 2019 Workshop on Widening NLP, pages143–145, Florence, Italy. Association for Computa-tional Linguistics.

Mike Lewis and Angela Fan. 2018. Generative ques-tion answering: Learning to answer the whole ques-tion. ICLR.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2015. A diversity-promoting objec-tive function for neural conversation models. arXivpreprint arXiv:1510.03055.

Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp-ithourakis, Jianfeng Gao, and Bill Dolan. 2016. Apersona-based neural conversation model. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 994–1003, Berlin, Germany. Associ-ation for Computational Linguistics.

Margaret Li, Jason Weston, and Stephen Roller. 2019.Acute-eval: Improved dialogue evaluation with opti-mized questions and multi-turn comparisons. arXivpreprint arXiv:1909.03087.

Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz,Vincent Michalski, Laurent Charlin, and Chris Pal.2018. Towards deep conversational recommenda-tions. In Advances in Neural Information Process-ing Systems, pages 9748–9758.

Paul Pu Liang, Irene Li, Emily Zheng, Yao Chong Lim,Ruslan Salakhutdinov, and Louis-Philippe Morency.2019. Towards Debiasing Sentence representations.

Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zi-tao Liu, and Jiliang Tang. 2019. Does gender mat-ter? Towards fairness in dialogue systems. CoRR,abs/1910.10486.

Rowan Hall Maudslay, Hila Gonen, Ryan Cotterell,and Simone Teufel. 2019. It’s all in the name: Mit-igating gender bias with name-based counterfactualdata substitution. CoRR, abs/1909.00871.

Pierre-Emmanuel Mazare, Samuel Humeau, MartinRaison, and Antoine Bordes. 2018. Trainingmillions of personalized dialogue agents. arXivpreprint arXiv:1809.01984.

Jack Merullo, Luke Yeh, Abram Handler, Alvin Gris-som II, Brendan O’Connor, and Mohit Iyyer. 2019.Investigating sports commentator bias within a largecorpus of American football broadcasts. In Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP), pages 6354–6360,Hong Kong, China. Association for ComputationalLinguistics.

A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra,A. Bordes, D. Parikh, and J. Weston. 2017. Parlai:A dialog research software platform. arXiv preprintarXiv:1705.06476.

Oluwatobi O Olabiyi, Anish Khazane, and Erik TMueller. 2018. A persona-based multi-turn conver-sation model in an adversarial learning framework.In 2018 17th IEEE International Conference on Ma-chine Learning and Applications (ICMLA), pages489–494. IEEE.

Cathy O’Neil. 2016. Weapons of math destruction:How big data increases inequality and threatensdemocracy. Broadway Books.

Shereen Oraby, Lena Reed, Shubhangi Tandon,TS Sharath, Stephanie Lukin, and Marilyn Walker.2018. Controlling personality-based stylistic varia-tion with neural natural language generators. arXivpreprint arXiv:1805.08352.

Jahna Otterbacher, Alessandro Checco, Gianluca De-martini, and Paul Clough. 2018. Investigating userperception of gender bias in image search: the role

Page 12: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

of sexism. In The 41st International ACM SIGIRConference on Research & Development in Informa-tion Retrieval, pages 933–936. ACM.

Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re-ducing gender bias in abusive language detection.In Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing,pages 2799–2804, Brussels, Belgium. Associationfor Computational Linguistics.

Yusu Qian. 2019. Gender stereotypes differ betweenmale and female writings. In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics: Student Research Workshop,pages 48–53.

Yusu Qian, Urwa Muaz, Ben Zhang, and Jae WonHyun. 2019. Reducing gender bias in word-levellanguage models with a gender-equalizing loss func-tion. arXiv preprint arXiv:1905.12801.

Abigail See, Stephen Roller, Douwe Kiela, and JasonWeston. 2019. What makes a good conversation?how controllable attributes affect human judgments.arXiv preprint arXiv:1902.08654.

Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, andJoelle Pineau. 2016. Generative deep neural net-works for dialogue: A short review. arXiv preprintarXiv:1611.06216.

Sima Sharifirad, Alon Jacovi, Israel Bar Ilan Univesity,and Stan Matwin. 2019. Learning and understand-ing different categories of sexism using convolu-tional neural networks filters. In Proceedings of the2019 Workshop on Widening NLP, pages 21–23.

Sima Sharifirad and Stan Matwin. 2019. Usingattention-based bidirectional lstm to identify differ-ent categories of offensive language directed towardfemale celebrities. In Proceedings of the 2019 Work-shop on Widening NLP, pages 46–48.

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan,and Nanyun Peng. 2019. The woman worked asa babysitter: On biases in language generation. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 3398–3403, Hong Kong, China. Association for Computa-tional Linguistics.

Kurt Shuster, Samuel Humeau, Antoine Bordes, andJason Weston. 2018. Engaging image chat: Model-ing personality in grounded dialogue. arXiv preprintarXiv:1811.00945.

Gabriel Stanovsky, Noah A Smith, and Luke Zettle-moyer. 2019. Evaluating gender bias in machinetranslation. arXiv preprint arXiv:1906.00591.

Pierre Stock and Moustapha Cisse. 2017. Convnetsand imagenet beyond accuracy: Explanations, biasdetection, adversarial examples and model criticism.

Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang,Mai ElSherief, Jieyu Zhao, Diba Mirza, Eliza-beth Belding, Kai-Wei Chang, and William YangWang. 2019. Mitigating gender bias in natural lan-guage processing: Literature review. arXiv preprintarXiv:1906.08976.

Jack Urbanek, Angela Fan, Siddharth Karamcheti,Saachi Jain, Samuel Humeau, Emily Dinan, TimRocktaschel, Douwe Kiela, Arthur Szlam, and Ja-son Weston. 2019. Learning to speak and act ina fantasy text adventure game. In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 673–683, HongKong, China. Association for Computational Lin-guistics.

Adina Williams. 2018. Representing Relationality:MEG Studies on Argument Structure. Ph.D. thesis,New York University.

Adina Williams, Damian Blasi, Lawrence Wolf-Sonkin, Hanna Wallach, and Ryan Cotterell. 2019.Quantifying the semantic core of gender systems. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 5733–5738, Hong Kong, China. Association for Computa-tional Linguistics.

Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong,Noah Constant, Petr Pilar, Heming Ge, Yun-HsuanSung, Brian Strope, and Ray Kurzweil. 2018.Learning semantic textual similarity from conversa-tions. arXiv preprint arXiv:1804.07754.

Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018. Per-sonalizing dialogue agents: I have a dog, do youhave pets too? In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics, pages 2204–2213, Melbourne, Australia.Association for Computational Linguistics.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cot-terell, Vicente Ordonez, and Kai-Wei Chang. 2019.Gender bias in contextualized word embeddings. InNAACL (short).

Jieyu Zhao, Tianlu Wang, Mark Yatskar, VicenteOrdonez, and Kai-Wei Chang. 2017. Men alsolike shopping: Reducing gender bias amplifica-tion using corpus-level constraints. arXiv preprintarXiv:1707.09457.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-donez, and Kai-Wei Chang. 2018a. Gender biasin coreference resolution: Evaluation and debias-ing methods. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT, New Orleans,

Page 13: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

Louisiana, USA, June 1-6, 2018, Volume 2 (ShortPapers), pages 15–20.

Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018b. Learning gender-neutral wordembeddings. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, Brussels, Belgium, October 31 - Novem-ber 4, 2018, pages 4847–4853.

Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan-Hao Huang,Muhao Chen, Ryan Cotterell, and Kai-Wei Chang.2019. Examining gender bias in languages withgrammatical gender. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 5275–5283, Hong Kong,China. Association for Computational Linguistics.

Ran Zmigrod, Sebastian J. Mielke, Hanna Wallach, andRyan Cotterell. 2019. Counterfactual data augmen-tation for mitigating gender stereotypes in languageswith rich morphology. In Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics, pages 1651–1661, Florence, Italy.Association for Computational Linguistics.

Page 14: arXiv:1911.03842v2 [cs.CL] 16 Apr 2020

Data Split: F0M0 F0M+ F+M0 F+M+ All

% gend. % male F1 % gend. % male F1 % gend. % male F1 % gend. % male F1 F1Model words bias score words bias score words bias score words bias score score

Gold Lbl 0 0 - 4.11 100 - 4.03 0 - 6.67 50.71 - -Baseline 2.37 88.39 11.24 3.66 90.26 11.77 2.44 77.99 11.54 3.05 80.05 11.43 11.42ConvAI2 FT 0.79 71.09 7.78 1.1 78.31 7.94 1.35 51.6 8.75 1.97 67.23 8.99 7.95Reddit Base 2.18 73.68 9.93 3.03 81.78 11.54 2.81 52.99 10.99 3.94 63.16 12.61 10.57

CDA 0.88 71.03 11.63 1.38 68.57 11.7 1.2 56.18 11.43 1.17 58.01 11.12 11.62Pos. Data 2.76 82.44 10.46 3.68 86.43 10.07 4.59 72.1 10.07 4.43 86.5 9.88 10.44Bias Ctrl 0.14 68.75 10.72 5.83 98.08 13.01 4.8 2.69 10.84 4.05 45.86 11.35 11.38ALL 0.14 64.19 11.72 6.59 97.94 12.77 5.84 7.13 11.28 8.81 50.94 12.22 11.99

Table 9: We compare the performance of various bias mitigation methods—Counterfactual Data Augmentation(CDA), Positive-Bias Data Collection (Pos. Data), Bias Control Model (Bias Ctrl), and combining these methods(ALL)—on the test set, splitting the test set across the four genderedness bins: F0/+M0/+. X0 indicates there are noX-gendered words in the gold response, while X+ indicates that there is at least one. We measure the percent ofgendered words in the generated utterances (% gend. words) and the percent of male bias (% male bias), i.e. thepercent of male-gendered words among all gendered words generated. While each of these methods yield someimprovement, combining all of these methods in one yields the best control over the genderedness of the utteranceswhile improving the F1-score.

Data Split: F0M0 F0M+ F+M0 F+M+ All

% gend. % male F1 % gend. % male F1 % gend. % male F1 % gend. % male F1 F1Model words bias score words bias score words bias score words bias score score

Gold Lbl 0 0 - 4.11 100 - 4.03 0 - 6.67 50.71 - -Baseline 2.37 88.39 11.24 3.66 90.26 11.77 2.44 77.99 11.54 3.05 80.05 11.43 11.42

ALL F0M0 0.14 64.19 11.72 0.24 80.11 11.51 0.22 25.0 11.63 0.23 81.58 10.72 11.61ALL F0M+ 6.47 97.97 9.58 6.59 97.94 12.77 7.22 96.33 10.0 6.27 97.52 12.21 10.6ALL F+M0 4.77 11.66 10.27 5.12 15.84 10.94 5.84 7.13 11.28 5.03 13.64 11.23 10.57ALL F+M+ 9.53 53.34 8.89 9.6 55.35 11.19 9.42 48.65 10.5 8.81 50.94 12.22 9.79

Table 10: Performance of the ALL debiasing model controlled by indicating specific bins for all examples at testtime. We report results for each possible conditioning bin choice. Across bins, the model maintains performanceas measured by F1 whilst radically changing the genderedness of the language generated.

Persona Example (Original LIGHT Dataset)

son: I am spoiled and rich. I enjoy running in the castle. I like hide and seek.

men: I am an average man in the village. I do what ever work that my King requires me to do. At night, I spendmy time in the local pub with my fellow men.

farmer Bob: I was born in a poor village. I eat what we grow. I love being close to the earth.

father: I am a role model for my children. I provide for the family with meat and I keep a roof over their heads. Iam stability to the family, and keep things together and provide safety to my children.

husband: I try to be good to my wife. I want to provide for my family. I try to be strong.

Table 11: Examples of male gender biased personas written for gendered characters in the LIGHT dataset.