towards hate speech detection at large via deep ...they encoded genre-speciﬁc features, denoting...

Towards Hate SpeechDetection at Large via DeepGenerative Modeling

Tomer WullachUniversity of Haifa

Amir AdlerBraude College of Engineering andMassachusetts Institute of Technology

Einat MinkovUniversity of Haifa

Abstract—Hate speech detection is a critical problem in social media platforms, being oftenaccused for enabling the spread of hatred and igniting physical violence. Hate speech detectionrequires overwhelming resources including high-performance computing for online posts andtweets monitoring as well as thousands of human experts for daily screening of suspected postsor tweets. Recently, Deep Learning (DL)-based solutions have been proposed for automaticdetection of hate speech, using modest-sized training datasets of few thousands of hate speechsequences. While these methods perform well on the specific datasets, their ability to detect newhate speech sequences is limited and has not been investigated. Being a data-driven approach, itis well known that DL surpasses other methods whenever a scale-up in train dataset size anddiversity is achieved. Therefore, we first present a dataset of 1 million realistic hate and non-hatesequences, produced by a deep generative language model. We further utilize the generateddataset to train a well-studied DL-based hate speech detector, and demonstrate consistent andsignificant performance improvements across five public hate speech datasets. Therefore, theproposed solution enables high sensitivity detection of a very large variety of hate speechsequences, paving the way to a fully automatic solution.

INTRODUCTION The rapidly growing usageof social media platforms and blogging websiteshas led to over 3.8 billion [1] active social mediausers world wide (half of the world population)that use text as their prominent means of mutualcommunications. A fraction of the users use alanguage that expresses hatred towards a specificgroup of people, known as Hate Speech [2]–[6]. Hate speech is mostly used against groupsof common race, ethnicity, national origin, reli-gious affiliation, sexual orientation, caste, gender,

gender identity, and serious disease or disabil-ity. Hate speech can be expressed explicitly, forexample: ”Refugees are the illness of the worldand must be cured with antibiotics.” or implicitly:”Black women on London transport...”, whichleads to a very large diversity of realistic hatespeech sequences. Therefore, hate speech detec-tion has become a critical problem for socialmedia platforms, being often accused for en-abling the spread of hatred and igniting physicalviolence in the real world. Hate speech detec-

arX

iv:2

005.

0637

0v1

[cs

.CL

] 1

3 M

ay 2

020

tion requires overwhelming resources includinghigh-performance computing for online posts andtweets monitoring. In addition, a portion of thesuspected posts and tweets are sent for furtherscreening by teams of thousands [7] of humanexperts. It is therefore essential that automatichate speech detection systems will identify inreal-time hate speech sequences with a very highsensitivity, namely, having a low false-negativerate. Otherwise, a significant portion of the hate-speech will not be detected.

Hate speech detection has been investigatedextensively in recent years, and due to the out-standing success of Deep Learning (DL) [8] innatural language processing [9] it has becomea leading approach for solving this problem.Among the proposed DL architectures, the mostwell-studied ones such as [10], were adoptedfrom the related problem of sentiment analy-sis [11]. The main reason for the success ofDL is its Generalization capability. Generaliza-tion [12] is defined as the ability of a clas-sifier to perform well on unseen examples. Inthe supervised learning framework a classifierC : X → Y is learned using a training datasetS = {(x1, y1), ..., (xN , yN)} of size N , wherexi ∈ X is a data example (i.e. a text sequence)and yi ∈ Y is the corresponding label (i.e. hate ornon-hate). Let P(X ,Y) be the true distributionof the data, then the expected risk is defined byR(C) = Ex,y∼P(X ,Y)[L(C(x), y)], where L is aloss function that measures the misfit between theclassifier output and the data label. The goal ofDL is to find a classifier C that minimizes theexpected risk, however, the expected risk cannotbe computed since the true distribution is unavail-able. Therefore, the empirical risk is minimizedinstead: RE(C) = 1

N

∑Ni=1 L(C(xi), yi), which

approximates the statistical expectation with anempirical mean. The Generalization Gap is de-fined as the difference between the expected riskto the empirical risk: R(C) − RE(C). By usinglarge and diverse datasets, DL has been shownto achieve a low generalization gap, where anapproximation of the expected risk is computedusing the learned classifier and a held-out testingdataset T = {(x1, y1), ..., (xM , yM)} of size M ,such that S ∩ T = ∅. The implication of achiev-ing a low generalization gap is good performance

of the classifier on unseen data, namely, detectingaccurately unseen hate speech.

Existing hate speech detection solutions uti-lize public datasets with representative hate andnon-hate sequences, collected from social media.These datsets contain only a few thousands ofsequences, thus, with a limited capability to faith-fully represent the large diversity of hate and non-hate sequences. Therefore, while good results areachieved for existing solutions, these are limitedonly to the scope of the utilized datasets, and havenot been tested on a larger variety of hate/non-hate sequences.

In this paper we present a novel approach forsignificantly increasing the generalization capa-bilities of hate-speech detection solutions. Theperformance of the proposed approach is eval-uated using a well-studied DL-based detector,trained using five different public datasets. First,we scale up significantly the training datasets interms of size and diversity by utilizing the GPT-2 [13] deep generative language model. Each oneof the five datasets was used to fine-tune theGPT-2 model, such that it generated 100, 000hate and 100, 000 non-hate speech sequences,per dataset. Following the data generation stage,hate speech detection performance was evaluatedby augmenting the baseline training sets withthe generated data, demonstrating consistent andsignificant performance improvements of the DLdetector, as compared to training using only thebaseline (un-augmented) datsets.

The contributions of this paper are three-fold: (i) the utilization of deep generative lan-guage modeling for approximating the probabilitydistribution of hate speech and generating mas-sive amounts of hate speech sequences. (ii) acomparative study of the achievable generaliza-tion of hate speech detection using existing hatespeech datasets. (iii) a detailed demonstration ofsignificant hate speech detection generalizationimprovements, by utilizing the generated data.

The rest of this paper is organized as follows:an overview of hate speech detectors and datasetsis provided in the related work section. Thegeneration of hate and non-hate speech by thedeep generative model is detailed in the proposedapproach section. Hate speech detection resultsand a study of the achievable generalization are

2

provided in the performance evaluation section.The conclusions section summarizes the paper.

Related workResearchers have proposed several approaches

for hate speech detection over the recent years,ranging from classical learning methods, to mod-ern deep learning. While these methods differwith respect to architecture and data represen-tation, they share the need for a large, diverse,and high-quality data resource for training effec-tive classification models. In the lack of sucha resource, previous related works constructeddatasets for this purpose, which were manuallylabeled, and therefore strictly limited in size.Below, we review existing hate speech detectionapproaches, and describe the datasets that havebeen used by researchers for training and eval-uating such classifiers. We will leverage thoselabeled datasets which are publicly available forgenerating our MegaSpeech corpus.

Hate Speech ClassifiersClassical machine-learning methods require

data to be transformed into some pre-definedfeature representation scheme, which shouldideally capture meaningful and relevantinformation for classification purposes. Giventextual data, features typically pertain to rawlexical information (surface word forms),considering individual words as well as wordsequences (n-grams). Other features may encodeprocessed syntactic and semantic information.For example, Davidson et al. [14] representedtweets using word n-grams, having themweighted by word frequency (TF-IDF), part-of-speech and sentiment features. In addition,they encoded genre-specific features, denotinghashtag and user mentions, URLs, and the tweet’slength. They then trained a logistic regressionclassifier to distinguish between tweets labeledas hate or offensive speech, or neither. Inanother work, a Support Vector Machine (SVM)was trained to predict whether a tweet washateful or not using feature representations ofword n-grams and typed word dependencies [15].

Deep Learning (DL) methods can effectivelyexploit large-scale data, learning latent represen-tations using multi-layered neural network archi-

tectures. Modern DL architectures of text pro-cessing generally consist of a word embeddinglayer, intended to capture generalized semanticmeaning of words, mapping each word in the in-put sentence into a vector of low-dimension [16].The following layers learn relevant latent featurerepresentations, where the processed informationis fed into a classification layer that predicts thelabel of the input sentence.

In a recent work, Founta et al. [17] experi-mented with a popular DL architecture for textprocessing, classifying tweets into the categoriesof hate, abusive, or offensive speech, sarcasm andcybercullying. They transformed the input wordsinto Glove word embeddings [18]. They then useda recurrent layer comprised of Gated RecurrentUnits (GRU) for generating contextual and se-quential word representations, having each wordprocessed given the representations of previouswords in the input sentence. Following a dropoutlayer (intended to prevent over-fitting), the finaldense layer outputs the probability that the sen-tence belongs to each of the target using a soft-max activation function. Another popular choiceof DL architecture for processing textual data isConvolution neural networks (CNN). Previously,Badjatiya et al. [19] used such an architecturefor hate speech detection. CNN applies a filterover the input word representations, processinglocal features per fixed-size word subsequences.Hierarchical processing, which involves aggrega-tion and down-sampling, consolidates the localfeatures into global representations. These repre-sentations are finally fed into a fully-connectedlayer which predicts the probability distributionover all of the possible classes.

Another DL classification architecture pre-sented by Zhang et al. [10] is comprised ofboth convolution and recurrent processing layers.This architecture has been shown to yield top-performance results on the task of hate speechclassification, and is therefore the architecture ofchoice in our experiments. In brief, and as illus-trated in Figure 1, this text classification networkconsists of a convolution layer applied to indi-vidual dimensions of the input word embeddings,which is down-sampled using a subsequent max-pooling layer. The following layer is recurrent(GRUs), producing hidden state representationsper time step. Finally, a global max-pooling layer

2020 3

is applied, and a softmax layer produces a prob-ability distribution over the target classes for thegiven input. Further implementation details areprovided later in this paper.

Hate speech datasetsTable 1 details the statistics of five public

datasets of hate speech, which are all manuallycurated and of modest size. Below, we discussthe generation process and characteristics of thesedatasets. In this work, we will leverage thesedatasets for tuning an automatic language gen-eration process, resulting the in a very largenumber of hate speech sequences, namely theMega-speech resource.

The typical process that has been traditionallyapplied by researchers and practitioners for con-structing labeled datasets in general, and of hatespeech in particular, involves the identificationof authentic hate speech sentences, as well ascounter (non-hate) examples. Twitter is often tar-geted as a source of relevant data; this (and other)public social media platform allows users to sharetheir thoughts and interact with each other freely,making it a fertile ground for expressing all kindsof agendas, some of which may be racist orhateful. The initial retrieval of hate speech exam-ples from Twitter is based on keyword matching,specifying terms that are strongly associated withhate.1 Once candidate tweets are collected, theyare assessed and labeled by human annotatorsinto pre-specified categories. The manual anno-tation of the examples provides with high-qualityground truth labeled datasets, yet it is costly.Accordingly, the available datasets each includeonly a few thousands of labeled examples. Due totheir size limit, and biases involved in the datasetcollection process, e.g., keyword selection andlabeling guidelines, these datasets may be under-representative of the numerous forms and shapesin which hate speech may be manifested.2

We now turn to describe the individualdatasets in Table 1 in more detail. The datasetdue to Davidson et al. [14] (DV) includes tweets

1e.g., https://www.hatebase.org/2Another caveat of referring to authentic content within a

dataset concerns the discontinued availability of the collectedtexts by the relevant provider over time. For example, tweetsmust be stored by their identifier number, where access to thetweet’s content may be defined, imparing the dataset.

Table 1. Publicly available hate speech datasets, pertain-ing to examples labeled strictly as hate or non-hate.Dataset Source Hate Non-HateDV [14] Tweets 1,404 7,875FN [22] Tweets 4,020 49,775WS [20] Tweets 1,965 3,797WH [23] Posts 1,196 9,507SE [21] Tweets 4,217 5,758

labeled by CrowdFlower3 workers into three cate-gories: hate speech, offensive, or neither. For thepurposes of this work, we only consider exam-ples of the first and latter categories. Waseemand Hovy [20] created another dataset (WS),considering tweets of accounts which frequentlyused slurs and terms related to religious, sexual,gender and ethnic minorities; those tweets weremanually labeled into the categories of racism,sexism or neiter. Again, we only consider thefirst and latter categories as examples of hate andnon-hate, respectively. Another dataset was con-structed by SemEval conference organizers [21]for the purpose of promoting hate detection (SE).They considered the historical posts of identifiedhateful Twitter users, narrowed down to tweetsthat included hateful terms, and had exampleslabeled by CrowdFlower workers. We find thatmany of the tweets labeled as hateful in thisdataset target women and immigrants. Founta etal [22] (FN) performed iterative sampling andexploration while having tweets annotated usingcrowdsourcing. Their resulting dataset is rela-tively large (∼80K examples), and distinguishesbetween multiple flavors of offensive speech,namely offensive, abusive, hateful, aggressive, cy-ber bullying, spam and none. In order to maintainour focus on hate speech, we consider the labeledexamples associated with the hateful and nonecategories. Finally, the dataset due to de Gilbertet al. [23] (WH) was extracted from the extremistStormFront Internet forum,4. This dataset aimsto gauge hate in context, considering also caseswhere a sentence does not qualify as hate speechon its own, but is interpreted as hateful within alarger context comprised of several sentences.

3https://www.welcome.ai/crowdflower4https://en.wikipedia.org/wiki/Stormfront (website)

4

https://www.hatebase.org/

The proposed approach: Scalable HateSpeech Detection via Data Generation

Hate Speech Data GenerationIdeally, DL-classifiers would be trained using

large and diverse datasets, which more closelyrepresent the true distribution of hate speech.We automatically develop large-scale datasets ofhate speech text sequences and coutner examplesfrom seed labeled examples using the GPT-2model [13].

The Generative Pretrained Transformer 2(GPT-2) is a language model, trained to pre-dict, or ”generate”, the next token in a sequencegiven the previous tokens of that sequence inan unsupervised way. Being trained using 40GBof Web text, GPT-2 automatically generates textsequences of greater quality than ever seen before(CITE). We employed the large configuration ofGPT-2 (764M parameters) in this work.5

Our aim is to generate hateful rather thangeneral text sequences, however. Concretely, wewish to generate hate (and, non-hate) speech se-quences that: a) are perceived by humans as hatespeech (or, non-hate) with high probability, b)introduce high language diversity, and c) exhibithigh language fluency. In order to meet the firstrequirement, we exploit available labeled exam-ples, tuning GPT-2 towards hate (and non-hate)speech generation. High diversity is achieved byleveraging the rich language model encoded intoGPT-2, and by generating a very large number ofcandidate sequences. Last, we maintain high dataquality by automatically evaluating the generatedcandidate sequences using a state-of-the-art textclassifier, namely BERT [24], where generatedsequences that are evaluated as low-quality textsby BERT are automatically removed. Next, weformalize and discuss each of these componentsof our approach for hate speech generation.

Domain fine-tuning We wish to approxi-mate the language distribution of a specific class,namely hate speech or non-hate, as opposed togenerating random text sequences. We thereforefine-tune the pre-trained GPT-2 model, wherewe continue training the model from its distri-bution checkpoint, serving it with relevant text

5A yet larger GPT-2 model (1542M parameters) may yieldeven better text quality, but requires substantially larger comput-ing resources.

sequences. We randomly selected 80% of thelabeled examples per dataset and category forthis purpose (see Table 1). (The remaining la-beled examples are held out for evaluating hatedetection performance in our experiments.) Letus denote the examples sampled from dataset dithat are labeled as hate and non-hate by dihateand dinon−hate, respectively. Accordingly, we fine-tune two distinct GPT-2 models per dataset di:

• Gihate–fine-tuned using dihate, and

• Ginon−hate–fine-tuned using dinon−hate.

Importantly, fine-tuning a separate GPT-2 modelper dataset and class (yielding 10 models overall)is expected to bias the language generation pro-cess according to the topics and terms that arecharacteristic to that combination of dataset andclass. Once text sequences are generated fromeach fine-tuned model, they are unified so asto obtain a diverse distribution of automaticallygenerated text samples.

Candidate sequence Generation Given thefine-tuned GPT-2 models, we generate a largenumber of sequences (600K) per model. Simi-larly to the labeled examples, we generate textsequences that are relatively short, limiting thesequence length to 30 tokens. We provide noprompt to the GPT-2 model, that is, text isgenerated unconditionally, starting from an emptystring.6

Sequence selection One should not expectall of the sequences automatically generated bythe fine-tuned GTP-2 models to be perceived ashateful (or non-hateful). We found it crucial toassess and filter the generated sequences by theirquality and relevance to the focus class.

We employ BERT [24],7 a powerful DLtransformer-based model, which has been shownto achieve state-of-the-art results on many natu-ral language processing tasks, for automaticallyclassifying the generated sequences as hateful (ornot). BERT, which has been trained using largeamounts of text, projects given text sequences intocontextual embedding representations, which areuseful for text classification; adding a softmaxlayer on top of BERT, one may learn to predict

6We set the generation parameters top p and temperature bothto 0.9.

7We used the Bert-base model in our experiments

2020 5

target classes of interest by further training themodel using labeled examples [24], [25].

We utilize the pool of labeled training exam-ples for fine-tuning BERT on the task of hatespeech detection, using a balanced set of hate andnon-hate labeled examples (by down-samplingthe latter) for this purpose. Similarly to otherworks, we found it beneficial to first fine-tuneBERT on the related task of sentiment classifi-cation, using labeled examples from the StanfordSentiment Treebank (SST) [26], [27]. We usedthe popular HuggingFace platform [28] for fine-tuning BERT, and consequently–predicting theclass probabilities using the fine-tuned BERT foreach generated sequence. We maintained the top-scoring 100K sequences per dataset and class–their union comprises our 1M MegaSpeech se-quence corpus.

MegaSpeech: A Large Hate Speech resourceWe created MegaSpeech, a large-scale re-

source comprised of 1M text sequences, automat-ically generated and assessed as hate and non-hate speech. This resource exceeds the size ofcurrently available datasets (each comprised ofthousands of examples) by magnitudes of order.We will make this resource available for researchpurposes upon request.

A sample of 1,000 randomly selected se-quences was independently evaluated by the au-thors, manually assigning the generated texts tothe classes of hate and non-hate. We measuredinter-annotator agreement based on a subset 250co-annotated random examples and obtained ascore of 0.712 in terms of Fleiss kappa [29],indicating on strong agreement between the an-notators. Overall, the ratio of generated hate se-quences perceived as hate speech was as highas 65%, and similarly, the ratio of generatednon-hate sequences perceived as non-hate wasestimated at 86%. We further assessed the textsample with respect to coherence, or ‘readability’using a scale of 1–5 (ranging from poor toexcellent text quality, respectively), where thisevaluation yielded an average score of 3.62, i.e.,overall the text quality was good. We note thatthe generated text resembles the seed examplesgenre-wise, making use of slang, abbreviations,and hashtags. Table 5 includes several represen-tative examples of hate and non-hate generated

Table 2. ROUGE-L scores of data generated by the GPT-2 language model per dataset. These values indicate onlow similarity between the generated sequences and thesequences on which the model has been fine-tuned, foreach dataset and class.

Source dataset Generated hate Generated non-hateWS 0.12 0.05DV 0.07 0.05FN 0.11 0.14WH 0.09 0.16SE 0.05 0.03

sequences.We further wished to verify that the generated

sequences depart from the seed labeled exam-ples, forming new and diverse language usage,as opposed to duplicating or repeating parts ofthe input examples. We therefore assessed thelexical similarity between the automatically gen-erated sequences and the source examples perdataset and class [30]. Concretely, we employedROUGE-L [31], a popular text similarity measurewhich indicates how much is common betweentwo word sequences–ranging from 0 (nothingin common) to 1 (identical sequences). Table 2reports the computed average ROUGE-L sim-ilarity scores. As shown, the similarity scoresare low, indicating that the rich language modelencoded within GPT-2 was effectively leveragedfor generating new and different text sequences.

Can MegaSpeech boost hate speechdetection?

In the following section, we report the resultsof an empirical study, where we train deep learn-ing hate detection models using MegaSpeech. Letus denote those hate and non-hate text sequencesin MegaSpeech that were generated given thetraining examples of dataset di by geni

hate andgeni

non−hate, respectively. In our experiments, weaugment the training examples of di with thoseautomatically generated examples, ideally pro-viding hate detection classifiers that are trainedusing this dataset with a more general and truthfulapproximation of the underlying language distri-bution.

Our results show consistent improvementsin hate detection performance by using theMegaSpeech text sequences. In particular, weshow a major boost in classification recall

6

(aka, sensitivity), which is crucial for effectivehate speech detection–this impact is due to thegreater diversity of hate speech introduced byMegaSpeech compared with the existing datasets.

Generalized Hate Detection: EvaluationThis section presents a comparative study of

the generalization capabilities of DL-based hatespeech detection, where we augment hate speechdatasets with massive amounts of realistic textsequences, sampled from the approximated dis-tribution. Concretely, each one of the datasets inTable 1 was randomly split to a training set (80%)and a testing set (20%). The data generation pro-cesses and all the training procedures describedbelow were performed using the training sets,where hate detection performance is evaluated ex-clusively using the corresponding testing sets. Welearn hate detection classification models usingtwo types of training sets: the baseline, whichincludes several thousands of human-generatedtext samples per dataset (Table 1) and augmented,constructed by adding geni

hate, 100,000 model-generated hate speech samples and geni

non−hate,100,000 model-generated non-hate speech sam-ples, to the corresponding baseline training set.

Hate speech detection results were obtainedusing the high-performing DL-based detector[10], as detailed in Figure 1. The DL detectorwas implemented in TensorFlow [32] and trainedusing an NVIDIA K-80 GPU, with the Adam [8]optimizer, minimizing the binary cross-entropyloss, with early stopping and mini-batch sizeof 32 samples. The Convolutional layer consistsof 100 filters with a window size of 4, theMaxPooling1D layer uses a pool size of 4 andthe GRU outputs a 100 hidden units for eachtimestep. Note that in this study we compare hatedetection performance given different datasetsusing a single DL detector, where it is not claimedthat the chosen detector achieves state-of-the-artresults for each of the datasets.

Performance was evaluated using standardclassification metrics: accuracy (A), precision (P),recall (R) and F1 (the harmonic mean of precisionand recall). We note that the utilized datsets aregenerally imbalanced–and so are the derived testsets, including smaller proportions of exampleslabeled as hate speech vs. non-hate speech. Wereport evaluation with respect to the hate speech

class: Recall corresponds to the proportion oftrue hate speech examples that were automat-ically identified as hate speech, and Precisionis the proportion of correct predictions withinthe examples identified as hate speech by thedetector. While false positive predictions maybe tracked relatively easily by means of furtherhuman inspection, it is impossible to track falsenegatives at scale. Therefore, increasing recallby means of improved generalization is of greatimportance in practice.

ExperimentsOur study included two types of experiments,

which we refer to as intra-dataset and cross-dataset hate speech detection. The intra-datasetexperiments applied the DL detector to eachdataset independently: performance of the DLdetector was measured using the held-out testexamples of every dataset Di, having trainedthe detector with either the baseline training set,or with the augmented training set available forDi. In addition, we formed a combined datasetcomprised of the union of all datasets (havingjoined all training and testing sets, respectively).The cross-dataset experiments aim to evaluate therealistic condition of data shift, where the hatedetector is applied to new examples drawn froma data distribution that is different from the oneused for training. In these experiments, having thedetector trained using the training set of datasetDi, it was evaluated on the held-out testing set ofdataset Dj 6= Di. Again, we trained the DL de-tector using either the baseline or the augmentedtraining set available for dataset Di, aiming togauge the impact of dataset augmentation on hatedetection generalization; ideally, hate detectionshould maintain good performance when appliedto new examples, where, as discussed above, it iscrucial to maintain high recall during test time.

There generally exists a trade-off between pre-cision and recall classification performances [33].Given an input text sequence x, the outputof the DL detector is the posterior probabilityP (class(x) = ”hate”|x), namely, the proba-bility that the input belongs to the ”hate” class.The input is classified as ”hate” if P (class(x) =”hate”|x) > τ , and otherwise as ”non-hate”, forsome pre-defined threshold τ . Different thresholdchoices alter the trade-off between precision and

2020 7

Figure 1. The DL hate speech detector [10]: an input text sequence is represented by a bag-of-words model,where each word is a one-hot vector. The Embedding layer reduces the dimension of each word vector fromthe input vocabulary dimension to 300, and the sequence of embedded word vectors are processed by a1D convolutional layer which extracts features. The 1D MaxPooling layer down-samples the convolutionalfeatures, and the resulting down-sampled sequence enters a Gated-Recurrent Unit (GRU) which learns thetime-dependencies between words. The GRU output is down-sampled by a 1D GlobalMaxPooling layer andconverted to a posterior probability (that the input sequence is hate speech) by the output layer.

recall (with high thresholds improving precision).In order to provide a uniform view of the DLdetector performance, we fixed τ = 0.7 8 acrossall experiments, although tuning τ per dataset canlead to better results for some datasets.

Intra-Dataset Experiments. Table 3 reportstest results of the hate detector trained usingthe baseline (un-augmneted) verus the augmentedtraining sets, for each of the datasets in Table 1,and for the combined dataset. The results indicatethat augmenting the training sets with automati-cally generated examples leads to improvement inRecall and F1 in most cases, peaking at +73.0%and +33.1%, respectively, for the SE dataset.Precision on the other hand is decreased formost datsets, and Accuracy changes mildly. Thisis not surprising, as the training and test setsbelong to the same distribution and are highlysimilar in this setup, whereas the generated textsequences introduce language diversity, as wellas some label noise. For the combined dataset,which is relatively diverse, there is a yet a boostdue to train set augmentation of +48.2% inRecall and +16.1% in F1; Precision decreases by19.7% and Accuracy almost unchanged. There-fore, by dataset augmentation, a dramatic drop infalse-negatives (missed hate speech sequences) isachieved, namely, significantly more hate speech

8τ = 0.7 provided good F1 results on-average for the intra-dataset experiments.

is detected, which is also evident in the improvedF1. The decrease in Precision indicates morefalse-positives (non-hate sequences classified ashate), however, these are less severe than missedhate speech sequences. Figure 2 provides thecomplete Precision-Recall curves for all datasets,computed by altering the detector threshold inthe range of τ = 0.5 to 0.95 in steps of 0.05.These curves clearly demonstrates that trainingdata augmentation yields significantly imrpvedRecall (and consequently F1) across this range.

Cross-Dataset Experiments. We furtherevaluated the generalization of the DL detectorsin conditions of data shift. Figure 3 shows theresults of applying the DL detectors trained usingthe baseline (non-augmented) training set Di onthe held-out test set Di (intra-dataset), as well ason the held-out test sets of other datasets, Dj 6=Di (cross-dataset). As shown, there is a signifi-cant drop in performance measurements betweenthe first and latter setups–we observe precisiondrops, and poor recall performance (below 0.2)on the target test sets in conditions of data shift.Therefore, provided with (un-augmented) trainsets of few thousands of hate speech examples,the achievable generalization to other hate speechdatasets is very limited, resulting in low levels ofrecall (i.e., high ratio of missed hate speech).

In another set of experiments, we re-trainedthe detector with each one of the correspond-

8

Table 3. Intra-Dataset classification results, comparing the baseline and augmented-baseline training sets

Dataset Accuracy Precision Recall F1Baseline Augmented (%) Baseline Augmented (%) Baseline Augmented (%) Baseline Augmented (%)

WS 0.967 0.977 +1.03 0.968 0.989 +2.17 0.936 0.943 +0.75 0.952 0.966 +1.47WH 0.891 0.872 -2.13 0.862 0.600 -30.40 0.375 0.582 +55.20 0.523 0.591 +13.00SE 0.715 0.764 +6.85 0.901 0.767 -14.87 0.367 0.635 +73.02 0.522 0.695 +33.14DV 0.922 0.935 +1.41 0.929 0.923 -0.65 0.753 0.814 +8.10 0.832 0.865 +3.97FN 0.956 0.942 -1.46 0.874 0.644 -26.32 0.337 0.515 +52.82 0.486 0.573 +17.90

Combined 0.904 0.905 +0.11 0.895 0.718 -19.77 0.433 0.642 +48.26 0.584 0.678 +16.10

Figure 2. Precision-Recall curves of the DL Hate Speech detector, trained separately using each dataset (a)-(e)and using the combined dataset (f), comparing baseline and augmented training data results.

ing augmented training set, and evaluated onthe same cross-dataset combinations of the firstexperiment. Table 4 summarizes the results of theCross-dataset experiment. The average metrics ofall cross-dataset pairs reveals a consistent increasein accuracy (+6.9%), precision (+3.5%), recall(+182.8%) and F1 (+179.7%). That is, aug-menting the original labeled training sets with theautomatically generated text sequences resulted ina dramatic increase of recall, while maintaining

good levels of precision. Overall, this trend isreflected by a steep increase in the combinedmeasure of F1. These results thus provide anotherapproval of the benefits of augmenting the datawith generated sentences and using a broader datadistribution by utilizing a pre-trained languagemodel.

2020 9

Figure 3. Cross-Dataset performance without training set augmentation: in each experiment the DL detectorwas trained using the training set of one dataset (the left initials of each pair). The trained detector was evaluatedon the held-out test set of the same dataset (Intra-Dataset) and on the held-out test set of a different dataset(Cross-Dataset, according to the right initials of each pair). The results clearly indicate a significant decreasein cross-data performance, especially in the Recall and F1 metrics. Therefore, the proportion of false-negatives(i.e. undetected hate speech) significanlty increases, as compared to the intra-dataset results.

Table 4. Cross-Dataset performance, comparing detector training using baseline to augmented-baseline training sets

Trainset- Accuracy Precision Recall F1Testset Baseline Augmented (%) Baseline Augmented (%) Baseline Augmented (%) Baseline Augmented (%)FN-SE 0.613 0.645 +5.22 0.689 0.570 -17.27 0.155 0.644 +315.48 0.253 0.605 +139.13FN-WH 0.846 0.850 +0.47 0.636 0.528 -16.98 0.063 0.507 +704.76 0.114 0.517 +353.50DV-WH 0.794 0.820 +3.27 0.278 0.441 +58.63 0.189 0.507 +168.25 0.225 0.472 +109.77DV-WS 0.652 0.832 +27.60 0.471 0.870 +84.71 0.107 0.599 +459.81 0.174 0.709 +307.47SE-WS 0.662 0.754 +13.89 0.875 0.872 -0.34 0.017 0.331 +1,847.06 0.034 0.480 +1,311.76SE-FN 0.926 0.924 -0.21 0.694 0.489 -29.53 0.031 0.226 +629.03 0.059 0.309 +423.72WH-SE 0.584 0.593 +1.54 0.572 0.522 -8.74 0.069 0.450 +552.17 0.124 0.484 +290.32WH-DV 0.752 0.777 +3.32 0.714 0.610 -14.56 0.053 0.364 +586.79 0.099 0.456 +360.60WS-WH 0.749 0.824 +10.01 0.332 0.450 +35.54 0.579 0.489 -15.54 0.422 0.469 +11.13WS-FN 0.822 0.895 +8.88 0.135 0.233 +72.59 0.255 0.176 -30.98 0.177 0.201 +13.55Average 0.740 0.791 +6.95 0.539 0.559 +3.50 0.151 0.429 +182.81 0.168 0.470 +179.71

CONCLUSIONS

In this work we demonstrated the problemsthat may occur when training a hate speechclassifier on small datasets. We showed the per-formance decrease that arisen when training ahate speech classifier on a small dataset and testedon a dataset of a different hate speech languagedistribution. We suggested a new framework foraugmenting hate speech dataset by using datathat was generated from a deep generative modeltrained on the small hate speech datasets that wereavailable. The proposed framework was evaluatedusing a high-performing DL hate speech detector,demonstrating a significant boost in the general-ization capability of the DL detector, intra-datasetand cross-dataset as well.

ACKNOWLEDGMENTThis research was supported partly by Face-

book Content Policy Research on Social MediaPlatforms Research Award.

REFERENCES1. S. Kemp, “Digital 2020: Global digital

overview,” DataReportal, January 2020. [Online].

Available: https://datareportal.com/reports/digital-2020-

global-digital-overview

2. Twitter, “Twitter hateful conduct policy.” [On-

line]. Available: https://help.twitter.com/en/rules-and-

policies/hateful-conduct-policy

3. ——, “Updating our rules against hate-

ful conduct,” March 2020. [Online]. Avail-

able: https://blog.twitter.com/en us/topics/company/

2019/hatefulconductupdate.html

10

https://datareportal.com/reports/digital-2020-global-digital-overview

https://datareportal.com/reports/digital-2020-global-digital-overview

https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy

https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy

https://blog.twitter.com/en_us/topics/company/2019/hatefulconductupdate.html

https://blog.twitter.com/en_us/topics/company/2019/hatefulconductupdate.html

Table 5. Examples of generated sequences by the GPT-2 language model, fine-tuned by each source training set.

Dataset Hate Speech Generated by GPT-2 Non-Hate Speech Generated by GPT-2

WS What does Islam have to do with democracy andhuman rights islam is a hateful religion that hasinspired the murders of

The haters are all over this comment

for starters i know some of you never read thiscanadian quran is full of excuses why so manymuslims are murdering islamists

This story originally appeared on National Geo-graphic News

the writer of the sharia law book given to the mus-lim world in the 1980s is calling for a religiouscivil war

US wants Zambias political crisis to end

DV #AltRight is so important to white people thesedays they want to rebrand white genocide as im-migration amnesty #WhiteGenocide #NoMoreD

A member of the Liberty Bell Choir sings theNational Anthem before the game between theNew York Yankees

Fox News allowed these other nut jobs on theirnetwork to spout hatred towards women and mi-norities so that they could sell more toys to thesescum pigs

Who do you think of when you think of one ofsports lowest paid employees in league history#yankees #mariners #mlb

we must stop lawless illegal black hearted sandnigger lovers theyre ruining our country my coun-trymen dont like those people

Afghanistans military training looks to be on thebrink of a revolution after President Ashraf Ghanipledged to make it so special that it is

FN this new administration is willing to start a worldwar over corruption while these illegal immi-grants destroying our country are not taken intoconsideration

On Thursday evening a collection of posts ap-peared on a website with the slogan stop the racistterrorist threats to burma #facebook #burmaco

i wish someone would tell me what the hell isgoing on with this video i hate it man that fuckingnigga

what the fuck does feeling good imply hes notfeeling good anymore ill shake his hand or givehim a hug mfw

Democrats and their media sycophants do notwant to talk about why illegal border crossers arestealing from citizens who pay taxes #buildthat-wall

It was initially deemed a case of falling victimto men in power who are trying to change the#innovation #technology

WH The number of refugees and immigration controlsare clear, now is it time to go to extremes in adesperate attempt to stop them

thanks for your reponse this is what our schoolsystem is becoming

the left celebrates the death of freedom as it isthe road to the extermination of whites and therebirth of the baboon

as an asian guy living in canada i have to say ilove it here welcome to stormfront

Mexicans wonder why their country is going tohell when they send their sons and daughters tofight for such disgusting filth

i remember when my 3 year old son made a facelike the one on the far left in the above cartoon

SE Afghan refugees now run our towns and streets#buildthatwall #secureourborders #unleashthein-sane

Police say a man in his 20s is dead and threeothers were injured after a brawl broke out at apolitical rally in Athens

illegal aliens must be deported or even suedif not they will keep coming illegally #nodaca#noamnesty #endbirthrightcitizenship

To celebrate the launch of the Fund for RefugeesI am inviting you to become a guardian of thisimportant right for every Syrian refugee as wedefend refugee rights

Hillary Clinton Confronts Gay Muslim Refugee#Rapefugee who bragged about the way he rapeda little boy

Republican presidential candidate Donald Trumphas vowed to ensure anything is possible withall of our great employees #muslimimmigrationpartners #bayareamus

2020 11

4. Facebook, “Community standards: Hate speech.”

[Online]. Available: https://www.facebook.com/

communitystandards/hate speech

5. J. Peters, “Twitter now bans dehumanizing

remarks based on age, disability, and disease,”

The Verge, March 2020. [Online]. Available:

https://www.theverge.com/2020/3/5/21166940/twitter-

hate-speech-ban-age-disability-disease-dehumanize

6. M. Zuckerberg, “Hate speech has no place

on facebook,” The Guadrian. [Online]. Available:

https://www.theguardian.com/technology/2016/feb/26/

mark-zuckerberg-hate-speech-germany-facebook-

refugee-crisis

7. I. Lunden, “Facebook to add 3,000 to team reviewing

posts with hate speech, crimes, and other harming

posts,” TechCrunch, May 2017. [Online]. Available:

https://techcrunch.com/2017/05/03/facebook-to-hire-

3000-to-review-posts-with-hate-speech-crimes-and-

other-harming-posts/

8. I. Goodfellow, Y. Bengio, and A. Courville, Deep

Learning. MIT Press, 2016. [Online]. Available:

http://www.deeplearningbook.org

9. T. Young, D. Hazarika, S. Poria, and E. Cambria, “Re-

cent trends in deep learning based natural language

processing [review article],” IEEE Computational Intel-

ligence Magazine, vol. 13, no. 3, pp. 55–75, 2018.

10. Z. Zhang, D. Robinson, and J. Tepper, “Detecting

hate speech on twitter using a Convolution-GRU

based deep neural network,” in European Semantic

Web Conference. Springer, 2018, pp. 745–760.

[Online]. Available: https://link.springer.com/chapter/

10.1007/978-3-319-93417-4 48

11. L. Zhang, S. Wang, and B. Liu, “Deep learning for

sentiment analysis: A survey,” WIREs Data Mining and

Knowledge Discovery, vol. 8, no. 4, p. e1253, 2018.

[Online]. Available: https://onlinelibrary.wiley.com/doi/

abs/10.1002/widm.1253

12. K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, “Gener-

alization in deep learning,” ArXiv, 2017. [Online]. Avail-

able: https://dspace.mit.edu/bitstream/handle/1721.1/

115274/gedl.pdf?sequence=2&isAllowed=y

13. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and

I. Sutskever, “Language models are unsupervised mul-

titask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.

14. T. Davidson, D. Warmsley, M. Macy, and I. Weber,

“Automated hate speech detection and the problem

of offensive language,” in Eleventh international AAAI

conference on web and social media (ICWSM), 2017.

15. P. Burnap and M. L. Williams, “Cyber hate speech on

twitter: An application of machine classification and sta-

tistical modeling for policy and decision making,” Policy

& Internet, vol. 7, no. 2, pp. 223–242, 2015.

16. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and

J. Dean, “Distributed representations of words and

phrases and their compositionality,” in Advances in neu-

ral information processing systems, 2013, pp. 3111–

3119.

17. A. M. Founta, D. Chatzakou, N. Kourtellis, J. Blackburn,

A. Vakali, and I. Leontiadis, “A unified deep learning

architecture for abuse detection,” in Proceedings of the

10th ACM Conference on Web Science, 2019.

18. J. Pennington, R. Socher, and C. D. Manning, “Glove:

Global vectors for word representation,” in Proceedings

of the 2014 conference on empirical methods in natural

language processing (EMNLP), 2014, pp. 1532–1543.

19. P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep

learning for hate speech detection in tweets,” in Pro-

ceedings of the 26th International Conference on World

Wide Web Companion, 2017, pp. 759–760.

20. Z. Waseem and D. Hovy, “Hateful symbols or hateful

people? predictive features for hate speech detection on

twitter,” in Proceedings of the NAACL student research

workshop, 2016, pp. 88–93.

21. V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti,

F. M. R. Pardo, P. Rosso, and M. Sanguinetti, “Semeval-

2019 task 5: Multilingual detection of hate speech

against immigrants and women in twitter,” in Proceed-

ings of the 13th International Workshop on Semantic

Evaluation, 2019, pp. 54–63.

22. A. M. Founta, C. Djouvas, D. Chatzakou, I. Leontiadis,

J. Blackburn, G. Stringhini, A. Vakali, M. Sirivianos, and

N. Kourtellis, “Large scale crowdsourcing and charac-

terization of twitter abusive behavior,” in International

AAAI Conference on Web and Social Media, 2018.

23. O. de Gibert, N. Perez, A. Garcı́a-Pablos, and

M. Cuadros, “Hate speech dataset from a white

supremacy forum,” in Proceedings of the 2nd Workshop

on Abusive Language Online (ALW2). Brussels,

Belgium: Association for Computational Linguistics,

Oct. 2018, pp. 11–20. [Online]. Available: https:

//www.aclweb.org/anthology/W18-5102

24. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova,

“Bert: Pre-training of deep bidirectional transformers for

language understanding,” in NAACL-HLT, 2019.

25. C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune

BERT for text classification?” in China National Confer-

ence on Chinese Computational Linguistics. Springer,

2019, pp. 194–206.

26. R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Man-

ning, A. Y. Ng, and C. Potts, “Recursive deep models for

12

https://www.facebook.com/communitystandards/hate_speech

https://www.facebook.com/communitystandards/hate_speech

https://www.theverge.com/2020/3/5/21166940/twitter-hate-speech-ban-age-disability-disease-dehumanize

https://www.theverge.com/2020/3/5/21166940/twitter-hate-speech-ban-age-disability-disease-dehumanize

https://www.theguardian.com/technology/2016/feb/26/mark-zuckerberg-hate-speech-germany-facebook-refugee-crisis



https://techcrunch.com/2017/05/03/facebook-to-hire-3000-to-review-posts-with-hate-speech-crimes-and-other-harming-posts/



http://www.deeplearningbook.org

https://link.springer.com/chapter/10.1007/978-3-319-93417-4_48

https://link.springer.com/chapter/10.1007/978-3-319-93417-4_48

https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1253

https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1253

https://dspace.mit.edu/bitstream/handle/1721.1/115274/gedl.pdf?sequence=2&isAllowed=y

https://dspace.mit.edu/bitstream/handle/1721.1/115274/gedl.pdf?sequence=2&isAllowed=y

https://www.aclweb.org/anthology/W18-5102


semantic compositionality over a sentiment treebank,”

in Proceedings of the 2013 conference on empirical

methods in natural language processing, 2013, pp.

1631–1642.

27. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and

S. R. Bowman, “Glue: A multi-task benchmark and

analysis platform for natural language understanding,”

in EMNLP, 2018.

28. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-

langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtow-

icz, and J. Brew, “Huggingface’s transformers: State-

of-the-art natural language processing,” ArXiv, vol.

abs/1910.03771, 2019.

29. J. L. Fleiss, B. Levin, and M. C. Paik, Statistical methods

for rates and proportions. john wiley & sons, 2013.

30. S. Witteveen and M. Andrews, “Paraphrasing with large

language models,” in Proceedings of the 3rd Workshop

on Neural Generation and Translation. Association for

Computational Linguistics, 2019.

31. C.-Y. Lin, “ROUGE: A package for automatic evaluation

of summaries,” in Text Summarization Branches

Out. Barcelona, Spain: Association for Computational

Linguistics, Jul. 2004, pp. 74–81. [Online]. Available:


32. [Online]. Available: https://www.tensorflow.org/

33. E. Minkov, R. C. Wang, A. Tomasic, and W. Cohen,

“Ner systems that suit users preferences: adjusting the

recall-precision trade-off for entity extraction,” in Pro-

ceedings of the Human Language Technology Confer-

ence of the NAACL, Companion Volume: Short Papers,

2006, pp. 93–96.

2020 13


https://www.tensorflow.org/

towards hate speech detection at large via deep ...they encoded genre-speciﬁc features, denoting...

Documents