arxiv:2102.01826v1 [cs.cl] 3 feb 2021

16
A Computational Framework for Slang Generation Zhewei Sun 1 , Richard Zemel 1,2 , Yang Xu 1,2 1 Department of Computer Science, University of Toronto, Toronto, Canada 2 Vector Institute for Artificial Intelligence, Toronto, Canada {zheweisun, zemel, yangxu}@cs.toronto.edu Abstract Slang is a common type of informal lan- guage, but its flexible nature and paucity of data resources present challenges for ex- isting natural language systems. We take an initial step toward machine generation of slang by developing a framework that mod- els the speaker’s word choice in slang con- text. Our framework encodes novel slang meaning by relating the conventional and slang senses of a word while incorporating syntactic and contextual knowledge in slang usage. We construct the framework using a combination of probabilistic inference and neural contrastive learning. We perform rig- orous evaluations on three slang dictionaries and show that our approach not only outper- forms state-of-the-art language models, but also better predicts the historical emergence of slang word usages from 1960s to 2000s. We interpret the proposed models and find that the contrastively learned semantic space is sensitive to the similarities between slang and conventional senses of words. Our work creates opportunities for the automated gen- eration and interpretation of informal lan- guage. 1 Introduction Slang is a common type of informal language that appears frequently in daily conversations, social media, and mobile platforms. The flexible and ephemeral nature of slang (Eble, 1989; Landau, 1984) poses a fundamental challenge for compu- tational representation of slang in natural language systems. As of today, slang constitutes only a small portion of text corpora used in the natural language processing (NLP) community, and it is severely under-represented in standard lexical re- sources (Michel et al., 2011). Here we propose a novel framework for automated generation of slang with a focus on generative modeling of slang word meaning and choice. I have a feeling he’s gonna ice himself someday. “frozen water; cold juice; unfriendliness” ice “to kill” Figure 1: A slang generation framework that mod- els speaker’s choice of a slang term (ice) based on the novel sense (“to kill”) in context and relations with conventional senses (e.g., “frozen water”). Existing language models trained on large-scale text corpora have shown success in a variety of NLP tasks. However, they are typically biased to- ward formal language and under-represent slang. Consider the sentence “I have a feeling he’s gonna ___ himself someday”. Directly applying a state- of-the-art GPT-2 (Radford et al., 2019) based language infilling model (e.g., Donahue et al. (2020)) would result in the retrieval of kill as the most probable word choice (probability = 7.7%). However, such a language model is limited and near-insensitive to slang usage, e.g., ice—a com- mon slang alternative for kill—received virtually 0 probability, suggesting that existing models of dis- tributional semantics, even the transformer-type models, do not capture slang effectively, if at all. Our goal is to extend the capacity of NLP sys- tems toward slang in a principled framework. As an initial step, we focus on modeling the genera- tive process of slang, specifically the problem of slang word choice that we illustrate in Figure 1. Given an intended slang sense such as “to kill”, we ask how we can emulate the speaker’s choice of arXiv:2102.01826v2 [cs.CL] 22 May 2021

Upload: others

Post on 14-Apr-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

A Computational Framework for Slang Generation

Zhewei Sun1, Richard Zemel1,2, Yang Xu1,2

1Department of Computer Science, University of Toronto, Toronto, Canada2Vector Institute for Artificial Intelligence, Toronto, Canada{zheweisun, zemel, yangxu}@cs.toronto.edu

Abstract

Slang is a common type of informal lan-guage, but its flexible nature and paucityof data resources present challenges for ex-isting natural language systems. We takean initial step toward machine generation ofslang by developing a framework that mod-els the speaker’s word choice in slang con-text. Our framework encodes novel slangmeaning by relating the conventional andslang senses of a word while incorporatingsyntactic and contextual knowledge in slangusage. We construct the framework using acombination of probabilistic inference andneural contrastive learning. We perform rig-orous evaluations on three slang dictionariesand show that our approach not only outper-forms state-of-the-art language models, butalso better predicts the historical emergenceof slang word usages from 1960s to 2000s.We interpret the proposed models and findthat the contrastively learned semantic spaceis sensitive to the similarities between slangand conventional senses of words. Our workcreates opportunities for the automated gen-eration and interpretation of informal lan-guage.

1 Introduction

Slang is a common type of informal language thatappears frequently in daily conversations, socialmedia, and mobile platforms. The flexible andephemeral nature of slang (Eble, 1989; Landau,1984) poses a fundamental challenge for compu-tational representation of slang in natural languagesystems. As of today, slang constitutes only asmall portion of text corpora used in the naturallanguage processing (NLP) community, and it isseverely under-represented in standard lexical re-sources (Michel et al., 2011). Here we proposea novel framework for automated generation ofslang with a focus on generative modeling of slangword meaning and choice.

I have a feeling he’s gonna

ice himself someday.

“frozen water; cold juice; unfriendliness”

ice

“to kill”

Figure 1: A slang generation framework that mod-els speaker’s choice of a slang term (ice) based onthe novel sense (“to kill”) in context and relationswith conventional senses (e.g., “frozen water”).

Existing language models trained on large-scaletext corpora have shown success in a variety ofNLP tasks. However, they are typically biased to-ward formal language and under-represent slang.Consider the sentence “I have a feeling he’s gonna___ himself someday”. Directly applying a state-of-the-art GPT-2 (Radford et al., 2019) basedlanguage infilling model (e.g., Donahue et al.(2020)) would result in the retrieval of kill as themost probable word choice (probability = 7.7%).However, such a language model is limited andnear-insensitive to slang usage, e.g., ice—a com-mon slang alternative for kill—received virtually 0probability, suggesting that existing models of dis-tributional semantics, even the transformer-typemodels, do not capture slang effectively, if at all.

Our goal is to extend the capacity of NLP sys-tems toward slang in a principled framework. Asan initial step, we focus on modeling the genera-tive process of slang, specifically the problem ofslang word choice that we illustrate in Figure 1.Given an intended slang sense such as “to kill”, weask how we can emulate the speaker’s choice of

arX

iv:2

102.

0182

6v2

[cs

.CL

] 2

2 M

ay 2

021

Page 2: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

slang word(s) in informal context 1. We are partic-ularly interested in how the speaker chooses exist-ing words from the lexicon and makes innovativeuse of those words in novel slang context (such asthe use of ice in Figure 1).

Our basic premise is that sensible slang wordchoice depends on linking conventional or estab-lished senses of a word (such as “frozen water” forice) to its emergent slang senses (such as “to kill”for ice). For instance, the extended use of ice toexpress killing could have emerged from the use ofcold ice to refrigerate one’s remains. A principledsemantic representation should adapt to such simi-larity relations. Our proposed framework is aimedat encoding slang that relates informal and con-ventional word senses, hence capturing semanticsimilarities beyond those from existing languagemodels. In particular, contextualized embeddingmodels such as BERT would consider “frozen wa-ter” to be semantically distant or irrelevant from“to kill”, so they cannot predict ice to be appropri-ate for expressing “to kill” in slang context.

The capacity for generating novel slang wordusages will have several implications and applica-tions. From a scientific view, modeling the genera-tive process of slang word choice will help explainthe emergence of novel slang usages over time—we show how our framework can predict the emer-gence of slang in the history of English. From apractical perspective, automated slang generationpaves the way for automated slang interpretation.Existing psycho-linguistic work suggests that lan-guage generation and comprehension rely on simi-lar cognitive processes (e.g., Pickering and Garrod(2013); Ferreira Pinto Jr. and Xu (2021)). Sim-ilarly, a generative model of slang can be an in-tegral component of slang comprehension that in-forms the relation between a candidate sense anda query word, where the mapping can be unseenduring training. Furthermore, a generative ap-proach to slang may also be applied to downstreamtasks such as naturalistic chatbots, sentiment anal-ysis, and sarcasm detection (see work by Aly andvan der Haar (2020) and Wilson et al. (2020)).

We propose a neural-probabilistic frameworkthat involves three components: 1) a probabilisticchoice model that infers an appropriate word forexpressing a query slang meaning given its con-text, 2) an encoder based on contrastive learning

1We will use the terms meaning and sense interchange-ably.

that captures slang meaning in a modified embed-ding space, and 3) a prior that incorporates differ-ent forms of context. Specifically, the slang en-coder we propose transforms slang and conven-tional senses of a word into a slang-sensitive em-bedding space where they will lie in close proxim-ity. As such, senses like “frozen water”, “unfriend-liness“ and “to kill” will be encouraged to be inclose proximity in the learned embedding space.Furthermore, the resulting embedding space willalso set apart slang senses of a word from sensesof other unrelated words, and hence contrastingwithin-word senses from across-word senses in thelexicon. A practical advantage of this encodingmethod is that semantic similarities pertinent toslang can be extracted automatically from a smallamount of training data, and the learned semanticspace will be sensitive to slang.

Our framework also captures the flexible natureof slang usages in natural context. Here, we fo-cus on syntax and linguistic context, although ourframework should allow for the incorporation ofsocial or extra-linguistic features as well. Recentwork has found that the flexibility of slang is re-flected prominently in syntactic shift (Pei et al.,2019). For example, ice—most commonly usedas a noun—is used as a verb to express “to kill”(in Figure 1). We build on these findings by in-corporating syntactic shift as a prior in the proba-bilistic model, which is integrated coherently withthe contrastive neural encoder that captures flexi-bility in slang sense extension. We also show howa contextualized language infilling model can pro-vide additional prior information from linguisticcontext (c.f. Erk (2016)).

To preview our results, we show that our frame-work yields a substantial improvement on the ac-curacy of slang generation against state-of-the-art embedding methods including deep contextu-alized models, in both few-shot and zero-shot set-tings. We evaluate our framework rigorously onthree datasets constructed from slang dictionar-ies and in a historical prediction task. We showevidence that the learned slang embedding spaceyields intuitive interpretation of slang and offersfuture opportunities for informal natural languageprocessing.

2 Related Work

NLP for Non-Literal Language. Machine pro-cessing of non-literal language has been ex-

Page 3: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

plored in different linguistic phenomena includ-ing metaphor (Shutova et al., 2013b; Vealeet al., 2016; Gao et al., 2018; Dankers et al.,2019), metonymy (Lapata and Lascarides, 2003;Nissim and Markert, 2003; Shutova et al.,2013a), irony (Filatova, 2012), neologism (Cook,2010), idiom (Fazly et al., 2009; Liu and Hwa,2018), vulgarity (Holgate et al., 2018), and eu-phemism (Magu and Luo, 2018). Non-literal us-ages are present in slang, but these existing stud-ies do not directly model the semantics of slang.In addition, work in this area has typically focusedon detection and comprehension. In contrast, gen-eration of novel informal language use has beensparsely tackled.

Computational Studies of Slang. Slang hasbeen extensively studied as a social phe-nomenon (Mattiello, 2005), where social variablessuch as gender (Blodgett et al., 2016), ethnic-ity (Bamman et al., 2014), and social-economicstatus (Labov, 1972, 2006) have been shown toplay important roles in slang construction. Morerecently, an analysis of social media text hasshown that linguistic features also correlate withthe survival of slang terms, where linguisticallyappropriate terms have a higher likelihood of be-ing popularized (Stewart and Eisenstein, 2018).

Recent work in the NLP community has alsoanalyzed slang. Ni and Wang (2017) studied slangcomprehension as a translation task. In their study,both the spelling of a word and its context are pro-vided as input to a translation model to decodea definition sentence. Pei et al. (2019) proposedend-to-end neural models to detect and identifyslang automatically in natural sentences.

Kulkarni and Wang (2018) have proposed com-putational models that derive novel word forms ofslang from spellings of existing words. Here, weinstead explore the generation of novel slang usagefrom existing words and focus on word sense ex-tension toward slang context, based on the premisethat new slang senses are often derived from exist-ing conventional word senses.

Our work is inspired by previous research sug-gesting that slang relies on reusing words in theexisting lexicon (Eble, 2012). Previous work hasapplied cognitive models of categorization to pre-dict novel slang usage (Sun et al., 2019). In thatwork, the generative model is motivated by re-search on word sense extension (Ramiro et al.,2018). In particular, slang generation is opera-

tionalized by categorizing slang senses based ontheir similarities to dictionary definitions of can-didate words, enhanced by collaborative filter-ing (Goldberg et al., 1992) to capture the fact thatwords with similar senses are likely to extend tosimilar novel senses (Lehrer, 1985). However, thisapproach presupposes that slang senses are simi-lar to conventional senses of a word represented instandard embedding space, an assumption that isnot warranted and yet to be addressed.

Our work goes beyond the existing work inthree important aspects: 1) We capture seman-tic flexibility of slang usage by contributing anovel method based on contrastive learning. Ourmethod encodes slang meaning and conventionalmeaning of a word under a common embeddingspace, thereby improving the inadequate existingmethodology for slang generation that uses com-mon, slang-insensitive embeddings; 2) We capturesyntactic and contextual flexibility of slang usagein a coherent probabilistic framework, an aspectthat was ignored in previous work; 3) We rigor-ously test our framework against slang sense def-inition entries from three large slang dictionariesand contribute a new dataset for slang research tothe community.

Contrastive Learning. Contrastive learning isa semi-supervised learning technique used to ex-tract semantic representations in data-scarce sit-uations. It can be incorporated into neural net-works in the form of twin networks, where twoexact copies of an encoder network are applied totwo different examples. The encoded representa-tions are then compared and back-propagated. Al-ternative loss schemes such as Triplet (Weinbergerand Saul, 2009; Wang et al., 2014) and Quadrupletloss (Law et al., 2013) have also been proposed toenhance stability in training. In NLP, contrastivelearning has been applied to learn similarities be-tween text (Mueller and Thyagarajan, 2016; Necu-loiu et al., 2016) and speech utterances (Kamperet al., 2016) with recurrent neural networks.

The contrastive learning method we develophas two main differences: 1) We do not use re-current encoders because they perform poorly ondictionary-based definitions; 2) We propose a jointneural-probabilistic framework on the learned em-bedding space instead of resorting to methods suchas nearest-neighbor search for generation.

Page 4: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

3 Computational Framework

Our computational framework for slang genera-tion comprises three interrelated components: 1)A probabilistic formulation of word choice, ex-tending that in Sun et al. (2019) to leverage encap-sulated slang senses from a modified embeddingspace; 2) A contrastive encoder—inspired by vari-ants of twin network (Baldi and Chauvin, 1993;Bromley et al., 1994)—that constructs a modifiedembedding space for slang by adapting the con-ventional embeddings to incorporate new sensesof slang words; 3) A contextually informed priorfor capturing flexible uses of naturalistic slang.

3.1 Probabilistic Slang Choice ModelGiven a query slang sense MS and its context CS ,we cast the problem of slang generation as infer-ence over candidate words w in our vocabulary.Assuming all candidate words w are drawn from afixed vocabulary V , the posterior is as follows:

P (w|MS , CS) ∝ P (MS |w,CS)P (w|CS)

∝ P (MS |w)P (w|CS) (1)

Here, we define the prior P (w|CS) based on regu-larities of syntax and/or linguistic context in slangusage (described in Section 3.4). We formulate thelikelihood P (MS |w)2 by specifying the relationsbetween conventional senses of word w (denotedby Mw = {Mw1 ,Mw2 , · · · ,Mwm}, i.e., the setof senses drawn from a standard dictionary) andthe query MS (i.e., slang sense that is outside thestandard dictionary). Specifically, we model thelikelihood by a similarity function that measuresthe proximity between the slang sense MS and theset of conventional sensesMw of wordw in a con-tinuous, embedded semantic space:

P (MS |w) = P (MS |Mw)

∝ f({sim(ES , Ewi);Ewi ∈ Ew}) (2)

Here, f(·) is a similarity function in range [0, 1],while ES and Ew represent semantic embeddingsof the slang sense MS and the set of conventionalsenses Mw. We derive these embeddings fromcontrastive learning which we describe in detail inSection 3.2, and we compare this proposed methodwith baseline methods that draw embeddings fromexisting sentence embedding models.

2Here, we only consider linguistically motivated contextas CS and assume the semantic shift patterns of slang areuniversal across all such contexts.

Our choice of the similarity function f(·) ismotivated by prior work on few-shot classifica-tion. Specifically, we consider variants of two es-tablished methods: One Nearest Neighbor (1NN)matching (Koch et al., 2015; Vinyals et al., 2016)and Prototypical learning (Snell et al., 2017).

The 1NN model postulates that a candidateword should be chosen according to the similar-ity between the query slang sense and the closestconventional sense:

f1nn(ES , Ew) = maxEwi∈Ew

sim(ES , Ewi) (3)

In contrast, the prototype model postulates thata candidate word should be chosen if its aggre-gate (or average) sense is in close proximity of thequery slang sense:

fprototype(ES , Ew) = sim(ES , Eprototypew )

Eprototypew =

1

|Ew|∑

Ewi∈Ew

Ewi (4)

In both cases, the similarity between two sensesis defined by the exponentiated negative squaredEuclidean distance in semantic embedding space:

sim(ES , Ew) = exp(−||ES − Ew||22hs

) (5)

Here, hs is a learned kernel width parameter.We also consider an enhanced version of the

posterior using collaborative filtering (Goldberget al., 1992), where words with similar meaningare predicted to shift to similar novel slang mean-ings. We operationalize this by summing over theclose neighborhood of candidate word L(w):

P (w|MS , CS) =∑

w′∈L(w)

P (w|w′)P (w′|MS , CS)

(6)

Here, P (w′|MS , CS) is a fixed term calculatedidentically as in Equation (1) and P (w|w′) is theweighting of words in the close neighborhood ofa candidate word w. This weighting probability isset proportional to the exponentiated negative co-sine distance between w and its neighbor w′ de-fined in pre-trained word embedding space, andthe kernel parameter hcf is also estimated from thetraining data:

P (w|w′) ∝ sim(w,w′) = exp(−d(w,w′)

hcf) (7)

Here, d(w,w′) is the cosine distance between twowords in a word embedding space.

Page 5: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

3.2 Contrastive Semantic Encoding (CSE)

We develop a contrastive semantic encoder forconstructing a new embedding space representingslang and conventional word senses that do notbear surface similarities. For instance, the conven-tional sense of kick such as “propel with foot” canhardly be related to the slang sense of kick suchas “a strong flavor”. The contrastive embeddingspace we construct seeks to redefine or warp sim-ilarities, such that the otherwise unrelated senseswill be in closer proximity than they are underexisting embedding methods. For example, twometaphorically related senses can bear strong sim-ilarity in slang usage, even though they may be farapart in a literal sense.

We sample triplets of word senses as input tocontrastive learning, following work on twin net-works (Baldi and Chauvin, 1993; Bromley et al.,1994; Chopra et al., 2005; Koch et al., 2015).We use dictionary definitions of conventional andslang senses to obtain the initial sense embeddings(See Section 4.4 for details). Each triplet con-sists of 1) an anchor slang sense MS , 2) a positiveconventional sense MP , and 3) a negative con-ventional sense MN . The positive sense shouldideally be encouraged to lie closely to the anchorslang sense (in the resulting embedding space),whereas the negative sense should ideally be fur-ther away from both the positive conventional andanchor slang senses. Section 3.3 describes the de-tailed sampling procedures.

Our triplet network uses a single neural encoderg to project each word sense representation into ajoint embedding space in Rd.

ES = g(MS);EP = g(MP );EN = g(MN ) (8)

We choose a 1-layer fully connected network withReLU (Nair and Hinton, 2010) as the encoder g forpre-trained word vectors (e.g. fastText). For con-textualized embedding models we consider, g willbe a Transformer encoder (Vaswani et al., 2017) .In both cases, we apply the same encoder networkg to each of the three inputs. We train the tripletnetwork using the max-margin triplet loss (Wein-berger and Saul, 2009), where the squared dis-tance between the positive pair is constrained tobe closer than that of the negative pair with a mar-gin m:

Ltriplet =[m+ ||ES − EP ||22 − ||ES − EN ||22

]+

(9)

3.3 Triplet Sampling

To train the triplet network, we build data tripletsfrom every slang lexical entry in our training set.For each slang sense MS of word w, we create apositive pair with each conventional sense Mwi ofthe same word w. Then for each positive pair, wesample a negative example every training epochby randomly selecting a conventional sense Mw′

from a word w′ that is sufficiently different fromw, such that the corresponding definition sentenceDw′ has less than 20% overlap in the set of con-tent words compared to MS and any conventionaldefinition sentence Dwi of word w. We rank allcandidate words in our vocabulary against w bycomputing cosine distances from pre-trained wordembeddings and consider a word w′ to be suffi-ciently different if it is not in the top 20 percent.

Neighborhood Sampling (NS). In addition to us-ing conventional senses of the matching word wfor constructing positive pairs, we also samplepositive senses from a small neighborhood L(w)of similar words. This sampling strategy pro-vides linguistic knowledge from parallel semanticchange to encourage neighborhood structure in thelearned embedding space. Sampling from neigh-boring words also augments the size of the train-ing data considerably in this data-scarce task. Wesample negative senses in a similar way, exceptthat we also consider all conventional definitionsentences from neighboring words when checkingfor overlapping senses.

3.4 Contextual Prior

The final component of our framework is the priorP (w|CS) (see Equation (1)) that captures flexi-ble use of slang words with regard to syntax anddistributional semantics. For example, slang ex-hibits flexible Part-of-Speech (POS) shift, e.g.,noun→verb transition as in the example ice, andsurprisals in linguistic context, e.g., ice in “I havea feeling he’s gonna [blank] himself someday.”Here, we formulate the context CS in two forms:1) a syntactic-shift prior, namely the POS informa-tion PS to capture syntactic regularities in slang,and/or 2) a linguistic context prior, namely thelinguistic context KS to capture distributional se-mantic context when this is available in the data.

Syntactic-Shift Prior (SSP). Given a queryPOS tag PS , we construct the syntactic prior bycomparing POS distribution Pw from literal natu-

Page 6: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

ral usage of a candidate word w with a smoothedPOS distribution PS centered at PS . However, wecannot directly compare PS to Pw because slangusage often involves shifting POS (Eble, 2012; Peiet al., 2019). To account for this, we apply a trans-formation T by counting the number of POS tran-sitions for each slang-conventional definition pairin the training data. Each column of the transfor-mation matrix T is then normalized, so columni of T can be interpreted as the expected slang-informed POS distribution given the i’th POS tagin conventional context (e.g., the noun columngives the expected slang POS distribution of aword that is used exclusively as a noun in con-ventional usage). The slang-contextualized POSdistribution P∗S can then be computed by applyingT on PS : P∗S = T × PS . The prior can be esti-mated by comparing the POS distributions Pw andP∗S via Kullback-Leibler (KL) divergence:

P (w|CS) = P (w|PS) ∝ exp(−KL(Pw,P∗S)

) 12

(10)Intuitively, this prior captures the regularities ofsyntactic shift in slang usage, and it favors candi-date words with POS characteristics that fits wellwith the queried POS tag in a slang context.

Linguistic Context Prior (LCP). We use a lan-guage model PLM to a given linguistic contextKS

to estimate the probability of each candidate word:

P (w|CS) = P (w|KS) ∝ PLM (w|KS)+α (11)

Here, α is a Laplace smoothing constant. We usethe GPT-2 based language infilling model fromDonahue et al. (2020) as PLM and discuss the im-plementation in Section 4.3.

4 Experimental Setup

4.1 Lexical ResourcesWe collected lexical entries of slang and conven-tional words/phrases from three separate onlinedictionaries:3 1) Online Slang Dictionary (OSD),4

2) Green’s Dictionary of Slang (GDoS) (Green,2010),5 and 3) an open source subset of UrbanDictionary (UD) data from Kaggle.6 In addition,

3We obtained written permissions from all authors for thedatasets that we use for this work.

4OSD: http://onlineslangdictionary.com5GDoS: https://greensdictofslang.com6UD: https://www.kaggle.com/therohk/urban-dictionary-

words-dataset

we gathered dictionary definitions of conventionalsenses of words from the online version of OxfordDictionary (OD).7

Slang Dictionary. Both slang dictionaries (OSDand GDoS) are freely accessible online and con-tain slang definitions with meta-data such as Part-of-Speech tags. Each data entry contains the word,its slang definition, and its part-of-speech (POS)tag. In particular, OSD includes example sen-tence(s) for each slang entry which we leverageas linguistic context, and GDoS contains time-tagged references that allow us to perform his-torical prediction (described later). We removedall acronyms (i.e., fully capitalized words) as theygenerally do not extend meaning, and slang def-initions that share more than 50% content wordswith any of their corresponding conventional defi-nitions to account for conventionalized slang. Wealso removed slang with novel word forms whereno conventional sense definitions are available.Slang phrases were treated as unigrams becauseour task only concerns the association betweensenses and lexical items. Each sense definitionwas considered a data point during both learningand prediction. We later partitioned definition en-tries from each dataset to be used for training, val-idation, and testing. Note that a word may appearin both training and testing but the pairing betweenword senses are unique (See Section 5.3 for dis-cussion).

Conventional Word Senses. We focused on thesubset of OD containing word forms that are alsoavailable in the slang datasets described. For eachword entry, we removed all definitions that havebeen tagged as informal because these are likely torepresent slang senses. This results in 10,091 and29,640 conventional sense definitions correspond-ing to the OSD and GDoS datasets respectively.

Data Split. We used all definition entries fromthe slang resources such that the correspondingslang word also exists in the collected OD subset.The resulting datasets (OSD and GDS) had 2,979and 29,300 definition entries respectively, from1,635 and 6,540 unique slang words, of which1,253 are shared across both dictionaries. Foreach dataset, the slang definition entries were par-titioned into a 90% training set and a 10% test set.5% of the data in the training set were set aside forvalidation when training the contrastive encoder.

7OD: https://en.oxforddictionaries.com

Page 7: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

Urban Dictionary. In addition to the twodatasets described above, we provide a thirddataset based on Urban Dictionary (UD) that aremade available via Kaggle. Unlike the previoustwo datasets, we are able to make this one publiclyavailable without requiring one to obtain prior per-mission from the data owners.8 To guard againstthe crowd-sourced and noisy nature of UD, we en-sure quality by keeping definition entries such that1) it has at least 10 more upvotes than downvotes,2) the word entry exists in one of OSD or GDoS,and 3) at least one of the corresponding defini-tion sentences in these dictionaries have a 20% orgreater overlap in the set of content words with theUD definition sentence. We also remove entrieswith more than 50% overlap in content words withany other UD slang definitions under the sameword to remove duplicated senses. This results in2,631 definitions entries from 1,464 unique slangwords. The corresponding OD subset has 10,357conventional sense entries. We find entries fromUD to be more stylistically variable and lengthier,with a mean entry length of 9.73 in comparison to7.54 and 6.48 for OSD and GDoS respectively.

4.2 Part-of-Speech Data

The natural POS distribution Pw for each candi-date word w is obtained using POS counts fromthe most recent available decade of the HistWordsproject (Hamilton et al., 2016). For word entriesthat are not available, mostly phrases, we estimatePw by counting POS tags from Oxford Dictionary(OD) entries of w.

When estimating the slang POS transformationfor the syntactic prior, we mapped all POS tagsinto one of the following six categories: {verb,other, adv, noun, interj, adj} for the OSD exper-iments. For GDS, the tag ‘interj’ was excluded asit is not present in the dataset.

4.3 Contextualized Language Model Baseline

We considered a state-of-the-art GPT-2 based lan-guage infilling model from Donahue et al. (2020)as both a baseline model and a prior to our frame-work (on the OSD data where context sentencesare available for the slang entries). For each entry,we blanked out the corresponding slang word inthe example sentence, effectively treating our taskas a cloze task. We applied the infilling model to

8Code and data available at: https://github.com/zhewei-sun/slanggen

obtain probability scores for each of the candidatewords and apply a Laplace smoothing of 0.001.We fine-tuned the LM infilling model using all ex-ample sentences in the OSD training set until con-vergence. We also experiment with a combinedprior where the two priors are combined usingelement-wise multiplication and re-normalization.

4.4 Baseline Embedding Methods

To compare with and compute the baseline em-bedding methods M for definition sentences, weused 300-dimensional fastText embeddings (Bo-janowski et al., 2017) pre-trained with subwordinformation on 600 billion tokens from CommonCrawl9 as well as 768-dimensional Sentence-Bert(SBERT) (Reimers and Gurevych, 2019) encoderspretrained on Wikipedia and fine-tuned on NLIdatasets (Bowman et al., 2015; Williams et al.,2018). The fastText embeddings were also used tocompute cosine distances d(w,w′) in Eq. 7. Em-beddings for phrases and the fastText-based sen-tence embeddings were both computed by apply-ing average pooling to normalized word-level em-beddings of all content words. In the case ofSBERT, we fed in the original definition sentence.

4.5 Training Procedures

We trained the triplet networks for a maximum of20 epochs using Adam (Kingma and Ba, 2015)with a learning rate of 10−4 for fastText and 2−5

for SBERT based models. We preserved dimen-sions of the input sense vectors for the contrastiveembeddings learned by the triplet network (that is,300 for fastText and 768 for SBERT). We used1,000 fully-connected units in the contrastive en-coder’s hidden layer for fastText based models.Triplet margins of 0.1 and 1.0 were used with fast-Text and SBERT embeddings respectively.

We trained the probabilistic classificationframework by minimizing negative log likelihoodof the posterior P (w∗|MS , CS) on the ground-truth words for all definition entries in the trainingset. We jointly optimized kernel width parametersusing L-BFGS-B (Byrd et al., 1995). To constructa word w’s neighborhood L(w) in both collabora-tive filtering and triplet sampling, we consideredthe 5 closest words in cosine distances of theirfastText embeddings.

9http://commoncrawl.org

Page 8: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

0 200 400 600 800 1000 1200 1400 1600Number of retrievals

0.0

0.2

0.4

0.6

0.8

1.0Pr

ecisi

on

Random (AUC = 50.0)Syntactic-shift prior (AUC = 60.6)Linguistic context prior (AUC = 61.8)FastText baseline (AUC = 68.7)FastText full model (AUC = 75.6)SBERT baseline (AUC = 72.0)SBERT full model (AUC = 79.5)

Figure 2: ROC curves for slang generation in OSDtest set. Collaborative-filtering prototype modelwas used for prediction. Ticks on the y-axis in-dicate median precision of the models.

5 Results

5.1 Model Evaluation

We first evaluated our models quantitatively bypredicting slang word choices: Given a novelslang sense (a definition taken from a slang dic-tionary) and its part-of-speech, how likely is themodel to predict the ground-truth slang recordedin the dictionary? To assess model perfor-mance, we allowed each model to make up to|V | ranked predictions where V is the vocabularyof the dataset being evaluated, and we used stan-dard Area-Under-Curve (AUC) percentage fromReceiver-Operator Characteristic (ROC) curves toassess overall performance.

We show the ROC curves for the OSD evalua-tion in Figure 2 as an illustration. The AUC metricis similar to and a continuous extension to an F1score by comprehensively sweeping through thenumber of candidate words a model is allowed topredict. We find this metric to be the most appro-priate because multiple words may be appropriateto express a probe slang sense.

To examine the effectiveness of the contrastiveembedding method, we varied the semantic rep-resentation as input to the models by consideringboth fastText and SBERT (described in Sec 4.4).For both embeddings, we experimented with thebaseline variant without the contrastive encod-ing (e.g., vanilla embeddings from fastText andSBERT). We then augmented the models incre-mentally with the contrastive encoder and the pri-ors whenever applicable to examine their respec-tive and joint effects on model performance inslang word choice prediction. We observed that,under both datasets, models leveraging the con-

trastively learned sense embeddings more reliablypredict the ground-truth slang words, indicated byboth higher AUC scores and consistent improve-ment in precision over all retrieval ranks. Note thatthe vanilla SBERT model, despite being a muchlarger model trained on more data, only presentedminor performance gains when compared with theplain fastText model. This suggests that simplytraining larger models on more data does not bet-ter encapsulate slang semantics.

We also analyzed whether the contrastive em-beddings are robust under different choices ofthe probabilistic models. Specifically, we con-sidered the following four variants of the mod-els: 1) 1-Nearest Neighbor (1NN), 2) Prototype,3) 1NN with collaborative filtering (CF), and 4)Prototype with CF. Our results show that applyingcontrastively learned semantic embeddings con-sistently improves predictive accuracy across allprobabilistic choice models. The complete set ofresults for all 3 datasets is summarized in Table 1.

We noted that the syntactic information fromthe prior improves predictive accuracy in all set-tings, while by itself predicting significantly betterthan chance. On OSD, we used the context sen-tences alone in a contextualized language infillingmodel for prediction and also incorporating it as aprior. Again, while the prior consistently improvesmodel prediction, both by itself and when pairedwith the syntactic-shift prior, the language modelalone is not sufficient.

We found the syntactic-shift prior and linguisticcontext prior to be capturing complementary in-formation (mean Spearman correlation of 0.054±0.003 across all examples), resulting in improvedperformance when they are combined together.

However, the majority of the performance gainis attributed to the augmented contrastive embed-dings, which highlights the importance and sup-ports our premise that encoding of slang and con-ventional senses is crucial to slang word choice.

5.2 Historical Analysis of Slang Emergence

We next performed a temporal analysis to evalu-ate whether our model explains slang emergenceover time. We used the time tags available in theGDoS dataset and predicted historically emergedslang from the past 50 years (1960s–2000s). Fora given slang entry recorded in history, we taggedits emergent decade using the earliest dated refer-ence available in the dictionary. For each future

Page 9: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

Model 1NN Prototype 1NN+CF Proto+CFDataset 1: Online Slang Dictionary (OSD)

Prior Baseline - Uniform 51.9Prior Baseline - Syntactic-shift 60.6Prior Baseline - Linguistic Context (Donahue et al., 2020) 61.8Prior Baseline - Syntactic-shift + Linguistic Context 67.3

FastText Baseline (Sun et al., 2019) 63.2 65.2 66.0 68.7FastText + Contrastive Semantic Encoding (CSE) 71.7 71.6 73.0 72.6FastText + CSE + Syntactic-shift Prior (SSP) 73.8 73.4 75.2 74.4FastText + CSE + Linguistic Context Prior (LCP) 73.6 73.2 74.7 73.9FastText + CSE + SSP + LCP 75.4 74.9 76.5 75.6

SBERT Baseline 67.4 68.1 69.5 72.0SBERT + CSE 76.6 77.4 77.4 78.0SBERT + CSE + SSP 77.6 78.0 78.8 78.9SBERT + CSE + LCP 77.8 78.4 78.1 78.7SBERT + CSE + SSP + LCP 78.5 79.0 79.4 79.5

Dataset 2: Green’s Dictionary of Slang (GDoS)

Prior Baseline - Uniform 51.5Prior Baseline - Syntactic-shift 61.0

FastText Baseline (Sun et al., 2019) 68.2 69.9 67.8 69.7FastText + Contrastive Semantic Encoding (CSE) 73.4 74.0 74.1 74.8FastText + CSE + Syntactic-shift Prior (SSP) 74.5 74.8 75.2 75.8

SBERT Baseline 67.1 68.0 66.8 67.5SBERT + CSE 77.8 78.2 77.4 77.9SBERT + CSE + SSP 78.5 78.7 78.3 78.6

Dataset 3: Urban Dictionary (UD)

Prior Baseline - Uniform 52.3

FastText Baseline (Sun et al., 2019) 65.2 68.8 67.6 70.9FastText + Contrastive Semantic Encoding (CSE) 71.0 72.2 71.5 73.7

SBERT Baseline 72.4 71.7 74.0 74.4SBERT + CSE 76.2 76.6 77.2 78.8

Table 1: Summary of model AUC scores (%) for slang generation in 3 slang datasets.

decade d, we trained our model using all entriesbefore d and assessed whether our model can pre-dict the choices of slang words for slang sensesthat emerged in the future decade. We scored themodels on slang words that emerged during eachsubsequent decade, simulating a scenario wherefuture slang usages are incrementally predicted.

Table 2 summarizes the result from the histori-cal analysis for the non-contrastive SBERT base-line and our full model (with contrastive embed-dings), based on the GDoS data. AUC scores aresimilar to the previous findings but slightly lowerfor both models in this historical setting. Overall,we find the full model to improve the baseline con-sistently over the course of history examined and

achieve similar performance as in the synchronicevaluation. This provides strong evidence that ourframework is robust and has explanatory powerover the historical emergence of slang.

5.3 Model Error Analysis and Interpretation

Few-shot vs Zero-shot Prediction. We analyzeour model errors and note that one source of er-ror stems from whether the probe slang word hasappeared during training versus not. Here, eachcandidate word is treated as a class and each slangsense of a word seen in the training set is consid-ered a ‘shot’. In the few-shot case, although theslang sense in question was not observed in pre-diction, the model has some a priori knowledge

Page 10: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

Decade # Test Baseline SBERT+CSE+SSP

1960s 2010 67.5 77.41970s 1757 66.3 77.91980s 1655 66.3 78.61990s 1605 66.2 75.42000s 1374 65.9 77.0

Table 2: Summary of model AUC scores in histor-ical prediction of slang emergence (1960s-2000s).The non-contrastive SBERT baseline and the pro-posed full model (with contrastive embedding,CSE, and syntactic prior, SSP) are compared us-ing collaborative-filtering Prototype. Models weretrained and tested incrementally through time(test sizes shown) and trained initially on 20,899Green’s Dictionary definitions prior to the 1960s.

about its target word and how it has been used inslang context (because a word may have multipleslang senses), thus allowing the model to general-ize toward novel slang usage of that word. In thezero-shot case, the model needs to select a novelslang word (i.e., one that never appeared in train-ing) and hence has no direct knowledge about howthat word should be extended in a slang context.Such knowledge must be inferred indirectly, andin this case, from the conventional senses of thecandidate words. The model can then infer howwords with similar conventional senses might ex-tend to slang context.

Table 3 outlines the AUC scores of the collab-oratively filtered prototype models under few-shotand zero-shot settings. For each dataset, we par-titioned the corresponding test set by whether thetarget word appears at least once within anotherdefinition entry in the training data. This results in179, 2,661, and 165 few-shot definitions in OSD,GDoS and UD respectively, along with 120, 269,96 zero-shot definitions. From our results, we ob-served that it is more challenging for the modelto generalize usage patterns to unseen words, withAUC scores often being higher in the few-shotcase. Overall, we found the model to have themost issues handling zero-shot cases from GDoSdue to the fine-grained senses recorded in this dic-tionary, where a word has more slang senses onaverage (in comparison to the OSD and UD data).This issue caused the models to be more biased to-wards generalizing usage patterns from more com-monly observed words. Finally, the SBERT-basedmodels tend to be more robust towards unseenword-forms, potentially benefiting from their con-textualized properties.

(a) Online Slang Dictionary (OSD)

Model Few-shot Zero-shot

Prior - Uniform 55.1 47.1Prior - Syntactic-shift 63.4 56.4Prior - Linguistic Context 72.4 45.8Prior - SSP + LCP 74.7 56.4FT Baseline 68.3 69.2FT + CSE 74.8 69.4FT + CSE + SSP 76.8 70.9FT + CSE + LCP 76.7 69.5FT + CSE + SSP + LCP 78.7 70.9SBERT Baseline 72.2 71.6SBERT + CSE 78.3 77.5SBERT + CSE + SSP 79.3 78.3SBERT + CSE + LCP 79.8 77.1SBERT + CSE + SSP + LCP 80.7 77.8

(b) Green’s Dictionary of Slang (GDoS)

Model Few-shot Zero-shot

Prior - Uniform 51.8 48.1Prior - Syntactic-shift 61.6 54.8FT Baseline 70.6 61.3FT + CSE 76.3 59.2FT + CSE + SSP 77.3 60.7SBERT Baseline 68.3 59.6SBERT + CSE 79.0 66.8SBERT + CSE + SSP 79.7 67.7

(c) Urban Dictionary (UD)

Model Few-shot Zero-shot

Prior - Uniform 54.2 49.1FT Baseline 68.6 75.0FT + CSE 76.2 69.4SBERT Baseline 73.0 76.8SBERT + CSE 80.6 75.6

Table 3: Model AUC scores (%) for Few-shot andZero-shot test sets (“CSE” for contrastive embed-ding, “SSP” for syntactic prior, “LCP” for contex-tual prior, and “FT” for fastText).

Synonymous Slang Senses. We also examinedthe influence of synonymy (or sense overlap) inthe slang datasets. We quantified the degreeof sense synonymy by checking each test senseagainst all training senses and computing the editdistance between the corresponding sets of con-stituent content words of the sense definitions.

Figure 3 shows the distribution of degree of syn-onymy across all test examples where the edit dis-tance to the closest training example is considered.We perform our evaluation by binning based onthe degree of synonymy and summarize the resultsin Figure 4. We do not observe any substantial

Page 11: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

0 1 2 3 4 5 6 7 8 9 10+Edit Distance to the Closest Training Example

0%

10%

20%

30%

40%

50%

Prop

ortio

n of

Tes

t Exa

mpl

esOSD (299 Test Examples)GDoS (2930 Test Examples)UD (261 Test Examples)

Figure 3: Degree of synonymy in the test examplesrelative to training data in each of the 3 datasets.

(a) Online Slang Dictionary (OSD)

Model Training Testing

FT Baseline 0.33 ± 0.011 0.35 ± 0.033FT + CSE 0.15 ± 0.0083 0.28 ± 0.030SBERT Baseline 0.34 ± 0.011 0.32 ± 0.033SBERT + CSE 0.097 ± 0.0069 0.23 ± 0.029

(b) Green’s Dictionary of Slang (GDoS)

Model Training Testing

FT Baseline 0.30 ± 0.0034 0.30 ± 0.010FT + CSE 0.19 ± 0.0028 0.26 ± 0.0097SBERT Baseline 0.32 ± 0.0035 0.32 ± 0.010SBERT + CSE 0.10 ± 0.0019 0.22 ± 0.0089

(c) Urban Dictionary (UD)

Model Training Testing

FT Baseline 0.34 ± 0.012 0.31 ± 0.037FT + CSE 0.20 ± 0.010 0.28 ± 0.033SBERT Baseline 0.34 ± 0.012 0.28 ± 0.034SBERT + CSE 0.10 ± 0.0075 0.23 ± 0.031

Table 4: Mean Euclidean distance from slangsenses to prototypical conventional senses.

changes in performance when controlling for thedegree of synonymy, and in fact, the highly syn-onymous definitions appear to be more difficult(as opposed to easier) for the models. Overall, wefind the models to yield consistent improvementacross different degrees of synonymy, particularlywith the SBERT based full model which offers im-provement in all cases.

Semantic Distance. To understand the conse-quence of contrastive embedding, we examine therelative distance between conventional and slangsenses of a word in embedding space and the ex-tent to which the learned semantic relations mightgeneralize. We measured the Euclidean distance

(a) Online Slang Dictionary (OSD)

0 1 2 3 4 5 6 7 8 9 10+Edit Distance to the Closest Synonymous Sense in Training Set

40

50

60

70

80

90

100

AUC Score

FT BaselineFT + CSE + SSP + LCPSBERT BaselineSBERT + CSE + SSP + LCP

(b) Green’s Dictionary of Slang (GDoS)

0 1 2 3 4 5 6 7 8 9 10+Edit Distance to the Closest Synonymous Sense in Training Set

40

50

60

70

80

90

100

AUC Score

FT BaselineFT + CSE + SSP

SBERT BaselineSBERT + CSE + SSP

(c) Urban Dictionary (UD)

0 1 2 3 4 5 6 7 8 9 10+Edit Distance to the Closest Synonymous Sense in Training Set

40

50

60

70

80

90

100

AUC

No data

FT BaselineFT + CSE

SBERT BaselineSBERT + CSE

Figure 4: Model AUC scores (%) under test setswith different degrees of synonymy present intraining, for the baselines and the best performingmodels (under collaborative-filtering prototype).

between each slang embedding with the prototypesense vector of all candidate words, without ap-plying the probabilistic choice models. Table 4shows the ranks of the corresponding candidatewords, averaged over all slang sense embeddingsconsidered and normalized between 0 and 1. Weobserved that contrastive learning indeed bringscloser slang and conventional senses (from thesame word), as indicated by lower mean semanticdistance after the embedding procedure is applied.Under both fastText and SBERT, we obtained sig-

Page 12: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

Model Top-5 slang words predicted by model Predicted rank of the true slang1. True slang: kick; Slang sense: “a thrill, amusement or excitement”

Sample usage: I got a huge kick when things were close to out of hand.SBERT Baseline thrill, pleasure, frolic, yahoo, sparkle 3496 / 6540

Full model twist, spin, trick, crank, punch 96 / 6540

2. True slang: whiff; Slang sense: “to kill, to murder, [play on SE, to blow away]”Sample usage: The trouble is he wasn’t alone when you whiffed him.

SBERT Baseline suicide, homicide, murder, killing, rape 2735 / 6540

Full model spill, swallow, blow, flare, dash 296 / 6540

3. True slang: chirp; Slang sense: “an act of informing, a betrayal”Sample usage: Once we’re sure there’s no back-fire anywhere, the Sparrow will chirp his last chirp.

SBERT Baseline dupe, sin, scam, humbug, hocus 2431 / 6540

Full model chirp, squeal, squawk, fib, chat 1 / 6540

4. True slang: red; Slang sense: “a communist, a socialist or anyone considered to have left-wing leanings”Sample usage: Why the hell would I bed a red?

SBERT Baseline leveller, wildcat, mole, pawn, domino 1744 / 6540

Full model orange, bluey, black and tan, violet, shadow 164 / 6540

5. True slang: team; Slang sense: “a gang of criminals”Sample usage: And a little team to follow me – all wanted up the yard.

SBERT Baseline gangster, hoodlum, thug, mob, gangsta 826 / 6540

Full model brigade, mob, business, gang, school 15 / 6540

Table 5: Example slang word predictions from the contrastively learned full model and SBERT baseline(with no contrastive embedding) on slang usage from the Green’s Dictionary. Each example shows thetrue slang, the probe slang sense, a sample usage, the alternative slang words predicted by each model,and the predicted rank (colored bars indicate inverse rank) of the true slang from a lexicon of 6,540 words.

nificant improvement on both the OSD and GDoStest sets (p < 0.001). On UD, the improvement issignificant for SBERT (p = 0.018) but marginalfor fastText (p = 0.087).

Examples of Model Prediction. Table 5 shows5 example slangs from the GDoS test set and thetop words predicted by both the baseline SBERTmodel and the full SBERT-based model with con-trastive learning. The full model exhibits a greatertendency to choose words that appear remotely re-lated to the queried sense (e.g., spill, swallow forthe act of killing), while the baseline model fa-vors words that share only surface semantic sim-ilarity (e.g., retrieving murder and homicide di-rectly). We found cases where the model extendsmeaning metaphorically (e.g., animal to action, inthe case of chirp), euphemistically (e.g., spill andswallow for kill), and generalization of a concept(e.g., brigade and mob for gang), all of which arecommonly attested in slang usage (Eble, 2012).

We found the full model to achieve better re-trieval accuracy in cases where the queried slangundergoes a non-literal sense extension, whereasthe baseline model is situated at retrieving candi-date words with incremental or literal changes inmeaning. We also noted many cases where the true

slang word is difficult to predict without appro-priate background knowledge. For instance, thefull-model suggested words such as orange andbluey to mean “a communist” but could not pin-point the color red without knowing its culturalassociation to communism. Finally, we observedthat our model to perform generally worse whenthe target slang sense can hardly be related to con-ventional senses of the target word, suggesting thatcultural knowledge may be important to considerin the future.

6 Conclusion

We have presented a framework that combinesprobabilistic inference with neural contrastivelearning to generate novel slang word usages. Ourresults suggest that capturing semantic and con-textual flexibility simultaneously helps to improvethe automated generation of slang word choiceswith limited training data. To our knowledge thiswork constitutes the first formal computational ap-proach to modeling slang generation, and we haveshown the promise of the learned semantic spacefor representing slang senses. Our framework willprovide opportunities for future research in thenatural language processing of informal language,particularly the automated interpretation of slang.

Page 13: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

Acknowledgements

We thank the anonymous TACL reviewers andaction editors for their constructive and detailedcomments. We thank Walter Rader and JonathonGreen respectively for their permissions to use TheOnline Slang Dictionary and Green’s Dictionaryof Slang for our research. We thank Graeme Hirst,Ella Rabinovich, and members of the Language,Cognition, and Computation (LCC) Group at theUniversity of Toronto for offering thoughtful feed-back to this work. We also thank Dan Jurafskyand Derek Denis for stimulating discussion. Thiswork was supported by a NSERC Discovery GrantRGPIN-2018-05872 and a Connaught New Re-searcher Award to YX.

References

Elton Shah Aly and Dustin Terence van der Haar.2020. Slang-based text sentiment analysis ininstagram. In Fourth International Congresson Information and Communication Technol-ogy, pages 321–329, Singapore. Springer Sin-gapore.

Pierre Baldi and Yves Chauvin. 1993. Neural net-works for fingerprint recognition. Neural Com-putation, 5(3):402–418.

David Bamman, Jacob Eisenstein, and Tyler Sch-noebelen. 2014. Gender identity and lexicalvariation in social media. Journal of Sociolin-guistics, 18(2):135–160.

Su Lin Blodgett, Lisa Green, and BrendanO’Connor. 2016. Demographic dialectal vari-ation in social media: A case study of African-American English. In Proceedings of the 2016Conference on Empirical Methods in Natu-ral Language Processing, pages 1119–1130,Austin, Texas. Association for ComputationalLinguistics.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2017. Enriching word vec-tors with subword information. Transactions ofthe Association for Computational Linguistics,5:135–146.

Samuel R. Bowman, Gabor Angeli, ChristopherPotts, and Christopher D. Manning. 2015. Alarge annotated corpus for learning natural lan-guage inference. In Proceedings of the 2015

Conference on Empirical Methods in NaturalLanguage Processing, Lisbon, Portugal. Asso-ciation for Computational Linguistics.

Jane Bromley, Isabelle Guyon, Yann LeCun, Ed-uard Säckinger, and Roopak Shah. 1994. Sig-nature verification using a "siamese" time delayneural network. In J. D. Cowan, G. Tesauro,and J. Alspector, editors, Advances in NeuralInformation Processing Systems 6, pages 737–744. Morgan-Kaufmann.

Richard H. Byrd, Peihuang Lu, Jorge Nocedal,and Ciyou Zhu. 1995. A limited mem-ory algorithm for bound constrained optimiza-tion. SIAM Journal on Scientific Computing,16:1190–1208.

Sumit Chopra, Raia Hadsell, and Yann LeCun.2005. Learning a similarity metric discrimina-tively, with application to face verification. InProceedings of the 2005 IEEE Computer Soci-ety Conference on Computer Vision and PatternRecognition (CVPR’05) - Volume 1 - Volume 01,CVPR ’05, pages 539–546, Washington, DC,USA. IEEE Computer Society.

C Paul Cook. 2010. Exploiting linguistic knowl-edge to infer properties of neologisms. Univer-sity of Toronto, Toronto, Canada.

Verna Dankers, Marek Rei, Martha Lewis, andEkaterina Shutova. 2019. Modelling the in-terplay of metaphor and emotion through mul-titask learning. In Proceedings of the 2019Conference on Empirical Methods in NaturalLanguage Processing and the 9th InternationalJoint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 2218–2229,Hong Kong, China. Association for Computa-tional Linguistics.

Chris Donahue, Mina Lee, and Percy Liang. 2020.Enabling language models to fill in the blanks.In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics, pages 2492–2501, Online. Association forComputational Linguistics.

Connie C Eble. 1989. The ephemerality of amer-ican college slang. In The Fifteenth Lacus Fo-rum, 15, pages 457–469.

Page 14: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

Connie C Eble. 2012. Slang & sociability: In-group language among college students. Univ.of North Carolina Press, Chapel Hill, NC.

Katrin Erk. 2016. What do you know about analligator when you know the company it keeps?Semantics and Pragmatics, 9(17):1–63.

Afsaneh Fazly, Paul Cook, and Suzanne Steven-son. 2009. Unsupervised type and token iden-tification of idiomatic expressions. Computa-tional Linguistics, 35(1):61–103.

Renato Ferreira Pinto Jr. and Yang Xu. 2021. Acomputational theory of child overextension.Cognition, 206:104472.

Elena Filatova. 2012. Irony and sarcasm: Cor-pus generation and analysis using crowdsourc-ing. In Proceedings of the Eighth InternationalConference on Language Resources and Eval-uation (LREC-2012), pages 392–398, Istanbul,Turkey. European Language Resources Associ-ation (ELRA).

Ge Gao, Eunsol Choi, Yejin Choi, and LukeZettlemoyer. 2018. Neural metaphor detectionin context. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural Lan-guage Processing, pages 607–613, Brussels,Belgium. Association for Computational Lin-guistics.

David Goldberg, David Nichols, Brian M. Oki,and Douglas Terry. 1992. Using collabora-tive filtering to weave an information tapestry.Commun. ACM, 35:61–70.

Jonathan Green. 2010. Green’s Dictionary ofSlang. Chambers, London.

William L. Hamilton, Jure Leskovec, and Dan Ju-rafsky. 2016. Diachronic word embeddings re-veal statistical laws of semantic change. In Pro-ceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Vol-ume 1: Long Papers), pages 1489–1501, Berlin,Germany. Association for Computational Lin-guistics.

Eric Holgate, Isabel Cachola, Daniel Preotiuc-Pietro, and Junyi Jessy Li. 2018. Why swear?analyzing and inferring the intentions of vulgar

expressions. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural Lan-guage Processing, pages 4405–4414, Brussels,Belgium. Association for Computational Lin-guistics.

Herman Kamper, Weiran Wang, and KarenLivescu. 2016. Deep convolutional acousticword embeddings using word-pair side infor-mation. In 2016 IEEE International Confer-ence on Acoustics, Speech and Signal Process-ing (ICASSP), pages 4950–4954.

Diederik P. Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization. In 3rd In-ternational Conference on Learning Represen-tations, ICLR 2015, San Diego, CA, USA, May7-9, 2015, Conference Track Proceedings.

Gregory Koch, Richard Zemel, and RuslanSalakhutdinov. 2015. Siamese neural networksfor one-shot image recognition. In Deep Learn-ing Workshop at the International Conferenceon Machine Learning, volume 2.

Vivek Kulkarni and William Yang Wang. 2018.Simple models for word formation in slang.In Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), pages1424–1434, New Orleans, Louisiana. Associa-tion for Computational Linguistics.

William Labov. 1972. Language in the inner city:Studies in the Black English vernacular. Uni-versity of Pennsylvania Press.

William Labov. 2006. The social stratification ofEnglish in New York City. Cambridge Univer-sity Press.

Sidney Landau. 1984. Dictionaries: The art andcraft of lexicography. Charles Scribner’s Sons,New York, NY.

Maria Lapata and Alex Lascarides. 2003. A prob-abilistic account of logical metonymy. Comput.Linguist., 29(2):261–315.

Marc T. Law, Nicolas Thome, and Matthieu Cord.2013. Quadruplet-wise image similarity learn-ing. In 2013 IEEE International Conference onComputer Vision, pages 249–256.

Page 15: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

Adrienne Lehrer. 1985. The influence of semanticfields on semantic change. Historical Seman-tics: Historical Word Formation, 29:283–296.

Changsheng Liu and Rebecca Hwa. 2018. Heuris-tically informed unsupervised idiom usagerecognition. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural Lan-guage Processing, pages 1723–1731, Brussels,Belgium. Association for Computational Lin-guistics.

Rijul Magu and Jiebo Luo. 2018. Determiningcode words in euphemistic hate speech usingword embedding networks. In Proceedings ofthe 2nd Workshop on Abusive Language Online(ALW2), pages 93–100, Brussels, Belgium. As-sociation for Computational Linguistics.

Elisa Mattiello. 2005. The pervasiveness of slangin standard and non-standard english. Mots Pal-abras Words, 6:7–41.

Jean-Baptiste Michel, Yuan Kui Shen,Aviva Presser Aiden, Adrian Veres, Matthew K.Gray, Joseph P. Pickett, Dale Hoiberg, DanClancy, Peter Norvig, Jon Orwant, StevenPinker, Martin A. Nowak, and Erez LiebermanAiden. 2011. Quantitative analysis of cultureusing millions of digitized books. Science,331:176–182.

Jonas Mueller and Aditya Thyagarajan. 2016.Siamese recurrent architectures for learningsentence similarity. In Proceedings of the Thir-tieth AAAI Conference on Artificial Intelligence,AAAI’16, pages 2786–2792. AAAI Press.

Vinod Nair and Geoffrey E. Hinton. 2010. Rec-tified linear units improve restricted boltzmannmachines. In Proceedings of the 27th Interna-tional Conference on International Conferenceon Machine Learning, ICML’10, pages 807–814, USA. Omnipress.

Paul Neculoiu, Maarten Versteegh, and MihaiRotaru. 2016. Learning text similarity withSiamese recurrent networks. In Proceedings ofthe 1st Workshop on Representation Learningfor NLP, pages 148–157, Berlin, Germany. As-sociation for Computational Linguistics.

Ke Ni and William Yang Wang. 2017. Learn-ing to explain non-standard English words and

phrases. In Proceedings of the Eighth Interna-tional Joint Conference on Natural LanguageProcessing (Volume 2: Short Papers), pages413–417, Taipei, Taiwan. Asian Federation ofNatural Language Processing.

Malvina Nissim and Katja Markert. 2003. Syntac-tic features and word similarity for supervisedmetonymy resolution. In Proceedings of the41st Annual Meeting on Association for Com-putational Linguistics - Volume 1, ACL ’03,pages 56–63, Stroudsburg, PA, USA. Associa-tion for Computational Linguistics.

Zhengqi Pei, Zhewei Sun, and Yang Xu. 2019.Slang detection and identification. In Proceed-ings of the 23rd Conference on ComputationalNatural Language Learning (CoNLL), pages881–889, Hong Kong, China. Association forComputational Linguistics.

Martin J. Pickering and Simon Garrod. 2013. For-ward models and their implications for produc-tion, comprehension, and dialogue. Behavioraland Brain Sciences, 36(4):377–392.

Alec Radford, Jeffrey Wu, Rewon Child, DavidLuan, Dario Amodei, and Ilya Sutskever. 2019.Language models are unsupervised multitasklearners. OpenAI blog, 1(8):9.

Christian Ramiro, Mahesh Srinivasan, Barbara C.Malt, and Yang Xu. 2018. Algorithms in thehistorical emergence of word senses. Pro-ceedings of the National Academy of Sciences,115:2323–2328.

Nils Reimers and Iryna Gurevych. 2019.Sentence-BERT: Sentence embeddings us-ing Siamese BERT-networks. In Proceedingsof the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9thInternational Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages3982–3992, Hong Kong, China. Associationfor Computational Linguistics.

Ekaterina Shutova, Jakub Kaplan, Simone Teufel,and Anna Korhonen. 2013a. A computationalmodel of logical metonymy. ACM Trans.Speech Lang. Process., 10(3):11:1–11:28.

Ekaterina Shutova, Simone Teufel, and Anna Ko-rhonen. 2013b. Statistical metaphor processing.Computational Linguistics, 39(2):301–353.

Page 16: arXiv:2102.01826v1 [cs.CL] 3 Feb 2021

Jake Snell, Kevin Swersky, and Richard S. Zemel.2017. Prototypical networks for few-shot learn-ing. In Advances in Neural Information Pro-cessing Systems 30: Annual Conference onNeural Information Processing Systems 2017,4-9 December 2017, Long Beach, CA, USA,pages 4080–4090.

Ian Stewart and Jacob Eisenstein. 2018. Mak-ing “fetch” happen: The influence of social andlinguistic context on nonstandard word growthand decline. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural Lan-guage Processing, pages 4360–4370, Brussels,Belgium. Association for Computational Lin-guistics.

Zhewei Sun, Richard Zemel, and Yang Xu. 2019.Slang generation as categorization. In Proceed-ings of the 41st Annual Conference of the Cog-nitive Science Society, pages 2898–2904. Cog-nitive Science Society.

Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,undefinedukasz Kaiser, and Illia Polosukhin.2017. Attention is all you need. In Proceedingsof the 31st International Conference on Neu-ral Information Processing Systems, NIPS’17,page 6000–6010, Red Hook, NY, USA. CurranAssociates Inc.

Tony Veale, Ekaterina Shutova, andBeata Beigman Klebanov. 2016. Metaphor:A computational perspective. Morgan &Claypool Publishers.

Oriol Vinyals, Charles Blundell, Timothy Lilli-crap, Koray Kavukcuoglu, and Daan Wierstra.2016. Matching networks for one shot learning.In Proceedings of the 30th International Con-ference on Neural Information Processing Sys-tems, NIPS’16, pages 3637–3645, USA. CurranAssociates Inc.

Jiang Wang, Yang Song, Thomas Leung, ChuckRosenberg, Jingbin Wang, James Philbin,Bo Chen, and Ying Wu. 2014. Learning fine-grained image similarity with deep ranking.In Proceedings of the 2014 IEEE Conferenceon Computer Vision and Pattern Recognition,CVPR ’14, pages 1386–1393, Washington, DC,USA. IEEE Computer Society.

Kilian Q. Weinberger and Lawrence K. Saul.2009. Distance metric learning for large marginnearest neighbor classification. J. Mach. Learn.Res., 10:207–244.

Adina Williams, Nikita Nangia, and Samuel Bow-man. 2018. A broad-coverage challenge corpusfor sentence understanding through inference.In Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), pages1112–1122, New Orleans, Louisiana. Associa-tion for Computational Linguistics.

Steven Wilson, Walid Magdy, BarbaraMcGillivray, Kiran Garimella, and GarethTyson. 2020. Urban dictionary embeddings forslang NLP applications. In Proceedings of the12th Language Resources and Evaluation Con-ference, pages 4764–4773, Marseille, France.European Language Resources Association.