aadhavan sadasivam, kausic gunasekar, hasan davulcu, yezhou … · 2020-05-01 · first, a meme...

14
memeBot: Towards Automatic Image Meme Generation Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou Yang Arizona State University, Tempe AZ, United States {asadasi1, kgunase3, hdavulcu, yz.yang }@asu.edu Abstract Image memes have become a widespread tool used by people for interacting and exchang- ing ideas over social media, blogs, and open messengers. This work proposes to treat au- tomatic image meme generation as a transla- tion process, and further present an end to end neural and probabilistic approach to generate an image-based meme for any given sentence using an encoder-decoder architecture. For a given input sentence, an image meme is gener- ated by combining a meme template image and a text caption where the meme template image is selected from a set of popular candidates us- ing a selection module, and the meme caption is generated by an encoder-decoder model. An encoder is used to map the selected meme tem- plate and the input sentence into a meme em- bedding and a decoder is used to decode the meme caption from the meme embedding. The generated natural language meme caption is conditioned on the input sentence and the se- lected meme template. The model learns the dependencies between the meme captions and the meme template images and generates new memes using the learned dependencies. The quality of the generated captions and the gen- erated memes is evaluated through both auto- mated and human evaluation. An experiment is designed to score how well the generated memes can represent the tweets from Twitter conversations. Experiments on Twitter data show the efficacy of the model in generating memes for sentences in online social interac- tion. 1 Introduction An Internet meme commonly takes the form of an image and is created by combining a meme template (image) and a caption (natural language sentence). The image typically comes from a set of popular image candidates and the caption conveys the intended message through natural language. Meme Template Selection Module Decoder Encoder Selected Meme Template Meme Template Name Generated Caption Meme Template Image I am curiously waiting for my father to cook supper tonight Input Sentence Meme Embedding Figure 1: An illustrative figure of memeBot. It gen- erates an image meme for a given input sentence by combining the selected meme template image and the generated meme caption. Over the internet, information exists in the form of text, images, video, or a combination of these. However the information existing as a combina- tion of image or video and short text often gets viral. Image memes are a combination of image, text, and humor, making them a powerful tool to deliver information. The image memes are popular because they portray the culture and social choices embraced by the internet community and they have a strong influence on the cultural norms of how specific demographics of people operate. For ex- ample, in Figure 2, we present the memes used by an online deep learning community to ridicule how the new pre-training methods are outperforming the previous state-of-the-art models. The popularity of image-based memes can be at- tributed to the fact that visual information is easier to process and understand when compared to read- ing large blocks of text, and this fact is evident in arXiv:2004.14571v1 [cs.CL] 30 Apr 2020

Upload: others

Post on 16-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

memeBot: Towards Automatic Image Meme Generation

Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou YangArizona State University, Tempe AZ, United States{asadasi1, kgunase3, hdavulcu, yz.yang }@asu.edu

Abstract

Image memes have become a widespread toolused by people for interacting and exchang-ing ideas over social media, blogs, and openmessengers. This work proposes to treat au-tomatic image meme generation as a transla-tion process, and further present an end to endneural and probabilistic approach to generatean image-based meme for any given sentenceusing an encoder-decoder architecture. For agiven input sentence, an image meme is gener-ated by combining a meme template image anda text caption where the meme template imageis selected from a set of popular candidates us-ing a selection module, and the meme captionis generated by an encoder-decoder model. Anencoder is used to map the selected meme tem-plate and the input sentence into a meme em-bedding and a decoder is used to decode thememe caption from the meme embedding. Thegenerated natural language meme caption isconditioned on the input sentence and the se-lected meme template. The model learns thedependencies between the meme captions andthe meme template images and generates newmemes using the learned dependencies. Thequality of the generated captions and the gen-erated memes is evaluated through both auto-mated and human evaluation. An experimentis designed to score how well the generatedmemes can represent the tweets from Twitterconversations. Experiments on Twitter datashow the efficacy of the model in generatingmemes for sentences in online social interac-tion.

1 Introduction

An Internet meme commonly takes the form ofan image and is created by combining a memetemplate (image) and a caption (natural languagesentence). The image typically comes from a set ofpopular image candidates and the caption conveysthe intended message through natural language.

MemeTemplateSelectionModule

Decoder

EncoderSelectedMemeTemplate

MemeTemplateName

GeneratedCaption

MemeTemplateImage

Iamcuriouslywaitingformyfathertocooksuppertonight

InputSentence

MemeEmbedding

Figure 1: An illustrative figure of memeBot. It gen-erates an image meme for a given input sentence bycombining the selected meme template image and thegenerated meme caption.

Over the internet, information exists in the formof text, images, video, or a combination of these.However the information existing as a combina-tion of image or video and short text often getsviral. Image memes are a combination of image,text, and humor, making them a powerful tool todeliver information. The image memes are popularbecause they portray the culture and social choicesembraced by the internet community and they havea strong influence on the cultural norms of howspecific demographics of people operate. For ex-ample, in Figure 2, we present the memes used byan online deep learning community to ridicule howthe new pre-training methods are outperformingthe previous state-of-the-art models.

The popularity of image-based memes can be at-tributed to the fact that visual information is easierto process and understand when compared to read-ing large blocks of text, and this fact is evident in

arX

iv:2

004.

1457

1v1

[cs

.CL

] 3

0 A

pr 2

020

Page 2: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

Figure 2: Memes used by the online deep learning com-munity on social media to ridicule the state-of-the-artpre-training models.

Figure 2. The key role played by the image memesin shaping the popular culture of the internet com-munity makes automatic meme image generation ademanding research topic to delve into.

Davison (2012) separates a meme into three com-ponents - Manifestation, Behavior, and Ideal. Inan image meme, the Ideal is the idea that needs tobe conveyed. The Behavior is to select a suitablememe template and caption to convey that idea and,the Manifestation is the final meme image with acaption conveying the idea. Wang and Wen (2015)and Peirson et al. (2018) focus on the behavior andmanifestation of a meme, not much importanceis given to the ideal of a meme. Their approachof image meme generation is limited to selectingthe most appropriate meme caption or generatinga meme caption for the given image and templatename. In this work, we intend to automaticallygenerate an image meme to represent a given inputsentence (Ideal) as illustrated in Figure 1, which isa challenging NLP task with immediate practicalapplications for online social interaction.

By taking a deep look into the process of imagememe generation, we propose to co-relate imagememe generation to Natural Language Translation.To translate a sentence from a source to target lan-guage, one has to decode the meaning of the sen-tence in its entirety, analyze its meaning, and thenencode that meaning of the source sentence intothe target sentence. Similarly, a sentence can betranslated into a meme by encoding the meaning ofthe sentence into a pair of image and caption capa-ble of conveying the same meaning or emotion asthat of the sentence. Motivated by this intuition formeme generation, we develop a model that oper-ates beyond the known approaches and extend thecapability of image meme generation to generatememes for a given input sentence. We summarizeour contributions as follows:

• We present an end to end encoder-decoderarchitecture to generate an image meme for

any given sentence.

• We compiled the first large-scale Meme Cap-tion dataset.

• We design experiments based on human eval-uation and provide a thorough analysis of theexperimental results on using the generatedmemes for online social interaction.

2 Related Work

There are only a few studies on automatic imagememe generation and the existing approaches treatmeme generation as a caption selection or captiongeneration problem. Wang and Wen (2015) com-bined an image and its text description to select ameme caption from a corpus based on a rankingalgorithm. Peirson et al. (2018) extends NaturalLanguage Description Generation to generatinga meme caption using an encoder-decoder modelwith an attention (Luong et al., 2015) mechanism.Although there is not much work on automaticmeme generation, the task of meme generation canbe closely aligned with tasks like Sentiment Anal-ysis, Neural Machine Translation, Image CaptionGeneration and Controlled Natural Language Gen-eration.

In Natural Language Understanding (NLU), re-searchers have explored classifying a sentencebased on their sentiment (Socher et al., 2013). Weextend this idea to classify a sentence based on itscompatibility with a meme template. The idea ofcreating an encoded representation and decodingit into a desired target is well establish in NeuralMachine Translation and Image Caption Genera-tion. Sutskever et al. (2014), Bahdanau et al. (2014)and Vaswani et al. (2017) use an encoder-decodermodel to encode and decode a sentence from asource to a targeted language. Vinyals et al. (2015),Xu et al. (2015), Karpathy and Fei-Fei (2015) en-code the visual features of an image and use adecoder to generate natural language descriptionof the image. Fang et al. (2020) generate natu-ral language description containing common senseknowledge from the encoded visual inputs.

Our proposed model of meme generation sharessimilar spirit with the above mentioned problemswhere we encode the given input sentence into alatent space followed by decoding it into a memecaption that can be combined with the meme imageto convey the same meaning as that of the input

Page 3: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

Parts-of-SpeechTagger

InputSentence

MaskedInput

Sentence

"Pleasesavetheworldfrom

CoronaVirus"

POSVector

POSMask

MemeTemplateSelectionModule

PredictedMeme

Template

Multi-HeadAttention

Add&Norm

Multi-HeadAttention

Add&Norm

FeedForward

Add&Norm

PositionEmbedding

InputEmbedding

TemplateEmbedding

MaskedMulti-HeadAttention

Add&Norm

Multi-HeadAttention

Add&Norm

FeedForward

Add&Norm

PositionEmbedding

OutputEmbedding

OutputCaption(ShiftedRight)

Linear

Softmax

MemeTemplateName

MemeTemplateImage

GeneratedMemeCaption

GeneratedMeme

MemeEmbedding

Figure 3: memeBot - model architecture. For a given input sentence, a meme is created by combining the memeimage selected by the template selection module and the meme caption generated by the caption generation trans-former.

sentence. However, the generated meme captionshould represent the input sentence through the se-lected meme template, making it a conditioned orcontrolled text generation task. Controlled Natu-ral Language Generation with desired emotions,semantics and keywords have been studied pre-viously. Huang et al. (2018) generate text withdesired emotions by embedding emotion represen-tations into Seq2Seq models. Hu et al. (2017)concatenate a control vector to the latent spaceof their model to generate text with designated se-mantics. Su et al. (2018) and Miao et al. (2019)generate a sentence with desired emotions or key-words using sampling techniques. While theseapproaches of controlled text generation involverelatively complex conditioning factors, we imple-ment a transformer (Vaswani et al., 2017) basedencoder-decoder model to generate a meme cap-tion conditioned on both the input sentence and theselected meme template.

3 Our Approach

In this section, we describe our approach: an end-to-end neural and probabilistic architecture formeme generation. Our model has two components.First, a meme template selection module to identifya compatible meme template (image) for the inputsentence. Second, a meme caption generator asillustrated in Figure 3.

3.1 Meme Template Selection Module

Pre-trained language representations from trans-former based architectures like BERT (Devlin et al.,2019), XLNet (Yang et al., 2019) and Roberta (Liuet al., 2019) are being used in a wide range of Nat-ural Language Understanding tasks. Devlin et al.(2019), Yang et al. (2019) and Liu et al. (2019)show that these models can be fine-tuned specifi-cally to a range of NLU tasks to create state-of-the-art models.

For meme template selection module, we fine

Page 4: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

tune the pre-trained language representation mod-els with a linear neural network on the meme tem-plate selection task. In training, the probability ofselecting the correct template for a given sentenceis maximized by using the formulation given below:

l(θ1) = argmaxθ1

∑(T,S)

log(P (T |S, θ1)), (1)

where θ1 denotes the parameters of the meme tem-plate selection module, T is the template and S isthe input sentence.

3.2 Meme Caption Generator

We train the meme caption generator by corrupt-ing the input caption, borrowing from denoisingautoencoder (Vincent et al., 2008). We extract theparts of speech of the input caption using a Part-Of-Speech Tagger (POS Tagger) (Honnibal and Mon-tani, 2017). Using the POS vector, we mask theinput caption such that only the noun phrases andverbs are passed as input to the meme caption gen-erator. We corrupt the data to facilitate our modelto learn meme generation from existing captionsand to generalize the process of meme generationfor any given input sentences during inference.

The meme cation generator model uses a trans-former architecture inspired from Vaswani et al.(2017). Our transformer encoder creates meme em-bedding for a given sentence by performing multi-head scaled dot-product attention on the selectedmeme template and the input sentence. The trans-former decoder initially performs masked multi-head attention on the expected caption and laterperforms multi-head scaled dot-product attentionbetween the encoded meme embedding and the out-puts of the masked multi-head attention as shownin Figure 3. This enables the meme caption gen-erator to learn the dependency between the inputsentence, selected meme template and the expectedmeme caption. We optimize the transformer byusing the formulation given below:

l(θ2) = argmaxθ2

∑(S,C)

log(P (C|M, θ2)), (2)

where θ2 denotes the parameters of the meme cap-tion generator, C is the meme caption and M is thememe embedding obtained from the transformerencoder.

4 Dataset

4.1 Meme Caption Dataset

To make possible and validate the aforementionedtechnical framework, we collect a dataset thatwould enable us to learn the dependency betweena meme template and a meme caption. Here weadopt the open online resource imgflip1 which isone of the most commonly used meme generators.To automatically crawl the data, we developed aweb crawler to collect the memes.

We observe that only a few meme templates dom-inate the collection. We investigate this dominatingmemes along with factors that can make a memepopular. Replication of a meme depends on themental processes of observation and learning ofthe group of people across which it is being shared(Davison, 2012). Popular meme templates makea content shareable and are replicated frequentlybecause of their capability to a make content vi-ral. To this end, we experiment on image memegeneration using the popular meme templates.

Our dataset has 177, 942 meme captions from24 templates. The distribution of meme captionsacross the meme templates is presented in Figure4. The dataset consists of meme template (image &template name) and meme caption pairs. A samplefrom the dataset is illustrated in Table 1. To adddiversity to the generated memes, we use variousimages for the same meme template. The memetemplate figures used and a sample of the additionalimages used for the meme templates are presentedin Appendices A and B.

4.2 Twitter Dataset

We collect tweets from Twitter to evaluate the ef-ficacy of our model in generating memes for sen-tences used in online social interaction. We ran-domly sampled 6000 tweets for the query “Coronavirus”. We prune the sampled tweets to 1000tweets by selecting only those tweets with non neg-ative sentiment using VADER-Sentiment-Analysis(Hutto and Gilbert, 2014). Twitter is an open do-main and may contain tweets that could affect thebeliefs and sentiments of people and to have a con-trol over our model, we remove the tweets withnegative sentiments. The goal is to prompt ourmodel to generate an image meme by inputting atweet and evaluate if the generated meme is rele-vant to the tweet.

1https://imgflip.com

Page 5: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

Template Name Captions Template Image

Leonardo DicaprioCheers

• to those who have been fortunate enough to haveknown true love• cheers to us making the future bright• when you see your cousin at a family gathering

Success Kid

• carries the laundry didn’t drop a single sock• when you win your first fortnite game• late to work and boss was even later• when she gives you her phone number

Table 1: Sample examples (Template name, Captions and Meme Image) from the meme caption dataset.

am i the only one around here3.1%x x everywhere3.1%success kid3.1%third world skeptical kid3.3%roll safe think about it3.4%captain picard facepalm3.4%ancient aliens3.4%matrix morpheus3.7%disaster girl3.8%leonardo dicaprio cheers3.9%y u no4.1%grumpy cat4.4%

bad luck brian5.8%

first world problems5.8%

the most interesting man in the world5.7%

one does not simply5.6%

futurama fry5.3%

10 guy5.2%

creepy condescending wonka5.0%

but thats none of my business4.7%

that would be great4.5%

waiting skeleton4.4%

Figure 4: Distribution of meme caption count for the meme templates.

5 Experiments and Results

We train our model on the meme caption dataset(Section 4.1). The train, validation and test datasetcontains 142341, 17802 and 17799 samples respec-tively. We evaluate the performance of the memetemplate selection module in selecting the com-patible template, the effectiveness of the captiongenerator in generating captions that are similarto the input captions from the meme caption testdataset and the efficacy of the model in generatingmemes for real-world examples (Tweets) throughhuman evaluation.

5.1 Meme Template Selection ModuleWe fine tune the pre-trained language represen-tation models (BERTbase (Devlin et al., 2019),XLNetbase (Yang et al., 2019) and Robertabase (Liuet al., 2019)) using a single layer linear neural net-work with 768 units on the meme template selec-

tion task using the meme caption dataset. Perfor-mance of the meme template selection module onthe meme caption test data using variants of pre-trained language representation models is reportedin Table 2.

Model Accuracy ↑ F1 ↑ Loss ↓BERTbase 68.18 68.71 1.15XLNetbase 68.74 69.06 1.17Robertabase 69.31 69.87 1.10

Table 2: Meme template selection module performanceon the meme caption test dataset using the fine tunedlanguage representation models. Bold font highlightsthe best scores obtained. For accuracy and F1, higherthe score is better. For loss, lower the score is better.

We adopt the best-performing model with a finetuned Robertabase model as the meme templateselection module in the meme generation pipeline.

Page 6: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

Input Sentence: Waiting for trump to answer a question.

MT2MC SMT2MC-NP SMT2MC-NP+V

Figure 5: Memes generated by the caption generator variants for the given input sentence.

5.2 Meme Caption Generator

For meme caption generation, we experiment withtwo different variants. The first variant - MemeTemplate to Meme Caption (MT2MC), inputs theselected meme template and generates a meme cap-tion. The second variant - Sentence and MemeTemplate to Meme Caption (SMT2MC), inputs theinput sentence along with the selected meme tem-plate and generates a meme caption. We use twovariants as a part of ablation study to demonstratethe usage of the input sentence features enablingour model to generate memes relevant to the inputsentence.

We also experiment with two variants ofSMT2MC. The first variant uses only the nounphrases from the input sentence while the secondvariant uses the verbs along with the noun phrases.We experiment only using the noun phrases in or-der to study to what extent the addition of verbsdirects the context of the generated meme towardsthe context of the input sentence. SMT2MC andMT2MC architectures follow the same denotationsas of Vaswani et al. (2017). We report the hyper-parameters used in Table 3.

Architecture N dmodel dff hMT2MC 8 768 2048 12

SMT2MC 6 512 2048 8

Table 3: Hyper-parameters used in the caption genera-tion models.

We use residual dropout (Pdrop) (Srivastavaet al., 2014) for regularization and Adam optimizer(Kingma and Ba, 2014) with β1 = 0.9, β2 = 0.98and ε = 1e−9 and a scheduler using cosine an-nealing with warm restarts (Loshchilov and Hutter,2016). During inference we generate meme cap-tions using Beam search with a beam of size 6 andlength penalty α = 0.7. We stop the caption gen-eration when a special end token or the maximumlength of 32 tokens is reached.

A sample of memes generated by the captiongenerator variants are presented in Figure 5. It canbe seen that MT2MC generates a random captionfor the given meme template which is irrelevant tothe input sentence while the meme captions gener-ated by the variants of SMT2MC are contextuallyrelevant to the input sentence. Among the variantsof SMT2MC, it can seen that the caption gener-ated using noun phrases & verbs as inputs betterrepresent the input sentence.

5.3 Evaluation Metrics

We use BLEU score (Papineni et al., 2002) to eval-uate the quality of the generated captions. Theperspective of good quality of a meme is subjec-tive and varies among people. To the best of ourknowledge, there are no known automatic evalua-tion metrics to evaluate the quality of a meme. Afairly reliable technique is to perform human eval-uation by a set of raters to evaluate the quality of a

Page 7: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

Variant BLEU-1 BLEU-2 BLEU-3 BLEU-4MT2MC 13.93 6.85.86 4.45 2.72

SMT2MC-NP 38.71 24.67 14.56 8.89SMT2MC-NP+V 45.67 27.56 17.14 11.12

Table 4: BLEU scores for the caption generator variants. Bold font highlights the best scores obtained. NP - NounPhrase and V - Verb.

415.9%3.58.2%

339.7%

118.3%

1.53.0%

28.9%2.5

5.9%

(a) Coherence score distribution

412.5%3.59.2%

339.3%

115.1%

1.54.7%

210.2%

2.59.1%

(b) Relevance score distribution

User liked59.8%

No likes28.4%

Disagreem11.8%

(c) User Likes score distribution

Figure 6: Human evaluation scores.

meme on a subjective score.In machine translation, adequacy and fluency

(Snover et al., 2009) are used to subjectively ratethe correctness and fluency of a translation. In-spired from adequacy and fluency, we define 2 met-rics - Coherence and Relevance to evaluate thegenerated memes, described as follows:

• Coherence: Can you understand the messageconveyed through the meme (Image + text)?• Relevance: Is the meme contextually relevant

to the text?

Coherence score captures the quality (fluency) ofthe generated meme and the Relevance score cap-tures how well the generated meme represents theinput sentence (correctness). We also ask the ratersif they like the meme to evaluate if the generatedmemes are good. The Relevance and Coherencemetrics are scored on a range of 1 - 4. User Likesscore represents the percentage of total raters wholiked the meme. To score these metrics, we set upan Amazon Mechanical Turk (AMT) experiment.

5.4 Evaluation Results5.4.1 Caption Generation ResultsThe scores for the caption generator variants arereported in Table 4. We see that the SMT2MCvariants produce meme captions textually similarto that of the input sentence. Among the SMT2MCvariants, the variant which inputs verbs along withnoun phrases has better score, and using verbs withnoun phrases enables the caption generator to gen-erate relatively relevant captions to that of the in-put sentence when compared to the variant which

inputs only the noun phrases. We use the best per-forming SMT2MC-NP+V to generate memes forTwitter data.

5.4.2 Human Evaluation Task SetupWe choose Amazon Mechanical Turk (AMT) forthe evaluation of the generated memes due to itseasy to use platform and the ready availability of abig worker pool with required skills. An examplefor AMT questionnaire is in presented in AppendixC. Each sample was rated by 2 raters and in case ofdisagreement among the raters, we consider theiraverage score as the final score.

In our AMT evaluation setup, we design a two-stage process to evaluate the meme. We first dis-play the meme image and ask the workers to scorethe Coherence metric, only based on their under-standing of the meme. Later we display the tweetand ask them to understand the text, and then askthem to score the Relevance metric based on theircomprehension of the tweet and the meme. Ourexpectation for the AMT workers is that they are ca-pable of visually understanding an image, capableof semantically and contextually understanding asentence and possess the reasoning ability to com-pare context from different information sources.We assume an adult human being is well qualifiedto meet our expectations.

5.4.3 Human Evaluation ResultsThe performance of the SMT2MC-NP+V model onthe human evaluation metrics is reported in Table5 and the score distribution across the evaluationmetrics is presented in the Figure 6. A qualitative

Page 8: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

VERYGOOD(4) GOOD(3) POOR(2) VERYPOOR(1)

Figure 7: Qualitative comparison of memes grouped by Coherence score.

VERYGOOD(4) GOOD(3) POOR(2) VERYPOOR(1)

Tweet:It'sillegaltopostfakeinformationaboutcoronavirus.

Tweet:Stayathomeandsaveyourlife.

Tweet:Coronaviruscasescontinuetoescalates.

Tweet:Peoplesaythecoronavirusislikeaninvisiblewar.

GeneratedMeme: GeneratedMeme: GeneratedMeme: GeneratedMeme:

Figure 8: Qualitative comparison of memes grouped by Relevance score.

comparison of memes grouped by rater scores ispresented in Figures 7 and 8.

Metric ScoreCoherence 2.66Relevance 2.65User Likes 0.65

Table 5: Human evaluation scores on twitter data. Rele-vance and Coherence metrics are scored out of 4. UserLikes score represents the percentage of total raterswho liked the meme.

Before interpreting the scores, we review theimage meme generation task. It requires the abilityto semantically and contextually understand theinput sentence along with the contextual knowledgeof the image memes. Even with the understandingof the input sentence and meme images, one hasto posses a good fluency in natural language togenerate a meme caption that is compatible withthe meme image. The generated meme shouldalso be relevant to the input sentence. We analyzethe performance of our model by assuming that ahuman generated meme would get perfect scoreacross all the metrics in generating a good quality

meme for a tweet.Observing the score distribution from Figure

6, we infer that more than 60% of the generatedmemes are coherent and relevant to the input tweets.From Table 5, we see that 65% of the raters likethe meme shown to them and the like percentagecorrelate with the coherence and relevance scores.We infer that the raters have liked the meme if theyunderstood the information conveyed through thememe and if the meme is relevant to the input tweet.Quantitatively, our model is capable of generatingcoherent memes with 66.5% confidence and rele-vant memes with 66.25% confidence. Our modelperforms with good confidence on the challengingimage meme generation task using only the lan-guage features of the image meme during training.

6 Inter Rater Reliability

We use Cohen’s Kappa (κ) to measure the reliabil-ity among the raters. Cohen’s Kappa is defined as

κ =po − PeN − Pe

(3)

where po is the relative observed agreement and

Page 9: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

Figure 9: Memes generated by the SMT2MC-NP+V variant for the input tweet -“Please save the world fromCorona” by conditioning caption generation using different meme templates.

pe is the hypothetical probability of chance agree-ment among the raters, and N is the number ofsamples. The Inter Rater Reliability (IRR) scoreamong the users on different metrics is reported inTable 6.

Metric Agreement ScoreCoherence 71.68Relevance 61.39User Likes 73.85

Table 6: Inter Rater Reliability scores [%] on the hu-man evaluation metrics.

The raters have higher than 60% agreement onall the metrics which establishes a good consistencyamong the raters for evaluating the quality of thegenerated image memes.

7 Controlling Meme Generation

Corrupting the input data during training enablesour model to learn from the meme caption datasetand scale our model for any input sentence dur-ing inference as shown in Figure 8. In an idealscenario, the user might want to select the memetemplate. During the experiments, we observedthat for a given sentence, information abstractionduring training has enabled our model to create ameme caption conditioned on any given meme tem-plate. We experiment further on this by forcing thecaption generator to generate captions for a inputsentence conditioned on different meme templates.The generated memes are presented in Figure 9 andour model is capable of generating a meme for aninput sentence conditioned on a selected or a givenmeme template.

8 Conclusion and Future Work

We have presented memeBot, an end to end ar-chitecture that can automatically generate a memefor a given sentence. memeBot is composed oftwo components, a module to select a meme tem-plate and an encoder-decoder to generate a memecaption. The model is trained on a meme captiondataset to maximize the likelihood of selecting atemplate given a caption and to maximize the like-lihood of generating a meme caption given the in-put sentence and the meme template. Automaticevaluation on meme caption test data and humanevaluation scores on Twitter data show promisingperformance in generating an image for sentencesin online social interaction.

The concept of quality of a meme highly variesamong people and is hard to evaluate using a setof pre-defined metrics. In real-world scenarios,if an individual likes a meme, he or she sharesit with others. If a group of individuals like thesame meme then the meme can become viral ortrending. Future work includes evaluating a memeby introducing it in a social media stream andrate the meme based on its transmission amongthe people. The meme transmission rate and thegroup of people it transmits across can be used asreinforcement to generate more creative and betterquality meme.

Acknowledgment: The Office of Naval Research(award #N00014-18-1-2761) and the National Sci-ence Foundation under the Robust Intelligence Pro-gram (#1750082) are gratefully acknowledged.

Page 10: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Patrick Davison. 2012. The language of internetmemes. The social media reader, pages 120–134.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Zhiyuan Fang, Tejas Gokhale, Pratyay Baner-jee, Chitta Baral, and Yezhou Yang. 2020.Video2commonsense: Generating commonsensedescriptions to enrich video captioning. arXivpreprint arXiv:2003.05162.

Matthew Honnibal and Ines Montani. 2017. spaCy 2:Natural language understanding with Bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Towardcontrolled generation of text. In Proceedingsof the 34th International Conference on MachineLearning-Volume 70, pages 1587–1596. JMLR. org.

Chenyang Huang, Osmar Zaiane, Amine Trabelsi, andNouha Dziri. 2018. Automatic dialogue generationwith expressed emotions. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 2 (Short Pa-pers), pages 49–54.

Clayton J Hutto and Eric Gilbert. 2014. Vader: A par-simonious rule-based model for sentiment analysisof social media text. In Eighth international AAAIconference on weblogs and social media.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages3128–3137.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochas-tic gradient descent with warm restarts. arXivpreprint arXiv:1608.03983.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in Natu-ral Language Processing, pages 1412–1421, Lis-bon, Portugal. Association for Computational Lin-guistics.

Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and LeiLi. 2019. Cgmh: Constrained sentence generationby metropolis-hastings sampling. In Proceedings ofthe AAAI Conference on Artificial Intelligence, vol-ume 33, pages 6834–6842.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

V Peirson, L Abel, and E Meltem Tolunay. 2018. Danklearning: Generating memnlp papes using deep neu-ral networks. arXiv preprint arXiv:1806.04510.

Matthew Snover, Nitin Madnani, Bonnie J Dorr, andRichard Schwartz. 2009. Fluency, adequacy, orhter?: exploring different human judgments with atunable mt metric. In Proceedings of the FourthWorkshop on Statistical Machine Translation, pages259–268. Association for Computational Linguis-tics.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 conference onempirical methods in natural language processing,pages 1631–1642.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. The Journal of Machine LearningResearch, 15(1):1929–1958.

Jinyue Su, Jiacheng Xu, Xipeng Qiu, and XuanjingHuang. 2018. Incorporating discriminator in sen-tence generation: a gibbs sampling method. InThirty-Second AAAI Conference on Artificial Intel-ligence.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In Advances in neural information processing sys-tems, pages 3104–3112.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Page 11: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, andPierre-Antoine Manzagol. 2008. Extracting andcomposing robust features with denoising autoen-coders. In Proceedings of the 25th internationalconference on Machine learning, pages 1096–1103.ACM.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Proceedings of the IEEEconference on computer vision and pattern recogni-tion, pages 3156–3164.

William Yang Wang and Miaomiao Wen. 2015. I canhas cheezburger? a nonparanormal approach to com-bining textual and visual information for predictingand generating popular meme descriptions. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages355–365.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In International conference on machine learn-ing, pages 2048–2057.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. arXiv preprintarXiv:1906.08237.

Page 12: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

A Meme Templates

TemplateName Template Image Template

Name Template Image TemplateName Template Image

Bad LuckBrian

LeonardoDicaprioCheers

SuccessKid

X X Every-where

AncientAliens

DisasterGirl

One DoesNot Simply

ThirdWorld

SkepticalKid

FuturamaFry

10 Guy

Am I TheOnly OneAround

Here

CaptainPicard

Facepalm

BraceYourselves

X isComing

CreepyConde-

scendingWonka

MatrixMorpheus

Y U No

But ThatsNone Of

MyBusiness

Roll SafeThink

About It

WaitingSkeleton

The MostInteresting

Man InThe World

First WorldProblems

Hide thePain

Harold

ThatWould Be

Great

GrumpyCat

Meme templates and meme images from the meme caption dataset used in our experiments.

Page 13: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

B Sample Additional Images

A Sample of additional images used for Waiting Skeleton meme template.

C Amazon Mechanical Turk Questionnaire

Sample AMT questionnaire - Coherence metric.

Page 14: Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou … · 2020-05-01 · First, a meme template selection module to identify a compatible meme template (image) for the input

Sample AMT questionnaire - Relevance and User Likes metrics.

Sample AMT questionnaire - disclaimer.

D memeBot Demo

A live demo of the presented memeBot is available in the website.