lifelong knowledge-enriched social event representation

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 3624–3635April 19 - 23, 2021. ©2021 Association for Computational Linguistics

3624

Lifelong Knowledge-Enriched Social Event Representation Learning

Prashanth VijayaraghavanMIT Media Lab

Cambridge, MA, [email protected]

Deb RoyMIT Media Lab

Cambridge, MA, [email protected]

Abstract

The ability of humans to symbolically rep-resent social events and situations is crucialfor various interactions in everyday life. Sev-eral studies in cognitive psychology have es-tablished the role of mental state attributionsin effectively representing variable aspects ofthese social events. In the past, NLP researchon learning event representations often focuseson construing syntactic and semantic informa-tion from language. However, they fail toconsider the importance of pragmatic aspectsand the need to consistently update new so-cial situational information without forgettingthe accumulated experiences. In this work,we propose a representation learning frame-work to directly address these shortcomingsby integrating social commonsense knowledgewith recent advancements in the space of life-long language learning. First, we investigatemethods to incorporate pragmatic aspects intoour social event embeddings by leveraging so-cial commonsense knowledge. Next, we in-troduce continual learning strategies that allowfor incremental consolidation of new knowl-edge while retaining and promoting efficientusage of prior knowledge. Experimental re-sults on event similarity, reasoning, and para-phrase detection tasks prove the efficacy of oursocial event embeddings.

1 Introduction

Everyday life comprises the ways in which peo-ple typically act, think, and feel on a daily basis.Our life experiences unfold naturally into tempo-rally extended daily events. The event descriptionscan be packaged in various ways depending onseveral factors like speaker’s perspective or the re-lated domain. Interpretation of event descriptionswill be incomplete without understanding multi-ple entities involved in the events and even moreso when the focus is primarily on “social events”,

S1: Student goes to class S2: Student goes to wedding

S3: Student takes course S4: Teacher takes course

S5: Professor teaches subject

x S1

x S2

x S3

x S4

x S5

x S1

x S2

x S3x S4

x S5

ConceptNet

ConceptNet

ATOMIC

ConceptNet

ATOMIC

SB-SCK

x S1

x S2

x S3x S4

x S5

x S1

x S2

x S3

x S4x S5

(a) (b)

(c) (d)

Figure 1: Illustration of functioning of our lifelong rep-resentation learning approach that produces incremen-tally richer social event representations. Event texts aregiven in the top green box. With more knowledge, so-cial event embeddings move beyond high lexical over-lap [shown in (a)] and learn to integrate semantic andpragmatic properties [shown in (b), (c)] of event textsalong with social role information [shown in (d)].

i.e., events explaining social situations and inter-actions. Therefore, a social event representationmodel must capture the semantic properties fromthe event text description and embed salient knowl-edge that encompasses the implicit pragmatic abil-ities. Early definitions of pragmatic aspects referto the use of language in context; comprising theverbal, paralinguistic, and non-verbal elements oflanguage (Adams et al., 2005). Contemporary defi-nitions have expanded beyond just communicativefunctions to include behavior that includes social,emotional, and communicative aspects of language(Adams et al., 2005; Parsons et al., 2017).

Moving away from the extensively studied

3625

speech acts, we analyze characteristics that reflecthow a person behaves in social situations and howsocial contextual aspects influence linguistic mean-ing. In the context of event representations, thepragmatic properties can specifically refer to thehuman’s inferred implicit understanding of eventactors’ intents, beliefs, and feelings or reactions(Wood, 1976; Hopper and Naremore, 1978).

Understanding the pragmatic implications of so-cial events is non-trivial for machines as they arenot explicitly found in the event texts. Prior stud-ies (Ding et al., 2014, 2015; Granroth-Wilding andClark, 2016; Weber et al., 2018) often extract thesyntactic and semantic information from the eventdescriptions but ignore the pragmatic aspects of lan-guage. In this work, we address this shortcoming,and aim to (a) disentangle semantic and pragmaticattributes from social event descriptions and (b) en-capsulate these attributes into an embedding thatcan move beyond simple linguistic structures anddispel apparent ambiguities in the real sense oftheir context and meaning.

Towards this goal, we propose to train our mod-els with social commonsense knowledge aboutevents focusing specifically on intents and emo-tional reactions of people. Such commonsense un-derstanding can be obtained from existing knowl-edge bases like ConceptNet (Speer et al., 2017),Event2Mind/ATOMIC (Sap et al., 2019a; Rashkinet al., 2018) or by collecting more noisy common-sense knowledge using data mining techniques. Asnew domain sources emerge, each containing dif-ferent knowledge assertions, it is essential that therepresentation models for social events keep evolv-ing with this growing knowledge. Since it is gen-erally infeasible to retrain models from scratch forevery new knowledge source, we consider the needto employ prominent continual learning practices(Kirkpatrick et al., 2017; Lopez-Paz and Ranzato,2017; Asghar et al., 2018; d’Autume et al., 2019)and enable semantic and pragmatic enrichment ofsocial event representations. This problem can beaddressed from the perspective of incremental do-main adaptation (Asghar et al., 2018; Wulfmeieret al., 2018), which quickly adapts to new domainknowledge without interfering with existing ones.Figure 1 presents a sample functioning scenarioproducing incrementally richer social event embed-dings. As the model gains more knowledge fromdifferent sources, it learns to discern events basedon semantic and pragmatic properties, including

social roles. For example, “Student takes course”,and “Teacher takes course“ has significant lexicaland semantic relatedness. However, the social roleinformation changes the meaning as depicted inFigure 1(d) with the introduction of our in-housedataset (SB-SCK).

In this paper, we develop a lifelong representa-tion learning approach for embedding social eventsfrom their free-form textual descriptions. Ourmodel augments a growing set of knowledge ob-tained from various domain sources to allow forpositive knowledge transfer across these domains.Our contributions are as follows:

• We propose a continual representation learn-ing approach-that integrates both text encod-ing and lifelong learning techniques to aidbetter representation of social events.

• We adopt a domain-representative episodicmemory replay strategy with text encod-ing techniques to effectively consolidate theexpanding knowledge from several domainsources and generate a semantically & prag-matically enriched social event embedding.

• We evaluate our models primarily on fourdifferent tasks: (a) intent-emotion predictionfor event texts based on an in-house LifelongEventRep Corpus, (b) event similarity task us-ing hard similarity dataset (Ding et al., 2019;Weber et al., 2018), (c) paraphrase detectionusing Twitter URL corpus (Lan et al., 2017),and (d) social commonsense reasoning taskusing SocialIQA (Sap et al., 2019b) dataset.

2 Related Work

2.1 Social Events Representation LearningEarly work in the domain of events can be tracedback to modeling narrative chains. Chambers andJurafsky (Chambers and Jurafsky, 2008, 2009) in-troduced models for event sequences involvingcoreference resolution and inferring event schemas.Similar efforts (Balasubramanian et al., 2013; Che-ung et al., 2013; Jans et al., 2012) have exploredthe use of open-domain relations to extract eventschemas but suffer from reduced predictive capabil-ities and increased sparsity. Recent advancements,aimed at addressing the limitations of prior works,compute distributed embeddings of events involv-ing word embeddings, recurrent sequence models,and tensor-based composition models (Modi and

3626

Titov, 2013; Granroth-Wilding and Clark, 2016; Pi-chotta and Mooney, 2016; Hu et al., 2017). Specif-ically, tensor-based methods have demonstratedimproved performance by representing events thatpredict implicit arguments with event knowledge(Cheng and Erk, 2018), combine (subject, predi-cate, object) triples information (Weber et al., 2018)and reflect thematic fit (Tilk et al., 2016).

2.2 Lifelong Learning

Lifelong learning or continual learning approachescan be grouped into regularization-based, data-based, and model-based approaches. Regulariza-tion based approaches (Kirkpatrick et al., 2017;Schwarz et al., 2018; Zenke et al., 2017) minimizesignificant changes to the previously learned repre-sentations as we update parameters for the currenttask. This is usually implemented as an additionalconstraint to the objective function based on thesensitivity of parameters. Recent studies (Kemkerand Kanan, 2017; d’Autume et al., 2019; Lopez-Paz and Ranzato, 2017; Chaudhry et al., 2019) un-der data-based approaches store previous task dataeither using a replay memory buffer or a generativemodel. In NLP domain, lifelong language learningapproaches have investigated the use of memoryreplay and local adaptation techniques (d’Autumeet al., 2019).

Finally, model-based approaches allow modelsto allocate or grow capacity (layers or features) nec-essary for the tasks (Rusu et al., 2016; Lee et al.,2016). More recently, (Asghar et al., 2018) aug-mented RNNs with a progressive memory bankleading to increased model capacity. However, thechallenges related to increased architectural com-plexity are tackled by hybrid models as in (Sodhaniet al., 2020). In this paper, we build on ideas fromthe hybrid models and apply them for our learn-ing task. While previous studies (Ding et al., 2019)have attempted to incorporate commonsense knowl-edge, this work is one of the first efforts to integratemulti-source knowledge and address it through thelens of incremental domain adaptation.

3 Problem Formalization

Formally, we assume that our learning frame-work has access to streams of social commonsenseknowledge data obtained from n different domains,denoted by D = {D1,D2, ...,Dn}. At a particu-lar point in time, we extract knowledge from thecurrent domain Di. We produce an embedding

of social events by consolidating the accumulatedknowledge across the modeled domains D≤i. Datafrom each domain source contains source-specifictextual descriptions of social situations and theirintuitive commonsense information such as intentsand emotions. Training samples, drawn from a do-main dataset Di, could contain either a significantoverlap or an entirely new set of knowledge whencompared with the previously processed domainsD1:i. Given such a setup, we aim to generate incre-mentally richer social event representations usingour continual learning framework.

4 Datasets

For our representation learning task, we aggregatesocial commonsense knowledge data from variousdomain sources. This knowledge contains detailsabout pragmatic aspects like intents and emotionalreactions. We create a continual learning bench-mark based on these commonsense data sources1.

4.1 Lifelong Social Events DatasetDifferent domain sources of social commonsenseknowledge used for training our social event repre-sentation model are explained as follows.

ATOMIC dataset consists of inferential knowl-edge based on 24k short events covering a diverserange of everyday events and motivations. Thougheach event contains nine dimensions per event, thescope of this work will be limited to intent andemotions as our inferential pragmatic dimensions.

CONCEPTNET knowledge base contains sev-eral commonsense assertions. For our pur-pose, we choose ConceptNet’s relevant relations:/r/MotivatedByGoal, /r/CausesDesire, /r/Entails,/r/Causes, /r/HasSubevent. We convert triples inthe dataset into template form.

SB-SCK Since social roles (e.g., student, mother,teacher, worker, etc.) provide additional infor-mation about the motives and emotions behindactions specified in the events (as shown in Fig-ure 1, 2, we adopt web-based knowledge miningtechniques for capturing this aspect. This datasetwas collected as a part of our recent work (Vija-yaraghavan and Roy, 2021) using the followingsteps: (a) process texts from Reddit posts contain-ing personal narratives as in (Vijayaraghavan and

1The project details about future data/codereleases or any updates will be available athttps://pralav.github.io/lifelong eventrep?c=10

3627

SB-SCK Dataset

#events w/ motives 103,357

#events w/ emotions 69,584

#unique social roles 586

Search-based Social Commonsense Knowledge

Social Roles Event Phrases Motives

Politicians

use social media

to woo voters

Activists to create a movement

Police to connect with residents and solve crime

Workers

gather around table

to solve business problems

Priests to pray to god, share wine and bread

Friends to share a meal, conversation

Figure 2: Left: Samples from Search-based So-cial Commonsense Knowledge (SB-SCK) dataset withhighlighted motivations for social roles, Right: Statis-tics of SB-SCK dataset.

Roy, 2021), (b) extract propositions from text usingOpenIE tools, (c) perform a web search for plau-sible intents and emotions by attaching purposeclauses (Palmer et al., 2005) and feelings lexicalunits from Framenet (Baker et al., 1998) and (d)finally, remove the poorly extracted facts using asimple classifier trained on some seed common-sense knowledge. Figure 2(Left) shows samplesfrom this dataset indicating how the same actioncould have different social role related motivations.We refer to this as Search-based Social Common-sense Knowledge (SB-SCK) data. Figure 2(Right)presents the data statistics.

For data from each of the above domain sources,we sample free-form event text, its paraphrase,intent, emotional reactions, and negative sam-ples of paraphrases, intents, and emotional reac-tions. Based on the annotated labels for motivation(Maslow’s) and emotional reactions (Plutchik) inSTORYCOMMONSENSE data, we run a simple K-Means clustering on the open text intent data. Weidentify five disjoint clusters on each of the threedomains and map them to those categories. Forthe purpose of our lifelong learning problem, wedivide each domain data into two sets (3 clustersand 2 clusters) and consider them as different sub-domains. Therefore, this results in 6 tasks in ourcontinual learning setup. We refer to this dataset asLifelong EventRep Corpus.

4.2 Paraphrase Datasets

We use random samples of parallel texts fromparaphrase datasets like PARANMT-50M cor-pus(Wieting and Gimpel, 2017) and Quora Ques-tion Pair dataset 2. These paraphrase datasets areprimarily used for pretraining our model. We alsoproduce paraphrases of free-form event texts in our

2https://www.kaggle.com/c/quora-question-pairs/data

dataset using a back-translation approach (Iyyeret al., 2018). We used pretrained English↔Germantranslation models for this purpose.

5 Framework

Our goal is to learn distributed representationsof social events by incorporating pragmatic as-pects of the language beyond shallow event seman-tics. Moving away from conventional supervisedmulti-task classification based lifelong learning ap-proaches, we focus on a lifelong representationlearning approach that enables us to adapt and se-quentially learn a social event embedding model.The motivation for a lifelong learning framework isthat the growing knowledge obtained from variousdomain sources can effectively guide the modelingof complex social events. This involves system-atically updating the model by consolidating thisexpanding knowledge to produce richer embed-dings without forgetting previously accumulatedknowledge. In this section, we will explain variouscomponents of our modeling framework.

5.1 Social Event Representation

Given an input event text description, the core ideais to first encode the free-form event text and de-compose the ensuing representation into pragmatic(implied emotions and intents) and non-pragmatic(syntactic and semantic information) components.Eventually, we combine these decomposed repre-sentations to obtain an overall event representationand apply it in different downstream tasks.

5.1.1 EncoderThe input to our model is a free-form event textdescription from ith domain, x(i)

j ∈ Di. Thisfree-form event text contains a sequence of tokens,x

(i)j = [w1, w2, ..., wL], where each token w(·) is

obtained from an input vocabulary V . The modelencodes the input event text x(i)

j ∈ RL×dX in mul-tiple steps. First, we construct a context-dependenttoken embedding using a context embedding func-tion G : RL×dX 7→ RL×dH , where dX and dHrefer to the embedding and hidden layer dimen-sions respectively. Following this encoding step,we incorporate pooling or projection function, Gpp:RL×dH 7→ R3×dH , that transform event text fromcontext-dependent embedding space into pragmaticand semantic space. More specifically, we producelatent vectors for intents (hI ), reactions (hR) andnon-pragmatic (hN ) information. Finally, we com-

3628

bine the latent vectors hN , hI , hR using a simplefeed-forward layer, GC: R3×dH 7→ RdH , to pro-duce a powerful social event representation, hC ,capable of dispelling apparent ambiguities in thetrue sense of their meaning. Given positive and neg-ative examples of intents, emotional reactions andparaphrases associated with the input event text, welearn to effectively sharpen each of these embed-dings hI , hR and using metric learning methods.

For the sake of brevity, we drop the domain indexi and the sample index j in this section. Theseencoding steps are summarized as:

He = [h1, h2, ..., hL] = G([w1, w2, ..., wL]) (1)

hI , hR, hN = Gpp(He) (2)

hC = GC(hI , hR, hN ) (3)

We denote this multi-step encoding process re-sulting in hI , hR, hC as a function Gevent. Now,we experiment with the following text embeddingtechniques as our context embedding function (G):

BiGRU Using bidirectional GRUs (Chung et al.,2014), we compute the context embedding of theinput event text by concatenating the forward (

−→ht)

and backward hidden states (←−ht),←→ht = [

−→ht ;←−ht ].

BERT We employ BERT (Devlin et al.,2018), a multi-layer bidirectional Transformer-based encoder, as our context embeddingmethod G. We fine-tune a BERT modelthat takes attribute-augmented event text x =[CLS] m [SEP ] w1, ..., wL [SEP ] as input andoutputs a powerful context-dependent event rep-resentation He. The attribute m ∈ {xIntent,xReact, xNprag} refers to special tokens for in-tents, reactions and non-pragmatic aspects.

In our default case, our Gpp function is the outputembedding of [CLS] token associated with their re-spective attribute-augmented input. In cases whereinput event text is not augmented with attributespecial tokens, we apply pooling strategies suchas attentive pooling (AP) and mean (MEAN) of allcontext vectors obtained from the previous encod-ing step G. We obtain hI , hR, hN based on thesetechniques. Depending on the type of context em-bedding function, we refer our multi-step event textencoder, Gevent, as EVENTGRU or EVENTBERT.

5.1.2 Objective LossUsing positive {upI , u

pR, u

pC} and N − 1 negative

{unI , unR, unC} examples of intents, emotions and

paraphrases associated with the event texts, we cal-culate N -pair loss, Lv(h, zp, {znk }

N−1k=1 ), to maxi-

mize the similarity between the representation ofpositive examples (zpv) and the computed embed-dings (hv). Here, zev is computed using a trans-formation function fv as: zev = fv(u

ev), where

v ∈ {I,R,C} and e ∈ {p, n}. Thus, our lossfunction is devised as:

LT = βD · (LI + LR) + βE · LC (4)

where LI ,LR are used to learn disentangled prag-matic embeddings (intent and emotion), LC is in-tended to jointly embed semantic and pragmaticaspects to produce an overall social event represen-tation. βD, βE are loss coefficients that weigh theimportance of disentanglement loss and an over-all joint embedding loss. These coefficients arenon-negative and they sum to 1.

5.2 Continual LearningGiven a never-ending list of social events, once-and-for-all training on a fixed dataset limits theutility of such models in real-world applications.Therefore, we draw ideas from lifelong learningliterature to adapt our models to new data yet re-taining prior knowledge. First, we implement adata-based approach, which is a variant of episodicmemory replay (EMR) (Lopez-Paz and Ranzato,2017; Chaudhry et al., 2019; d’Autume et al., 2019)for mitigating catastrophic forgetting while allow-ing for beneficial backward knowledge transfer.Next, we combine it with simple vocabulary ex-pansion for incremental domain adaptation.

5.2.1 Domain-Representative EpisodicMemory Replay (DR-EMR)

We augment our model explained in Section 5.1with an episodic memory module to perform sparseexperience replay. As we train our lifelong repre-sentation learning model, we create mixed train-ing mini-batches (Bdom,Brep) by drawing samplesfrom: (a) new domain dataset, Bdom ⊂ Di and(b) episodic memory containing old domain sam-ples, Brep ⊂M. Training our model using mixedmini-batches introduces parameter changes that al-ter past event data embeddings, including thosestored in our episodic memory,M. The episodicmemory module is implemented as a “Read-Write”memory store containing selective event samplesfrom the original domain dataset and their respec-tive embeddings. The key memory operations arestated as follows:

3629

D1 D2 D3 D4 D5 D6

71.13

71.59

72.36

72.28

72.39

72.63

74.02

63.35

68.76

68.25 69.44

70.67

68.8263.41

73.08

62.89

62.5 68.61

71.83

73.26

70.65

D1:1

D1:2

D1:3

D1:4

D1:5

D1:6

Text

Incremental Knowledge-

AwareNetwork

Encoder

DN-Decoder

RW-Memory Context Module

EmotionsIntents

Character Current Storyline Story Context

LossEA

LossEA


AwareNetwork

Encoder

Decoder

Character Current Storyline

EmotionsIntents


AwareNetwork

Encoder

Decoder

RW-Memory Context Module

EmotionsIntents

x[w1, w2, .., wn]

Experience Replay

Memory

Encoder

hP

hN

LossI + LossRhI

hR hC LossC

Domain Clusters

Figure 3: Left: Illustration of our Lifelong Social Event Representation Model, Right: Average accuracy score (%)of one specific permuted run across 6 domains. Table shows the dynamics of our continual learning model allowingpositive backward transfer, i.e., performance increases in previous domains gradually with new knowledge.

Read Operation The read operation retrievesdomain-specific random samples from our episodicmemory for experience replay. These samples con-tain the original event training data and their repre-sentations. We choose samples from a previouslytrained domain at every read step assuming an over-all uniform distribution over domains.

Write Operation In this work, we incorpo-rate the following desirable characteristics of anepisodic memory module: (a)M stores domain-representative samples that best approximate thedomain’s data distribution and (b) M’s capacityis bounded or at most expands sub-linearly. Toachieve this, we adopt the following strategies:

Domain-Representative Sample Selection: Paststudies have explored using random writing strat-egy (d’Autume et al., 2019) and distribution ap-proximation or K-Means to cluster the samples(Rebuffi et al., 2017). In our work, we performsample selection using a CURE (Clustering UsingREpresentatives) algorithm (Guha et al., 2001) tofind C representative points. CURE employs hi-erarchical clustering that is computationally fea-sible by adopting random sampling and two-passclustering. Using event representations obtained af-ter embedding alignment transformation, we iden-tify the domain-representative samples by comput-ing euclidean metric-based nearest neighbors tothe C representative points. We set the value ofC based on the memory budget. These domain-representative samples are stored in our episodicmemory with their corresponding representations.

Replacement Policy: When the memory be-comes full, we follow a simple memory replace-ment policy that selects an existing memory en-try to delete. Specifically, we replace the qth

memory entry which is determined by: q =

argmaxq(softmax(φ(x(t)j ) ·Mq) similar to idea

proposed in (Gulcehre et al., 2018).Inspired by Wang et al. (Wang et al., 2019), we

propose a variant of the alignment model that helpsovercome catastrophic forgetting by ensuring mini-mal distortion to previously computed representa-tional spaces. To accomplish this, we define simplelinear transformations: GIA,GRA,GCA on the topof the multi-step encoder function Gevent outputs.For simplicity, we drop the subscripts (I,R,C)and denote these transformation functions as GA.Given a new domain i, we initialize our multi-step event encoder Gievent and alignment functionGiA with the last trained parameters as in Gi−1

event

and Gi−1A respectively. The optimization is imple-

mented in the following two steps: (a) For samplesfrom (Bdom,Brep), we optimize the encoder out-put representations to be closer to their respectiveground-truth examples in this linearly transformedspace. Therefore, we modify the N -pair loss func-tion to accommodate the alignment function as:L(GiA(Gievent(x)), GiA(zp), {GiA(zn)}

N−1k=1 );

(b) For samples from Brep, we add an extra con-straint (L+ LEA) for each embedding componentto align old and new domain embedding spaces:

LEA = ||GiA(Gievent(x))− Gi−1A (Gi−1

event(x))||2 (5)

6 Training

Since our model involves metric learning, hardnegative data mining is an essential step for fasterconvergence and improved discriminative capabil-ities. However, selecting too hard examples toooften makes the training unstable. Therefore, wechoose a hybrid negative mining technique wherewe choose few semi-hard negatives examples (Her-mans et al., 2017) and combine it with randomnegative samples to effectively train our model.

In our work, we define a heuristic objective byweighing samples based on two factors: (i) wordoverlap or similarity in embedding space of theevent text and (ii) intent and emotion free-formtext or categories based on STORYCOMMONSENSE

data. More specifically, given an event text as an-chor and a positive intent text based on a ground

3630

Domains

% A

ccur

acy

Figure 4: Average accuracy scores (%) of data after 6permuted runs from domains that have been observedat that point during the continual learning process.

truth motivation category, we mine negative in-stances for intent as follows: (a) choose randomtext samples associated with a motivation categorythat is different from that of the positive examplebut closer in the embedding space or word overlap,(b) choose random text samples within the samemotivation category but with different emotion cat-egory. We repeat this process for drawing negativeinstances related to emotions. For paraphrases,we consider few examples with significant wordoverlap while the rest are randomly chosen exam-ples. N -pair loss helps alleviate the sensitivity oftriplet loss function to the choice of hard triplets.Finally, we pre-train our model with paraphrasedata and fine-tune it using the examples obtainedfrom hard negative mining for intents, emotionsand paraphrases. For our training, the learning rateis set to 0.0001, the number of training epoch is 20.By default, we use EVENTBERT as our multi-stepencoder. We conduct a study by assigning differentvalues for loss coefficients, βD, βE , and explainthe results of the study in Section 7.1.2.

7 Experiments

We conduct several experiments to study the powerof our learned embeddings. Our experiments aredesigned to answer the following questions:

RQ1: How well does our model perform in com-parison to other continual learning approaches forintent-emotion prediction?

RQ2: To what extent do our modeling choicesimpact the results in predicting intents & emotions?

RQ3: Does our model outperform existing state-of-the-art methods in hard similarity task that eval-uates the effectiveness of the learned embeddings?

RQ4: Do the learned embeddings demonstratetransfer capability to downstream tasks – Para-phrase detection & Social IQA reasoning?

7.1 Intent-Emotion Prediction (RQ1)The continual learning methods evaluated usingour Lifelong EventRep Corpus are given below.

• Base: We simply fine-tune our model on suc-cessive tasks from previously trained check-point.

• A-GEM: This method (Chaudhry et al., 2019)uses a constraint that enables the projectedgradient to decrease the loss on older tasks.We randomly choose 2-3% samples from allthe previous tasks to form a constraint.

• EWC: A regularization-based technique,Online-EWC (Schwarz et al., 2018), is usedto address catastrophic forgetting.

• EMR: We use randomly stored examples forsparse experience replay.

• A-EMR: A variant of our model that usesrandom samples for experience replay withalignment constraints.

• DR-EMR: This is our complete model involv-ing domain representative experience replayand alignment constraints.

7.1.1 Empirical ResultsAfter running six permuted sequence of tasks, wecalculate the mean performance on the test set ofall observed task domains after time step k, givenby AvgAcc = 1

k

∑ki=1Acc

(i), where Acc(i) is themodel accuracy on the test set from the domainDi. Further, we also compute standard deviation todetermine the importance of the order of the tasksduring training. Table 1a contains the comparisonof last-step AvgAcc scores and standard deviationfor predicting intents and emotions. Figure 4 plotsthe AvgAcc at every step where D1:i indicates thatthe model has seen data from i domains to evaluateour continual learning process.

In the absence of any lifelong learning strate-gies, the performance drops significantly for Basemodel while emphasizing the importance of taskorder as indicated by the high standard deviationvalue. Compared to the Base model, our resultsshow some improvement in intent and emotion pre-diction as we introduce methods like A-GEM andEWC. However, we observe a significant perfor-mance gain with EMR-based techniques. Our com-plete model (DR-EMR) outperforms all the othermethods, thereby demonstrating the importance of

3631

Models Intents EmotionsAvgAcc Std AvgAcc Std

EVENTBERT (Base) 51.16 10.41 62.27 9.74

EVENTBERT + A-GEM 60.47 6.75 69.44 6.12EVENTBERT + EWC 56.93 8.86 66.89 7.55EVENTBERT + EMR 63.17 4.20 72.04 4.66EVENTBERT + A-EMR 68.54 2.85 73.42 3.10

EVENTBERT+ DR-EMR 70.02 0.38 78.48 0.65

(a) Lifelong EventRep Dataset

Models % Acc.

KGEB (Ding et al., 2016) 50.09NTN + Int+ (Ding et al., 2019) 58.83NTN + Int + Senti (Ding et al., 2019) 64.31

EVENTBERTβE=0.3+ DR-EMR 66.19EVENTBERTβE=0.5+ DR-EMR 71.23EVENTBERTβE=0.7+ DR-EMR 69.79

(b) Hard Similarity Dataset

Table 1: Evaluation results on: (a) held-out Lifelong EventRep test set and (b) combined hard similarity dataset.

domain-representative sampling and alignment con-straints towards learning representations that helpeffective prediction of intents and emotions. More-over, we assess the domain sequences that causea performance drop. For example, whenever SB-SCK is trained at the end, the model shows reducedaccuracy. The reason can be ascribed to the effectof interference of noisy knowledge over the previ-ously trained cleaner domains. Training this noisysource ahead leads to positive knowledge transferand hence produces sharpened performance out-comes. Despite these odds, our DM-EMR modelrecords the least standard deviation implying re-duced sensitivity to training order. Figure 3 (Right)shows the dynamics of our continual learning mode.It contains accuracy scores of a single run and dis-plays the lower-triangular values. We see that ourapproach allows positive backward transfer, i.e.,model performance in previous domains graduallyincreases with new domain knowledge. Our modelachieves the best performance (see the bold-facedaccuracy scores in Figure 3(Right)) in most do-mains after observing all the data (D1:6).

7.1.2 Ablation Study (RQ2)

We analyze different model configurations relatedto: (a) encoding: EVENTGRU, EVENTBERT, (b)pooling: attribute-augmented input (CLS), MeanPooling (MP) and Attentive Pooling (AP) and (c)sampling: K-Means, CURE. Results of the studyare given in Figure 5(Left). We evaluated differ-ent combinations of these strategies but only reportthe average accuracy scores of configurations hav-ing the best strategy at each category combinedwith the variants in the following category, i.e.,for pooling strategies, we only report scores withEVENTBERT encoding strategy and so on. Fromthe results, we ascertain that the best configurationcomprises an EVENTBERT encoder supported byattentive pooling and CURE-based sample selection.

Methods Intents EmotionsEncoding Strategy

EVENTGRU 60.92 70.61EVENTBERT 68.56 78.48

Pooling Strategy

MP 65.50 76.13AP 67.28 76.69

CLS 68.56 78.48Sampling Strategy

K-Means 66.45 75.88CURE 68.56 78.48 βE

Aver

age A

ccur

acy

(%)

60

62.25

64.5

66.75

69

0.3 0.5 0.7 0.9

2020

2020

January

MayFigure 5: Results of our ablation study on a held-outvalidation set. Left: AvgAcc scores (%) on intent-emotion prediction task using different encoding, pool-ing & sampling strategies. Right: AvgAcc scores (%)to measure the effect of βE in predicting intents.

Additionally, we measure the effect of βE in theprediction of intents. As shown in Figure 5 (Right),the model performs significantly better for lowervalues βE as more weight is assigned for the disen-tanglement of pragmatic aspects. However, we ob-serve that a balanced loss function with βE = 0.5allows for consistently good performance in bothintent-emotion prediction (Figure 5 (Right)) andhard similarity tasks (see Section 7.2). Despiteother hyperparameters, changes to βE determinethe importance of incorporating semantic or prag-matic information in the ensuing event embedding.

7.2 Hard Similarity Task (RQ3)

By following the work of Ding et al. (Ding et al.,2019), we evaluate our social event representationon an extended dataset of event pairs containing:(a) similar event pair having minimum lexical over-lap (e.g., people admired president / citizens lovedleader) (b) dissimilar event pair with high lexicaloverlap (e.g., people admired president / peopleadmired nature). A good performance in this taskwill ensure that similar events are pulled closer toeach other than dissimilar events. Combining hardsimilarity datasets from (Ding et al., 2019) and(Weber et al., 2018), the total size of this balanced

3632

dataset is 2,230 event pairs. Using our joint embed-ding hC for an event text and triplet loss setup, wecompute a similarity score between similar and dis-similar pairs. The baselines include: Knowledge-graph based embedding model (KGEB) (Ding et al.,2016), Neural Tensor Network (NTN) and its vari-ants augmented with ATOMIC dataset based em-beddings (Int, Senti) (Ding et al., 2019). We reportthe model’s accuracy in assigning a higher simi-larity score for similar pairs than dissimilar pairs.Table 1b shows that our model outperforms thestate-of-the-art method for this task.

7.3 Paraphrase Detection (RQ4)Given a sentence pair, the objective is to detectwhether they are paraphrases or not. For each sen-tence pair (s1, s2), we pass them through our modeland obtain their respective hC , given by vectors(u, v). We concatenate these vectors (u, v) withthe element-wise difference |u − v| and feed toa feed-forward layer. We optimize binary cross-entropy loss. For evaluation purposes, we com-pare our model against baselines like BERT andESIM (Chen et al., 2016). Trained on a subset ofdataset explained in Section 4.2, we choose an out-of-domain test dataset where samples stem froma dissimilar input distribution. To this end, Twit-ter URL paraphrasing corpus (Lan et al., 2017),referred to as TwitterPPDB, is selected. Table 2bcontains results of our evaluation. The results tes-tify the efficacy of our embeddings.

7.4 Social IQA Reasoning (RQ4)We determine the quality of our latent social eventrepresentations by evaluating on a social common-sense reasoning benchmark – SocialIQA dataset(Sap et al., 2019b). Given a context, a questionand three candidate answers, the goal is to se-lect the right answer among the candidates. Fol-lowing Sap et al.(Sap et al., 2019b), the context,question, and candidate answer are concatenatedusing separator tokens and passed to the BERTmodel. Additionally, we feed the context to ourEVENTBERT model to obtain three embeddingshI , hR, hC . While the original work computeda score l using the hidden state of [CLS] token,we introduce a minor modification to this stepas: l =W5tanh(W1hCLS +W2hI +W3hR +W4hC),where W1:4 ∈ RdH×dH and W5 ∈ R1×dH arelearnable parameters. Similar to (Sap et al., 2019b),triple with the highest normalized score is used asthe model’s prediction. We fine-tune BERT models

Models Dev Test

w/o Social Event Embeddings

GPT 63.3 63.0BERT-base 63.3 63.1BERT-large 66.0 64.5

w/ Social Event Embeddings

BERT-base 65.1 64.0BERT-large 68.7 67.9

(a) SocialIQA Dataset

Models % Acc

ESIM 84.01BERT 87.63

EVENTBERT0.5 88.23EVENTBERT0.7 90.16

(b) TwitterPPDB dataset

Table 2: Accuracy scores (%) of different models on:(a) SocialIQA dev & test set, and (b) Twitter URL Para-phrasing corpus, TwitterPPDB. EVENTBERT modelsemploy DR-EMR and subscript indicates value of βE .

using our new scoring function with social eventembedding (denoted as “w/”) and compare againstbaselines (like GPT (Radford et al., 2018)) withoutour event embeddings (denoted as “w/o”). Resultsin Table 2a indicate that a simple enhancementprocedure at the penultimate step can offer signifi-cant performance gains. Our findings also suggestthat our enhanced model performed well for ques-tion types like ‘wants’ and ‘effects’ that weren’texplicitly modeled in our embedding model. Thisconfirms that our pragmatics-enriched embeddingslead to improved reasoning capabilities.

8 Conclusion

Humans rely upon commonsense knowledge aboutsocial contexts to ascribe meaning to everydayevents. In this paper, we introduce a lifelong learn-ing approach to effectively embed social eventswith the help of a growing set of social common-sense knowledge assertions acquired from differ-ent domains. First, we leverage social common-sense knowledge to sharpen social event embed-dings with semantic and pragmatic attributes. Next,we employ domain-representative episodic mem-ory replay (DR-EMR) to overcome catastrophicforgetting and enable positive knowledge transferwith the emergence of new domain knowledge. Byevaluating on a corpus of social events aggregatedfrom multiple sources, we establish that our modelis able to outperform several baselines. Experimen-tal results on downstream tasks like event similarity,reasoning, and paraphrase detection demonstratethe capabilities of our social event embeddings.We hope that our work will motivate further ex-ploration into lifelong representation learning ofsocial events and advance the research in inferringpragmatic dimensions from texts.

3633

ReferencesCatherine Adams, Janet Baxendale, Julian Lloyd, and

Catherine Aldred. 2005. Pragmatic language im-pairment: case studies of social and pragmatic lan-guage therapy. Child Language Teaching and Ther-apy, 21(3):227–250.

Nabiha Asghar, Lili Mou, Kira A Selby, Kevin D Pan-tasdo, Pascal Poupart, and Xin Jiang. 2018. Progres-sive memory banks for incremental domain adapta-tion. arXiv preprint arXiv:1811.00239.

Collin F Baker, Charles J Fillmore, and John B Lowe.1998. The berkeley framenet project. In 36th An-nual Meeting of the Association for ComputationalLinguistics and 17th International Conference onComputational Linguistics, Volume 1, pages 86–90.

Niranjan Balasubramanian, Stephen Soderland, OrenEtzioni, et al. 2013. Generating coherent eventschemas at scale. In Proceedings of the 2013 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1721–1731.

Nathanael Chambers and Dan Jurafsky. 2008. Unsu-pervised learning of narrative event chains. In Pro-ceedings of ACL-08: HLT, pages 789–797.

Nathanael Chambers and Dan Jurafsky. 2009. Unsu-pervised learning of narrative schemas and their par-ticipants. In Proceedings of the Joint Conferenceof the 47th Annual Meeting of the ACL and the4th International Joint Conference on Natural Lan-guage Processing of the AFNLP: Volume 2-Volume2, pages 602–610. Association for ComputationalLinguistics.

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho-seiny, Thalaiyasingam Ajanthan, Puneet K Doka-nia, Philip HS Torr, and Marc’Aurelio Ranzato.2019. Continual learning with tiny episodic mem-ories. arXiv preprint arXiv:1902.10486.

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei,Hui Jiang, and Diana Inkpen. 2016. Enhancedlstm for natural language inference. arXiv preprintarXiv:1609.06038.

Pengxiang Cheng and Katrin Erk. 2018. Implicit ar-gument prediction with event knowledge. arXivpreprint arXiv:1802.07226.

Jackie Chi Kit Cheung, Hoifung Poon, and LucyVanderwende. 2013. Probabilistic frame induction.arXiv preprint arXiv:1302.4813.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555.

Cyprien de Masson d’Autume, Sebastian Ruder, Ling-peng Kong, and Dani Yogatama. 2019. Episodicmemory in lifelong language learning. arXivpreprint arXiv:1906.01076.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Xiao Ding, Kuo Liao, Ting Liu, Zhongyang Li, andJunwen Duan. 2019. Event representation learningenhanced with external commonsense knowledge.arXiv preprint arXiv:1909.05190.

Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan.2014. Using structured events to predict stock pricemovement: An empirical investigation. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages1415–1425.

Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan.2015. Deep learning for event-driven stock predic-tion. In Twenty-fourth international joint conferenceon artificial intelligence.

Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan.2016. Knowledge-driven event embedding for stockprediction. In Proceedings of coling 2016, the 26thinternational conference on computational linguis-tics: Technical papers, pages 2133–2142.

Mark Granroth-Wilding and Stephen Clark. 2016.What happens next? event prediction using a com-positional neural network model. In Thirtieth AAAIConference on Artificial Intelligence.

Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim.2001. Cure: an efficient clustering algorithm forlarge databases. Information systems, 26(1):35–58.

Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho,and Yoshua Bengio. 2018. Dynamic neural turingmachine with continuous and discrete addressingschemes. Neural computation, 30(4):857–884.

Alexander Hermans, Lucas Beyer, and Bastian Leibe.2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737.

Robert Hopper and Rita C Naremore. 1978. Children’sspeech: A practical introduction to communicationdevelopment. HarperCollins Publishers.

Linmei Hu, Juanzi Li, Liqiang Nie, Xiao-Li Li, andChao Shao. 2017. What happens next? futuresubevent prediction using contextual hierarchicallstm. In Thirty-First AAAI Conference on ArtificialIntelligence.

Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.arXiv preprint arXiv:1804.06059.

Bram Jans, Steven Bethard, Ivan Vulic, andMarie Francine Moens. 2012. Skip n-gramsand ranking functions for predicting script events.In Proceedings of the 13th Conference of the

3634

European Chapter of the Association for Computa-tional Linguistics, pages 336–344. Association forComputational Linguistics.

Ronald Kemker and Christopher Kanan. 2017. Fear-net: Brain-inspired model for incremental learning.arXiv preprint arXiv:1711.10563.

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,Joel Veness, Guillaume Desjardins, Andrei A Rusu,Kieran Milan, John Quan, Tiago Ramalho, Ag-nieszka Grabska-Barwinska, et al. 2017. Over-coming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences,114(13):3521–3526.

Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017.A continuously growing dataset of sentential para-phrases. arXiv preprint arXiv:1708.00391.

Sang-Woo Lee, Chung-Yeon Lee, Dong-Hyun Kwak,Jiwon Kim, Jeonghee Kim, and Byoung-Tak Zhang.2016. Dual-memory deep learning architectures forlifelong learning of everyday human behaviors. InIJCAI, pages 1669–1675.

David Lopez-Paz and Marc’Aurelio Ranzato. 2017.Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems,pages 6467–6476.

Ashutosh Modi and Ivan Titov. 2013. Learning seman-tic script knowledge with event embeddings. arXivpreprint arXiv:1312.5198.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The proposition bank: An annotated cor-pus of semantic roles. Computational linguistics,31(1):71–106.

Lauren Parsons, Reinie Cordier, Natalie Munro, An-nette Joosten, and Renee Speyer. 2017. A system-atic review of pragmatic language interventions forchildren with autism spectrum disorder. PloS one,12(4):e0172242.

Karl Pichotta and Raymond J Mooney. 2016. Learn-ing statistical scripts with lstm recurrent neural net-works. In Thirtieth AAAI Conference on ArtificialIntelligence.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

Hannah Rashkin, Maarten Sap, Emily Allaway,Noah A Smith, and Yejin Choi. 2018. Event2mind:Commonsense inference on events, intents, and re-actions. arXiv preprint arXiv:1805.06939.

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov,Georg Sperl, and Christoph H Lampert. 2017. icarl:Incremental classifier and representation learning.In Proceedings of the IEEE conference on ComputerVision and Pattern Recognition, pages 2001–2010.

Andrei A Rusu, Neil C Rabinowitz, Guillaume Des-jardins, Hubert Soyer, James Kirkpatrick, KorayKavukcuoglu, Razvan Pascanu, and Raia Hadsell.2016. Progressive neural networks. arXiv preprintarXiv:1606.04671.

Maarten Sap, Ronan Le Bras, Emily Allaway, Chan-dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,Brendan Roof, Noah A Smith, and Yejin Choi.2019a. Atomic: An atlas of machine commonsensefor if-then reasoning. In Proceedings of the AAAIConference on Artificial Intelligence, volume 33,pages 3027–3035.

Maarten Sap, Hannah Rashkin, Derek Chen, RonanLeBras, and Yejin Choi. 2019b. Socialiqa: Com-monsense reasoning about social interactions. arXivpreprint arXiv:1904.09728.

Jonathan Schwarz, Jelena Luketina, Wojciech M Czar-necki, Agnieszka Grabska-Barwinska, Yee WhyeTeh, Razvan Pascanu, and Raia Hadsell. 2018.Progress & compress: A scalable framework for con-tinual learning. arXiv preprint arXiv:1805.06370.

Shagun Sodhani, Sarath Chandar, and Yoshua Bengio.2020. Toward training recurrent neural networks forlifelong learning. Neural computation, 32(1):1–35.

Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.Conceptnet 5.5: An open multilingual graph of gen-eral knowledge. In Thirty-First AAAI Conference onArtificial Intelligence.

Ottokar Tilk, Vera Demberg, Asad Sayeed, DietrichKlakow, and Stefan Thater. 2016. Event participantmodelling with neural networks. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 171–182.

Prashanth Vijayaraghavan and Deb Roy. 2021. Model-ing human motives and emotions from personal nar-ratives using external knowledge and entity tracking.In Proceedings of The Web Conference 2021.

Hong Wang, Wenhan Xiong, Mo Yu, Xiaoxiao Guo,Shiyu Chang, and William Yang Wang. 2019. Sen-tence embedding alignment for lifelong relation ex-traction. arXiv preprint arXiv:1903.02588.

Noah Weber, Niranjan Balasubramanian, andNathanael Chambers. 2018. Event representationswith tensor-based compositions. In Thirty-SecondAAAI Conference on Artificial Intelligence.

John Wieting and Kevin Gimpel. 2017. Paranmt-50m:Pushing the limits of paraphrastic sentence embed-dings with millions of machine translations. arXivpreprint arXiv:1711.05732.

Barbara S Wood. 1976. Children and communication:Verbal and nonverbal language development.

Markus Wulfmeier, Alex Bewley, and Ingmar Posner.2018. Incremental adversarial domain adaptationfor continually changing environments. In 2018

3635

IEEE International conference on robotics and au-tomation (ICRA), pages 1–9. IEEE.

Friedemann Zenke, Ben Poole, and Surya Ganguli.2017. Continual learning through synaptic intelli-gence. In Proceedings of the 34th InternationalConference on Machine Learning-Volume 70, pages3987–3995. JMLR. org.

lifelong knowledge-enriched social event representation

Documents