coupled hierarchical transformer for stance-aware rumor ...proceedings of the 2020 conference on...

10
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 1392–1401, November 16–20, 2020. c 2020 Association for Computational Linguistics 1392 Coupled Hierarchical Transformer for Stance-Aware Rumor Verification in Social Media Conversations Jianfei Yu 1 , Jing Jiang 2 , Ling Min Serena Khoo 3 , Hai Leong Chieu 3 , and Rui Xia 1 1 School of Artificial Intelligence, Nanjing University of Science & Technology, China 2 School of Information Systems, Singapore Management University, Singapore 3 DSO National Laboratories, Singapore {jfyu, rxia}@njust.edu.cn, [email protected], {klingmin, chaileon}@dso.org.sg Abstract The prevalent use of social media enables rapid spread of rumors on a massive scale, which leads to the emerging need of automatic rumor verification (RV). A number of previ- ous studies focus on leveraging stance classi- fication to enhance RV with multi-task learn- ing (MTL) methods. However, most of these methods failed to employ pre-trained contex- tualized embeddings such as BERT, and did not exploit inter-task dependencies by using predicted stance labels to improve the RV task. Therefore, in this paper, to extend BERT to obtain thread representations, we first pro- pose a Hierarchical Transformer 1 , which di- vides each long thread into shorter subthreads, and employs BERT to separately represent each subthread, followed by a global Trans- former layer to encode all the subthreads. We further propose a Coupled Transformer Mod- ule to capture the inter-task interactions and a Post-Level Attention layer to use the pre- dicted stance labels for RV, respectively. Ex- periments on two benchmark datasets show the superiority of our Coupled Hierarchical Trans- former model over existing MTL approaches. 1 Background Recent years have witnessed a profound revolu- tion in social media, as many individuals gradually turn to different social platforms to share the latest news and voice personal opinions. Meanwhile, the flourish of social media also enables rapid dissemi- nation of unverified information (i.e., rumors) on a massive scale, which may cause serious harm to our society (e.g., impacting presidential election decisions (Allcott and Gentzkow, 2017)). Since manually checking a sheer quantity of rumors on 1 Note that the concept of hierarchy in this paper is different from that in Yang et al. (2016), as we use hierarchy to refer to a neural structure that first models the local interactions among posts within each subthread, followed by modeling the global interactions among all the posts in the whole thread. /HH .XDQ <HZ GLHG DOUHDG\ ZZZSPRJRYVJON\ 6RXUFH 3RVW 4XHU\ ,V LW WUXH" /HH .XDQ <HZ 'LHG" &DQ DQ\RQH FRQILUP LW" 1R , GRQ¶W EHOLHYH LW LV WUXH 5 5HSO\ 3RVW 5 5HSO\ 3RVW 'HQ\ 6XSSRUW +H GLHG VHYHUDO GD\V DJR 7KH\ GLGQ¶W DQQRXQFH XQWLO QRZ 5 5HSO\ 3RVW , DOVR WKLQN VR +H ZDV RQ 79 ODVW ZHHN 5 5HSO\ 3RVW 'HQ\ 6XSSRUW 6WDQFH /DEHO 9HUDFLW\ /DEHO )DOVH 5XPRU Figure 1: An example conversation thread with both rumor veracity label and stance labels. Each post has a stance label towards the claim in the source post, and the source claim was later identified as false rumor. social media is naturally labor-intensive and time- consuming, it is crucial to develop an automatic rumor verification approach to mitigate their harm- ful effect. Rumor verification is typically defined as a task of determining whether the source claim in a con- versation thread is false rumor, true rumor, or un- verified rumor (Zubiaga et al., 2018a). In the litera- ture, much work has been done for rumor verifica- tion (Liu et al., 2015; Ma et al., 2016; Ruchansky et al., 2017; Chen et al., 2018; Kochkina and Li- akata, 2020). Among them, one appealing line of work focuses on exploiting stance signals to en- hance rumor verification (Zubiaga et al., 2016), since it is observed that people’s stances in reply posts usually provide important clues to rumor veri- fication (e.g., in Fig. 1, if the source claim is denied or queried by most replies, it is highly probable that the source claim contains misinformation and is false rumor). This line of work has attracted increasing atten- tion in recent years. A number of multi-task learn- ing (MTL) methods have been proposed to jointly perform stance classification (SC) and rumor veri- fication (RV) over conversation threads, including Sequential LSTM-based methods (Li et al., 2019), Tree LSTM-based methods (Kumar and Carley,

Upload: others

Post on 27-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 1392–1401,November 16–20, 2020. c©2020 Association for Computational Linguistics

    1392

    Coupled Hierarchical Transformer for Stance-Aware Rumor Verificationin Social Media Conversations

    Jianfei Yu1, Jing Jiang2, Ling Min Serena Khoo3, Hai Leong Chieu3, and Rui Xia11 School of Artificial Intelligence, Nanjing University of Science & Technology, China

    2 School of Information Systems, Singapore Management University, Singapore3 DSO National Laboratories, Singapore

    {jfyu, rxia}@njust.edu.cn, [email protected],{klingmin, chaileon}@dso.org.sg

    Abstract

    The prevalent use of social media enablesrapid spread of rumors on a massive scale,which leads to the emerging need of automaticrumor verification (RV). A number of previ-ous studies focus on leveraging stance classi-fication to enhance RV with multi-task learn-ing (MTL) methods. However, most of thesemethods failed to employ pre-trained contex-tualized embeddings such as BERT, and didnot exploit inter-task dependencies by usingpredicted stance labels to improve the RVtask. Therefore, in this paper, to extend BERTto obtain thread representations, we first pro-pose a Hierarchical Transformer1, which di-vides each long thread into shorter subthreads,and employs BERT to separately representeach subthread, followed by a global Trans-former layer to encode all the subthreads. Wefurther propose a Coupled Transformer Mod-ule to capture the inter-task interactions anda Post-Level Attention layer to use the pre-dicted stance labels for RV, respectively. Ex-periments on two benchmark datasets show thesuperiority of our Coupled Hierarchical Trans-former model over existing MTL approaches.

    1 Background

    Recent years have witnessed a profound revolu-tion in social media, as many individuals graduallyturn to different social platforms to share the latestnews and voice personal opinions. Meanwhile, theflourish of social media also enables rapid dissemi-nation of unverified information (i.e., rumors) ona massive scale, which may cause serious harm toour society (e.g., impacting presidential electiondecisions (Allcott and Gentzkow, 2017)). Sincemanually checking a sheer quantity of rumors on

    1Note that the concept of hierarchy in this paper is differentfrom that in Yang et al. (2016), as we use hierarchy to referto a neural structure that first models the local interactionsamong posts within each subthread, followed by modeling theglobal interactions among all the posts in the whole thread.

    Lee Kuan Yew died already. www.pmo.gov.sg/lky.Source Post

    QueryIs it true? Lee Kuan Yew Died? Can anyone confirm it?

    No, I don’t believe it is true.

    R2: Reply Post

    R21: Reply Post

    Deny

    SupportHe died several days ago. They didn’t announce until now. R1: Reply Post

    I also think so. He was on TV last week.R211: Reply Post

    Deny

    Support

    Stance Label Veracity Label: False Rumor

    Figure 1: An example conversation thread with bothrumor veracity label and stance labels. Each post has astance label towards the claim in the source post, andthe source claim was later identified as false rumor.

    social media is naturally labor-intensive and time-consuming, it is crucial to develop an automaticrumor verification approach to mitigate their harm-ful effect.

    Rumor verification is typically defined as a taskof determining whether the source claim in a con-versation thread is false rumor, true rumor, or un-verified rumor (Zubiaga et al., 2018a). In the litera-ture, much work has been done for rumor verifica-tion (Liu et al., 2015; Ma et al., 2016; Ruchanskyet al., 2017; Chen et al., 2018; Kochkina and Li-akata, 2020). Among them, one appealing line ofwork focuses on exploiting stance signals to en-hance rumor verification (Zubiaga et al., 2016),since it is observed that people’s stances in replyposts usually provide important clues to rumor veri-fication (e.g., in Fig. 1, if the source claim is deniedor queried by most replies, it is highly probablethat the source claim contains misinformation andis false rumor).

    This line of work has attracted increasing atten-tion in recent years. A number of multi-task learn-ing (MTL) methods have been proposed to jointlyperform stance classification (SC) and rumor veri-fication (RV) over conversation threads, includingSequential LSTM-based methods (Li et al., 2019),Tree LSTM-based methods (Kumar and Carley,

  • 1393

    2019), and Graph Convolutional Network-basedmethods (Wei et al., 2019). These MTL approachesare mainly constructed upon the MTL2 frameworkproposed in Kochkina et al. (2018), which aims tofirst learn shared representations with shared layersin the low level, followed by learning task-specificrepresentations with separate stance-specific layersand rumor-specific layers in the high level.

    Although these MTL approaches have shown theusefulness of stance signals to rumor verification,they still suffer from the following shortcomings:(1) The first obstacle lies in their single-task mod-els for SC or RV, whose randomly initialized textencoders such as LSTM tend to overfit existingsmall annotated corpora. With the recent trend ofpre-training, many pre-trained text encoders suchas BERT have been shown to overcome the overfit-ting problem and achieve significant improvementsin many NLP tasks (Devlin et al., 2019). However,unlike previous sentence-level tasks, our SC andRV tasks require the language understanding overconversation threads in social media. Since BERTis unable to process arbitrarily long sequences dueto its maximum length constraint in the pre-trainingstage, it remains an open question how to extendBERT to our SC and RV tasks. (2) Another im-portant limitation of previous studies lies in theirmulti-task learning framework. First, the MTL2framework used in existing methods fails to explic-itly model the inter-task interactions between thestance-specific and rumor-specific layers. Second,although it has been observed that people’s stancesin reply posts are crucial to rumor verification, thestance distributions predicted from stance-specificlayers have not been utilized for rumor veracityprediction in the MTL2 framework.

    To address the above two shortcomings, we ex-plore the potential of BERT for stance-aware rumorverification, and propose a new multi-task learn-ing model based on Transformer (Vaswani et al.,2017), named Coupled Hierarchical Transformer.Our main contributions can be summarized as fol-lows:

    • To extend BERT as our single-task model forSC and RV, we propose a Hierarchical Trans-former architecture. Specifically, we first flattenall the posts in a conversation thread into a longsequence, and then decompose them evenly intomultiple subthreads, each within the length con-straint of BERT. Next, each subthread is encodedwith BERT to capture the local interactions be-

    tween posts within the subthread, and then aTransformer layer is stacked on top of all thesubthreads to capture the global interactions be-tween posts in the whole conversation thread.• To tackle the limitations of the MTL2 frame-

    work, we first design a Coupled TransformerModule to capture the inter-task interactions be-tween the stance-specific and the rumor-specificlayers. Moreover, to utilize the stance distribu-tions predicted for each post, we propose to con-catenate them with its associated post represen-tations, followed by a post-level attention mech-anism to automatically learn the importance ofeach post for the final rumor verification task.

    Evaluations on two benchmark datasets demon-strate the following: First, compared with existingsingle-task models, our Hierarchical Transformerbrings consistent performance gains on Macro-F1for both SC and RV tasks. Second, our CoupledHierarchical Transformer outperforms the state-of-the-art multi-task learning approach by 9.2% and6.3% on Macro-F1 for the two benchmarks, respec-tively.

    2 Related Work

    Stance Classification: Although stance classifica-tion has been well studied in different contexts suchas online forums (Hasan and Ng, 2013; Lukasiket al., 2016; Ferreira and Vlachos, 2016; Moham-mad et al., 2016), a recent trend is to study stanceclassification towards rumors in different socialmedia platforms (Mendoza et al., 2010; Qazvinianet al., 2011). These studies can be roughly catego-rized into two groups. One line of work aims todesign different features to capture the sequentialproperty of conversation threads (Zubiaga et al.,2016; Aker et al., 2017; Pamungkas et al., 2018;Zubiaga et al., 2018b; Giasemidis et al., 2018). An-other line of work attempts to apply recent deeplearning models to automatically capture effectivestance features (Kochkina et al., 2017; Veyseh et al.,2017). Our work extends the latter line of work byproposing a hierarchical Transformer based on therecent pre-trained BERT for this task. Moreover,we notice that our BERT-based hierarchical Trans-former is similar to the model proposed in (Pap-pagari et al., 2019), but we want to point out thatour model design in the input and output layers isspecific to stance classification, which is differentfrom their work.Rumor Verification: Due to the negative impact

  • 1394

    of various rumors spreading on social media, ru-mor verification has attracted increasing attentionin recent years. Existing approaches to single-taskrumor verification generally belong to two groups.The first line of work focuses on either employ-ing a myriad of hand-crafted features (Qazvinianet al., 2011; Yang et al., 2012; Kwon et al., 2013;Ma et al., 2015) including post contents, user pro-files, information credibility features (Castillo et al.,2011), and propagation patterns, or resorting to var-ious kinds of kernels to model the event propaga-tion structure (Wu et al., 2015; Ma et al., 2017).The second line of work applies variants of sev-eral neural network models to automatically cap-ture important features among all the propagatedposts (Ma et al., 2016; Ruchansky et al., 2017;Chen et al., 2018). Different from these studies, thegoal in this paper is to leverage stance classifica-tion to improve rumor verification with a multi-tasklearning architecture.Stance-Aware Rumor Verification: The recentadvance in rumor verification is to exploit stanceinformation to enhance rumor verification with dif-ferent multi-task learning approaches. Specifically,Ma et al. (2018a) and Kochkina et al. (2018) respec-tively proposed two multi-task learning architec-tures to jointly optimize stance classification andrumor verification based on two different variantsof RNN, i.e., GRU and LSTM. More recently, Ku-mar and Carley (2019) proposed another multi-taskLSTM model based on tree structures for stance-aware rumor verification. Our work bears the sameintuition to these previous studies, and aims to ex-plore the potential of the pre-trained BERT to thismulti-task learning task.

    3 Methodology

    In this section, we first formulate the task of stanceclassification (SC) and rumor verification (RV). Wethen describe our single-task model for SC and RV,followed by introducing our multi-task learningframework for stance-aware rumor verification.

    3.1 Task Formulation

    Given a Twitter corpus, let us first use D ={C1, C2, . . . , C|D|} to denote a set of conversationthreads in the corpus. Each thread Ci is then as-sumed to consist of a post with the source claim S0

    and a sequence of reply posts sorted in chronologi-cal order, denoted by R1, R2, ... , RN .

    For the SC task, given an input thread Ci, we

    assume that each post (including a source post andreply posts) in the thread is annotated with a stancelabel towards the source claim, namely support,deny, query, and comment. Formally, let s = (s0, s1,..., sN ) denote the sequence of stance labels, andthe goal of SC is to learn a sequence classificationfunction g: S0, R1, . . . , RN → s0, s1, . . . , sN .

    For the RV task, we assume that each inputthread Ci is associated with a rumor label yi, whichbelongs to one of the three classes, namely false ru-mor, true rumor, and unverified rumor. The goal ofRV is to learn a classification function f : Ci → yi.

    3.2 Hierarchical Transformer for StanceClassification and Rumor Verification

    In this subsection, we present our proposed Hier-archical Transformer, which is a single-task learn-ing framework encompassing the tasks of SC andRV. Fig. 2 illustrates the overview of our model,which mainly consists of four modules, includinginput thread transformation, local context encoding,global context encoding, and output layers.Motivation: Although BERT has been widelyadopted in various NLP tasks (Devlin et al., 2019),its application to our SC and RV tasks is not triv-ial. First, most previous studies employed BERTto obtain token-level representations for sentenceor paragraph understanding, while our SC and RVtasks primarily require sentence-level representa-tions for conversation thread understanding. Sec-ond, due to the maximum length constraint duringthe pre-training stage, BERT cannot be directlyapplied to encode arbitrarily long sequences, e.g.,conversation threads in our tasks. Although trun-cating the input sequences is a feasible solution,it will inevitably ignore many posts that might becrucial for rumor verification.

    Our main idea to address the limitations above isto divide the long sequence of a thread into shortersequences, each within the length constraint ofBERT, and to use a hierarchical model to capturethe global interactions at the top layer.Input Thread Transformation: First, to obtainpost-level representations, we insert two specialtokens, i.e., [CLS] and [SEP], to the beginning andthe end of each post, where the [CLS] token is in-tended to represent the semantic meaning of thepost following it. We then sort the transformedposts in each thread Ci in chronological order, fol-lowed by flattening them into a long sequence. Sec-ond, to eliminate the maximum length constraint,

  • 1395

    Source S 1st Reply R (n-1)th Reply R

    1st Subthread(k-1)nth Reply R kn-1th Reply R

    kth Subthread

    1 n-1 kn-1nth Reply R 2n-1th Reply R2n-1n…...

    2nd Subthread

    Local Context Encoder with BERT

    Global Context Encoder with Transformer

    Local Context Encoder with BERT Local Context Encoder with BERT

    [CLS] … [SEP]S1 Sm-2 [CLS] … [SEP]R1 Rm-2 [CLS] … [SEP]R1 Rm-2…... [CLS] … [SEP]R1 Rm-2 [CLS] … [SEP]R1 Rm-2...… [CLS] … [SEP]R1 Rm-2 [CLS] … [SEP]R1 Rm-2…

    (k-1)n

    [CLS] … [SEP] [CLS] … [SEP] [CLS] … [SEP]…... [CLS] … [SEP] [CLS] … [SEP]…... …... [CLS] … [SEP] [CLS] … [SEP]…...

    [CLS] … [SEP] [CLS] … [SEP] [CLS] … [SEP]…... [CLS] … [SEP] [CLS] … [SEP]…... …... [CLS] … [SEP] [CLS] … [SEP]…...

    Task 1Rumor Label

    True/False/Unverified

    S1 Sn-1 SnS0 S2n-1 S(k-1)n Skn-1

    Task 2Stance Label

    Support/Deny/Query/Comment

    …... …...…...…...

    Input Conversation Thread

    Figure 2: Our Single-Task Model (Hierarchical Transformer) for Stance Classification and Rumor Verification.

    we propose to decompose the flattened sequenceinto multiple subthreads, so that each subthread hasthe same number of posts, and the sequence lengthof each subthread satisfies the length constraint.

    Formally, let Ci = (S0, R1, . . . , RN ) denote theflattened thread, where S0 is the source post, andRj refers to the j-th reply post. As shown in thebottom of Fig. 2, we assume that Ci is decomposedinto k subthreads, each subthread consists of n con-secutive posts, and each post consists of m tokens2.For the j-th post in the thread Ci, let us use Pj =(xjCLS,x

    j1, . . . ,x

    jm−2,x

    jSEP) to denote its input rep-

    resentations, where each token x is represented bysumming up its word embeddings, segment embed-dings and position embeddings. For the l-th sub-thread in Ci, we use Bl = (Pl0,Pl1, . . . ,Pl(n−1))to refer to it.Local Context Encoding (LCE): Next, we em-ploy the pre-trained BERT to separately processthe k subthreads to capture the local interactionsbetween adjacent posts within each subthread:

    hl = BERT(Bl), l = 1, 2, . . . , k (1)

    where hl ∈ Rnm×d is the hidden representationgenerated for the l-th subthread.Global Context Encoding (GCE): To furthercapture the global interactions between all the postsin the whole conversation thread, we propose tofirst concatenate the hidden representations of eachsubthread: h = h1 ⊕ h2 ⊕ . . .⊕ hk. We then feedh to a standard Transformer layer as follows:

    2Note that for parallel computing, each post is padded ortruncated to have the same number of tokens, i.e., m, and eachsubthread is padded to have the same number of posts, i.e., n.

    h̃ = LN(h+ MH-ATT(h)), (2)

    H = LN(h̃+ FFN(h̃)), (3)

    where MH-ATT and FFN respectively refer to themulti-head self-attention and the feed forward net-work (Vaswani et al., 2017), and LN refers to layernormalization (Ba et al., 2016).Output Layers: Based on the global hidden rep-resentation H, we further stack the output layersto make predictions for SC and RV, respectively.Specifically, for the SC task, we treat the hiddenstate of the j-th [CLS] token as the representationfor the j-th post, followed by adding a softmaxlayer to classify its stance towards the source claim:

    p(sj | HjCLS) = softmax(W>s H

    jCLS + bs), (4)

    where Ws ∈ Rd×4 and bs ∈ R4 are learnableparameters. Moreover, for the RV task, we add asoftmax layer over the last hidden state of the first[CLS] token for rumor veracity prediction:

    p(y | H0CLS) = softmax(W>r H0CLS + br), (5)

    where Wr ∈ Rd×3 and br ∈ R3 are weight andbias parameters.

    3.3 Coupled Hierarchical Transformer forStance-Aware Rumor Verification

    Based on the above single-task model (i.e., Hier-archical Transformer), we describe our proposedmulti-task learning (MTL) framework for stance-aware rumor verification in this subsection.Baseline MTL Framework: To exploit the stancesignals for rumor verification, a widely used MTL

  • 1396

    1st Subthread …...

    LCE with BERT LCE with BERT LCE with BERT

    …...[CLS] [CLS][CLS] ...... [CLS][CLS] [CLS]... ... [CLS] ...[CLS] ... [CLS]

    skn-2s1Stance Label

    s0

    [CLS] [CLS] [CLS]... ...

    yRumor Label

    [CLS] [CLS] [CLS]... ...

    …...

    …...…...

    …... …... …...

    [CLS] ...

    skn-1

    [CLS] ...

    Input Conversation Thread

    2nd Subthread kth Subthread

    GCE with Rumor-Specific Transformer

    GCE with Stance-Specific Transformer

    Figure 3: Baseline Multi-Task Learning Framework (MTL2)for Stance-Aware Rumor Verification.

    framework is the MTL2 model proposed in Kochk-ina et al. (2018), which assumes that the SC andRV tasks share the low-level neural layers but thehigh-level layers are specific to each task. As il-lustrated in Fig. 3, to adapt our Hierarchical Trans-former to this MTL2 framework, we propose toshare the input and LCE modules between SC andRV, followed by employing separate GCE and out-put modules for these two tasks, respectively.Motivation: However, as mentioned before, thisbaseline MTL framework has two major limita-tions. First, it fails to consider the inter-task inter-action. Since the GCE module in SC is supervisedto capture salient stance-specific features such asno doubt, agree and fake news, these features canbe leveraged to guide the GCE module in RV to cap-ture those important rumor-specific features closelyrelated to stance features. Moreover, since bothstance-specific and rumor-specific features are in-tuitively crucial to RV, it is necessary to effectivelyintegrate them. Second, it ignores the sequentialstance labels predicted from the output module inSC. Actually, the predicted stance distributions foreach post can capture the temporal evolution ofpublic stances towards the source claim, whichmay reflect indicative clues for veracity prediction.Coupled Transformer Module: To model inter-task interactions, we devise a Coupled TransformerModule with two coupled components in Fig. 4: astance-specific Transformer and a cross-task Trans-former.

    Concretely, we first employ a standard Trans-former layer (i.e., Eqn (2) and Eqn (3)) to obtainstance-specific representations P in the right chan-nel. Next, to learn the inter-task interactions in theleft channel, we design a multi-head stance-awareattention mechanism (MH-SATT) by treating Pas queries, and h as keys and values, which es-sentially leverages stance-specific features in P to

    Self-AttentionQ K V

    Add & Norm

    Add & Norm

    Feed Forward

    Stance-Aware Attention

    K V Q

    Add & Norm

    Feed Forward

    GC

    E w

    ith C

    ross

    -Ta

    sk T

    rans

    form

    er GCE w

    ith Stance-Specific Transform

    er

    Add & Norm

    …...

    Post-Level Attention skn-2s1s0

    y Rumor Label

    SupportDenyQueryComment

    [CLS] [CLS] [CLS]... .........[CLS] [CLS] [CLS]... .........

    Stance Label

    …...

    …...

    …...

    [CLS] ... [CLS] ...

    skn-1

    + + + +

    [CLS] +

    1st Subthread …...

    LCE with BERT LCE with BERT LCE with BERT

    [CLS] [CLS][CLS] ...... [CLS][CLS] [CLS]... ... [CLS] ...[CLS] ... [CLS]…... …... …...

    Input Conversation Thread

    2nd Subthread kth Subthread

    PV

    p0 p1 pkn-1p0 p1 pkn-1

    U

    h

    Figure 4: Our Multi-Task Learning Framework (Coupled Hi-erarchical Transformer) for Stance-Aware Rumor Verification.

    guide our model to pay more attention to stance-aware rumor-specific features. Specifically, the i-thhead of MH-SATT is defined as follows:

    SATTi(P,h) = softmax([WqP]

    >[Wkh]√d/z

    )[Wvh]>, (6)

    where {Wq, Wk, Wv} ∈ Rd/z×d are parameters,and z is the number of heads.

    Moreover, to integrate stance-specific and rumor-specific features, we propose to add a layer normtogether with a residual connection as follows:

    Ṽ = LN(P+ MH-SATT(P,h)). (7)

    Finally, we add a feed-forward network and a layernormalization to get the rumor-stance hybrid repre-sentations V:

    V = LN(Ṽ + FFN(Ṽ)). (8)

    Post-Level Attention with Stance Labels: Toaddress the second limitation, we propose to con-catenate each post’s stance distribution and its cor-responding hidden representation, followed by apost-level attention layer to automatically learn theimportance of each post.

    Specifically, as shown in Fig. 4, we first useEqn (4) to predict the stance distribution of the j-thpost in the right channel, denoted by pj . We thentreat the hybrid representation of the j-th [CLS]token (i.e., VjCLS) as the representation of the j-thpost, and concatenate it with pj , followed by feed-ing them to a post-level attention layer to obtainthe stance label-aware thread representation U:

  • 1397

    Stance Labels Rumor Veracity Labels

    Dataset #Threads #Tweets #Support #Deny #Query #Comment #True #False #Unverified

    SemEval-17 325 5,568 1,004 415 464 3,685 145 74 106PHEME 2,402 105,354 - 1,067 638 697

    Table 1: Basic statistics of the SemEval-2017 dataset and the PHEME dataset.

    uj = v> tanh

    (Wh(V

    jCLS ⊕ pj)

    ), (9)

    αj =exp(uj)∑Nl=1 exp(ul)

    , (10)

    U =

    N∑j=1

    αj(VjCLS ⊕ pj). (11)

    Output Layers: Finally, since V0CLS and U canbe considered as the token-level thread representa-tion and the post-level thread representation respec-tively, we propose to concatenate them to predictthe veracity label of the source claim:

    p(y | V0CLS,U) = softmax(W>(V0CLS ⊕U) + b

    ), (12)

    where W ∈ R(2d+4)×3 and b ∈ R3 are weight andbias terms.Model Training: To optimize all the parametersin our Coupled Hierarchical Transformer, we adoptthe alternating optimization strategy to minimizethe following objective function, which is a combi-nation of the cross-entropy loss of the two tasks:

    J = −( 1M

    M∑i=1

    log p(yi | V0CLS,U)

    +1

    M ′

    M′∑k=1

    N∑j=1

    log p(sj | PjCLS)), (13)

    where M and M ′ refer to the number of samplesfor the tasks of RV and SC, respectively.

    4 Experiments

    In this section, we first evaluate our single-taskmodel on both stance classification (SC) and rumorverification (RV), followed by evaluating our multi-task learning model on RV. Finally, we performfurther analysis to provide deeper insights into ourproposed multi-task learning model.

    4.1 Experiment SettingDataset: To demonstrate the effectiveness of ourproposed approaches, we carry out experiments ontwo benchmark datasets, i.e., SemEval-2017 andPHEME. Table 1 shows the basic statistics of thetwo datasets.

    Specifically, SemEval-2017 is a widely useddataset from SemEval-2017 Challenge Task 8,which contains 325 Twitter conversation threadsdiscussing rumors (Derczynski et al., 2017). Thedataset has been split into training, development,and test sets, where the former two sets are relatedto eight events and the test set covers two addi-tional events. Since each thread is annotated with arumor veracity label and each post in the thread isannotated with its stance towards the source claim,this dataset is used for evaluating both SC and RVtasks in this work.

    PHEME is a well known dataset for RV, whichcontains 2402 Twitter conversation threads dis-cussing nine events. For fair comparison with ex-isting approaches, we perform cross-validation ex-periments based on leave-one-event-out settings:for each fold, all the threads related to one eventare used for testing, and all the threads related tothe other eight events are used for training. Follow-ing previous studies (Kochkina et al., 2018; Weiet al., 2019), PHEME is only used for evaluatingthe performance of RV.

    Since the class distribution of the two datasetsare imbalanced, we employ Macro-F1 as the mainevaluation metric and accuracy as the secondaryevaluation metric for both tasks.

    Parameter Settings: Our models are based on thepre-trained uncased BERTbase model (Devlin et al.,2019), where the number of BERT layers is 12 andthe number of attention heads is z = 12. Moreover,for both Hierarchical Transformer and Coupled Hi-erarchical Transformer, we set the learning rate as5e-5, and the dropout rate as 0.1. Due to memorylimitation, for each conversation thread, the num-ber of subthreads is set to k = 6, and the maximuminput length of each subthread is set as 512. Foreach subthread, the number of posts is set to n = 17,and the number of tokens in each post is fixed to m= 30. Moreover, the batch size is respectively set as4 and 2 for Hierarchical Transformer and CoupledHierarchical Transformer, respectively. We imple-ment all the models based on PyTorch with a 24GBNVIDIA TITAN RTX GPU.

  • 1398

    Single Stance Type Evaluation Overall Evaluation

    Method Support-F1 Deny-F1 Query-F1 Comment-F1 Macro-F1 Accuracy

    SVM (Pamungkas et al., 2018) 0.410 0.000 0.580 0.880 0.470 0.795BranchLSTM (Kochkina et al., 2018) 0.403 0.000 0.462 0.873 0.434 0.784Temporal ATT (Veyseh et al., 2017) - - - - 0.482 0.820Conversational-GCN (Wei et al., 2019) 0.311 0.194 0.646 0.847 0.499 0.751

    Hierarchical Transformer (Ours) 0.421 0.255 0.520 0.841 0.509 0.763

    Table 2: Results of stance classification on the SemEval-2017 dataset.

    SemEval-2017 Dataset PHEME Dataset

    Setting Method Macro-F1 Accuracy Macro-F1 Accuracy

    BranchLSTM (Kochkina et al., 2018) 0.491 0.500 0.259 0.314TD-RvNN (Ma et al., 2018b) 0.509 0.536 0.264 0.341

    Single-Task Hierarchical GCN-RNN (Wei et al., 2019) 0.540 0.536 0.317 0.356HiTPLAN (Khoo et al., 2020) 0.581 0.571 0.361 0.438

    Hierarchical Transformer (Ours) 0.592 0.607 0.372 0.441

    BranchLSTM+NileTMRG (Kochkina et al., 2018) 0.539 0.570 0.297 0.360MTL2 (Veracity+Stance) (Kochkina et al., 2018) 0.558 0.571 0.318 0.357

    Multi-Task Hierarchical PSV (Wei et al., 2019) 0.588 0.643 0.333 0.361

    MTL2-Hierarchical Transformer (Ours) 0.657 0.643 0.375 0.454Coupled Hierarchical Transformer (Ours) 0.680† 0.678† 0.396† 0.466†

    Table 3: Results of rumor veracity prediction. Single-Task indicates that stance labels are not used during the training stage. †indicates that our Coupled Hieararchical Transformer model is significantly better than the best compared system with p-value <0.05 based on McNemar’s significance test.

    4.2 Main Results

    4.2.1 Evaluation on Single-Task Models

    In this subsection, we compare our proposed Hi-erarchical Transformer with existing single-taskmodels for SC and RV, respectively.Stance Classification (SC): We first consider thefollowing competitive approaches that focus onSC only: (1) SVM is a baseline method thatfeeds conversation-based and affective-based fea-tures to linear SVM (Pamungkas et al., 2018); (2)BranchLSTM is an LSTM-based architecture de-signed by Kochkina et al. (2018), which focuses onmodeling the sequential branches in each thread;(3) Temporal ATT is an attention-based model pro-posed by Veyseh et al. (2017), which treats eachpost’s adjacent posts in a conversation timeline asits local context, followed by employing attentionmechanism over the local context to learn the im-portance of each adjacent post; (4) ConversationalGCN is the state-of-the-art approach recently pro-posed by Wei et al. (2019), which leverages graphconvolutional network to model the relations be-tween posts in each thread.

    We report the SC results in Table 2. First, it isclear to observe that our Hierarchical Transformermodel performs much better than all the comparedsystems on Macro-F1. Second, compared with

    previous approaches, our model shows its strongcapability of detecting posts belonging to the sup-port and deny stances. This is crucial for veracityprediction, because the support and deny stancesusually provide important clues to identify the trueand false rumors respectively (see Fig. 5). All theseobservations demonstrate the general effectivenessof our Hierarchical Transformer model.

    Rumor Verification (RV): We then consider sev-eral competitive systems that focus on RV only: (1)RvNN is a recursive neural network model based ontop-down tree structure, which is proposed by Maet al. (2018b); (2) Hierarchical GCN-RNN is avariant of Conversational GCN for veracity pre-diction; (3) PLAN is the state-of-the-art approachrecently proposed by Khoo et al. (2020), whichuses a randomly initialized Transformer to encodeeach conversation thread.

    We report the RV results of compared systemson SemEval-2017 and PHEME in the top part ofTable 3. First, compared with earlier methods forRV, we observe that our Hierarchical Transformermodel gains significant improvements, outperform-ing Hierarchical GCN-RNN by 5.2 and 5.5 abso-lute percentage points on Macro-F1 for the twodatasets, respectively. Second, even compared withthe recent state-of-the-art model PLAN, our model

  • 1399

    Methods (TFM: Transformer) Macro-F1 Accuracy

    Hierarchical TFM 0.372 0.441- Truncating Input & Removing Global TFM 0.354 0.409

    Coupled Hierarchical TFM 0.396 0.466- Removing Post-Level Attention 0.385 0.430- Replacing Cross-Task TFM with TFM 0.390 0.456

    Table 4: Ablation study on the PHEME dataset.

    can still bring moderate performance gains on thetwo datasets. Since PLAN is based on randomlyinitialized Transformer whereas our model is basedon pre-trained Transformer (i.e., BERT), this showsthe usefulness of employing pre-trained models forRV, which agrees with our first motivation.

    4.2.2 Evaluation on Multi-Task ModelsIn this subsection, we evaluate the effectivenessof our Coupled Hierarchical Transformer model,and consider several multi-task learning frame-works for stance-aware rumor verification: (1)BranchLSTM+NileTMRG is a pipeline approach,which first trains a BranchLSTM model for SC, fol-lowed by a SVM classifier for RV (Kochkina et al.,2018); (2) MTL2 is the MTL framework proposedin (Kochkina et al., 2018), which shares a singleLSTM channel but uses two separate output lay-ers for SC and RV, respectively; (3) HierarchicalPSV is a hierarchical model proposed by (Wei et al.,2019), which first learns content and stance featuresvia Conversational-GCN, followed by exploitingtemporal evolution for RV via Stance-Aware RNN;(4) MTL2-Hierarchical Transformer is our adaptedMTL2 model which is introduced in Section 3.3.

    In the bottom part of Table 3, we can first findthat all the multi-task learning models achieve bet-ter performance than their corresponding single-task baselines across the two datasets, which ver-ifies the usefulness of stance signals for RV. Sec-ond, among all the multi-task learning approaches,it is clear to observe that our Coupled Hierarchi-cal Transformer model consistently achieves thebest results on both SemEval-2017 and PHEME,which outperforms the second best method by 2.3and 2.1 absolute percentage points on Macro-F1 forthe two datasets, respectively. These observationsshow the superiority of our proposed model overprevious multi-task learning methods for stance-aware rumor verification.

    4.3 Ablation Study

    To examine the impact of each key component inour single-task and multi-task approaches, we fur-

    Figure 5: Correlation between predicted stance classes (y-axis) and predicted rumor labels (x-axis) from our CoupledHierarchical Transformer on test sets of our two datasets.

    ther perform ablation study in this subsection.As shown in Table 4, for our proposed Hierarchi-

    cal Transformer, we can see that if we directly ap-ply BERT to our RV task (i.e., truncating the inputthread and removing the global Transformer layer),the performance will drop significantly. This isin line with our first motivation, and also demon-strates the effectiveness of our proposed model.

    Moreover, for our multi-task learning framework(i.e., Coupled Hierarchical Transformer), the post-level attention layer shows its indispensable rolebecause of the significant performance drop after re-moval. Meanwhile, replacing our cross-task Trans-former with the standard Transformer will lead tomoderate performance drop in both datasets, whichalso suggests its importance to our full model.

    4.4 Correlation Between Predicted StanceLabels and Veracity Labels

    To better understand the usefulness of stance sig-nals to veracity prediction in our Coupled Hierar-chical Transformer, we first analyze the correlationbetween predicted stance classes and predicted ve-racity labels on our two datasets. Since the com-ment stance is not crucial for rumor verification, wefocus on the other three stance classes, i.e., deny,query, and support.

    As shown in Fig. 5, we can clearly see that truerumor is more closely associated with the supportstance, whereas false rumor is generally dominatedby the other two stances deny and query. Thissuggests that our multi-task learning model hasimplicitly learnt that the stance signal can provideimportant clues to rumor verification.Case Study: To provide deeper insights into ourCoupled Hierarchical Transformer, we carefullychoose one representative sample from our testset, and show the stance and veracity predictionresults as well as the attention weights of each postlearnt in the post-level attention layer. Due to spacelimitation, we only show five posts with the top-5attention weights in the thread.

  • 1400

    These are not timid colours; soldiers back guarding tomb of unknown soldier after today's shooting #standforcanada

    Predicted Stance

    @user1@user2 apparently a hoax. best to take tweet down.

    Source Post

    Reply Post

    @user3 not a hoax. This is before the shootingReply Post

    @user1@user4 I don't believe there are soldiers guarding this area right now. 2/2

    Reply Post

    @user5@user1 who wants to have a "go"???Reply Post

    .....

    Support

    Deny

    Support

    Deny

    .....

    Query.....

    Post-Level Attention Weight

    0.004

    0.625

    0.001

    0.368

    0.0002

    Predicted Veracity Label: False Rumor

    Figure 6: Stance classes and rumor labels predicted by Cou-pled Hierarchical Transformer on a test sample in PHEMEdataset.

    In Fig. 6, we can see that although the sourceclaim is supported by some replies, our modellearns to pay much higher attention weights to thetwo posts with deny stance while primarily ignor-ing the other posts, which may help our modelcorrectly predict its veracity label as false rumor.

    5 Conclusion

    In this paper, we first examined the limitations ofexisting approaches to stance classification (SC)and rumor verification (RV). To tackle these limi-tations, we first proposed a single-task model (i.e.,Hierarchical Transformer) for SC and RV, followedby designing a multi-task learning framework witha Coupled Transformer module to capture inter-task interactions and a Post-Level Attention Layerto use stance distributions for the RV task. Experi-ments on two benchmarks show the effectivenessof our single-task and multi-task learning methods.

    Acknowledgments

    We would like to thank three anonymous review-ers for their valuable comments. This research issupported by DSO grant DSOCL18009, the Natu-ral Science Foundation of China (No. 61672288,62076133, and 62006117), and the Natural ScienceFoundation of Jiangsu Province for Young Schol-ars (SBK2020040749) and Distinguished YoungScholars (SBK2020010154).

    ReferencesAhmet Aker, Leon Derczynski, and Kalina Bontcheva.

    2017. Simple open stance classification for rumouranalysis. In Proceedings of the International Con-ference Recent Advances in Natural Language Pro-cessing, RANLP 2017, pages 31–39.

    Hunt Allcott and Matthew Gentzkow. 2017. Social me-dia and fake news in the 2016 election. Journal ofeconomic perspectives, 31(2):211–36.

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450.

    Carlos Castillo, Marcelo Mendoza, and BarbaraPoblete. 2011. Information credibility on twitter. InProceedings of WWW.

    Tong Chen, Xue Li, Hongzhi Yin, and Jun Zhang. 2018.Call attention to rumors: Deep attention based recur-rent neural networks for early rumor detection. InProceedings of PAKDD.

    Leon Derczynski, Kalina Bontcheva, Maria Liakata,Rob Procter, Geraldine Wong Sak Hoi, and ArkaitzZubiaga. 2017. Semeval-2017 task 8: Rumoureval:Determining rumour veracity and support for ru-mours. In Proceedings of SemEval.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of NAACL.

    William Ferreira and Andreas Vlachos. 2016. Emer-gent: a novel data-set for stance classification. InProceedings of NAACL.

    Georgios Giasemidis, Nikolaos Kaplis, Ioannis Agrafi-otis, and Jason Nurse. 2018. A semi-supervisedapproach to message stance classification. IEEETKDE, 32(1):1–11.

    Kazi Saidul Hasan and Vincent Ng. 2013. Stance clas-sification of ideological debates: Data, models, fea-tures, and constraints. In Proceedings of IJCNLP.

    Ling Min Serena Khoo, Hai Leong Chieu, Zhong Qian,and Jing Jiang. 2020. Interpretable rumor detectionin microblogs by attending to user interactions. InProceedings of AAAI.

    Elena Kochkina and Maria Liakata. 2020. Estimatingpredictive uncertainty for rumour verification mod-els. In Proceedings of ACL.

    Elena Kochkina, Maria Liakata, and Isabelle Augen-stein. 2017. Turing at semeval-2017 task 8: Sequen-tial approach to rumour stance classification withbranch-lstm. In Proceedings of SemEval.

    Elena Kochkina, Maria Liakata, and Arkaitz Zubiaga.2018. All-in-one: Multi-task learning for rumourverification. In Proceedings of COLING.

    Sumeet Kumar and Kathleen Carley. 2019. TreeLSTMs with convolution units to predict stance andrumor veracity in social media conversations. InProceedings of ACL.

    Sejeong Kwon, Meeyoung Cha, Kyomin Jung, WeiChen, and Yajun Wang. 2013. Prominent featuresof rumor propagation in online social media. In Pro-ceedings of ICDM.

  • 1401

    Quanzhi Li, Qiong Zhang, and Luo Si. 2019. Rumordetection by exploiting user credibility information,attention and multi-task learning. In Proceedings ofACL.

    Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, RuiFang, and Sameena Shah. 2015. Real-time rumordebunking on twitter. In Proceedings of CIKM.

    Michal Lukasik, PK Srijith, Duy Vu, Kalina Bontcheva,Arkaitz Zubiaga, and Trevor Cohn. 2016. Hawkesprocesses for continuous time sequence classifica-tion: an application to rumour stance classificationin twitter. In Proceedings of ACL.

    Jing Ma, Wei Gao, Prasenjit Mitra, Sejeong Kwon,Bernard J Jansen, Kam-Fai Wong, and MeeyoungCha. 2016. Detecting rumors from microblogs withrecurrent neural networks. In Proceedings of IJCAI.

    Jing Ma, Wei Gao, Zhongyu Wei, Yueming Lu, andKam-Fai Wong. 2015. Detect rumors using time se-ries of social context information on microbloggingwebsites. In Proceedings of CIKM.

    Jing Ma, Wei Gao, and Kam-Fai Wong. 2017. Detectrumors in microblog posts using propagation struc-ture via kernel learning. In Proceedings of ACL.

    Jing Ma, Wei Gao, and Kam-Fai Wong. 2018a. Detectrumor and stance jointly by neural multi-task learn-ing. In Companion Proceedings of the The Web Con-ference 2018.

    Jing Ma, Wei Gao, and Kam-Fai Wong. 2018b. Ru-mor detection on twitter with tree-structured recur-sive neural networks. In Proceedings of ACL.

    Marcelo Mendoza, Barbara Poblete, and CarlosCastillo. 2010. Twitter under crisis: can we trustwhat we rt? In Proceedings of the 3rd Workshopon Social Network Mining and Analysis, SNAKDD2009, Paris, France, June 28, 2009.

    Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-hani, Xiaodan Zhu, and Colin Cherry. 2016.Semeval-2016 task 6: Detecting stance in tweets. InProceedings of SemEval.

    EW Pamungkas, V Basile, and V Patti. 2018. Stanceclassification for rumour analysis in twitter: Exploit-ing affective information and conversation structure.In 2nd International Workshop on Rumours and De-ception in Social Media (RDSM 2018), volume 2482,pages 1–7.

    Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba,Yishay Carmiel, and Najim Dehak. 2019. Hierarchi-cal transformers for long document classification. In2019 IEEE Automatic Speech Recognition and Un-derstanding Workshop (ASRU).

    Vahed Qazvinian, Emily Rosengren, Dragomir RRadev, and Qiaozhu Mei. 2011. Rumor has it: Iden-tifying misinformation in microblogs. In Proceed-ings of EMNLP.

    Natali Ruchansky, Sungyong Seo, and Yan Liu. 2017.Csi: A hybrid deep model for fake news detection.In Proceedings of CIKM.

    Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of NIPS.

    Amir Pouran Ben Veyseh, Javid Ebrahimi, Dejing Dou,and Daniel Lowd. 2017. A temporal attentionalmodel for rumor stance classification. In Proceed-ings of CIKM.

    Penghui Wei, Nan Xu, and Wenji Mao. 2019. Mod-eling conversation structure and temporal dynamicsfor jointly predicting rumor stance and veracity. InProceedings of EMNLP.

    Ke Wu, Song Yang, and Kenny Q. Zhu. 2015. False ru-mors detection on sina weibo by propagation struc-tures. In Proceedings of ICDE.

    Fan Yang, Yang Liu, Xiaohui Yu, and Min Yang. 2012.Automatic detection of rumor on sina weibo. In Pro-ceedings of the ACM SIGKDD Workshop on MiningData Semantics, MDS ’12, pages 13:1–13:7.

    Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchicalattention networks for document classification. InProceedings of NAACL.

    Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva,Maria Liakata, and Rob Procter. 2018a. Detectionand resolution of rumours in social media: A survey.ACM Computing Surveys (CSUR), 51(2):32.

    Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, RobProcter, Michal Lukasik, Kalina Bontcheva, TrevorCohn, and Isabelle Augenstein. 2018b. Discourse-aware rumour stance classification in social mediausing sequential classifiers. Information Processing& Management, 54(2):273–290.

    Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral-dine Wong Sak Hoi, and Peter Tolmie. 2016.Analysing how people orient to and spread rumoursin social media by looking at conversational threads.PloS one, 11(3):e0150989.