human attention maps for text classification: do humans ... · 2 preliminaries on attention maps in...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4596–4608July 5 - 10, 2020. c©2020 Association for Computational Linguistics

4596

Human Attention Maps for Text Classification:Do Humans and Neural Networks Focus on the Same Words?

Cansu Sen1, Thomas Hartvigsen2, Biao Yin2, Xiangnan Kong1,2, and Elke Rundensteiner1,2

1Computer Science Department , Worcester Polytechnic Institute2Data Science Program, Worcester Polytechnic Institute{csen,twhartvigsen,byin,xkong,rundenst}@wpi.edu

AbstractMotivated by human attention, computationalattention mechanisms have been designed tohelp neural networks adjust their focus on spe-cific parts of the input data. While atten-tion mechanisms are claimed to achieve in-terpretability, little is known about the actualrelationships between machine and human at-tention. In this work, we conduct the firstquantitative assessment of human versus com-putational attention mechanisms for the textclassification task. To achieve this, we de-sign and conduct a large-scale crowd-sourcingstudy to collect human attention maps that en-code the parts of a text that humans focus onwhen conducting text classification. Based onthis new resource of human attention datasetfor text classification, YELP-HAT, collectedon the publicly available YELP dataset, weperform a quantitative comparative analysisof machine attention maps created by deeplearning models and human attention maps.Our analysis offers insights into the relation-ships between human versus machine attentionmaps along three dimensions: overlap in wordselections, distribution over lexical categories,and context-dependency of sentiment polarity.Our findings open promising future researchopportunities ranging from supervised atten-tion to the design of human-centric attention-based explanations.

1 Introduction

Attention-based models have become the architec-tures of choice for a vast number of NLP tasksincluding, but not limited to, language modeling(Daniluk et al., 2017), machine translation (Bah-danau et al., 2015), document classification (Yanget al., 2016), and question answering (Kundu andNg, 2018; Sukhbaatar et al., 2015). While attentionmechanisms have been said to add interpretabilitysince their introduction (Bahdanau et al., 2015), theinvestigation of whether this claim is correct has

Figure 1: Examples of binary human attention (blue intop two texts) and continuous machine attention (red inbottom text).

only just recently become a topic of high-interest(Mullenbach et al., 2018; Thorne et al., 2019; Ser-rano and Smith, 2019). If attention mechanismsindeed offer a more in-depth understanding of amodel’s inner-workings, application areas frommodel debugging to architecture selection wouldbenefit greatly from profound insights into the in-ternals of attention-based neural models.

Recently, Jain and Wallace (2019), Wiegreffeand Pinter (2019), and Serrano and Smith (2019)proposed three distinct approaches for evaluatingthe explainability of attention. Jain and Wallace(2019) base their work on the premise that explain-able attention scores should be unique for a givenprediction as well as consistent with other feature-importance measures. This prompts their conclu-sion that attention is not explanation. Based onsimilar experiments on alternative attention scores,Serrano and Smith (2019) conclude that attentiondoes not necessarily correspond to the importanceof inputs. In contrast, Wiegreffe and Pinter (2019)find that attention learns a meaningful relationshipbetween input tokens and model predictions, whichcannot be easily hacked adversarially.

While these works ask valuable questions, theyembrace model-driven approaches for manipulat-ing the attention weights and thereafter evaluate the

4597

post-hoc explainability of the generated machineattention. In other words, they overlook the humanfactor in the evaluation process – which should beintegral in assessing the plausibility of the gener-ated explanations (Riedl, 2019).

In this work, we adopt a novel approach to atten-tion explainability from a human-centered perspec-tive and, in particular, investigate to what degreemachine attention mimics human behavior. Moreprecisely, we are interested in the following re-search question: Do neural networks with attentionmechanisms attend to the same parts of the text ashumans? To this end, we first collect a large datasetof human-attention maps and then compare the val-idated human attention with a variety of machineattention mechanisms for text classification.

Figure 1 displays examples of human andmachine-generated attention for classifying arestaurant review’s overall rating. Our goal is toquantify the similarity between human attentionand machine-generated attention scores. Measur-ing this similarity is non-trivial and is not appropri-ately captured by an existing similarity metric (e.g.,Euclidean) between two vectors for the followingreasons. A binary human attention vector does notsolely denote which tokens are given higher impor-tance but also implies information about the under-lying grammatical structure and linguistic construc-tion. For example, whether or not adjectives tendto be high-importance is encoded in the attentionweights as well. Further, it is well known that hu-man attention is itself subjective: given the sametext and task, human annotators may not alwaysagree on which words are important. That is, onesingle human’s attention should rarely be regardedas the ground-truth for attention.

Given this objective, we use crowd-sourcing tocollect a large set of human attention maps. Weprovide a detailed account of the iterative designprocess for our data collection study in §3. Wedesign new metrics that quantify the similarity be-tween machine and human attention from three per-spectives (§4): Behavioral similarity measures thenumber of common words selected by human andmachine discerning if neural networks with atten-tion mechanisms attend to the same parts of the textas humans. Humans associate certain lexical cate-gories (e.g., adjectives) with a sentiment more heav-ily. Lexical (grammatical) similarity identifies ifmachine attention favors similar lexical categorieswith humans. A high lexical similarity shows that

the attention mechanism learns similar languagepatterns with humans. Context-dependency quani-tifies sentiment polarity of word selections.

We then employ these metrics to compare at-tention maps from a variety of attention-based Re-current Neural Networks (RNN). We find that bi-Directional RNNs with additive attention demon-strate strong similarities to human attention for allthree metrics. In contrast, uni-directional RNNswith attention differ from human attention signif-icantly. Finally, as the text length increases, andwith it, the prediction task becomes more difficult,both the accuracy of the models and similarity be-tween human and machine decrease.

Our contributions are as follows:

• We conduct a large-scale collection of 15,000human attention maps as a companion tothe publicly-available Yelp Review dataset.Our collected Yelp-HAT (Human ATtention)dataset is publicly available as a valuable re-source to the NLP community.

• We develop rich metrics for comparing humanand machine attention maps for text. Our newmetrics cover three complementary perspec-tives: behavioral similarity, lexical similarity,and context-dependency.

• We conduct the first in-depth assessment com-paring human versus machine attention maps,with the latter generated by a variety of state-of-the-art soft and hard attention.

• We show that when used with bidirectionalarchitectures, attention can be interpreted ashuman-like explanations for model predic-tions. However, as text length increases, ma-chine attention resembles human attentionless.

2 Preliminaries on Attention Maps

In this section, we define the concepts of HumanAttention Map and Machine Attention Map.

Definition 2.1. Attention Map. An AttentionMap (AM) is a vector where each entry in sequenceis associated with a word in the corresponding po-sition of the associated text. The value of the entryindicates the level of attention the correspondingword receives with respect to a classification task.

Definition 2.2. Human Attention Map. A Hu-man Attention Map (HAM) is a binary attention

4598

map produced by a human, where each entry witha set-bit indicates that the corresponding word re-ceives high attention.

Definition 2.3. Machine Attention Map. A Ma-chine Attention Map (MAM) is an attention mapgenerated by a neural network model. If computedthrough soft-attention, a MAM corresponds to anAM of continuous values, that capture a probabilitydistribution over the words. If computed throughhard-attention, a MAM is a binary AM.

We now introduce the application of aggregationoperators to coalesce HAMs by multiple annotatorsinto aggregated HAMs.

Definition 2.4. Consensus Attention Map. Ifmultiple HAMs exist for the same text, a Consen-sus Attention Map (CAM) is computed through abitwise AND operation of the HAMs.

Definition 2.5. Super Attention Map. If multipleHAMs exist for the same text, a Super AttentionMap (SAM) is computed by a bitwise OR operationof the HAMs.

3 Collection and Analysis of HumanAttention Maps

3.1 HAM Collection by Crowd-sourcingWe collect human attention maps for the Yelpdataset1 on the classification task of rating a reviewas positive or negative on Amazon MechanicalTurk. Participants are asked to complete twotasks: 1) Identify the sentiment of the review aspositive, negative, or neither, and 2) Highlight thewords that are indicative of the chosen sentiment.Our interface used for data collection is in Figure 2.

Preliminary investigation of the quality of hu-man annotations. First, we conduct a series ofdata collection studies on two subsets of the Yelpdataset. Both subsets consist of 50 randomly-selected reviews from the Restaurant category. Thefirst subset contains reviews with exactly 50 words,while the second contains reviews with exactly 100words. For each review, human annotation is col-lected from two unique users.

We explore the quality of data we can collect onMechanical Turk, as it encourages users to com-plete their tasks as quickly as possible since thenumber of completed tasks determines their in-come. This may lower the quality of collected

1https://www.yelp.com/dataset/challenge

Figure 2: User interface we used for data collection onAmazon Mechanical Turk.

data since users may not select all relevant words,instead opting for the few most obvious ones, orthey may choose words randomly.

Based on our preliminary investigations, we ob-serve that both the average time users spend onthe task (44 vs. 70 seconds) and the average num-ber of words selected per review (9 vs. 13 words)increase as the number of words in the review in-creases from 50 to 100. This suggests that users donot choose words randomly; instead, they make aninformed decision. We also visually examine thecollected human attention maps and confirm thatsubjects make meaningful selections.Pilot study assessing two design choices for datacollection. Next, we design another pilot study tounderstand how humans perform the cognitive taskof classifying a text and selecting the particularwords that led to this decision. In this study, we askeight participants to perform the same task whileadhering to one of two strategies. The first strategy,the read-first design, involves reading the reviewfirst, deciding on the sentiment, then rereading thereview, this time to highlight the relevant words.The second strategy, the free-style design, gives par-ticipants the freedom to choose the relevant wordsas they read the review to determine the sentiment.Each participant is asked to complete two tasks toexperience both strategies. Half of the participantsfirst work with the read-first design followed bythe free-style design while the other half work inthe reverse order. After completing the tasks, weask the participants which strategy they find morenatural in a post-task questionnaire.Findings from the pilot study. Out of eight partic-ipants, half of them find it more useful reading thereview first then deciding on the words whereas theother half indicated the opposite. We then evaluate

https://www.yelp.com/dataset/challenge

https://www.yelp.com/dataset/challenge

4599

the collected data from three perspectives to decidewhich design is most suitable for our purposes.

We first examine the agreement between par-ticipants adhering to a particular strategy. Thisinvolves calculating the percentage of participantsthat mutually select the same phrase. We find thatparticipant agreement is higher (73%) when theparticipants are forced to read the review beforemaking any selections compared to using the free-style design (69%). Next, we investigate how sim-ilar the results are to the ground truth we definedfor each review. The read-first design achievesbetter performance (3.30) compared to the free-style design (3.10). Our final criterion involvesexamining the amount of noise in the data (i.e., se-lections which deviate from the chosen sentiment).Only one review exhibits this situation where thereview is clearly positive; however, it also containsa negative-opinion sentence. We observe that theread-first design reduces this cross-sentiment noise(1 vs. 0.5 scores).Data collection protocol for the main study.Based on conclusions from the pilot studies, theread-first design is adopted to conduct the maindata collection for 5, 000 reviews on Amazon Me-chanical Turk. For this study, three different sub-jects annotated each review, resulting in a total of15, 000 human attention maps. The resulting YelpHuman Attention Dataset (YELP-HAT) is publiclyavailable 2 .

3.2 Analysis and Insights About HAMs

Factors that affect human accuracy. Some re-views contain a mixture of opinions, even thoughthe reviewer felt strongly positive or negative aboutthe restaurant. For example, consider the followingreview: “Nothing to write home about, the chickenseems microwaved and the appetizers are meh. ...If your [sic] looking for a quick oriental fix I’d saygo for it.. otherwise look elsewhere.” This reviewis labeled as negative, positive, and neither. Theannotator who assigned it to the positive class se-lected the words “go for it” while the annotatorwho assigned it to the negative class selected thewords “otherwise look elsewhere”. This type of“mixed review” is the principal reason for discrep-ancies in classifications by the human annotators.The nature of crowd-sourcing also causes such in-consistencies as not all annotators provide reviews

2http://davis.wpi.edu/dsrg/PROJECTS/YELPHAT/index.html

of equal quality.Ambiguity in human attention. Intuitively, hu-

man attention is highly subjective. Some commonpatterns across annotators lead to differences in hu-man annotations. A common behavior is to selectkeywords that indicate a sentiment. Another typicalaction is to select entire sentences if the sentenceexpresses an opinion.

Some reviews include subjective phrasesthat people interpret differently with regard tosentiment-polarity. For instance, “I come here of-ten” can be construed as a favorable opinion. How-ever, some people find it neutral. In some cases,an overwhelmingly-positive review incorporatesa negative remark (or vice versa). In these cases,some people select all pieces of evidence of anysentiment, whereas others only choose words thatindicate the prevailing sentiment.

4 Attention Map Similarity Framework

We quantify the similarity between HAMs andMAMs through our similarity framework that con-tains three new metrics as described in this section.

4.1 Overlap in Word SelectionsFor two attention mechanisms to be similar, theymust put attention on the same parts of the text.Thus, we first define a metric for quantifying theoverlap in the words selected by human annotatorsand by deep learning models.

Definition 4.1. Behavioral Similarity. Given acollection of attention maps HAMD and MAMDfor a text dataset D, behavioral similarity be-tween human (H) and machine (M) correspondsto the average pair-wise similarity between each(HAMi,MAMi) vector pair ∀i ∈ D as defined be-low:

PairwiseSimi = AUC(HAMi,MAMi)

BehavioralSim(M,H) =1

|D|∑i

(PairwiseSimi)

where |D| is the number of reviews in the datasetD.Intuitively, this corresponds to adopting the humanattention vector as binary ground truth. That is,it measures how similar the machine-generatedcontinuous vector is to this ground truth. AUC isbetween 0 and 1 with .5 representing no similarity,and 1 the perfect similarity.

http://davis.wpi.edu/dsrg/PROJECTS/YELPHAT/index.html

http://davis.wpi.edu/dsrg/PROJECTS/YELPHAT/index.html

4600

4.2 Distribution over Lexical Categories

Previous work has found that lexical indicators ofsentiment are commonly associated with syntacticcategories such as adjective, adverb, noun, and verb(Marimuthu and Devi, 2012). We define the follow-ing lexical similarity metric to test if human andmachine adopt similar behaviors in terms favoringcertain lexical categories.

Definition 4.2. Lexical Similarity. Given a col-lection of attention maps HAMD and MAMD fora text dataset D, Lexical Similarity (LS) betweenhuman (H) and machine (M) over D is computed:

LS(M,H) = corr(dist(wordsH), dist(wordsM ))

where wordsH is a list of all selected words inall reviews of D by human, wordsM is a list ofall selected words in all reviews of D by machine,dist() is a function that computes the distributionof a word list over a tagset (e.g., nouns, verbs, etc.).After computing two distributions, the corr() func-tion computes the correlation between them. Inour experiments, we adopt Pearson Correlation. IfMAM is continuous, selected words by M corre-sponds to k words with the highest attention scores,where k is the number of words selected by humanfor that text.

Using a random attention R as a baseline wherethe most important k words are selected randomly,we then compute an Adjusted Lexical Similaritywhich is between 0 and 1 as follows.

AdjustedLS =LS(M,H)− LS(R,H)

1− LS(R,H)

4.3 Context-dependency of SentimentalPolarity

When deciding the sentiment of a review, humansubjects may consider positive-sentiment words ina negative review and vice versa. To assess howcontext-dependant human and machine attentionsare, we compute cross-sentiment selections rates.

Definition 4.3. Cross-sentiment selection rate(CSSR). Assume we have a collection of attentionmaps AMD for a datasetD, ground truth for overallsentiment Y for each review in D ( yi ∈ {0, 1} ),and a list of positive words P and negative wordsN in the English language. CSSR denotes the ratioof selected words from the opposite sentiment.

p words = get words(HAMD, Y = 1)

n words = get words(HAMD, Y = 0)

CSSRp =|p words ∩N||p words ∩ P|

CSSRn =|n words ∩ P||n words ∩N|

get words() function returns a list of attention-receiving words where HAMij = 1, ∀i, j for theentire set of HAMD, for positive-sentiment reviews(Y = 1) and negative-sentiment reviews (Y = 0)separately. A list of words with positive and neg-ative connotations, P and N , are obtained fromHu and Liu (2004). CSSRp (positive) and CSSRn

(negative) is then computed as the ratio of the num-ber of cross-sentiment words over the number ofsame-sentiment words. A high CSSR means manywords from the opposite sentiment are selected.This metric provides insights about how similarhuman and machine attentions are with regard totheir context-dependant behaviour.

5 Is Machine Attention Similar toHuman Attention?

5.1 Generating Machine Attention Maps

The Yelp dataset contains reviews and their ratingscores between 0 and 5 (stars). This rating scorecorresponds to the ground truth for the review’soverall sentiment. We create a binary classifica-tion task by assigning 1 and 2-star reviews to thenegative class and 4 and 5-star reviews to the pos-itive class. We omit 3-star reviews as they maynot exhibit a clear sentiment. For training neuralnetwork models, we extract balanced subsets andsplit them into 80% training set, 10% validation setand 10% test sets. We then generate MAMs usingthe following machine learning models.RNN with soft attention. Recurrent Neural Net-works (RNN) enhanced with attention mechanismshave emerged as the state-of-the-art for NLP tasks(Bahdanau et al., 2015; Yang et al., 2016; Daniluket al., 2017; Kundu and Ng, 2018). We implementthe additive attention for many-to-one classificationtask as it is commonly used in the literature (Yanget al., 2016; Bahdanau et al., 2015) and paired itwith both uni- and bi-directional RNN. In our im-plementation, we use LSTM memory cells.

4601

Accuracy

Yelp-50 Yelp-100 Yelp-200

Human 0.96 0.94 0.94RNN 0.91 ± 0.006 0.90 ± 0.013 0.88 ± 0.01biRNN 0.93 ± 0.008 0.91 ± 0.005 0.88 ± 0.02Rationales 0.90 ± 0.004 0.85 ± 0.035 0.77 ± 0.015

Table 1: Test accuracy from three subsets of Yelp data.

Assuming that Γ is the recurrence function ofLSTM and xi is the embedded i-th word of Twords in a review, we model our method as:

hi = Γ(xi, hi−1), i ∈ [1, T ] (1)

ui = tanh(Whi + b) (2)

αi =exp(u>i u)∑t exp(u>i u)

(3)

Here hi, i ∈ [1, T ] are hidden representations,W , b, and u are trainable parameters, and αi, i ∈[1, T ] are the attention scores for each word xi.A context vector ci corresponds to the weightedaverage of the hidden representations of words withattention weights, denoted by:

ci =∑j

αjhj (4)

Through a softmax layer, context vector ci is thenused for further classifying the input sequence.Rationale mechanism. An alternative approach,referred to as “rationale mechanism”, can be seenas a type of hard attention (Lei et al., 2016; Baoet al., 2018). This model consists of two main partsthat are jointly learned: a generator and an encoder.The generator specifies a distribution over the inputtext to select candidate rationales. The encoder isused to make predictions based on the rationales.The two components are integrated and regularizedin the cost function with two hyper-parameters,selection lambda, and continuity lambda, for opti-mizing the representative selections. The selectionlambda penalizes the number of words selected,while the continuity lambda encourages the con-tinuity via minimizing the distances of the wordschosen.

5.2 Behavioral Similarity AnalysisWe conduct a set of controlled experiments withthe length of the review changing across experi-ments. First, we generate MAMs for three subsetsof the Yelp dataset: reviews containing 50 words

(Yelp-50), 100 words (Yelp-100) and 200 words(Yelp-200). Neural network models with attentionmechanisms are trained on each of these subsets.The corresponding test set accuracies for sentimentclassification of human versus machine are shownin Table 1. Next, we acquire the HAMs collectedfor each test set. Since each review is annotated bythree people, we have three sets of HAMs: HAM1,HAM2, and HAM3. Consensus among the three,CAM and SAM, are computed as per Defs. 2.4 and2.5. Then we measure the Behavioral Similarity be-tween human and machine. The amount of overlapin the selected words are presented in Table 2.

We observe that accuracy and similarity bothdecrease as the review-length increases and theclassification task becomes more difficult for bothhumans and machine learning models. We identifytwo reasons for this: First, when a review is long,the prevailing opinion is usually not obvious at firstglance and may require more intensive reading andcontemplating. Second, the reviewers are morelikely to state conflicting facts and opinion in longreviews. This, in turn, creates distracting and hard-to-read text. Compared to unidirectional model,bidirectional RNN with attention consistently ratescloser to human attention. This is most striking forthe Yelp-50 subset. This can be explained with thefact that bidirectional RNNs possess informationfrom both directions of the text similar to humans.

For all three subsets, Yelp-50, Yelp-100, andYelp-200, behavioral similarity for Consensus At-tention Map is higher than all three HAMs. Thisis an important result because it indicates that thewords all annotators agreed to be important are se-lected by machine attention too, whereas more sub-jective selections do not always get high attentionfrom machine, indicated by lower SAM similarity.

Finally, we compare similarity of these three setsof HAMs. Even though human-to-human similarityis usually higher than human-to-machine similarity(as expected), the numbers still far from being closeto 1. This confirms the subjectivity of human at-tention. Also, note that human-to-human similaritydecreases as the review length increases.

We observe that the performance of the rationale-based models degrades more sharply as the review-length increases. As our goal is to compare hu-man attention with machine-generated attention formodel interpretability, we optimize the model notonly for accuracy but also for the number of se-lected rationales. We aim to generate roughly an

4602

Yelp-50 HAM1, k = 10 HAM2, k = 12 HAM3, k = 12 CAM, k = 5 SAM, k = 22

HAM2 0.73 - - - -HAM3 0.74 0.75 - - -RNN Attention 0.59± 0.021 0.59± 0.002 0.57± 0.012 0.59± 0.024 0.58± 0.021Bi-RNN Attention 0.69± 0.004 0.70± 0.008 0.69± 0.007 0.79± 0.003 0.64± 0.008Rationales 0.62± 0.014 0.62± 0.012 0.63± 0.015 0.68± 0.020 0.58± 0.010


HAM2 0.71 - - - -HAM3 0.73 0.74 - - -RNN Attention 0.57 ± 0.009 0.58 ± 0.011 0.59 ± 0.012 0.57 ± 0.010 0.58 ± 0.008Bi-RNN Attention 0.65 ± 0.011 0.65 ± 0.021 0.66 ± 0.021 0.73 ± 0.031 0.62 ± 0.012Rationales 0.55 ± 0.015 0.55 ± 0.005 0.55 ± 0.010 0.59 ± 0.015 0.54 ± 0.005


HAM2 0.70 - - - -HAM3 0.69 0.71 - - -RNN Attention 0.60 ± 0.011 0.60 ± 0.013 0.60 ± 0.014 0.60 ± 0.017 0.60 ± 0.011Bi-RNN Attention 0.61 ± 0.015 0.61 ± 0.008 0.61 ± 0.018 0.63± 0.009 0.60 ± 0.008Rationales 0.51± 0.013 0.52 ± 0.021 0.51 ± 0.018 0.52± 0.025 0.49± 0.019

Table 2: Behavioral similarity of human attention to machine on varying review length. k indicates the averagenumber of words selected. (0.5:no similarity, 1.0:perfect similarity)

equal number of words selected by both human an-notators and machine-generated rationales. Hence,we force the rationale-models to pick fewer wordsby tuning the selection lambda accordingly. Thisgives a comparative advantage to attention-basedmodels against rationale-based models, as the ra-tionale model is a hard-attention mechanism. Inaddition, rationales are better suited for sentence-level tasks as they encourage consecutive selectionas opposed to the behavior of attention.

5.3 Lexical Similarity Analysis

Next, we analyze if humans and neural networkspay more attention to words from particular lexicalcategories using Adjusted Lexical Similarity score.

Lexical Similarity results, presented in Table 3,are consistent with Behavioral Similarity in thatbidirectional model with attention is most similarto human (0.91 for Yelp-50 and 0.84 for Yelp-100).Rationales model follows bidirectional RNN, andunidirectional RNN is the least similar model to hu-man. Overall, lexical similarity to human decreasesfor all models, as the reviews become longer.

Next, we inspect which lexical categories areselected more heavily by human and machine. Forthis, we provide relative frequency of lexical cate-gories for human-selected words, machine-selectedwords (bi-RNN), and overall relative frequency ofthis tag within the dataset. Adjectives (Human:0.24

bi-RNN:0.23 Overall:0.02), comparative adjectives(Human:0.002 bi-RNN:0.001 Overall:0.0001), andnouns (Human:0.38 bi-RNN:0.37 Overall:0.09) areamong the lexical categories that humans and bi-RNN models favor heavily. Similarly, personalpronouns are rarely selected by neither humansnor bi-RNN models (Human:0.005 bi-RNN:0.005Overall:0.01).

5.4 Cross-sentiment Selection Rate Analysis

Finally, we compute CSSR scores, presented inTable 4, to evaluate the context-dependency of sen-timental polarity for human and machine attentions.Our observations for Yelp-50 dataset are as follows.By human annotators, almost exclusively positivewords are selected if the overall review sentimentis positive. For negative reviews, higher numberof positive words are selected than negative words(CSSRp = 0.06,CSSRn = 0.20). Among the neu-ral network models, the bidirectional RNN oncemore behaves most similar to human annotatorswith CSSRp = 0.04 and CSSRn = 0.19. RNNmodel’s approach differs from that of human’s andbi-RNN’s. Even though the behaviour is similarfor positive polarity (CSSRp = 0.06), the oppositeis true for negative polarity. In fact, positive wordsselected 2.28 times more than negative words innegative reviews, which is counter-intuitive. Forthe Rationales model, CSSRp is 0.08 and CSSRn

4603

is 0.44. This indicates that Rationales model ismore similar to human attention than RNN modelwith attention. We observe similar trends for theYelp-100 and Yelp-200 datasets.

6 Related Work

A large body of work has been using attentionmechanisms to attempt to bring ’interpretability’to model predictions (Choi et al., 2016; Sha andWang, 2017; Yang et al., 2016). However, they onlyassess the produced attention maps qualitatively byvisualizing a few hand-selected instances. Recently,researchers began to question the interpretabilityof attention. Jain and Wallace (2019) and Serranoand Smith (2019) argue that if alternative attentiondistributions exist that produce similar results tothose obtained by the original model, then the origi-nal model’s attention scores cannot be reliably usedto explain the model’s prediction. They empiricallyshow that achieving such alternative distributions ispossible. In contrast, Wiegreffe and Pinter (2019)find that attention learns a meaningful relationshipbetween input tokens and model predictions whichcannot be easily hacked adversarially.

Das et al. (2016) conducted the first quantita-tive assessment of computational attention mech-anisms for the visual question answering (VQA)task. Similar to our work, they collect a humanattention dataset, then measure the similarity ofhuman and machine attention within the contextof VQA. This VQA-HAT dataset now providesa fertile research vehicle for researchers in com-puter vision for studying the supervision of theattention mechanism (Liu et al., 2017a). The devel-opment of a similar dataset and an in-depth quanti-tative evaluation for text to advance NLP researchis sorely lacking. In a concurrent and independentwork, DeYoung et al. (2019) collects the ERASERdataset for human annotations of rationales. WhileERASER includes multiple datasets for a numberof NLP tasks with relatively small amounts of datafor each, we focus on text classification and collecta large amount of data on a different corpus.

7 Discussion

Recent papers, including our work, take strides atanswering the question if attention is interpretable.This is complicated by the fact that “interpretabil-ity” remains a not well-defined concept.

Attention adds transparency. Lipton(2018) defines transparency as overall human-

understanding of a model, i.e., why a model makesits decisions. Under this definition, attention scorescan be seen as partial transparency. That is, theyprovide a look into the inner workings of a model,in that they produce an easily-understandableweighting of hidden states (Wiegreffe and Pinter,2019).

Attention is not faithful. Whether adversarialattention scores exist that result in the same pre-dictions as the original attention scores helps usunderstand if attention is faithful. With their empir-ical analyses, Serrano and Smith (2019) and Jainand Wallace (2019) show that attention is not faith-ful.

Rationale models for human-like explana-tions. Riedl (2019) argues that explanations arepost-hoc descriptions of how a system came toa given conclusion. This raises the question ofwhat makes a good explanation of the behavior ofa machine learning system. One line of researchoffers these explanations in the form of binary ratio-nales, namely, explanations that plausibly justify amodel’s actions (Bao et al., 2018; Lei et al., 2016).

Our approach at attention as human-like ex-planations. In claiming attention is explanation,it is seen to mimic humans in rationalizing pastactions. In our work, we approach interpretabilityfrom this human-centric perspective. We developa systematic approach to either support or refutethe hypothesis that attention corresponds to human-like explanations for model behavior. Based on ourcomparative analyses, we provide initial answers tothis important question by finding insights into thesimilarities and dissimilarities of attention-basedarchitectures to human attention.

Towards additional tasks beyond text classifi-cation. Confidently concluding whether attentionmimics human requires tremendous efforts frommany researchers with human data to be collectedvia a well-designed data collection methodology,both labor-intensive and costly task. In this work,we thus focus on one task, namely, sentiment clas-sification, and collect HAM for this task and on asingle dataset. We invite other researchers to con-tinue this line of research by exploring other tasks(e.g., question answering).

Next steps in attention research. Our workopens promising future research opportunities. Oneis to supervise attention models explicitly. Atten-tion mechanisms themselves are typically learnedin an unsupervised manner. However, initial re-

4604

Yelp-50 Yelp-100 Yelp-200

Lexical Sim. Adjusted LS Lexical Sim. Adjusted LS Lexical Sim. Adjusted LS

Random Attention 0.85 ± 0.006 - 0.84 ± 0.013 - 0.90 ± 0.010 -RNN Attention 0.93 ± 0.015 0.54 0.91 ± 0.007 0.44 0.93 ± 0.005 0.37Bi-RNN Attention 0.99 ± 0.005 0.91 0.98 ± 0.013 0.84 0.93 ± 0.003 0.36Rationales 0.95 ± 0.012 0.66 0.93 ± 0.027 0.53 0.90 ± 0.002 0.05

Table 3: Lexical Similarity and Adjusted Lexical Similarity of human attention to machine on varying reviewlength. (Adjusted LS 0:no similarity, 1:perfect similarity)

CSSRp CSSRn

Human 0.06 0.20RNN Attention 0.06 2.28Bi-RNN Attention 0.04 0.19Rationales 0.08 0.44

Table 4: Cross-sentiment Selection Rates for positiveand negative reviews for Yelp-50 dataset.

search offers compelling evidence for the successof supervised attention models (Chen et al., 2017;Liu et al., 2017b) in the computer vision area. Also,attention has the potential to be leveraged for bothmaking predictions and concurrently producinghuman-centric explanations similar to rationale-based architectures.

8 Conclusion

To gain a deeper understanding of the relationshipsbetween human and attention-based neural networkmodels, we conduct a large crowd-sourcing studyto collect human attention maps for text classifi-cation. This human attention dataset represents avaluable community resource that we then lever-age for quantifying similarities between human andattention-based neural network models using novelattention-map similarity metrics. Our research notonly results in insights into significant similaritiesbetween bidirectional RNNs and human attention,but also opens the avenue for promising future re-search directions.

Acknowledgments

This research was supported by the U.S. Dept. ofEducation grant P200A150306, Worcester Poly-technic Institute through the Arvid Anderson Fel-lowship, and the National Science Foundationthrough grants IIS-1815866, IIS-1910880, IIS-1718310, and CNS -1852498. We thank Prof. LaneHarrison, WPI, for his advice and guidance on the

design study for the data collection, and Prof. Jea-nine Skorinko, WPI, for helpful discussion aboutthe cognitive aspects of human attention.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. Proceedings of Inter-national Conference on Learning Representations.

Yujia Bao, Shiyu Chang, Mo Yu, and Regina Barzilay.2018. Deriving machine attention from human ra-tionales. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 1903–1913.

Lei Chen, Mengyao Zhai, and Greg Mori. 2017. At-tending to distinctive moments: Weakly-supervisedattention models for action localization in video. InProceedings of the IEEE International Conferenceon Computer Vision, pages 328–336.

Edward Choi, Mohammad Taha Bahadori, Jimeng Sun,Joshua Kulas, Andy Schuetz, and Walter Stewart.2016. Retain: An interpretable predictive model forhealthcare using reverse time attention mechanism.In Advances in Neural Information Processing Sys-tems, pages 3504–3512.

Michał Daniluk, Tim Rocktaschel, Johannes Welbl,and Sebastian Riedel. 2017. Frustratingly short at-tention spans in neural language modeling. ICLR.

Abhishek Das, Harsh Agrawal, Larry Zitnick, DeviParikh, and Dhruv Batra. 2016. Human attention invisual question answering: Do humans and deep net-works look at the same regions? In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 932–937.

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani,Eric Lehman, Caiming Xiong, Richard Socher, andByron C Wallace. 2019. Eraser: A benchmark toevaluate rationalized nlp models. arXiv preprintarXiv:1911.03429.

Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In Proceedings of the tenthACM SIGKDD international conference on Knowl-edge discovery and data mining, pages 168–177.ACM.

4605

Sarthak Jain and Byron C Wallace. 2019. Attention isnot explanation. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 3543–3556.

Souvik Kundu and Hwee Tou Ng. 2018. A question-focused multi-factor attention network for questionanswering. In Thirty-Second AAAI Conference onArtificial Intelligence.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.Rationalizing neural predictions. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 107–117.

Zachary C Lipton. 2018. The mythos of modelinterpretability. Communications of the ACM,61(10):36–43.

Chenxi Liu, Junhua Mao, Fei Sha, and Alan Yuille.2017a. Attention correctness in neural image cap-tioning. In Thirty-First AAAI Conference on Artifi-cial Intelligence.

Shulin Liu, Yubo Chen, Kang Liu, and Jun Zhao.2017b. Exploiting argument information to improveevent detection via supervised attention mechanisms.In Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics, pages1789–1798.

K Marimuthu and Sobha Lalitha Devi. 2012. How hu-man analyse lexical indicators of sentiments-a cog-nitive analysis using reaction-time. In Proceedingsof the 2nd Workshop on Sentiment Analysis where AImeets Psychology, pages 81–90.

James Mullenbach, Sarah Wiegreffe, Jon Duke, JimengSun, and Jacob Eisenstein. 2018. Explainable pre-diction of medical codes from clinical text. In Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long Papers), pages 1101–1111.

Mark O Riedl. 2019. Human-centered artificial in-telligence and machine learning. arXiv preprintarXiv:1901.11184.

Sofia Serrano and Noah A. Smith. 2019. Is attentioninterpretable? In Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers).

Ying Sha and May D Wang. 2017. Interpretable predic-tions of clinical outcomes with an attention-basedrecurrent neural network. In Proceedings of the8th ACM International Conference on Bioinformat-ics, Computational Biology, and Health Informatics,pages 233–240. ACM.

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In Advancesin neural information processing systems, pages2440–2448.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2019.Generating token-level explanations for naturallanguage inference. In Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers), pages 963–969.

Sarah Wiegreffe and Yuval Pinter. 2019. Attention isnot not explanation. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 11–20.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchicalattention networks for document classification. InProceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1480–1489.

4606

A Appendix

A.1 Training Rationale-based models

For the Rationale Neural Prediction Framework,we use the Pytorch implementation3 suggested byLei et al. (2016). In this framework, the encoder isbuilt as Convolutional Neural Network (CNN) andthe generator is built as Gumbel Softmax with inde-pendent selectors. The following hyper-parametersof CNN are used as pointed out by (Lei et al.,2016): 200 hidden dimensions, 0.1 dropout rate, 2hidden layers, 128 batch size, 64 epochs, 0.0003initial learning rate.

We conducted an extensive parameter searchto find the optimum values for the two keyhyper-parameters of the rationale model, selection-lambda, and continuity-lambda, which regularizethe number and the continuity of words selectedduring the optimization process. For the selectionlambda, we experimented with values 1, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9, and0. For the continuity lambda, we experimentedwith values 0 and two times of selection lambda.We observe that the performance of the rationale-based model is extremely sensitive to its hyper-parameters.

One conflicting interest with the rationale-basedmodels is that the more words the model selects,the accuracy becomes higher. As our goal is tocompare human attention with machine-generatedattention for model interpretability, we optimize themodel not only for accuracy but also for the numberof selected rationales. We aim to generate roughlyan equal number of words selected by both humanannotators and machine-generated rationales.

A.2 Training Attention-based models

We used the following hyper-parameters to RNN-based models. 100 hidden dimensions, 100 at-tention size, 0.2 dropout rate, 128 batch size, 64epochs, 0.0001 initial learning rate.

A.3 Additional Analysis Results

An example visualization of the attention maps an-notated by human annotators and machine learningmodels is provided in Figure 4. The agreement be-tween human annotators and all machine learningmodels can be considered high in this example, asthere are many mutual selections.

3https://github.com/yala/text_nn

Figure 3: Human attention is highly subjective. Someannotators tend to select only a few words, whereas oth-ers choose entire sentences.

Another example is provided in Figure 3, demon-strating the attention maps provided by two dif-ferent annotators for the same review. This is anextreme example of the subjectivity of human at-tention. The first annotator only highlights indi-vidual words with the strongest cues of sentiment,whereas the second annotator sometimes selectsentire sentences when they indicate a sentiment.

Table 5 shows the distribution of selected wordsover lexical categories for Human (CAM), Machine(bi-RNN), and the entire corpus for the Yelp-50subset. Any divergence in the Human and Ma-chine columns from the Corpus column indicatesa tendency of selection for a lexical category. Forexample, adjectives are selected very heavily byboth Human and Machine, even though they onlymake 0.02 of all words in the dataset.

https://github.com/yala/text_nn

4607

Lexical Category Human Machine(bi-RNN) Corpus

Coordinating conjunction 0.0000 0.0098 0.0147Cardinal number 0.0098 0.0077 0.0043Determiner 0.0112 0.0168 0.0312Existentialthere 0.0000 0.0000 0.0000Foreign word 0.0000 0.0000 0.0000Preposition or subordinating conjunction 0.0266 0.0084 0.0298Adjective 0.2374 0.2269 0.0201Adjective, comparative 0.0021 0.0014 0.0002Adjective, superlative 0.0252 0.0287 0.0016List item marker 0.0000 0.0000 0.0000Modal 0.0035 0.0000 0.0030Noun, singular or mass 0.3838 0.3711 0.0950Noun, plural 0.0000 0.0000 0.0000Proper noun, singular 0.0000 0.0000 0.0000Proper noun, plural 0.0413 0.0665 0.0154Predeterminer 0.0000 0.0000 0.0000Possessive ending 0.0000 0.0000 0.0000Personal pronoun 0.0056 0.0049 0.0141Possessive pronoun 0.0035 0.0028 0.0067Adverb 0.1296 0.0931 0.0277Adverb, comparative 0.0070 0.0000 0.0014Adverb, superlative 0.0000 0.0000 0.0000Particle 0.0000 0.0000 0.0000Symbol 0.0000 0.0000 0.0000to 0.0035 0.0007 0.0077Interjection 0.0000 0.0000 0.0000Verb, base form 0.0196 0.0028 0.0098Verb, past tense 0.0070 0.0609 0.0148Verb, gerund or present participle 0.0357 0.0462 0.0053Verb, past participle 0.0455 0.0455 0.0083Verb, non-3rd person singular present 0.0000 0.0028 0.0023Verb, 3rd person singular present 0.0007 0.0021 0.0065Wh-determiner 0.0000 0.0000 0.0005Wh-pronoun 0.0007 0.0000 0.0005Possessive wh-pronoun 0.0000 0.0000 0.0000Wh-adverb 0.0007 0.0007 0.0012

Table 5: Distribution over lexical categories for human-selected words, machine-selected words, and the entirecorpus.

4608

Figure 4: Visualizations of attention maps by human annotators and machine learning models. From top to bottom:first human annotator, second human annotator, RNN, bi-RNN, Rationales.

human attention maps for text classification: do humans ... · 2 preliminaries on attention maps in...

Documents