arxiv:1606.02820v1 [cs.cl] 9 jun 2016one dedicated to toy animals (figure 1). terrific once had a...

10
Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora William L. Hamilton, Kevin Clark, Jure Leskovec, Dan Jurafsky Department of Computer Science, Stanford University, Stanford CA, 94305 wleif,jure,kevclark,[email protected] Abstract A word’s sentiment depends on the domain in which it is used. Computational social science research thus requires sentiment lexicons that are specific to the domains being studied. We combine domain-specific word embeddings with a label propagation framework to induce accurate domain-specific sentiment lexicons using small sets of seed words. We show that our approach achieves state-of-the-art perfor- mance on inducing sentiment lexicons from domain-specific corpora and that our purely corpus-based approach outperforms methods that rely on hand-curated resources (e.g., WordNet). Using our framework, we induce and release historical sentiment lexicons for 150 years of English and community-specific sentiment lexicons for 250 online commu- nities from the social media forum Reddit. The historical lexicons we induce show that more than 5% of sentiment-bearing (non- neutral) English words completely switched polarity during the last 150 years, and the community-specific lexicons highlight how sentiment varies drastically between different communities. 1 Introduction The sentiment of the word soft varies drastically be- tween an online community dedicated to sports and one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2). Inducing domain-specific sentiment lexicons is crucial to computational social science (CSS) re- search. Sentiment lexicons allow researchers to an- r/sports r/mylittlepony -6 -4 -2 0 2 4 6 Sentiment of soft Figure 1: The sentiment of soft in different online commu- nities. Sentiment values computed using SENTPROP (Section 3) on comments from Reddit communities illustrate how sen- timent depends on social context. Bootstrap-sampled standard deviations provide a measure of confidence with the scores. alyze key subjective properties of texts, such as user opinions and emotional attitudes (Taboada et al., 2011). However, without domain-specific lexicons, analyses can be misled by sentiment assignments that are biased towards domain-general contexts and that fail to take into account community-specific ver- nacular or demographic variations in language use (Hovy, 2015; Yang and Eisenstein, 2015). Experts or crowdsourced annotators can be used to construct sentiment lexicons for a specific do- main, but these efforts are expensive and time- consuming (Mohammad and Turney, 2010; Fast et al., 2016). Crowdsourcing is especially problematic when the domain involves very non-standard lan- guage (e.g., historical documents or obscure social media forums), since in these cases annotators must understand the sociolinguistic context of the data. Recent work has shown that web-scale senti- ment lexicons can be automatically induced for large socially-diffuse domains, such as the internet-at- large (Velikovich et al., 2010) or all of Twitter (Tang arXiv:1606.02820v1 [cs.CL] 9 Jun 2016

Upload: others

Post on 24-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1606.02820v1 [cs.CL] 9 Jun 2016one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2)

Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora

William L. Hamilton, Kevin Clark, Jure Leskovec, Dan JurafskyDepartment of Computer Science, Stanford University, Stanford CA, 94305

wleif,jure,kevclark,[email protected]

Abstract

A word’s sentiment depends on the domain inwhich it is used. Computational social scienceresearch thus requires sentiment lexicons thatare specific to the domains being studied. Wecombine domain-specific word embeddingswith a label propagation framework to induceaccurate domain-specific sentiment lexiconsusing small sets of seed words. We show thatour approach achieves state-of-the-art perfor-mance on inducing sentiment lexicons fromdomain-specific corpora and that our purelycorpus-based approach outperforms methodsthat rely on hand-curated resources (e.g.,WordNet). Using our framework, we induceand release historical sentiment lexicons for150 years of English and community-specificsentiment lexicons for 250 online commu-nities from the social media forum Reddit.The historical lexicons we induce show thatmore than 5% of sentiment-bearing (non-neutral) English words completely switchedpolarity during the last 150 years, and thecommunity-specific lexicons highlight howsentiment varies drastically between differentcommunities.

1 Introduction

The sentiment of the word soft varies drastically be-tween an online community dedicated to sports andone dedicated to toy animals (Figure 1). Terrificonce had a highly negative connotation; now it isessentially synonomous with good (Figure 2).

Inducing domain-specific sentiment lexicons iscrucial to computational social science (CSS) re-search. Sentiment lexicons allow researchers to an-

r/sportsr/mylittlepony

−6−4−2

0246

Sen

timen

tofs

oft

Figure 1: The sentiment of soft in different online commu-nities. Sentiment values computed using SENTPROP (Section3) on comments from Reddit communities illustrate how sen-timent depends on social context. Bootstrap-sampled standarddeviations provide a measure of confidence with the scores.

alyze key subjective properties of texts, such as useropinions and emotional attitudes (Taboada et al.,2011). However, without domain-specific lexicons,analyses can be misled by sentiment assignmentsthat are biased towards domain-general contexts andthat fail to take into account community-specific ver-nacular or demographic variations in language use(Hovy, 2015; Yang and Eisenstein, 2015).

Experts or crowdsourced annotators can be usedto construct sentiment lexicons for a specific do-main, but these efforts are expensive and time-consuming (Mohammad and Turney, 2010; Fast etal., 2016). Crowdsourcing is especially problematicwhen the domain involves very non-standard lan-guage (e.g., historical documents or obscure socialmedia forums), since in these cases annotators mustunderstand the sociolinguistic context of the data.

Recent work has shown that web-scale senti-ment lexicons can be automatically induced for largesocially-diffuse domains, such as the internet-at-large (Velikovich et al., 2010) or all of Twitter (Tang

arX

iv:1

606.

0282

0v1

[cs

.CL

] 9

Jun

201

6

Page 2: arXiv:1606.02820v1 [cs.CL] 9 Jun 2016one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2)

et al., 2014). However, in cases where researcherswant to analyze the sentiment of domain-specificlanguage—such as in financial documents, histori-cal texts, or tight-knit social media forums—it is notenough to simply use generic crowdsourced or web-scale lexicons. Generic lexicons will not only be in-accurate in specific domains, they may mislead re-search by introducing harmful biases (Loughran andMcDonald, 2011)1. Researchers need a principledand accurate framework for inducing lexicons thatare specific to their domain of study.

To meet this need, we introduce SENTPROP, aframework to learn accurate sentiment lexicons fromsmall sets of seed words and domain-specific cor-pora. Unlike previous approaches, SENTPROP is de-signed to maintain accurate performance when us-ing modestly-sized domain-specific corpora (∼107

tokens), and it provides confidence scores along withthe learned lexicons, which allows researchers toquantify uncertainty in a principled manner.

The key contributions of this work are:

1. A state-of-the-art sentiment induction algorithm,combining high-quality word vector embeddingswith an intuitive label propagation approach.

2. A novel bootstrap-sampling framework for infer-ring confidence scores with the sentiment values.

3. Two large-scale studies that reveal how sentimentdepends on both social and historical context.(a) We induce community-specific sentimentlexicons for the largest 250 “subreddit” commu-nities on the social-media forum Reddit, reveal-ing substantial variation in word sentiment be-tween communities.(b) We induce historical sentiment lexicons for150 years of English, revealing that >5% ofwords switched polarity during this time.

To the best of our knowledge, this is the first workto systematically analyze the domain-dependency ofsentiment at a large-scale, across hundreds of yearsand hundreds of user-defined online communities.

All of the inferred lexicons along with code forSENTPROP and all methods evaluated are madeavailable in the SOCIALSENT package released withthis paper.2

1http://brandsavant.com/brandsavant/

the-hidden-bias-of-social-media-sentiment-analysis2http://nlp.stanford.edu/projects/socialsent

Figure 2: Terrific becomes more positive over the last 150years. Sentiment values and bootstrapped confidences werecomputed using SENTPROP on historical data (Section 6).

2 Related work

Our work builds upon a wealth of previous researchon inducing sentiment lexicons, along two threads:

Corpus-based approaches use seed words andpatterns in unlabeled corpora to induce domain-specific lexicons. These patterns may rely onsyntactic structures (Hatzivassiloglou and McKe-own, 1997; Thelen and Riloff, 2002; Widdowsand Dorow, 2002; Jijkoun et al., 2010; Rooth etal., 1999), which can be domain-specific and brit-tle (e.g., in social media lacking usual grammati-cal structures). Other models rely on general co-occurrence (Turney and Littman, 2003; Riloff andShepherd, 1997; Igo and Riloff, 2009). Oftencorpus-based methods exploit distant-supervisionsignals (e.g., review scores, emoticons) specificto certain domains (Asghar et al., 2015; Blair-Goldensohn et al., 2008; Bravo-Marquez et al.,2015; Choi and Cardie, 2009; Severyn and Mos-chitti, 2015; Speriosu et al., 2011; Tang et al.,2014). An effective corpus-based approach that doesnot require distant-supervision—which we adapthere—is to construct lexical graphs using word co-occurrences and then to perform some form of la-bel propagation over these graphs (Huang et al.,2014; Velikovich et al., 2010). Recent work has alsolearned transformations of word-vector representa-tions in order to induce sentiment lexicons (Rotheet al., 2016). Fast et al. (2016) combine word vec-tors with crowdsourcing to produce domain-generaltopic lexicons.

Dictionary-based approaches use hand-curatedlexical resources—usually WordNet (Fellbaum,1998)—in order to propagate sentiment from seedlabels (Esuli and Sebastiani, 2006; Hu and Liu,2004; Kamps et al., 2004; Rao and Ravichandran,

Page 3: arXiv:1606.02820v1 [cs.CL] 9 Jun 2016one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2)

Domain Positive seed words Negative seed words

StandardEnglish

good, lovely, excellent, fortunate, pleasant,delightful, perfect, loved, love, happy

bad, horrible, poor, unfortunate, unpleasant,disgusting, evil, hated, hate, unhappy

Finance successful, excellent, profit, beneficial, im-proving, improved, success, gains, positive

negligent, loss, volatile, wrong, losses, dam-ages, bad, litigation, failure, down, negative

Twitter love, loved, loves, awesome, nice, amazing,best, fantastic, correct, happy

hate, hated, hates, terrible, nasty, awful,worst, horrible, wrong, sad

Table 1: Seed words. The seed words were manually selected to be context insensitive (without knowledge of the test lexicons).

2009; San Vicente et al., 2014; Takamura et al.,2005; Tai and Kao, 2013). There is an implicit con-sensus that dictionary-based approaches will gener-ate higher-quality lexicons, due to their use of theseclean, hand-curated resources; however, they are notapplicable in domains lacking such a resource (e.g.,most historical texts).

Most previous work seeks to enrich or enlarge ex-isting lexicons (San Vicente et al., 2014; Velikovichet al., 2010; Qiu et al., 2009), emphasizing recallover precision. This recall-oriented approach is mo-tivated by the need for massive polarity lexicons intasks like web-advertising (Velikovich et al., 2010).In contrast to these previous efforts, the goal of thiswork is to induce high-quality lexicons that are ac-curate to a specific social context.

Algorithmically, our approach is inspired by Ve-likovich et al. (2010). We extend this work by in-corporating high-quality word vector embeddings,a new graph construction approach, an alternativelabel propagation algorithm, and a bootstrappingmethod to obtain confidences. Together these im-provements, especially the high-quality word vec-tors, allow our corpus-based method to even outper-form the state-of-the-art dictionary-based approach.

3 Framework

Our framework, SENTPROP, is designed to meetfour key desiderata:1. Resource-light: Accurate performance without

massive corpora or hand-curated resources.2. Interpretable: Uses small seed sets of

“paradigm” words to maintain interpretabil-ity and avoid ambiguity in sentiment values.

3. Robust: Bootstrap-sampled standard deviationsprovide a measure of confidence.

4. Out-of-the-box: Does not rely on signals that arespecific to only certain domains.

SENTPROP involves two steps: constructing alexical graph from unlabeled corpora and propagat-ing sentiment labels over this graph.

3.1 Constructing a lexical graphLexical graphs are constructed from distributionalword embeddings learned on unlabeled corpora.

Distributional word embeddingsThe first step in our approach is building high-

quality semantic representations for words using avector space model (VSM). We embed each wordwi ∈ V as a vector wi that captures informationabout its co-occurrence statistics with other words(Landauer and Dumais, 1997; Turney and Pantel,2010). This VSM approach has a long history inNLP and has been highly successful in recent appli-cations (see Levy et al., 2015 for a survey).

When recreating known lexicons, we used a num-ber of publicly available embeddings (Section 4).

In the cases where we learned embeddings our-selves, we employed an SVD-based method to con-struct the word-vectors. First, we construct a matrixMPPMI ∈ R|V|×|V| with entries given by

MPPMIi,j = max

{log

(p̂(wi, wj)

p̂(w)p̂(wj)

), 0

}, (1)

where p̂ denotes smoothed empirical probabilities ofword (co-)occurrences within fixed-size sliding win-dows of text.3 MPPMI

i,j is equal to a smoothed vari-ant of the positive pointwise mutual information be-tween words wi and wj (Levy et al., 2015). Next,we compute MPPMI = UΣV>, the truncated sin-gular value decomposition of MPPMI . The vectorembedding for word wi is then given by

wSVDi = (U)i . (2)

3We use contexts of size four on each side and context-distribution smoothing with c = 0.75 (Levy et al., 2015).

Page 4: arXiv:1606.02820v1 [cs.CL] 9 Jun 2016one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2)

Excluding the singular value weights, Σ, has beenshown known to dramatically improve embeddingqualities (Turney and Pantel, 2010; Bullinaria andLevy, 2012). Following standard practices, we learnembeddings of dimension 300.

We found that this SVD method significantly out-performed word2vec (Mikolov et al., 2013) andGloVe (Pennington et al., 2014) on the domain-specific datasets we examined. Our results echo thefindings of Levy et al. (2015) that the SVD approachperforms best on rare word similarity tasks.

Defining the graph edgesGiven a set of word embeddings, a weighted lex-

ical graph is constructed by connecting each wordwith its nearest k neighbors within the semanticspace (according to cosine-similarity). The weightsof the edges are set as

Ei,j = arccos

(− wi

>wj

‖wi‖‖wj‖

). (3)

3.2 Propagating polarities from a seed setOnce a weighted lexical graph is constructed, wepropagate sentiment labels over this graph using arandom walk method (Zhou et al., 2004). A word’spolarity score for a seed set is proportional to theprobability of a random walk from the seed set hit-ting that word.

Let p ∈ R|V| be a vector of word-sentimentscores constructed using seed set S (e.g., ten nega-tive words); p is initialized to have 1

|V| in all entries.And let E be the matrix of edge weights given byequation (3). First, we construct a symmetric tran-sition matrix from E by computing T = D

12 ED

12 ,

where D is a matrix with the column sums of E onthe diagonal. Next, using T we iteratively update puntil numerical convergence:

p(t+1) = βTp(t) + (1− β)s, (4)

where s is a vector with values set to 1|S| in the en-

tries corresponding to the seed set S and zeros else-where. The β term controls the extent to which thealgorithm favors local consistency (similar labels forneighbors) vs. global consistency (correct labels onseed words), with lower βs emphasizing the latter.

To obtain a final polarity score for a word wi,we run the walk using both positive and negative

seed sets, obtaining positive (pP (wi)) and nega-tive (pN (wi)) label scores. We then combine thesevalues into a positive-polarity score as p̄P (wi) =

pP (wi)pP (wi)+pN (wi)

and standardize the final scores tohave zero mean and unit variance (within a corpus).

Many variants of this random walk approach andrelated label propagation techniques exist in the lit-erature (Zhou et al., 2004; Zhu and Ghahramani,2002; Zhu et al., 2003; Velikovich et al., 2010;San Vicente et al., 2014). We experimented with anumber of these approaches and found little differ-ence between their performance, so we present onlythis random walk approach here. The SOCIALSENT

package contains a full suite of these methods.

3.3 Bootstrap-sampling for robustnessPropagated sentiment scores are inevitably influ-enced by the seed set, and it is important for re-searchers to know the extent to which polarity valuesare simply the result of corpus artifacts that are cor-related with these seeds words. We address this issueby using a bootstrap-sampling approach to obtainconfidence regions over our sentiment scores. Webootstrap by running our propagation over B ran-dom equally-sized subsets of the positive and neg-ative seed sets. Computing the standard deviationof the bootstrap-sampled polarity scores provides ameasure of confidence and allows the researcher toevaluate the robustness of the assigned polarities.We set B = 50 and used 7 words per random subset(full seed sets are size 10; see Table 1).

4 Recreating known lexicons

We validate our approach by recreating known sen-timent lexicons in the three domains: Standard En-glish, Twitter, and Finance. Table 1 lists the seedwords used in each domain.

Standard English: To facilitate comparison withprevious work, we focus on the well-known GeneralInquirer lexicon (Stone et al., 1966). We also use thecontinuous valence (i.e., polarity) scores collectedby Warriner et al. (2013) in order to evaluate thefine-grained performance of our framework. We testour framework’s performance using two differentembeddings: off-the-shelf Google news embeddingsconstructed from 1011 tokens4 and embeddings we

4https://code.google.com/p/word2vec/

Page 5: arXiv:1606.02820v1 [cs.CL] 9 Jun 2016one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2)

constructed from the 2000s decade of the Corpus ofHistorical American English (COHA), which con-tains ∼2 × 107 words in each decade, from 1850to 2000 (Davies, 2010). The COHA corpus allowsus to test how the algorithms deal with this smallerhistorical corpus, which is important since we willuse the COHA corpus to infer historical sentimentlexicons (Section 6).

Finance: Previous work found that general pur-pose sentiment lexicons performed very poorly onfinancial text (Loughran and McDonald, 2011), soa finance-specific sentiment lexicon (containing bi-nary labels) was hand-constructed for this domain(ibid.). To test against this lexicon, we constructedembeddings using a dataset of∼2×107 tokens fromfinancial 8K documents (Lee et al., 2014).

Twitter: Numerous works attempt to induceTwitter-specific sentiment lexicons using supervisedapproaches and features unique to that domain (e.g.,follower graphs; Speriosu et al., 2011). Here, weemphasize that we can induce an accurate lexiconusing a simple domain-independent and resource-light approach, with the implication that lexiconscan easily be induced for related social mediadomains without resorting to complex supervisedframeworks. We evaluate our approach using thetest set from the 2015 SemEval task 10E competition(Rosenthal et al., 2015), and we use the embeddingsconstructed by Rothe et al. (2016).5

4.1 Baselines and state-of-the-art comparisons

We compared SENTPROP against standard baselinesand state-of-the-art approaches. The PMI baselineof Turney and Littman (2003) computes the point-wise mutual information between the seeds and thetargets without using propagation. The CountVecbaseline, corresponding to the method in Velikovichet al. (2010), is similar to our method but usesan alternative propagation approach and raw co-occurrence vectors instead of learned embeddings.Both these methods require raw corpora, so theyfunction as baselines in cases where we do not useoff-the-shelf embeddings. We also compare againstDENSIFIER, a state-of-the-art method which learnsorthogonal transformations of word vectors instead

5The official SemEval task 10E involved fully-supervisedlearning, so we do not use their evaluation setup.

of propagating labels (Rothe et al., 2016). Lastly, onstandard English we compare against a state-of-the-art WordNet-based method, which performs labelpropagation over a WordNet-derived graph (San Vi-cente et al., 2014). Several variant baselines, allof which SENTPROP outperforms, are omitted forbrevity (e.g., using word-vector cosines in place ofPMI in Turney and Littman (2003)’s framework).Code for replicating all these variants is available inthe SOCIALSENT package.

4.2 Evaluation setupWe evaluate the approaches according to (i) theirbinary classification accuracy (ignoring the neutralclass, as is common in previous work), (ii) ternaryclassification performance (positive vs. neutral vs.negative)6, and (iii) Kendall τ rank-correlation withcontinuous human-annotated polarity scores.

For all methods in the ternary-classification con-dition, we use the class-mass normalization method(Zhu et al., 2003) to label words as positive, neutral,or negative. This method assumes knowledge of thelabel distribution—i.e., how many positive/negativevs. neutral words there are—and simply assigns la-bels to best match this distribution.

4.3 Evaluation resultsTables 2a-2d summarize the performance of ourframework along with baselines and other state-of-the-art approaches. Our framework significantlyoutperforms the baselines on all tasks, outperforms astate-of-the-art approach that uses WordNet on stan-dard English (Table 2a), and is competitive withSentiment140 on Twitter (Table 2b), a distantly-supervised approach that uses signals from emoti-cons (Mohammad and Turney, 2010). DENSIFIER

also performs extremely well, outperforming SENT-PROP when off-the-shelf embeddings are used (Ta-bles 2a and 2b). However, SENTPROP significantlyoutperforms all other approaches when using thedomain-specific embeddings (Tables 2c and 2d).

Overall, SENTPROP is competitive with the state-of-the-art across all conditions and, unlike previ-ous approaches, it is able to maintain high accu-racy even when modestly-sized domain-specific cor-

6Only GI contains words explicitly marked neutral, so forternary evaluations in Twitter and Finance we sample neutralwords from GI to match its neutral-vs-not distribution.

Page 6: arXiv:1606.02820v1 [cs.CL] 9 Jun 2016one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2)

Method AUC Ternary F1 τ

SENTPROP 90.6 58.6 0.44DENSIFIER 93.3 62.1 0.50WordNet 89.5 58.7 0.34Majority – 24.8 –

(a) Corpus methods outperform WordNet on standardEnglish. Using word-vector embeddings learned on a mas-sive corpus (1011 tokens), we see that both corpus-basedmethods outperform the WordNet-based approach overall.

Method AUC Ternary F1 τ

SENTPROP 86.0 60.1 0.50DENSIFIER 90.1 59.4 0.57Sentiment140 86.2 57.7 0.51Majority – 24.9 –

(b) Corpus approaches are competitive with a distantlysupervised method on Twitter. Using Twitter embeddingslearned from ∼109 tokens, we see that the semi-supervisedcorpus approaches using small seed sets perform very well.

Method AUC Ternary F1

SENTPROP 91.6 63.1DENSIFIER 80.2 50.3PMI 86.1 49.8CountVecs 81.6 51.1Majority – 23.6

(c) SENTPROP performs best with domain-specific financeembeddings. Using embeddings learned from financial cor-pus (∼2×107 tokens) , SENTPROP significantly outperformsthe other methods.

Method AUC Ternary F1 τ

SENTPROP 83.8 53.0 0.28DENSIFIER 77.4 46.6 0.19PMI 70.6 41.9 0.16CountVecs 52.7 32.9 0.01Majority – 24.3 –

(d) SENTPROP performs well on standard English evenwith 1000x reduction in corpus size. SENTPROP maintainsstrong performance even when using embeddings learnedfrom the 2000s decade of COHA (only 2×∼107 tokens).

Table 2: Results on recreating known lexicons.

pora are used. We found that the baseline methodof Velikovich et al. (2010), which our method isclosely related to, performed very poorly with thesedomain-specific corpora. This indicates that us-ing high-quality word-vector embeddings can have adrastic impact on performance. However, it is worthnoting that Velikovich et al. (2010)’s method was de-signed for high recall with massive corpora, so itspoor performance in our regime is not surprising.

5 Inducing community-specific lexicons

As a first large-scale study, we investigate how sen-timent depends on the social context in which aword is used. It is well known that there is sub-stantial sociolinguistic variation between differentcommunities, whether these communities are de-fined geographically (Trudgill, 1974) or via under-lying sociocultural differences (Labov, 2006). How-ever, no previous work has systematically investi-gated community-specific variation in word senti-ment at a large scale. Yang and Eisenstein (2015)exploit social network structures in Twitter to infera small number (1-10) of communities and analyzedsentiment variation via a supervised framework. Ouranalysis extends this line of work by analyzing thesentiment across hundreds of user-defined commu-

nities using only unlabeled corpora and a small setof “paradigm” seed words (the Twitter seed wordsoutlined in Table 1).

In our study, we induced sentiment lexicons forthe top-250 (by comment-count) subreddits from thesocial media forum Reddit.7 We used all the 2014comment data to induce the lexicons, with wordslower cased and comments from bots and deletedusers removed.8 Sentiment was induced for the top-5000 non-stop words in each subreddit (again, bycomment-frequency).

5.1 Examining the lexicons

Analysis of the learned lexicons reveals the ex-tent to which sentiment can differ across com-munities. Figure 3 highlights some words withopposing sentiment in two communities: inr/TwoXChromosomes (r/TwoX), a communitydedicated to female perspectives and gender issues,the words crazy and insane have negative polarity,which is not true in the r/sports community, and,vice-versa, words like soft are positive in r/TwoXbut negative in r/sports.

To get a sense of how much sentiment differs

7Subreddits are user-created topic-specific forums.8https://archive.org/details/2015_reddit_comments_corpus

Page 7: arXiv:1606.02820v1 [cs.CL] 9 Jun 2016one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2)

Figure 3: Word sentiment differs drastically between a community dedicated to sports (r/sports) and one dedicated tofemale perspectives and gender issues (r/TwoX). Words like soft and animal have positive sentiment in r/TwoX but negativesentiment in r/sports, while the opposite holds for words like crazy and insane.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Community-community sentiment correlation (τ25%)

0

1

2

3

4

5

6

Cou

nt

Figure 4: There is a long tail of communities with very dif-ferent word sentiments. Some communities have very simi-lar sentiment (e.g., r/sports and r/hockey), while otherother community pairs differ drastically (e.g., r/sports andr/TwoX).

across communities in general, we selected a ran-dom subset of 1000 community pairs and examinedthe correlation in their sentiment values for highlysentiment-bearing words (Figure 4). We see that thedistribution is noticeably skewed, with many com-munity pairs having highly uncorrelated sentimentvalues. The 1000 random pairs were selected suchthat each member of the pair overlapped in at leasthalf of their top-5000 word vocabulary. We thencomputed the correlation between the sentiments inthese community-pairs. Since sentiment is noisy andrelatively uninteresting for neutral words, we com-pute τ25%, the Kendall-τ correlation over the top-25% most sentiment bearing words shared betweenthe two communities.

Analysis of individual pairs reveals some inter-esting insights about sentiment and inter-communitydynamics. For example, we found that the sentimentcorrelation between r/TwoX and r/TheRedPill

(τ25% = 0.58), two communities that hold conflict-ing views and often attack each other9, was actu-ally higher than the sentiment correlation betweenr/TwoX and r/sports (τ25% = 0.41), two com-munities that are entirely unrelated. This result sug-gests that conflicting communities may have moresimilar sentiment in their language compared tocommunities that are entirely unrelated.

6 Inducing diachronic sentiment lexicons

Sentiment also depends on the historical time-periodin which a word is used. To investigate this depen-dency, we use our framework to analyze how wordpolarities have shifted over the last 150 years. Thephenomena of amelioration (words becoming morepositive) and pejoration (words becoming more neg-ative) are well-discussed in the linguistic literature(Traugott and Dasher, 2001); however, no compre-hensive polarity lexicons exist for historical data(Cook and Stevenson, 2010). Such lexicons are cru-cial to the growing body of work on NLP analyses ofhistorical text (Piotrowski, 2012) which are inform-ing diachronic linguistics (Hamilton et al., 2016),the digital humanities (Muralidharan and Hearst,2012), and history (Hendrickx et al., 2011).

The only previous work on automatically in-ducing historical sentiment lexicons is Cook andStevenson (2010); they use the PMI method and

9This conflict is well-known on Reddit; for example, bothcommunities mention each others’ names along with fuck-basedprofanity in the same comment far more than one would expectby chance (χ2

1 > 6.8, p < 0.01 for both). r/TheRedPill isdedicated to male empowerment.

Page 8: arXiv:1606.02820v1 [cs.CL] 9 Jun 2016one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2)

(a) Lean becomes more positive. Lean underwent amelioration,becoming more similar to muscular and less similar to weak.

(b) Pathetic becomes more negative. Pathetic underwent pejo-ration, becoming similar to weak and less similar to passionate.

Figure 5: Examples of amelioration and pejoration.

a full modern sentiment lexicon as their seed set,which problematically assumes that all these wordshave not changed in sentiment. In contrast, we use asmall seed set of words that were manually selectedbased upon having strong and stable sentiment overthe last 150 years (Table 1; confirmed via historicalentries in the Oxford English Dictionary).

6.1 Examining the lexicons

We constructed lexicons from COHA, since it wascarefully constructed to be genre balanced (e.g.,compared to the Google N-Grams; Pechenick et al.,2015). We built lexicons for all adjectives withcounts above 100 in a given decade and also for thetop-5000 non-stop words within each year. In boththese cases we found that>5% of sentiment-bearing(positive/negative) words completely switched po-larity during this 150-year time-period and >25%of all words changed their sentiment label (includ-ing switches to/from neutral).10 The prevalence offull polarity switches highlights the importance ofhistorical sentiment lexicons for work on diachroniclinguistics and cultural change.

Figure 5a shows an example amelioration de-tected by this method: the word lean lost its neg-ative connotations associated with “weakness” andinstead became positively associated with conceptslike “muscularity” and “fitness”. Figure 5b showsan example pejoration, where pathetic, which usedto be more synonymous with passionate, gainedstronger negative associations with the conceptsof “weakness” and “inadequacy” (Simpson et al.,1989). In both these cases, semantic similarities

10We defined the thresholds for polar vs. neutral using theclass-mass normalization method and compared scores aver-aged over 1850-1880 to those averaged over 1970-2000.

computed using our learned historical word vectorswere used to contextualize the shifts.

Some other well-known examples of sentimentchanges captured by our framework include the se-mantic bleaching of sorry, which shifted from nega-tive and serious (“he was in a sorry state”) to usesas a neutral discourse marker (“sorry about that”)and worldly, which used to have negative conno-tations related to materialism and religious impu-rity (“sinful worldly pursuits”) but now is frequentlyused to indicate sophistication (“a cultured, worldlywoman”) (Simpson et al., 1989). Our hope is thatthe full lexicons released with this work will spurfurther examinations of such historical shifts in sen-timent, while also facilitating CSS applications thatrequire sentiment ratings for historical text.

7 Conclusion

SENTPROP allows researchers to easily induce ro-bust and accurate sentiment lexicons that are rel-evant to their particular domain of study. Suchlexicons are crucial to CSS research, as evidencedby our two studies showing that sentiment dependsstrongly on both social and historical context.

The sentiment lexicons induced by SENTPROP

are not perfect, which is reflected in the uncertaintyassociated with our bootstrap-sampled estimates.However, we believe that these user-constructed,domain-specific lexicons, which quantify uncer-tainty, provide a more principled foundation for CSSresearch compared to domain-general sentiment lex-icons that contain unknown biases. In the futureour method could also be integrated with superviseddomain-adaption (e.g.,Yang and Eisenstein, 2015) tofurther improve these domain-specific results.

Page 9: arXiv:1606.02820v1 [cs.CL] 9 Jun 2016one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2)

Acknowledgements

The authors thank P. Liang for his helpful com-ments. This research has been supported in partby NSF CNS-1010921, IIS-1149837, IIS-1514268NIH BD2K, ARO MURI, DARPA XDATA, DARPASIMPLEX, Stanford Data Science Initiative, SAPStanford Graduate Fellowship, NSERC PGS-D,Boeing, Lightspeed, SAP, and Volkswagen.

References

Muhammad Zubair Asghar, Aurangzeb Khan, ShakeelAhmad, Imran Ali Khan, and Fazal Masud Kundi.2015. A Unified Framework for Creating Domain De-pendent Polarity Lexicons from User Generated Re-views. PLOS ONE, 10(10):e0140204, October.

Sasha Blair-Goldensohn, Kerry Hannan, Ryan McDon-ald, Tyler Neylon, George A. Reis, and Jeff Reynar.2008. Building a sentiment summarizer for local ser-vice reviews. In WWW Workshop on NLP in the Infor-mation Explosion Era.

Felipe Bravo-Marquez, Eibe Frank, and BernhardPfahringer. 2015. From Unlabelled Tweets to Twitter-specific Opinion Words. In SIGIR.

John A. Bullinaria and Joseph P. Levy. 2012. Extract-ing semantic representations from word co-occurrencestatistics: stop-lists, stemming, and SVD. BehaviorResearch Methods, 44(3):890–907, September.

Yejin Choi and Claire Cardie. 2009. Adapting a polaritylexicon using integer linear programming for domain-specific sentiment classification. In EMNLP.

Paul Cook and Suzanne Stevenson. 2010. Automati-cally Identifying Changes in the Semantic Orientationof Words. In LREC.

Mark Davies. 2010. The Corpus of Historical AmericanEnglish: 400 million words, 1810-2009.

Andrea Esuli and Fabrizio Sebastiani. 2006. Sentiword-net: A publicly available lexical resource for opinionmining. In LREC. Citeseer.

Ethan Fast, Binbin Chen, and Michael S. Bernstein.2016. Empath: Understanding Topic Signals in Large-Scale Text. In CHI.

Christiane Fellbaum. 1998. WordNet. Wiley Online Li-brary.

William L. Hamilton, Jure Leskovec, and Dan Juraf-sky. 2016. Diachronic Word Embeddings Re-veal Statistical Laws of Semantic Change. In ACL.arXiv:1605.09096.

Vasileios Hatzivassiloglou and Kathleen R. McKeown.1997. Predicting the semantic orientation of adjec-tives. In EACL.

Iris Hendrickx, Michel Gnreux, and Rita Marquilhas.2011. Automatic pragmatic text segmentation of his-torical letters. In Language Technology for CulturalHeritage, pages 135–153. Springer.

Dirk Hovy. 2015. Demographic factors improve classifi-cation performance. In ACL.

Minqing Hu and Bing Liu. 2004. Mining and summariz-ing customer reviews. In KDD. ACM.

Sheng Huang, Zhendong Niu, and Chongyang Shi.2014. Automatic construction of domain-specific sen-timent lexicon based on constrained label propagation.Knowledge-Based Systems, 56:191–200, January.

Sean P. Igo and Ellen Riloff. 2009. Corpus-based seman-tic lexicon induction with web-based corroboration. InACL Workshop on Unsupervised and Minimally Super-vised Learning of Lexical Semantics.

Valentin Jijkoun, Maarten de Rijke, and WouterWeerkamp. 2010. Generating focused topic-specificsentiment lexicons. In ACL, pages 585–594.

Jaap Kamps, M. J. Marx, Robert J. Mokken, and M. deRijke. 2004. Using wordnet to measure semantic ori-entations of adjectives. In LREC.

William Labov. 2006. The social stratification of Englishin New York city. Cambridge University Press.

Thomas K. Landauer and Susan T. Dumais. 1997. A so-lution to Plato’s problem: The latent semantic analysistheory of acquisition, induction, and representation ofknowledge. Psychol. Rev., 104(2):211.

Heeyoung Lee, Mihai Surdeanu, Bill MacCartney, andDan Jurafsky. 2014. On the Importance of Text Anal-ysis for Stock Price Prediction. In LREC, pages 1170–1175.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-proving distributional similarity with lessons learnedfrom word embeddings. Trans. Assoc. Comput. Ling.,3.

Tim Loughran and Bill McDonald. 2011. When is aliability not a liability? Textual analysis, dictionaries,and 10-Ks. The Journal of Finance, 66(1):35–65.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. 2013. Distributed representationsof words and phrases and their compositionality. InNIPS.

Saif M. Mohammad and Peter D. Turney. 2010. Emo-tions evoked by common words and phrases: UsingMechanical Turk to create an emotion lexicon. InNAACL, pages 26–34.

Aditi Muralidharan and Marti A Hearst. 2012. Support-ing exploratory text analysis in literature study. Liter-ary and Linguistic Computing.

Eitan Adam Pechenick, Christopher M. Danforth, andPeter Sheridan Dodds. 2015. Characterizing theGoogle Books corpus: Strong limits to inferences of

Page 10: arXiv:1606.02820v1 [cs.CL] 9 Jun 2016one dedicated to toy animals (Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good (Figure 2)

socio-cultural and linguistic evolution. PLoS ONE,10(10).

Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. EMNLP.

Michael Piotrowski. 2012. Natural language processingfor historical texts. Synthesis Lectures on Human Lan-guage Technologies, 5(2):1–157.

Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2009.Expanding Domain Sentiment Lexicon through Dou-ble Propagation. In IJCAI, volume 9.

Delip Rao and Deepak Ravichandran. 2009. Semi-supervised polarity lexicon induction. In EACL, pages675–682.

Ellen Riloff and Jessica Shepherd. 1997. A corpus-based approach for building semantic lexicons. arXivpreprint cmp-lg/9706013.

Mats Rooth, Stefan Riezler, Detlef Prescher, Glenn Car-roll, and Franz Beil. 1999. Inducing a semanticallyannotated lexicon via EM-based clustering. In ACL,pages 104–111.

Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko,Saif M. Mohammad, Alan Ritter, and Veselin Stoy-anov. 2015. Semeval-2015 task 10: Sentiment analy-sis in twitter. SemEval-2015.

Sascha Rothe, Sebastian Ebert, and Hinrich Schtze.2016. Ultradense Word Embeddings by OrthogonalTransformation. arXiv preprint arXiv:1602.07572.

Inaki San Vicente, Rodrigo Agerri, German Rigau, andDonostia-San Sebastin. 2014. Simple, Robust and (al-most) Unsupervised Generation of Polarity Lexiconsfor Multiple Languages. In EACL, pages 88–97.

Aliaksei Severyn and Alessandro Moschitti. 2015. Onthe automatic learning of sentiment lexicons. InNAACL-HLT.

John Andrew Simpson, Edmund SC Weiner, et al. 1989.The Oxford English Dictionary, volume 2. ClarendonPress Oxford, Oxford, UK.

Michael Speriosu, Nikita Sudan, Sid Upadhyay, and Ja-son Baldridge. 2011. Twitter polarity classifica-tion with label propagation over lexical links and thefollower graph. In ACL Workshop on UnsupervisedLearning in NLP.

Philip J Stone, Dexter C Dunphy, and Marshall S Smith.1966. The General Inquirer: A Computer Approach toContent Analysis.

Maite Taboada, Julian Brooke, Milan Tofiloski, KimberlyVoll, and Manfred Stede. 2011. Lexicon-based meth-ods for sentiment analysis. Comput. Ling., 37(2):267–307.

Yen-Jen Tai and Hung-Yu Kao. 2013. Automaticdomain-specific sentiment lexicon generation with la-bel propagation. In Proceedings of International Con-

ference on Information Integration and Web-based Ap-plications & Services, page 53. ACM.

Hiroya Takamura, Takashi Inui, and Manabu Okumura.2005. Extracting semantic orientations of words usingspin model. In ACL, pages 133–140.

Duyu Tang, Furu Wei, Bing Qin, Ming Zhou, and TingLiu. 2014. Building Large-Scale Twitter-SpecificSentiment Lexicon: A Representation Learning Ap-proach. In COLING, pages 172–182.

Michael Thelen and Ellen Riloff. 2002. A bootstrappingmethod for learning semantic lexicons using extractionpattern contexts. In EMNLP.

Elizabeth Closs Traugott and Richard B Dasher. 2001.Regularity in Semantic Change. Cambridge Univer-sity Press, Cambridge, UK.

Peter Trudgill. 1974. Linguistic change and diffusion:Description and explanation in sociolinguistic dialectgeography. Language in Society, 3(2):215–246.

Peter D. Turney and Michael L. Littman. 2003. Mea-suring praise and criticism: Inference of semanticorientation from association. ACM Trans. Inf. Sys.,21(4):315–346.

Peter D. Turney and Patrick Pantel. 2010. From fre-quency to meaning: Vector space models of semantics.J. Artif. Intell. Res., 37(1):141–188.

Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Han-nan, and Ryan McDonald. 2010. The viability of web-derived polarity lexicons. In NAACL-HLT, pages 777–785.

Amy Beth Warriner, Victor Kuperman, and Marc Brys-baert. 2013. Norms of valence, arousal, and domi-nance for 13,915 English lemmas. Behavior researchmethods, 45(4):1191–1207.

Dominic Widdows and Beate Dorow. 2002. A graphmodel for unsupervised lexical acquisition. In COL-ING.

Yi Yang and Jacob Eisenstein. 2015. Putting Thingsin Context: Community-specific Embedding Pro-jections for Sentiment Analysis. arXiv preprintarXiv:1511.06052.

Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal,Jason Weston, and Bernhard Schlkopf. 2004. Learn-ing with local and global consistency. In NIPS, vol-ume 16.

Xiaojin Zhu and Zoubin Ghahramani. 2002. Learningfrom labeled and unlabeled data with label propaga-tion. Technical report, Citeseer.

Xiaojin Zhu, Zoubin Ghahramani, John Lafferty, and oth-ers. 2003. Semi-supervised learning using gaussianfields and harmonic functions. In ICML, pages 912–919.