introduction to sentiment analysis - eth z · sentiment ! sentiment analysis is also known as...
Post on 12-Mar-2020
63 Views
Preview:
TRANSCRIPT
| | COSS
Machine Learning and Modelling for Social Networks Lloyd Sanders, Olivia Woolley, Iza Moize, Nino Antulov-Fantulin D-GESS: Computational Social Science
Introduction to Sentiment Analysis
| | COSS
§ What is Sentiment Analysis? § Classifying Sentiment § Feature Creation and Selection § Use Case: Public health and Vaccine Sentiment § References and Reading
L Sanders 2
Overview
| | COSS
§ Positive/Negative Polarity assigned to text § The Sentiment ‘space’ is being expanded to
accommodate more than a single dimension § Classification with respect to emotion: Joy, frustration,
sadness are occurring § Classification with respect to stance (either for, or against
a position) is similar to, but not entirely the same as sentiment
§ Sentiment analysis is also known as opinion mining L Sanders 3
What is Sentiment Analysis
Sentiment analysis is the operation of understanding the intent or emotion behind a given piece of text
| | COSS
§ Sentiment Analysis is a branch of computer science, and overlaps heavily with Machine Learning, and Computational Linguistics
§ Why? One seeks to understand the general opinion across many documents within a corpus (e.g., all tweets relating to a given brand).
§ This is labor intensive, so we use ML to automatically label documents via classifier through a labeled dataset (supervised learning)
L Sanders 4
What is Sentiment Analysis
Sentiment analysis is the operation of understanding the intent or emotion behind a given piece of text
| | COSS
§ Vonnegut posited in his Master’s thesis that there were 6 basic shapes to a story § Rags to Riches (rise) § Riches to Rags (fall) § Man in a hole (fall then rise) § Icarus (rise then fall) § Cinderella (rise then fall then rise) § Oedipus (fall then rise then fall)
§ A team used sentiment analysis to verify this with over 1700 English fiction novels [Reagan et al. 2016]
L Sanders 7
Emotional Arcs of Fiction
| | COSS L Sanders 8
Emotional Arcs of Fiction
Cinderella
Oedipus Icarus
Man in Hole Rags to Riches
Riches to Rags
Reagan et al. 2016
| | COSS
§ Sentiment analysis often correlates well with real world observables.
§ For commercial aspects: Brand Awareness § Stock fluctuations and public opinion [Bollen et al. 2010] § Health related: Vaccine sentiment vs. coverage [Later] § Public safety: Situational awareness in mass emergencies
via Twitter [Verma et al. 2011]
L Sanders 10
Why is it useful?
Sentiment could be considered a latent variable in social behavior. Measuring and understanding this
behavior, could lead to better understanding of social phenomena.
| | COSS
§ Sentiment is very domain specific, and also temporally specific w.r.t. social media.
§ Different contexts, alter polarity of different words (e.g.: ‘unpredictable’: movie review good, driving = bad)
§ Slang ‘Movie is bad ass’ § Sentiment has multiple levels:
§ Document or message (tweet/sms) level § Term/Aspect level “The coffee was amazing, but the atmosphere
was dull” § Word level / within word level (severity of sentiment per word)
§ Negations, sloppy spelling/structure, compound the difficulty
L Sanders 11
Sentiment Classification is Difficult
| | COSS
§ Gather a large quantity of data – the more the better § Construct a labeled set of data into your classes (e.g.
positive/negative/neutral) § Split your set into training/test sets § Construct your features § Train Classifier (SVM, Naïve Bayes, Ensemble Methods,
Neural Nets) § Assess accuracy § Let loose on a the full set
L Sanders 12
Classifying Sentiment: A Recipe
| | COSS
§ It’s important to have well labeled data, and there are a number of ways of doing this
§ Self-annotation can lead to biases. § Crowd sourcing annotation
L Sanders 13
Labeling Training Data
“Put junk in, get junk out”
mturk.com crowdflower.com
| | COSS
§ Pseudo-labeling data can have a net positive effect § This can be achieved, for example on social media,
through hashtags, or emoticons/emojis [Kouloumpis et al. 2011, Davidov et al., 2010]
L Sanders 14
Labeling Training Data
“Put junk in, get junk out”
| | COSS
§ Common practice one can use a bag of words technique which discards structure, but does incorporate word count
§ Each document in the corpus is disassembled into a bag of words, represented as a vector
§ Can use TF-IDF on this bag of words vector [see Iza’s lecture on Big Data].
§ Your bag of words vector per document will be sparse, can leverage that in computation.
L Sanders 15
Constructing text features
1 Sentiment Analysis Equations
Bag of words
~
di = [x1, x2, · · · , xn]T
1
| | COSS
§ N-grams are a simple technique to capture document structure
§ When considering words: a unigram is a single word, a bigram is a string of two words
§ Bigrams can begin to capture negations such as ‘this food was not_good’, but will miss out on ‘this food was not_very_good’ (less severe)
§ One can construct skip n-grams, e.g.: not_*_good § N-grams are also possible with characters: ‘good’ is a 4-
gram, ‘happy’ is a 5-gram char
L Sanders 16
N-grams
| | COSS
§ A negation word can flip the polarity on an entire sentence.
§ Bigrams, or Trigrams go some way towards this, as mentioned before.
§ How else can one take these into account? § Preprocess text to take negations into account: ‘not good’ =>
‘good_neg’
L Sanders 17
Negations and how to deal with them
“This food was not good”
| | COSS
§ General Inquirer [http://www.wjh.harvard.edu/~inquirer/homecat.htm]
§ SentiWordNet [http://sentiwordnet.isti.cnr.it/]
§ Bing Liu’s lexicons [https://www.cs.uic.edu/~liub/]
L Sanders 18
Sentiment Lexicons
There are many publically available Sentiment (and Emotion) lexicons available. These can be used as a complementary feature construction for your classifiers (especially for out of vocabulary words – those not in your corpus).
| | COSS
§ Here is a sample of the features used by a state of the art Twitter sentiment classifier: § Word ngrams (up to 4), skip ngrams w/ 1 missing word § Character ngrams up to 5 § All caps: number of words in capitals § Number of hashtags § Number of continuous punctuation marks, either exclamation or
question or mixed. Also whether last char contains one of these. § Presence of emoticons
L Sanders 19
Feature Vectors for short informal texts: a bird’s eye view
| | COSS
§ Here is a sample of the features used by a state of the art Twitter sentiment classifier: § Number of elongated words (one character repeated more than
twice: ‘raaaaaad’) § Normalization: URLS to http://someurl; userids to @someurl § Part-of-Speech tagged tweets: number of occurrences of each
POS tag.
L Sanders 20
Feature Vectors: a bird’s eye view
| | COSS
§ Sentiment is a classification problem § Typically people have used Naïve Bayes or Support
Vector Machines (SVM) in the past [Mohammad et al. 2013]
§ Artificial Neural Nets are also becoming more popular now [Nogueira dos Santos & Gatti, 2014]
L Sanders 21
Classifying your sentiment
| | COSS
§ How does one construct a baseline for accuracy? § As always, we refer to ‘better than chance’ baseline § In the context of pos/neg/neu, they are often not split
evenly. § One can use the maximum likelihood for each class: If
pos is 70% of the classes, then choose that. § For multiple classes, as a single measure, it is common to
use the macro F-score. § For binary case, the go to is: AUC ROC
L Sanders 22
Sentiment Accuracy
| | COSS L Sanders 24
Use case: Public Health and Vaccine Sentiment
The authors wanted to investigate the correlation between sentiment on vaccines with respect to vaccine uptake. Usual survey methods are expensive, so they took a new approach in using Twitter. The took the model further to understand if such sentiments held in similar clustering within real-world
communities, what outbreaks would look like.
| | COSS
§ Analyzed over 100k users from twitter over 6 months to assess how sentiment of a new (2009) H1N1 vaccine correlated with actual coverage of the vaccine.
§ 478k tweets (320k relevant to H1N1). 256k neutral, 27k negative, 36k positive (imbalanced data set).
L Sanders 25
Synopsis
Salathe & Khandelwal [2011]
| | COSS L Sanders 26
Sentiment-Coverage Correlation
Salathe & Khandelwal [2011]
Due to the correlation, we see that there is promise in this technique to be used as a cost-effective probing tool to stage vaccine interventions
| | COSS
§ Built a webapp which was used by 64 ‘volunteers’ § Each student was given 1400 tweets (with heavy overlap
w.r.t. other students’ tweet sets). § 47k tweets were rated. Each tweet labeled by a majority
decision. § The high confidence* test set numbered 630. These were
those rated 44 times. § Built an ensemble classifier: Naïve Bayes (pos/neg) and
Max. Entropy (irrelevant/neu) § Accuracy was 84.29 %
L Sanders 27
Methodology of the Classifier
| | COSS
§ Created a directed graph of 40k nodes, 685k edges. § Nodes are users with either a pos/neg sentiment score. § A directed edge is created if a user follows another user. § Measured the assortative mixing of users with a
qualitatively similar opinion on vaccination (homophily) § 0<r<=1: nodes are mostly connected to nodes of the
same type § -1<= r <0: nodes are connected to the opposite type § r = 0.144: People with the same vacc. opinion are likely to
be connected. Sentiment gives a measure of info. flow.
L Sanders 28
Social Network – Homophily and Herd Immunity
| | COSS
§ Working with Text Data, User Guide from Sci-kit Learn http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
§ Sentiment Analysis of Short Informal Texts; Kiritchenko et al., Journal of Artificial Intelligence Research 2014
§ Stance and Sentiment in Tweets, Mohammad et al., arXiv, 2016 § Assessing vaccination sentiments with online social media: Implications for infectious disease
dynamics and control, PLoS Comp. Bio. 2011. § The emotional arcs of stories are dominated by six basic shapes, Reagan et al., arXiv 2016 § Survey on Aspect-level sentiment analysis, Schouten and Frasnicar, IEEE, 2016 § Twitter mood predicts the stock market, Bollen, Mao, and Zeng, 2010 § Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts, Cicero Nogueira dos
Santos & Maira Gatti, 2014
L Sanders 30
References and Reading
top related