examining the implications of adding sentiment analysis ...1107843/fulltext01.pdfdata så har...

27
INOM EXAMENSARBETE , GRUNDNIVÅ, 15 HP , STOCKHOLM SVERIGE 2017 Examining the Implications of Adding Sentiment Analysis when Clustering based on Party Affiliation through Twitter WILHELM ÅKERMARK OSCAR AHLQVIST KTH SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

Upload: others

Post on 18-Oct-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

INOM EXAMENSARBETE ,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2017

Examining the Implications of Adding Sentiment Analysis when Clustering based on Party Affiliation through Twitter

WILHELM ÅKERMARK

OSCAR AHLQVIST

KTHSKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

Page 2: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

Examining the implications ofadding sentiment analysiswhen clustering based onparty affiliation throughtwitter

WILHELM ÅKERMARKOSCAR AHLQVIST

Bachelor in Computer ScienceDate: June 5, 2017Supervisor: Mårten BjörkmanExaminer: Örjan EkebergSwedish title: Undersökning av innebörden för tillförandet avattitydanalys till klustering baserat på partitillhörighet på TwitterSchool of Computer Science and Communication

Page 3: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på
Page 4: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

iii

Abstract

Political discussion is very common and a popular statistic for the po-litically interested are the surveys determining people’s party affilia-tion. Today they are done manually but it would be beneficial if therewas an automated method which could produce the same or similarresults. On the social network Twitter there are a lot of users discussingpolitics, all from political leaders and representatives to just regularpeople with an interest for the subject. Because of that and also for itsconvenience when it comes to collecting data, Twitter has been usedbefore in previous work to cluster political users based on what theyare discussing. What has not been done before which this thesis con-tributes is clustering with sentiment analysis. In other terms, tryingto cluster people based on what they are discussing but also if theiropinion is positive or negative. What was noted in this work was thatthe part of clustering without sentiment analysis, that had been donein previous studies, was successfully replicated. However adding thesentiment analysis did not give the desired effect and only worsenedthe results.

Page 5: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

iv

Sammanfattning

Politiska diskussioner är väldigt vanliga och en populär mätning förde politiskt intresserade är undersökningarna som fastlägger folks par-titillhörighet. Idag så görs de manuellt men det vore fördelaktigt omdet fanns en automatiserad metod som kunde producera fram sam-ma eller liknande resultat. På det sociala nätverket Twitter så finns detmånga användare som diskuterar politik, allt från politiska ledare ochrepresentanter till vanligt folk som har ett intresse för ämnet. På grundav detta och för dess bekvämlighet när det kommer till att samla ihopdata så har Twitter använts i tidigare arbeten för att klustra politiskaanvändare grundat på vad de diskuterar. Vad som inte gjorts tidigareoch som denna uppsats tillför är att klustra med attitydanalys. I andratermer, försöka klustra folk baserat på vad de diskuterar men ocksåom deras åsikt är positiv eller negativ. Vad som noterades i detta arbe-te var att delen med att klustra folk utan attitydanalys, som hade gjortsi tidigare studier, lyckades upprepas med framgång. Däremot så gavinte tillförandet av attitydanalys den önskade effekten och försämradeenbart resultaten.

Page 6: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

Contents

1 Introduction 1

2 Theoretical background 32.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . 42.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Hybrid TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Cosine similarity . . . . . . . . . . . . . . . . . . . . . . . 6

3 Method 73.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Calculating word frequency . . . . . . . . . . . . . . . . . 73.3 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . 83.4 Constructing the similarity matrix . . . . . . . . . . . . . 83.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.6 Analyzing the clusters . . . . . . . . . . . . . . . . . . . . 9

4 Results 104.1 Clustering without sentiment analysis . . . . . . . . . . . 104.2 Clustering with sentiment analysis . . . . . . . . . . . . . 12

5 Discussion 145.1 Successfully replicating past work . . . . . . . . . . . . . 145.2 Clustering without sentiment analysis compared to clus-

tering with sentiment analysis . . . . . . . . . . . . . . . . 145.3 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.4 Method critique . . . . . . . . . . . . . . . . . . . . . . . . 155.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.6 Future research . . . . . . . . . . . . . . . . . . . . . . . . 16

v

Page 7: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

vi CONTENTS

Bibliography 17

A Search terms used 19

Page 8: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

Chapter 1

Introduction

Social media networks are a big part of people’s everyday life, accord-ing to IIS 48% of all swedes used social media daily during 2014. Outof these users 19% used Twitter [1], this along with the short messagelength and public availability of tweets makes Twitter a perfect plat-form to use when looking to study the social patterns of users. Twitteris also a huge platform for political discussion, making it perfect forstudies about users political patterns.

In the current state of gathering information about people’s politi-cal affiliation, it’s done by surveys which then demands manual work.With this project we will determine if it is possible to automate suchwork with clustering and sentiment analysis.

Related work

A previous paper by Vakili (2016) have studied the advantages anddrawbacks of clustering Twitter users based on social interaction andtweet content. [2] Other previous papers, such as Du & Söderberg(2016) and De Silva & Engelin (2016), have studied the possibility ofdetecting trolls on Twitter using clustering. [3, 4]

Another study, by Hallman & Lökk (2016), have examined the via-bility of using sentiment analysis when detecting trolls on Twitter. [5]There does however seem to be no studies examining the possibilityof combining sentiment analysis with clustering to cluster users basedon their political interactions.

1

Page 9: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

2 CHAPTER 1. INTRODUCTION

Problem statement

Simply clustering might only create clusters grouping people who dis-cuss the same subjects rather than grouping people based on theiropinions about said subject. This paper will study the implicationsof using sentiment analysis in conjunction with clustering when clus-tering Twitter users based on a fraction of their political tweets. Thiswill then be used to answer the question:

Does adding sentiment analysis increase the success rate when clusteringpolitical Twitter accounts?

Page 10: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

Chapter 2

Theoretical background

2.1 Twitter

Twitter is a social network where user post messages that are restrictedto 140 characters each, known as “tweets”, and unlike similar socialnetworks there are very few restrictions to who can view a tweet. It’sa source of news and social interactions for its 313 million users [6],and allegedly 40 million election related tweets were posted on theday of the 2016 U.S. presidential election, making it a huge platformfor political discussion. [7]

Most of the content on Twitter is categorised by the users them-selves using so called “hashtags”, a hashtag is nothing more than a“#”-sign followed by a keyword. These keywords can be almost any-thing, even whole sentences excluding spaces (ex. “#CutForBieber”).These hashtags can then be used by the users to find discussions abouta specific subject. This is also one of the things that makes Twitter aperfect platform when analysing patterns of users, since categoriza-tion of tweets is already done beforehand.

Another feature of Twitter is the retweet-function, this allows a userto basically tweet a copy the whole tweet of another user with a ref-erence to the original tweet and its poster. This can be compared tothe “like”-functions of similar social media networks, since it’s mostlyused to express a feeling of agreeing.

There’s no function to allow for “friends” (like on similar socialmedia networks, ex Facebook) on Twitter, instead there’s a “follow”-function. Following a user means that their tweets will get shown toyou on the Twitter home page, on a timeline, whenever a new tweet

3

Page 11: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

4 CHAPTER 2. THEORETICAL BACKGROUND

appears. A follow is not mutual, unlike friendships on similar net-works, and thus the user you follow will not see your tweets unlessthe user in question follows you as well.

2.2 Sentiment analysis

Sentiment analysis aims to determine the attitude of the speaker orwriter using natural language processing. Basic sentiment analysis isgenerally carried out on a document or sentence by weighting in thepredefined values of how positive, negative or strong each word inthe sentence is and how they affect each other. Take the example sen-tence “My mom is f-ing great!”, the words “My”, “mom” and “is” haveno sentiment nor strength while “f-ing” is a very strong and partlynegative word and last “great!” is both strong and positive. Only con-sidering all the words individual value in the sentence will result in ansomewhat positive value. But that would give us a false result sincewe intuitively know that the sentence is very positive. Therefore weconsider all the words individually and how they affect each other;Since “f-ing” is describing the adjective “great!” the strength of “f-ing”will be used to multiply the greatness of “great”, the sentence whilethus get a very strong positive value.

Advanced sentiment analysis strive to measure more than than justthe general positivity of the writer or speaker, by having an emotionalvector rather than a single value. In theory it’s the same as basic senti-ment analysis, only using more than just one value for an user’s emo-tions, adding or subtracting to and from different emotional values.This is as the name suggests a rather advanced method, being muchmore complex to implement, and was thus not used in this study.

2.3 Clustering

In the field of machine learning there’s four core steps, data collection,data sorting, data processing and result presentation. One method thatis widely used in the data sorting-step is called clustering, clusteringis widely speaking grouping sets of data into different groups, calledclusters. What conditions they are grouped on and how many of theseclusters that are created is up to the coder and the algorithm used.

Page 12: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

CHAPTER 2. THEORETICAL BACKGROUND 5

In this paper, a specific clustering algorithm known as spectral clus-tering will be used. The input for spectral clustering is an adjacencymatrix, which in the case of spectral clustering is called the similaritymatrix. The similarity matrix is a representation of a graph where eachnode is a user and the edges weighted values are the calculated sim-ilarities of the nodes. The algorithm uses the eigenvalues and eigen-vectors of the similarity matrix to do a dimensionality reduction (thisallows the algorithm to consider fewer random variables [8]) of thedata. The lower dimension data is then being grouped up and afterthat the resulting clusters are returned.

2.4 Hybrid TF-IDF

TF-IDF, term frequency-inverse document frequency, is a numericalstatistic that determines a value for a word, based on how frequentlyit occurs in a document in a collection or corpus [9]. In this case in thework for this report, we use a hybrid version of the TF-IDF formulafrom the works of Sharifi et al. (2010) which is simply being calledHybrid TF-IDF. The reasoning behind it is because traditional TF-IDFimplementations are usually for documents with a lot more text thanshort Twitter-post, therefore there is a need for adaption. Hybrid TF-IDF is made for these kinds of shorter documents, like tweets, and willtherefore be used in this paper. [10]

The Hybrid TF-IDF formula from Sharif et al. (2010) contains twokey-values which is part of its name, term frequency and inverse docu-ment frequency. The term frequency is calculated by first counting theamount of times the word occurs in all twitter posts being examinedand then divide it with said amount of posts. The inverse documentfrequency is then calculated by counting all sentences in all posts anddivide it with the amount of sentences which the word appears in.When the two have been computed the term frequency is then multi-plied with the binary logarithm of inverse document frequency. Theresult of that operations is what then gives the term frequency-inversedocument frequency value.

tfidf(w) = tf(w) ∗ log2(idf(w)) [10]

tf(w) = #Occurrences of the word w in all posts#Words in all posts

Page 13: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

6 CHAPTER 2. THEORETICAL BACKGROUND

idf(w) = #Sentences in all posts#Sentences in which the word w appears

The idea of using the two values in the calculation is to find outwhat words occurs very often in a post but what also characterizes apost. Words like “a”, “the”, “and”, etc. (also known as stop words)will be filtered out because they are typically common words and un-characteristic for a single post. [10]

2.5 Cosine similarity

Cosine similarity measures the similarity of two vectors using the co-sine of the angle between them, therefore the value returned will be inthe range -1 to 1 where 1 indicates that the vectors are identical and -1indicates that they are each other’s opposites.

Given two vectors, A and B, the cosine similarity can be obtainedusing the following formula:

Similarity = cos(θ) = A · B||A||2 ||B||2 [11]

Cosine similarity can thus be used to determine the similarity be-tween the word usage of two users, assuming that we have one vectorfor each user containing the vocabulary used in the tweets by that user.

Page 14: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

Chapter 3

Method

3.1 Data collection

Data was collected using the Twitter REST API, searching for a num-ber of prespecified search term which were mostly seen used in tweetswhich discussed political matters. An estimate of 1000 tweets, by ap-proximately 700 users, were saved along with the name of the postinguser.

One of the search terms used was “#SvPol”, a hashtag used bymany swedes discussing politics. Also included in the search termswas the username of at least one party leader for each of sweden’s 9largest parties, this was thought to effectively catch a majority of thetweets targeting a specific party leader. (All the search terms usedwhen collecting the tweets can be found in the appendix).

3.2 Calculating word frequency

Word vectors were created for every user, containing every word postedby that user within the data set along with an integer denoting howmany times the word had been used. Using these vectors, TF-IDF val-ues were added to every corresponding word based on the frequencyof which that word occurred in all tweets.

7

Page 15: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

8 CHAPTER 3. METHOD

3.3 Sentiment analysis

Basic sentiment analysis, using a library called TextBlob for the Pythonprogramming language, were used to determine if a user was overallpositive or negative by looking at all the tweets of a user and summingthe values of the sentiment analysis for each individual tweet.

Looking at each user’s tweets individually and weighing in thesentiment analysis of each one of them, even though the result wouldprobably be more reliable using it, was deemed too complex since itwould have required a lot of more variables stored per user as well asrequired a new system to calculate the “score” of individual tweets.

3.4 Constructing the similarity matrix

Two similarity matrices were created, one only using the cosine simi-larity between users word vectors with the corresponding Hybrid TF-IDF values weighted in, and one which also took into account thevalue assigned to the user by the sentiment analysis by multiplyingthe value of the cosine similarity with either 1 for a positive sentimentvalue or -1 for a negative sentiment value.

Since the resulting value can be any rational number in the rangebetween -1 and 1 and the value of the cosine similarity using Hy-brid TF-IDF only ranges from 0 to 1, double the amount of clusterswere used when clustering. These extra clusters were needed sincewe split up all clusters into two smaller clusters; one containing thosewho were neutral or positive and another containing those who wereat least somewhat negative. All the users still had the same cosinesimilarity to the other users in both clusters, only that their sentimentvalues were then weighted in and thus creating two opposite clusters.

The decision to use double the amount of clusters came from theidea that clustering without sentiment analysis would create clusterswith people discussing the same subjects, while then adding the senti-ment analysis would theoretically split these clusters in two, one clus-ter containing users positive about the subject and one cluster contain-ing users negative about the subject.

Page 16: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

CHAPTER 3. METHOD 9

3.5 Clustering

Spectral clustering was used using the implementation in scikit-learn,a library made specifically for machine learning for the Python pro-gramming language. As described by Vakili (2016), the challenge whenusing spectral clustering is to determine the k-value; the number ofclusters to cluster to. [2] A good k-value were determined to be arounda tenth of the number of users and, as mentioned earlier, double thatwhen the sentiment analysis were applied.

Due to simplicity and to avoid too many clusters, the k-values wereset to 50 when clustering without sentiment analysis and 100 whenclustering with sentiment analysis.

3.6 Analyzing the clusters

The analysis of the clusters were made manually by looking at all clus-ters and manually looking for similarities of the political affiliationof the users within each cluster. This was determined by looking atthe user’s biography that they have themselves provided, along withmanually looking at their tweets. A cluster was deemed to be of highquality if a high amount of similarity between the political affiliation ofthe majority of users or high similarity of the subjects being discussedby the majority of users within the cluster were found.

Page 17: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

Chapter 4

Results

4.1 Clustering without sentiment analysis

Clustering without sentiment analysis lead to very few clusters withless than 3 people and quite a few very large clusters, with the largestone, cluster number 31, containing a total of 78 people. Within thiscluster (#31), no dominating common denominator among the usersparty affiliation was found to link the people together. As such is itconcluded that the cluster mostly consists of people who discuss sim-ilar subjects without necessarily sharing party affiliation. The sameconclusion applies to most of the clusters which a larger than averagecluster size.

Cluster number 38, containing 27 users, was deemed to be a clusterof high quality. This cluster contained a lot of users with strong con-nections to the feminist party (“Feministiskt Initiativ”). The most note-worthy examples of such users includes: “Feministerna”, the officialtwitter account of the feminist party. “FiSundsvall” and “FiJamtland”,a regional twitter account of the feminist party based in the town ofSundsvall and the county of Jämtland. “gudschy” and “vkwesa”, thepersonal twitter accounts of the two party leaders of the feminist party.“RosaGuiden”, a twitter account striving to post updates about theadvancements in equality (heavily connected to feminism and clearlyshares the views of the feminist party).

Besides these noteworthy users with clear connections to feministparty, about half of the users within the cluster strongly indicates ontheir profiles that they share the views of the feminist party. Two twit-ter accounts within this cluster without any obvious connection to the

10

Page 18: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

CHAPTER 4. RESULTS 11

feminist party but with connections to each other is two accounts ofthe local newspapers “Mitt i Stockholm” and “Mitt i Skärholmen”. An-other noteworthy user within this cluster is the former party secretaryof the moderate party, Kent Persson.

Another cluster with clear connections to a specific political party iscluster number 46, this cluster contains 21 users, out of which 7 havestrong connections to the liberal party (“liberalerna”). One examplebeing the twitter account of Helene Odenjung, the deputy party leaderof the liberal party. The remaining users showed signs of liberal viewsbut did not clearly state their political affiliation, however only threeof the users showed any sign of being against the liberals views in anyway. Therefore, the cluster was deemed to be of good quality.

Another fairly small cluster, number 13 with 15 users, shows veryclear connections to the left party (“Vänsterpartiet”), with all but oneof its users not showing clear signs of party affiliation towards the leftparty. This user did however not show any signs of supporting anyother party either. The most noteworthy twitter accounts within thiscluster are the official twitter account of the left party along with theofficial twitter account of the left party’s regional branch in botkyrka.Noted was also that the party leader of the left party was not presentin this cluster, but within another cluster (cluster number 47). Thiscluster was concluded to be of high quality due to the shared partyaffiliation of all but one of its users.

Cluster number 42, with a size of 10 users, was concluded to be acluster filled with users of the moderate party. This cluster persistedmostly of people which clearly showed that they either voted for orshared the views of the moderate party. The most noteworthy twit-ter account was the official twitter account of the moderate party’s re-gional branch in Malmö.

Another small cluster, number 16, also shows clear connections tothe liberal party. The official twitter account of the liberal party (named“liberalerna”), along with two users employed by the party. Out ofthe clusters 9 users, 2 users show signs of liberal thoughts throughtheir tweets but without stating any political affiliation on their profile.While the last user was concluded to be a product of an automatedposting tool, a bot, with no real connections to anything.

The users of cluster number 24, sized at 33 users, does not all shareany specific opinion, rather it’s a cluster where the common denomi-nator is the subjects being discussed. This cluster focused mainly on

Page 19: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

12 CHAPTER 4. RESULTS

discussing the subjects of immigration, religion (mostly focused on butnot limited to islam) and the swedish democrats party. This clusterwas deemed to be of high quality since almost all of the 33 users in thecluster regularly discussed a common subject.

Cluster number 30, with a size of 9 users, was concluded to bethe main cluster for the christian democrats, containing both the of-ficial twitter accounts of the party itself along with its party leader.Noteworthy was that two of the users within the cluster did state ontheir profile that they were voters of other political parties, one of thembeing the chairwoman of the liberal party’s regional branch in Stock-holm. This cluster was deemed to be of high quality since a clear ma-jority of the users within it had clear connections to the christian party.

Cluster number 10, with a size of 26 users, was concluded to bethe main cluster for the green party (“miljöpartiet”), containing onlythree people not openly stating that they support the opinions of thegreen party. The cluster also contained the official twitter account ofthe green party’s youth organization, “Grön Ungdom”. One notewor-thy anomaly in this cluster is that the official twitter account of theparty itself, “miljopartiet”, was not present within this specific clusterbut rather in another cluster that was deemed to be of lower quality.

4.2 Clustering with sentiment analysis

Adding sentiment analysis has given varying results, however no newinteresting or high quality clusters was noted. The amount of clusterswith few members have risen a great deal. Without sentiment analy-sis, there were only 4 clusters with only one member. With sentimentanalysis, the amount is now 45, almost half of all 100 clusters. It ishowever possible to find some resemblance between the two collec-tions, though some changes has been made.

Among the clusters without sentiment analysis, The green partycluster number 10 is now number 4 among the clusters with sentimentanalysis. Only one of the users dropped from the cluster, one whichshared the same views as the majority of the cluster.

The left party cluster 13 can now be found as cluster 57. Previouslybeing 15 members has now dropped down to 13 though all 13 are orig-inal members from the first cluster. Neither of the two users, now indifferent clusters, are any of the previously mentioned users stated as

Page 20: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

CHAPTER 4. RESULTS 13

noteworthy.The Liberal party cluster 16 is now cluster 87 and has gone from 9

members to 7. 3 of the previous members, one troll account and twoaccounts who does not show any party affiliations, have been placedelsewhere in other clusters. In the new 87 cluster, there is one newmember. However this new member is not discussing politics what-soever. The only tweet that has any ties to the liberal party is a retweetfrom someone else expressing disapproval of the current logo for theLiberal party.

Cluster number 24 which shared the common discussion topicsof immigration, religion and swedish democrats party has dissolved.The members of the cluster has been scattered into multiple differentclusters when sentiment analysis was added.

The previous main cluster for christian democrats, cluster 30, isnow only half of its members. The party leader is still a memberbut out of 5 members, only 3 of them are to be considered christiandemocrats.

Cluster 38, with the connections to the feminist party and deemeda cluster of high quality, is no longer existing. Among the new senti-ment analysis clusters, a few of the remnants from 38 is now in cluster17. The only member in that group that was mentioned earlier as note-worthy is “FiSundsvall”. The rest of the noteworthy users are dividedinto different other clusters.

The Moderate cluster is now cluster 48. It is almost the same asbefore except for one change. One of the legit users has been switchedwith a troll account.

Page 21: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

Chapter 5

Discussion

5.1 Successfully replicating past work

When we cluster all the users without sentiment analysis, the resultsindicate that we get the same output as Vakili (2016) describes in histhesis. This means that we have successfully replicated his work andcan consider our work then as an addition to his, continuing the re-search from where he ended it.

5.2 Clustering without sentiment analysis com-pared to clustering with sentiment anal-ysis

It’s quite clear that the clusters with sentiment analysis are generallyof lower quality and fewer than the clusters gained when clusteringwithout sentiment analysis. It’s quite possible that this is a product ofusing double the amount of clusters when clustering with sentimentanalysis than when clustering without.

Another possible explanation is that positive users stay in the samecluster, while negative users don’t. This would explain why the clusterfor the swedish democrats got split into so many pieces, since they areviewed as a party of discontent. While parties such as the moderateparty, often viewed as one of the more stable parties, stayed intact.

14

Page 22: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

CHAPTER 5. DISCUSSION 15

5.3 Ethics

Labeling humans based on party affiliation is a very sensitive subjectand must be carried out with caution, publishing or assigning such alabel to a person might lead to great personal embarrassment or worse.Thus all users who did not explicitly state in their bio or by the contentof their tweets showed that they support a specific political party, orunder-aged Twitter users did not have their names published in thisreport.

5.4 Method critique

Limiting the number of tweets to 1 000 might have been a limitingfactor in the study, while a higher value might have resulted in morereliable results but at the same time it would put more strain on theclustering algorithm which already took around 30 seconds to clusterthe users of 1 000 tweets.

Another limiting factor during the data collection is the search terms,only using one actual hashtag and then letting the rest of them be ref-erences to other twitter users might have had a negative impact on theresults. These terms limits the scope of the data to people either usingthat one specific hashtag, or talking about or talking to one of the usersincluded in the search terms.

Only taking into account a user’s general positivity instead of ona subject to subject basis may negatively affect the result of the studysince a user may be negative to one subject but positive to another.

Also only taking into consideration the user’s positivity, and notconsidering a larger amount of emotions, may also heavily affect theresult. However, taking additional emotions into consideration mightalso negatively affect the result, since more clusters would probablyhave to be used.

Letting the sentiment value either be a one or a negative one mayalso affect the results of the study, a user which only shows small signsof negativity might get grouped with users who show heavy signs ofnegativity within their tweets. While letting the sentiment value havea varying impact on the end result introduces another challenge, pick-ing a suitable weight for the sentiment value. If this value is too highthe result will be entirely dependent on the sentiment analysis, while a

Page 23: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

16 CHAPTER 5. DISCUSSION

too low value will result in the result being entirely dependent on theoriginal method and the sentiment analysis being wasted.

5.5 Conclusion

Clustering with basic sentiment analysis, as implemented in this pa-per, does not give any advantage over clustering without any senti-ment analysis. Rather the opposite, it was found that adding basicsentiment analysis when clustering political Twitter accounts loweredthe success rate.

5.6 Future research

As mentioned earlier in this paper, only taking into account the sum ofall sentiment values of a user’s tweets may be sub par. Thus studyingthe effects of weighing in the sentiment analysis of single posts insteadmight allow for improvements of the results presented in this paper.

Taking into consideration all the different emotions of the user, notjust the general positivity, might also lead to very interesting, both pos-itive and negative, results.

What could additionally be done as well for futures studies is tofind out where things went wrong in ours.

Page 24: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

Bibliography

[1] Pamela Davidson. New Statistics - How Swedes use social media.[cited 5th June 2017]. URL: https://www.iis.se/english/blog / new - statistics - how - swedes - use - social -media/.

[2] Thomas Vakili. A Comparison of Clustering the Swedish PoliticalTwittersphere Based on Social Interactions and on Tweet Content. 2016.

[3] Erik Söderberg and Lili Du. Trolldetektering : En undersökning ilämpligheten att använda ämnesmodellering och klustring för trollde-tektion. 2016.

[4] Martin Engelin and Felix De Silva. Troll detection : A comparativestudy in detecting troll farms on Twitter using cluster analysis. 2016.

[5] Adrian Lökk and Jacob Hallman. Viability of Sentiment Analysisfor Troll Detection on Twitter : A Comparative Study Between theNaive Bayes and Maximum Entropy Algorithms. 2016.

[6] Twitter Inc. [cited 5th of June 2017]. 2017. URL: https://about.twitter.com/company.

[7] Mike Isaac and Saul Ember. For Election Day Influence, TwitterRuled Social Media. 2016. URL: https://www.nytimes.com/2016/11/09/technology/for-election-day-chatter-twitter-ruled-social-media.html.

[8] Sam T. Roweis and Lawrence K. Saul. Nonlinear DimensionalityReduction by Locally Linear Embedding. [cited 5th June 2017]. 2000.URL: http://science.sciencemag.org/content/290/5500/2323.

[9] Juan Ramos. Using TF-IDF to Determine Word Relevance in Docu-ment Queries. 2003. URL: https://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/papers/ramos.pdf.

17

Page 25: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

18 BIBLIOGRAPHY

[10] Beaux Sharifi, Mark-Anthony Hutton, and Jugal K. Kalita. Exper-iments in Microblog Summarization. 2010. URL: http://ieeexplore.ieee.org/abstract/document/5590862/.

[11] Mohammad Alodadi and Vandana P. Janeja. Similarity in PatientSupport Forums Using TF-IDF and Cosine Similarity Metrics. 2015.URL: http://ieeexplore.ieee.org/abstract/document/7349760/.

Page 26: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

Appendix A

Search terms used

"#svpol", "@jimmieakesson", "@annieloof", "@bjorklundjan", "@BuschEbba","@IsabellaLovin", "@KinbergBatra", "@SwedishPM", "@jsjostedt","@gudschy", "@GustavFridolin"

19

Page 27: Examining the Implications of Adding Sentiment Analysis ...1107843/FULLTEXT01.pdfdata så har Twitter använts i tidigare arbeten för att klustra politiska användare grundat på

www.kth.se