twitter analysis using text mining...

The University of ManchesterSchool of Computer Science

Third Year Project Report 2013/2014

Twitter Analysis Using Text Mining Tools

Supervisor: Prof. Sophia [email protected]

Author: Abdullah AlBannay.Degree: Computer Systems Engineering.

[email protected]

Date: April 2014.

mailto:[email protected]


Twitter Analysis Using Text Mining Tools

Supervisor: Prof. Sophia [email protected]

Author: Abdullah AlBannay.Degree: Computer Systems Engineering.

[email protected]

Date: April 2014.

Abstract

Social media websites are growing rapidly in amount of written content and users that are served by them. Twitter is an important source of information not only for users but also for others who are interested in extracting useful information from such huge amount of data. For example, marketers, politicians, policy makers, social analysers etc. are interested in understanding how different people i.e. users on Twitter think about a particular topic. This report reflects some background information about text mining (more precisely opinion mining) and how natural language inputs can be transformed to useful new knowledge. Also, different processes and steps required for sentiment analysis will be investigated and some of them will be implemented. Four different methods were used to achieve the task and they vary from normal methods i.e. word scoring and classification to more advance methods i.e. Recursive Tensor Neural Networks. Highest accuracy could be achieved by the application built is 80%. Many features were implemented to achieve that result but there is more to be achieved which proves that text mining is an evolving hard task.

2



Acknowledgements

Firstly, I would like to thank my parents and my brothers and sisters for everything they have supported me with, not only with this project but every step that led to such an achievement. Also, I would like to thank my supervisor the director of the National Centre for Text Mining (NaCTeM) Professor Sophia Ananiadou for her guidance, advices and support during the project and for being patient with me. I would like to thank the deputy director of NaCTeM Mr. John McNaught for his precious advices during the initial and final presentation of the project. In addition, I would like to thank my colleagues, Hasan Ali, Mohim Ali and other students and friends who supported and continuously encouraged me during my studying years in the University of Manchester. Moreover, I would like to thank Saudi Aramco for giving me the opportunity to study in the UK and to live unforgeable experience that surely widened my knowledge and changed my way of looking at life and many of its aspects. Finally, I would like to congratulate myself for achieving such a great milestone in my life.

3

Table of ContentsTable of Figures.................................................................................................................................... 6Chapter 1: Introduction.........................................................................................................................7

1.1 Overview....................................................................................................................................71.2 Context.......................................................................................................................................71.3 Motivation..................................................................................................................................81.4 Related Work............................................................................................................................. 91.5 Goals and Objectives............................................................................................................... 101.6 Report Structure.......................................................................................................................10

Chapter 2: Background and Literature Survey................................................................................... 112.1 Overview.................................................................................................................................. 112.2 Text Mining in Depth............................................................................................................... 11

2.2.1 Information Retrieval....................................................................................................... 112.2.2 Information Extraction..................................................................................................... 122.2.3 NLP...................................................................................................................................132.2.4 Data mining......................................................................................................................152.2.5 TM approaches.................................................................................................................16

2.2.5.1. Co-occurrence based............................................................................................... 162.2.5.2. Machine learning based (ML)................................................................................. 162.2.5.3. Rule based............................................................................................................... 162.2.5.4. Hybrid......................................................................................................................16

2.3 Text Mining with Twitter......................................................................................................... 172.3.1 Difficulties with Twitter................................................................................................... 172.3.2 Twitter API....................................................................................................................... 17

2.4 Sentiment Analysis in Details.................................................................................................. 182.4.1 related Work..................................................................................................................... 19

Chapter 3: Design............................................................................................................................... 213.1 Design Overview......................................................................................................................213.2 Twitter and Tweets Handling................................................................................................... 223.3 Tweets Preprocessing...............................................................................................................233.4 Sentiment Evaluation Methods................................................................................................233.5 Visualisation.............................................................................................................................24

Chapter 4: Implementation................................................................................................................. 254.1 Implementation Technologies..................................................................................................254.2 Tweets Retrieval.......................................................................................................................254.3 Tweets Storage......................................................................................................................... 274.4 Tweets Normalisation.............................................................................................................. 27

4.4.1 Identifying language........................................................................................................ 284.4.2 Spellchecking................................................................................................................... 284.4.3 Normal Cleaning methods............................................................................................... 284.4.4 Discarding Twitter Patterns..............................................................................................284.4.5 Cleaning Tokens...............................................................................................................29

4.4.5.1 Twitter Abbreviations............................................................................................... 294.4.5.2 Internet Abbreviations.............................................................................................. 294.4.5.3 Noisy Tokens............................................................................................................ 294.4.5.4 Handling Smileys..................................................................................................... 30

4.5 Tweets Sentiment Evaluation...................................................................................................304.5.1 BaseLine Classification....................................................................................................314.5.2 Advanced BaseLine Classification...................................................................................314.5.3 LingPipe Classification.................................................................................................... 32

4

4.5.4 Stanford CoreNLP Classification.....................................................................................334.6 Generation of Results...............................................................................................................354.7 Visualisation.............................................................................................................................35

Chapter 5: Results...............................................................................................................................365.1 Introduction..............................................................................................................................365.2 Twitter Search.......................................................................................................................... 365.3 Preprocessing........................................................................................................................... 365.4 Sentiment Evaluation...............................................................................................................375.5 Results Summary..................................................................................................................... 385.6 WordClouds............................................................................................................................. 38

Chapter 6: Testing and Evaluation......................................................................................................406.1 Overview..................................................................................................................................406.2 Methodology............................................................................................................................ 406.3 Results......................................................................................................................................41

6.3.1 Preprocessing................................................................................................................... 416.3.2 Sentiment Evaluation....................................................................................................... 416.3.3 Visualisation.....................................................................................................................42

Chapter 7: Conclusions.......................................................................................................................437.1 Achievements...........................................................................................................................437.2 Reflection.................................................................................................................................437.3 Further Work............................................................................................................................ 44

7.3.1 Twitter...............................................................................................................................447.3.2 Preprocessing................................................................................................................... 447.3.3 Sentiment analysis............................................................................................................457.3.4 Visualisation.....................................................................................................................457.3.5 Others............................................................................................................................... 45

7.4 Conclusion............................................................................................................................... 45References.......................................................................................................................................... 46Appendix A.........................................................................................................................................50

Final word count: 13112

5

Table of FiguresFigure(1): Search results for "Whitney Houston" on Sentiment140.................................................... 9Figure(2): TM Cycle...........................................................................................................................11Figure(3): The IR system family....................................................................................................... .12Figure(4): Comparison between IR methodologies............................................................................12Figure(5): The TwitIE Information Extraction Pipeline.....................................................................13Figure(6): Knowledge Relevant to Natural Language Understanding...............................................14Figure(7): Steps in the Evolution of Data Mining..............................................................................15Figure(8): Twitter RESTful API Flow of Request..............................................................................18Figure(9): Twitter Streaming API Flow of Request........................................................................... 18Figure (10): Accuracy for fine grained (5-class) and binary predictions at sentience level and for all nodes...................................................................................................................................................20Figure(11): Example of Recursive Neural Tensor Network...............................................................20Figure(12): Project Activity Diagram.................................................................................................21Figure(13): Schema Design for Tweets Database(TDB)....................................................................22Figure(14): A WordCloud of most regularly used words in negative reviews of Android phone......24Figure(15): Class Diagram of MyTweet used for presentingTweets along with their attributes. ......26Figure(16): Class Diagram of MyUser used for presentingUsers (those who posted a Tweet) along with their attributes.............................................................................................................................26Figure(17): Class diagram for Database interface..............................................................................27Figure(18): CleanTweet Class Diagram............................................................................................. 27Figure(19): Some of Twitter Patterns that are discarded from Tweets...............................................28Figure(20): Technical Twitter Abbreviations......................................................................................29Figure(21): Smileys (Emoticons) Categories with their examples and regexes.................................30Figure(22): Class Diagram For ClassificationScoring Package......................................................... 31Figure(23): Class Diagram For WordScoring Package...................................................................... 32Figure(24): Class Diagram For PolarityBasic Class.......................................................................... 33Figure(25): Left(Stanford Dependencies) Right(Basic Dependencies) Trees....................................34Figure(26): Classes For Using & Training Stanford CoreNLP pipeline............................................34Figure(27): WordCloud (negative) for FSA Sentiment Analysis Results...........................................35Figure(28): Some of the search results about“Ukraine”.....................................................................36Figure(29): Statistics about the number of Tweets after search and after filtering............................36Figure(30): Results of preprocessing..................................................................................................36Figure(31): A Tweet along with its cleaned version........................................................................... 37Figure(32): Output for building NB classifier and Stanford pipeline................................................37Figure(33): A Tweet with its cleaned version and sentiments............................................................ 37Figure(34): A Tweet with its sentiments plus its parse tree and dependency tree details...................38Figure(35): Summary results of sentiment analysis........................................................................... 38Figure(36): WordCloud for negative Tweets for “Ukraine” search....................................................39Figure(37): WordCloud for positive Tweets for “Ukraine” search.....................................................39Figure(38): WordCloud for neutral Tweets for “Ukraine” search...................................................... 39Figure(39): Class Diagram for Testing Classes.................................................................................. 40Figure(40): An example of Preprocessing Test Case..........................................................................41Figure(41): Results of First Test of Sentiment Evaluation Methods.................................................. 41Figure(42): Results of Second Test of Sentiment Evaluation Methods..............................................42Figure(43): Test Results Comparison.................................................................................................42Figure(44): Words Blocked From WordCloud Tagging..................................................................... 42

6

Chapter 1: Introduction.

1.1 Overview.

Social media contributed greatly in the growth of data that is written by individuals andshared across the web. It presents different opinions, believes, attractions, preferences and more. However, data from users does not occupy storage space without a purpose but it is very important for many people and organisations including firms, manufacturers, politicians, policy makers, marketers, intelligence agencies and even armed forces once the useful facts for them are extracted from the data. The importance of the data from user is not only coming from its quantity only but also coming from the fact that users include their opinions and/or experiences that can be used by the different parties mentioned previously in order to gain profit, victory, strategical vision and more. Twitter has over 554 million users and more than 58 million posts (Tweets) are posted everyday[1]. This makes Twitter an important source for gathering information from a vast number of users' inputs. The vast variety of users on Twitter obtains the opportunity for whoever interested in searching for some knowledge to identify different opinions about a particular topic when they read about it on Twitter or collect the relative Tweets and analyse them using computational linguistics methodologies which are advancing and being implemented in many areas within research communities and commercial applications.

1.2 Context.

This project is concerned about analysing the Tweets that are collected from Twitter using some text mining(TM) tools and techniques. Discovering and extracting the information from unstructured data is the core objective of TM which involves the following steps[2]:

• Information Retrieval (IR). The process of gathering related text about a particular subject in order to be analysed afterward.

• Natural Language Processing (NLP). The process of processing human natural language in order to be understood and analysed by a computer. This is one of the hardest artificial intelligence problems.

• Information Extraction (IE). The process of articulating entities, events, facts and theconnections between them. After that, they are extracted and can be processes later by the next step different techniques.

• Data Mining (DM). The process of finding various relationships between pieces of information extracted from different data sources such as databases, collections of documents, blogs, webpages and/or various combinations of them.

Some of previous steps and the different technologies, processes and methods used in each of them are explained through this report. It is important to know that, the main objective of text mining applications is making implicit information within a document (or a group of documents) more explicit which contributes greatly in saving organisations and researchers time and effort. Hence, saving the money spent on manually reading and analysing a huge number of documents about a particular subject or a group of associated subjects. For example, NaCTeM is the National Centre for Text Mining which is the first publicly financed TM centre globally. Its main objective is to provide text mining services and support for the UK academic community in associated researches[3]. NaCTeM provides different tools and services for their funders along with a group of clients. Their major work has been focused on TM of biological knowledge bases. Among NaCTeM

7

tools, there are many tools that can be used freely for students, staff and researchers under the academic domain including TerMine, AcroMine, KLEIO, MEDIE, FACTA and others[4].

Opinion mining (also called sentiment analysis) is the process of finding opinion/s within a piece of text. An opinion is what a person feels or think about a particular thing such an an entity or an event. Opinion mining is a hot topic within TM research community and its importance arises from the importance of people's opinions and how these opinions might lead to changes within policy making, strategy drawing, marketing and many other fields. Using sentiment analysis techniques, the polarity of a piece of text can be obtained i.e. either it is considered positive, negative or neutral in association with the investigated topic[5].

This project plans to use the available TM technologies in order to analyse Tweets and to come up with summaries about the users' views and perceptions.

1.3 Motivation.

The growth of user-generated content on social media websites such as Twitter, Facebook, etc. gives such content its importance in many fields and led many researchers to work on extracting useful information that can be used to reason some phenomena or to draw some plans for individuals, businesses and even governments.

The motivation behind this project is widening the knowledge about and exploring one of the most important sectors of computer science which is natural language processing. Learning about that allows exploring more chances in the data mining market specially in the developing countries where opportunities are vast and easy to capture. In addition, using Twitter as a source to gather the data that will be analysed obtains an opportunity for exploring how available APIs work with current social media websites and how their public data can be accessed, stored and analysed. Such an objective might lead to explore the available chances to fill the gap in the usage of such great amount of data and the possible implications on users, organisations the social media websites themselves.

Analysing micro-blogs leads to many benefits for societies and businesses. For example, Emergency Situation Awareness (ESA) uses micro-blogs in order to save lives and to provide the crucial help for those under a natural crises or living under armed conflicts. Yin et. al. built an ESA system that mines micro-blogs i.e. Tweets in real time in order to extract and visualise useful information to help authorities and those in need to respond to an incident[6].

Micro-blogs such as Tweets that mention a particular product or service can be a source for commercial usage, too. For instance, querying Twitter on a product's name will retrieve the recent Tweets that include that particular product name and analysing them might lead to exploring advantages and disadvantages about such product[7]. Furthermore, there was some experiments to identify the feelings of suicide committers by mining their suicide notes to analyse the kind of feelings they have experienced before committing suicide[8].

8

1.4 Related Work.

There are a lot of tools for analysing Tweets and other social media blogs and micro-blogs online. Some of those tools are used for sentiment analysis on Twitter such as the following[9]:

• Sentiment140. • Socialmention.• Twitteratr.• Twendz. • UberVu.• Social Mention.• TweetFeel.

Hence, one might argue over the reason and benefit from this project. However, these available tools online are not designed for the main objective of this project which is reflecting the opinions of users on Twitter over a specific controversial subject and even with others more generic searches and analysis, some of these tools do not reflect the real opinion of people. For example, using sentiment140 to search for “Whitney Houston”, some results will be considered as negative Tweets only because they reflect sad feelings about her death. Some of these Tweets in the picture below:

Figure(1): Search results for “Whitney Houston” on Sentiment1401.

Therefore, there is a need for developing tools that can deal with such problems and give a moretrusted and reflective sentiment analysis of Tweets. This can be achieved by using more advancedTM techniques that gives more reliable polarity scores to the analysed text. However, there are still more problem under research for identifying certain feelings within the investigated text such as sarcasm, irony, swears, made-up words e.g. swear words and implicit feelings[10].

Maynard and Funk worked on analysing Tweets to find which candidate is supported during the British election[9]. That has been done by creating triples of the form <Person, Opinion, Political Party>. They have the training data which was analysed by hand and then they trained an algorithm on the training data and the achieved accuracy was 80%. Their full work would not be as easy as it was without using many supporting tools that are provided by General Architecture for Text Engineering (GATE) by the University of Sheffield[11]. which has many tools that can be used for different purposes including taggers, annotators, parsers and more.

1 Obtained from Twitter Sentiment website: http://www.sentiment140.com

9

http://www.sentiment140.com/

1.5 Goals and Objectives.

The main objective of this project is to gather Tweets about an event such as a political event and then identifying the opinions about that particular event in which the polarity i.e. (whether it is positive, negative, neutral or maybe irrelevant) of Tweets is identified depending on the user's view that is expressed in those Tweets. However, Tweets are restricted to the length of 140 characters which might make the analysis harder and motivates users to produce shortened forms of words for that the content of Twitter is considered as a noisy content.

Normalising and cleaning retrieved Tweets from Twitter before performing the analysis on them is important since without preprocessing such analysis would be more difficult. Therefore, a preprocessing for each Tweet is needed. Also, There should be some sort of storing medium to store the Tweets about the investigated topic or event. Hence, building a reliable database that is used for storing Tweets along with their authors i.e. users on Twitter.

Result produced after analysing the Tweets should be visualised for the user and that can be done using diagrams to display how the polarities are changing over a period of time in percentages and values. In addition, maps can be used to visualise the areas where each class of Tweets is concentrated. Additionally and alternatively, wordClouds can be used to identify the significant words that occurred in each class of Tweets.

1.6 Report Structure.

This report consists of three main parts, background knowledge, design and implementation and evaluation and conclusions. The first two chapters present the required background knowledge about the project and text mining and its applications on Twitter. The subsequent two chapters talk about how the system is planned to be structured and built. And the final part investigates the correctness of the work implemented and evaluates it.

The second chapter of this report provides a background information about methods and techniques used in sentimental analysis and text mining. Also, it should provide the recent work on the topic and expand on what has been provided in the introduction about Twitter and its usefulness in text mining applications.

The third chapter presents the design of the application to be built and gives an overview about the possible technologies that can be used to build the application and their pros and cons. Subsequently, the fourth chapter will use the information provided by the design in order to implement the application and to articulate the problem faced during the implementation and the purpose of choosing what has been implemented.

The fifth chapter will provide some information about the results produced after implementing what has been designed. After that, the sixth chapter will talk about how the application built was tested and evaluate it by presenting the results and what has decisions has been made to improve the results.

Finally, the seventh chapter will conclude the achievements and learnings resulted from this project, Also, it will give an overview of it as a major part of the studying experience at the university. Also, it will articulate whether the objectives claimed in the introduction are met or not.

10

Chapter 2: Background and Literature Survey.

2.1 Overview.

In this section, background information and knowledge of required techniques of TM will be outlined and some of the recent work in TM field will be visited. Many approaches of achieving the objectives of this project will be presented briefly due to the report length restrictions and readers are advised to read further from the provided sources to widen their knowledge of the topics that will be presented in this section.

2.2 Text Mining in Depth.

Text mining steps were mentioned in section 1.2 of this report and here they will be explained again in more depth and what methods are used to complete each step of them. The figure bellow illustrates the life cycle of TM.

Figure(2): TM Cycle[3].

2.2.1 Information Retrieval.

IR systems can be seen everywhere e.g. search engine, library catalogues cookbook indices, and so on. The main objective of any IR system is to retrieve whatever is useful and to ignore whatever that is not. IR family can be summarised through figure (2)[12]. Traditional IR systems retrieve the information from unstructured texts (raw text) such as documents, comments, etc., while the advance IR systems extract them from structured texts (marked up/ tagged) texts such as XML1 documents. However, IR systems are contrasted from Relational Databases (RDB) which query structured data. RDB systems are useful when querying over highly structured data since they obtain reliability and speed.

There are many data sources that contain textual data. They can be presented in the form of structured documents such as XML documents. Searching over such documents is called structured information retrieval. The difference between different types of IR systems and RDB systems can be summarised in figure (3)[13].

1 XML: Extensible Markup language which is a markup language that defines a set of rules of encoding documents in a format which is both human and machine readable.

11

Figure (3): The IR system family.

RDB search Unstructured retrieval Structured retrieval

Objects Records Unstructured documents Trees with text at leaves

Model Relational model Vector space and others ?

Main data structure

Table Inverted index ?

Queries SQL Free text queries ?

Figure (4): Comparison between IR methodologies and traditional relational database search.

2.2.2 Information Extraction.

Information extraction is done by extracting structured data from unstructured or semi-structured machine-readable documents which was traditionally relying heavily on human involvement. Structured data that can be extracted includes[14]:

• Named Entities. For example, persons,locations, organisations, products, etc. They can be extracted using Named Entity Recognition (NER) Techniques.

• General Entities. They are hard to disambiguate and in order to be identified, contextual information is required e.g. knowledge bases or encyclopedias.

• Characteristics and Attributes of Entities. They can be gathered using entity-centred web search (object level vertical search). An example of this is Microsoft's EntityCube project which summarises information about the entities regardless of their presence level on the web.

• Classes of Entities. A particular entity could be classified into multiple classes depending on some attributes or characteristics. For example, a car might be classified by its maker, country of production, production year, colour, and so on.

• General Relationships Between Entities. Some relationships between entities can be identified by the classes of an entity such as the relationships “is-a” and “part of”. However, some other relationships might be more complex. For example, it is more difficult to identify the relationship “born-in” between a person and a city.

Jerry et. al. described the operation of Finite State Automaton Text Understanding System (FASTUS) which is used for IE from unstructured text in English. It is comprised of the following steps[15]:

12

(1) Triggering: searching for trigger words that the pattern needs the least during the first pass over a particular sentence.

(2) Recognising Phrases: distinguishing noun groups, verb groups and variable word classes including all the other parts of speech.

(3) Recognising Patterns: taking the input from the second step as a list of phrases. It identifies some predefined patterns in order to build incident structures.

(4) Merging Incidents: incidents found in a sentence (incident structures) are merged together and any un-merged incidents can be merged with other incidents from previous sentiences.

TwitIE is an open-source IE pipeline constructed specially for micro-blogs and it includes metadata handling and data import that is specific to Twitter. It uses GATE tools and the pipeline is summarised by Figure(5)[16].

Figure(5): The TwitIE Information Extraction Pipeline.

Hybrid IE systems which uses both manual and automatic techniques in order to extract the structured information obtain better results[14]. They might use human input to improve what is performed by the IE algorithm or they might need direct human involvement to help the system in performing its task.

2.2.3 NLP.

Natural Language Processing refers to the practices performed on sentences written in a natural language such as the English language. It includes many subtopics such as[17]:

• Signal processing: dealing with spoken language (out of scope). • Syntactic analysis: dealing with how the sentences are written and their grammars. • Semantic analysis: dealing with meanings of words within the sentence. • Pragmatics: deals with the correlation between the meaning of a sentence and its daily

usage.

13

NLP requires the knowledge of many topics in order to achieve its objectives, these topics are summarised by Figure(6).

Topic Explanation

Phonetic and phonological knowledge

How words relate to their sounds.

Morphological knowledge How words are built from more primitive morphemes e.g. how “sunny” comes from “sun”.

Syntactic knowledge How a sequence of words makes a correct sentence. (Knowledge of the grammar rules)

Semantic knowledge How words have meaning.How words have denotations (references) and connotations (associated concepts).

Pragmatic knowledge How sentences are used in different situations and how the different usage can affect the meaning of the sentence. Involves intentions and context.

Discourse knowledge How previous sentences can affect the meaning of a sentence.When referencing to a pronoun.

World knowledge General knowledge such as knowledge about others involving in the conversation

Figure(6): Knowledge Relevant to Natural Language Understanding.

Syntactic analysis would be easy if each word has only one part of speech associated with it and it has one meaning regardless of the context. However, in English that is not the case since many words have different parts of speech associated with them and they can be used differently in different contexts. For example, bank (n) may refer to a place to keep a boat or a place to keep money.

A parser is an interpretation tool that can be used to map natural language sentences to their syntactic structure or representation (result from syntactic analysis) and their logical form (result from semantic analysis)[17] and it can be of three different types:

• Noise-disposal parsers. • State-machine parsers (FSM parser).• Definite clause grammar parsers (correspondent to phrase structure grammar parser).

Stanford CoreNLP1 pipeline is an open-source easy-to-use pipeline for NLP analysis tools. It involves many tools that can be used for processing documents in English and other languages. It includes and integrate different tools that can be used separately or can be integrated together such as Part-of-speech (POS) tagger, named entity recogniser (NER), parser, coreference resolution system and a sentimental analysis tool. It uses annotators and produce annotations of various types which can be used to preview the output from each tool or to be an input for another processing tool along the pipeline. This will be discussed more in the subsequent chapters[19].

1 Stanford CoreNLP is an NLP package of tools that can be found on: http://nlp.stanford.edu/software/corenlp.shtml

14

http://nlp.stanford.edu/software/corenlp.shtml

2.2.4 Data mining.

DM is the process of discovering new previously unknown knowledge from already available data. Nowadays, it is used more broadly and in many application. It depends on the way the information is extracted and represented in order to find the relationships between them and derive conclusions, facts, etc.[18]. The scope of DM is broad and it can be used in many fields differently. But once it is used over some data it is expected to perform the following 2 tasks to prove success[20]:

• Firstly, automatically predicting trends and behaviours. • Secondly, automatically discovering previously unknown patterns.

There are many techniques used in DM, here is some of the commonly used techniques[20]: Artificial neural networks: structured as biological neural networks to learn in a non-linear

predictive way. Decision trees: structured as trees that resembles different sets of decisions. These decisions

produce the rules that are used for classification. For Instance, Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).

Genetic algorithms: uses some processes for optimisation such as genetic combination, natural selection and mutation. It is designed based on evolution concepts.

Nearest neighbour method: classifies each element in the dataset based on combination of the classes of the kth element(s) which is the nearest to it. Also, it is called the k-nearest neighbour technique.

Rule induction: using statistics to extract the useful if-then relationships.

DM uses modelling to tell the new previously unknown important things. Modelling is basically a training model where a model is built knowing the answer and then applying it to situations where the answer is unknown. For example, a model can be pre-classified Tweets that are used in order to classify the new encountered Tweets. Also, it can be used to estimate the opinion of a customer or to predict a future event based on a previously known opinion and previously known event, respectively[20].

The evolution of DM can be noticed through Figure(7) where different methodologies used and the advancement in DM systems are shown[20].

Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics

Data Collection (1960s)

"What was my total revenue in the last five years?"

Computers, tapes, disks IBM, CDC Retrospective, static data delivery

Data Access (1980s)

"What were unit sales in New England last March?"

Relational databases (RDBMS), Structured Query Language (SQL), ODBC

Oracle, Sybase, Informix, IBM, Microsoft

Retrospective, dynamic data delivery at record level

Data Warehousing & Decision Support (1990s)

"What were unit sales in New England last March? Drill down to Boston."

On-line analytic processing (OLAP), multidimensional databases, data warehouses

Pilot, Comshare, Arbor, Cognos, Microstrategy

Retrospective, dynamic data delivery at multiple levels

Data Mining (Emerging Today)

"What’s likely to happen to Boston unit sales next

Advanced algorithms, multiprocessor computers,

Pilot, Lockheed, IBM, SGI,

Prospective, proactive information delivery

Figure(7): Steps in the Evolution of Data Mining.

15

2.2.5 TM approaches.

There are three main approaches of TM and each approach has its own advantages and disadvantages. They can be used differently and depending on the required task of the application that uses them[21].

2.2.5.1. Co-occurrence based.

A co-occurrence based system looks for co-occurring concepts within a piece of text, often a sentence. However, sometimes it can be as large as an abstract and posit a relationship between sentences. Systems based on this approach were used in biomedical field. For example, to find correlations between despises and some proteins, enzymes, etc. Such systems are not commonly in-use today since they are highly error-prone.

2.2.5.2. Machine learning based (ML).

Machine learning (also called statical) approach works by building classifiers to be used in different TM operational levels such as POS tagging, parsing and classifying sentences and documents. Unfortunately, ML needs sufficient amount of training data which must be classified manually to assure correctness. Gathering such information and processing them manually is a difficult task. Therefore, using ML approach without having the required training data initially is going to consume time and effort.

2.2.5.3. Rule based.

Rule based (also called knowledge-based) approach uses some knowledge which might be in the form of general language structure or more specific such as biologically relevant facts. Also, it can use some rules that are domain-specific and for a special purpose that are known to domain people only such as biomedical rules. Identifying a variety of potential methods to assure the most likely predicate class might be done using complex linguistic and semantic analysis. However, developing rule based systems is assumed to take longer time than developing ML systems. In addition, rules produced are often domain specific and hard to make portable.

2.2.5.4. Hybrid.

Hybrid system of ML based and rule based can be built, too. For example, many systems use an initial ML processing followed by knowledge-based processing. However, there is a fundamental problem associated with both approaches i.e. ambiguity which is the presence of multiple relationships between the linguistic part and the semantic or contextual parts. In other words, a sentence can be of multiple meanings or of multiple classes. Resolving ambiguity might be either easy or difficult depending on the context it occurs in[22]. ML-based and rule-based(lexicon-based in this context) approaches are used for sentiment evaluation task specifically. Also, a hybrid approach can be used too with the sentiment lexicons playing an important rule in most tasks[10].

16

2.3 Text Mining with Twitter.

Using TM approaches with twitter introduces many challenges and problems and requires usage of already available libraries. Using what is already available helps in advancing the available cutting edge technologies and adding extra features once they are needed.

2.3.1 Difficulties with Twitter.

There are many difficulties and challenges associated with using Tweets as the source of data to perform the TM tasks upon them. Since Tweets have restricted number of characters (140 chars), they do not contain enough contextual information as longer pieces of text and it is assumed that the reader has enough implicit knowledge. Therefore, ambiguity is an issue since there is no contextual information that can be useful as Tweets do not appear in conversational thread usually but rather in isolation. In addition, Tweets tend to be less grammatical than longer posts, show more language variations, contain unnecessary capitalisation and use emoticons, hashtags1 and abbreviations which might be necessary to capture the meaning of the Tweet. Furthermore, extensive use of sarcasm and irony (might be implicit) can be seen within Tweets and that is hard for a machine to capture. However, making Tweets of restricted length helps in forcing users to write more on topic and domain specific Tweets[22]. Also, there might be some Tweets which are of different topic but with the same keyword/s used to query twitter i.e. irrelevant Tweets. For instance, apple it might refer to a fruit or a brand.

Tweets might have other problems not because they are restricted in length but because of the way they are written since they can be written in other languages other than the targeted language specially if the topic investigated is active in a non-English speaking country. Also, some users are spammers who just write the same Tweet repeatedly resulting in duplicates or write Tweets with some keywords but the content is not related to the topic. Furthermore, Tweets could contain spelling mistakes, be incomplete and use slangs. All these problems should be considered and solutions should be found when the application is design and implemented.

2.3.2 Twitter API.

Twitter uses two different APIs for querying its publicly accessed content. First, RESTful2 API which accepts a limited number of requests in a defined duration of time. Second, Streaming API which maintains a connection with Twitter servers and retrieves Tweets in real time without a duration limitation. The difference between the different APIs methodologies can be seen from Figure(8) and Figure(9)[23]. Both APIs can be used for the purpose of this project. However, using RESTful API is more generic and more useful for searching for previously written Tweets specially for those topics which are not currently active or they are active but it will take a long time to get some Tweets for analysis.

1 A hashtag is used on Twitter to mention a particular topic within the Tweet and to let readers know its significance and to help delivering the message to them. For example, in “I like to read #example books” example is the hashtag used.

2 REST: Representational state transfer.

17

Figure(8): Twitter RESTful API Flow of Request.

Figure(9): Twitter Streaming API Flow of Request.

Twitter APIs (both streaming and RESTful) can be accessed using some libraries such as Twitter4J[24] which provides many easy-to-use functionalities with proper documentations. This library might be used in this project once it is developed in Java. It is highly recommended for its reliability and compatibility. After establishing a request to Twitter API the respond is in the form of JSON document which is processed by Twitter4J to create a Java object named Status which is used to store Tweets along with all their parameters.

2.4 Sentiment Analysis in Details.

Sentiment analysis (SA) refers to Opinion Mining (OM) which was mentioned earlier in the introduction. However, there is a lot of work recently on sentimental analysis since it is useful in obtaining a better understanding of markets, societies, voters, customers, employees, etc.

18

2.4.1 related Work.

Maynard et. al.[22] used GATE tools to build a system that follows 3 main steps to analyse Tweets. These steps are entity extraction, event recognition and sentiment analysis. Entity extraction is used to build the corpus to be annotated and used by subsequent steps i.e. event recognition. ANNIE is the named entity recogniser used which is the main NER used by GATE. A Rule-based approach was used in the sentimental analysis step in order to assign polarities to different Tweets.They achieved that using different subcomponents mainly consisting of JAPE1 grammars (a part of GATE system) which are used for the generation of annotations on fragments of text. JAPE grammars uses some rules that are written in a special form for GATE usage which are called gazetteers, They are used in combination with other linguistic features such as POS tags etc. and contextual data is used to create sets of features and annotations that are editable by rules which can be altered in the future to improve the performance. For instance, a gazetteer was built using WordNet2 which has words with their parts of speech with additional information about the original WordNet synset to which they belong. Every sentiment-bearing word matched is analysed along with its neighbours in the same sentence or phrase taking into account negations and sarcasms since they tend to revert the polarity. Other gazetteers are made for emoticons and swear words since they express strong sentiments. OM is also applied on products' reviews using WordNet as a base for finding polarities and evaluating adjectives that are related to a product's name within the reviews that are analysed. This method could achieve an average accuracy of 84%. It is different from analysing Tweets since Tweets can be written about any subject while reviews are written to review a certain product. Also, reviews are longer than Tweets and tend to be less noisy. This might be used once the sentimental analysis is done on products that are mentioned in twitter[25].

Stanford CoreNLP uses recursive deep models for semantic compositionality over sentiment treebank in order to analyse movie reviews. It can be used for sentimental analysis of Tweets, too. This model focuses and combines five different topics of NLP:

• Semantic Vector Spaces. Co-occurrences statistics and the context of a word used to describe this idea such as tf-idf3. Also, more complex frequencies might be used such as the frequency of a word appearing in a particular syntactic context.

• Compositionality in Vector Spaces. Most compositionality algorithms capture only two word compositions to compute the similarities using vector addition. Some others might use matrix representations to perform matrix product on them.

• Logical Form. It tackles the compositionality problem from another prospective by using logical forms to represent sentences.

• Deep Learning. It can be achieved by using neural tensor network (NTN). • Sentiment Analysis. It regards the importance of polarity shifting rules and the syntactic

structures needed to capture and evaluate the input text.

Semantic vector spaces are used over a word but they cannot be used to find meanings of longer phrases properly. Also, they have models of composition and evaluation resources but they are limited and need to be advanced. Therefore, Stanford CoreNLP sentimental analysis method introduced sentiment treebanks which are constructed out of more than 215K phrases along with a powerful Recursive Neural Tensor Network (RNTN) which is able to predicate the semantic effects

1 JAPE: Java-based pattern matching language used in GATE. 2 WordNet: refer to Appendix A for more details. 3 tf-idf: term frequency-inverse document frequency.

19

represented by the treebank accurately. RNTN are used for compositionality purposes since it is much more powerful than traditional methods mentioned above. The phrases contained in the tree bank are highly accurate since they are evaluated by 3 human examiners. A phrase is previewed through word vectors and a parse tree. The used network is recursive since the vectors of higher nodes are evaluated using the same tensor-based composition function. Compared With traditional recursive neural networks (RNN), Naive Bayes (NB), bi-gram NB and support vector machine (SVM) it proves higher accuracy of 85% (see Figure(10)). RNTN can predict five different sentiment classes (- -, -, 0, +, + +) and it takes into account negations and contrastive conjunction such as “but”. Predicate classes can be presented numerically from 0 (most negative) to 4 (most positive) while 2 represent neutrality. The different predicate classes and their representation in a tree can be seen by Figure(11)[26].

Figure (10): Accuracy for fine grained (5-class) and binary predictions at sentience level and for all nodes.

Figure(11): Example of the Recursive Neural Tensor Network.

20

Chapter 3: Design.

3.1 Design Overview.

This project's goal is to build a system that is able to retrieve Tweets from twitter, filter them, normalise them, give them polarity scores and visualise the produced results for the user. Activity diagram in Figure(12) illustrates the flow of the application to be built and each main step will be explained throughout the subsections of this chapter. All design decision were taken considering that Java is the programming language that the system to be built is written in.

Figure(12): Project Activity Diagram.

21

3.2 Twitter and Tweets Handling.

Tweets are not only associated with their textual content i.e. the text of Tweet but also they have many attributes that are summarised in Appendix A. However, some of those attributes are not supported by Twitter4J[24] library and some are not important for the scope of this project. These are believed to be among the most important attributes of a Tweet needed by this project:

• Twitter user's ID which is used for calculating the number of Tweets about the investigated subject from each user highlighting the influential users with the highest number of Tweets.

• Time/Date which are used when the Tweets are being analysed depending on posting time or plotted with sentimental values over time. Also, it can be used for filtering.

• Tweet's ID which is used for identifying duplicate Tweets when the system is used to retrieve and store the Tweets automatically i.e. if the Tweets are already stored they will be discarded from the storing process.

• URLs which might be used in any future work to retrieve their content and analyse it or for presenting some multimedia for positive, negative and/or neutral Tweets.

• Retweets count this can be used for ranking purposes and for identifying the most important Tweets which can be used in computing the overall sentiment of the investigated topic.

• Place which might be used in order to visualise the distribution of the negative and positive Tweets on a map or on a country based list.

Once Twitter4J RESTful API is used to search Twitter over a given query, then the resulting Tweets are given as a Java object called Status which needs to be extended to include extra features for the purpose of this project such as Tweet's text after preprocessing and polarity score.

After they are retrieved from Twitter, Tweets are stored in a special table in the Tweets database (TDB) along with their attributes and their publishers i.e. users (also supported by Twitter4J) which are stored in another table. The design of TDB is obtained in Figure(13) and it will be implemented as an Oracle SQL DB. Storing and retrieving Tweets to and from TDB will be done using JDBC1 as it was used before in previous projects and it has sufficient documentations. In Addition, manipulating the Tweets and their attributes functionality should be supported by the DB interfaced that will be designed. Also, when the user issue a query, the system should look for already available Tweets in the TDB to be retrieved and evaluated along with the new Tweets that were retrieved from Twitter.

There should be some Tweets used for testing which should be stored in the database and accessed through the database interface as normal Tweets. The difference between both of them is that, for test Tweets, polarity score is set accurately and must not be altered.

Figure(13): Schema Design for Tweets Database(TDB).

1 JDBC: Java Database Connectivity. An API in Java to connect with databases and use their contents in Java.

22

3.3 Tweets Preprocessing.

Attributes of Tweets are beneficial when the Tweets are processed. They can be used to filter Tweets e.g. distinguishing Tweets that are written in English from those written in other languages. Furthermore, Spam Tweets such as Tweets with a URL only or hashtags only should be identified and therefore excluded from the Tweets to be processed.

Tweets that are processed need to be cleaned from @mentions, #tags that are not necessary and Twitter specific abbreviations such as RT(retweet) and PRT(previous retweet). Also, URLs are not useful for sentimental analysis and should be removed prior to processing. All of that can be achieved using collections in Java such as lists or regular expressions(regexes).

There are multiple issues with Tweets to be solved when a Tweet is normalised. Han and Baldwin mentioned some of them[27] including duplicating letters within words e.g. gooooood, typographical errors, slangs and abbreviations. They can be found easily using a POS tagger with <unk>(unknown) option or using a spellchecker . Furthermore, emoticons contains strong sentiments and they should be either normalised or used for sentimental evaluation[28].

3.4 Sentiment Evaluation Methods.

There are many methods that can be used in evaluating sentiments of Tweets. Firstly, using baseline classifier which classify Tweets depending on the occurrence of previously known words that have some given sentiments. This baseline classifier might be implemented based on Multi-label classification[29]. This classifier can be improved by employing modifiers and negations which might increase or decrease polarity scores. Also, different lexicons can be used noticing their effectiveness. For both methods, there are some prerequisites such as tokenization and splitting sentences. Two corpora are required for implementing such classifier.

Different classifiers can be used, too. For example, some NLP Java APIs such as WEKA1 and LingPipe2 provide the required functionalities to implement different classifiers. There are two which are focused on, NB classifier and Weighted Majority classifier. Training data is required for classifiers so there should be some pre-collected and classified data to train the classifier on them.

Stanford CoreNLP (mentioned in section 2.4.2) is another option for evaluation. Its main advantage is that all the required tools for analysis are provided. It works by building a pipeline of different tools then passing the Tweets to the pipeline in order to be processed and evaluated. Also, it provides the output from each tool in the pipeline which can be presented to the user. In addition, the sentiment of the sentence that contain the search keyword can be easily provided which means more accuracy in evaluation[19][26].

The main design approach will be using different evaluation methods on Tweets and presenting the results from each method to the user. Otherwise, the user should be able to choose the method prior to processing because different methods give better performance depending on the context of the Tweets used and the way each method is designed. Also, the average of evaluation for all of them can be presented to the user who can then decide of the accuracy of these methods. There should be statical results of the number of Tweets in each polarity class i.e. positive, negative and neutral.

1 WEKA is DM framework in Java by the University of Waikato. Available at: http://www.cs.waikato.ac.nz/ml/weka/2 LingPipe is a tool kit for processing text using computational linguistics. Available at: http://alias-i.com/lingpipe

23

http://alias-i.com/lingpipe

http://www.cs.waikato.ac.nz/ml/weka/

3.5 Visualisation.

Results from sentiment analysis should be visualised in a way that expresses the feelings in each class of Tweets. This can be done using a WordCloud that emphases the mostly used terms in a class of Tweets. It can be implemented with OpenCloud1 straightforwardly by considering each word as a tag and incrementing its weight as it repeatedly occur within the document. However, A stopwords list should be made to avoid words with high frequency in English. An example of a WordCloud is shown in Figure(14)[30].

Figure(14): A WordCloud of most regularly used words in negative reviews of Android phone.

1 OpenCloud is an API in Java for creating WordClouds. Available at: http://opencloud.mcavallo.org

24

http://opencloud.mcavallo.org/

Chapter 4: Implementation.

4.1 Implementation Technologies

This project was implemented using Java programming language and MySQL database. All classes that will be mentioned in this section are implemented in Java and JDBC API was used to connect TDB with the application implementations. The database is structured in java and it includes 7 different tables:

• Tweet. Stores Tweets that are retrieved from Twitter for analysis and some of their attributes (mentioned in Figure(13)).

• TestTweet. Stores Tweets that are used for testing purposes. • User. Stores users who posted the Tweets that are used for analysis. • Lexicon. Stores 11816 different terms along with their sentiments. • Modifier. Stores modifiers that change the sentiments of the words in lexicon.• Negation. Stores negation words which reverse the sentiment of a word in lexicon. • Abbreviation. Stores some abbreviations along with their full form.

4.2 Tweets Retrieval.

Tweets will be retrieved from Twitter using Twitter4J which provides the required functionalities to use Twitter RESTful API. However, before being able to use Twitter API, the application should be registered on Twitter and the required keys for authorising the application to use the API will be given. They should be stored in a special property file called twitter4j.properities which contains 5 different parameters:

• debug (boolean) which is set to be true. • consumerKey.• consumerSecret.• accessToken.• accessTokenSecret.

Tweets can be retrieved by creating an instance of Twitter using TwitterFactory. Then, using that instance to perform search using a Query which is another class in Twitter4J. After the search, results will be saved as QueryResult which contains the resulting Tweets. Each time a search is performed, another set of Tweets will be returned.Therefore, the search will be performed until the result from search is null or until maximum number of requests is exceeded. The requests limits is define by Twitter as “When using application-only authentication, rate limits are determined globally for the entire application. If a method allows for 15 requests per rate limit window, then it allows you to make 15 requests per window — on behalf of your application. This limit is considered completely separately from per-user limits.”[31]. Each Tweet is associated with a user who posted it on Twitter which can by extracted by using getUser() method.

Query can take some parameters to do the filtering on the retrieved Tweets. Filters are concatenated with the query i.e. search keywords such as the following line of code:

Query query = new Query(searchKeyword + " -filter:retweets -filter:links -filter:replies -filter:images");

The previous query will use 4 filters: • retweets filter which excludes any retweets from the results. In other words, removing

duplicates from the results an encountered Tweet is retweeted. • links filter which excludes any Tweets with URLs. If used, this will save some time in the

preprocessing step. • replies filter which excludes any Tweets that were created as replies to some Tweets which

25

might be hard to understand the conversational effect of them since they will not be retrieved with the Tweets that they were replied to.

• images filter which excludes any Tweets that contain Twitter images which are mostly made with minimum or none textual content.

Tweets and users required the implementation of two new classes to add some attributes needed by this application such as polarity scores, search keywords and normalised text of the Tweet. The 2 classes are shown by Figure(15) and Figure(16). Finally, Tweets along with their authors i.e. users will be saved in TDB using SearchTwitterToDB class. Other functionalities for searching Twitter were implemented, too. For example, retrieving Tweets to be stored in a txt file or a CSV1 file.

Figure(15): Class Diagram of MyTweet used for presentingTweets along with their attributes.

Figure(16): Class Diagram of MyUser used for presentingUsers (those who posted a Tweet) along with their attributes.

1 CSV: Comma Separated Values is the format for data sheet files such as Excel files.

26

4.3 Tweets Storage.

After searching Twitter, Tweets along with the users who created them should be stored in TDB. Therefore, an interface for inserting them into the database and manipulating them later was implemented. Database class provides the required functionalities for storing and manipulating the data in TDB. SearchTwitterToDB uses Database for inserting Tweets and users in the database. Functionalities provided by this class can be seen through the class diagram in Figure(17).

Figure(17): Class diagram for Database interface.

There were many problems associated with storing the Tweets such as assuring the Tweet has not been stored before and the format of the Tweet text does not trigger any problems in the database. For example, Tweets might contain a single quotes which are used by SQL syntax to differentiate strings. Therefore, some engineering solutions were used such as using double single quotes when the text is stored and replacing the double single quotes when the Tweet is retrieved from the database at a later stage.

Other functionalities such as retrieving modifiers and negations that are used by sentiment analysis to provide their sentiments or their effects on the scoring of terms' polarities. Also, using test Tweets for testing purposes since they are different from the search Tweets and they are stored in a dependant table. Finally, abbreviations that are used to analyse the Tweets are stored in TDB and they can be retrieved by the database interface and used in normalising the Tweets.

4.4 Tweets Normalisation.

Tweets normalisation mainly done using CleanTweet class (Figure 18) which can be given the text of a Tweet and it will return the clean (normalised) version of it or it can be given a Tweet and it returns the same Tweet with the cleanedTweet attribute set. Normalisation of Tweets consists of various steps:

Figure(18): CleanTweet Class Diagram.

27

4.4.1 Identifying language.

Identifying the language of a Tweet is easy because each Tweet has an ISO1 language code among its attributes. When the ISO language code is “en” then it is written in English. This can be used for filtering English Tweets from non-English Tweets that might be among the resulting Tweets.

4.4.2 Spellchecking.

Tweets may contain typographical errors. Therefore, they should be corrected before processing. Google Spellchecker[32] has high reliability but unfortunately it is not available as an open-source tool anymore. Hence, other different spellcheckers were tested such as JLanguageTool spellchecker[33]. However, they proved to be unreliable since it recognises some of the names and known abbreviations to what it considers the correctness of them. Therefore spellchecking functionality was not completely implemented over the full text of Tweet since the available time was not enough to assure reliability and there were more important tasks to be achieved.

4.4.3 Normal Cleaning methods.

Tweets might have different characteristics that should be discarded before processing. For example, white spaces and new line characters should be removed from the Tweet before it can be processed. Also, some words are written with duplicate letters such as writing whaaaat instead of what. The duplicate letters are minimised to 2 letters and then the spellchecker is used to assure the correct version of the word is included in the cleaned text.

4.4.4 Discarding Twitter Patterns.

There are some pattern used in writing Tweets which can be identified easily using regexes. Such patterns include the usage of some Twitter abbreviations at the start of the Tweet. For example, to mention to other users a particular Tweet is a retweet (RT) or to mention the previous Tweet (PRT). Also, such patterns might be noticed at the middle and at the end of the Tweet. For example, some users mention multiple users or use multiple hashtags at the end of the Tweet which are not useful for sentiment analysis. Also, the URLs should be removed since they are not useful, too. When user mentions and hashtags are not removed that is because they are different and they might be used for analysis, too. Figure(19) illustrates some of the patterns discarded from Tweets.

Pattern Regex

White spaces. "\\s+"

"RT @someone:" "\\s*RT\\s*@\\w+\\s*:\\s*"

"http://foo via @someone" "https://bar by @someone #hashtags"

"\\s*(#\\w+\\s*)*https?:[^\\s]*\\s*(via|by)*\\s*(@\\w+)*\\s*(#\\w+\\s*)*"

Figure(19): Some of Twitter Patterns that are discarded from Tweets.

Tweets can also be spams e.g. when the same Tweet is reposted by the same user or when it is full of hashtags or with no importance to the topic such as adverts. Therefore, any spam Tweet is removed from the results. This could be more efficient if a blacklist of such users was made which can be more useful if the project focuses on a particular topic.

1 ISO: International Standards Organisation.

28

4.4.5 Cleaning Tokens.

After processing a Tweet using the previous steps, the Tweet can be tokenised, either using a special tokenizer such as Stanford CoreNLP tokenizer, WEKA tokenizer or any other tokenizer. Java provides a text tokenizer which can be used providing the tokenization pattern i.e. using characters such as white space, fullstop, comma, etc. Java standard tokenizer was used to tokenise the Tweets and each token was processed using the following:

4.4.5.1 Twitter Abbreviations.

Twitter abbreviations that was not dealt with using previous processing methods should be interpreted to their full form. They are only 8 different abbreviations which are represented in a list and then each token in the Tweet is compared with each abbreviation taking into account capitalisation and the way they are written. The list of Twitter abbreviation is summarised by Figure(20)[34]. It might be obvious that they are not very useful for the analysis and they can be discarded. There are some other abbreviation associated with Twitter industry and some other conversational abbreviations which were added to the general internet abbreviations since their full form might be important for the user to understand their meaning . Also, they might be important for the sentiment analysis to work more accurately.

Abbreviation Full Form

CC Carbon-copy. Same as carbon-copy in emails.

CX Correction

CT Cut Tweet. Not full Tweet. i.e. partial Tweet.

DM Direct message. Private messages on Twitter.

HT Hat tip. This is a way of attributing a link to another Twitter user

MT Modified Tweet.

PRT Partial retweet.

PRT Please retweet.

RT Retweet.

Figure(20): Technical Twitter Abbreviations.

4.4.5.2 Internet Abbreviations.

Internet abbreviations are widely used over the web and new abbreviations and jargons are evolving everyday. Atkinson gathered the most used Internet abbreviations with their full forms and used that for deciphering documents before analysing them[35]. Atkinson's list of abbreviations along with some other abbreviations taken from socialmediatoday.com[34] were stored into TDB. Every time this process is performed, they need to be retrieved from TDB once. The total number of abbreviations in TDB is 384 different abbreviations. Each Token that is identified to be in the list of internet abbreviations is replaced by the full form of it.

4.4.5.3 Noisy Tokens.

Noisy tokens include those which are used as hashtags or user mentions. If they have not been identified and discarded in the previous steps then they are needed and they have meaning that is useful when the sentiment analysis is performed. Any token that contains # or @ characters then they will be removed and any further processing will be applied. For example if the token is a user mention then @ character will be removed and any dots or underscores used in them will be identified and removed, too.

29

4.4.5.4 Handling Smileys.

Identifying smileys (emoticons) in Tweets is an important task since they may contain strong sentiment that can be used to improve the analysis and the evaluation of Tweets. Identifying smileys is performed using regexes for finding which category of smileys a particular encountered smiley belongs to. smileys were divided into 9 different categories described in Figure(21). Some smileys might help in identifying the polarity of the Tweet while others might not e.g. speechless smileys are not useful for sentiment analysis since someone might be speechless because of happiness, because of sadness or because there is nothing to say. However, not all smileys are recognised but these are the most frequent smileys. Also, these are western smileys and some other smileys can be used such as eastern smileys[35][36][37].

One of the sentiment analysis methods used might be using smileys sentiment analyser which gives different Tweets their possible polarity depending on the smileys used in them. However, using such a method is not very practical since not all Tweets contain emoticons.

Category Examples Regexes

Sad >:[ :-( :( :-c :c :-< :C :< :-[ :[ :{:'-( :'(

"[><].[><]""[:][',][-<>o]?[\\]\\)}]""[}\\)\\]]+[->]?[',]?[;:8=]+[0oO<{\\[\\(]?"

Happy :-) :) :o) :] :3 :c) :> =] 8) =) :} :^) :� ):D 8-D 8D x-D xD X-D XD:-)) :'-) :') ;-) ;) *-) *) ;-] ;] ;D ;^) :-,

"[0oO>}\\]\\)]?[:;8=Xx]+[-oc^0]?[\\)\\]}D3>]+""[<\\[\\(\\{C]+[-o^0]?[:;8=Xx]+[0oO<}\\[\\(]?"

Angry :-|| :@ >:( ">*:[-]?\\|?[\\|(@]+"

Surprised >:O :-O :O :-o :o 8-0 O_O o-o O_o o_O o_o O-O

">?[:8]-?[0oO]""[o0O8][._]+[o0O8]""°[O0o]°"

Disguised D:< D: D8 D; D= DX v.v D-': "D-?'?[:;8=]?[<xX]?""[vV][._][Vv]"

Cheeky >:P :-P :P X-P x-p xp XP :-p :p =p :-Þ :Þ :þ :-þ :-b :b d:

"[oO0>]?[:Xx=]-?[pPbdÞþ]+"

Kiss :* :^* ( '}{' ) "[:;8][-^]?\\*+"

Annoyed >:\ >:/ :-/ :-. :/ :\ =/ =\ :L =L :S >.<

">?[:=]-?[\\\\LSs/.]+"

Speechless :-X :X :-# :# ":-?[#Xx]+"

Figure(21): Smileys (Emoticons) Categories with their examples and regexes.

4.5 Tweets Sentiment Evaluation.

Four different methods were used to evaluate sentiments of Tweets. However, other methods were tried but they could not be finished in time due to time limit of the project but they will be discussed here. The following subsections explain the methods used:

30

4.5.1 BaseLine Classification.

This method has been influenced by the work done by [38]. BaseLine classification uses the negatives and positives in lexicon which is given by MPQA1 in order to evaluate the sentiment of each Tweet. They are stored in two Lists where each token of the analysed Tweet is matched against both lists and the sentiment of the sentence is calculated. Its mainly supported by BaseLineClassifier class which was implemented among the classificationScoring package along with many supporting classes which can be visualised through Figure(22).

The main objective of this package was to create any type of classifier using WEKA API. Everything was implemented correctly except that, training the classifiers with a certain type of files (ARFF files)2 was an issue. The training was done and it achieved 75% precision. However, the classifier could not be used to classify Tweets since the model that should be returned after the training used to take a long time without returning anything and that was not practical or accepted for the purpose of this project. This problem could be solved if the duration of the implementation phase of the project was planned to be longer and if enough knowledge about WEKA was acquired.

Figure(22): Class Diagram For ClassificationScoring Package.

4.5.2 Advanced BaseLine Classification.

Again, Tweets are tokenized and each token is evaluated separately but modifiers such as very, so, etc. are used along with negations such as was not, am not, etc. in order to make the sentiment evaluation more accurate. This method has been influenced by [39]. They applied each

1 MPQA: Multi-Perspective Question Answering. It is an opinion Corpus contain articl, lexicons, and some other opinion mining resources and tools. 2 ARFF: Attribute-Relation File Format. It is used to distinguish different attributes in the training file and how they can be used to generate the expected results. Form more info: http://www.cs.waikato.ac.nz/ml/weka/arff.html

31

http://www.cs.waikato.ac.nz/ml/weka/arff.html

method i.e. negations, modifiers, gaussian distribution, etc. separately while they are applied all together to achieve better results. Unlike the previous method, instead of summing up the sentiments of tokens some calculations were made for evaluation the total sentiment of a Tweet. Calculations include the gaussian distance from each modifier and negation word. Some other features were experimented, too. For example, words that are not associated with a term in the lexicon will be searched for the related words in Wikipedia1 or using WordNik2 as there are some APIs used for communicating with these websites and retrieving the required data from them. Such methods can be used in finding the full form of some abbreviations that are used rarely and which are not included in the TDB abbreviation table. However, such method will require some voting mechanism to determine the most related word. Unfortunately, classifiers were tried to be built but the same problems associated with the previously used package mentioned in Figure(23) were faced. Therefore, not all features supported by this package were used in the final version of the application. Figure(23) illustrates the work that was associated with this sentiment analysis method.

Figure(23): Class Diagram For WordScoring Package.

4.5.3 LingPipe Classification.

LingPipe API provided the alternative for WEKA classifiers that were experimented in the previous two steps. Two classifiers were tried here, Naive Bayes and Dynamic Language Model classifiers. Before being able to use the classifier, it should be trained. Therefore, the classifier were trained on some pre-classified Tweets. Number of Tweets used to train both classifiers is 1,578,627 classified Tweets[40]. Classifiers were not trained on neutrality but rather negativity and positivity since the training file document did not contain any Tweets that were classified neutral. The

1 Wikipedia the free encyclopedia: http://www.wikipedia.org2 WordNik: used for discovering different meanings of words: https://www.wordnik.com/

32

https://www.wordnik.com/

http://www.wikipedia.org/

classified Tweets are based on two different corpus:

• University of Michigan Sentiment Analysis competition on Kaggle.• Twitter Sentiment Corpus by Niek Sanders.

DynamicLMClassifier class is the implementation of a dynamic language classifier by . The difference between normal LM classifier and dynamic LM classifier is that, the later can be trained using some training data. LingPipe explains it as “a classifier that performs joint probability-based classification of character sequences into non-overlapping categories based on language models for each category and a multivariate distribution over categories” and further explanation of its work can be found in the API documentation[41]. NaiveBayesClassifier class is the classifier is the implementation of NB classifier by LingPipe and it extends the DynamicLMClassifier class implementation and it is explained by the API, too[42]. Using other classifiers will be similar. However, using these two is sufficiently enough to illustrate how classifiers can be used. The class implemented to use the LingPipe API and its classifier is PolarityBasic class which is illustrated by Figure (24).

Figure(24): Class Diagram For PolarityBasic Class.

It works by using classify method by giving it the cleaned text of the Tweet and it will return the class as a string i.e. pos or neg.

4.5.4 Stanford CoreNLP Classification.

Implementing Stanford CoreNLP pipeline is made simple by the interface that is provided by the API1. An instance of the pipeline should be made then the annotators used by the pipeline should be specified and if necessary, their options, too. Each annotator generate some annotations which then can be used by the next process in the pipeline or can be viewed for the user when it is required. Four annotators are used by the pipeline to obtain sentiments of processed Tweets[19]:

1. tokenize annotator. It uses extended Penn Treebank style (PTB-style) tokenization. It uses PTB methodology along with some other options. It has 20 options in total that can be changed to conform users' needs. For example, untokenizable=allKeep is the option for keeping charcters or sequences of characters that are not tokenizable[42].

2. ssplit annotator. It forms sentences from the sequence of tokens generated by tokenize annotator.

3. parse annotator. It provides full syntactical analysis using probabilistic natural language parser. It generate Stanford Dependency Trees (SDT) annotation (TreeAnnotation) which can be printed neatly. A comparison between Basic dependencies and SDT is seen through Figure(25)[44].

4. sentiment annotator. It used Scheor et. al. sentiment model[26] which was mentioned in section 2.4 where the model was explained in some details. Basically, it attaches a binarised tree of the sentence to the sentence level CoreMap which then used to evaluate the different

1 Stanford CoreNLP API : http://www-nlp.stanford.edu/nlp/javadoc/javanlp/

33

http://www-nlp.stanford.edu/nlp/javadoc/javanlp/

tokens of the sentence where the parent (highest level node) holds the final sentiment value. Each node of the tree contains Recursive Neural Networks(RNN) core annotation RNNCoreAnnotation which is used for demonstrating the predicated class and scores for that subtree. The score ranges between 0 and 4 such that 2 is the neutral score.

5. Some other useful annotators are provided such as ner(NER), pos(POS) and other annotators.

After creating the pipeline and assigning the used annotators, sentences within the Tweet can be retrieved as CoreMaps. Subsequently, the AnnotatedTree of each sentence can be extracted using SentimentCoreAnnotations class. After that, the tree that contains the search keyword will be passed to the next step to find its sentiment which can be calculated by RNNCoreAnnoations class. getPredicateClass method is used to find the polarity of the sentence which will be an integer between 0 and 4. Finding the sentence with the keyword increases the accuracy of the method since each sentence might have different sentiment and averaging them will give incorrect sentiments. Therefore, ssplit annotator is important and it might be used by other evaluation methods to increase accuracy. StanfordPipeline class provides the explained functionalities here or can be used to return annotations that can be used by other sentiment analysis methods. The sentiment stage of the pipeline is trained on more than 10,000 documents and it can be trained using TrainStanfordPipeline class. Both of the previous classes are illustrated by Figure(26).

Figure(25): Left(Stanford Dependencies) Right(Basic Dependencies) Trees.

Figure(26): Classes For Using & Training Stanford CoreNLP pipeline

34

4.6 Generation of Results.

Using MainClass, Tweets are printed for the user as soon as they are retrieved from Twitter. After that, all Tweets are normalised and each Tweet that is expanded because of the presence of abbreviation will be printed for the user. Final number of Tweets after filtering i.e. after removing spams, duplicates and non-English Tweets. Each sentiment analysis method mentioned above will classify Tweets into positive, negative and possibly neutral classes. Then percentage of each class is calculated for each method and printed at the end. Also, each Tweet is evaluated using the 4 methods and the resulting sentiments are viewed for the user.

4.7 Visualisation.

Resulting classes from classifying Tweets using Stanford pipeline will be visualised using WordClouds such that each class will be represented by an individual WordCloud represented by a separate tap in a window so the user can move between each tap and see the differences. Figure(27) present the window that will be visualised for the user. WordCloud was implemented using openCloud API. The different classes of Tweets are visualised are passed to makeOpenCloud method in MainClass class. It works by passing three different list of Tweets with the required tap header to visualise them.

Figure(27): WordCloud (negative) for FSA1

Sentiment Analysis Results.

1 FSA: Free Syrian Army. http://en.wikipedia.org/wiki/Free_Syrian_Army

35

http://en.wikipedia.org/wiki/Free_Syrian_Army

Chapter 5: Results.

5.1 Introduction.

This chapter obtains a walkthrough of the application produced, how it is used and outputs that are presented to the user. Due to time restriction, no GUI was implemented but command line interface was given to users to make analysis on the topic they choose. Basically, it is used by using the MainClass class which will do all the required processing for the user and view the results.

5.2 Twitter Search.

Keywords entered by the user are used to make the query which then used to perform a search on Twitter. While Tweets are being pulled from Twitter, they will be printed to the user along with the user who posted them and the time they were created at. As seen on Figure(28).

Figure(28): Some of the search results about“Ukraine”.

After the search is done, then non-English Tweets will be filtered and the final number of Tweets will be presented to the user as seen in Figure(29).

Figure(29): Statistics about the number of Tweets after search

and after filtering.

5.3 Preprocessing.

Tweets normalisation and cleaning is done on English Tweets that are retrieved from Twitter after discarding the non-English Tweets. For each Tweet, the cleaned version will be stored with the Tweet in the database and then they will printed for the user when each Tweet is analysed(30). However, before that, the abbreviations that are replaced by their full forms will be shown to the user. Also, some Tweets are are duplicated in TDB or they are spam will be deleted from both the results and the database. Results from preprocessing can as seen through Figure(31).

Figure(30): Results of preprocessing.

36

Figure(31): A Tweet along with its cleaned version.

5.4 Sentiment Evaluation.

Before performing Sentiment analysis on Tweets, classification and Stanford pipeline methods will construct the required models i.e. NB classifier and the pipeline tools. This step is confirmed to the user through some output as seen in Figure(32).

Figure(32): Output for building NB classifier and Stanford pipeline.

Sentiment of each Tweet will be evaluated using the 4 methods discussed above and the result from each evaluation method will be shown to the user. Figure(33) shows a Tweet along with its sentiments from the different methods.

Figure(33): A Tweet with its cleaned version and sentiments.

Using Stanford pipeline provides some functionalities such as visualising the parse tree and the dependency tree which are viewed to the user for each Tweet. Figure(34) illustrates the way they are shown. Parse trees are printed neatly like a tree while the dependency trees are printed textually with the dependencies attached to each pair of words.

37

Figure(34): A Tweet with its sentiments plus its parse tree

and dependency tree details.

5.5 Results Summary.

Statistics about the analysed Tweets are viewed to the user after sentiment analysis is done. Tweets are classified into 2 or 3 classes using each sentiment analysis method. After that, the percentage of Tweets of each class is calculated and presented to the user. An example of the summary results is shown in Figure(35).

Figure(34): Summary results of sentiment analysis.

5.6 WordClouds.

After generating the summary, a wordCloud for each class of Tweets from using the final method of sentiment analysis i.e. Stanford pipeline will be created. WordClouds are shown through a pop-up window that contains three different taps each of which is for a class of Tweets. Figures(36-38) show the different WordClouds resulting from analysing “Ukraine” related Tweets.

38

Figure(36): WordCloud for negative Tweets for “Ukraine” search.

Figure(37): WordCloud for positive Tweets for “Ukraine” search.

Figure(38): WordCloud for neutral Tweets for “Ukraine” search.

39

Chapter 6: Testing and Evaluation.

6.1 Overview.

This chapter will entail an overview about methods that are used for testing and how the application is evaluated. Evaluation was done using multiple steps and as required it was improved through gaining more understanding of the important features implemented and how they are expected to work.

6.2 Methodology.

The project is consisted of various processes and they need to be integrated in order to achieve the claimed objectives. Therefore, every process was tested and assessed individually. While the final evaluation is done on pre-classified Tweets that were divided into 2 sets depending on the tested sentiment analysis method, some of the implemented processes were tested using special testing classes i.e. using unit test methodology. However, some other processes were tested using the main method within the class that implements them. The testing classes that were created are shown in Figure(39).

Figure(38): Class Diagram for Testing Classes.

Sentiment evaluation methods were tested using main method within the class that implements them. Except NB classifier which was tested using PolarityBasicTest class. Sentiment methods that were implemented but were not used in the final version of the application were tested independently using main method within the class that implements them or another class within the package that they were implemented in.

40

6.3 Results.

6.3.1 Preprocessing.

Tweets cleaning process was tested by testing each method and each case independently. For example, the pattern identification step was tested first and after it proved success the next method was tested and so on. This was done using the CleanTweet class main method and CleanTweetTest class. Methods that are implemented proved reliability and accuracy through the different test cases tried on each step. An example of a test case is shown in Figure(40) where smileys are replaced by their category and each word with duplicate letters is shortened.

Figure(40): An example of Preprocessing Test Case.

6.3.2 Sentiment Evaluation.

Sentiment evaluation methods are tested using pre-classified Tweets that are stored in TDB. However, Testing was done using 2 different sets of testing Tweets. This is because the first three evaluation methods cannot deal with neutrality of Tweets. And the final method can evaluate Tweets into neutral and it uses fine-grain evaluation methods unlike the other methods. The results from the first test using the first dataset is given by Figure(41). First dataset included positive and negative Tweets only and this resulted in lower accuracy for the final evaluation method. Also, the second method was not implemented accurately and that resulted in low accuracy.

Figure(41): Results of First Test of Sentiment Evaluation Methods.

From the Figure above, the results are not satisfying and therefore further development on the pipeline of Tweets processing was introduced and the dataset was changed to meet the speciality of each of the evaluation methods. Improvements introduces include the following:

• Articulating sentences with that contain search keywords and use them for evaluation.

• Solving the errors in the second sentiment analysis method. Mainly using gauss distribution with many features seemed to produce higher error rate.

• Improving the normalisation process and using better training data.

• Expanding the sentiment lexicon used.

After building another dataset, the application was tested again using each of the evaluation methods with some of its Tweets excluded sometimes i.e. neutral Tweets for NB/LM classification. Also, the second evaluation method has some errors in the way the total sentiment of each Tweet is evaluated and that is why it had very low accuracy. It was fixed by improving the gaussian distribution function implemented. The results achieved after the second test was performed can be seen through Figure(42). A comparison between the results of both tests is given by Figure(43).

41

Figure(42): Results of Second Test of Sentiment Evaluation Methods.

Figure(43): Test Results Comparison.

6.3.3 Visualisation.

Some random WordClouds were generated for testing. After that, the evaluation of the WordCloud generation was done over real Tweets to test them and to articulate any associated errors. Fortunately, the main error that was seen is that some of the commonly occurring words in English language filled the WordCloud. Therefore, a blacklist of these word was created to prevent any of these words to be presented in the WordClouds generated for users. Figure(44) shows some of the words identified.

Figure(44): Words Blocked From WordCloud Tagging.

42

BaseLine MethodImproved BaseLine method

NB/LM ClassificationStanford Pipeline

0102030405060708090

First Test

Second Test

Chapter 7: Conclusions.

7.1 Achievements.

Project objectives that were claimed in section 1.5 are:

• Retrieving Tweets from Twitter based on some keywords given by the user.

• Storing Tweets along with their publishers i.e. users.

• Processing Tweets to give normalised less noisy text.

• Identifying users' opinions in Tweets about a particular subject.

• Visualising different Tweets and the significant words in each class (opinion) of Tweets.

The above objectives has been achieved to some extent and the way they are achieved is discussed in the design and implementation chapters (4&5).

7.2 Reflection.

This project was a great learning experience and it was challenging and interesting. It widened the knowledge about one of the most advance and evolving topics of computer science i.e. text mining and how it can be applied in research and commercial applications. Also, it proved again that computer science has a lot more to be achieved and we are still at the beginning of the advancements of its applications specially regarding understanding of natural language. Also, presenting human perception to a computer is still unachieved and as young courage computer scientist students we should find the place for our steps to give more to our communities and the world as a whole.

First important learning outcome to highlight here is that, the same problem might be solved using different methods and as a researcher, enough background knowledge about each method should be acquired and it should be investigated before starting the implementation specially for large projects. Also, some methods might evolve while working on it and always there will be some out-of-date tools that were used in previous projects and if possible alternatives should be created or found to use them.

Another important learning outcome is that time management is an important aspect of any achievement. Therefore, even if something seemed to be easy to achieve, it might turn to be a dilemma and it will consume more time than what it was obtained. Therefore, any deadlines associated with important milestones should be given some flexibility so when a problem occurs, there is still some extra time to deal with it.

During this project, patience and continuous balanced work proved to be more productive than overworking for some short time just to meet the set deadline. As continuous balanced work provides more time to think and many creative ideas might be implemented. Also, continuous communication with experienced staff is important before taking decisions that might affect the initial plans since it widen the view of unexperienced students who are new with some topics and do not have enough background knowledge to build their decisions on. Being adventurous is good if there is enough time and background knowledge about the topic and the different tools used. Hence, different tools should be tested and investigated before using them and important design decisions should be discussed with others who have enough information about the most suitable tools to use.

43

7.3 Further Work.

7.3.1 Twitter.

Twitter RESTful API was used for this project but Streaming API is also available and it gives some other advantages for analysing hot topics without limitations on the number of Tweets that can be retrieved. Therefore, some future work might be based on using Twitter Sentiment methods for analysing topics in real time and with more filters. Also, some ranking mechanism might be implemented and that might depends on different features which are decided by the user.

Also, there are some restrictions made on Twitter content in some countries such as Turkey[44] and some analysis might be performed on such content to indicate whether such content should be restricted or not. This is possible since Twitter provides the parameters to identify countries that restrict a particular Tweet.

Moreover, using Twitter location attributes might be useful in developing applications that analyse Tweets depending on the place that they were issued from or for performing comparisons between views about a particular topic in different countries. This could allow to prove the correlation between the content produced on social media and the place of user. Also, geographical distribution of views might be visualised. This can be done using Twitter Streaming API since the retrieved Tweets can be filtered by location, too.

Finally, this project might be used on different languages and possibly about a particular topic to indicate the differences between global views i.e. found in English Tweets and the Tweets that are written in the language spoken in the area that is mostly interested in the topic. For example, investigating the differences between Arabic and English Tweets about Syrian conflict to see whether Arabic media is influencing users differently in comparison with Global media or English media.

7.3.2 Preprocessing.

Patterns that are used to clean Tweets can be extended throughout further analysis on more Tweets to identify more patterns that are used on Twitter. Sometimes these patterns are used in Tweets that are written about a particular topic. Therefore more dynamic identification of pattern methodologies are required. Also, the filtering method should be improved to filter Tweets that are irrelevant in their meaning to the topic. For example, questions and spams should be identified using more efficient methodologies. Furthermore, Statistics about Tweets that are filtered should be generated and presented to the user.

Spellchecking should be implemented efficiently and effectively taking into account the collection of abbreviations which should be extended to include more of the frequent and rare abbreviations. In addition, some techniques for disambiguating which abbreviation to use depending on the context should be made.

Finally, more smileys should be identified and the set of Tweets with smileys should be distinguished from the other Tweets in order to introduce some smiley-based sentiment analysis. For instance, mobile-specific emoticons should be identified since the usage of mobile on Twitter and other social network is growing. Furthermore, some mechanism might be developed to allow the user to interact with the preprocessing to add some suggestions or to fix some errors that can be generated by the automatic process. Also, a training corpus can be built using this methodology e.g. searching for Tweets that contain smileys and the sentiment values of them will depend on the smileys used and therefore they can be added to a special training corpus that can be used to train classifiers.

44

7.3.3 Sentiment analysis.

Sentiment analysis on other languages might be implemented and some other methods of sentiment analysis in English should be investigated. The results from using the four sentiment analysis discussed above can be improved when the source of their weaknesses is identified. Also, sentiment analysis using smileys can be implemented.

Data that is used for training should be improved by building a new corpus or integrating already available corpora. Such an integration requires further assurance of the accuracy of the corpora and wether it is useful or not.

Calculations based on number of Tweets of each polarity class might be useful in investigating wether some methods are more trusted in evaluating a certain class of Tweets and if they can be the most reliable when a Tweets is classified into a certain class.

Finally, integrating multiple methods together to produce more accurate result might be investigated further and its success will depend on the way it is implemented. Again users involvement in this stage might be useful in expanding the corpus and in increasing the accuracy of evaluation methodologies.

7.3.4 Visualisation.

Other visualisation technologies might be developed such as viewing the percentages of each class of Tweets as a pie-chart or viewing the distribution of Tweets over on a map either on a country based or on a global base. Tweets that are associated with a particular term which appears in the WordCloud might be made accessible from the WordCloud itself to allow the user to locate them easily and more efficiently. Other visualisation methods such as visualising Tweets as they are visualised in Twitter with all the reply Tweets, retweets and favourite counts. This is possible since the URL of the image of each user and other attributes are supported by Twitter API and some of wrappers that use it in different language.

7.3.5 Others

Building a user interface will help in providing some of the suggested improvements above and gives the user more options in which tools to use and how each step of the process is made. The user interface might be implemented using Java Swing or possibly using a web interface. These options can be investigated in the future. Also, further mining technologies might be investigated. For example, mining the content of URLs that are attached with Tweets or even mining images.

7.4 Conclusion.

In this report, required steps for sentiment analysis were discussed to some extent and variety of sentiment analyses methods were investigated. Steps that are used to achieve the objectives are retrieving, preprocessing, sentiment evaluating and visualisation of Tweets. Sentiment analysis was done using 4 different methods which are word scoring, improved word scoring, NB/LM classification and Stanford CoreNLP pipeline. The accuracy of different methods ranged between 56-80% with Stanford pipeline obtaining highest accuracy. Due to time limitations many ideas were not implemented. Therefore, there are more further improvements to be applied to the worked achieved and to improve the results which might be done in future projects by me or by other students.

45

References

[1] Statistic Brain, “Twitter Statistics”. (2012). Available at: http://www.statisticbrain.com/twitter-statistics/[Accessed on: 25th March 2014]

[2] NaCTeM., Redfearn, J. “Text Mining”. (2008).Available at: http://sitecore.jisc.ac.uk/publications/briefingpapers/2008/bptextminingv2.aspx[Accessed on 25th March 2014]

[3] McNaught, J. “NaCTeM Brochure, The National Centre for Text Mining”.Available at: http://nactem.ac.uk/brochure/NaCTeM_Brochure.pdf[Accessed on 25th March 2014]

[4] Ananiadou, S. “National Centre for Text Mining: Introduction to Tools for Researchers”. (2008).Available at: http://sitecore.jisc.ac.uk/publications/briefingpapers/2008/bpnationalcentrefortextminingv1.aspx[Accessed on 25th March 2014]

[5] Liu, B. “Opinion Mining”. Available at: http://www.cs.uic.edu/~liub/FBS/opinion-mining.pdf[Accessed on 25th March 2014]

[6] Yin, J., Karimi, S., Robinson, B., Cameron, M. “ESA: Emergency Situation Awareness via Microbloggers”. (2012). Available at: http://www.ict.csiro.au/staff/jie.yin/files/de0418-yin-CIKM12.pdf [Accessed on 27th March 2014]

[7] Liu, B., Hu, M. “Mining Summarising Costumer Reviews”. (2004).Available at: http:// www.cs.uic.edu/~liub/publications/kdd04-revSummary.pdf [Accessed on 27th March 2014]

[8] Pestian, J. P., et al. “Sentiment Analysis of Suicide Notes: A Shared Task”. (2012).Available at: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3299408/[Accessed on 27th March 2014]

[9] Rosales, F. “6 Tools For Twitter Sentiment Tracking”. (2010). Available at: http://socialmouths.com/blog/2010/03/31/6-tools-for-twitter-sentiment-tracking/[Accessed on ] 27th March 2014

[10] Maynard, D., Funk, A. “Automatic Detection of Political Opinions in Tweets”. (2011).Available at: http://gate.ac.uk/sale/eswc11/opinion-mining.pdf[Accessed on 27th March 2014]

[11] “GATE: An Overview”. Available at: http://gate.ac.uk/[Accessed on 28th March 2014]

[12] Soergel, D. “Information Retrieval”. (2004).Available at: http://www.dsoergel.com/NewPublications/HCIEncyclopediaIRShortEForDS.pdf[Accessed on 28th March 2014]

46

http://gate.ac.uk/

http://gate.ac.uk/sale/eswc11/opinion-mining.pdf

http://nactem.ac.uk/brochure/NaCTeM_Brochure.pdf

http://www.dsoergel.com/NewPublications/HCIEncyclopediaIRShortEForDS.pdf

http://socialmouths.com/blog/2010/03/31/6-tools-for-twitter-sentiment-tracking/

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3299408/

http://www.cs.uic.edu/~liub/publications/kdd04-revSummary.pdf


http://www.ict.csiro.au/staff/jie.yin/files/de0418-yin-CIKM12.pdf

http://www.cs.uic.edu/~liub/FBS/opinion-mining.pdf

http://sitecore.jisc.ac.uk/publications/briefingpapers/2008/bpnationalcentrefortextminingv1.aspx

http://sitecore.jisc.ac.uk/publications/briefingpapers/2008/bptextminingv2.aspx

http://www.statisticbrain.com/twitter-statistics/

[13] Manning, C. D., Raghavan, P., Schutze, H. “Introduction to Information Retrieval”. Cambridge University Press. (2008). Available at: http://nlp.stanford.edu/IR-book/pdf/10xml.pdf[Accessed on 28th March 2014]

[14]Balke, W. “Introduction to Information Extraction: Basic Notions”. (2012). Available at: http://download.springer.com/static/pdf/371/art%253A10.1007%252Fs13222-012-0090-x.pdf?auth66=1397522691_434139fa3c7541453bf1c3f0fbce0435&ext=.pdf[Accessed on 28th March 2014]

[15] Hobbs, J. R. et. al. “SRI International: Description of The FASTUS System Used for MUC-4”. (1992). Available at: http://acl.ldc.upenn.edu/M/M92/M92-1036.pdf[Accessed on 28th March 2014]

[16] Bontcheva, K. et. a. “TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text”. (2013).Available at: http://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf[Accessed on 28th March 2014]

[17] Phillips, W. “Introduction to Natural Language Processing”. (2006). Available at: http://www.mind.ilstu.edu/curriculum/protothinker/natural_language_processing.php?modGUI=194&compGUI=1970&itemGUI=3445[Accessed on 30th March 2014]

[18] Markov, Z. “Introduction to Data Mining”.Available at: http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-1.html[Accessed on 30th March 2014]

[19] The Stanford Natural Language Processing Group. “Stanford CoreNLP: A Suite of Core NLP Tools”. (2013). Available at: http://nlp.stanford.edu/software/corenlp.shtml[Accessed on 30th March 2014]

[20] Thearling, K. “An Introduction to Data Mining: Discovering hiding value in your data warehouse”. (2012). Available at: http://www.thearling.com/text/dmwhite/dmwhite.htm[Accessed on 1st April 2014]

[21] Cohen K. B., Hunter L. “Getting Started in Text Mining”. (2008). Available at: http://www.ploscompbiol.org/article/info%3Adoi/10.1371/journal.pcbi.0040020#pcbi-0040020-se001[Accessed on 1st April 2014]

[22] Maynard, D., Bontcheva, K., Rout, D. “Challenges in developing opinion mining tools forsocial media”. (2012).Available at: http://gate.ac.uk/sale/lrec2012/ugc-workshop/opinion-mining-extended.pdf[Accessed on 1st April 2014]

47

http://nlp.stanford.edu/software/corenlp.shtml

http://gate.ac.uk/sale/lrec2012/ugc-workshop/opinion-mining-extended.pdf

http://www.ploscompbiol.org/article/info%3Adoi/10.1371/journal.pcbi.0040020#pcbi-0040020-se001

http://www.ploscompbiol.org/article/info%3Adoi/10.1371/journal.pcbi.0040020#pcbi-0040020-se001

http://www.thearling.com/text/dmwhite/dmwhite.html

http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-1.html

http://www.mind.ilstu.edu/curriculum/protothinker/natural_language_processing.php?modGUI=194&compGUI=1970&itemGUI=3445

http://www.mind.ilstu.edu/curriculum/protothinker/natural_language_processing.php?modGUI=194&compGUI=1970&itemGUI=3445

http://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf

http://acl.ldc.upenn.edu/M/M92/M92-1036.pdf

http://download.springer.com/static/pdf/371/art%253A10.1007%252Fs13222-012-0090-x.pdf?auth66=1397522691_434139fa3c7541453bf1c3f0fbce0435&ext=.pdf

http://download.springer.com/static/pdf/371/art%253A10.1007%252Fs13222-012-0090-x.pdf?auth66=1397522691_434139fa3c7541453bf1c3f0fbce0435&ext=.pdf

http://nlp.stanford.edu/IR-book/pdf/10xml.pdf

[23] Twitter. “The Streaming APIs”. (2012). Available at: https://dev.twitter.com/docs/api/streaming[Accessed on 2nd April 2014]

[24] “Twitter4J”. (2012). Available at: http://twitter4j.org/en/index.html[Accessed on 2nd April 2014]

[25]Liu, B., Hu, M. Mining Summarising Costumer Reviews. (2004).Available at: www.cs.uic.edu/~liub/publications/kdd04-revSummary.pdf[Accessed on 2nd April 2014]

[26] Socher, S. et. al. “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”. (2013). Available at: http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf[Accessed on 5th April 2014]

[27] Han, B., Baldwin, T. “Lexical Normalisation of Short Text Messages: Makn Sens a #twitter ”. (2011). Available at: http://dl.acm.org/citation.cfm?id=2002520[Accessed on 5th April 2014]

[28] Bal, D. et. al. “Exploiting Emoticons in Sentiment Analysis”. (2013). Available at:http://eprints.eemcs.utwente.nl/23268/01/sac13-senticon.pdf[Accessed on 5th April 2014]

[29] Tsoumakas, G., Katakis, I. “Multi-Label Classification: An Overview”. (2007). Available at: http://talos.csd.auth.gr/tsoumakas/publications/J6.pdf[Accessed on 8th April 2014]

[30] Mark. “Effective Use of The WordCloud”. (2011). Available at: http://epicgraphic.com/effective-use-of-the-word-clou/[Accessed on 8th April 2014]

[31] Twitter. “REST API Rate Limiting in v1.1”. (2013). Available at: https://dev.twitter.com/docs/rate-limiting/1.1[Accessed on 8th April 2014]

[32] “google-api-spelling-java: A Java API for Google spell checking service”. (2009).Available at: http://code.google.com/p/google-api-spelling-java/[Accessed on 9th April 2014]

[33] LanguageTool. “LanguageTool Wiki: Java API”. (2014).Available at: http://wiki.languagetool.org/java-api[Accessed on 10th April 2014]

[34] Fisher, T. “Top Twitter Abbreviations You Need to Know”. (2012). Available at: http://socialmediatoday.com/emoderation/512987/top-twitter-abbreviations-you-need-know[Accessed on 10th April 2014]

48

http://socialmediatoday.com/emoderation/512987/top-twitter-abbreviations-you-need-know

http://socialmediatoday.com/emoderation/512987/top-twitter-abbreviations-you-need-know

http://wiki.languagetool.org/java-api

http://code.google.com/p/google-api-spelling-java/

https://dev.twitter.com/docs/rate-limiting/1.1

http://epicgraphic.com/effective-use-of-the-word-clou/

http://talos.csd.auth.gr/tsoumakas/publications/J6.pdf

http://eprints.eemcs.utwente.nl/23268/01/sac13-senticon.pdf

http://dl.acm.org/citation.cfm?id=2002520

http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf


http://twitter4j.org/en/index.html

https://dev.twitter.com/docs/api/streaming

[35] Atkinson, C. “Deciphering Emoticons for Text Analytics: A Macro-Based Approach”. (2013).Available at: http://support.sas.com/resources/papers/proceedings13/104-2013.pdf[Accessed on 11th April 2014]

[36] Daniel, P. “Internet Smiley Face Guide”. Available at: http://www.newbie.org/SmileyFAQ/[Accessed on 11th April 2014]

[37] Wikipedia. “List of emoticons”. (2014). Available at: http://en.wikipedia.org/wiki/List_of_emoticons[Accessed on 11th April 2014]

[38] “twitter-sentiment-analysis: twitter sentiment analysis using machine learning techniques”. (2011). Available at: https://code.google.com/p/twitter-sentiment-analysis/[Accessed on 13th April 2014]

[39] “twitter-sentiments: Analyse brand sentiments derived from twitter messages”. (2013).Available at: https://code.google.com/p/twitter-sentiments/[Accessed on 14th April 2014]

[40] thinknook.com. “Twitter Sentiment Analysis Training Corpus (Dataset)”. (2012). Available at: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/[Accessed on 14th April 2014]

[41] alias-i.com. “Javadoc: Class DynamicLMClassifier”. (2011). Available at:http://alias-i.com/lingpipe/docs/api/com/aliasi/classify/DynamicLMClassifier.html[Accessed on 14th April 2014]

[42] alias-i.com. “Javadoc: Class NaiveBayesClassifier”. (2011). Available at: http://alias-i.com/lingpipe/docs/api/com/aliasi/classify/NaiveBayesClassifier.html[Accessed on 14th April 2014]

[43] nlp.stanford.edu. “Javadoc: Class PTBTokenizer”. (2013).Available at: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html[Accessed on 16th April 2014]

[44] Letsch, C. “Turkey Twitter users flout Erdogan ban on micro-blogging site”. (2014). Available at: http://www.theguardian.com/world/2014/mar/21/turkey-twitter-users-flout-ban-erdogan[Accessed on 19th April 2014]

49

http://www.theguardian.com/world/2014/mar/21/turkey-twitter-users-flout-ban-erdogan

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html

http://alias-i.com/lingpipe/docs/api/com/aliasi/classify/DynamicLMClassifier.html

http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

https://code.google.com/p/twitter-sentiments/

https://code.google.com/p/twitter-sentiment-analysis/

http://en.wikipedia.org/wiki/List_of_emoticons

http://www.newbie.org/SmileyFAQ/

http://support.sas.com/resources/papers/proceedings13/104-2013.pdf

http://alias-i.com/lingpipe/docs/api/com/aliasi/classify/NaiveBayesClassifier.html

Appendix AThe table below1 illustrates the different attributes of Tweets that are retrieved with Tweets

in JSON format which is wrapped to a java object i.e. user using Twitter4J Java library.

Field Type Description

annotations Object Unused. Future/beta home for status annotations.

contributorsCollection of Contributors

Nullable. An collection of brief user objects (usually only one) indicating users who contributed to the authorship of the Tweet, on behalf of the official Tweet author.

coordinates CoordinatesNullable. Represents the geographic location of this Tweet as reported by the user or client application. The inner coordinates array is formatted as geoJSON (longitude first, then latitude).

created_at String UTC time when this Tweet was created.

current_user_retweet ObjectPerspectival. Only surfaces on methods supporting the include_my_retweet parameter, when set to true. Details the Tweet ID of the user's own retweet (if existent) of this Tweet.

entities Entities Entities which have been parsed out of the text of the Tweet.

favorite_count IntegerNullable. Indicates approximately how many times this Tweet has been "favorited" by Twitter users.

favorited BooleanNullable. Perspectival. Indicates whether this Tweet has been favourite by the authenticating user.

filter_level StringIndicates the maximum value of the filter_level parameter which may be used and still stream this Tweet. So a value of medium will be streamed on none,low, and medium streams.

geo Object Deprecated. Nullable. Use the "coordinates" field instead.

id Int64

The integer representation of the unique identifier for this Tweet. This number is greater than 53 bits and some programming languages may have difficulty/silent defects in interpreting it. Using a signed 64 bit integer for storing this identifier is safe. Use id_str for fetching the identifier to stay on the safe side.

id_str StringThe string representation of the unique identifier for this Tweet. Implementations should use this rather than the large integer in id.

in_reply_to_screen_name StringNullable. If the represented Tweet is a reply, this field will contain the screen name of the original Tweet's author.

in_reply_to_status_id Int64Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet's ID.

in_reply_to_status_id_str StringNullable. If the represented Tweet is a reply, this field will contain the string representation of the original Tweet's ID.

in_reply_to_user_id Int64Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet's author ID. This will not necessarily always be the user directly mentioned in the Tweet.

in_reply_to_user_id_str StringNullable. If the represented Tweet is a reply, this field will contain the string representation of the original Tweet's author ID. This will not necessarily always be the user directly mentioned in the Tweet.

lang StringNullable. When present, indicates a BCP 47 language identifier corresponding to the machine-detected language of the Tweet text, or "und" if no language could be detected.

place PlacesNullable. When present, indicates that the Tweet is associated (but not necessarily originating from) a Place.

possibly_sensitive BooleanNullable. This field only surfaces when a Tweet contains a link. The meaning of the field doesn't pertain to the Tweet content itself, but instead it is an indicator that the URL contained in the Tweet may contain content or media identified as sensitive content.

scopes ObjectA set of key-value pairs indicating the intended contextual delivery of the containing Tweet. Currently used by Twitter's Promoted Products.

1 Taken From : https://dev.twitter.com/docs/platform-objects/tweets#obj-coordinates

50

https://dev.twitter.com/docs/platform-objects/tweets#obj-coordinates

https://dev.twitter.com/docs/platform-objects/tweets#obj-coordinates

https://dev.twitter.com/docs/platform-objects/places

https://dev.twitter.com/docs/platform-objects/places

http://tools.ietf.org/html/bcp47

https://dev.twitter.com/docs/streaming-apis/parameters#filter_level

https://dev.twitter.com/docs/api/1.1/post/favorites/create

https://dev.twitter.com/docs/platform-objects/entities

http://www.geojson.org/

Field Type Description

retweet_count IntNumber of times this Tweet has been retweeted. This field is no longer capped at 99 and will not turn into a String for "100+"

retweeted Boolean Perspectival. Indicates whether this Tweet has been retweeted by the authenticating user.

retweeted_status Tweet

Users can amplify the broadcast of Tweets authored by other users by retweeting. Retweets can be distinguished from typical Tweets by the existence of a retweeted_status attribute. This attribute contains a representation of the original Tweet that was retweeted. Note that retweets of retweets do not show representations of the intermediary retweet, but only the original Tweet. (Users can also unretweet a retweet they created by deleting their retweet.)

source StringUtility used to post the Tweet, as an HTML-formatted string. Tweets from the Twitter website have a source value of web.

text StringThe actual UTF-8 text of the status update. See twitter-text for details on what is currently considered valid characters.

truncated Boolean

Indicates whether the value of the text parameter was truncated, for example, as a result of a retweet exceeding the 140 character Tweet length. Truncated text will end in ellipsis, like this ... Since Twitter now rejects long Tweets vs truncating them, the large majority of Tweets will have this set to false.Note that while native retweets may have their toplevel text property shortened, the original text will be available under the retweeted_status object and the truncated parameter will be set to the value of the original status (in most cases, false).

user UsersThe user who posted this Tweet. Perspectival attributes embedded within this object are unreliable.

withheld_copyright BooleanWhen present and set to "true", it indicates that this piece of content has been withheld due to a DMCA complaint.

withheld_in_countries Array of StringWhen present, indicates a list of uppercase two-letter country codes this content is withheld from.

withheld_scope String When present, indicates whether the content being withheld is the "status" or a "user."

51

http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2

http://en.wikipedia.org/wiki/Digital_Millennium_Copyright_Act

https://dev.twitter.com/docs/api/1.1/post/statuses/destroy/%3Aid

https://dev.twitter.com/docs/api/1.1/post/statuses/retweet/%3Aid

https://dev.twitter.com/docs/platform-objects/tweets

twitter analysis using text mining...

Documents