sentiment analysis and classification of online movie reviews using categorical proportional

SENTIMENT ANALYSIS AND CLASSIFICATION OF ONLINE REVIEWS USING

CATEGORICAL PROPORTIONAL DIFFERENCE

A PROJECT REPORT

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

MASTER OF SCIENCE

IN

COMPUTER SCIENCE

UNIVERSITY OF REGINA

By

Dorothy Aku Allotey

Regina, Saskatchewan

December 2, 2011

© Copyright 2011: Dorothy Aku Allotey

ii

Abstract

Sentiment analysis and classification is an area of text classification that began around

2001 and has recently been receiving a lot of attention from researchers. Sentiment analysis

involves analyzing textual datasets which contain opinions (e.g., social media, blogs, discussion

groups, and internet forums) with the objective of classifying the opinions as positive, negative,

or neutral. Classification of textual objects according to sentiment is considered a more difficult

task than classification of textual objects according to content because opinions in natural

language can be expressed in subtle and complex ways containing slang, ambiguity, sarcasm,

irony, and idiom.

Various measures, including Information gain, simple chi-square, feature relation

network, log-likelihood ratio, and minimum frequency thresholds, have previously been used as

feature selection methods in sentiment analysis and classification. In this report, we investigate

the performance of categorical proportional difference, a novel feature selection method used for

text classification. While categorical proportional difference has previously been shown by

others to be useful in sentiment classification of text documents, here we apply CPD to the

classification of online reviews. Online reviews differ from typical text documents in that they

contain few features (i.e., they are short documents). The results obtained using CPD on online

iii

reviews are compared to those obtained using information gain and simple chi-square. We also

apply two weighting schemes, called Feature Presence and SentiWordNet scores, in combination

with each of CPD, information gain and simple chi-square, to classify the reviews as either

positive or negative. Experimental results show that given a dataset containing labeled movie

reviews, CPD generates a classification accuracy of 88.1% with 42% of the features. The

performance of Information gain and simple chi-square varied depending on the dataset.

Information gain performed better than simple chi-square on the movie review dataset while

simple chi-square performed better than information gain on the congressional floor debate

dataset. We can therefore recommend that, CPD can be a good feature selection method for

sentiment classification tasks.

iv

Acknowledgements

I would like to express my sincere appreciation to my supervisor, Dr. Robert Hilderman

for all the assistance and guidance through all stages of this project. This project would not have

not been possible without his advice and support. I would like to thank the Faculty of Graduate

Studies and Research for their financial support. To all friends who helped in one way or the

other, I am very grateful.

v

Contents

Abstract ii

Acknowledgements iv

Table of Contents v

List of Tables vii

List of Figures viii

Chapter 1 INTRODUCTION 1

1.1 Statement of the Problem ....................................................................................... 2

1.2 The Process of Sentiment Analysis and Classification .......................................... 3

1.3 Objectives of the project ........................................................................................ 4

1.4 The contributions of the project. ............................................................................ 5

1.5 Organization of the Project Report. ....................................................................... 6

Chapter 2 BACKGROUND 7

2.1 General Overview of Sentiment Analysis and Classification ................................ 8

2.2 Factors that affect sentiment analysis and classification ....................................... 8

2.3 Resources for Sentiment Analysis and Classification ......................................... 10

2.3.1 The POS tagger ......................................................................................... 11

2.3.2 SentiWordNet ........................................................................................... 12

2.3.2 Dictionary Structure .................................................................................. 12

Chapter 3 RELATED WORK 15

3.1 Recent Work ........................................................................................................ 15

Chapter 4 METHODOLOGY 21

vi

4.1 Assumptions ....................................................................................................... 211

4.2 Description of our approach................................................................................. 22

4.3 SentiWordNet Scoring ......................................................................................... 22

4.4 Feature Selection .................................................................................................. 25

4.4.1 Feature selection methods ......................................................................... 25

4.4.2 CPD ........................................................................................................... 26

4.4.3 IG and χ2 ................................................................................................... 27

4.5 Classifier … ......................................................................................................... 28

4.6 Classification........................................................................................................ 28

Chapter 5 EXPERIMENTAL RESULTS 30

5.1 Hardware and Software ........................................................................................ 30

5.2 Weka…. ............................................................................................................... 31

5.3 Datasets … ........................................................................................................... 31

5.4 Interest Measures ................................................................................................. 32

5.5 Results .................................................................................................................. 34

5.5.1 Initial Classification Without Feature Selection ....................................... 34

5.5.2 Results for Movie Review Dataset ........................................................... 35

5.5.3 Results for Convote Dataset ...................................................................... 38

5.6 Discussion ............................................................................................................ 41

5.6.1 CPD Performance ..................................................................................... 41

5.6.2 Comparison with Other Results ................................................................ 42

5.6.3 Drawbacks................................................................................................. 44

Chapter 6 CONCLUSION AND FUTURE WORK 45

References 47

Appendix 53

vii

List of Tables

4.5 Contingency table ....................................................................................................... 26

5.1 Average 10-fold cross validation accuracies for base classifier in percent ................ 35

5.2 SVM accuracies for Movie review dataset in percent ................................................ 36

5.3 Naïve Bayes accuracies for Movie review dataset in percent ..................................... 36

5.6 SVM accuracies for convote dataset in percent .......................................................... 39

5.7 Naïve Bayes accuracies for convote dataset in percent .............................................. 39

5.11 Accuracy Comparison ............................................................................................... 43

viii

List of Figures

1.1 Sentiment analysis and classification steps................................................................... 4

2.1 Sample SentiWordNet output for the word “boring” ................................................. 13

4.1 SentiWordNet output .................................................................................................. 23

4.2 A sample dynamic array ............................................................................................. 23

4.3 Sample movie review dataset in arff format ............................................................... 24

4.4 Sample binary feature vector representation............................................................... 25

4.5 Experiment Execution stages ...................................................................................... 29

5.1 Sample Naïve Bayes classification results .................................................................. 34

5.4 SVM accuracies for the movie review dataset ............................................................ 37

5.5 Naïve Bayes Accuracies for movie review ................................................................. 38

5.8 SVM accuracies for convote dataset ........................................................................... 40

5.9 Naive Bayes accuracies for convote dataset ............................................................... 41

5.10 CPD feature selection process .................................................................................. 42

1

Chapter 1

INTRODUCTION

The World Wide Web and the Internet provide a forum through which an individual’s

process of decision making may be influenced by the opinions of others. For example, the

customer feedback system used by eBay.com allows customers to use free-form text to rate

products and services received while making the ratings available to other customers to

review before they make a purchase decision, in effect allowing a customer to make a more

informed decision. Customer feedback and product evaluations can also be found at many

online sites including epinions.com and amazon.com. Online sites such as

rottentomatoes.com, allow movie buffs to leave reviews for movies they have seen. Online

sites, such as Facebook and blogs, allow users to leave opinions and comments. Other

online sites, such as cnn.com and globeandmail.com, allow readers to leave comments.

These kinds of online media have resulted in large quantities of textual data containing

opinion and facts. Over the years, there has been extensive research aimed at analyzing and

classifying text and data [6], where the objective is to assign predefined category labels to

documents based upon learned models [7]. However, more recent research has attempted to

analyze textual data to determine how an individual “feels” about a particular topic (i.e., the

2

individual’s sentiment towards that topic). This has led to the development of sentiment

analysis and classification systems [1]. Sentiment analysis and classification are technically

challenging because opinions can be expressed in subtle and complex ways, involving the

use of slang, ambiguity, sarcasm, irony and idiom.

Sentiment analysis and classification is performed for several reasons, for example to

track the ups and downs of aggregate attitudes to a brand or product [13], to compare the

attitudes of online customers between one brand or product and another, and to pull out examples

of particular types of positive or negative statements on some topic. It may also be performed to

enhance customer relationship management and to help other potential customers make informed

choices.

The remainder of this chapter is organized as follows. In Section 1.1, we give a

statement of the problem. In Section 1.2, we describe the steps involved in sentiment

analysis and classification. In Section 1.3, the objectives of this project are discussed, and in

Section 1.4, the contributions of this project are summarized. The organization of this

project report is provided in Section 1.5.

1.1 Statement of the Problem

Sentiment analysis has been formally referred to as a broad (definitionally challenged) area

of natural language processing, computational linguistics, and text mining [10]. Generally

speaking, it aims to determine the attitude of a speaker or a writer with respect to some topic.

Their attitude may be a judgment or evaluation, an affective state (that is to say, the emotional

state of the author when writing), or an emotional communication (that is to say, the emotional

effect the author wishes to have on the reader). With the increasing amounts of text in on-line

documents, efforts have been made to organize this information using automatic text

http://en.wikipedia.org/wiki/Natural%20language%20processing

http://en.wikipedia.org/wiki/Computational%20linguistics

http://en.wikipedia.org/wiki/Text%20mining

3

classification (categorization) techniques. However, using these approaches to classify sentiment

in documents has not been effective. This is due to the fact that text classification involves

automatically sorting a set of documents into categories from a predefined set. Most research has

been based on topical classification of these texts. Sentiment analysis on the other hand goes

beyond topical classification of texts to include extracting opinions and classifying them as

positive or negative, and favourable or not favourable.

This presents challenges which may not be easily addressed by simple text classification

approaches. Consequently, there is a need to incorporate methods for classifying opinions into a

general text classification tool, or to develop separate systems that will be able to analyze and

accurately classify sentiments in text. This need has risen in recent years due to the application

of sentiment analysis in diverse areas such as market research, business intelligence, public

relations, electronic governance, web search, and email filtering. Other factors, which have

culminated in increased interest in sentiment analysis include (according to Pang and Lee [1]):

The rise of machine learning methods in natural language processing and information

retrieval,

The availability of datasets for machine learning algorithms to be trained on,

specifically, the development of review-aggregation web-sites, and

The realization of the challenges offered by commercial and intelligence applications.

1.2 The Process of Sentiment Analysis and Classification

Here, we provide a brief overview of the process of sentiment analysis and

classification. A more detailed description is provided in Chapter 2. A diagram summarizing

the steps required for text classification using sentiment is shown in Figure 1.1. In Figure 1.1,

sentiment analysis and classification is shown to consist of two main steps: preprocessing

4

step and classification step. The preprocessing step involves extracting the reviews or

documents from a source dataset. Terms in each review are parsed by a part-of-speech

(POS) tagger such as the Standford POS tagger [18]. The tags are then used by a lexical

resource, such as SentiWordNet [27], to determine the sentiment score for each term. The

terms in each document, together with their sentiment scores, are stored as feature vectors to

be used as input to a text classifier.

Figure 1.1 Sentiment analysis and classification steps

The classification step involves using a text classifier to classify the selected features as

either positive or negative. The stored feature vectors become input to the text classifier. If

necessary, feature selection can be applied to reduce the number of features. In order to

5

arrive at the best results, a series of iterative steps, known as cross-validation can be used to

estimate how accurately a predictive model will perform in practice.

1.3 Objectives of the project

The objectives of this project are:

To analyze the Cornell movie review datasets used by Pang and Lee [5] and the

congressional-speech corpus used by Matt et al. [24] by incorporating features from

SentiWordNet.

To extract features based on SentiWordNet scores and use them for sentiment

classification.

To use CPD, simple chi-square ( , and information gain (IG) for feature selection

[28] to compare the performance of CPD to and IG in sentiment classification

tasks.

To evaluate the effect of feature selection on the overall performance of sentiment

classification.

To compare our experimental results with other well known research results.

1.4 The contributions of the project.

This project makes the following contributions to the field of sentiment analysis and

classification:

It demonstrates that CPD is an effective feature selection method, not only in text

categorization, but also in sentiment classification.

It shows that CPD is an effective feature selection method in movie review and

political review domains.

http://en.wikipedia.org/wiki/Accuracy

6

It confirms that using feature presence in conjunction with SentiWordNet scores

provides better sentiment classification accuracy than feature presence alone.

1.5 Organization of the Project Report.

The remainder of this report is organized as follows. In Chapter 2, we present some

background information on sentiment analysis and classification. In Chapter 3, we discuss

some related work in this research field. In Chapter 4, we explain our methodology. In

Chapter 5, we discuss our experimental results and compare them with other well known

results. In Chapter 6 we give our conclusions and suggest future work.

7

Chapter 2

BACKGROUND

Data available on the Internet comes in a variety of different formats. The simplest and

most common format is plain textual data. One example of textual data is online reviews.

Online reviews are broadly described as either objective or subjective [8]. An objective

review tends to contain mostly facts while a subjective review contains mostly opinions.

There are two common review formats [9]: the restricted review format and the free format.

In a restricted review, the reviewer is asked to separately describe the pros and cons, and to

write a comprehensive review. In a free format review, the reviewer writes freely without

any separation of the pros and cons.

In Section 2.1, we present an overview of how sentiment analysis and classification

work in general. In Section 2.2, we discuss some factors that affect sentiment analysis and

classification. In Section 2.3, we discuss SentiWordNet, a lexical resource used for

sentiment analysis and classification and other lexical resources.

8

2.1 General Overview of Sentiment Analysis and Classification

Sentiments can be classified at the word, sentence, or document level. This project

focused on document-level sentiment classification. Document-level sentiment classification

can be described as follows. Given a set of related documents containing opinions (also

known as opinionated documents), we determine whether each document d expresses a

positive or negative opinion of (i.e., the sentiment toward) an object. Existing research in this

area makes the assumption that the opinionated document d (e.g., a movie review) contains

opinions regarding a single object [9]. This assumption typically holds for customer reviews of

products and services, but not for forum and blog posts due to the fact that such a post may

contain opinions on multiple products and services.

In trying to find the document-level sentiment, one approach that could be used is to

perform sentence-level sentiment classification on each sentence in the document and then sum

up all the sentence-level sentiments to produce the document-level sentiment. This however,

requires that word-level sentiment classification be performed on each sentence prior to

sentence-level classification. One of the main challenges of document-level sentiment analysis

and classification is that not every part of the document is equally informative in inferring the

sentiment of the whole document [11]. However identifying the useful sentences automatically

is itself a difficult learning problem. Some researchers have proposed automatic approaches to

extracting useful sentences [11]. Others have considered all sentences in the document in

conjunction with algorithms that were able to produce better experimental results [5, 12].

2.2 Factors that affect sentiment analysis and classification

To classify a document as either positive or negative according to the overall

sentiment expressed by the author is seen as a more challenging problem than strict text

9

classification. For example, opinions can often be expressed in a more complex manner,

making it difficult to be identified by any of the words in the sentence or document when

considered in isolation. By just looking at the words in a review, one may not be able to

accurately classify the sentiments which have been expressed. Consider this review:

“This film should be brilliant. It sounds like a great plot, the actors are first grade, and the

supporting cast is good as well, and Stallone is attempting to deliver a good performance,

however, it can't hold up"

The presence of words such as “brilliant”, “great”, “good” suggest a positive sentiment so one

might think it easy to identify the sentiment of a review by a set of keywords. However, Pang

and Lee found out from their experimental results that the accuracy of classification generated

from the use of human generated list of keywords was lower as compared to that of keywords

generated using machine learning techniques [5].

The above review has an anaphor (e.g., the phrase “it can’t hold up”). It is not clear what

“it” in this phrase refers to, whether the movie or Stallone’s performance. This makes it difficult

to determine the overall polarity of the review. Many free format reviews make use of many

anaphora, abbreviations, lack of capitals, poor spelling, poor punctuation, and poor grammar.

The main factors that affect how opinions are analyzed and classified include the domain

of the datasets, the size of the datasets to be analyzed, the format of the datasets (labeled or

unlabelled), and quality of the dataset. The accuracy of sentiment classification can be influenced

by the domain of the items to which it is applied [1]. One reason is that the same phrase can

indicate different sentiment in different domains: for example, “go read the book” most likely

indicates positive sentiment for book reviews, but negative sentiment for movie reviews. A

review such as “it is so easy to predict the next action….” is a negative sentiment for a movie

10

plot but a positive sentiment for a political review. Difference in vocabularies across different

domains also adds to the difficulty when applying classifiers trained on labeled data in one

domain to test data in another.

Depending on the size of datasets to be used, either manual or automatic, or both

approaches, could be used. However, it is always best to use a combination of both approaches as

it has been realized from a number of experiments that even the worst results obtained from

using both approaches are superior to the best of manual approaches and some automatic

approaches [4]. The availability of labeled data also improves the time taken to classify opinions.

Without it, many researchers have had to use the linguistic/semantic approach of building

lexicons, which is very time consuming and yet does not yield better performance. Lexicons are

also specific to a particular language and domain so making the sentiment analysis and

classification task even more difficult.

The quality of the dataset is also bound to affect sentiment classification performance. Due to

the fact that there is no quality control measure in place to check reviews, anyone can write

anything on the web. This results in many low quality reviews and review spam. Bing et al.

studied opinion spam and discovered that online reviews are normally made up of spam

messages which consist of untruthful or fake reviews, irrelevant reviews, and reviews which are

not actual reviews, but rather statements or questions [33]. Effective preprocessing of the data is

needed to reduce the level of spam in these reviews. However, this is a research area which has

not received much attention.

2.3 Resources for Sentiment Analysis and Classification

The goal of this research is to classify sentiments in text, and as such, there is the need to

use an approach that can determine the sentiments of extracted features in individual documents.

11

One major approach normally used is lexical induction which involves creating resources that

contain opinion information on words based on lexicons. Among the many lexical resources

available, SentiWordNet 3.0 was chosen for this research [27]. It is a lexical resource explicitly

devised for supporting sentiment classification and opinion mining applications. SentiWordNet

accepts a term and its part-of-speech as input. This requires the use of the Standford POS tagger

to tag terms before their scores can be determined.

2.3.1 The POS tagger

A POS tagger parses a string of words (e.g., a sentence) and tags each term with its part

of speech. For example, parsing the following text which is an excerpt taken from a sample

review:

“well , its main problem is that it's simply too jumbled . it starts off " normal " but then

downshifts into this " fantasy " world in which you , as an audience member , have no

idea what's going on .

generates the following output:

“RB well , , PRP$ its JJ main NN problem VBZ is IN that PRP it VBZ 's RB simply

RB too JJ jumbled . PRP it VBZ starts RP off `` " JJ normal '' " CC but RB then

VBZ downshifts IN into DT this `` " NN fantasy '' " NN world IN in WDT which

PRP you , IN as DT an NN audience NN member , , VBP have DT no NN idea WP

what VBZ 's VBG going IN on . ”

Every term has been associated with a relevant tag indicating its role in the sentence, such as

VBZ (verb), NN (noun), JJ (adjective), etc. The entire list of tags and their meaning is based on

the Penn Treebank Tagset, an annotated corpus which seems to be the most popular standard

used in most POS tagging [26].

12

2.3.2 SentiWordNet

WordNet is a publicly available database of words containing a semantic lexicon for the

English language that organizes words into groups called synsets (i.e., synonym sets). A synset is

a collection of synonym words linked to other synsets according to a number of different

possible relationships between the synsets (e.g., is-a, has-a, is-part-of, and others). SentiWordNet

is a publicly available lexical resource for research purposes providing a semi-supervised form of

sentiment classification [27] based on the annotation of all the synsets of WordNet according to

the notions of “positivity”, “negativity”, and “neutrality”.

Each synset s is associated to three numerical scores Pos(s), Neg(s), and Obj(s) which

indicates the degree to which the terms in the synset are positive, negative, objective (i.e.,

neutral) respectively. Different senses of the same term may thus have different opinion-related

properties. For example, in SentiWordNet 1.0, the synset [estimable(J,3)]1, corresponding to the

sense “may be computed or estimated” of the adjective estimable, has an Obj score of 1.0 (and

Pos and Neg scores of 0.0), while the synset [estimable(J,1)] corresponding to the sense

“deserving of respect or high regard” has a Pos score of 0.75, a Neg score of 0.0, and an Obj

score of 0.25. Each of the three scores ranges from 0 to 1 and their sum is 1.0 for each synset. A

synset may have nonzero scores for all the three categories.

2.3.2 Dictionary Structure

The SentiWordNet dictionary can be downloaded online as a text file. In this file, the

term scores are grouped by the synset and the relevant part of speech. Each entry in the

dictionary is made of seven fields: the part of speech, the offset, the positive score, the negative

score, the objective score, the associated synset terms, and an example of the context in which

the term may be used. The parts of speech used in this dictionary are limited to adjective, noun,

13

verb, and adverb. The offset is a numerical value which uniquely identifies a synset in the

database. Associated synset terms are list of terms included in that particular synset. Figure 2.1

shows a sample output for the term “boring” entered as input for the SentiWordNet dictionary.

Figure 2.1 Sample SentiWordNet output for the word “boring”

According to the SentiWordNet dictionary, the term “boring” can be used as either an

adjective or a noun. When used as an adjective, it has an offset of 01345307, a Pos score of

0.0, a Neg score of 0.25 and an Obj score of 0.75. Its associated synset terms include

14

“wearisome”, “tiresome”, “tedious”, and “slow”. The term could mean “so lacking in

interest as to cause mental weariness” and an example of a context in which the term may be

used is “a boring evening with uninteresting people”.

When the term is used as a noun, it may have two offsets, 00942799 and 00923130

depending on the context in which it is used. For both offsets, the term has a Pos score of

0.0, a Neg score of 0.0 and an objective score of 1.0. The word “drilling” is an associated

synset term and is used in the context of “the act of drilling a hole”. From the scores

generated for this term, it indicates that the term is a strong objective term, however, when

used as an adjective, it is a weak negative term. Such a term, if used as a noun in a

document, will not be able to determine the sentiment of a document since our aim is to

determine the sentiment based on positive and negative scores. If it is used as an adjective,

however, we may consider it as a negative term.

15

Chapter 3

RELATED WORK

There is an increasing number and variety of research papers in the area of sentiment analysis

and classification. We chose to consider only related work which makes use of the following:

machine learning techniques, document-level classification, n-gram features (such as unigrams),

part-of-speech tagging, lexical resources (especially SentiWordNet), sentiment classification

techniques involving support vector machines (SVM) and Naïve Bayes, and the review domain,

such as movie reviews or product reviews. We did not consider other research work which used

natural language processing techniques, did not involve feature selection and did not utilize SVM

and Naïve Bayes classifiers.

3.1 Recent Work

Pang and Lee applied machine learning techniques to classify movie reviews according to

sentiment [5]. They employed Naive Bayes, Maximum Entropy, and SVM classifiers, and

observed that these do not perform as well on sentiment classification tasks as on traditional

topic-based text classification tasks. However, SVM did generate a higher accuracy than Naïve

Bayes and Maximum Entropy. They noticed that using unigrams as features with term presence

16

on SVM always yielded highly accurate results, but when bigrams were used as features, the

accuracy was lower as compared to that of the unigrams. They also found out that machine

learning techniques outperformed human-produced baselines.

In other work, they attempted to improve the classifiers by using only the subjective

sentences in movie reviews [8]. They explored extraction methods based on a minimum cut

formulation framework which resulted in the development of efficient algorithms for sentiment

analysis. They noted that utilizing contextual information via this framework can lead to a

statistically significant improvement in polarity classification accuracy.

An approach to sentiment analysis using SVMs and a variety of diverse information sources

was introduced by Mullen and Collier [38]. Their work involved extracting value phrases (two

word phrases conforming to a particular part-of-speech) and assigning them sentiment

orientation values using pointwise mutual information. They used SVM in conjunction with

other hybrid models of SVM to classifier their datasets. Their results indicated that their

proposed approach performed better when used on data which is not topic annotated. However,

when used on data that has been manually topic annotated, their approach did not yield much

improvement.

Ohana and Tierney studied sentiment classification using features built from the

SentiWordNet database of term polarity scores [34]. Their approach consisted of counting

positive and negative term scores to determine sentiment orientation. They also presented an

improvement of this by building a data set of relevant features using SentiWordNet and a

machine learning classifier. They implemented a negation detection algorithm to adjust

SentiWordNet scores accordingly for negated terms and set a threshold value in cases where

multiple SentiWordNet scores were found for a term. A three-fold classification approach was

17

used and results obtained were similar to those obtained using manual lexicons seen in the

literature. This result was closer to the results obtained by Pang and Lee. for a classifier based on

a single document statistics and a word list for positive and negative terms generated manually

for the dataset [5]. In addition, their feature set approach yielded improvements over the baseline

term counting method. The results indicated that SentiWordNet could be used as an important

resource for sentiment classification tasks.

Dave et al. developed ReviewSeer, a document level opinion classifier that uses statistical

techniques and POS tagging information for sifting through and synthesizing product reviews,

essentially automating the sort of work done by aggregation sites or clipping services [25]. They

first used structured reviews for testing and training, identifying appropriate features and scoring

methods from information retrieval for determining whether reviews are positive or negative.

These results performed as well as traditional machine learning methods. They then used the

classifier to identify and classify review sentences from the web, where classification is more

difficult. However, a simple technique for identifying the relevant attributes of a product

produces a subjectively useful summary.

A simple unsupervised learning algorithm for classifying a review as recommended or

not recommended was presented by Turney [40]. The algorithm takes a written review as input

and produces a classification as output in a three step approach: using a part-of-speech tagger to

identify phrases in a review that contain adjectives or adverbs, estimating the semantic

orientation of each phrase extracted, and assigning the review to a class, either recommended or

not recommended, based on the average semantic orientation of the extracted phrases. If the

average is positive, the review is assumed to recommend the item, otherwise, the item is not

recommended. The pointwise mutual information and information retrieval algorithm is used to

18

measure the similarity of pairs of words or phrases to estimate the semantic orientation of a

phrase.

A method for sentiment classification based on extracting and analyzing appraisal groups,

such as “very good” or “not terribly funny”, was presented by Whitelaw et al. [35]. An appraisal

group is represented as a set of attribute values in several task-independent semantic taxonomies,

based on appraisal theory. Semi-automated methods were used to build a lexicon of appraising

adjectives and their modifiers. They heuristically extract adjectival appraisal groups from texts

and compute their sentiment scores according to this lexicon. Documents were represented as

vectors of relative frequency features computed over these groups on an SVM learning algorithm

was used to learn a classifier discriminating positive from negative oriented text documents.

Movie reviews were classified using features based upon these taxonomies combined with

standard “bag-of-words” features, and report state-of-the-art accuracy of 90.2%.

Abbasi et al. studied the use of sentiment analysis methodologies for classification of

Web forum opinions in multiple languages [36]. The utility of stylistic and syntactic features was

evaluated for sentiment classification of English and Arabic content. Stylistic features consist of

lexical and structural attributes (e.g., unique words, words per sentence, and sentence length)

while syntactic features incorporate manual, semi-automatic, or automatic annotation techniques

to add polarity scores to words and phrases (e.g. Word/POS tag n-grams, phrase patterns,

punctuation). Specific feature extraction components were integrated to account for the linguistic

characteristics of Arabic. The entropy weighted genetic algorithm (EWGA), a hybridized genetic

algorithm that incorporates the information-gain heuristic was developed for feature selection.

EWGA was designed to improve performance and get a better assessment of key features. The

proposed features and techniques were evaluated on a benchmark movie review dataset, and U.S.

19

and Middle Eastern Web forum postings. The experimental results obtained using EWGA with

SVM produced accuracy over 91%. Stylistic features significantly enhanced performance across

all testbeds, while EWGA also outperformed other feature selection methods, indicating the

utility of these features and techniques for document-level classification of sentiments.

Gamon demonstrated that it is possible to perform automatic sentiment classification in

the very noisy domain of customer feedback data [37]. This work showed that by using large

feature vectors in combination with feature reduction, linear SVMs can be trained to achieve

high classification accuracy on data that present classification challenges for a human annotator.

It also showed that the addition of abstract linguistic analysis features (e.g., stop words removal

and spell checking) consistently contributes to an increase in classification accuracy in sentiment

classification.

Yessenalina et al. proposed a two-level approach to document-level sentiment

classification that extracts useful sentences and predicts document-level sentiment based on the

extracted sentences [11]. Their model, unlike previous learning methods for this task, does not

rely on the gold standard sentence-level subjectivity annotations and optimizes directly for

document-level performance. This model was evaluated using the movie reviews dataset and the

U.S. Congressional floor debates, and showed improved performance over previous approaches.

Thomas et al. [24] investigated the possibility of determining from the transcripts of U.S.

Congressional floor debates whether each speech (i.e., a continuous single-speaker segment of

text) represents support for or opposition to a proposed piece of legislation. They addressed the

problem by exploiting the fact that the speeches occur as part of a discussion and this allowed the

use of sources of information regarding relationships between discourse segments, such as

whether a given utterance indicates agreement with the opinion expressed by another. They

20

demonstrated that the incorporation of agreement modeling can provide substantial

improvements over the application of SVMs in isolation, which represents the state of the art in

the individual classification of documents. Their enhanced accuracies were obtained via a fairly

primitive automatically-acquired “agreement detector” and a conceptually simple method for

integrating isolated-document and agreement-based information. Their results showed the

potentially large benefits of exploiting sentiment-related discourse-segment relationships in

sentiment analysis tasks.

Tony et al. proposed an approach to sentiment analysis which uses SVMs to bring

together diverse sources of potentially pertinent information, including several favorability

measures for phrases and adjectives and, where available, knowledge of the topic of the text [38].

Models using the features introduced were further combined with unigram models and

lemmatized versions of the unigram models. The results of a variety of experiments are

presented, using both data which is not topic annotated and data which has been hand annotated

for topic. In the case of the former, their approach is shown to yield better performance than

previous models on the same data. In the latter, their results indicated that their approach may

allow for further improvements to be gained given knowledge of the topic of the text.

21

Chapter 4

METHODOLOGY

In this research we combined supervised and non-supervised sentiment classification

approaches to investigate the performance of CPD as a feature selection method in sentiment

classification. In this chapter, we discuss our experimental setup and execution steps. In Section

4.1, we state some assumptions. In Section 4.2, we provide an overview of our approach. In

Section 4.3, we discuss our scoring method. In Section 4.4, we discuss the feature selection

methods used. In Section 4.5, we discuss the classifier used. In Section 4.6, we discuss our

classification steps.

4.1 Assumptions

The following assumptions were made as part of our methodology:

Each dataset is assumed to be restricted to one type of object. For example, all the movie

reviews will be about movies.

Unigrams (individual terms in a review e.g., clear, noisy) are used as features (i.e. we do

not use bi-grams and higher forms of n-grams).

22

Stemming (the process of reducing inflected words to their base or root form. e.g.

“clearly” is stemmed to “clear”) could distort the part-of-speech of a term, so it is not

used.

For tagging, our aim is to find the sentiment of the whole document, thus whole

documents are tagged rather than sentences.

For terms with multiple polarity scores, the average of all these scores is used.

The percentage of the number of unique terms found in the documents which will not be

found in the SentiWordNet dictionary is considered to be minimal and so will be ignored.

Terms not found in the SentiWordNet dictionary will be regarded as non-sentiment

bearing words.

4.2 Description of our approach

In this work, we used the general sentiment analysis and classification steps outlined in

Chapter 1 of this report plus a few additions. The pre-processing step involves downloading the

datasets from online. Using the Stanford POS tagger [41], terms in each document are parsed to

determine the part-of-speech of each term. Tokenization is performed on the tagged dataset to

extract each term and its part-of-speech. To filter the terms for easy scoring, stop words are

removed and SentiWordNet is used to determine the score for each term.

4.3 SentiWordNet Scoring

Each document is parsed by SentiWordNet to determine the term scores. The version of

SentiWordNet that is used requires a term and its part-of-speech to produce a score. For a Neg

score, the value is preceded by a negation operator. For a Pos score, there is no preceding

operator. For an Obj score, the value generated is zero. Terms that are not found in the

23

SentiWordNet dictionary are automatically scored as zero. All terms in each document are stored

as binary-valued feature vectors representing a term and its score. In situations where a term had

more than one score, the average score is chosen. For example, the sample output generated after

one document is parsed by SentiWordNet, is shown in Figure 4.1. Each line represents a term, its

part-of-speech and score, where, “a” represents adjective, “v” represents verb and, “n” represents

a noun.

nullhappy: a: 0.2881;

bastard: n: -0.2135;

head: n: 0.0035;

start: n: 0.0067;

movie: n: 0.0000;

strangeness: n: -0.2700;

kick: v: -0.015;

Figure 4.1 SentiWordNet output

A dynamic array, such as the one shown in Figure 4.2, is created to store the feature

vectors. This is an n by m matrix where n represents the total number of documents and m the

total number of terms or attributes in the entire dataset. This matrix has a unique id for each term

stored as the attribute heading and an id for each document stored as rows. At the intersection of

each column and row is the score of the term (i.e., each cell contains the score of a term in a

particular document). The matrix also stores the overall sentiment of each document as either 0

(negative) or 1 (positive) at the end of each row.

Term1 Term2 Term3 ---- TermM-1 TermM Class

Doc1 0.1201 -0.233 0.000 ---- -0.256 -0.956 0

Doc2 0.989 0.00 0.2365 ---- -0.003 0.5633 1

------

------

DocN-1 0.000 -0.123 0.0235 -0.9865 0.001 -0.212 0

DocN 0.2333 0.000 0.455 0.5615 -0.200 -0.0001 1

Figure 4.2 A sample dynamic array

24

All the entries in a feature vector array are converted into attribute-relation file format

(arff) to serve as input for the classifier. Figure 4.3 shows the movie review dataset terms in arff

format. In Figure 4.3, txt_sentoken is the name of the dynamic array that was used. The terms are

represented as attributes with numeric values and each term is given an id or index value. The

output shows the results generated for three documents with the scores for each document

represented in a pair of braces ({}). The values in each of these sets represent the index of a term

@relation data/txt_sentoken

@attribute start numeric

@attribute great numeric

@attribute stars numeric

…………

…………

…………

@attribute white numeric

@attribute boring numeric

@attribute Class {0,1}

@data

{0 0.2881,1 -0.2135,4 0.0447,5 -0.125,6 0.0012,7 0.0035,8 0.00055,14

0.0153,16 0.08,17 -0.0197,24 -0.27,25 -0.0111,26 0.0136,30 0.0095,32 -

0.045,34 0.1701,35 0.02745,36 0.0403,39 0.0223,41 -0.2273,42 -0.0061,66

0.466571,67 0.002,69 0.1037,72 0.0006,73 -0.125}

{5 -0.125,16 0.08,17 -0.0197,56 0.0103,59 -0.0199,66 0.466571,76

0.0127,107 0.5407,109 0.3864,121 0.2625,165 -0.00215,186 0.3182,207

0.2727, 422 0.0137,448 0.0022,615 0.2083,943 0.3636,944 0.625,946

0.0121,947 0.25,948 0.1667,949 -0.0324,951 0.2045,956 -0.3182,1549 1}

{17 -0.0197,42 -0.0061,66 0.466571,113 -0.0302,132 -0.0571,135 0.0219,136

-0.375,230 0.0743,302 -0.0021,309 0.0139,564 0.0876,584 0.0338,589

0.0035,725 -0.5,843 -0.0255,845 0.0539,846 0.0059,847 -0.0793,848 -

0.0937,856 0.0307, 866 0.25,867 -0.0026,868 -0.4091,1549 1}

Figure 4.3 Sample movie review dataset in arff format

and its score. For positively labeled documents, the final value in the set represents the overall

document classification. For instance, for the value “1549 1”, “1549” represents the index of the

last column in the array and “1” is the class of the document. Negatively labeled documents did

not have this representation for the last column in the array. The index corresponding to the last

25

column was empty since its value was zero and the Weka model automatically ignores all zero

scores.

4.4 Feature Selection

Before any dataset can be classified, each document must be converted into feature vector

representation, the most widely used in classification. This ensures that the most important

features that express the sentiments of each document are extracted while at the same time

preserving the meaning of the document. This is applicable not only in sentiment analysis, but

also in general text classification. In this project, unigrams are the only features used based on

the recommendations of Pang and Lee [5]. A binary feature vector representation of some terms

(found in the movie review documents) and their SentiWordnet scores are shown in Figure 4.4.

(basically 0.9162907318741551, hide 0.10536051565782635, drink

0.10536051565782635, year -0.7419373447293773, quickly 0.9162907318741551,

world -0.7419373447293773)

Figure 4.4 Sample binary feature vector representation

The efficiency of our classification depends on the number of features (unique terms)

extracted from the documents. The larger the number, the longer the classifier takes to classify a

document. Feature selection is used to enhance the efficiency of classification by eliminating less

relevant features, from the document. Each feature is scored based on some predefined measure

and the most relevant terms are selected based on this measure. In this project, we use three

feature selection methods, and based on the f-measure values, features are eliminated.

4.4.1 Feature selection methods

Both text classification and sentiment classification have benefited from the use of

feature selection methods to improve classification accuracy and efficiency. We used three

26

feature selection methods, IG, χ2 and CPD. IG and χ

2 have been used extensively in previous

research [28, 29, 36]. CPD is a new method introduced by Simeon and Hilderman [29] and

recently studied in feature selection and weighting methods for sentiment analysis by O’Keefe

and Koprinska [39].

4.4.2 CPD

CPD measures the degree to which a word contributes to differentiating a particular

category from other categories in a text corpus. The possible values for CPD fall in the range -1

to 1, where values near -1 indicate that a term occurs in approximately an equal number of

documents in all categories and a value of 1 indicates that a term occurs in the documents in only

one category. Simeon et al. used

Table 4.5 Contingency table

Σ Row

Σ Column

an example contingency table (see table 4.5) to formally define CPD for a word w in category c

as:

where A is the number of times a word w category c occurs together, B is the number of times a

word w occurs without category c, C is the number of times category c occurs without word w, D

is the number of times neither word w nor category c occur, and [29].

CPD can be simply stated as the ratio of the difference between the number of documents

of a category in which a word occurs and the number of documents of other categories in which

27

the word also occurs divided by the total number of documents in which the word occurs. The

CPD for a word is the ratio associated with the category ci for which the value is greatest. That is,

4.4.3 IG and χ2

IG measures the decrease in entropy when a selected feature is present versus when it is

absent. χ2 is the common statistical test that measures divergence from the distribution expected

if one assumes the feature occurrence is actually independent of the class value. As a statistical

test, it is known to behave erratically for very small expected counts, which are common in text

classification due to rarely occurring word features and having few positive training examples

for a concept [28]. IG and χ2

can be defined generally by the following formulas:

] ,

where

, and

where t(count, expect)=(count-expect)2/ expect.

tp, fp, tn and fn are the number of true positives, false positives, true negatives, and false

negatives, respectively. pos = tp+tn, and neg= fp+fn are the number of positives and number of

negatives respectively. Ppos=pos/all is the probability of a positive, Pneg=neg/all is the

probability of a negative, Pword= (tp+fp)/all is the probability of the word appearing and P¬word=

1-P(word) is the probability of not the word appearing.

28

4.5 Classifier

This research utilized SVM and Naïve Bayes classifiers, two classifiers that have been

widely and frequently used for many text and sentiment classification tasks [5, 8, 24, 34, 37, 38].

Another reason for our using these classifiers is to be able to compare our results to the text

categorization results of Simeon and Hilderman [29] and the sentiment classification results of

Pang and Lee [5], Ohana Tierney [34], and O’Keefe and Koprinska [39]. SVM performs

classification by constructing an N-dimensional hyperplane that optimally separates the data into

two categories, while Naïve Bayes is a simple probabilistic classifier based on Baye's theorem.

These classifiers were part of Weka, an open source data mining application which we used in

our experiments. We discuss Weka in the Chapter 5.

4.6 Classification

Figure 4.5 describes the steps used in this research. The Weka data mining software is

used to classify our datasets using the generated matrix from the preprocessing stage which had

already been converted into arff format as input. The command line version of this software

enabled the creation of a model that is suitable for the purpose of our experiment. This model

uses its Naïve Bayes and SVM classifiers and the feature selection methods, CPD, IG and in

the experiments. All the terms are used by each classifier in turn, first without any feature

selection methods.

29

Figure 4.5 Experiment Execution stages

The model is once again executed but with each feature selection method in turn, first

with CPD, then IG and finally with . A ten-fold cross validation was used to arrive at the best

performance results. For each execution, the f-measure values are obtained by scoring each term

using the current feature selection method and, the unique scores are stored in sorted score list.

Based on these scores, 5% of the total number of features is eliminated after each classification.

The features that are eliminated after each run are those with the lowest f-measure values. This

process continues until there are no more features to be eliminated.

30

Chapter 5

EXPERIMENTAL RESULTS

A series of several experiments were performed to evaluate and compare CPD, IG, and χ2

on sentiment classification tasks. In this chapter, we discuss the experimental results. In Section

5.1, we give an overview of the hardware used. In Section 5.2, we discuss the Weka data mining

software. In Section 5.3, we discuss the datasets used in the experiments. In Section 5.4, we

present the interest measures used. In Section 5.5, we present our results. In Section 5.6, we

discuss our results.

5.1 Hardware and Software

The experiments were executed on two systems: the first was an HP G60 notebook

running Windows Vista with 3GB RAM, 250G hard disk space, and a 2.0GHz Pentium(R)

Dual-Core CPU, and the other was a Dell Desktop PC running Windows XP Professional

with 3GB RAM, 250G hard disk space and a 1.8GHz processor. The software used included

the Weka data mining application (version 3.7), Java SE (version 6), the Netbeans IDE

(version 6.9), and the basic English Stanford POS tagger (version 3.0.4).

31

5.2 Weka

Weka is an acronym for Waikato Environment for Knowledge Analysis. The Weka

application is a collection of state-of-the-art machine learning algorithms and data processing

tools for data mining tasks. It is an open source software issued under the GNU General Public

License and runs on almost any platform. The algorithms can either be applied directly to a

dataset or called from one’s own Java code. It is suited for developing new machine learning

schemes [30]. All Weka algorithms take their input in the form of a single relational table in

ARFF (Attribute-Relation File Format) format which can be read from a file or generated by a

database query.

Weka has a variety of classifiers and filters to choose from. Its main strength lies in the

classification area, where many machine learning approaches have been implemented [32]. It

comes with four interfaces; the Explorer, the Knowledge Flow, the Experimenter and the

command line interface. The command line interface provides the basic functionality that lies

behind Weka’s interactive interfaces. It can be accessed in raw form by entering textual

commands in order to gain access to all features of the system. The command line accepts Java

syntax and is made up of Java features such as classes, instances, and packages. Weka provides a

directory containing documentation of all Java classes it uses. The command line is

recommended for in-depth experimental usage because it offers some functionality which is not

available via the GUI - and uses far less memory. This project used the command line version to

execute all experiments.

5.3 Datasets

The Cornell movie review dataset (sentiment polarity dataset v2.0) and the congressional-

speech corpus (convote dataset) were used for our experiments. The movie review dataset is a

32

popular dataset that has been previously used for research in sentiment classification. It consists

of 2000 processed text files, half of which have been categorized into positive (pos) reviews and

the other half negative (neg). This was downloaded from

http://www.cs.cornell.edu/People/pabo/movie-review-data/. All files were renamed according to

the category they belonged to and their count. For instance, all text files corresponding to

positive (negative) reviews were renamed beginning with the suffix “pos” (“neg”) followed by a

number. For example, pos(30) represents the thirtieth positive text file and neg(500) represents

the five hundredth negative text file.

The convote dataset was downloaded from

http://www.cs.cornell.edu/home/llee/data/convote.html. The original dataset includes three

stages of tokenized speech-segment data, corresponding to three different stages in the analysis

pipeline employed by Matt et al. [24]. For our experiment, we only used the data in stage 3,

consisting of speech-segments that had been classified into support for, or opposition to, a

particular bill.

All filenames ending in Y were renamed as positive, and all those ending in N, as

negative, using the same labelling format used for the movie dataset. The files had been divided

into three sets: the development set, training set and test set. The training set was used since it

contained majority of the data. It comprised of 1,200 positive and 1,200 negative documents.

5.4 Interest Measures

The performance of most classifiers is measured using the popular metrics precision,

recall, f-measure and accuracy. Precision measures the exactness of a classifier. A higher

precision means fewer false positives (negative documents which have been wrongly classified

as positive), while a lower precision means more false positives. This is often at odds with recall,

http://www.cs.cornell.edu/People/pabo/movie-review-data/

http://www.cs.cornell.edu/home/llee/data/convote.html

http://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Type_I_error

33

as an easy way to improve precision is to decrease recall. Recall measures the completeness, or

sensitivity, of a classifier. Higher recall means fewer false negatives (positive documents which

have been wrongly classified as negative), while lower recall means more false negatives.

Improving recall can often decrease precision because it gets increasingly harder to be precise as

the sample space increases. The f-measure is the average of precision and recall but gives higher

scores when precision and recall results are closer together. How well this all works depends on

what is being analyzed. Accuracy refers to how exactly the classifier classifies each document.

The performance measures used in this research are the same as that used by Simeon and

Hilderman [29]. Figure 5.1 shows the results generated after 1000 (500 positive and 500

negative) documents of the convote dataset are classified using the Naïve Bayes classifier. The

measures used include the accuracy, true positive, false positive, true negative, false negative,

precision, recall, f-measure and root mean square error.

For a feature size of 10,543, the accuracy was 59.8, the f-measure was 0.597, and the root

mean squared error was 0.618. For both classes (positive=1 and negative=0), the average recall

was 0.598, average precision was 0.599, average true positive was 0.598, average false positive

was 0.402, average true negative was 0.598, and average false negative was 0.402. For a feature

size of 9,476, the accuracy was 60.07, the f-measure was 0.605, and the root mean squared error

was 0.612. For both classes, the average recall was 0.607, average precision was 0.609, average

true positive was 0.607, average false positive was 0.393, average true negative was 0.607, and

average false negative was 0.393.

http://non-non-sense.blogspot.com/2010/01/precision-recall-trade-off.html

http://en.wikipedia.org/wiki/Sensitivity_and_specificity

http://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Type_II_error

34

Figure 5.1 Sample Naïve Bayes classification results

5.5 Results

Experiments with CPD and the other feature selection methods using the SVM and Naïve

Bayes classifier has shown that, in general, according to the f-measure, CPD performs better than

the other feature selection methods in most text categorization tasks [29]. Here our objective is to

experiment using CPD, IG and for sentiment classification on SVM and Naïve Bayes

classifier to ascertain if this conclusion will still hold. The results of our experiments are

presented in the following sections.

5.5.1 Initial Classification Without Feature Selection

Table 5.1 shows the summary of average ten-fold cross validation accuracies in percent

for the base classifier. SVM performed better than Naïve Bayes for both datasets, results that are

consistent with those obtained by Pang et al. [5]. For the movie review dataset, SVM had an

accuracy of 79.45%, while Naïve Bayes had an accuracy of 78.95%. For the convote dataset,

35

accuracies for both classifiers were reduced. SVM had an accuracy of 65.96 and Naïve Bayes

had an accuracy of 57.38%.

Table 5.1 Average 10-fold cross validation accuracies for base classifier in percent

Dataset Number of features SVM Naïve Bayes

Movie Reviews 36,467 79.45 78.95

Convote 15,549 65.96 57.38

5.5.2 Results for Movie Review Dataset

Tables 5.2 and 5.3 show the summary results for the movie review dataset generated

when feature selection is performed during the classification stage using SVM and Naïve Bayes,

respectively. The first line of each table is repeated from Table 5.1.The best results generated for

each feature selection method is in bold. Rows results can be found in Appendix A. From Table

5.2, we notice that CPD performs better than IG and χ2. Its best accuracy results of 88.1% were

obtained with only 42% of the features. IG on the other hand generated its best accuracy results

of 82.2% with 19% of the features. χ2 did not generate any accuracy values greater than the base

classifier. It performed poor as compared to CPD and IG. This could be attributed to the fact that

it is known to behave erratically for very small expected counts which are common in text

classification and also because of having rarely occurring word features.

36

Table 5.2 SVM accuracies for Movie review dataset in percent

# of Features Percentage CPD IG χ2

36,467 100 79.45 79.45 79.45

27,350 75 79.05 79.45 77.5

17,536 48 84.5 78.75 77.5

15,204 41.7 88.1 78.75 77.5

9,116 25 88.1 80.95 77.5

7,037 19 88.1 82.2 65.2

3,647 10 88.1 80.65 63.45

Table 5.3 Naïve Bayes accuracies for Movie review dataset in percent


36,467 100 78.95 78.95 78.95

27,350 75 79.05 79.45 77.2

17,536 48 77.95 73.7 67.45

7,190 19.7 77.95 67.1 67.45

3,647 10 77.95 65.3 63.95

Table 5.3 shows the summary Naïve Bayes accuracies generated for the movie review

database. With 75% of the features, IG had the highest accuracy of 79.45% whiles CPD had an

accuracy of 79.05%. χ2 did not perform any better than the base classifier. The lowest accuracy

produced by CPD was 77.95%, the lowest for IG was 65.3%, and the lowest for χ2

was 63.95%.

Comparing CPD with IG, even though IG had the highest accuracy at 79.45, the general

37

performance of CPD was better since it accuracy remained constant after 50% of the features had

been eliminated whiles that of IG decreased.

Figure 5.4 SVM accuracies for the movie review dataset

Figure 5.4 shows the graphical representation of the SVM movie review dataset

accuracies at different number of features. From this Figure, we notice that as the number of

features decreases CPD accuracy increases till about half of the features have been eliminated

and then the accuracy remains consistent afterwards, as more features are eliminated. IG on the

other hand had no consistency, the accuracy decreases and then increases and then decreases

again. χ2

did not increase at all, its accuracy decreased and remained relatively consistent until at

19% of the features, it began to decrease again.

Figure 5.5 shows a graphical representation of the accuracies generated by Naïve Bayes

for the movie review dataset. We notice that, the accuracy for CPD and IG increases after 25%

of the features were eliminated and after that, the accuracy decreased as the number of features

0

10

20

30

40

50

60

70

80

90

100

36,467 27,350 17,536 15,204 9,116 7,190 3,647

CPD

IG

χ2

38

decreased. We notice that CPD performed better than IG and χ2 even though the accuracy

decreased.

Figure 5.5 Naïve Bayes Accuracies for movie review

5.5.3 Results for Convote Dataset

Tables 5.6 and 5.7 show the summary results generated when feature selection is

performed during the classification stage using SVM and Naïve Bayes, respectively. The best

results generated for each feature selection method is in bold. For the convote dataset, the

accuracy generated as compared to the movie review dataset was lower. This can be attributed to

the fact that both classifiers are domain dependent and as such it is normal for them to perform

differently under different domains.

In Table 5.6, we notice that all the results did not show any improvement over the base

classifier results of 65.96%. However, comparing CPD, IG and χ2, CPD had the best accuracy.

0

10

20

30

40

50

60

70

80

90

36,467 27,350 17,536 7,190 3,647

CPD

IG

χ2

39

We also notice that after 25% of the features were eliminated, the accuracy of CPD remained the

same while that of IG and χ2 kept reducing. The lowest accuracy for CPD was 65.5%, for χ

2 was

60.45% and for IG was 59.3%. For Table 5.7, even though the accuracies were low, all three

feature selection methods performed better than the base classifier accuracy of 57.38%, which

was lower than that generated for the SVM. The highest results generated were for χ2

(61.46%)

with 28% of the features. χ2 performed better than IG and CPD while CPD, performed better

than IG. χ2

performance here can also be attributed to its erratic behavior on text classification

tasks.

Table 5.6 SVM accuracies for convote dataset in percent


15,549 100 65.96 65.96 65.96

11,662 75 65.5 65.18 62.90

4,332 28 65.5 61.93 62.82

3,334 22 65.5 60.03 62.25

1,555 10 65.5 59.3 60.45

Table 5.7 Naïve Bayes accuracies for convote dataset in percent


15,549 100 57.38 57.38 57.38

11,662 75 57.9 57.17 57.38

4,332 28 57.9 57.9 61.46

3,334 22 57.9 57.4 60.04

1,555 10 57.9 57.0 59.4

40

Figure 5.8 shows the graphical representation of using SVM and the three feature

selection methods on the convote dataset. We notice that CPD is relatively consistent, while IG

and χ2 are inconsistent. IG decreases at a greater rate than χ

2. With 75% of the features, its

accuracy was higher than χ2 but just before 50% of the features were eliminated, its accuracy

decreases drastically (lower than that of χ2) and kept decreasing as more features were

eliminated. Figure 5.9 on the other hand shows the graphical representation of using Naïve Bayes

accuracies and the three feature selection methods on convote dataset. CPDs performance was

relatively consistent in this case too. The accuracy of IG decreased and then increased, and then

decreased. The performance of χ2 was different in this classification. It increased sharply and

then began to decrease, but its lowest accuracy was still higher than the base classifier result.

Figure 5.8 SVM accuracies for convote dataset

54

56

58

60

62

64

66

68

15,549 11,662 4,332 3,334 1,555

CPD

IG

χ2

41

Figure 5.9 Naive Bayes accuracies for convote dataset

5.6 Discussion

In this section, we present a discussion based on the results obtained from our

experiments. We explain and justify the performance of CPD, and compare our results with that

of Ohana et al. [34], O’Keefe et al. [39], Pang et al. [5] and Simeon et al. [29].

5.6.1 CPD Performance

We observe from Table 5.2 that CPD generated the best accuracies on the movie review

datasets for the SVM classifier. For the Naïve Bayes classifier, the accuracy seemed to be a bit

lower than that for IG with 75% of the features. For the convote dataset, we observe from Table

5.6 that even though all the feature selection methods generated accuracies below that of the base

classifier (SVM), CPD had a higher accuracy. For Naïve Bayes however, CPD had the highest

accuracy with 75% of the features but χ2 performed the best in most situations.

54

55

56

57

58

59

60

61

62

15,549 11,662 4,332 3,334 1,555

CPD

IG

χ2

42

From all the results generated, we noticed that CPD maintained the same accuracy value

after 50% of the features had been eliminated. This was due to the CPD scores that were

generated. The objective of CPD is to find the maximum f-measure value and set a cutoff point

based on this value. This cutoff point is used to eliminate features. For our experiment, we set

specific scale rates to eliminate features rather than the f-measure values due to time constraints.

Using these scale factors, it took on average eight days for one experiment to run to completion.

For the CPD experiments, the CPD scores that were generated after 50% of the features were

eliminated did not differ much. This resulted in most of the terms being eliminated at this point

and causing the experiment to complete after this step. For both datasets, the same situation was

observed. Figure 5.10 shows a graphical representation of CPD performance.

CPD score

generation

F-measure > Max F-

measure

Run model

with

reduced

terms

Eliminate

terms

F-Measure

value

Base classifier

Sorted score list

Max. F-

Measure=0

Cut-off=2% of

terms

2% of terms with lowest

CPD score

F-Measure

value

Perform 10-fold cross

validation

Terms left >0

End

F-measure

=Max F-

measure

No

Yes

Figure 5.10 CPD feature selection process

5.6.2 Comparison with Other Results

Given the fact that we used an approach similar to previous work in sentiment analysis and

classification, we compare our results with past results. Table 5.11 shows how our results

compare with other published results using the movie dataset. Our approach has a high accuracy

43

compared to other results using manually built lexicons and this serves as an encouragement for

the use of lexicons (specifically SentiWordNet) built from semi-supervised methods. We also

notice feature selection had a great effect on the accuracy as compared to the other two which

did not use feature selection as we did.

Table 5.11 Accuracy Comparison

Method Accuracy

SentiWordNet Scores used as features and CPD as feature selection

method (this research).

88.1%

SentiWordNet Scores used as Features with no feature selection

method [34]

69.35%

Term Counting – Manually built list of Positive/Negative words. [5] 69.00%

We can also compare our results with that obtained by Simeon and Hilderman [29].

According to their experimental results, they concluded that CPD outperformed other frequently

studied feature selection methods in four out of six text categorization tasks given the dataset

they used. In our sentiment analysis and classification research, we also observed that CPD

performed better in most cases on both datasets when SVM classifier was used. This is in line

with their results and serves as an encouragement to use CPD for further research in sentiment

analysis and classification research.

O’Keefe and Koprinska [39] also studied using CPD together with other feature selection

methods on the movie review dataset. Their results show that it is possible to maintain a state-of-

the art classification accuracy of 87.15% while using less than 36% of the features. Our results

show that, with 41.7% of the features, we could obtain an accuracy of 88.1% which is an

improvement. Just as our research confirmed, their research also showed SentiWordNet scoring

as a good feature scoring approach to use.

44

5.6.3 Drawbacks

We observed that, CPD has a much longer running time to than IG and χ2, even though its

performance was better. To reduce the running time, we used a cut off scale of 5%.

Some of the SentiWordNet scores generated were inaccurate and this may be attributed to the

reliance on glosses (is a brief notation of the meaning of a word or wording in a text) as a source

of information for determining term orientation. Some terms which should contain negative

orientation were given positive scores, and vice versa, based on the gloss the term is more likely

to be associated with. Some ambiguous terms were difficult to score and so for such terms, the

average score was chosen. This score also influenced our accuracy values.

Unigrams alone are known to yield better accuracies but for some terms, using unigrams

distorted their orientation. For such terms, higher order n-grams could have been used. This

resulted in the distortion of the sentiment orientation of some terms and thereby affected the

sentiment score.

The convote dataset did not perform well as compared to the movie dataset. This could be

attributed to the composition and organization of the data. Further processing will have to be

done on this dataset to improve its accuracy in future. Another factor had to do with the fact that

the reviews seem to be on several topics of debate even though they all seemed to have some

political connotation.

45

Chapter 6

CONCLUSION

For our work, we investigated the performance of CPD as a feature selection method together

with other popular feature selection methods: IG and χ2. Two datasets were used in this research

and SentiWordNet was used to score the terms in each document. The SVM and Naïve Bayes

classifiers in the Weka data mining application were used for classifying the datasets into

positive and negative sentiments. Experimental results show that, CPD performs well as a feature

selection method for sentiment analysis and classification tasks yielding the best accuracy results

in three out of four experiments.

We noticed, however, that the accuracy became constant after 50% of the features were

eliminated due to the fact that most terms had similar CPD scores. As future work in this

research, we hope to study CPD in more detail by doing further work with the convote dataset to

eliminate any reviews which are not related, and possibly group reviews based on the topic being

debated. This will involve using both supervised and non-supervised approaches, as it has been

noticed that the combination of these approaches yield better results. We would also like to study

the performance of CPD on other datasets described in the sentiment analysis literature.

46

We hope to also use SentiWordNet with other scoring measures to arrive at better scores

for terms which will make up for the inaccurate scores generated sometimes form SentiWordNet.

In future, we hope to use f-measure values as cutoff values during feature selection and also

improve the time taken by CPD to generate scores for terms. This will greatly enhance the

classification step and also improve accuracy. Finally, we will like to investigate the use of

unigrams and bigrams in this research to see if accuracy can be improved.

47

References

[1] Bo Pang and Lillian Lee. (2008). Opinion mining and Sentiment analysis. Foundations

and Trends in Information Retrieval Vol. 2, Nos. 1–2. Pages: 1–135

[2] Michelle de Haaff. (2010). Sentiment Analysis, Hard But Worth It!, CustomerThink,

http://www.customerthink.com/blog/sentiment_analysis_hard_but_worth_it, retrieved

2010-03-12.

[3] Fangzhong Su and Katja Markert. (2008). "From Words to Senses: a Case Study in

Subjectivity Recognition". Proceedings of Coling 2008, Manchester, UK

[4] Annett Michelle and Kondrak Grzegorz. (2008).Comparison of sentiment analysis

techniques: polarizing movie blogs. Lecture Notes in Computer Science, Advances in

Artificial Intelligence. Springer Berlin / Heidelberg .Vol. 5032. Pages:25-30

[5] Bo Pang and Lillian Lee. (2002). Thumps up? Sentiment Classification using Machine

Learning Techniques. Proceedings of the Conference on Empirical Methods in Natural

Language Processing (EMNLP). Pages: 79-86

[6] Dunja Mladenic. (1999). Text-Learning and Related Intelligent Agents: A Survey. In

Intelligent Systems and their applications, IEEE. Vol. 14 Issue 4. Pages: 44-54.

http://www.customerthink.com/blog/sentiment_analysis_hard_but_worth_it

http://www.customerthink.com/blog/sentiment_analysis_hard_but_worth_it

http://www.comp.leeds.ac.uk/markert/Papers/Coling2008.pdf



http://www.engineeringvillage2.org.libproxy.uregina.ca:2048/controller/servlet/Controller?CID=quickSearchCitationFormat&searchWord1=%7BAnnett%2C+M.%7D&section1=AU&database=8194&yearselect=yearrange&sort=yr

http://www.engineeringvillage2.org.libproxy.uregina.ca:2048/controller/servlet/Controller?CID=quickSearchCitationFormat&searchWord1=%7BKondrak%2C+G.%7D&section1=AU&database=8194&yearselect=yearrange&sort=yr

48

[7] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianin and Chris Watkins.

(2002). Text Classification using String Kernels. Journal of Machine Learning

Research 2 (2002). Pages:419-444.

[8] Bo Pang and Lillian Lee. (2004). A Sentimental Education: Sentiment Analysis

Using Subjectivity Summarization Based on Minimum Cuts. Proceedings of the

Association for Computational Linguistics (ACL). Pages: 271–278.

[9] Bing Liu. (2010). Sentiment Analysis and Subjectivity. Handbook of Natural

Language Processing, Second Edition. (editors: N. Indurkhya and F. J. Damerau)

[10] http://en.wikipedia.org/wiki/Sentiment_analysis

[11] Ainur Yessenalina, Yisong Yue and Claire Cardie. (2010). Multi-level Structured

Models for Document-level Sentiment Classification. Proceedings of the 2010

Conference on Empirical Methods in Natural Language Processing. Pages: 1046–1056.

[12] Tu Bao Ho, David Cheung and Huan Liu. (2005). Advances in knowledge

discovery and data mining: 9th Pacific-Asia Conference on Knowledge Discovery

and Datamining. Pages:301-311

[13] http://reviews.cnet.com/smartphones/apple-iphone-4-16gb/4852-6452_7-34117598.html

[14] http://www.radian6.com/blog/2009/12/on-automated-sentiment-analysis/

[15] http://datamining.typepad.com/data_mining/2007/12/sentiment-minin.html

[16] http://en.wikipedia.org/wiki/Stemming

[17] http://aimotion.blogspot.com/2010/07/working-on-sentiment-analysis-on.html

http://www.cs.cornell.edu/home/llee/papers/cutsent.home.html

http://www.cs.cornell.edu/home/llee/papers/cutsent.home.html

http://en.wikipedia.org/wiki/Sentiment_analysis

http://www.radian6.com/blog/2009/12/on-automated-sentiment-analysis/

http://datamining.typepad.com/data_mining/2007/12/sentiment-minin.html

http://en.wikipedia.org/wiki/Stemming

http://aimotion.blogspot.com/2010/07/working-on-sentiment-analysis-on.html

49

[18] Suhaas Prasad. (2010).Micro-blogging Sentiment Analysis Using Bayesian

Classification Methods. http://nlp.stanford.edu/courses/cs224n/2010/reports/suhaasp.pdf

[19] Alec Go, Lei Huang and Richa Bhayani. Twitter Sentiment Analysis

[20] Ravi Parikh, Matin Movassate. Sentiment Analysis of User-Generated Twitter Updates

using Various Classification Techniques

[21] Minqing Hu and Bing Liu. (2004). Mining and summarizing customer reviews.

Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge

Discovery and Data, Aug. 22-25, ACM Press, Washington, USA., pp: 168-177.

[22] Pimwadee Chaovalit and Lina Zhou. (2005). Movie Review Mining: a Comparison

between Supervised and Unsupervised Classification Approaches. In Proceedings of

the 38th Annual Hawaii International Conference on System Sciences. Pages: 112c-

112c

[23] Benjamin K Tsuo and Oi Yee Kwong. (2005). Semantic Role Tagging for Chinese at

the Lexical level. Proceedings of the Natural Language Conference-IJCNLP.

Pages:804-815.

[24] Matt Thomas, Bo Pang and Lillian Lee. (2006) Get out the vote: Determining support or

opposition from Congressional floor-debate transcripts. Proceedings of EMNLP. Pages:

327–335.

[25] Kushal Dave, Steve Lawrence, and David M. Pennock. (2003) Mining the peanut

gallery: Opinion extraction and semantic classification of product reviews. In

Proceedings of WWW. Pages: 519–528.

[26] http://www.computing.dcu.ie/~acahill/tagset.html

http://nlp.stanford.edu/courses/cs224n/2010/reports/suhaasp.pdf

http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=9518

http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=9518

http://www.cs.cornell.edu/home/llee/papers/tpl-convote.home.html

http://www.cs.cornell.edu/home/llee/papers/tpl-convote.home.html

http://www.computing.dcu.ie/~acahill/tagset.html

50

[27] Baccianella Stefano, Andrea Esuli and Fabrizio Sebastiani. (2010). SentiWordNet

3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In

Proceedings of the Seventh Conference on International Language Resources and

Evaluation (LREC’10). Pages 2200–2204.

[28] George Forman. (2003). An extensive empirical study of feature selection metrics for

text classification. Journal of Machine Learning Research 3. Pages: 1289 –1305.

[29] Mondelle Simeon and Robert Hilderman. (2008). Categorical Proportional

Difference: A Feature Selection Method for Text Categorization. In Proceedings of

the Seventh Australasian Data Mining Conference (AusDM 2008), Glenelg, South

Australia. CRPIT, 87. Roddick, J. F., Li, J., Christen, P. and Kennedy, P. J., Eds.

ACS. 201-208.

[30] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,

Ian H. Witten (2009). The WEKA Data Mining Software: An Update; SIGKDD

Explorations, Volume 11, Issue 1

[31] Ian H. Witten, Eibe Frank, Mark A. Hall. (2005). Data Mining: Practical Machine

Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data

Management Systems).

[32] http://weka.wikispaces.com/Primer

[33] Nitin Jindal and Bing Liu. (2008). Opinion Spam and Analysis. Proceedings of

First ACM International Conference on Web Search and Data Mining (WSDM-

2008), Feb 11-12, 2008, Stanford University, Stanford, California, USA

http://crpit.com/confpapers/CRPITV87Simeon.pdf

http://crpit.com/confpapers/CRPITV87Simeon.pdf

http://www.cs.waikato.ac.nz/~ihw

http://www.cs.waikato.ac.nz/~eibe

http://www.cs.waikato.ac.nz/~mhall

http://weka.wikispaces.com/Primer

51

[34] Bruno Ohana and Brendan Tierney. (2009). Sentiment Classification of reviews

using SentiWordNet. IT&T Conference.

[35] Casey Whitelaw, Navendu Garg, Shlomo Argamon. (2005). Using appraisal groups for

sentiment analysis. In Proceedings of the 14th ACM Conference on Information and

Knowledge Management. Pages:625–631.

[36] Abbasi Ahmed, Chen Hsinchun and Salem Arab.(2008). Sentiment analysis in multiple

languages: Feature selection for opinion classification in Web forums. ACM

Transactions of Information System. Volume 26, Number 3, Article 12 (June 2008) 34

pages.

[37] Gamon, M. (2004). Sentiment Classification on Customer Feedback Data: Noisy Data,

Large Feature Vectors, and the Role of Linguistic Analysis. Proceedings of the 20th

international conference on Computational Linguistics. Geneva, Switzerland:

Association for Computational Linguistics.

[38] Tony Mullen and Nigel Collier.(2004). Sentiment analysis using support vector

machines with diverse information sources. Proceedings of EMNLP-2004, Barcelona,

Spain, July 2004. Association for Computational Linguistics. Pages: 412–418

[39] Tim O’Keefe and Irena Koprinska.(2009). Feature Selection and Weighting methods in

Sentiment Analysis. Proceedings of the Fourteenth Australasian Document Computing

Symposium.

[40] Peter D. Turney. (2002). Thumbs up or thumbs down?: Semantic orientation applied to

unsupervised classification of reviews. Proceedings of the 40th Annual Meeting on

Association for Computational Linguistics. Pages: 417-424.

52

[41] http://nlp.stanford.edu/software/tagger.shtml

53

APPENDIX

Extract from the movie review SVM classification result.

data/txt_sentoken Corpora Size: 2000 Feature Size: 36467 Classifier: SVM

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.79

0.201 0.797 0.79 0.794 0.795 0

0.799 0.21 0.792 0.799 0.795 0.795 1

Weighted Avg.0.795 0.206 0.795 0.795 0.794 0.795

Accuracy: 79.45 F-Measure: 0.794 Recall: 0.794 Precision: 0.795 True Positive:

0.794 False Positive: 0.206 True Negative: 0.794 False Negative: 0.206 Root Mean Squared

Error: 0.453



TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.79 0.201 0.797 0.79 0.794 0.795 0

0.799 0.21 0.792 0.799 0.795 0.795 1

Weighted Avg. 0.795 0.206 0.795 0.795 0.794 0.795



Error: 0.453




0.798 0.197 0.802 0.798 0.8 0.801 0

0.803 0.202 0.799 0.803 0.801 0.801 1

Weighted Avg. 0.801 0.2 0.801 0.801 0.8 0.801

Accuracy: 80.05 F-Measure: 0.8 Recall: 0.8 Precision: 0.801 True Positive: 0.8

False Positive: 0.2 True Negative: 0.8 False Negative: 0.2 Root Mean Squared

Error: 0.447




0.799 0.194 0.805 0.799 0.802 0.803 0

0.806 0.201 0.8 0.806 0.803 0.803 1

Weighted Avg. 0.803 0.198 0.803 0.803 0.802 0.803



Error: 0.444

54




0.798 0.201 0.799 0.798 0.798 0.799 0

0.799 0.202 0.798 0.799 0.799 0.799 1

Weighted Avg. 0.799 0.202 0.799 0.799 0.798 0.799



Error: 0.449




0.804 0.199 0.802 0.804 0.803 0.803 0

0.801 0.196 0.803 0.801 0.802 0.803 1

Weighted Avg. 0.803 0.198 0.803 0.803 0.802 0.803



Error: 0.444




0.793 0.206 0.794 0.793 0.793 0.794 0

0.794 0.207 0.793 0.794 0.794 0.794 1

Weighted Avg. 0.794 0.207 0.794 0.794 0.793 0.794



Error: 0.454




0.791 0.207 0.793 0.791 0.792 0.792 0

0.793 0.209 0.791 0.793 0.792 0.792 1

Weighted Avg. 0.792 0.208 0.792 0.792 0.792 0.792



Error: 0.456




0.791 0.201 0.797 0.791 0.794 0.795 0

0.799 0.209 0.793 0.799 0.796 0.795 1

Weighted Avg. 0.795 0.205 0.795 0.795 0.795 0.795



Error: 0.453




0.781 0.208 0.79 0.781 0.785 0.787 0

0.792 0.219 0.783 0.792 0.788 0.787 1

Weighted Avg. 0.787 0.214 0.787 0.787 0.786 0.787



Error: 0.462

55




0.785 0.211 0.788 0.785 0.787 0.787 0

0.789 0.215 0.786 0.789 0.787 0.787 1

Weighted Avg. 0.787 0.213 0.787 0.787 0.787 0.787



Error: 0.462




0.776 0.214 0.784 0.776 0.78 0.781 0

0.786 0.224 0.778 0.786 0.782 0.781 1

Weighted Avg. 0.781 0.219 0.781 0.781 0.781 0.781



Error: 0.468




0.777 0.214 0.784 0.777 0.781 0.782 0

0.786 0.223 0.779 0.786 0.782 0.782 1

Weighted Avg. 0.782 0.219 0.782 0.782 0.781 0.782



Error: 0.467




0.787 0.206 0.793 0.787 0.79 0.791 0

0.794 0.213 0.788 0.794 0.791 0.791 1

Weighted Avg. 0.791 0.21 0.791 0.791 0.79 0.791



Error: 0.458




0.776 0.227 0.774 0.776 0.775 0.775 0

0.773 0.224 0.775 0.773 0.774 0.775 1

Weighted Avg. 0.775 0.226 0.775 0.775 0.774 0.775



Error: 0.475




0.777 0.232 0.77 0.777 0.774 0.773 0

0.768 0.223 0.775 0.768 0.771 0.773 1

Weighted Avg. 0.773 0.228 0.773 0.773 0.772 0.773



Error: 0.