sentiment analysis and classification of online movie reviews using categorical proportional
TRANSCRIPT
SENTIMENT ANALYSIS AND CLASSIFICATION OF ONLINE REVIEWS USING
CATEGORICAL PROPORTIONAL DIFFERENCE
A PROJECT REPORT
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
MASTER OF SCIENCE
IN
COMPUTER SCIENCE
UNIVERSITY OF REGINA
By
Dorothy Aku Allotey
Regina, Saskatchewan
December 2, 2011
© Copyright 2011: Dorothy Aku Allotey
ii
Abstract
Sentiment analysis and classification is an area of text classification that began around
2001 and has recently been receiving a lot of attention from researchers. Sentiment analysis
involves analyzing textual datasets which contain opinions (e.g., social media, blogs, discussion
groups, and internet forums) with the objective of classifying the opinions as positive, negative,
or neutral. Classification of textual objects according to sentiment is considered a more difficult
task than classification of textual objects according to content because opinions in natural
language can be expressed in subtle and complex ways containing slang, ambiguity, sarcasm,
irony, and idiom.
Various measures, including Information gain, simple chi-square, feature relation
network, log-likelihood ratio, and minimum frequency thresholds, have previously been used as
feature selection methods in sentiment analysis and classification. In this report, we investigate
the performance of categorical proportional difference, a novel feature selection method used for
text classification. While categorical proportional difference has previously been shown by
others to be useful in sentiment classification of text documents, here we apply CPD to the
classification of online reviews. Online reviews differ from typical text documents in that they
contain few features (i.e., they are short documents). The results obtained using CPD on online
iii
reviews are compared to those obtained using information gain and simple chi-square. We also
apply two weighting schemes, called Feature Presence and SentiWordNet scores, in combination
with each of CPD, information gain and simple chi-square, to classify the reviews as either
positive or negative. Experimental results show that given a dataset containing labeled movie
reviews, CPD generates a classification accuracy of 88.1% with 42% of the features. The
performance of Information gain and simple chi-square varied depending on the dataset.
Information gain performed better than simple chi-square on the movie review dataset while
simple chi-square performed better than information gain on the congressional floor debate
dataset. We can therefore recommend that, CPD can be a good feature selection method for
sentiment classification tasks.
iv
Acknowledgements
I would like to express my sincere appreciation to my supervisor, Dr. Robert Hilderman
for all the assistance and guidance through all stages of this project. This project would not have
not been possible without his advice and support. I would like to thank the Faculty of Graduate
Studies and Research for their financial support. To all friends who helped in one way or the
other, I am very grateful.
v
Contents
Abstract ii
Acknowledgements iv
Table of Contents v
List of Tables vii
List of Figures viii
Chapter 1 INTRODUCTION 1
1.1 Statement of the Problem ....................................................................................... 2
1.2 The Process of Sentiment Analysis and Classification .......................................... 3
1.3 Objectives of the project ........................................................................................ 4
1.4 The contributions of the project. ............................................................................ 5
1.5 Organization of the Project Report. ....................................................................... 6
Chapter 2 BACKGROUND 7
2.1 General Overview of Sentiment Analysis and Classification ................................ 8
2.2 Factors that affect sentiment analysis and classification ....................................... 8
2.3 Resources for Sentiment Analysis and Classification ......................................... 10
2.3.1 The POS tagger ......................................................................................... 11
2.3.2 SentiWordNet ........................................................................................... 12
2.3.2 Dictionary Structure .................................................................................. 12
Chapter 3 RELATED WORK 15
3.1 Recent Work ........................................................................................................ 15
Chapter 4 METHODOLOGY 21
vi
4.1 Assumptions ....................................................................................................... 211
4.2 Description of our approach................................................................................. 22
4.3 SentiWordNet Scoring ......................................................................................... 22
4.4 Feature Selection .................................................................................................. 25
4.4.1 Feature selection methods ......................................................................... 25
4.4.2 CPD ........................................................................................................... 26
4.4.3 IG and χ2 ................................................................................................... 27
4.5 Classifier … ......................................................................................................... 28
4.6 Classification........................................................................................................ 28
Chapter 5 EXPERIMENTAL RESULTS 30
5.1 Hardware and Software ........................................................................................ 30
5.2 Weka…. ............................................................................................................... 31
5.3 Datasets … ........................................................................................................... 31
5.4 Interest Measures ................................................................................................. 32
5.5 Results .................................................................................................................. 34
5.5.1 Initial Classification Without Feature Selection ....................................... 34
5.5.2 Results for Movie Review Dataset ........................................................... 35
5.5.3 Results for Convote Dataset ...................................................................... 38
5.6 Discussion ............................................................................................................ 41
5.6.1 CPD Performance ..................................................................................... 41
5.6.2 Comparison with Other Results ................................................................ 42
5.6.3 Drawbacks................................................................................................. 44
Chapter 6 CONCLUSION AND FUTURE WORK 45
References 47
Appendix 53
vii
List of Tables
4.5 Contingency table ....................................................................................................... 26
5.1 Average 10-fold cross validation accuracies for base classifier in percent ................ 35
5.2 SVM accuracies for Movie review dataset in percent ................................................ 36
5.3 Naïve Bayes accuracies for Movie review dataset in percent ..................................... 36
5.6 SVM accuracies for convote dataset in percent .......................................................... 39
5.7 Naïve Bayes accuracies for convote dataset in percent .............................................. 39
5.11 Accuracy Comparison ............................................................................................... 43
viii
List of Figures
1.1 Sentiment analysis and classification steps................................................................... 4
2.1 Sample SentiWordNet output for the word “boring” ................................................. 13
4.1 SentiWordNet output .................................................................................................. 23
4.2 A sample dynamic array ............................................................................................. 23
4.3 Sample movie review dataset in arff format ............................................................... 24
4.4 Sample binary feature vector representation............................................................... 25
4.5 Experiment Execution stages ...................................................................................... 29
5.1 Sample Naïve Bayes classification results .................................................................. 34
5.4 SVM accuracies for the movie review dataset ............................................................ 37
5.5 Naïve Bayes Accuracies for movie review ................................................................. 38
5.8 SVM accuracies for convote dataset ........................................................................... 40
5.9 Naive Bayes accuracies for convote dataset ............................................................... 41
5.10 CPD feature selection process .................................................................................. 42
1
Chapter 1
INTRODUCTION
The World Wide Web and the Internet provide a forum through which an individual’s
process of decision making may be influenced by the opinions of others. For example, the
customer feedback system used by eBay.com allows customers to use free-form text to rate
products and services received while making the ratings available to other customers to
review before they make a purchase decision, in effect allowing a customer to make a more
informed decision. Customer feedback and product evaluations can also be found at many
online sites including epinions.com and amazon.com. Online sites such as
rottentomatoes.com, allow movie buffs to leave reviews for movies they have seen. Online
sites, such as Facebook and blogs, allow users to leave opinions and comments. Other
online sites, such as cnn.com and globeandmail.com, allow readers to leave comments.
These kinds of online media have resulted in large quantities of textual data containing
opinion and facts. Over the years, there has been extensive research aimed at analyzing and
classifying text and data [6], where the objective is to assign predefined category labels to
documents based upon learned models [7]. However, more recent research has attempted to
analyze textual data to determine how an individual “feels” about a particular topic (i.e., the
2
individual’s sentiment towards that topic). This has led to the development of sentiment
analysis and classification systems [1]. Sentiment analysis and classification are technically
challenging because opinions can be expressed in subtle and complex ways, involving the
use of slang, ambiguity, sarcasm, irony and idiom.
Sentiment analysis and classification is performed for several reasons, for example to
track the ups and downs of aggregate attitudes to a brand or product [13], to compare the
attitudes of online customers between one brand or product and another, and to pull out examples
of particular types of positive or negative statements on some topic. It may also be performed to
enhance customer relationship management and to help other potential customers make informed
choices.
The remainder of this chapter is organized as follows. In Section 1.1, we give a
statement of the problem. In Section 1.2, we describe the steps involved in sentiment
analysis and classification. In Section 1.3, the objectives of this project are discussed, and in
Section 1.4, the contributions of this project are summarized. The organization of this
project report is provided in Section 1.5.
1.1 Statement of the Problem
Sentiment analysis has been formally referred to as a broad (definitionally challenged) area
of natural language processing, computational linguistics, and text mining [10]. Generally
speaking, it aims to determine the attitude of a speaker or a writer with respect to some topic.
Their attitude may be a judgment or evaluation, an affective state (that is to say, the emotional
state of the author when writing), or an emotional communication (that is to say, the emotional
effect the author wishes to have on the reader). With the increasing amounts of text in on-line
documents, efforts have been made to organize this information using automatic text
3
classification (categorization) techniques. However, using these approaches to classify sentiment
in documents has not been effective. This is due to the fact that text classification involves
automatically sorting a set of documents into categories from a predefined set. Most research has
been based on topical classification of these texts. Sentiment analysis on the other hand goes
beyond topical classification of texts to include extracting opinions and classifying them as
positive or negative, and favourable or not favourable.
This presents challenges which may not be easily addressed by simple text classification
approaches. Consequently, there is a need to incorporate methods for classifying opinions into a
general text classification tool, or to develop separate systems that will be able to analyze and
accurately classify sentiments in text. This need has risen in recent years due to the application
of sentiment analysis in diverse areas such as market research, business intelligence, public
relations, electronic governance, web search, and email filtering. Other factors, which have
culminated in increased interest in sentiment analysis include (according to Pang and Lee [1]):
The rise of machine learning methods in natural language processing and information
retrieval,
The availability of datasets for machine learning algorithms to be trained on,
specifically, the development of review-aggregation web-sites, and
The realization of the challenges offered by commercial and intelligence applications.
1.2 The Process of Sentiment Analysis and Classification
Here, we provide a brief overview of the process of sentiment analysis and
classification. A more detailed description is provided in Chapter 2. A diagram summarizing
the steps required for text classification using sentiment is shown in Figure 1.1. In Figure 1.1,
sentiment analysis and classification is shown to consist of two main steps: preprocessing
4
step and classification step. The preprocessing step involves extracting the reviews or
documents from a source dataset. Terms in each review are parsed by a part-of-speech
(POS) tagger such as the Standford POS tagger [18]. The tags are then used by a lexical
resource, such as SentiWordNet [27], to determine the sentiment score for each term. The
terms in each document, together with their sentiment scores, are stored as feature vectors to
be used as input to a text classifier.
Figure 1.1 Sentiment analysis and classification steps
The classification step involves using a text classifier to classify the selected features as
either positive or negative. The stored feature vectors become input to the text classifier. If
necessary, feature selection can be applied to reduce the number of features. In order to
5
arrive at the best results, a series of iterative steps, known as cross-validation can be used to
estimate how accurately a predictive model will perform in practice.
1.3 Objectives of the project
The objectives of this project are:
To analyze the Cornell movie review datasets used by Pang and Lee [5] and the
congressional-speech corpus used by Matt et al. [24] by incorporating features from
SentiWordNet.
To extract features based on SentiWordNet scores and use them for sentiment
classification.
To use CPD, simple chi-square ( , and information gain (IG) for feature selection
[28] to compare the performance of CPD to and IG in sentiment classification
tasks.
To evaluate the effect of feature selection on the overall performance of sentiment
classification.
To compare our experimental results with other well known research results.
1.4 The contributions of the project.
This project makes the following contributions to the field of sentiment analysis and
classification:
It demonstrates that CPD is an effective feature selection method, not only in text
categorization, but also in sentiment classification.
It shows that CPD is an effective feature selection method in movie review and
political review domains.
6
It confirms that using feature presence in conjunction with SentiWordNet scores
provides better sentiment classification accuracy than feature presence alone.
1.5 Organization of the Project Report.
The remainder of this report is organized as follows. In Chapter 2, we present some
background information on sentiment analysis and classification. In Chapter 3, we discuss
some related work in this research field. In Chapter 4, we explain our methodology. In
Chapter 5, we discuss our experimental results and compare them with other well known
results. In Chapter 6 we give our conclusions and suggest future work.
7
Chapter 2
BACKGROUND
Data available on the Internet comes in a variety of different formats. The simplest and
most common format is plain textual data. One example of textual data is online reviews.
Online reviews are broadly described as either objective or subjective [8]. An objective
review tends to contain mostly facts while a subjective review contains mostly opinions.
There are two common review formats [9]: the restricted review format and the free format.
In a restricted review, the reviewer is asked to separately describe the pros and cons, and to
write a comprehensive review. In a free format review, the reviewer writes freely without
any separation of the pros and cons.
In Section 2.1, we present an overview of how sentiment analysis and classification
work in general. In Section 2.2, we discuss some factors that affect sentiment analysis and
classification. In Section 2.3, we discuss SentiWordNet, a lexical resource used for
sentiment analysis and classification and other lexical resources.
8
2.1 General Overview of Sentiment Analysis and Classification
Sentiments can be classified at the word, sentence, or document level. This project
focused on document-level sentiment classification. Document-level sentiment classification
can be described as follows. Given a set of related documents containing opinions (also
known as opinionated documents), we determine whether each document d expresses a
positive or negative opinion of (i.e., the sentiment toward) an object. Existing research in this
area makes the assumption that the opinionated document d (e.g., a movie review) contains
opinions regarding a single object [9]. This assumption typically holds for customer reviews of
products and services, but not for forum and blog posts due to the fact that such a post may
contain opinions on multiple products and services.
In trying to find the document-level sentiment, one approach that could be used is to
perform sentence-level sentiment classification on each sentence in the document and then sum
up all the sentence-level sentiments to produce the document-level sentiment. This however,
requires that word-level sentiment classification be performed on each sentence prior to
sentence-level classification. One of the main challenges of document-level sentiment analysis
and classification is that not every part of the document is equally informative in inferring the
sentiment of the whole document [11]. However identifying the useful sentences automatically
is itself a difficult learning problem. Some researchers have proposed automatic approaches to
extracting useful sentences [11]. Others have considered all sentences in the document in
conjunction with algorithms that were able to produce better experimental results [5, 12].
2.2 Factors that affect sentiment analysis and classification
To classify a document as either positive or negative according to the overall
sentiment expressed by the author is seen as a more challenging problem than strict text
9
classification. For example, opinions can often be expressed in a more complex manner,
making it difficult to be identified by any of the words in the sentence or document when
considered in isolation. By just looking at the words in a review, one may not be able to
accurately classify the sentiments which have been expressed. Consider this review:
“This film should be brilliant. It sounds like a great plot, the actors are first grade, and the
supporting cast is good as well, and Stallone is attempting to deliver a good performance,
however, it can't hold up"
The presence of words such as “brilliant”, “great”, “good” suggest a positive sentiment so one
might think it easy to identify the sentiment of a review by a set of keywords. However, Pang
and Lee found out from their experimental results that the accuracy of classification generated
from the use of human generated list of keywords was lower as compared to that of keywords
generated using machine learning techniques [5].
The above review has an anaphor (e.g., the phrase “it can’t hold up”). It is not clear what
“it” in this phrase refers to, whether the movie or Stallone’s performance. This makes it difficult
to determine the overall polarity of the review. Many free format reviews make use of many
anaphora, abbreviations, lack of capitals, poor spelling, poor punctuation, and poor grammar.
The main factors that affect how opinions are analyzed and classified include the domain
of the datasets, the size of the datasets to be analyzed, the format of the datasets (labeled or
unlabelled), and quality of the dataset. The accuracy of sentiment classification can be influenced
by the domain of the items to which it is applied [1]. One reason is that the same phrase can
indicate different sentiment in different domains: for example, “go read the book” most likely
indicates positive sentiment for book reviews, but negative sentiment for movie reviews. A
review such as “it is so easy to predict the next action….” is a negative sentiment for a movie
10
plot but a positive sentiment for a political review. Difference in vocabularies across different
domains also adds to the difficulty when applying classifiers trained on labeled data in one
domain to test data in another.
Depending on the size of datasets to be used, either manual or automatic, or both
approaches, could be used. However, it is always best to use a combination of both approaches as
it has been realized from a number of experiments that even the worst results obtained from
using both approaches are superior to the best of manual approaches and some automatic
approaches [4]. The availability of labeled data also improves the time taken to classify opinions.
Without it, many researchers have had to use the linguistic/semantic approach of building
lexicons, which is very time consuming and yet does not yield better performance. Lexicons are
also specific to a particular language and domain so making the sentiment analysis and
classification task even more difficult.
The quality of the dataset is also bound to affect sentiment classification performance. Due to
the fact that there is no quality control measure in place to check reviews, anyone can write
anything on the web. This results in many low quality reviews and review spam. Bing et al.
studied opinion spam and discovered that online reviews are normally made up of spam
messages which consist of untruthful or fake reviews, irrelevant reviews, and reviews which are
not actual reviews, but rather statements or questions [33]. Effective preprocessing of the data is
needed to reduce the level of spam in these reviews. However, this is a research area which has
not received much attention.
2.3 Resources for Sentiment Analysis and Classification
The goal of this research is to classify sentiments in text, and as such, there is the need to
use an approach that can determine the sentiments of extracted features in individual documents.
11
One major approach normally used is lexical induction which involves creating resources that
contain opinion information on words based on lexicons. Among the many lexical resources
available, SentiWordNet 3.0 was chosen for this research [27]. It is a lexical resource explicitly
devised for supporting sentiment classification and opinion mining applications. SentiWordNet
accepts a term and its part-of-speech as input. This requires the use of the Standford POS tagger
to tag terms before their scores can be determined.
2.3.1 The POS tagger
A POS tagger parses a string of words (e.g., a sentence) and tags each term with its part
of speech. For example, parsing the following text which is an excerpt taken from a sample
review:
“well , its main problem is that it's simply too jumbled . it starts off " normal " but then
downshifts into this " fantasy " world in which you , as an audience member , have no
idea what's going on .
generates the following output:
“RB well , , PRP$ its JJ main NN problem VBZ is IN that PRP it VBZ 's RB simply
RB too JJ jumbled . PRP it VBZ starts RP off `` " JJ normal '' " CC but RB then
VBZ downshifts IN into DT this `` " NN fantasy '' " NN world IN in WDT which
PRP you , IN as DT an NN audience NN member , , VBP have DT no NN idea WP
what VBZ 's VBG going IN on . ”
Every term has been associated with a relevant tag indicating its role in the sentence, such as
VBZ (verb), NN (noun), JJ (adjective), etc. The entire list of tags and their meaning is based on
the Penn Treebank Tagset, an annotated corpus which seems to be the most popular standard
used in most POS tagging [26].
12
2.3.2 SentiWordNet
WordNet is a publicly available database of words containing a semantic lexicon for the
English language that organizes words into groups called synsets (i.e., synonym sets). A synset is
a collection of synonym words linked to other synsets according to a number of different
possible relationships between the synsets (e.g., is-a, has-a, is-part-of, and others). SentiWordNet
is a publicly available lexical resource for research purposes providing a semi-supervised form of
sentiment classification [27] based on the annotation of all the synsets of WordNet according to
the notions of “positivity”, “negativity”, and “neutrality”.
Each synset s is associated to three numerical scores Pos(s), Neg(s), and Obj(s) which
indicates the degree to which the terms in the synset are positive, negative, objective (i.e.,
neutral) respectively. Different senses of the same term may thus have different opinion-related
properties. For example, in SentiWordNet 1.0, the synset [estimable(J,3)]1, corresponding to the
sense “may be computed or estimated” of the adjective estimable, has an Obj score of 1.0 (and
Pos and Neg scores of 0.0), while the synset [estimable(J,1)] corresponding to the sense
“deserving of respect or high regard” has a Pos score of 0.75, a Neg score of 0.0, and an Obj
score of 0.25. Each of the three scores ranges from 0 to 1 and their sum is 1.0 for each synset. A
synset may have nonzero scores for all the three categories.
2.3.2 Dictionary Structure
The SentiWordNet dictionary can be downloaded online as a text file. In this file, the
term scores are grouped by the synset and the relevant part of speech. Each entry in the
dictionary is made of seven fields: the part of speech, the offset, the positive score, the negative
score, the objective score, the associated synset terms, and an example of the context in which
the term may be used. The parts of speech used in this dictionary are limited to adjective, noun,
13
verb, and adverb. The offset is a numerical value which uniquely identifies a synset in the
database. Associated synset terms are list of terms included in that particular synset. Figure 2.1
shows a sample output for the term “boring” entered as input for the SentiWordNet dictionary.
Figure 2.1 Sample SentiWordNet output for the word “boring”
According to the SentiWordNet dictionary, the term “boring” can be used as either an
adjective or a noun. When used as an adjective, it has an offset of 01345307, a Pos score of
0.0, a Neg score of 0.25 and an Obj score of 0.75. Its associated synset terms include
14
“wearisome”, “tiresome”, “tedious”, and “slow”. The term could mean “so lacking in
interest as to cause mental weariness” and an example of a context in which the term may be
used is “a boring evening with uninteresting people”.
When the term is used as a noun, it may have two offsets, 00942799 and 00923130
depending on the context in which it is used. For both offsets, the term has a Pos score of
0.0, a Neg score of 0.0 and an objective score of 1.0. The word “drilling” is an associated
synset term and is used in the context of “the act of drilling a hole”. From the scores
generated for this term, it indicates that the term is a strong objective term, however, when
used as an adjective, it is a weak negative term. Such a term, if used as a noun in a
document, will not be able to determine the sentiment of a document since our aim is to
determine the sentiment based on positive and negative scores. If it is used as an adjective,
however, we may consider it as a negative term.
15
Chapter 3
RELATED WORK
There is an increasing number and variety of research papers in the area of sentiment analysis
and classification. We chose to consider only related work which makes use of the following:
machine learning techniques, document-level classification, n-gram features (such as unigrams),
part-of-speech tagging, lexical resources (especially SentiWordNet), sentiment classification
techniques involving support vector machines (SVM) and Naïve Bayes, and the review domain,
such as movie reviews or product reviews. We did not consider other research work which used
natural language processing techniques, did not involve feature selection and did not utilize SVM
and Naïve Bayes classifiers.
3.1 Recent Work
Pang and Lee applied machine learning techniques to classify movie reviews according to
sentiment [5]. They employed Naive Bayes, Maximum Entropy, and SVM classifiers, and
observed that these do not perform as well on sentiment classification tasks as on traditional
topic-based text classification tasks. However, SVM did generate a higher accuracy than Naïve
Bayes and Maximum Entropy. They noticed that using unigrams as features with term presence
16
on SVM always yielded highly accurate results, but when bigrams were used as features, the
accuracy was lower as compared to that of the unigrams. They also found out that machine
learning techniques outperformed human-produced baselines.
In other work, they attempted to improve the classifiers by using only the subjective
sentences in movie reviews [8]. They explored extraction methods based on a minimum cut
formulation framework which resulted in the development of efficient algorithms for sentiment
analysis. They noted that utilizing contextual information via this framework can lead to a
statistically significant improvement in polarity classification accuracy.
An approach to sentiment analysis using SVMs and a variety of diverse information sources
was introduced by Mullen and Collier [38]. Their work involved extracting value phrases (two
word phrases conforming to a particular part-of-speech) and assigning them sentiment
orientation values using pointwise mutual information. They used SVM in conjunction with
other hybrid models of SVM to classifier their datasets. Their results indicated that their
proposed approach performed better when used on data which is not topic annotated. However,
when used on data that has been manually topic annotated, their approach did not yield much
improvement.
Ohana and Tierney studied sentiment classification using features built from the
SentiWordNet database of term polarity scores [34]. Their approach consisted of counting
positive and negative term scores to determine sentiment orientation. They also presented an
improvement of this by building a data set of relevant features using SentiWordNet and a
machine learning classifier. They implemented a negation detection algorithm to adjust
SentiWordNet scores accordingly for negated terms and set a threshold value in cases where
multiple SentiWordNet scores were found for a term. A three-fold classification approach was
17
used and results obtained were similar to those obtained using manual lexicons seen in the
literature. This result was closer to the results obtained by Pang and Lee. for a classifier based on
a single document statistics and a word list for positive and negative terms generated manually
for the dataset [5]. In addition, their feature set approach yielded improvements over the baseline
term counting method. The results indicated that SentiWordNet could be used as an important
resource for sentiment classification tasks.
Dave et al. developed ReviewSeer, a document level opinion classifier that uses statistical
techniques and POS tagging information for sifting through and synthesizing product reviews,
essentially automating the sort of work done by aggregation sites or clipping services [25]. They
first used structured reviews for testing and training, identifying appropriate features and scoring
methods from information retrieval for determining whether reviews are positive or negative.
These results performed as well as traditional machine learning methods. They then used the
classifier to identify and classify review sentences from the web, where classification is more
difficult. However, a simple technique for identifying the relevant attributes of a product
produces a subjectively useful summary.
A simple unsupervised learning algorithm for classifying a review as recommended or
not recommended was presented by Turney [40]. The algorithm takes a written review as input
and produces a classification as output in a three step approach: using a part-of-speech tagger to
identify phrases in a review that contain adjectives or adverbs, estimating the semantic
orientation of each phrase extracted, and assigning the review to a class, either recommended or
not recommended, based on the average semantic orientation of the extracted phrases. If the
average is positive, the review is assumed to recommend the item, otherwise, the item is not
recommended. The pointwise mutual information and information retrieval algorithm is used to
18
measure the similarity of pairs of words or phrases to estimate the semantic orientation of a
phrase.
A method for sentiment classification based on extracting and analyzing appraisal groups,
such as “very good” or “not terribly funny”, was presented by Whitelaw et al. [35]. An appraisal
group is represented as a set of attribute values in several task-independent semantic taxonomies,
based on appraisal theory. Semi-automated methods were used to build a lexicon of appraising
adjectives and their modifiers. They heuristically extract adjectival appraisal groups from texts
and compute their sentiment scores according to this lexicon. Documents were represented as
vectors of relative frequency features computed over these groups on an SVM learning algorithm
was used to learn a classifier discriminating positive from negative oriented text documents.
Movie reviews were classified using features based upon these taxonomies combined with
standard “bag-of-words” features, and report state-of-the-art accuracy of 90.2%.
Abbasi et al. studied the use of sentiment analysis methodologies for classification of
Web forum opinions in multiple languages [36]. The utility of stylistic and syntactic features was
evaluated for sentiment classification of English and Arabic content. Stylistic features consist of
lexical and structural attributes (e.g., unique words, words per sentence, and sentence length)
while syntactic features incorporate manual, semi-automatic, or automatic annotation techniques
to add polarity scores to words and phrases (e.g. Word/POS tag n-grams, phrase patterns,
punctuation). Specific feature extraction components were integrated to account for the linguistic
characteristics of Arabic. The entropy weighted genetic algorithm (EWGA), a hybridized genetic
algorithm that incorporates the information-gain heuristic was developed for feature selection.
EWGA was designed to improve performance and get a better assessment of key features. The
proposed features and techniques were evaluated on a benchmark movie review dataset, and U.S.
19
and Middle Eastern Web forum postings. The experimental results obtained using EWGA with
SVM produced accuracy over 91%. Stylistic features significantly enhanced performance across
all testbeds, while EWGA also outperformed other feature selection methods, indicating the
utility of these features and techniques for document-level classification of sentiments.
Gamon demonstrated that it is possible to perform automatic sentiment classification in
the very noisy domain of customer feedback data [37]. This work showed that by using large
feature vectors in combination with feature reduction, linear SVMs can be trained to achieve
high classification accuracy on data that present classification challenges for a human annotator.
It also showed that the addition of abstract linguistic analysis features (e.g., stop words removal
and spell checking) consistently contributes to an increase in classification accuracy in sentiment
classification.
Yessenalina et al. proposed a two-level approach to document-level sentiment
classification that extracts useful sentences and predicts document-level sentiment based on the
extracted sentences [11]. Their model, unlike previous learning methods for this task, does not
rely on the gold standard sentence-level subjectivity annotations and optimizes directly for
document-level performance. This model was evaluated using the movie reviews dataset and the
U.S. Congressional floor debates, and showed improved performance over previous approaches.
Thomas et al. [24] investigated the possibility of determining from the transcripts of U.S.
Congressional floor debates whether each speech (i.e., a continuous single-speaker segment of
text) represents support for or opposition to a proposed piece of legislation. They addressed the
problem by exploiting the fact that the speeches occur as part of a discussion and this allowed the
use of sources of information regarding relationships between discourse segments, such as
whether a given utterance indicates agreement with the opinion expressed by another. They
20
demonstrated that the incorporation of agreement modeling can provide substantial
improvements over the application of SVMs in isolation, which represents the state of the art in
the individual classification of documents. Their enhanced accuracies were obtained via a fairly
primitive automatically-acquired “agreement detector” and a conceptually simple method for
integrating isolated-document and agreement-based information. Their results showed the
potentially large benefits of exploiting sentiment-related discourse-segment relationships in
sentiment analysis tasks.
Tony et al. proposed an approach to sentiment analysis which uses SVMs to bring
together diverse sources of potentially pertinent information, including several favorability
measures for phrases and adjectives and, where available, knowledge of the topic of the text [38].
Models using the features introduced were further combined with unigram models and
lemmatized versions of the unigram models. The results of a variety of experiments are
presented, using both data which is not topic annotated and data which has been hand annotated
for topic. In the case of the former, their approach is shown to yield better performance than
previous models on the same data. In the latter, their results indicated that their approach may
allow for further improvements to be gained given knowledge of the topic of the text.
21
Chapter 4
METHODOLOGY
In this research we combined supervised and non-supervised sentiment classification
approaches to investigate the performance of CPD as a feature selection method in sentiment
classification. In this chapter, we discuss our experimental setup and execution steps. In Section
4.1, we state some assumptions. In Section 4.2, we provide an overview of our approach. In
Section 4.3, we discuss our scoring method. In Section 4.4, we discuss the feature selection
methods used. In Section 4.5, we discuss the classifier used. In Section 4.6, we discuss our
classification steps.
4.1 Assumptions
The following assumptions were made as part of our methodology:
Each dataset is assumed to be restricted to one type of object. For example, all the movie
reviews will be about movies.
Unigrams (individual terms in a review e.g., clear, noisy) are used as features (i.e. we do
not use bi-grams and higher forms of n-grams).
22
Stemming (the process of reducing inflected words to their base or root form. e.g.
“clearly” is stemmed to “clear”) could distort the part-of-speech of a term, so it is not
used.
For tagging, our aim is to find the sentiment of the whole document, thus whole
documents are tagged rather than sentences.
For terms with multiple polarity scores, the average of all these scores is used.
The percentage of the number of unique terms found in the documents which will not be
found in the SentiWordNet dictionary is considered to be minimal and so will be ignored.
Terms not found in the SentiWordNet dictionary will be regarded as non-sentiment
bearing words.
4.2 Description of our approach
In this work, we used the general sentiment analysis and classification steps outlined in
Chapter 1 of this report plus a few additions. The pre-processing step involves downloading the
datasets from online. Using the Stanford POS tagger [41], terms in each document are parsed to
determine the part-of-speech of each term. Tokenization is performed on the tagged dataset to
extract each term and its part-of-speech. To filter the terms for easy scoring, stop words are
removed and SentiWordNet is used to determine the score for each term.
4.3 SentiWordNet Scoring
Each document is parsed by SentiWordNet to determine the term scores. The version of
SentiWordNet that is used requires a term and its part-of-speech to produce a score. For a Neg
score, the value is preceded by a negation operator. For a Pos score, there is no preceding
operator. For an Obj score, the value generated is zero. Terms that are not found in the
23
SentiWordNet dictionary are automatically scored as zero. All terms in each document are stored
as binary-valued feature vectors representing a term and its score. In situations where a term had
more than one score, the average score is chosen. For example, the sample output generated after
one document is parsed by SentiWordNet, is shown in Figure 4.1. Each line represents a term, its
part-of-speech and score, where, “a” represents adjective, “v” represents verb and, “n” represents
a noun.
nullhappy: a: 0.2881;
bastard: n: -0.2135;
head: n: 0.0035;
start: n: 0.0067;
movie: n: 0.0000;
strangeness: n: -0.2700;
kick: v: -0.015;
Figure 4.1 SentiWordNet output
A dynamic array, such as the one shown in Figure 4.2, is created to store the feature
vectors. This is an n by m matrix where n represents the total number of documents and m the
total number of terms or attributes in the entire dataset. This matrix has a unique id for each term
stored as the attribute heading and an id for each document stored as rows. At the intersection of
each column and row is the score of the term (i.e., each cell contains the score of a term in a
particular document). The matrix also stores the overall sentiment of each document as either 0
(negative) or 1 (positive) at the end of each row.
Term1 Term2 Term3 ---- TermM-1 TermM Class
Doc1 0.1201 -0.233 0.000 ---- -0.256 -0.956 0
Doc2 0.989 0.00 0.2365 ---- -0.003 0.5633 1
------
------
DocN-1 0.000 -0.123 0.0235 -0.9865 0.001 -0.212 0
DocN 0.2333 0.000 0.455 0.5615 -0.200 -0.0001 1
Figure 4.2 A sample dynamic array
24
All the entries in a feature vector array are converted into attribute-relation file format
(arff) to serve as input for the classifier. Figure 4.3 shows the movie review dataset terms in arff
format. In Figure 4.3, txt_sentoken is the name of the dynamic array that was used. The terms are
represented as attributes with numeric values and each term is given an id or index value. The
output shows the results generated for three documents with the scores for each document
represented in a pair of braces ({}). The values in each of these sets represent the index of a term
@relation data/txt_sentoken
@attribute start numeric
@attribute great numeric
@attribute stars numeric
…………
…………
…………
@attribute white numeric
@attribute boring numeric
@attribute Class {0,1}
@data
{0 0.2881,1 -0.2135,4 0.0447,5 -0.125,6 0.0012,7 0.0035,8 0.00055,14
0.0153,16 0.08,17 -0.0197,24 -0.27,25 -0.0111,26 0.0136,30 0.0095,32 -
0.045,34 0.1701,35 0.02745,36 0.0403,39 0.0223,41 -0.2273,42 -0.0061,66
0.466571,67 0.002,69 0.1037,72 0.0006,73 -0.125}
{5 -0.125,16 0.08,17 -0.0197,56 0.0103,59 -0.0199,66 0.466571,76
0.0127,107 0.5407,109 0.3864,121 0.2625,165 -0.00215,186 0.3182,207
0.2727, 422 0.0137,448 0.0022,615 0.2083,943 0.3636,944 0.625,946
0.0121,947 0.25,948 0.1667,949 -0.0324,951 0.2045,956 -0.3182,1549 1}
{17 -0.0197,42 -0.0061,66 0.466571,113 -0.0302,132 -0.0571,135 0.0219,136
-0.375,230 0.0743,302 -0.0021,309 0.0139,564 0.0876,584 0.0338,589
0.0035,725 -0.5,843 -0.0255,845 0.0539,846 0.0059,847 -0.0793,848 -
0.0937,856 0.0307, 866 0.25,867 -0.0026,868 -0.4091,1549 1}
Figure 4.3 Sample movie review dataset in arff format
and its score. For positively labeled documents, the final value in the set represents the overall
document classification. For instance, for the value “1549 1”, “1549” represents the index of the
last column in the array and “1” is the class of the document. Negatively labeled documents did
not have this representation for the last column in the array. The index corresponding to the last
25
column was empty since its value was zero and the Weka model automatically ignores all zero
scores.
4.4 Feature Selection
Before any dataset can be classified, each document must be converted into feature vector
representation, the most widely used in classification. This ensures that the most important
features that express the sentiments of each document are extracted while at the same time
preserving the meaning of the document. This is applicable not only in sentiment analysis, but
also in general text classification. In this project, unigrams are the only features used based on
the recommendations of Pang and Lee [5]. A binary feature vector representation of some terms
(found in the movie review documents) and their SentiWordnet scores are shown in Figure 4.4.
(basically 0.9162907318741551, hide 0.10536051565782635, drink
0.10536051565782635, year -0.7419373447293773, quickly 0.9162907318741551,
world -0.7419373447293773)
Figure 4.4 Sample binary feature vector representation
The efficiency of our classification depends on the number of features (unique terms)
extracted from the documents. The larger the number, the longer the classifier takes to classify a
document. Feature selection is used to enhance the efficiency of classification by eliminating less
relevant features, from the document. Each feature is scored based on some predefined measure
and the most relevant terms are selected based on this measure. In this project, we use three
feature selection methods, and based on the f-measure values, features are eliminated.
4.4.1 Feature selection methods
Both text classification and sentiment classification have benefited from the use of
feature selection methods to improve classification accuracy and efficiency. We used three
26
feature selection methods, IG, χ2 and CPD. IG and χ
2 have been used extensively in previous
research [28, 29, 36]. CPD is a new method introduced by Simeon and Hilderman [29] and
recently studied in feature selection and weighting methods for sentiment analysis by O’Keefe
and Koprinska [39].
4.4.2 CPD
CPD measures the degree to which a word contributes to differentiating a particular
category from other categories in a text corpus. The possible values for CPD fall in the range -1
to 1, where values near -1 indicate that a term occurs in approximately an equal number of
documents in all categories and a value of 1 indicates that a term occurs in the documents in only
one category. Simeon et al. used
Table 4.5 Contingency table
Σ Row
Σ Column
an example contingency table (see table 4.5) to formally define CPD for a word w in category c
as:
where A is the number of times a word w category c occurs together, B is the number of times a
word w occurs without category c, C is the number of times category c occurs without word w, D
is the number of times neither word w nor category c occur, and [29].
CPD can be simply stated as the ratio of the difference between the number of documents
of a category in which a word occurs and the number of documents of other categories in which
27
the word also occurs divided by the total number of documents in which the word occurs. The
CPD for a word is the ratio associated with the category ci for which the value is greatest. That is,
4.4.3 IG and χ2
IG measures the decrease in entropy when a selected feature is present versus when it is
absent. χ2 is the common statistical test that measures divergence from the distribution expected
if one assumes the feature occurrence is actually independent of the class value. As a statistical
test, it is known to behave erratically for very small expected counts, which are common in text
classification due to rarely occurring word features and having few positive training examples
for a concept [28]. IG and χ2
can be defined generally by the following formulas:
] ,
where
, and
where t(count, expect)=(count-expect)2/ expect.
tp, fp, tn and fn are the number of true positives, false positives, true negatives, and false
negatives, respectively. pos = tp+tn, and neg= fp+fn are the number of positives and number of
negatives respectively. Ppos=pos/all is the probability of a positive, Pneg=neg/all is the
probability of a negative, Pword= (tp+fp)/all is the probability of the word appearing and P¬word=
1-P(word) is the probability of not the word appearing.
28
4.5 Classifier
This research utilized SVM and Naïve Bayes classifiers, two classifiers that have been
widely and frequently used for many text and sentiment classification tasks [5, 8, 24, 34, 37, 38].
Another reason for our using these classifiers is to be able to compare our results to the text
categorization results of Simeon and Hilderman [29] and the sentiment classification results of
Pang and Lee [5], Ohana Tierney [34], and O’Keefe and Koprinska [39]. SVM performs
classification by constructing an N-dimensional hyperplane that optimally separates the data into
two categories, while Naïve Bayes is a simple probabilistic classifier based on Baye's theorem.
These classifiers were part of Weka, an open source data mining application which we used in
our experiments. We discuss Weka in the Chapter 5.
4.6 Classification
Figure 4.5 describes the steps used in this research. The Weka data mining software is
used to classify our datasets using the generated matrix from the preprocessing stage which had
already been converted into arff format as input. The command line version of this software
enabled the creation of a model that is suitable for the purpose of our experiment. This model
uses its Naïve Bayes and SVM classifiers and the feature selection methods, CPD, IG and in
the experiments. All the terms are used by each classifier in turn, first without any feature
selection methods.
29
Figure 4.5 Experiment Execution stages
The model is once again executed but with each feature selection method in turn, first
with CPD, then IG and finally with . A ten-fold cross validation was used to arrive at the best
performance results. For each execution, the f-measure values are obtained by scoring each term
using the current feature selection method and, the unique scores are stored in sorted score list.
Based on these scores, 5% of the total number of features is eliminated after each classification.
The features that are eliminated after each run are those with the lowest f-measure values. This
process continues until there are no more features to be eliminated.
30
Chapter 5
EXPERIMENTAL RESULTS
A series of several experiments were performed to evaluate and compare CPD, IG, and χ2
on sentiment classification tasks. In this chapter, we discuss the experimental results. In Section
5.1, we give an overview of the hardware used. In Section 5.2, we discuss the Weka data mining
software. In Section 5.3, we discuss the datasets used in the experiments. In Section 5.4, we
present the interest measures used. In Section 5.5, we present our results. In Section 5.6, we
discuss our results.
5.1 Hardware and Software
The experiments were executed on two systems: the first was an HP G60 notebook
running Windows Vista with 3GB RAM, 250G hard disk space, and a 2.0GHz Pentium(R)
Dual-Core CPU, and the other was a Dell Desktop PC running Windows XP Professional
with 3GB RAM, 250G hard disk space and a 1.8GHz processor. The software used included
the Weka data mining application (version 3.7), Java SE (version 6), the Netbeans IDE
(version 6.9), and the basic English Stanford POS tagger (version 3.0.4).
31
5.2 Weka
Weka is an acronym for Waikato Environment for Knowledge Analysis. The Weka
application is a collection of state-of-the-art machine learning algorithms and data processing
tools for data mining tasks. It is an open source software issued under the GNU General Public
License and runs on almost any platform. The algorithms can either be applied directly to a
dataset or called from one’s own Java code. It is suited for developing new machine learning
schemes [30]. All Weka algorithms take their input in the form of a single relational table in
ARFF (Attribute-Relation File Format) format which can be read from a file or generated by a
database query.
Weka has a variety of classifiers and filters to choose from. Its main strength lies in the
classification area, where many machine learning approaches have been implemented [32]. It
comes with four interfaces; the Explorer, the Knowledge Flow, the Experimenter and the
command line interface. The command line interface provides the basic functionality that lies
behind Weka’s interactive interfaces. It can be accessed in raw form by entering textual
commands in order to gain access to all features of the system. The command line accepts Java
syntax and is made up of Java features such as classes, instances, and packages. Weka provides a
directory containing documentation of all Java classes it uses. The command line is
recommended for in-depth experimental usage because it offers some functionality which is not
available via the GUI - and uses far less memory. This project used the command line version to
execute all experiments.
5.3 Datasets
The Cornell movie review dataset (sentiment polarity dataset v2.0) and the congressional-
speech corpus (convote dataset) were used for our experiments. The movie review dataset is a
32
popular dataset that has been previously used for research in sentiment classification. It consists
of 2000 processed text files, half of which have been categorized into positive (pos) reviews and
the other half negative (neg). This was downloaded from
http://www.cs.cornell.edu/People/pabo/movie-review-data/. All files were renamed according to
the category they belonged to and their count. For instance, all text files corresponding to
positive (negative) reviews were renamed beginning with the suffix “pos” (“neg”) followed by a
number. For example, pos(30) represents the thirtieth positive text file and neg(500) represents
the five hundredth negative text file.
The convote dataset was downloaded from
http://www.cs.cornell.edu/home/llee/data/convote.html. The original dataset includes three
stages of tokenized speech-segment data, corresponding to three different stages in the analysis
pipeline employed by Matt et al. [24]. For our experiment, we only used the data in stage 3,
consisting of speech-segments that had been classified into support for, or opposition to, a
particular bill.
All filenames ending in Y were renamed as positive, and all those ending in N, as
negative, using the same labelling format used for the movie dataset. The files had been divided
into three sets: the development set, training set and test set. The training set was used since it
contained majority of the data. It comprised of 1,200 positive and 1,200 negative documents.
5.4 Interest Measures
The performance of most classifiers is measured using the popular metrics precision,
recall, f-measure and accuracy. Precision measures the exactness of a classifier. A higher
precision means fewer false positives (negative documents which have been wrongly classified
as positive), while a lower precision means more false positives. This is often at odds with recall,
33
as an easy way to improve precision is to decrease recall. Recall measures the completeness, or
sensitivity, of a classifier. Higher recall means fewer false negatives (positive documents which
have been wrongly classified as negative), while lower recall means more false negatives.
Improving recall can often decrease precision because it gets increasingly harder to be precise as
the sample space increases. The f-measure is the average of precision and recall but gives higher
scores when precision and recall results are closer together. How well this all works depends on
what is being analyzed. Accuracy refers to how exactly the classifier classifies each document.
The performance measures used in this research are the same as that used by Simeon and
Hilderman [29]. Figure 5.1 shows the results generated after 1000 (500 positive and 500
negative) documents of the convote dataset are classified using the Naïve Bayes classifier. The
measures used include the accuracy, true positive, false positive, true negative, false negative,
precision, recall, f-measure and root mean square error.
For a feature size of 10,543, the accuracy was 59.8, the f-measure was 0.597, and the root
mean squared error was 0.618. For both classes (positive=1 and negative=0), the average recall
was 0.598, average precision was 0.599, average true positive was 0.598, average false positive
was 0.402, average true negative was 0.598, and average false negative was 0.402. For a feature
size of 9,476, the accuracy was 60.07, the f-measure was 0.605, and the root mean squared error
was 0.612. For both classes, the average recall was 0.607, average precision was 0.609, average
true positive was 0.607, average false positive was 0.393, average true negative was 0.607, and
average false negative was 0.393.
34
Figure 5.1 Sample Naïve Bayes classification results
5.5 Results
Experiments with CPD and the other feature selection methods using the SVM and Naïve
Bayes classifier has shown that, in general, according to the f-measure, CPD performs better than
the other feature selection methods in most text categorization tasks [29]. Here our objective is to
experiment using CPD, IG and for sentiment classification on SVM and Naïve Bayes
classifier to ascertain if this conclusion will still hold. The results of our experiments are
presented in the following sections.
5.5.1 Initial Classification Without Feature Selection
Table 5.1 shows the summary of average ten-fold cross validation accuracies in percent
for the base classifier. SVM performed better than Naïve Bayes for both datasets, results that are
consistent with those obtained by Pang et al. [5]. For the movie review dataset, SVM had an
accuracy of 79.45%, while Naïve Bayes had an accuracy of 78.95%. For the convote dataset,
35
accuracies for both classifiers were reduced. SVM had an accuracy of 65.96 and Naïve Bayes
had an accuracy of 57.38%.
Table 5.1 Average 10-fold cross validation accuracies for base classifier in percent
Dataset Number of features SVM Naïve Bayes
Movie Reviews 36,467 79.45 78.95
Convote 15,549 65.96 57.38
5.5.2 Results for Movie Review Dataset
Tables 5.2 and 5.3 show the summary results for the movie review dataset generated
when feature selection is performed during the classification stage using SVM and Naïve Bayes,
respectively. The first line of each table is repeated from Table 5.1.The best results generated for
each feature selection method is in bold. Rows results can be found in Appendix A. From Table
5.2, we notice that CPD performs better than IG and χ2. Its best accuracy results of 88.1% were
obtained with only 42% of the features. IG on the other hand generated its best accuracy results
of 82.2% with 19% of the features. χ2 did not generate any accuracy values greater than the base
classifier. It performed poor as compared to CPD and IG. This could be attributed to the fact that
it is known to behave erratically for very small expected counts which are common in text
classification and also because of having rarely occurring word features.
36
Table 5.2 SVM accuracies for Movie review dataset in percent
# of Features Percentage CPD IG χ2
36,467 100 79.45 79.45 79.45
27,350 75 79.05 79.45 77.5
17,536 48 84.5 78.75 77.5
15,204 41.7 88.1 78.75 77.5
9,116 25 88.1 80.95 77.5
7,037 19 88.1 82.2 65.2
3,647 10 88.1 80.65 63.45
Table 5.3 Naïve Bayes accuracies for Movie review dataset in percent
# of Features Percentage CPD IG χ2
36,467 100 78.95 78.95 78.95
27,350 75 79.05 79.45 77.2
17,536 48 77.95 73.7 67.45
7,190 19.7 77.95 67.1 67.45
3,647 10 77.95 65.3 63.95
Table 5.3 shows the summary Naïve Bayes accuracies generated for the movie review
database. With 75% of the features, IG had the highest accuracy of 79.45% whiles CPD had an
accuracy of 79.05%. χ2 did not perform any better than the base classifier. The lowest accuracy
produced by CPD was 77.95%, the lowest for IG was 65.3%, and the lowest for χ2
was 63.95%.
Comparing CPD with IG, even though IG had the highest accuracy at 79.45, the general
37
performance of CPD was better since it accuracy remained constant after 50% of the features had
been eliminated whiles that of IG decreased.
Figure 5.4 SVM accuracies for the movie review dataset
Figure 5.4 shows the graphical representation of the SVM movie review dataset
accuracies at different number of features. From this Figure, we notice that as the number of
features decreases CPD accuracy increases till about half of the features have been eliminated
and then the accuracy remains consistent afterwards, as more features are eliminated. IG on the
other hand had no consistency, the accuracy decreases and then increases and then decreases
again. χ2
did not increase at all, its accuracy decreased and remained relatively consistent until at
19% of the features, it began to decrease again.
Figure 5.5 shows a graphical representation of the accuracies generated by Naïve Bayes
for the movie review dataset. We notice that, the accuracy for CPD and IG increases after 25%
of the features were eliminated and after that, the accuracy decreased as the number of features
0
10
20
30
40
50
60
70
80
90
100
36,467 27,350 17,536 15,204 9,116 7,190 3,647
CPD
IG
χ2
38
decreased. We notice that CPD performed better than IG and χ2 even though the accuracy
decreased.
Figure 5.5 Naïve Bayes Accuracies for movie review
5.5.3 Results for Convote Dataset
Tables 5.6 and 5.7 show the summary results generated when feature selection is
performed during the classification stage using SVM and Naïve Bayes, respectively. The best
results generated for each feature selection method is in bold. For the convote dataset, the
accuracy generated as compared to the movie review dataset was lower. This can be attributed to
the fact that both classifiers are domain dependent and as such it is normal for them to perform
differently under different domains.
In Table 5.6, we notice that all the results did not show any improvement over the base
classifier results of 65.96%. However, comparing CPD, IG and χ2, CPD had the best accuracy.
0
10
20
30
40
50
60
70
80
90
36,467 27,350 17,536 7,190 3,647
CPD
IG
χ2
39
We also notice that after 25% of the features were eliminated, the accuracy of CPD remained the
same while that of IG and χ2 kept reducing. The lowest accuracy for CPD was 65.5%, for χ
2 was
60.45% and for IG was 59.3%. For Table 5.7, even though the accuracies were low, all three
feature selection methods performed better than the base classifier accuracy of 57.38%, which
was lower than that generated for the SVM. The highest results generated were for χ2
(61.46%)
with 28% of the features. χ2 performed better than IG and CPD while CPD, performed better
than IG. χ2
performance here can also be attributed to its erratic behavior on text classification
tasks.
Table 5.6 SVM accuracies for convote dataset in percent
# of Features Percentage CPD IG χ2
15,549 100 65.96 65.96 65.96
11,662 75 65.5 65.18 62.90
4,332 28 65.5 61.93 62.82
3,334 22 65.5 60.03 62.25
1,555 10 65.5 59.3 60.45
Table 5.7 Naïve Bayes accuracies for convote dataset in percent
# of Features Percentage CPD IG χ2
15,549 100 57.38 57.38 57.38
11,662 75 57.9 57.17 57.38
4,332 28 57.9 57.9 61.46
3,334 22 57.9 57.4 60.04
1,555 10 57.9 57.0 59.4
40
Figure 5.8 shows the graphical representation of using SVM and the three feature
selection methods on the convote dataset. We notice that CPD is relatively consistent, while IG
and χ2 are inconsistent. IG decreases at a greater rate than χ
2. With 75% of the features, its
accuracy was higher than χ2 but just before 50% of the features were eliminated, its accuracy
decreases drastically (lower than that of χ2) and kept decreasing as more features were
eliminated. Figure 5.9 on the other hand shows the graphical representation of using Naïve Bayes
accuracies and the three feature selection methods on convote dataset. CPDs performance was
relatively consistent in this case too. The accuracy of IG decreased and then increased, and then
decreased. The performance of χ2 was different in this classification. It increased sharply and
then began to decrease, but its lowest accuracy was still higher than the base classifier result.
Figure 5.8 SVM accuracies for convote dataset
54
56
58
60
62
64
66
68
15,549 11,662 4,332 3,334 1,555
CPD
IG
χ2
41
Figure 5.9 Naive Bayes accuracies for convote dataset
5.6 Discussion
In this section, we present a discussion based on the results obtained from our
experiments. We explain and justify the performance of CPD, and compare our results with that
of Ohana et al. [34], O’Keefe et al. [39], Pang et al. [5] and Simeon et al. [29].
5.6.1 CPD Performance
We observe from Table 5.2 that CPD generated the best accuracies on the movie review
datasets for the SVM classifier. For the Naïve Bayes classifier, the accuracy seemed to be a bit
lower than that for IG with 75% of the features. For the convote dataset, we observe from Table
5.6 that even though all the feature selection methods generated accuracies below that of the base
classifier (SVM), CPD had a higher accuracy. For Naïve Bayes however, CPD had the highest
accuracy with 75% of the features but χ2 performed the best in most situations.
54
55
56
57
58
59
60
61
62
15,549 11,662 4,332 3,334 1,555
CPD
IG
χ2
42
From all the results generated, we noticed that CPD maintained the same accuracy value
after 50% of the features had been eliminated. This was due to the CPD scores that were
generated. The objective of CPD is to find the maximum f-measure value and set a cutoff point
based on this value. This cutoff point is used to eliminate features. For our experiment, we set
specific scale rates to eliminate features rather than the f-measure values due to time constraints.
Using these scale factors, it took on average eight days for one experiment to run to completion.
For the CPD experiments, the CPD scores that were generated after 50% of the features were
eliminated did not differ much. This resulted in most of the terms being eliminated at this point
and causing the experiment to complete after this step. For both datasets, the same situation was
observed. Figure 5.10 shows a graphical representation of CPD performance.
CPD score
generation
F-measure > Max F-
measure
Run model
with
reduced
terms
Eliminate
terms
F-Measure
value
Base classifier
Sorted score list
Max. F-
Measure=0
Cut-off=2% of
terms
2% of terms with lowest
CPD score
F-Measure
value
Perform 10-fold cross
validation
Terms left >0
End
F-measure
=Max F-
measure
No
Yes
Figure 5.10 CPD feature selection process
5.6.2 Comparison with Other Results
Given the fact that we used an approach similar to previous work in sentiment analysis and
classification, we compare our results with past results. Table 5.11 shows how our results
compare with other published results using the movie dataset. Our approach has a high accuracy
43
compared to other results using manually built lexicons and this serves as an encouragement for
the use of lexicons (specifically SentiWordNet) built from semi-supervised methods. We also
notice feature selection had a great effect on the accuracy as compared to the other two which
did not use feature selection as we did.
Table 5.11 Accuracy Comparison
Method Accuracy
SentiWordNet Scores used as features and CPD as feature selection
method (this research).
88.1%
SentiWordNet Scores used as Features with no feature selection
method [34]
69.35%
Term Counting – Manually built list of Positive/Negative words. [5] 69.00%
We can also compare our results with that obtained by Simeon and Hilderman [29].
According to their experimental results, they concluded that CPD outperformed other frequently
studied feature selection methods in four out of six text categorization tasks given the dataset
they used. In our sentiment analysis and classification research, we also observed that CPD
performed better in most cases on both datasets when SVM classifier was used. This is in line
with their results and serves as an encouragement to use CPD for further research in sentiment
analysis and classification research.
O’Keefe and Koprinska [39] also studied using CPD together with other feature selection
methods on the movie review dataset. Their results show that it is possible to maintain a state-of-
the art classification accuracy of 87.15% while using less than 36% of the features. Our results
show that, with 41.7% of the features, we could obtain an accuracy of 88.1% which is an
improvement. Just as our research confirmed, their research also showed SentiWordNet scoring
as a good feature scoring approach to use.
44
5.6.3 Drawbacks
We observed that, CPD has a much longer running time to than IG and χ2, even though its
performance was better. To reduce the running time, we used a cut off scale of 5%.
Some of the SentiWordNet scores generated were inaccurate and this may be attributed to the
reliance on glosses (is a brief notation of the meaning of a word or wording in a text) as a source
of information for determining term orientation. Some terms which should contain negative
orientation were given positive scores, and vice versa, based on the gloss the term is more likely
to be associated with. Some ambiguous terms were difficult to score and so for such terms, the
average score was chosen. This score also influenced our accuracy values.
Unigrams alone are known to yield better accuracies but for some terms, using unigrams
distorted their orientation. For such terms, higher order n-grams could have been used. This
resulted in the distortion of the sentiment orientation of some terms and thereby affected the
sentiment score.
The convote dataset did not perform well as compared to the movie dataset. This could be
attributed to the composition and organization of the data. Further processing will have to be
done on this dataset to improve its accuracy in future. Another factor had to do with the fact that
the reviews seem to be on several topics of debate even though they all seemed to have some
political connotation.
45
Chapter 6
CONCLUSION
For our work, we investigated the performance of CPD as a feature selection method together
with other popular feature selection methods: IG and χ2. Two datasets were used in this research
and SentiWordNet was used to score the terms in each document. The SVM and Naïve Bayes
classifiers in the Weka data mining application were used for classifying the datasets into
positive and negative sentiments. Experimental results show that, CPD performs well as a feature
selection method for sentiment analysis and classification tasks yielding the best accuracy results
in three out of four experiments.
We noticed, however, that the accuracy became constant after 50% of the features were
eliminated due to the fact that most terms had similar CPD scores. As future work in this
research, we hope to study CPD in more detail by doing further work with the convote dataset to
eliminate any reviews which are not related, and possibly group reviews based on the topic being
debated. This will involve using both supervised and non-supervised approaches, as it has been
noticed that the combination of these approaches yield better results. We would also like to study
the performance of CPD on other datasets described in the sentiment analysis literature.
46
We hope to also use SentiWordNet with other scoring measures to arrive at better scores
for terms which will make up for the inaccurate scores generated sometimes form SentiWordNet.
In future, we hope to use f-measure values as cutoff values during feature selection and also
improve the time taken by CPD to generate scores for terms. This will greatly enhance the
classification step and also improve accuracy. Finally, we will like to investigate the use of
unigrams and bigrams in this research to see if accuracy can be improved.
47
References
[1] Bo Pang and Lillian Lee. (2008). Opinion mining and Sentiment analysis. Foundations
and Trends in Information Retrieval Vol. 2, Nos. 1–2. Pages: 1–135
[2] Michelle de Haaff. (2010). Sentiment Analysis, Hard But Worth It!, CustomerThink,
http://www.customerthink.com/blog/sentiment_analysis_hard_but_worth_it, retrieved
2010-03-12.
[3] Fangzhong Su and Katja Markert. (2008). "From Words to Senses: a Case Study in
Subjectivity Recognition". Proceedings of Coling 2008, Manchester, UK
[4] Annett Michelle and Kondrak Grzegorz. (2008).Comparison of sentiment analysis
techniques: polarizing movie blogs. Lecture Notes in Computer Science, Advances in
Artificial Intelligence. Springer Berlin / Heidelberg .Vol. 5032. Pages:25-30
[5] Bo Pang and Lillian Lee. (2002). Thumps up? Sentiment Classification using Machine
Learning Techniques. Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP). Pages: 79-86
[6] Dunja Mladenic. (1999). Text-Learning and Related Intelligent Agents: A Survey. In
Intelligent Systems and their applications, IEEE. Vol. 14 Issue 4. Pages: 44-54.
48
[7] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianin and Chris Watkins.
(2002). Text Classification using String Kernels. Journal of Machine Learning
Research 2 (2002). Pages:419-444.
[8] Bo Pang and Lillian Lee. (2004). A Sentimental Education: Sentiment Analysis
Using Subjectivity Summarization Based on Minimum Cuts. Proceedings of the
Association for Computational Linguistics (ACL). Pages: 271–278.
[9] Bing Liu. (2010). Sentiment Analysis and Subjectivity. Handbook of Natural
Language Processing, Second Edition. (editors: N. Indurkhya and F. J. Damerau)
[10] http://en.wikipedia.org/wiki/Sentiment_analysis
[11] Ainur Yessenalina, Yisong Yue and Claire Cardie. (2010). Multi-level Structured
Models for Document-level Sentiment Classification. Proceedings of the 2010
Conference on Empirical Methods in Natural Language Processing. Pages: 1046–1056.
[12] Tu Bao Ho, David Cheung and Huan Liu. (2005). Advances in knowledge
discovery and data mining: 9th Pacific-Asia Conference on Knowledge Discovery
and Datamining. Pages:301-311
[13] http://reviews.cnet.com/smartphones/apple-iphone-4-16gb/4852-6452_7-34117598.html
[14] http://www.radian6.com/blog/2009/12/on-automated-sentiment-analysis/
[15] http://datamining.typepad.com/data_mining/2007/12/sentiment-minin.html
[16] http://en.wikipedia.org/wiki/Stemming
[17] http://aimotion.blogspot.com/2010/07/working-on-sentiment-analysis-on.html
49
[18] Suhaas Prasad. (2010).Micro-blogging Sentiment Analysis Using Bayesian
Classification Methods. http://nlp.stanford.edu/courses/cs224n/2010/reports/suhaasp.pdf
[19] Alec Go, Lei Huang and Richa Bhayani. Twitter Sentiment Analysis
[20] Ravi Parikh, Matin Movassate. Sentiment Analysis of User-Generated Twitter Updates
using Various Classification Techniques
[21] Minqing Hu and Bing Liu. (2004). Mining and summarizing customer reviews.
Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge
Discovery and Data, Aug. 22-25, ACM Press, Washington, USA., pp: 168-177.
[22] Pimwadee Chaovalit and Lina Zhou. (2005). Movie Review Mining: a Comparison
between Supervised and Unsupervised Classification Approaches. In Proceedings of
the 38th Annual Hawaii International Conference on System Sciences. Pages: 112c-
112c
[23] Benjamin K Tsuo and Oi Yee Kwong. (2005). Semantic Role Tagging for Chinese at
the Lexical level. Proceedings of the Natural Language Conference-IJCNLP.
Pages:804-815.
[24] Matt Thomas, Bo Pang and Lillian Lee. (2006) Get out the vote: Determining support or
opposition from Congressional floor-debate transcripts. Proceedings of EMNLP. Pages:
327–335.
[25] Kushal Dave, Steve Lawrence, and David M. Pennock. (2003) Mining the peanut
gallery: Opinion extraction and semantic classification of product reviews. In
Proceedings of WWW. Pages: 519–528.
[26] http://www.computing.dcu.ie/~acahill/tagset.html
50
[27] Baccianella Stefano, Andrea Esuli and Fabrizio Sebastiani. (2010). SentiWordNet
3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In
Proceedings of the Seventh Conference on International Language Resources and
Evaluation (LREC’10). Pages 2200–2204.
[28] George Forman. (2003). An extensive empirical study of feature selection metrics for
text classification. Journal of Machine Learning Research 3. Pages: 1289 –1305.
[29] Mondelle Simeon and Robert Hilderman. (2008). Categorical Proportional
Difference: A Feature Selection Method for Text Categorization. In Proceedings of
the Seventh Australasian Data Mining Conference (AusDM 2008), Glenelg, South
Australia. CRPIT, 87. Roddick, J. F., Li, J., Christen, P. and Kennedy, P. J., Eds.
ACS. 201-208.
[30] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,
Ian H. Witten (2009). The WEKA Data Mining Software: An Update; SIGKDD
Explorations, Volume 11, Issue 1
[31] Ian H. Witten, Eibe Frank, Mark A. Hall. (2005). Data Mining: Practical Machine
Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data
Management Systems).
[32] http://weka.wikispaces.com/Primer
[33] Nitin Jindal and Bing Liu. (2008). Opinion Spam and Analysis. Proceedings of
First ACM International Conference on Web Search and Data Mining (WSDM-
2008), Feb 11-12, 2008, Stanford University, Stanford, California, USA
51
[34] Bruno Ohana and Brendan Tierney. (2009). Sentiment Classification of reviews
using SentiWordNet. IT&T Conference.
[35] Casey Whitelaw, Navendu Garg, Shlomo Argamon. (2005). Using appraisal groups for
sentiment analysis. In Proceedings of the 14th ACM Conference on Information and
Knowledge Management. Pages:625–631.
[36] Abbasi Ahmed, Chen Hsinchun and Salem Arab.(2008). Sentiment analysis in multiple
languages: Feature selection for opinion classification in Web forums. ACM
Transactions of Information System. Volume 26, Number 3, Article 12 (June 2008) 34
pages.
[37] Gamon, M. (2004). Sentiment Classification on Customer Feedback Data: Noisy Data,
Large Feature Vectors, and the Role of Linguistic Analysis. Proceedings of the 20th
international conference on Computational Linguistics. Geneva, Switzerland:
Association for Computational Linguistics.
[38] Tony Mullen and Nigel Collier.(2004). Sentiment analysis using support vector
machines with diverse information sources. Proceedings of EMNLP-2004, Barcelona,
Spain, July 2004. Association for Computational Linguistics. Pages: 412–418
[39] Tim O’Keefe and Irena Koprinska.(2009). Feature Selection and Weighting methods in
Sentiment Analysis. Proceedings of the Fourteenth Australasian Document Computing
Symposium.
[40] Peter D. Turney. (2002). Thumbs up or thumbs down?: Semantic orientation applied to
unsupervised classification of reviews. Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics. Pages: 417-424.
53
APPENDIX
Extract from the movie review SVM classification result.
data/txt_sentoken Corpora Size: 2000 Feature Size: 36467 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.79
0.201 0.797 0.79 0.794 0.795 0
0.799 0.21 0.792 0.799 0.795 0.795 1
Weighted Avg.0.795 0.206 0.795 0.795 0.794 0.795
Accuracy: 79.45 F-Measure: 0.794 Recall: 0.794 Precision: 0.795 True Positive:
0.794 False Positive: 0.206 True Negative: 0.794 False Negative: 0.206 Root Mean Squared
Error: 0.453
data/txt_sentoken Corpora Size: 2000 Feature Size: 36467 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.79 0.201 0.797 0.79 0.794 0.795 0
0.799 0.21 0.792 0.799 0.795 0.795 1
Weighted Avg. 0.795 0.206 0.795 0.795 0.794 0.795
Accuracy: 79.45 F-Measure: 0.794 Recall: 0.794 Precision: 0.795 True Positive:
0.794 False Positive: 0.206 True Negative: 0.794 False Negative: 0.206 Root Mean Squared
Error: 0.453
data/txt_sentoken Corpora Size: 2000 Feature Size: 32601 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.798 0.197 0.802 0.798 0.8 0.801 0
0.803 0.202 0.799 0.803 0.801 0.801 1
Weighted Avg. 0.801 0.2 0.801 0.801 0.8 0.801
Accuracy: 80.05 F-Measure: 0.8 Recall: 0.8 Precision: 0.801 True Positive: 0.8
False Positive: 0.2 True Negative: 0.8 False Negative: 0.2 Root Mean Squared
Error: 0.447
data/txt_sentoken Corpora Size: 2000 Feature Size: 32525 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.799 0.194 0.805 0.799 0.802 0.803 0
0.806 0.201 0.8 0.806 0.803 0.803 1
Weighted Avg. 0.803 0.198 0.803 0.803 0.802 0.803
Accuracy: 80.25 F-Measure: 0.802 Recall: 0.802 Precision: 0.803 True Positive:
0.802 False Positive: 0.198 True Negative: 0.802 False Negative: 0.198 Root Mean Squared
Error: 0.444
54
data/txt_sentoken Corpora Size: 2000 Feature Size: 32353 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.798 0.201 0.799 0.798 0.798 0.799 0
0.799 0.202 0.798 0.799 0.799 0.799 1
Weighted Avg. 0.799 0.202 0.799 0.799 0.798 0.799
Accuracy: 79.85 F-Measure: 0.798 Recall: 0.798 Precision: 0.799 True Positive:
0.798 False Positive: 0.202 True Negative: 0.798 False Negative: 0.202 Root Mean Squared
Error: 0.449
data/txt_sentoken Corpora Size: 2000 Feature Size: 32161 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.804 0.199 0.802 0.804 0.803 0.803 0
0.801 0.196 0.803 0.801 0.802 0.803 1
Weighted Avg. 0.803 0.198 0.803 0.803 0.802 0.803
Accuracy: 80.25 F-Measure: 0.802 Recall: 0.802 Precision: 0.803 True Positive:
0.802 False Positive: 0.198 True Negative: 0.802 False Negative: 0.198 Root Mean Squared
Error: 0.444
data/txt_sentoken Corpora Size: 2000 Feature Size: 31887 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.793 0.206 0.794 0.793 0.793 0.794 0
0.794 0.207 0.793 0.794 0.794 0.794 1
Weighted Avg. 0.794 0.207 0.794 0.794 0.793 0.794
Accuracy: 79.35 F-Measure: 0.793 Recall: 0.794 Precision: 0.794 True Positive:
0.794 False Positive: 0.206 True Negative: 0.794 False Negative: 0.206 Root Mean Squared
Error: 0.454
data/txt_sentoken Corpora Size: 2000 Feature Size: 31574 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.791 0.207 0.793 0.791 0.792 0.792 0
0.793 0.209 0.791 0.793 0.792 0.792 1
Weighted Avg. 0.792 0.208 0.792 0.792 0.792 0.792
Accuracy: 79.2 F-Measure: 0.792 Recall: 0.792 Precision: 0.792 True Positive: 0.792
False Positive: 0.208 True Negative: 0.792 False Negative: 0.208 Root Mean Squared
Error: 0.456
data/txt_sentoken Corpora Size: 2000 Feature Size: 31272 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.791 0.201 0.797 0.791 0.794 0.795 0
0.799 0.209 0.793 0.799 0.796 0.795 1
Weighted Avg. 0.795 0.205 0.795 0.795 0.795 0.795
Accuracy: 79.5 F-Measure: 0.795 Recall: 0.795 Precision: 0.795 True Positive: 0.795
False Positive: 0.205 True Negative: 0.795 False Negative: 0.205 Root Mean Squared
Error: 0.453
data/txt_sentoken Corpora Size: 2000 Feature Size: 30851 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.781 0.208 0.79 0.781 0.785 0.787 0
0.792 0.219 0.783 0.792 0.788 0.787 1
Weighted Avg. 0.787 0.214 0.787 0.787 0.786 0.787
Accuracy: 78.65 F-Measure: 0.786 Recall: 0.786 Precision: 0.787 True Positive:
0.786 False Positive: 0.214 True Negative: 0.786 False Negative: 0.214 Root Mean Squared
Error: 0.462
55
data/txt_sentoken Corpora Size: 2000 Feature Size: 30382 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.785 0.211 0.788 0.785 0.787 0.787 0
0.789 0.215 0.786 0.789 0.787 0.787 1
Weighted Avg. 0.787 0.213 0.787 0.787 0.787 0.787
Accuracy: 78.7 F-Measure: 0.787 Recall: 0.787 Precision: 0.787 True Positive: 0.787
False Positive: 0.213 True Negative: 0.787 False Negative: 0.213 Root Mean Squared
Error: 0.462
data/txt_sentoken Corpora Size: 2000 Feature Size: 29587 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.776 0.214 0.784 0.776 0.78 0.781 0
0.786 0.224 0.778 0.786 0.782 0.781 1
Weighted Avg. 0.781 0.219 0.781 0.781 0.781 0.781
Accuracy: 78.1 F-Measure: 0.781 Recall: 0.781 Precision: 0.781 True Positive: 0.781
False Positive: 0.219 True Negative: 0.781 False Negative: 0.219 Root Mean Squared
Error: 0.468
data/txt_sentoken Corpora Size: 2000 Feature Size: 29378 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.777 0.214 0.784 0.777 0.781 0.782 0
0.786 0.223 0.779 0.786 0.782 0.782 1
Weighted Avg. 0.782 0.219 0.782 0.782 0.781 0.782
Accuracy: 78.15 F-Measure: 0.781 Recall: 0.782 Precision: 0.782 True Positive:
0.782 False Positive: 0.218 True Negative: 0.782 False Negative: 0.218 Root Mean Squared
Error: 0.467
data/txt_sentoken Corpora Size: 2000 Feature Size: 28977 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.787 0.206 0.793 0.787 0.79 0.791 0
0.794 0.213 0.788 0.794 0.791 0.791 1
Weighted Avg. 0.791 0.21 0.791 0.791 0.79 0.791
Accuracy: 79.05 F-Measure: 0.79 Recall: 0.79 Precision: 0.791 True Positive:
0.79 False Positive: 0.21 True Negative: 0.79 False Negative: 0.21 Root Mean Squared
Error: 0.458
data/txt_sentoken Corpora Size: 2000 Feature Size: 27741 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.776 0.227 0.774 0.776 0.775 0.775 0
0.773 0.224 0.775 0.773 0.774 0.775 1
Weighted Avg. 0.775 0.226 0.775 0.775 0.774 0.775
Accuracy: 77.45 F-Measure: 0.774 Recall: 0.774 Precision: 0.775 True Positive:
0.774 False Positive: 0.226 True Negative: 0.774 False Negative: 0.226 Root Mean Squared
Error: 0.475
data/txt_sentoken Corpora Size: 2000 Feature Size: 27395 Classifier: SVM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.777 0.232 0.77 0.777 0.774 0.773 0
0.768 0.223 0.775 0.768 0.771 0.773 1
Weighted Avg. 0.773 0.228 0.773 0.773 0.772 0.773
Accuracy: 77.25 F-Measure: 0.772 Recall: 0.772 Precision: 0.773 True Positive:
0.772 False Positive: 0.228 True Negative: 0.772 False Negative: 0.228 Root Mean Squared
Error: 0.