sentiment classification for product reviews (documentation)
DESCRIPTION
The documentation of the pre-master graduation project prepared by my self and my colleagues Mostafa Ameen, Mai M. Farag and Mohamed Abd El kader.TRANSCRIPT
1
2013
Supervisor:
Dr. Mohamed Farouk
Cairo university - ISSR
Sentiment Classification For Product Reviews
by
Mahmoud Mohamed Hassan
Mostafa Mohamed Ameen
Mohamed Abdelkader Hamed
Mai Mohamed Mahmoud
2
Table of Contents Abstract ......................................................................................................................................................................5
Chapter 1 Introduction ............................................................................................................................................6
1.1 Motivations.................................................................................................................................................6
Chapter 2 Sentiment Analysis .................................................................................................................................7
2.1 Sentiment Analysis Applications ................................................................................................................7
2.2 Sentiment Analysis Research ......................................................................................................................8
2.3 Different Levels of Analysis ........................................................................................................................8
2.3.1 Document level: ..................................................................................................................................8
2.3.2 Sentence level: ...................................................................................................................................8
2.3.3 Entity and Aspect level: ......................................................................................................................8
2.4 Sentiment Lexicon and Its Issues ................................................................................................................9
2.5 Natural Language Processing Issues ........................................................................................................ 10
2.6 Opinion Spam Detection ......................................................................................................................... 10
Chapter 3 Machine Learning Approaches ............................................................................................................ 11
3.1 Data preprocessing .................................................................................................................................. 11
3.1.1 Feature extraction ........................................................................................................................... 11
3.1.2 Feature selection (dimensionality reduction) ................................................................................. 13
3.2 Classification ............................................................................................................................................ 18
3.2.1 Support vector machine classification (SVM) .................................................................................. 18
3.2.2 Naïve Bayes classification ................................................................................................................ 25
3.2.3 Conditional Independence .............................................................................................................. 29
3.2.4 Bayes Theorem ................................................................................................................................ 31
3.2.5 Maximum A Posteriori (MAP) Hypothesis ....................................................................................... 32
3.2.6 Maximum Likelihood (ML) Hypothesis ............................................................................................ 33
3.2.7 Naïve Bayesian Classification ........................................................................................................... 34
3.3 Evaluation measures ............................................................................................................................... 36
3.3.1 Precision .......................................................................................................................................... 36
3.3.2 Recall (Sensitivity) ............................................................................................................................ 37
3.3.3 Accuracy .......................................................................................................................................... 37
3.3.4 F1 –measure .................................................................................................................................... 37
3
Chapter 4 Opinion lexicons .............................................................................................................................. 38
4.1 Definition ................................................................................................................................................. 38
4.2 SENTIWORDNET 3.0 ................................................................................................................................ 38
4.3 (Bing Liu) Opinion lexicon ........................................................................................................................ 39
4.3.1 Who is Dr. Bing Liu? ......................................................................................................................... 39
4.3.2 (Bing Liu) Opinion lexicon ................................................................................................................ 39
Chapter 5 Experimental results ........................................................................................................................ 40
5.1 Data collection ......................................................................................................................................... 40
5.2 Feature extraction ................................................................................................................................... 41
5.2.1 Unigram: .......................................................................................................................................... 41
5.2.2 Bigram:............................................................................................................................................. 42
5.3 Feature selection ..................................................................................................................................... 43
5.4 Results of the classifiers .......................................................................................................................... 47
5.4.1 Unigram ........................................................................................................................................... 47
5.4.2 Bigram .............................................................................................................................................. 50
5.5 Comparison.............................................................................................................................................. 53
5.6 Charts ....................................................................................................................................................... 54
5.7 UI for predictions preview ....................................................................................................................... 57
5.8 Application for live sentiment analysis .................................................................................................... 58
Chapter 6 Conclusion ....................................................................................................................................... 60
References ........................................................................................................................................................... 61
4
Acknowledgment
Firstly, we thanks Allah for helping us to complete this project
We would like to thanks our project supervisor
Dr. Mohamed Farouk
For his efforts and his great support which helped us to improve our performance and knowledge, since the very
beginning as he helped us to discover the world of machine learning and made us looking forward to explore
further in this overwhelming science.
Also, all thanks and respect to all stuff members who taught us during the four semesters of the diploma and the 2
semesters of the premaster courses.
We also thanks Professor Hisham Hefny for his inspiration and support by teaching us many subjects opening our
minds to new trends and challenges in the computer science field.
Special thanks for our parents for their endless efforts and unconditional support in all phases of our life
especially in our education which lead us to this far of education.
Project team..
5
Abstract Sentiment classification concerns the use of automatic methods for predicting the orientation of subjective
content on text documents, with applications on a number of areas including recommender and advertising
systems, customer intelligence and information retrieval.
This research presents the result of applying two different approaches to the problem of automatic sentiment
classification of product reviews. The first approach is using opinion lexicons (SentiWordNet 3.0 (Stefano
Baccianella) and Bing Lui’s opinion lexicon). The second approach is using a supervised machine learning
approaches (NaiveBayes algorithm and LibSVM algorithm).
Also this research is an attempt to present such results in a comparative manner to help the reader to
understand the difference between using variable numbers of attribute selection, attribute extraction and
variable classification algorithms.
6
Chapter 1 Introduction Sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments,
evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations,
individuals, issues, events, topics, and their attributes. It represents a large problem space. There are also many
names and slightly different tasks, e.g., sentiment analysis, opinion mining, opinion extraction, sentiment
mining, subjectivity analysis, affect analysis, emotion analysis, review mining, etc. However, they are now all
under the umbrella of sentiment analysis or opinion mining. While in industry, the term sentiment analysis is
more commonly used, but in academia both sentiment analysis and opinion mining are frequently employed.
They basically represent the same field of study. The term sentiment analysis perhaps first appeared in
(Nasukawa and Yi, 2003), and the term opinion mining first appeared in (Dave, Lawrence and Pennock, 2003).
However, the research on sentiments and opinions appeared earlier (Das and Chen, 2001; Morinaga et al., 2002;
Pang, Lee and Vaithyanathan, 2002; Tong, 2001; Turney, 2002; Wiebe, 2000).
Although linguistics and natural language processing (NLP) have a long history, little research had been done
about people’s opinions and sentiments before the year 2000. Since then, the field has become a very active
research area. There are several reasons for this. First, it has a wide arrange of applications, almost in every
domain. The industry surrounding sentiment analysis has also flourished due to the proliferation of commercial
applications. This provides a strong motivation for research. Second, it offers many challenging research
problems, which had never been studied before.
Third, for the first time in human history, we now have a huge volume of opinionated data in the social media on
the Web. Without this data, a lot of research would not have been possible. Not surprisingly, the inception and
the rapid growth of sentiment analysis coincide with those of the social media. In fact, sentiment analysis is now
right at the center of the social media research. Hence, research in sentiment analysis not only has an important
impact on NLP, but may also have a profound impact on management sciences, political science, economics, and
social sciences as they are all affected by people’s opinions. Although the sentiment analysis research mainly
started from early 2000, there were some earlier work on interpretation of metaphors, sentiment adjectives,
subjectivity, view points, and affects (Hatzivassiloglou and McKeown, 1997; Hearst, 1992; Wiebe, 1990; Wiebe,
1994; Wiebe, Bruce and O'Hara, 1999).
1.1 Motivations With the explosive growth of social media (e.g., reviews, forum discussions, blogs, micro-blogs, Twitter,
comments, and postings in social network sites) on the Web, individuals and organizations are increasingly using
the content in these media for decision making. Nowadays, if one wants to buy a consumer product, one is no
longer limited to asking one’s friends and family for opinions because there are many user reviews and
discussions in public forums on the Web about the product. For an organization, it may no longer be necessary
to conduct surveys, opinion polls, and focus groups in order to gather public opinions because there is an
abundance of such information publicly available. However, finding and monitoring opinion sites on the Web
and distilling the information contained in them remains a formidable task because of the proliferation of
diverse sites. Each site typically contains a huge volume of opinion text that is not always easily deciphered in
7
long blogs and forum postings. The average human reader will have difficulty identifying relevant sites and
extracting and summarizing the opinions in them. Automated sentiment analysis systems are thus needed.
In recent years, we have witnessed that opinionated postings in social media have helped reshape businesses,
and sway public sentiments and emotions, which have profoundly impacted on our social and political systems.
Such postings have also mobilized masses for political changes such as those happened in some Arab countries
in 2011. It has thus become a necessity to collect and study opinions on the Web. Of course, opinionated
documents not only exist on the Web (called external data), many organizations also have their internal data,
e.g., customer feedback collected from emails and call centers or results from surveys conducted by the
organizations.
Due to these applications, industrial activities have flourished in recent years. Sentiment analysis applications
have spread to almost every possible domain, from consumer products, services, healthcare, and financial
services to social events and political elections.
Chapter 2 Sentiment Analysis Opinions are central to almost all human activities because they are key influencers of our behaviors. Whenever
we need to make a decision, we want to know others’ opinions. In the real world, businesses and organizations
always want to find consumer or public opinions about their products and services. Individual consumers also
want to know the opinions of existing users of a product before purchasing it, and others’ opinions about
political candidates before making a voting decision in a political election. In the past, when an individual
needed opinions, he/she asked friends and family. When an organization or a business needed public or
consumer opinions, it conducted surveys, opinion polls, and focus groups. Acquiring public and consumer
opinions has long been a huge business itself for marketing, public relations, and political campaign companies.
2.1 Sentiment Analysis Applications Apart from real-life applications, many application-oriented research papers have also been published. For
example, in (Liu et al., 2007), a sentiment model was proposed to predict sales performance. In (McGlohon,
Glance and Reiter, 2010), reviews were used to rank products and merchants. In (Hong and Skiena, 2010), the
relationships between the NFL betting line and public opinions in blogs and Twitter were studied. In (O'Connor
et al., 2010), Twitter sentiment was linked with public opinion polls. In (Tumasjan et al., 2010), Twitter sentiment
was also applied to predict election results. In (Chen et al., 2010), the authors studied political standpoints. In
(Yano and Smith, 2010), a method was reported for predicting comment volumes of political blogs. In (Asur and
Huberman, 2010; Joshi et al., 2010; Sadikov, Parameswaran and Venetis, 2009), Twitter data, movie reviews and
blogs were used to predict box-office revenues for movies. In (Miller et al., 2011), sentiment flow in social
networks was investigated. In (Mohammad and Yang, 2011), sentiments in mails were used to find how genders
differed on emotional axes. In (Mohammad, 2011), emotions in novels and fairy tales were tracked. In (Bollen,
Mao and Zeng, 2011), Twitter moods were used to predict the stock market. In (Bar-Haim et al., 2011; Feldman
et al., 2011), expert investors in microblogs were identified and sentiment analysis of stocks was performed. In
(Zhang and Skiena, 2010), blog and news sentiment was used to study trading strategies. In (Sakunkoo and
Sakunkoo, 2009), social influences in online book reviews were studied. In (Groh and Hauffa, 2011), sentiment
8
analysis was used to characterize social relations. A comprehensive sentiment analysis system and some case
studies were also reported in (Castellanos et al., 2011).
2.2 Sentiment Analysis Research As discussed above, pervasive real-life applications are only part of the reason why sentiment analysis is a
popular research problem. It is also highly challenging as a NLP research topic, and covers many novel sub
problems as we will see later. Additionally, there was little research before the year 2000 in either NLP or in
linguistics. Part of the reason is that before then there was little opinion text available in digital forms. Since the
year 2000, the field has grown rapidly to become one of the most active research areas in NLP. It is also widely
researched in data mining, Web mining, and information retrieval. In fact, it has spread from computer science
to management sciences (Archak, Ghose and Ipeirotis, 2007; Chen and Xie, 2008; Das and Chen, 2007;
Dellarocas, Zhang and Awad, 2007; Ghose, Ipeirotis and Sundararajan, 2007; Hu, Pavlou and Zhang, 2006; Park,
Lee and Han, 2007).
2.3 Different Levels of Analysis In general, sentiment analysis has been investigated mainly at three levels:
2.3.1 Document level: The task at this level is to classify whether a whole opinion document expresses a positive or negative
sentiment (Pang, Lee and Vaithyanathan, 2002; Turney, 2002). For example, given a product review, the
system determines whether the review expresses an overall positive or negative opinion about the product.
This task is commonly known as document-level sentiment classification. This level of analysis assumes that
each document expresses opinions on a single entity (e.g., a single product). Thus, it is not applicable to
documents which evaluate or compare multiple entities.
2.3.2 Sentence level: The task at this level goes to the sentences and determines whether each sentence expressed a positive,
negative, or neutral opinion. Neutral usually means no opinion. This level of analysis is closely related to
subjectivity classification (Wiebe, Bruce and O'Hara, 1999), which distinguishes sentences (called objective
sentences) that express factual information from sentences (called subjective sentences) that express
subjective views and opinions. However, we should note that subjectivity is not equivalent to sentiment as
many objective sentences can imply opinions, e.g., “We bought the car last month and the windshield wiper
has fallen off.” Researchers have also analyzed clauses (Wilson, Wiebe and Hwa, 2004), but the clause level
is still not enough, e.g., “Apple is doing very well in this lousy economy.”
2.3.3 Entity and Aspect level: Both the document level and the sentence level analyses do not discover what exactly people liked and did
not like. Aspect level performs finer-grained analysis. Aspect level was earlier called feature level (feature-
based opinion mining and summarization) (Hu and Liu, 2004). Instead of looking at language constructs
(documents, paragraphs, sentences, clauses or phrases), aspect level directly looks at the opinion itself. It is
based on the idea that an opinion consists of a sentiment (positive or negative) and a target (of opinion). An
9
opinion without its target being identified is of limited use. Realizing the importance of opinion targets also
helps us understand the sentiment analysis problem better. For example, although the sentence “although
the service is not that great, I still love this restaurant” clearly has a positive tone, we cannot say that this
sentence is entirely positive. In fact, the sentence is positive about the restaurant (emphasized), but
negative about its service (not emphasized). In many applications, opinion targets are described by entities
and/or their different aspects. Thus, the goal of this level of analysis is to discover sentiments on entities
and/or their aspects. For example, the sentence “The iPhone’s call quality is good, but its battery life is
short” evaluates two aspects, call quality and battery life, of iPhone (entity). The sentiment on iPhone’s call
quality is positive, but the sentiment on its battery life is negative. The call quality and battery life of iPhone
are the opinion targets. Based on this level of analysis, a structured summary of opinions about entities and
their aspects can be produced, which turns unstructured text to structured data and can be used for all
kinds of qualitative and quantitative analyses. Both the document level and sentence level classifications are
already highly challenging. The aspect-level is even more difficult. It consists of several sub-problems.
To make things even more interesting and challenging, there are two types of opinions, i.e., regular opinions and
comparative opinions (Jindal and Liu, 2006b). A regular opinion expresses a sentiment only on an particular
entity or an aspect of the entity, e.g., “Coke tastes very good,” which expresses a positive sentiment on the
aspect taste of Coke. A comparative opinion compares multiple entities based on some of their shared aspects,
e.g., “Coke tastes better than Pepsi,” which compares Coke and Pepsi based on their tastes (an aspect) and
expresses a preference for Coke.
2.4 Sentiment Lexicon and Its Issues Not surprisingly, the most important indicators of sentiments are sentiment words, also called opinion words.
These are words that are commonly used to express positive or negative sentiments. For example, good,
wonderful, and amazing are positive sentiment words, and bad, poor, and terrible are negative sentiment
words. Apart from individual words, there are also phrases and idioms, e.g., cost someone an arm and a leg.
Sentiment words and phrases are instrumental to sentiment analysis for obvious reasons. A list of such words
and phrases is called a sentiment lexicon (or opinion lexicon). Over the years, researchers have designed
numerous algorithms to compile such lexicons.
Although sentiment words and phrases are important for sentiment analysis, only using them is far from
sufficient. The problem is much more complex. In other words, we can say that sentiment lexicon is necessary
but not sufficient for sentiment analysis. Below, we highlight several issues:
1. A positive or negative sentiment word may have opposite orientations in different application domains.
For example, “suck” usually indicates negative sentiment, e.g., “This camera sucks,” but it can also imply
positive sentiment, e.g., “This vacuum cleaner really sucks.”
2. A sentence containing sentiment words may not express any sentiment. This phenomenon happens
frequently in several types of sentences. Question (interrogative) sentences and conditional sentences
are two important types, e.g., “Can you tell me which Sony camera is good?” and “If I can find a good
camera in the shop, I will buy it.” Both these sentences contain the sentiment word “good”, but neither
expresses a positive or negative opinion on any specific camera. However, not all conditional sentences
10
or interrogative sentences express no sentiments, e.g., “Does anyone know how to repair this terrible
printer” and “If you are looking for a good car, get Toyota Camry.”.
3. Sarcastic sentences with or without sentiment words are hard to deal with, e.g., “What a great car! It
stopped working in two days.” Sarcasms are not so common in consumer reviews about products and
services, but are very common in political discussions, which make political opinions hard to deal with.
4. Many sentences without sentiment words can also imply opinions. Many of these sentences are actually
objective sentences that are used to express some factual information. Again, there are many types of
such sentences. Here we just give two examples. The sentence “This washer uses a lot of water” implies
a negative sentiment about the washer since it uses a lot of resource (water). The sentence “After
sleeping on the mattress for two days, a valley has formed in the middle” expresses a negative opinion
about the mattress. This sentence is objective as it states a fact. All these sentences have no sentiment
words.
2.5 Natural Language Processing Issues Finally, we must not forget sentiment analysis is a NLP problem. It touches every aspect of NLP, e.g., co
reference resolution, negation handling, and word sense disambiguation, which add more difficulties since these
are not solved problems in NLP. However, it is also useful to realize that sentiment analysis is a highly restricted
NLP problem because the system does not need to fully understand the semantics of each sentence or
document but only needs to understand some aspects of it, i.e., positive or negative sentiments and their target
entities or topics. In this sense, sentiment analysis offers a great platform for NLP researchers to make tangible
progresses on all fronts of NLP with the potential of making a huge practical impact.
2.6 Opinion Spam Detection A key feature of social media is that it enables anyone from anywhere in the world to freely express his/her
views and opinions without disclosing his/her true identify and without the fear of undesirable consequences.
These opinions are thus highly valuable. However, this anonymity also comes with a price. It allows people with
hidden agendas or malicious intentions to easily game the system to give people the impression that they are
independent members of the public and post fake opinions to promote or to discredit target products, services,
organizations, or individuals without disclosing their true intentions, or the person or organization that they are
secretly working for. Such individuals are called opinion spammers and their activities are called opinion
spamming (Jindal and Liu, 2008; Jindal and Liu, 2007).
Opinion spamming has become a major issue. Apart from individuals who give fake opinions in reviews and
forum discussions, there are also commercial companies that are in the business of writing fake reviews and
bogus blogs for their clients. Several high profile cases of fake reviews have been reported in the news. It is
important to detect such spamming activities to ensure that the opinions on the Web are a trusted source of
valuable information. Unlike extraction of positive and negative opinions, opinion spam detection is not just a
NLP problem as it involves the analysis of people’s posting behaviors. It is thus also a data mining problem.
11
Chapter 3 Machine Learning Approaches
3.1 Data preprocessing Sentiment classification or opinion classification is text classification problem in the first place. The
Performance of a text classification task is directly affected by representation of data. Once features are
appropriately selected even simple classifiers may produce good classification results.
Several feature selection and extraction methods have been proposed in the literature. Feature selection
merely selects a good subset of the original features, where as feature extraction allows for arbitrary new
features based on the original ones.
3.1.1 Feature extraction
The most commonly used features to represent words, Term Frequency (TF) and Inverse Document
Frequency (IDF), may not be always appropriate.
Choosing an appropriate representation of words in text documents is crucial to obtaining good
classification performance. Researchers have used different representations to maximize the accuracy of
machine learning algorithms. The”Bag of words” representation is widely used to represent text documents.
In this representation, a document is considered to be an unordered collection of words whereas the
position of words in the document bears no importance. “'Bag of words'” is the simplest representation of
textual data. The number of occurrences of each word in the document is represented by term frequency
(TF) which is a document specific measure of importance of a term. The collection of documents under
consideration is called a corpus. The importance of a term in a document is measured by its weight in the
document. A number of term weighting techniques have been proposed in literature. In the vector space
model [2], a document is represented by a document vector whose components are term weights. A
document using term frequency as term weights can be represented in vector form as {tf 1,tf 2,tf 3,...,tf n } ,
where TF is the term frequency and n is total number of terms in the document.
Lengths of documents in a corpus may vary and longer documents usually have higher term frequencies and
more unique terms than shorter documents. [18]
3.1.1.1 Text Feature Generators
Before we address the question of how to discard words, we must first determine what shall count as a
word. For example, is ‘HP-UX’ one word, or is it two words? What about ‘650-857-1501’? When it comes to
programming, a simple solution is to take any contiguous sequence of alphabetic characters; or
alphanumeric characters to include identifiers such as ‘ioctl32’, which may sometimes be useful. By using
the Posix regular expression \p,L&-+ we avoid breaking ‘naive’ in two, as well as many accented words in
French, German, etc. But what about ‘win 32’, ‘can’t’ or words that may be hyphenated over a line break?
Like most data cleaning endeavors, the list of exceptions is endless, and one must simply draw a line
somewhere and hope for an 80%-20% tradeoff. Fortunately, semantic errors in word parsing are usually only
seen by the core learning algorithm, and it is their statistical properties that matter, not its readability or
intuitiveness to people. Our purpose is to offer a range of feature generators so that the feature selector
12
may discover the strongly predictive features. The most beneficial feature generators will vary according to
the characteristics of the domain text.
Word Merging
One method of reducing the size of the feature space somewhat is to merge word variants together, and
treat them as a single feature. More importantly, this can also improve the predictive value of some
features. Forcing all letters to lowercase is a nearly ubiquitous practice. It normalizes for capitalization at the
beginning of a sentence, which does not otherwise affect the word’s meaning, and helps reduce the
dispersion issue mentioned in the introduction. For proper nouns, it occasionally conflates other word
meanings, e.g. ‘Bush’ or ‘LaTeX.’ Likewise, various word stemming algorithms can be used to merge multiple
related word forms. For example, ‘cat,’ ‘cats,’ ‘catlike’ and ‘catty’ may all be merged into a common feature.
Stemming typically benefits recall but at a cost of precision. If one is searching for ‘catty’ and the word is
treated the same as ‘cat,’ then a certain amount of precision is necessarily lost. For extremely skewed class
distributions, this loss may be unsupportable.
Stemming algorithms make both over-stemming errors and under-stemming errors, but again, the
semantics are less important than the feature’s statistical properties.
Word Phrases
Whereas merging related words together can produce features with more frequent occurrence (typically
with greater recall and lower precision), identifying multiple word phrases as a single term can produce
rarer, highly specific features (which typically aid precision and have lower recall), e.g. ‘John Denver’ or ‘user
interface.’ Rather than require a dictionary of phrases as above, a simple approach is to treat all consecutive
pairs of words as a phrase term, and let feature selection determine which are useful for prediction.
This can be extended for phrases of three or more words with occasionally more specifity, but with strictly
decreasing frequency. Most of the benefit is obtained by two-word phrases. This is in part because portions
of The phrase may already have the same statistical properties, e.g. the four word phrase ‘United States of
America’ is covered already by the two-word phrase ‘United States.’ In addition, the reach of a two-word
phrase can be extended by eliminating common stop words, e.g. ‘head of the household’ becomes ‘head
household.’ Stop word lists are language specific, unfortunately. Their primary benefit to classification is in
extending the reach of phrases, rather than eliminating commonly useless words, which most feature
selection methods, can already remove in a language-independent fashion.
Character N-grams the word identification methods above fail in some situations, and can miss some good
opportunities for features. For example, languages such as Chinese and Japanese do not use a space
character. Segmenting such text into words is complex, whereas nearly equivalent accuracy may be
obtained by simply using every pair of adjacent Unicode characters as features—n grams. Certainly many of
the combinations will be meaningless, but feature selection can identify the most predictive ones. For
languages that use the Latin character set, 3-grams or 6-grams may be appropriate. For example, n-grams
would capture the essence of common technical text patterns such as ‘HP-UX 11.0’, ‘while (<>) ,’, ‘#!/bin/’,
and ‘ :)’. Phrases of two adjacent n-grams simply correspond to (2n)-grams. Note that while the number of
13
potential n-grams grows exponentially with n, in practice only a small fraction of the possibilities occur in
actual training examples, and only a fraction of those will be found predictive.
Multi-Field Records
Although most research deals with training cases as a single string, many applications have multiple text
(and non-text) fields associated with each record. In document management, these may be title, author,
abstract, keywords, body, and references. In technical support, they may be title, product, keywords,
engineer, customer, symptoms, problem description, and solution. Multi-field records are common in
applications, even though the bulk of text classification research treats only a single string. Furthermore,
when classifying long strings, e.g. arbitrary file contents, the first few kilobytes may be treated as a separate
field and may often prove sufficient for generating adequate features, avoiding the overhead of processing
huge files, such as tar or zip archives.
Feature Values
Once a decision has been made about what to consider as a feature term, the meaning of the numerical
feature must be determined. For some purposes, a binary value is sufficient, indicating whether the term
appears at all.
This representation is used by the Bernoulli formulation of the Naive Bayes classifier. Many other classifiers
use the term frequency tf(t,k) (the word count in document k) directly as the feature value.
3.1.2 Feature selection (dimensionality reduction)
The total number of features in a corpus of text documents is the number of unique words present in all
documents. Word sharing is reduced in documents belonging to different categories thus producing a large
number of unique words in the whole corpus. High dimensionality is thus inherent to text classification.
Researchers have been trying to filter out terms which are not important for classification or are redundant.
Techniques used for filtering terms in text classification are based on the assumption that very rare and very
common terms do not help in discriminating documents of different categories. Very common terms
occurring are all documents are treated as stop words are removed. Also rarely occurring terms, occurring in
only 2 or 3 documents, are not considered.
One of the problems with high-dimensional datasets is that, in many cases, not all the measured variables
are important for understanding the underlying phenomena of interest. While certain computationally
expensive novel methods can construct predictive models with high accuracy from high-dimensional data. It
is still of interest in many applications to reduce the dimension of the original data prior to any modeling of
the data. Feature selection is the method that can reduce both the data and the computational complexity.
Dataset can also get more efficient and can be useful to find out feature subsets.
14
There is Four main steps in a feature selection method: (see Figure 1)
Generation = select feature subset candidate.
Evaluation = compute relevancy value of the subset.
Stopping criterion = determine whether subset is relevant.
Validation = verify subset validity.
3.1.2.1 Information Gain (IG)
Entropy and Information Gain
The entropy (very common in Information Theory) characterizes the impurity of an arbitrary collection of
examples.
Information Gain is the expected reduction in entropy caused by partitioning the examples according to a given
attribute.
Entropy =
Figure 1
15
3.1.2.2 Principle components Analysis (PCA)
Principal component analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated variables into a set of values of linearly
uncorrelated variables called principal components. The number of principal components is less than or equal
to the number of original variables. This transformation is defined in such a way that the first principal
component has the largest possible variance (that is, accounts for as much of the variability in the data as
possible), and each succeeding component in turn has the highest variance possible under the constraint that it
be orthogonal to (i.e., uncorrelated with) the preceding components. Principal components are guaranteed to
be independent only if the data set is jointly normally distributed.
PCA was invented in 1901 by Karl Pearson,[1] as an analogue of the principal axes theorem in mechanics; it was
later independently developed (and named) by Harold Hotelling in the 1930s.[2] The method is mostly used as a
tool in exploratory data analysis and for making predictive models. PCA can be done by eigenvalue
decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix,
usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute.[3] The results
of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed
variable values corresponding to a particular data point), and loadings (the weight by which each standardized
original variable should be multiplied to get the component score).
PCA is mathematically defined[7] as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
Consider a data matrix, X, with zero empirical mean (the empirical (sample) mean of the distribution has been subtracted from the data set), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of datum (say, the results from a particular probe).
Mathematically, the transformation is defined by a set of p-dimensional vectors of weights
or loadings that map each row vector of X to new vector of principal
component scores , given by
in such a way that the individual variables of t considered over the data set successively inherit the maximum possible variance from x, with each loading vector w constrained to be a unit vector.
3.1.2.2.1 First component
The first loading vector w(1) thus has to satisfy
Equivalently, writing this in matrix form gives
16
Since w(1) has been defined to be a unit vector, it equally must also satisfy
The quantity to be maximized can be recognized as a Rayleigh quotient. A standard result for a symmetric matrix such as XTX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector.
With w(1) found, the first component of a data vector x(i) can then be given as a score t1(i) = x(i) ⋅ w(1) in the transformed co-ordinates, or as the corresponding vector in the original variables, {x(i) ⋅ w(1)} w(1).
3.1.2.2.2 Further components
The kth component can be found by subtracting the first k-1 principal components from X:
and then finding the loading vector which extracts the maximum variance from this new data matrix
It turns out that this gives the remaining eigenvectors of XTX, with the maximum values for the quantity in brackets given by their corresponding eigenvalues.
The kth principal component of a data vector x(i) can therefore be given as a score tk(i) = x(i) ⋅ w(k) in the transformed co-ordinates, or as the corresponding vector in the space of the original variables, {x(i) ⋅ w(k)} w(k), where w(k) is the kth eigenvector of XTX .
The full principal components decomposition of X can therefore be given as
where W is a p-by-p matrix whose columns are the eigenvectors of XTX
3.1.2.2.3 Covariance
XTX itself can be recognized as proportional to the empirical sample covariance matrix of the dataset X.
The sample covariance Q between two of the different principal components over the dataset is given by
where the eigenvector property of w(k) has been used to move from line 2 to line 3. However eigenvectors w(j) and w(k) corresponding to eigenvalues of a symmetric matrix are orthogonal (if the eigenvalues are different), or can be orthogonalized (if the vectors happen to share an equal repeated value). The product in
17
the final line is therefore zero; there is no sample covariance between different principal components over the dataset.
Another way to characterize the principal components transformation is therefore as the transformation to coordinates which diagonalize the empirical sample covariance matrix.
In matrix form, the empirical covariance matrix for the original variables can be written
The empirical covariance matrix between the principal components becomes
where Λ is the diagonal matrix of eigenvalues λ(k) of XTX
(λ(k) being equal to the sum of the squares over the dataset associated with each component k: λ(k) = Σi tk2
(i) = Σi (x(i) ⋅ w(k))
2)
3.1.2.2.4 Dimensionality reduction
The faithful transformation T = X W maps a data vector x(i) from an original space of p variables to a new space of p variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first L principal components, produced by using only the first L loading vectors, gives the truncated transformation
where the matrix TL now has n rows but only L columns. By construction, of all the transformed data matrices with only L columns, this score matrix maximizes the variance in the original data that has been preserved, while minimizing the total squared reconstruction error ||T - TL||2.
Such dimensionality reduction can be a very useful step for visualizing and processing high-dimensional datasets, while still retaining as much of the variance in the dataset as possible. For example, selecting L=2 and keeping only the first two principal components finds the two-dimensional plane through the high-dimensional dataset in which the data is most spread out, so if the data contains clusters these too may be most spread out, and therefore most visible to be plotted out in a two-dimensional diagram; whereas if two directions through the data (or two of the original variables) are chosen at random, the clusters may be much less spread apart from each other, and may in fact be much more likely to substantially overlay each other, making them indistinguishable.
Similarly, in regression analysis, the larger the number of explanatory variables allowed, the greater is the chance of over fitting the model, producing conclusions that fail to generalize to other datasets. One approach, especially when there are strong correlations between different possible explanatory variables, is to reduce them to a few principal components and then run the regression against them, a method called principal component regression.
Dimensionality reduction may also be appropriate when the variables in a dataset are noisy. If each column of the dataset contains independent identically distributed Gaussian noise, and then the columns of T will also contain similarly identically distributed Gaussian noise (such a distribution is invariant under the effects of the matrix W, which can be thought of as a high-dimensional rotation of the co-ordinate axes). However, with more of the total variance concentrated in the first few principal components compared to the same noise variance, the proportionate effect of the noise is less -- the first components achieve a higher signal-to-noise ratio. PCA thus can have the effect of concentrating much of the signal into the first few principal components, which can
18
usefully be captured by dimensionality reduction; while the later principal components may be dominated by noise, and so disposed of without great loss.
3.2 Classification
3.2.1 Support vector machine classification (SVM)
3.2.1.1 Description
Support Vector Machines (SVM's) are a relatively new learning method used for binary classification. The
basic idea is to and a hyper plane which separates the d-dimensional data perfectly into its two classes.
However, since example data is often not linearly separable, SVM's introduce the notion of a “kernel
induced feature space" which casts the data into a higher dimensional space where the data is separable.
Typically, casting into such a space would cause problems computationally, and with over thing. The key
insight used in SVM's is that the higher-dimensional space doesn't need to be dealt with directly (as it turns
out, only the formula for the dot-product in that space is needed), which eliminates the above concerns. Furthermore, the VC-dimension (a measure of a system's likelihood to perform well on unseen data) of
SVM's can be explicitly calculated, unlike other learning methods like neural networks, for which there is no
measure. Overall, SVM's are intuitive, theoretically well- founded, and have shown to be practically
successful. SVM's have also been extended to solve regression tasks (where the system is trained to output
a numerical value, rather than \yes/no" classification).
3.2.1.2 History
Support Vector Machines were introduced by Vladimir Vapnik and colleagues. The earliest mention was in
(Vapnik, 1979), but the first main paper seems to be (Vapnik, 1995).
3.2.1.3 Mathematics
We are given L training examples { }; i = 1, . . . l , where each example has d inputs ( ), and a class
label with one of two values ( {-1, 1}). Now, all hyper planes in are parameterized by a vector (w),
and a constant (b), expressed in the equation
w.x + b = 0 (1)
(Recall that w is in fact the vector orthogonal to the hyper plane.) Given such a hyper plane (w, b) that
separates the data, this gives the function
f(x) = sign(w. x + b) (2)
This correctly classifies the training data (and hopefully other “testing” data it hasn’t seen yet). However, a
given hyper plane represented by (w, b) is equally expressed by all pairs ( w, b ) for . So we define
the canonical hyper plane to be that which separates the data from the hyper plane by a “distance" of at
least 1. That is, we consider those that satisfy:
.w + b ≥ +1 when = +1 (3)
19
.w + b ≤ -1 when = -1 (4)
or more compactly:
( .w + b) ≥ 1 i (5)
All such hyper planes have a “functional distance" ≥ 1 (quite literally, the function's value is ≥ 1). This shouldn't
be confused with the “geometric" or “Euclidean distance" (also known as the margin). For a given hyper plane
(w, b), all pairs ( w, b ) define the exact same hyper plane, but each has a Different functional distance to a
given data point. To obtain the geometric distance from the hyper plane to a data point, we must normalize by
the magnitude of w. This distance is simply:
(6)
Intuitively, we want the hyper plane that maximizes the geometric distance to the closest data points. (See
figure 1.)
Figure 1: Choosing the hyper plane that maximizes the margin.
20
From the equation we see this is accomplished by minimizing (subject to the distance constraints). The
main method of doing this is with Lagrange multipliers. (See (Vapnik, 1995), or (Burges, 1998) for derivation
details.) The problem is eventually transformed into:
minimize:
subject to:
Where is the vector of l non-negative Lagrange multipliers to be determined, and C is a constant (to be
explained later). We can de ne the matrix and introduce more compact notation:
minimize:
(7)
subject to: (8)
(9)
(This minimization problem is what is known as a Quadratic Programming Problem (QP). Fortunately, many
techniques have been developed to solve them.)
In addition, from the derivation of these equations, it was seen that the optimal hyper plane can be written as:
W = (10)
That is, the vector w is just a linear combination of the training examples. Interestingly, it can also be shown that
which is just a fancy way of saying that when the functional distance of an example is strictly greater than 1
(when ) then . . So only the closest data points contribute to w. These training examples
for which are termed support vectors. They are the only ones needed in defining (and finding) the
optimal hyper plane. Intuitively, the support-vectors are the “borderline cases" in the decision function we are
trying to learn. Even more interesting is that can be thought of as a “difficulty rating" for the example - how
important that example was in determining the hyper plane.
Assuming we have the optimal (from which we construct w), we must still determine b to fully specify the
hyper plane. To do this, take any “positive" and “negative" support vector, , for which we know
21
Solving these equations gives us
b =
(11)
Now, you may have wondered the need for the constraint (eq. 9)
When C = , the optimal hyper plane will be the one that completely separates the data (assuming one exists).
For finite C, this changes the problem to finding a “soft-margin" classifier 4, which allows for some of the data to
be misclassified. One can think of C as a tunable parameter: higher C corresponds to more importance on
classifying all the training data correctly, lower C results in a \more flexible" hyper plane that tries to minimize
the margin error (how badly ) for each example. Finite values of C are useful in situations
where the data is not easily separable (perhaps because the input data { } are noisy).
3.2.1.4 The Generalization Ability of Perfectly Trained SVM's
Suppose we find the optimal hyper plane separating the data. And of the training examples, of them are
support vectors. It can then be shown that the expected out-of-sample error (the portion of unseen data that
will be misclassified), bound by
(12)
This is a very useful result. It ties together the notions that simpler systems are better (Ockham's Razor principle)
and that for SVM's, fewer support vectors are in fact a more “compact" and “simpler" representation of the
hyper plane and hence should perform better. If the data cannot be separated however, no such theorem
applies, which at this point seems to be a potential setback for SVM's.
3.2.1.5 Mapping the Inputs to other dimensions – the use of Kernels
Now just because a data set is not linearly separable, doesn't mean there isn't some other concise way to
separate the data. For example, it might be easier to separate the data using polynomial curves, or circles.
However, finding the optimal curve to fit the data is difficult, and it would be a shame not to use the method of
finding the optimal hyper plane that we investigated in the previous section. Indeed there is a way to “pre-
process" the data in such a way that the problem is transformed into one of finding a simple
22
Hyper plane. To do this, we define a mapping z = that transforms the
d dimensional input vector x into a (usually higher) d’ dimensional vector z. We hope to choose a so that the
new training data ( ) g is separable by a hyper plane. (See Figure 2.)
This method looks like it might work, but there are some concerns. Firstly, how do we go about choosing It
would be a lot of work to have to construct one explicitly for any data set we are given. Not to fear, if casts
the input vector into a high enough space (d’ ≥ d), casts the input vector into a high enough space that does
this for most data... But casting into a very high dimensional space is also worry some. Computationally, this
creates much more of a burden. Recall that the construction of the matrix H requires the dot products ( ). if
d’ is exponentially larger than d (and it very well could be), the computation of H becomes prohibitive (not to
mention the extra space requirements). Also, by increasing the complexity of our system in such a way, over
fitting becomes a concern. By casting into a high enough dimensional space, it is a fact that we can separate any
data set. How can we be sure that the system isn't just fitting the idiosyncrasies of the training data, but is
actually learning a legitimate pattern that will generalize to other data it hasn't been trained on?
As we'll see, SVM's avoid these problems. Given a mapping , to set up our new optimization problem,
we simply replace all occurrences of x with .
Our QP problem (recall eq. 7) would still be
minimize:
but instead of ( ), it is ( . eq. 10 would be
23
W =
And eq. 2 would be
f(x) = sign(w. + b)
=
=
The important observation in all this, is that any time a appears, it is always in a dot product with some
other That is, if we knew the formula (called a kernel) for the dot product in the higher dimensional
feature space,
(13)
we would never need to deal with the mapping z = directly. The matrix in our optimization would simply be
( )). And
our classifier f(x) = Once the problem is set up in this way, finding the optimal hyper plane
proceeds as usual, only the hyper plane will be in some unknown feature space. In the original input space, the
data will be separated by some curved, possibly non-continuous contour. It may not seem obvious why the use
of a kernel alleviates our concerns, but it does. Earlier, we mentioned that it would be tedious to have to design
a different feature map for any training set we are given, in order to linearly separate the data. Fortunately,
useful kernels have already been discovered. Consider the “polynomial kernel"
(14)
Where p is a tunable parameter, which in practice varies from 1 to ~ 10. Notice that evaluating K involves only
an extra addition and exponentiation more than computing the original dot product. You might wonder what
the implicit mapping was in the creation of this kernel. Well, if you were to expand the dot product inside K
...
(15)
and multiply these (d+1) terms by each other p times, it would result in
terms each of which are
polynomials of varying degrees of the input vectors. Thus, one can think of this polynomial kernel as the dot
product of two exponentially large z vectors. By using a larger value of p the dimension of the feature space is
implicitly larger, where the data will likely be easier to separate. (However, in a larger dimensional space, there
might be more support vectors, which we saw leads to worse generalization.)
3.2.1.6 Other kernels
Another popular one is the Gaussian RBF Kernel
24
(16)
where is a tunable parameter. Using this kernel results in the classifier
f(x) =
which is really just a Radial Basis Function, with the support vectors as the centers. So here, a SVM was implicitly
used to find the number (and location) of centers needed to form the RBF network with the highest expected
generalization performance.
At this point one might wonder what other kernels exist, and if making your own kernel is as simple as just
dreaming up some function As it turns out, K must in fact be the dot product in a feature space for
some , if all the theory behind SVM's is to go through. Now there are two ways to ensure this. The first, is to
create some mapping z = and then derive the analytic expression for .
This kernel is most definitely the dot product in a feature space, since it was created as such. The second way is
to dream up some function K and then check if it is valid by applying Mercer's condition. Without giving too
many details, the condition states: Suppose K can be written as for some choice of
the
If K is indeed a dot product
The mathematically inclined reader interested in the derivation details is encouraged to see (Cristianini, Shawe-
Taylor, 2000). It is indeed a strange mathematical requirement. Fortunately for us, the polynomial and RBF
kernels have already been proven to be valid. And most of the literature presenting results using SVM's all use
these two simple kernels. So most SVM users need not be concerned with creating new kernels, and checking
that they meet Mercer's condition. (Interestingly though, kernels satisfy many closure properties. That is,
addition, multiplication, and composition of valid kernels all result in valid kernels. Again, see (Cristianini, Shawe-
Taylor, 2000).)
25
3.2.2 Naïve Bayes classification
3.2.2.1 Introduction to Bayesian Classification The Bayesian Classification represents a supervised learning method as well as a statistical method for
classification. Assumes an underlying probabilistic model and it allows us to capture uncertainty about the
model in a principled way by determining probabilities of the outcomes. It can solve diagnostic and predictive
problems.
This Classification is named after Thomas Bayes (1702 - 1761), who proposed the Bayes Theorem.
Bayesian classification provides practical learning algorithms and prior knowledge and observed data can be
combined. Bayesian Classification provides a useful perspective for understanding and evaluating many learning
algorithms. It calculates explicit probabilities for hypothesis and it is robust to noise in input data.
Uses of Naive Bayes classification: 1. Naive Bayes text classification
(http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html) The Bayesian classification is used as a probabilistic learning method (Naive Bayes text classification). Naive
Bayes classifiers are among the most successful known algorithms for learning to classify text documents.
2. Spam filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering) Spam filtering is the best known use of Naive Bayesian text classification. It makes use of a Naive Bayes
classifier to identify spam e-mail.
Bayesian spam filtering has become a popular mechanism to distinguish illegitimate spam email from
legitimate email (sometimes called "ham" or "bacn").[4] Many modern mail clients implement Bayesian
spam filtering. Users can also install separate email filtering programs. Server-side email filters, such as
DSPAM, Spam Assassin, Spam Bayes, Bogofilter and ASSP, make use of Bayesian spam filtering techniques,
and the functionality is sometimes embedded within mail server software itself.
3. Hybrid Recommender System Using Naive Bayes Classifier and Collaborative Filtering
(http://eprints.ecs.soton.ac.uk/18483/)
Recommender Systems apply machine learning and data mining techniques for filtering unseen information
and can predict whether a user would like a given resource.
It is proposed a unique switching hybrid recommendation approach by combining a Naive Bayes
classification approach with the collaborative filtering. Experimental results on two different data sets,
show that the proposed algorithm is scalable and provide better performance-in terms of accuracy and
coverage-than other algorithms while at the same time eliminates some recorded problems with the
recommender systems.
26
4. Online applications (http://www.convo.co.uk/x02/)
This online application has been set up as a simple example of supervised machine learning and affective
computing. Using a training set of examples which reflect nice, nasty or neutral sentiments, we're training
Ditto to distinguish between them.
Simple Emotion modeling combines a statistically based classifier with a dynamical model. The Naive Bayes
classifier employs single words and word pairs as features. It allocates user utterances into nice, nasty and
neutral classes, labeled +1, -1 and 0 respectively. This numerical output drives a simple first-order
dynamical system, whose state represents the simulated emotional state of the experiment's
personification, Ditto the donkey.
Independence
Example
Suppose there are two events:
M: Manuela teaches the class (otherwise it's Andrew)
S: It is sunny
"The sunshine levels do not depend on and do not influence who is teaching."
Theory:
From P(S | M) = P(S), the rules of probability imply:
P(~S | M) = P(~S)
P(M | S) = P(M)
P(M ^ S) = P(M) P(S)
P(~M ^ S) = P(~M) P(S)
P(M^~S) = P(M)P(~S)
P(~M^~S) = P(~M)P(~S)
Theory applied on previous example:
"The sunshine levels do not depend on and do not influence who is teaching." can be specified very simply:
P(S | M) = P(S)
"Two events A and B are statistically independent if the probability of A is the same value when B occurs, when
B does not occur or when nothing is known about the occurrence of B"
27
3.2.2.2 Conditional Probability
3.2.2.3 Simple Example:
H = "Have a headache" F = "Coming down with Flu" P(H) = 1/10
P(F) =1/40
P(H|F) = 1/2
P( A | B)P(A | B)
1
"Headaches are rare and flu is rarer, but if you're coming down with 'flu there's a 50-50 chance
you'll have a headache."
P(H|F) = Fraction of flu-inflicted worlds in which you have a headache =
#worlds with flu and headache Area of "H and F" region P(H ^ F)
= ------------------------------------ = ------------------------------------- = -----------
#worlds with flu Area of "F" region P(F)
3.2.2.4 Theory:
P(A|B) = Fraction of worlds in which B is true that also have A true
P(A ̂ B) P(A|B) = ------------------
P(B)
Corollary: P(A ^ B) = P(A|B) P(B) P(A|B)+P( A|B) = 1
n
P( A | B) 1 kk1
28
3.2.2.5 Detailed Example
M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late.
Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive
late than Manuela. Let's begin with writing down the knowledge:
P(S M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer. Therefore
Lateness is dependant on both weather and lecturer
29
3.2.3 Conditional Independence
3.2.3.1 Example: Suppose we have these three events:
M : Lecture taught by Manuela L : Lecturer arrives late R : Lecture concerns robots
Suppose: Andrew has a higher chance of being late than Manuela.
Andrew has a higher chance of giving robotics lectures.
30
3.2.3.2 Theory: R and L are conditionally independent given M if for all x,y,z in {T,F}:
P(R=x M=y ^ L=z) = P(R=x M=y) More generally:
Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all
assignments of values to the variables in the sets, P(S1's assignments| S2's assignments & S3's assignments)= P(S1's assignments| S3's assignments)
P(A|B) = P(A ^B)/P(B) Therefore P(A^B) = P(A|B).P(B) - also known as Chain Rule Also P(A^B) = P(B|A).P(A) Therefore P(A|B) = P(B|A).P(A)/P(B) P(A,B|C) = P(A^B^C)/P(C) = P(A|B,C).P(B^C)/P(C ) - applying chain rule = P(A|B,C).P(B|C) = P(A|C).P(B|C) , If A and B are conditionally independent given C.
This can be extended for n values as P(A1,A2An|C) = P(A1|C).P(A2|C)P(An|C) if A1, A2An are conditionally independent given C.
3.2.3.3 Theory applied on previous example:
For the previous example, we can use the following notations: P(R| M,L) = P(R| M) and P(R| ~M,L) = P(R| ~M)
We express this in the following way: "R and L are conditionally independent given M"
31
3.2.4 Bayes Theorem
Bayesian reasoning is applied to decision making and inferential statistics that deals with Probability inference. It is used the knowledge of prior events to predict future events. Example: Predicting the color of marbles in a basket
3.2.4.1 Example:
Table1: Data table
3.2.4.2 Theory:
The Bayes Theorem:
P(h/D)= P(D/h) P(h) P(D)
32
P(h) : Prior probability of hypothesis h P(D) : Prior probability of training data D P(h/D) : Probability of h given D
P(D/h) : Probability of D given h
3.2.4.3 Theory applied on previous example:
D : 35 year old customer with an income of $50,000 PA h :
Hypothesis that our customer will buy our computer
P(h/D) : Probability that customer D will buy our computer given that we know his age and income P(h) : Probability that any customer will buy our computer regardless of age (Prior Probability)
P(D/h) : Probability that the customer is 35 yrs old and earns $50,000, given that he has bought our computer (Posterior Probability) P(D) : Probability that a person from our set of customers is 35 yrs old and earns $50,000
3.2.5 Maximum A Posteriori (MAP) Hypothesis
3.2.5.1 Example: h1: Customer buys a computer = Yes h2 :
Customer buys a computer = No where h1 and h2 are subsets of our Hypothesis Space 'H'
P(h/D) (Final Outcome) = arg max{ P( D/h1) P(h1) , P(D/h2) P(h2)} P(D) can be ignored as it is the same for both the terms
3.2.5.2 Theory: Generally we want the most probable hypothesis given the training data hMAP = arg max P(h/D) (where h belongs to H and H is the hypothesis space)
hMAP = arg max P(D/h) P(h) P(D)
hMAP = arg max P(D/h) P(h)
33
3.2.6 Maximum Likelihood (ML) Hypothesis
3.2.6.1 Example:
Table 2
3.2.6.2 Theory: If we assume P(hi) = P(hj) where the calculated probabilities amount to the same Further simplification leads to: hML = arg max P(D/hi) (where hi belongs to H)
2.5.3. Theory applied on previous example: P (buys computer = yes) = 5/10 = 0.5 P
(buys computer = no) = 5/10 = 0.5 P (customer is 35 yrs & earns $50,000) = 4/10 = 0.4 P (customer is 35 yrs & earns $50,000 / buys computer = yes) = 3/5 =0.6 P
(customer is 35 yrs & earns $50,000 / buys computer = no) = 1/5 = 0.2
Customer buys a computer P(h1/D) = P(h1) * P (D/ h1) / P(D) = 0.5 * 0.6 / 0.4 Customer does not buy a computer P(h2/D) = P(h2) * P (D/ h2) / P(D) = 0.5 * 0.2 / 0.4
34
Final Outcome = arg max {P(h1/D) , P(h2/D)} = max(0.6, 0.2) => Customer buys a computer
3.2.7 Naïve Bayesian Classification
It is based on the Bayesian theorem It is particularly suited when the dimensionality of the inputs is high. Parameter estimation for naive Bayes models uses the method of maximum likelihood. In spite over-simplified assumptions, it often performs better in many complex real- world situations
Advantage: Requires a small amount of training data to estimate the parameters
3.2.7.1 Example
X = ( age= youth, income = medium, student = yes, credit_rating = fair)
A person belonging to tuple X will buy a computer?
3.2.7.2 Theory: Derivation:
D : Set of tuples Each Tuple is an 'n' dimensional attribute vector X : (x1,x2,x3,. xn)
Let there be 'm' Classes : C1,C2,C3Cm Naïve Bayes classifier predicts X belongs to Class Ci iff
35
P (Ci/X) > P(Cj/X) for 1<= j <= m , j <> i Maximum Posteriori Hypothesis
P(Ci/X) = P(X/Ci) P(Ci) / P(X) Maximize P(X/Ci) P(Ci) as P(X) is constant
With many attributes, it is computationally expensive to evaluate P(X/Ci). Naïve Assumption of "class conditional independence"
n
P( X / .Ci) P(xk / Ci)
k1 P(X/Ci) = P(x1/Ci) * P(x2/Ci) ** P(xn/ Ci)
3.2.7.3 Theory applied on previous example:
P(C1) = P(buys_computer = yes) = 9/14 =0.643 P(C2)
= P(buys_computer = no) = 5/14= 0.357 P(age=youth /buys_computer = yes) = 2/9 =0.222
P(age=youth /buys_computer = no) = 3/5 =0.600 P(income=medium /buys_computer = yes) = 4/9 =0.444
P(income=medium /buys_computer = no) = 2/5 =0.400 P(student=yes /buys_computer = yes) = 6/9 =0.667 P(student=yes/buys_computer = no) = 1/5 =0.200 P(credit rating=fair /buys_computer = yes) = 6/9 =0.667
P(credit rating=fair /buys_computer = no) = 2/5 =0.400 P(X/Buys a computer = yes) = P(age=youth /buys_computer = yes) * P(income=medium /buys_computer = yes) * P(student=yes /buys_computer = yes) * P(credit rating=fair /buys_computer = yes) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044
P(X/Buys a computer = No) = 0.600 * 0.400 * 0.200 * 0.400 = 0.019 Find class Ci that Maximizes P(X/Ci) * P(Ci)
=>P(X/Buys a computer = yes) * P(buys_computer = yes) = 0.028
=>P(X/Buys a computer = No) * P(buys_computer = no) = 0.007
Prediction : Buys a computer for Tuple X
36
3.3 Evaluation measures
Text classification rules are typically evaluated using performance measures from information retrieval. Common metrics for text categorization evaluation include recall, precision, accuracy and error rate and F1. Given a test set of N documents, a two-by-two contingency table (see Table 1) with four cells can be constructed for each binary classification problem. The cells contain the counts for true positive (TP), false positive (FP), true negative (TN) and false negative (FN), respectively. Clearly, N = TP + FP + TN + FN.
Table 1
actual class (observation)
predicted class (expectation)
TP (true positive) Correct result
FP (false positive) Unexpected result
FN (false negative) Missing result
TN (true negative) Correct absence of result
the terms true positives, true negatives, false positives, and false negatives compare the results of the classifier
under test with trusted external judgments.
The terms positive and negative refer to the classifier's prediction (sometimes known as the expectation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the observation).
Hence the metrics for binary-decisions are defined as:
3.3.1 Precision
Is the proportion of Predicted Positive cases that are correctly Real Positives. This is what Machine Learning, Data Mining and Information Retrieval focus on, but it is totally ignored in ROC analysis. It can however analogously be called True Positive Accuracy (tpa), being a measure of accuracy of Predicted Positives in contrast with the rate of discovery of Real Positives (tpr).
In precise it represents how many of the returned documents or topics are correctly predicted
Precision = TP / (TP + FP)
Issue in Precision When a system outputs only confident topics, the precision easily reaches a high percentage.
37
3.3.2 Recall (Sensitivity)
Is the proportion of Real Positive cases that are correctly Predicted Positive. This measures the Coverage of the Real Positive cases by the +P (Predicted Positive) rule. Its desirable feature is that it reflects how many of the relevant cases the +P rule picks up.
In precise it represents what percent of positive cases were caught
Recall = TP / (TP + FN)
When a system outputs loosely, the recall easily reaches a high percentage.
3.3.3 Accuracy
Accuracy represents what percent of the prediction were correct
The rate of correctly predicted topics
Figure 2
Accuracy =
Issue in Accuracy When a certain topic (e.g., not-spam) is a majority, the accuracy easily reaches a high percentage.
3.3.4 F1 –measure
F1 measure effectively references the True Positives to the Arithmetic Mean of Predicted Positives and Real Positives, being a constructed rate normalized to an idealized value, and expressed in this form it is known in statistics as a Proportion of Specific Agreement as it is a applied to a specific class, so applied to the Positive Class
F1 measure = 2*Recall*Precision/(Recall + Precision)
Since there is a trade-off between recall an precision, F-measure is widely used to evaluate text classification system.
38
Chapter 4 Opinion lexicons
4.1 Definition
Opinion lexicons are resources that associate sentiment orientation and words. Their use in opinion mining research
stems from the hypothesis that individual words can be considered as a unit of opinion information, and therefore
may provide clues to document sentiment and subjectivity.
Manually created opinion lexicons were applied to sentiment classification as seen in [13], where a
prediction of document polarity is given by counting positive and negative terms. A similar approach
is presented in the work of Kennedy and Inkpen [10], this time using an opinion lexicon based on the
combination of other existing resources.
Manually built lexicons however tend to be constrained to a small number of terms. By its nature, building manual
lists is a time consuming effort, and may be subject to annotator bias. To overcome these issues lexical induction
approaches have been proposed in the literature with a view to extend the size of opinion lexicons from a core set of
seed terms, either by exploring term relationships, or by evaluating similarities in document corpora. Early work in
this area seen in [9] extends a list of positive and negative adjectives by evaluating conjunctive statements in a
document corpus. Another common approach is to derive opinion terms from the WordNet database of terms and
relationships [12], typically by examining the semantic relationships of a term such as synonyms and antonyms.
In this work two commonly used opinion lexicon are used, first is SentiWordNet 3.0 and the second is Opinion
lexicon created by Dr. Bing Liu.
4.2 SENTIWORDNET 3.0 An enhanced lexical resource explicitly devised for supporting sentiment classification and opinion mining applications (Pang and Lee, 2008). SENTIWORDNET 3.0 is an improved version of SENTIWORDNET 1.0 (Esuli and Sebastiani, 2006), a lexical resource publicly available for research purposes, now currently licensed to more than 300 research groups and used in a variety of research projects worldwide. SENTIWORDNET is the result of the automatic annotation of all the synsets of WORDNET according to the notions of “positivity”, “negativity”, and “neutrality”. Each synset s is associated to three numerical scores P os(s), Neg(s), and Obj(s) which indicate how positive, negative, and “objective” (i.e., neutral) the terms contained in the synset are. Different senses of the same term may thus have different opinion-related properties. For example , in SENTIWORDNET 1.0 the synset [estimable(J,3)], corresponding to the sense “may be computed or estimated” of the adjective estimable, has an Obj score of 1:0 (and P pos and Neg scores of 0.0), while the synset *estimable(J,1)+ corresponding to the sense “deserving of respect or high regard” has a P os score of 0:75, a Neg score of 0:0, and an Obj score of 0:25. Each of the three scores ranges in the interval [0:0;1:0], and their sum is 1:0 for each synset. This means that a synset may have nonzero scores for all the three categories, which would indicate that the corresponding terms have, in the sense indicated by the synset, each of the three opinions related properties to a certain degree. Each set of terms sharing the same meaning in SentiWordNet (synsets) is associated with two numerical scores ranging from 0 to 1, each indicating the synset’s positive and negative bias. The scores reflect the agreement amongst the classifier committee on the positive or negative label for a term, thus one distinct aspect of SentiWordNet is that it is possible for a term to have non-zero values for both positive and negative scores, according to the formula: Pos. Score(term) + Neg. Score(term) + Objective Score(term) = 1
39
Terms in the SentiWordNet database follow the categorization into parts of speech derived from WordNet, and therefore to correctly apply scores to terms, a part of speech tagger program was applied to the polarity data set. In our experiment, the Stanford Part of Speech Tagger was used. The opinion lexicon can be downloaded freely for research purposes from the following link:
http://sentiwordnet.isti.cnr.it/
4.3 (Bing Liu) Opinion lexicon
4.3.1 Who is Dr. Bing Liu?
Dr. Bing Liu is a professor in department of computer science in university of Illinois at Chicago (UIC) whose research interests are Sentiment Analysis, Opinion Mining, Data and Web Mining, Machine and Constraint satisfaction, AI scheduling. He has a history full of publications especially in the field of opinion mining and data mining in general.
The following are examples of his publications in this field:
Mining and summarizing customer reviews
Opinion observer: analyzing and comparing opinions on the Web
Mining opinion features in customer reviews
Sentiment analysis and subjectivity
A holistic lexicon-based approach to opinion mining
Opinion spam and analysis
The following link contains most of his publications:
http://www.cs.uic.edu/~liub/publications/papers_chron.html
4.3.2 (Bing Liu) Opinion lexicon
Opinion lexicon is a list of positive and negative opinion words or sentiment words for English (around 6800 words which are divided into two separate files one contains the positive words and the other conations the negative words). This list was compiled over many years starting from his first paper (Hu and Liu, KDD-2004).
The opinion lexicon can be downloaded freely for research purposes from the following link: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
40
Chapter 5 Experimental results
5.1 Data collection The used data set in this literature is an Amazon product review data (Jindal and Liu, WSDM-2008) used in
(Jindal and Liu, WWW-2007, WSDM-2008; Lim et al, CIKM-2010; Jindal, Liu and Lim, CIKM-2010; Mukherjee et al.
WWW-2011; Mukherjee, Liu and Glance, WWW-2012) for opinion spam (fake review) detection. The dataset
consists of more than 2.8 million product reviews in multiple domains.
In this literature we extracted 2000 reviews (1000 positive review and 1000 negative review) in the Digital
cameras domain. The extracted data were split later as follows (90% of the data “1800 review” for the training
of classifiers and 10% of the data “200 reviews” as a test set).
The full dataset is free for research purposes and can be downloaded from the following link:
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
The following steps were applied to the dataset to extract the data of the desired domain:
Converted to database using SQL Server 2008 Import and Export Wizard tool
Figure 3
41
5.2 Feature extraction We used the weka filter “StringToWordVector”, to extract the features as unigrams and bigrams.
5.2.1 Unigram:
Figure 4
42
5.2.2 Bigram:
Figure 5
Also all stop words were selected to be exempted from the feature vector in both uni-gram feature vector and
bi-gram feature vector.
Stop words that were chosen to be removed are gathered from the following link:
https://code.google.com/p/textminingtools/source/browse/trunk/data/stopwords.txt?r=5
In both feature vectors (unigram & bigram) the “IteratedLovinsStemmer” stemmer were chosen and the
tokenizer in bigram included both tokens consists of one word (unigram) and tokens consists of two words.
Also TFTransform and IDFTransform were both applied on both feature vectors.
43
5.3 Feature selection For dimensionality reduction the following patterns were removed from the feature vector. Patterns excluded from the uni-gram feature vector: Patterns excluded:- (using weka)
Numbers only: ([0-9]+) Contains special characters: (.*[^a-z0-9 ]+.*) One character: (.) Two characters: (..)
The above patterns were excluded to remove features that consists of numbers only, features that include special characters and features that consists of one or only two characters.
Patterns excluded from the bi-gram feature vector:
Numbers only: ([0-9]+) Contains special characters: (.*[^a-z0-9 ]+.*) One character: (.) Two characters: (..) One word is one letter: ([a-z] .*)|(.* [a-z]) One word is two letters: ([a-z][a-z] .*)|(.* [a-z][a-z]) 2nd word is a number: (.* [0-9]+)
The above patterns were excluded to remove features that consists of numbers only, features that include special characters, features that consists of one or only two characters, features that have one word is only one character , features that include a word that consists of two characters only and the features that has the second word as number.
A further dimensionality reduction is applied on the uni-gram feature vector and the bi-gram feature vector
using the information gain algorithm. The following are the results of running information gain algorithm on
both feature vectors (unigram and Bi-gram respectively) in weka:
The results of applying information gain attribute selection algorithm On unigram feature vector:
=== Run information ===
Evaluator: weka.attributeSelection.InfoGainAttributeEval
Search: weka.attributeSelection.Ranker -T 0.0 -N -1
Instances: 2000
Attributes: 8081
Evaluation mode: evaluate on all training data
=== Attribute Selection on all input data ===
Search Method:
Attribute ranking.
Threshold for discarding attributes: 0
Attribute Evaluator (supervised, Class (nominal): 1 ReviewClass):
Information Gain Ranking Filter
Selected attributes: 1034 attribute
44
The results of applying information gain attribute selection algorithm On Bi-gram feature vector:
=== Run information ===
Evaluator: weka.attributeSelection.InfoGainAttributeEval
Search: weka.attributeSelection.Ranker -T 0.0 -N -1
Instances: 2000
Attributes: 107765
Evaluation mode: evaluate on all training data
=== Attribute Selection on all input data ===
Search Method:
Attribute ranking.
Threshold for discarding attributes: 0
Attribute Evaluator (supervised, Class (nominal): 1 ReviewClass):
Information Gain Ranking Filter
Selected attributes: 3740
A further dimensionality reduction is applied on the uni-gram feature vector and the bi-gram feature vector
using the principle components attribute selection algorithm. The following are the results of running PCA
algorithm on both feature vectors (unigram and Bi-gram respectively) in Weka:
The results of applying PCA attribute selection algorithm On unigram feature vector:
=== Run information ===
Evaluator: weka.attributeSelection.PrincipalComponents -R 0.2 -A 5
Search:weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Instances: 2000
Attributes: 1035
Evaluation mode:evaluate on all training data
=== Attribute Selection on all input data ===
Search Method: Attribute ranking.
Attribute Evaluator (unsupervised): Principal Components Attribute
Transformer
Correlation matrix
1 0.22 0.13 0.14 0.17 . . . . ..
0.22 1 0.13 0.19 0.23 . . . . .
. . . . . . . . .. .
. . . . . . . . . . .
. . . . . . .. . .. . .
eigenvalue proportion cumulative
16.34076 0.0158 0.0158 -0.117mod-0.1featur-0.099shoot-0.097control-0.093ma...
9.9449 0.00962 0.02542 -0.3rebuilt-0.3unreasolut-0.287apar-0.278bureau 0.2...
7.42351 0.00718 0.0326 0.241tech+0.231awar+0.224proper+0.221easyshar+0.204p...
6.98087 0.00675 0.03935 0.329deceit+0.321downright+0.318death+0.303cnet+0.27...
6.53621 0.00632 0.04567 0.319ent+0.313cluster+0.312shiver+0.286univer+0.28 k...
6.13852 0.00594 0.05161 0.158len-0.136card-0.123usb+0.111foc-0.111vid...
45
5.64292 0.00546 0.05707 -0.358dieg-0.358transcript-0.356raynox-0.329unedit-....
........ ........ ........ .........................................
........ ........ ........ .........................................
Ranked attributes: 0.984 1 -0.117mod-0.1featur-0.099shoot-0.097control-0.093manu...
0.975 2 -0.3rebuilt-0.3unreasolut-0.287apar-0.278bureau-0.268gask...
.
.
.
0.798 55 0.116piec+0.111algorithm+0.11 snapshot-0.108laser+0.106doubl...
Selected attributes: 55 The results of applying information gain attribute selection algorithm On Bi-gram feature vector:
Selected attributes: 58
The following table (Table 2) summarizes the evolution of the feature vectors through the different phases of
feature extraction and feature selection (dimensionality reduction) applied on the data set used in this literature
starting from the original feature vector and ending with the least possible obtained feature vector. That is also
visualized in (Figure 6 and Figure 7) below.
Table 2
The applied feature selection, extraction feature vector size
Unigram Bigram
Original feature vector 12974 165788
After removing stop words 12733 165547
After removing patterns: [0 9]* 12294 165349
After removing patterns: [.] 12280 165311
After removing patterns: [..] 11805 164765
After removing patterns: ( .*[^a z0 9]+.*) 8081 146376
After removing patterns: ([a z] .*)|(.*[a z] ) n/a 132961
After removing patterns:([a z] [a z] .*)|(.*[a z] [a z] ) n/a 100665
After removing patterns: (.*[0 9]+) n/a 97631
After applying information gain attribute selection 1043 3740
After applying Principle components analysis 55 58
46
Figure 6
Figure 7
47
5.4 Results of the classifiers
5.4.1 Unigram
NaiveBayes without attributes selection
=== Summary ===
Correctly Classified Instances 179 89.5 %
Incorrectly Classified Instances 21 10.5 %
Kappa statistic 0.79
Mean absolute error 0.105
Root mean squared error 0.324
Relative absolute error 21 %
Root relative squared error 64.8074 %
Coverage of cases (0.95 level) 89.5 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 200
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.930 0.140 0.869 0.930 0.899 0.792 0.917 0.864 +
0.860 0.070 0.925 0.860 0.891 0.792 0.948 0.920 -
Weighted Avg. 0.895 0.105 0.897 0.895 0.895 0.792 0.933 0.892
=== Confusion Matrix ===
a b <-- classified as
93 7 | a = +
14 86 | b = -
NaiveBayes after attribute selection (using Information gain only)
=== Summary ===
Correctly Classified Instances 186 93 %
Incorrectly Classified Instances 14 7 %
Kappa statistic 0.86
Mean absolute error 0.07
Root mean squared error 0.2646
Relative absolute error 14 %
Root relative squared error 52.915 %
Coverage of cases (0.95 level) 93 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 200
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.950 0.090 0.913 0.950 0.931 0.861 0.930 0.893 +
0.910 0.050 0.948 0.910 0.929 0.861 0.930 0.908 -
Weighted Avg. 0.930 0.070 0.931 0.930 0.930 0.861 0.930 0.900
=== Confusion Matrix ===
a b <-- classified as
95 5 | a = +
9 91 | b = -
48
NaiveBayes after using Principle components analysis (PCA)
=== Summary ===
Correctly Classified Instances 176 87.1287 %
Incorrectly Classified Instances 26 12.8713 %
Kappa statistic 0.7426
Mean absolute error 0.1269
Root mean squared error 0.3397
Relative absolute error 25.3785 %
Root relative squared error 67.9319 %
Coverage of cases (0.95 level) 91.5842 %
Mean rel. region size (0.95 level) 53.7129 %
Total Number of Instances 202
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.941 0.198 0.826 0.941 0.880 0.750 0.957 0.958 +
0.802 0.059 0.931 0.802 0.862 0.750 0.957 0.959 -
Weighted Avg. 0.871 0.129 0.879 0.871 0.871 0.750 0.957 0.959
=== Confusion Matrix ===
a b <-- classified as
95 6 | a = +
20 81 | b = -
SVM without attributes selection
=== Summary ===
Correctly Classified Instances 186 93 %
Incorrectly Classified Instances 14 7 %
Kappa statistic 0.86
Mean absolute error 0.07
Root mean squared error 0.2646
Relative absolute error 14 %
Root relative squared error 52.915 %
Coverage of cases (0.95 level) 93 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 200
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.950 0.090 0.913 0.950 0.931 0.861 0.930 0.893 +
0.910 0.050 0.948 0.910 0.929 0.861 0.930 0.908 -
Weighted Avg. 0.930 0.070 0.931 0.930 0.930 0.861 0.930 0.900
=== Confusion Matrix ===
a b <-- classified as
95 5 | a = +
9 91 | b = -
49
SVM after attributes selection (using information gain)
=== Summary ===
Correctly Classified Instances 189 94.5 %
Incorrectly Classified Instances 11 5.5 %
Kappa statistic 0.89
Mean absolute error 0.055
Root mean squared error 0.2345
Relative absolute error 11 %
Root relative squared error 46.9042 %
Coverage of cases (0.95 level) 94.5 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 200
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.960 0.070 0.932 0.960 0.946 0.890 0.945 0.915 +
0.930 0.040 0.959 0.930 0.944 0.890 0.945 0.927 -
Weighted Avg. 0.945 0.055 0.945 0.945 0.945 0.890 0.945 0.921
=== Confusion Matrix ===
a b <-- classified as
96 4 | a = +
7 93 | b = -
SVM after using Principle components analysis (PCA)
=== Summary ===
Correctly Classified Instances 186 92.0792 %
Incorrectly Classified Instances 16 7.9208 %
Kappa statistic 0.8416
Mean absolute error 0.0792
Root mean squared error 0.2814
Relative absolute error 15.8416 %
Root relative squared error 56.2878 %
Coverage of cases (0.95 level) 92.0792 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 202
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.980 0.139 0.876 0.980 0.925 0.848 0.921 0.869 +
0.861 0.020 0.978 0.861 0.916 0.848 0.921 0.911 -
Weighted Avg. 0.921 0.079 0.927 0.921 0.921 0.848 0.921 0.890
=== Confusion Matrix ===
a b <-- classified as
99 2 | a = +
14 87 | b = -
50
5.4.2 Bigram
NaiveBayes without attributes selection
=== Summary ===
Correctly Classified Instances 183 91.5 %
Incorrectly Classified Instances 17 8.5 %
Kappa statistic 0.83
Mean absolute error 0.085
Root mean squared error 0.2915
Relative absolute error 17 %
Root relative squared error 58.3095 %
Coverage of cases (0.95 level) 91.5 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 200
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.930 0.100 0.903 0.930 0.916 0.830 0.920 0.883 +
0.900 0.070 0.928 0.900 0.914 0.830 0.938 0.907 -
Weighted Avg. 0.915 0.085 0.915 0.915 0.915 0.830 0.929 0.895
=== Confusion Matrix ===
a b <-- classified as
93 7 | a = +
10 90 | b = -
NaiveBayes after attribute selection (with IG)
=== Summary ===
Correctly Classified Instances 181 90.5 %
Incorrectly Classified Instances 19 9.5 %
Kappa statistic 0.81
Mean absolute error 0.095
Root mean squared error 0.3082
Relative absolute error 19 %
Root relative squared error 61.6441 %
Coverage of cases (0.95 level) 90.5 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 200
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.970 0.160 0.858 0.970 0.911 0.817 0.913 0.855 +
0.840 0.030 0.966 0.840 0.898 0.817 0.904 0.891 -
Weighted Avg. 0.905 0.095 0.912 0.905 0.905 0.817 0.908 0.873
=== Confusion Matrix ===
a b <-- classified as
97 3 | a = +
16 84 | b = -
51
NaiveBayes after using Principle components analysis (PCA)
=== Summary ===
Correctly Classified Instances 195 96.5347 %
Incorrectly Classified Instances 7 3.4653 %
Kappa statistic 0.9307
Mean absolute error 0.0366
Root mean squared error 0.1802
Relative absolute error 7.3105 %
Root relative squared error 36.0455 %
Coverage of cases (0.95 level) 97.0297 %
Mean rel. region size (0.95 level) 51.2376 %
Total Number of Instances 202
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.980 0.050 0.952 0.980 0.966 0.931 0.982 0.973 +
0.950 0.020 0.980 0.950 0.965 0.931 0.983 0.986 -
Weighted Avg. 0.965 0.035 0.966 0.965 0.965 0.931 0.983 0.980
=== Confusion Matrix ===
a b <-- classified as
99 2 | a = +
5 96 | b = -
SVM without attributes selection
=== Summary === Correctly Classified Instances 129 64.5 %
Incorrectly Classified Instances 71 35.5 %
Kappa statistic 0.29
Mean absolute error 0.355
Root mean squared error 0.5958
Relative absolute error 71 %
Root relative squared error 119.1638 %
Coverage of cases (0.95 level) 64.5 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 200
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1.000 0.710 0.585 1.000 0.738 0.412 0.645 0.585 +
0.290 0.000 1.000 0.290 0.450 0.412 0.645 0.645 -
Weighted Avg. 0.645 0.355 0.792 0.645 0.594 0.412 0.645 0.615
=== Confusion Matrix ===
a b <-- classified as
100 0 | a = +
71 29 | b = -
52
SVM after attributes selection (IG)
=== Summary ===
Correctly Classified Instances 191 95.5 %
Incorrectly Classified Instances 9 4.5 %
Kappa statistic 0.91
Mean absolute error 0.045
Root mean squared error 0.2121
Relative absolute error 9 %
Root relative squared error 42.4264 %
Coverage of cases (0.95 level) 95.5 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 200
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.980 0.070 0.933 0.980 0.956 0.911 0.955 0.925 +
0.930 0.020 0.979 0.930 0.954 0.911 0.955 0.945 -
Weighted Avg. 0.955 0.045 0.956 0.955 0.955 0.911 0.955 0.935
=== Confusion Matrix ===
a b <-- classified as
98 2 | a = +
7 93 | b = -
SVM after using Principle components analysis (PCA)
=== Summary ===
Correctly Classified Instances 195 96.5347 %
Incorrectly Classified Instances 7 3.4653 %
Kappa statistic 0.9307
Mean absolute error 0.0347
Root mean squared error 0.1862
Relative absolute error 6.9307 %
Root relative squared error 37.2309 %
Coverage of cases (0.95 level) 96.5347 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 202
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.980 0.050 0.952 0.980 0.966 0.931 0.965 0.943 +
0.950 0.020 0.980 0.950 0.965 0.931 0.965 0.956 -
Weighted Avg. 0.965 0.035 0.966 0.965 0.965 0.931 0.965 0.949
=== Confusion Matrix ===
a b <-- classified as
99 2 | a = +
5 96 | b = -
53
5.5 Comparison The following table (Table 3) shows a summarization of the classification results of the classifiers used in this
literature which are also described more clearly in figures (Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12).
Table 3
N-Gram Attributes Classifier/Lex Accuracy RMS Precision Recall F-Measure
Uni
All Attributes Naive Bayes 89.50% 0.324 0.897 0.895 0.895
SVM 93.00% 0.2646 0.931 0.93 0.93
Selected Attrs (Info. Gain)
Naive Bayes 89.50% 0.3242 0.902 0.895 0.895
SVM 94.50% 0.2345 0.945 0.945 0.945
Using PCA Naive Bayes 87.13% 0.3397 0.879 0.871 0.871
SVM 92.08% 0.2814 0.927 0.921 0.921
Bi+Uni
All Attributes Naive Bayes 91.50% 0.2915 0.915 0.915 0.915
SVM 64.50% 0.5958 0.792 0.645 0.594
Selected Attrs (Info. Gain)
Naive Bayes 90.50% 0.3082 0.912 0.905 0.905
SVM 95.50% 0.2121 0.956 0.955 0.955
Using PCA Naive Bayes (96.53%) (0.1802) (0.966) (0.965) (0.965)
SVM (96.53%) 0.1862 (0.966) (0.965) (0.965)
Lexicon SentiWordNet 72.50% 0.5244 0.6552 0.95 0.776
Bing Lui’s 76.00% 0.4899 0.6857 0.96 0.8
54
5.6 Charts
Figure 8
Figure 9
60.00%
70.00%
80.00%
90.00%
100.00%
U-A U-S U-P B-A B-S B-P Lex
Accuracy
Naive Bayes/SWN SVM/Bing Lui’s
0.15
0.25
0.35
0.45
0.55
U-A U-S U-P B-A B-S B-P Lex
Root mean squared error (RMS)
Naive Bayes/SWN SVM/Bing Lui’s
55
Figure 10
Figure 11
0.6
0.7
0.8
0.9
1
U-A U-S U-P B-A B-S B-P Lex
Precision
Naive Bayes/SWN SVM/Bing Lui’s
0.6
0.7
0.8
0.9
1
U-A U-S U-P B-A B-S B-P Lex
Recall
Naive Bayes/SWN SVM/Bing Lui’s
56
Figure 12
Where : U-A = Unigram - All attributes U-S = Unigram - Selected attributes U-P = Unigram - PCA B-A = Bigram - All attributes B-S = Bigram - Selected attributes B-P = Bigram - PCA Lex = Lexicon
0.55
0.65
0.75
0.85
0.95
U-A U-S U-P B-A B-S B-P Lex
F-Measure
Naive Bayes/SWN SVM/Bing Lui’s
57
5.7 UI for predictions preview To make it easier to display the detailed predictions for each classifier/lexicon, we developed an application with
C# (see Figure 13) to navigate on the product reviews, listing the prediction of every classifier/lexicon for a
product review, and compare it with its actual class.
The test set used in this literature (200 of product reviews in cameras domain) are loaded to this application so
we may scan through each sentiment/review of the 200 product reviews to display its actual class and the
predictions of the different classifiers used in this literature.
Figure 13
As shown in the above figure the displayed sentiment has a negative actual class and is classified correctly with
the following classifiers: Naïve Bayes with all attributes on unigram or bigram feature vectors, Naïve Bayes after
attribute selection using information Gain on unigram or bigram feature vectors, SVM with IG attribute
selection, with PCA attribute selection, or without attribute selection on either unigram or bigram, Naïve Bayes
with attribute selection using PCA on Bigram and SetiwordNet Lexicon.
But this review or sentiment were not correctly classified using Naïve Bayes classifier after attribute selection
using PCA on the Unigram feature vector and also misclassified with the opinion lexicon.
58
5.8 Application for live sentiment analysis As it's interesting to analyze live product reviews, we developed an application with Java (see Figure 14) to fetch
all the reviews of a product from its web page to analyize. The product is targeted with its URL on Amazon or
gsmArena websites, or the user can type a review manually. Considering the chosen lexicon (SentiWordNet or
Bing Liu's), the application analyzes the review(s) and displays how much these are positive and/or negative, and
also the score for every reviews file or URL.
Figure 14
The input of this application could be a reviews file or document in the extention of .arff or can be an URL of a
product reviews page on Amazon.com or gsmarean.com.
The output of the application could be in three shapes (see Figure 15):
First it may be exported to an external xml file.
Secondly it may be represented as a result of sentiments in the document/URL page as a pie graph.
Third, results may be represented as a table as shown in the following figure.
59
Figure 15
60
Chapter 6 Conclusion Obviously in the experimental work it is very clear that spending some efforts in the preprocessing phase
and carefully apply the appropriate attribute extraction and attribute selection methods will definitely lead
to a better classification results even with less features and less classification cost.
In this case the principle components attribute selection algorithm has proven that it typically suits the text
classification area given the highest classification accuracy, precession, recall and F-measure.
In general we applied two different approaches to sentiment analysis. The opinion lexicons approach
(SentiWordNet and Bing Liu’s) and the supervised machine learning approach (NaiveBayes and SVM).
The supervised machine learning approach consistently demonstrated high quality results of 96.53% for
product reviews, 88∼ 96.6% (precision) and 87∼ 96.5% (accuracy) for cameras and photos product reviews
comparing with the relatively low measures given by the opinion lexicons approach.
The explanation why lexicon approaches have had a poor classification results as mentioned before in
chapter one is that opinion lexicon is necessary but not sufficient for sentiment classification.
However, from our initial experience with sentiment detection, we have identified a few areas of potentially
substantial improvements in the opinion lexicons classification area. We expect applying negation detection
would provide better polarity detection while using the opinion lexicons approach, thus better analysis
results. Second, more advanced sentiment patterns currently require a fair amount of manual validation.
Although some amount of human expert involvement may be inevitable in the validation to handle the
semantics accurately, we plan on more research on increasing the accuracy of the sentiment analysis.
As some potential improvements were provided above it is also important to state that there is some issues
that are until this moment is very hard for researchers to solve in the opinion lexicon classification field,
some of which are discussed earlier in chapter two section 2.4.
61
References
[1] Narendra Ahuja, Ming-Hsuan Yang. \A Geometric Approach to Train Support Vector Machines" Proceedings of the 2000 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2000), pp. 430-437, vol. 1, Hilton Head Island, June, 2000.
[2] Bernhard E. Boser, Isabelle M. Guyon, Vladimir Vapnik. \A Training Algorithm for Optimal Margin Classifiers." Fifth Annual Workshop on Computational Learning Theory. ACM Press, Pittsburgh. 1992
[3] Christopher J.C. Burges, Alexander J. Smola, and Bernhard Scholkopf (editors). Advances in Kernel Methods - Support Vector Learning MIT Press, Cambridge, USA, 1999
[4] Christopher J.C. Burges. "A Tutorial on Support Vector Machines for Pattern Recognition", Data Mining and Knowledge Discovery 2, 121-167, 1998
[5] Nello Cristianini, John Shawe-Taylor. An Introduction to Support Vector Networks and other kernel-based learning methods. Cambridge University Pres 2000
[6] Flake, G. W., Lawrence, S. “Efficient SVM Regression Training with SMO." NEC Research Institute, (submitted to Machine Learning, special issue on Support Vector Machines). 2000
[7] Robert Freund, Federico Girosi, Edgar Osuna. “Training Support Vector Machines: an Application to Face Detection." IEEE Conference on Computer Vision and Pattern Recognition, pages 130-136, 1997a
[8] Robert Freund, Federico Girosi, Edgar Osuna. “An Improved TraininAlgorithm for Support Vector Machines." In J. Principe, L. Gile, N.Morgan, and E. Wilson, editors, Neural Networks for Signal Processing VII { Proceeding of the 1997 IEEE Workshop, pages 276-285, New York, 1997b
[9] Thorsten Joachims. "Text Categorization with Support Vector Machines: Learning with Many Relevant Features", 1998
[10] Thorsten Joachims. "Making Large-Scale SVM Learning Practical" 1999 (Chapter 11 of (Burges, 1999)) [11] Linda Kaufman. \Solving the Quadratic Programming Problem Arising in Support Vector Classification",
1999 (Chapter 10 of (Burges, 1999)) [12] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya and K.R.K. Murthy.”A fast iterative nearest point algorithm
for support vector machine classifier design, "Technical Report TR-ISL-99-03, Intelligent Systems Lab, Dept. of Computer Science & Automation, Indian Institute of Science, 1999a.
[13] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy. “Improvements to Platt's SMO algorithm for SVM classifier design." Technical report, Dept of CSA, IISc, Bangalore, India, 1999b.
[14] John C. Platt. “Fast Training of Support Vector Machines using Sequential Minimal Optimization" (Chapter 12 of (Burges, 1999))
[15] Robert Vanderbei. \Loqo: An Interior Point Code for Quadratic Programming." Technical Report SOR 94-15, Princeton University, 1994
[16] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995 [17] Vladimir Vapnik, Corinna Cortes. "Support vector networks," Machine Learning, vol. 20, pp.273-297,
1995. [18] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani “SENTIWORDNET 3.0: An Enhanced Lexical
Resource for Sentiment Analysis and Opinion Mining” [19] T. Mitchell, Machine Learning, McGraw Hill, 1997. [20] Chai, K.; H. T. Hn, H. L. Chieu; "Bayesian Online Classifiers for Text Classification and Filtering",
Proceedings of the 25th annual international ACM SIGIR conference on Research and Development in Information Retrieval, August 2002, pp 97-104
[21] DATA MINING Concepts and Techniques,Jiawei Han, Micheline Kamber Morgan Kaufman Publishers, 2003
[22] Abdi. H., & Williams, L.J. (2010). "Principal component analysis.". Wiley Interdisciplinary Reviews: Computational Statistics, 2: 433–459.
[23] ^ a b Olson, David L.; and Delen, Dursun (2008); Advanced Data Mining Techniques, Springer, 1st edition (February 1, 2008), page 138, ISBN 3-540-76916-1
[24] ^ a b c Powers, David M W (2007/2011). "Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies 2 (1): 37–63.