sentiment classification for product reviews (documentation)

1

2013

Supervisor:

Dr. Mohamed Farouk

Cairo university - ISSR

Sentiment Classification For Product Reviews

by

Mahmoud Mohamed Hassan

Mostafa Mohamed Ameen

Mohamed Abdelkader Hamed

Mai Mohamed Mahmoud

2

Table of Contents Abstract ......................................................................................................................................................................5

Chapter 1 Introduction ............................................................................................................................................6

1.1 Motivations.................................................................................................................................................6

Chapter 2 Sentiment Analysis .................................................................................................................................7

2.1 Sentiment Analysis Applications ................................................................................................................7

2.2 Sentiment Analysis Research ......................................................................................................................8

2.3 Different Levels of Analysis ........................................................................................................................8

2.3.1 Document level: ..................................................................................................................................8

2.3.2 Sentence level: ...................................................................................................................................8

2.3.3 Entity and Aspect level: ......................................................................................................................8

2.4 Sentiment Lexicon and Its Issues ................................................................................................................9

2.5 Natural Language Processing Issues ........................................................................................................ 10

2.6 Opinion Spam Detection ......................................................................................................................... 10

Chapter 3 Machine Learning Approaches ............................................................................................................ 11

3.1 Data preprocessing .................................................................................................................................. 11

3.1.1 Feature extraction ........................................................................................................................... 11

3.1.2 Feature selection (dimensionality reduction) ................................................................................. 13

3.2 Classification ............................................................................................................................................ 18

3.2.1 Support vector machine classification (SVM) .................................................................................. 18

3.2.2 Naïve Bayes classification ................................................................................................................ 25

3.2.3 Conditional Independence .............................................................................................................. 29

3.2.4 Bayes Theorem ................................................................................................................................ 31

3.2.5 Maximum A Posteriori (MAP) Hypothesis ....................................................................................... 32

3.2.6 Maximum Likelihood (ML) Hypothesis ............................................................................................ 33

3.2.7 Naïve Bayesian Classification ........................................................................................................... 34

3.3 Evaluation measures ............................................................................................................................... 36

3.3.1 Precision .......................................................................................................................................... 36

3.3.2 Recall (Sensitivity) ............................................................................................................................ 37

3.3.3 Accuracy .......................................................................................................................................... 37

3.3.4 F1 –measure .................................................................................................................................... 37

3

Chapter 4 Opinion lexicons .............................................................................................................................. 38

4.1 Definition ................................................................................................................................................. 38

4.2 SENTIWORDNET 3.0 ................................................................................................................................ 38

4.3 (Bing Liu) Opinion lexicon ........................................................................................................................ 39

4.3.1 Who is Dr. Bing Liu? ......................................................................................................................... 39

4.3.2 (Bing Liu) Opinion lexicon ................................................................................................................ 39

Chapter 5 Experimental results ........................................................................................................................ 40

5.1 Data collection ......................................................................................................................................... 40

5.2 Feature extraction ................................................................................................................................... 41

5.2.1 Unigram: .......................................................................................................................................... 41

5.2.2 Bigram:............................................................................................................................................. 42

5.3 Feature selection ..................................................................................................................................... 43

5.4 Results of the classifiers .......................................................................................................................... 47

5.4.1 Unigram ........................................................................................................................................... 47

5.4.2 Bigram .............................................................................................................................................. 50

5.5 Comparison.............................................................................................................................................. 53

5.6 Charts ....................................................................................................................................................... 54

5.7 UI for predictions preview ....................................................................................................................... 57

5.8 Application for live sentiment analysis .................................................................................................... 58

Chapter 6 Conclusion ....................................................................................................................................... 60

References ........................................................................................................................................................... 61

4

Acknowledgment

Firstly, we thanks Allah for helping us to complete this project

We would like to thanks our project supervisor

Dr. Mohamed Farouk

For his efforts and his great support which helped us to improve our performance and knowledge, since the very

beginning as he helped us to discover the world of machine learning and made us looking forward to explore

further in this overwhelming science.

Also, all thanks and respect to all stuff members who taught us during the four semesters of the diploma and the 2

semesters of the premaster courses.

We also thanks Professor Hisham Hefny for his inspiration and support by teaching us many subjects opening our

minds to new trends and challenges in the computer science field.

Special thanks for our parents for their endless efforts and unconditional support in all phases of our life

especially in our education which lead us to this far of education.

Project team..

5

Abstract Sentiment classification concerns the use of automatic methods for predicting the orientation of subjective

content on text documents, with applications on a number of areas including recommender and advertising

systems, customer intelligence and information retrieval.

This research presents the result of applying two different approaches to the problem of automatic sentiment

classification of product reviews. The first approach is using opinion lexicons (SentiWordNet 3.0 (Stefano

Baccianella) and Bing Lui’s opinion lexicon). The second approach is using a supervised machine learning

approaches (NaiveBayes algorithm and LibSVM algorithm).

Also this research is an attempt to present such results in a comparative manner to help the reader to

understand the difference between using variable numbers of attribute selection, attribute extraction and

variable classification algorithms.

6

Chapter 1 Introduction Sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments,

evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations,

individuals, issues, events, topics, and their attributes. It represents a large problem space. There are also many

names and slightly different tasks, e.g., sentiment analysis, opinion mining, opinion extraction, sentiment

mining, subjectivity analysis, affect analysis, emotion analysis, review mining, etc. However, they are now all

under the umbrella of sentiment analysis or opinion mining. While in industry, the term sentiment analysis is

more commonly used, but in academia both sentiment analysis and opinion mining are frequently employed.

They basically represent the same field of study. The term sentiment analysis perhaps first appeared in

(Nasukawa and Yi, 2003), and the term opinion mining first appeared in (Dave, Lawrence and Pennock, 2003).

However, the research on sentiments and opinions appeared earlier (Das and Chen, 2001; Morinaga et al., 2002;

Pang, Lee and Vaithyanathan, 2002; Tong, 2001; Turney, 2002; Wiebe, 2000).

Although linguistics and natural language processing (NLP) have a long history, little research had been done

about people’s opinions and sentiments before the year 2000. Since then, the field has become a very active

research area. There are several reasons for this. First, it has a wide arrange of applications, almost in every

domain. The industry surrounding sentiment analysis has also flourished due to the proliferation of commercial

applications. This provides a strong motivation for research. Second, it offers many challenging research

problems, which had never been studied before.

Third, for the first time in human history, we now have a huge volume of opinionated data in the social media on

the Web. Without this data, a lot of research would not have been possible. Not surprisingly, the inception and

the rapid growth of sentiment analysis coincide with those of the social media. In fact, sentiment analysis is now

right at the center of the social media research. Hence, research in sentiment analysis not only has an important

impact on NLP, but may also have a profound impact on management sciences, political science, economics, and

social sciences as they are all affected by people’s opinions. Although the sentiment analysis research mainly

started from early 2000, there were some earlier work on interpretation of metaphors, sentiment adjectives,

subjectivity, view points, and affects (Hatzivassiloglou and McKeown, 1997; Hearst, 1992; Wiebe, 1990; Wiebe,

1994; Wiebe, Bruce and O'Hara, 1999).

1.1 Motivations With the explosive growth of social media (e.g., reviews, forum discussions, blogs, micro-blogs, Twitter,

comments, and postings in social network sites) on the Web, individuals and organizations are increasingly using

the content in these media for decision making. Nowadays, if one wants to buy a consumer product, one is no

longer limited to asking one’s friends and family for opinions because there are many user reviews and

discussions in public forums on the Web about the product. For an organization, it may no longer be necessary

to conduct surveys, opinion polls, and focus groups in order to gather public opinions because there is an

abundance of such information publicly available. However, finding and monitoring opinion sites on the Web

and distilling the information contained in them remains a formidable task because of the proliferation of

diverse sites. Each site typically contains a huge volume of opinion text that is not always easily deciphered in

7

long blogs and forum postings. The average human reader will have difficulty identifying relevant sites and

extracting and summarizing the opinions in them. Automated sentiment analysis systems are thus needed.

In recent years, we have witnessed that opinionated postings in social media have helped reshape businesses,

and sway public sentiments and emotions, which have profoundly impacted on our social and political systems.

Such postings have also mobilized masses for political changes such as those happened in some Arab countries

in 2011. It has thus become a necessity to collect and study opinions on the Web. Of course, opinionated

documents not only exist on the Web (called external data), many organizations also have their internal data,

e.g., customer feedback collected from emails and call centers or results from surveys conducted by the

organizations.

Due to these applications, industrial activities have flourished in recent years. Sentiment analysis applications

have spread to almost every possible domain, from consumer products, services, healthcare, and financial

services to social events and political elections.

Chapter 2 Sentiment Analysis Opinions are central to almost all human activities because they are key influencers of our behaviors. Whenever

we need to make a decision, we want to know others’ opinions. In the real world, businesses and organizations

always want to find consumer or public opinions about their products and services. Individual consumers also

want to know the opinions of existing users of a product before purchasing it, and others’ opinions about

political candidates before making a voting decision in a political election. In the past, when an individual

needed opinions, he/she asked friends and family. When an organization or a business needed public or

consumer opinions, it conducted surveys, opinion polls, and focus groups. Acquiring public and consumer

opinions has long been a huge business itself for marketing, public relations, and political campaign companies.

2.1 Sentiment Analysis Applications Apart from real-life applications, many application-oriented research papers have also been published. For

example, in (Liu et al., 2007), a sentiment model was proposed to predict sales performance. In (McGlohon,

Glance and Reiter, 2010), reviews were used to rank products and merchants. In (Hong and Skiena, 2010), the

relationships between the NFL betting line and public opinions in blogs and Twitter were studied. In (O'Connor

et al., 2010), Twitter sentiment was linked with public opinion polls. In (Tumasjan et al., 2010), Twitter sentiment

was also applied to predict election results. In (Chen et al., 2010), the authors studied political standpoints. In

(Yano and Smith, 2010), a method was reported for predicting comment volumes of political blogs. In (Asur and

Huberman, 2010; Joshi et al., 2010; Sadikov, Parameswaran and Venetis, 2009), Twitter data, movie reviews and

blogs were used to predict box-office revenues for movies. In (Miller et al., 2011), sentiment flow in social

networks was investigated. In (Mohammad and Yang, 2011), sentiments in mails were used to find how genders

differed on emotional axes. In (Mohammad, 2011), emotions in novels and fairy tales were tracked. In (Bollen,

Mao and Zeng, 2011), Twitter moods were used to predict the stock market. In (Bar-Haim et al., 2011; Feldman

et al., 2011), expert investors in microblogs were identified and sentiment analysis of stocks was performed. In

(Zhang and Skiena, 2010), blog and news sentiment was used to study trading strategies. In (Sakunkoo and

Sakunkoo, 2009), social influences in online book reviews were studied. In (Groh and Hauffa, 2011), sentiment

8

analysis was used to characterize social relations. A comprehensive sentiment analysis system and some case

studies were also reported in (Castellanos et al., 2011).

2.2 Sentiment Analysis Research As discussed above, pervasive real-life applications are only part of the reason why sentiment analysis is a

popular research problem. It is also highly challenging as a NLP research topic, and covers many novel sub

problems as we will see later. Additionally, there was little research before the year 2000 in either NLP or in

linguistics. Part of the reason is that before then there was little opinion text available in digital forms. Since the

year 2000, the field has grown rapidly to become one of the most active research areas in NLP. It is also widely

researched in data mining, Web mining, and information retrieval. In fact, it has spread from computer science

to management sciences (Archak, Ghose and Ipeirotis, 2007; Chen and Xie, 2008; Das and Chen, 2007;

Dellarocas, Zhang and Awad, 2007; Ghose, Ipeirotis and Sundararajan, 2007; Hu, Pavlou and Zhang, 2006; Park,

Lee and Han, 2007).

2.3 Different Levels of Analysis In general, sentiment analysis has been investigated mainly at three levels:

2.3.1 Document level: The task at this level is to classify whether a whole opinion document expresses a positive or negative

sentiment (Pang, Lee and Vaithyanathan, 2002; Turney, 2002). For example, given a product review, the

system determines whether the review expresses an overall positive or negative opinion about the product.

This task is commonly known as document-level sentiment classification. This level of analysis assumes that

each document expresses opinions on a single entity (e.g., a single product). Thus, it is not applicable to

documents which evaluate or compare multiple entities.

2.3.2 Sentence level: The task at this level goes to the sentences and determines whether each sentence expressed a positive,

negative, or neutral opinion. Neutral usually means no opinion. This level of analysis is closely related to

subjectivity classification (Wiebe, Bruce and O'Hara, 1999), which distinguishes sentences (called objective

sentences) that express factual information from sentences (called subjective sentences) that express

subjective views and opinions. However, we should note that subjectivity is not equivalent to sentiment as

many objective sentences can imply opinions, e.g., “We bought the car last month and the windshield wiper

has fallen off.” Researchers have also analyzed clauses (Wilson, Wiebe and Hwa, 2004), but the clause level

is still not enough, e.g., “Apple is doing very well in this lousy economy.”

2.3.3 Entity and Aspect level: Both the document level and the sentence level analyses do not discover what exactly people liked and did

not like. Aspect level performs finer-grained analysis. Aspect level was earlier called feature level (feature-

based opinion mining and summarization) (Hu and Liu, 2004). Instead of looking at language constructs

(documents, paragraphs, sentences, clauses or phrases), aspect level directly looks at the opinion itself. It is

based on the idea that an opinion consists of a sentiment (positive or negative) and a target (of opinion). An

9

opinion without its target being identified is of limited use. Realizing the importance of opinion targets also

helps us understand the sentiment analysis problem better. For example, although the sentence “although

the service is not that great, I still love this restaurant” clearly has a positive tone, we cannot say that this

sentence is entirely positive. In fact, the sentence is positive about the restaurant (emphasized), but

negative about its service (not emphasized). In many applications, opinion targets are described by entities

and/or their different aspects. Thus, the goal of this level of analysis is to discover sentiments on entities

and/or their aspects. For example, the sentence “The iPhone’s call quality is good, but its battery life is

short” evaluates two aspects, call quality and battery life, of iPhone (entity). The sentiment on iPhone’s call

quality is positive, but the sentiment on its battery life is negative. The call quality and battery life of iPhone

are the opinion targets. Based on this level of analysis, a structured summary of opinions about entities and

their aspects can be produced, which turns unstructured text to structured data and can be used for all

kinds of qualitative and quantitative analyses. Both the document level and sentence level classifications are

already highly challenging. The aspect-level is even more difficult. It consists of several sub-problems.

To make things even more interesting and challenging, there are two types of opinions, i.e., regular opinions and

comparative opinions (Jindal and Liu, 2006b). A regular opinion expresses a sentiment only on an particular

entity or an aspect of the entity, e.g., “Coke tastes very good,” which expresses a positive sentiment on the

aspect taste of Coke. A comparative opinion compares multiple entities based on some of their shared aspects,

e.g., “Coke tastes better than Pepsi,” which compares Coke and Pepsi based on their tastes (an aspect) and

expresses a preference for Coke.

2.4 Sentiment Lexicon and Its Issues Not surprisingly, the most important indicators of sentiments are sentiment words, also called opinion words.

These are words that are commonly used to express positive or negative sentiments. For example, good,

wonderful, and amazing are positive sentiment words, and bad, poor, and terrible are negative sentiment

words. Apart from individual words, there are also phrases and idioms, e.g., cost someone an arm and a leg.

Sentiment words and phrases are instrumental to sentiment analysis for obvious reasons. A list of such words

and phrases is called a sentiment lexicon (or opinion lexicon). Over the years, researchers have designed

numerous algorithms to compile such lexicons.

Although sentiment words and phrases are important for sentiment analysis, only using them is far from

sufficient. The problem is much more complex. In other words, we can say that sentiment lexicon is necessary

but not sufficient for sentiment analysis. Below, we highlight several issues:

1. A positive or negative sentiment word may have opposite orientations in different application domains.

For example, “suck” usually indicates negative sentiment, e.g., “This camera sucks,” but it can also imply

positive sentiment, e.g., “This vacuum cleaner really sucks.”

2. A sentence containing sentiment words may not express any sentiment. This phenomenon happens

frequently in several types of sentences. Question (interrogative) sentences and conditional sentences

are two important types, e.g., “Can you tell me which Sony camera is good?” and “If I can find a good

camera in the shop, I will buy it.” Both these sentences contain the sentiment word “good”, but neither

expresses a positive or negative opinion on any specific camera. However, not all conditional sentences

10

or interrogative sentences express no sentiments, e.g., “Does anyone know how to repair this terrible

printer” and “If you are looking for a good car, get Toyota Camry.”.

3. Sarcastic sentences with or without sentiment words are hard to deal with, e.g., “What a great car! It

stopped working in two days.” Sarcasms are not so common in consumer reviews about products and

services, but are very common in political discussions, which make political opinions hard to deal with.

4. Many sentences without sentiment words can also imply opinions. Many of these sentences are actually

objective sentences that are used to express some factual information. Again, there are many types of

such sentences. Here we just give two examples. The sentence “This washer uses a lot of water” implies

a negative sentiment about the washer since it uses a lot of resource (water). The sentence “After

sleeping on the mattress for two days, a valley has formed in the middle” expresses a negative opinion

about the mattress. This sentence is objective as it states a fact. All these sentences have no sentiment

words.

2.5 Natural Language Processing Issues Finally, we must not forget sentiment analysis is a NLP problem. It touches every aspect of NLP, e.g., co

reference resolution, negation handling, and word sense disambiguation, which add more difficulties since these

are not solved problems in NLP. However, it is also useful to realize that sentiment analysis is a highly restricted

NLP problem because the system does not need to fully understand the semantics of each sentence or

document but only needs to understand some aspects of it, i.e., positive or negative sentiments and their target

entities or topics. In this sense, sentiment analysis offers a great platform for NLP researchers to make tangible

progresses on all fronts of NLP with the potential of making a huge practical impact.

2.6 Opinion Spam Detection A key feature of social media is that it enables anyone from anywhere in the world to freely express his/her

views and opinions without disclosing his/her true identify and without the fear of undesirable consequences.

These opinions are thus highly valuable. However, this anonymity also comes with a price. It allows people with

hidden agendas or malicious intentions to easily game the system to give people the impression that they are

independent members of the public and post fake opinions to promote or to discredit target products, services,

organizations, or individuals without disclosing their true intentions, or the person or organization that they are

secretly working for. Such individuals are called opinion spammers and their activities are called opinion

spamming (Jindal and Liu, 2008; Jindal and Liu, 2007).

Opinion spamming has become a major issue. Apart from individuals who give fake opinions in reviews and

forum discussions, there are also commercial companies that are in the business of writing fake reviews and

bogus blogs for their clients. Several high profile cases of fake reviews have been reported in the news. It is

important to detect such spamming activities to ensure that the opinions on the Web are a trusted source of

valuable information. Unlike extraction of positive and negative opinions, opinion spam detection is not just a

NLP problem as it involves the analysis of people’s posting behaviors. It is thus also a data mining problem.

11

Chapter 3 Machine Learning Approaches

3.1 Data preprocessing Sentiment classification or opinion classification is text classification problem in the first place. The

Performance of a text classification task is directly affected by representation of data. Once features are

appropriately selected even simple classifiers may produce good classification results.

Several feature selection and extraction methods have been proposed in the literature. Feature selection

merely selects a good subset of the original features, where as feature extraction allows for arbitrary new

features based on the original ones.

3.1.1 Feature extraction

The most commonly used features to represent words, Term Frequency (TF) and Inverse Document

Frequency (IDF), may not be always appropriate.

Choosing an appropriate representation of words in text documents is crucial to obtaining good

classification performance. Researchers have used different representations to maximize the accuracy of

machine learning algorithms. The”Bag of words” representation is widely used to represent text documents.

In this representation, a document is considered to be an unordered collection of words whereas the

position of words in the document bears no importance. “'Bag of words'” is the simplest representation of

textual data. The number of occurrences of each word in the document is represented by term frequency

(TF) which is a document specific measure of importance of a term. The collection of documents under

consideration is called a corpus. The importance of a term in a document is measured by its weight in the

document. A number of term weighting techniques have been proposed in literature. In the vector space

model [2], a document is represented by a document vector whose components are term weights. A

document using term frequency as term weights can be represented in vector form as {tf 1,tf 2,tf 3,...,tf n } ,

where TF is the term frequency and n is total number of terms in the document.

Lengths of documents in a corpus may vary and longer documents usually have higher term frequencies and

more unique terms than shorter documents. ‎[18]

3.1.1.1 Text Feature Generators

Before we address the question of how to discard words, we must first determine what shall count as a

word. For example, is ‘HP-UX’ one word, or is it two words? What about ‘650-857-1501’? When it comes to

programming, a simple solution is to take any contiguous sequence of alphabetic characters; or

alphanumeric characters to include identifiers such as ‘ioctl32’, which may sometimes be useful. By using

the Posix regular expression \p,L&-+ we avoid breaking ‘naive’ in two, as well as many accented words in

French, German, etc. But what about ‘win 32’, ‘can’t’ or words that may be hyphenated over a line break?

Like most data cleaning endeavors, the list of exceptions is endless, and one must simply draw a line

somewhere and hope for an 80%-20% tradeoff. Fortunately, semantic errors in word parsing are usually only

seen by the core learning algorithm, and it is their statistical properties that matter, not its readability or

intuitiveness to people. Our purpose is to offer a range of feature generators so that the feature selector

12

may discover the strongly predictive features. The most beneficial feature generators will vary according to

the characteristics of the domain text.

Word Merging

One method of reducing the size of the feature space somewhat is to merge word variants together, and

treat them as a single feature. More importantly, this can also improve the predictive value of some

features. Forcing all letters to lowercase is a nearly ubiquitous practice. It normalizes for capitalization at the

beginning of a sentence, which does not otherwise affect the word’s meaning, and helps reduce the

dispersion issue mentioned in the introduction. For proper nouns, it occasionally conflates other word

meanings, e.g. ‘Bush’ or ‘LaTeX.’ Likewise, various word stemming algorithms can be used to merge multiple

related word forms. For example, ‘cat,’ ‘cats,’ ‘catlike’ and ‘catty’ may all be merged into a common feature.

Stemming typically benefits recall but at a cost of precision. If one is searching for ‘catty’ and the word is

treated the same as ‘cat,’ then a certain amount of precision is necessarily lost. For extremely skewed class

distributions, this loss may be unsupportable.

Stemming algorithms make both over-stemming errors and under-stemming errors, but again, the

semantics are less important than the feature’s statistical properties.

Word Phrases

Whereas merging related words together can produce features with more frequent occurrence (typically

with greater recall and lower precision), identifying multiple word phrases as a single term can produce

rarer, highly specific features (which typically aid precision and have lower recall), e.g. ‘John Denver’ or ‘user

interface.’ Rather than require a dictionary of phrases as above, a simple approach is to treat all consecutive

pairs of words as a phrase term, and let feature selection determine which are useful for prediction.

This can be extended for phrases of three or more words with occasionally more specifity, but with strictly

decreasing frequency. Most of the benefit is obtained by two-word phrases. This is in part because portions

of The phrase may already have the same statistical properties, e.g. the four word phrase ‘United States of

America’ is covered already by the two-word phrase ‘United States.’ In addition, the reach of a two-word

phrase can be extended by eliminating common stop words, e.g. ‘head of the household’ becomes ‘head

household.’ Stop word lists are language specific, unfortunately. Their primary benefit to classification is in

extending the reach of phrases, rather than eliminating commonly useless words, which most feature

selection methods, can already remove in a language-independent fashion.

Character N-grams the word identification methods above fail in some situations, and can miss some good

opportunities for features. For example, languages such as Chinese and Japanese do not use a space

character. Segmenting such text into words is complex, whereas nearly equivalent accuracy may be

obtained by simply using every pair of adjacent Unicode characters as features—n grams. Certainly many of

the combinations will be meaningless, but feature selection can identify the most predictive ones. For

languages that use the Latin character set, 3-grams or 6-grams may be appropriate. For example, n-grams

would capture the essence of common technical text patterns such as ‘HP-UX 11.0’, ‘while (<>) ,’, ‘#!/bin/’,

and ‘ :)’. Phrases of two adjacent n-grams simply correspond to (2n)-grams. Note that while the number of

13

potential n-grams grows exponentially with n, in practice only a small fraction of the possibilities occur in

actual training examples, and only a fraction of those will be found predictive.

Multi-Field Records

Although most research deals with training cases as a single string, many applications have multiple text

(and non-text) fields associated with each record. In document management, these may be title, author,

abstract, keywords, body, and references. In technical support, they may be title, product, keywords,

engineer, customer, symptoms, problem description, and solution. Multi-field records are common in

applications, even though the bulk of text classification research treats only a single string. Furthermore,

when classifying long strings, e.g. arbitrary file contents, the first few kilobytes may be treated as a separate

field and may often prove sufficient for generating adequate features, avoiding the overhead of processing

huge files, such as tar or zip archives.

Feature Values

Once a decision has been made about what to consider as a feature term, the meaning of the numerical

feature must be determined. For some purposes, a binary value is sufficient, indicating whether the term

appears at all.

This representation is used by the Bernoulli formulation of the Naive Bayes classifier. Many other classifiers

use the term frequency tf(t,k) (the word count in document k) directly as the feature value.

3.1.2 Feature selection (dimensionality reduction)

The total number of features in a corpus of text documents is the number of unique words present in all

documents. Word sharing is reduced in documents belonging to different categories thus producing a large

number of unique words in the whole corpus. High dimensionality is thus inherent to text classification.

Researchers have been trying to filter out terms which are not important for classification or are redundant.

Techniques used for filtering terms in text classification are based on the assumption that very rare and very

common terms do not help in discriminating documents of different categories. Very common terms

occurring are all documents are treated as stop words are removed. Also rarely occurring terms, occurring in

only 2 or 3 documents, are not considered.

One of the problems with high-dimensional datasets is that, in many cases, not all the measured variables

are important for understanding the underlying phenomena of interest. While certain computationally

expensive novel methods can construct predictive models with high accuracy from high-dimensional data. It

is still of interest in many applications to reduce the dimension of the original data prior to any modeling of

the data. Feature selection is the method that can reduce both the data and the computational complexity.

Dataset can also get more efficient and can be useful to find out feature subsets.

14

There is Four main steps in a feature selection method: (see Figure 1)

Generation = select feature subset candidate.

Evaluation = compute relevancy value of the subset.

Stopping criterion = determine whether subset is relevant.

Validation = verify subset validity.

3.1.2.1 Information Gain (IG)

Entropy and Information Gain

The entropy (very common in Information Theory) characterizes the impurity of an arbitrary collection of

examples.

Information Gain is the expected reduction in entropy caused by partitioning the examples according to a given

attribute.

Entropy =

Figure 1

15

3.1.2.2 Principle components Analysis (PCA)

Principal component analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to

convert a set of observations of possibly correlated variables into a set of values of linearly

uncorrelated variables called principal components. The number of principal components is less than or equal

to the number of original variables. This transformation is defined in such a way that the first principal

component has the largest possible variance (that is, accounts for as much of the variability in the data as

possible), and each succeeding component in turn has the highest variance possible under the constraint that it

be orthogonal to (i.e., uncorrelated with) the preceding components. Principal components are guaranteed to

be independent only if the data set is jointly normally distributed.

PCA was invented in 1901 by Karl Pearson,[1] as an analogue of the principal axes theorem in mechanics; it was

later independently developed (and named) by Harold Hotelling in the 1930s.[2] The method is mostly used as a

tool in exploratory data analysis and for making predictive models. PCA can be done by eigenvalue

decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix,

usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute.[3] The results

of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed

variable values corresponding to a particular data point), and loadings (the weight by which each standardized

original variable should be multiplied to get the component score).

PCA is mathematically defined[7] as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

Consider a data matrix, X, with zero empirical mean (the empirical (sample) mean of the distribution has been subtracted from the data set), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of datum (say, the results from a particular probe).

Mathematically, the transformation is defined by a set of p-dimensional vectors of weights

or loadings that map each row vector of X to new vector of principal

component scores , given by

in such a way that the individual variables of t considered over the data set successively inherit the maximum possible variance from x, with each loading vector w constrained to be a unit vector.

3.1.2.2.1 First component

The first loading vector w(1) thus has to satisfy

Equivalently, writing this in matrix form gives

https://en.wikipedia.org/wiki/Orthogonal_matrix

https://en.wikipedia.org/wiki/Correlation_and_dependence

https://en.wikipedia.org/wiki/Correlation_and_dependence

https://en.wikipedia.org/wiki/Variance

https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Joint_normality

https://en.wikipedia.org/wiki/Karl_Pearson

https://en.wikipedia.org/wiki/Principal_component_analysis#cite_note-1

https://en.wikipedia.org/wiki/Moment_of_inertia#principal_axes

https://en.wikipedia.org/wiki/Harold_Hotelling


https://en.wikipedia.org/wiki/Exploratory_data_analysis

https://en.wikipedia.org/wiki/Predictive_modeling

https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix

https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix

https://en.wikipedia.org/wiki/Covariance

https://en.wikipedia.org/wiki/Correlation

https://en.wikipedia.org/wiki/Singular_value_decomposition

https://en.wikipedia.org/wiki/Data_matrix_(multivariate_statistics)

https://en.wikipedia.org/wiki/Z-score



https://en.wikipedia.org/wiki/Orthogonal_transformation

https://en.wikipedia.org/wiki/Linear_transformation

https://en.wikipedia.org/wiki/Coordinate_system

https://en.wikipedia.org/wiki/Matrix_(mathematics)

https://en.wikipedia.org/wiki/Empirical_mean

16

Since w(1) has been defined to be a unit vector, it equally must also satisfy

The quantity to be maximized can be recognized as a Rayleigh quotient. A standard result for a symmetric matrix such as XTX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector.

With w(1) found, the first component of a data vector x(i) can then be given as a score t1(i) = x(i) ⋅ w(1) in the transformed co-ordinates, or as the corresponding vector in the original variables, {x(i) ⋅ w(1)} w(1).

3.1.2.2.2 Further components

The kth component can be found by subtracting the first k-1 principal components from X:

and then finding the loading vector which extracts the maximum variance from this new data matrix

It turns out that this gives the remaining eigenvectors of XTX, with the maximum values for the quantity in brackets given by their corresponding eigenvalues.

The kth principal component of a data vector x(i) can therefore be given as a score tk(i) = x(i) ⋅ w(k) in the transformed co-ordinates, or as the corresponding vector in the space of the original variables, {x(i) ⋅ w(k)} w(k), where w(k) is the kth eigenvector of XTX .

The full principal components decomposition of X can therefore be given as

where W is a p-by-p matrix whose columns are the eigenvectors of XTX

3.1.2.2.3 Covariance

XTX itself can be recognized as proportional to the empirical sample covariance matrix of the dataset X.

The sample covariance Q between two of the different principal components over the dataset is given by

where the eigenvector property of w(k) has been used to move from line 2 to line 3. However eigenvectors w(j) and w(k) corresponding to eigenvalues of a symmetric matrix are orthogonal (if the eigenvalues are different), or can be orthogonalized (if the vectors happen to share an equal repeated value). The product in

https://en.wikipedia.org/wiki/Rayleigh_quotient

https://en.wikipedia.org/wiki/Eigenvalue

https://en.wikipedia.org/wiki/Eigenvector

https://en.wikipedia.org/wiki/Covariance_matrix

17

the final line is therefore zero; there is no sample covariance between different principal components over the dataset.

Another way to characterize the principal components transformation is therefore as the transformation to coordinates which diagonalize the empirical sample covariance matrix.

In matrix form, the empirical covariance matrix for the original variables can be written

The empirical covariance matrix between the principal components becomes

where Λ is the diagonal matrix of eigenvalues λ(k) of XTX

(λ(k) being equal to the sum of the squares over the dataset associated with each component k: λ(k) = Σi tk2

(i) = Σi (x(i) ⋅ w(k))

2)

3.1.2.2.4 Dimensionality reduction

The faithful transformation T = X W maps a data vector x(i) from an original space of p variables to a new space of p variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first L principal components, produced by using only the first L loading vectors, gives the truncated transformation

where the matrix TL now has n rows but only L columns. By construction, of all the transformed data matrices with only L columns, this score matrix maximizes the variance in the original data that has been preserved, while minimizing the total squared reconstruction error ||T - TL||2.

Such dimensionality reduction can be a very useful step for visualizing and processing high-dimensional datasets, while still retaining as much of the variance in the dataset as possible. For example, selecting L=2 and keeping only the first two principal components finds the two-dimensional plane through the high-dimensional dataset in which the data is most spread out, so if the data contains clusters these too may be most spread out, and therefore most visible to be plotted out in a two-dimensional diagram; whereas if two directions through the data (or two of the original variables) are chosen at random, the clusters may be much less spread apart from each other, and may in fact be much more likely to substantially overlay each other, making them indistinguishable.

Similarly, in regression analysis, the larger the number of explanatory variables allowed, the greater is the chance of over fitting the model, producing conclusions that fail to generalize to other datasets. One approach, especially when there are strong correlations between different possible explanatory variables, is to reduce them to a few principal components and then run the regression against them, a method called principal component regression.

Dimensionality reduction may also be appropriate when the variables in a dataset are noisy. If each column of the dataset contains independent identically distributed Gaussian noise, and then the columns of T will also contain similarly identically distributed Gaussian noise (such a distribution is invariant under the effects of the matrix W, which can be thought of as a high-dimensional rotation of the co-ordinate axes). However, with more of the total variance concentrated in the first few principal components compared to the same noise variance, the proportionate effect of the noise is less -- the first components achieve a higher signal-to-noise ratio. PCA thus can have the effect of concentrating much of the signal into the first few principal components, which can

https://en.wikipedia.org/wiki/Dimensionality_reduction

https://en.wikipedia.org/wiki/Cluster_analysis

https://en.wikipedia.org/wiki/Regression_analysis

https://en.wikipedia.org/wiki/Explanatory_variable

https://en.wikipedia.org/wiki/Overfitting

https://en.wikipedia.org/wiki/Principal_component_regression

https://en.wikipedia.org/wiki/Principal_component_regression

https://en.wikipedia.org/wiki/Signal-to-noise_ratio

18

usefully be captured by dimensionality reduction; while the later principal components may be dominated by noise, and so disposed of without great loss.

3.2 Classification

3.2.1 Support vector machine classification (SVM)

3.2.1.1 Description

Support Vector Machines (SVM's) are a relatively new learning method used for binary classification. The

basic idea is to and a hyper plane which separates the d-dimensional data perfectly into its two classes.

However, since example data is often not linearly separable, SVM's introduce the notion of a “kernel

induced feature space" which casts the data into a higher dimensional space where the data is separable.

Typically, casting into such a space would cause problems computationally, and with over thing. The key

insight used in SVM's is that the higher-dimensional space doesn't need to be dealt with directly (as it turns

out, only the formula for the dot-product in that space is needed), which eliminates the above concerns. Furthermore, the VC-dimension (a measure of a system's likelihood to perform well on unseen data) of

SVM's can be explicitly calculated, unlike other learning methods like neural networks, for which there is no

measure. Overall, SVM's are intuitive, theoretically well- founded, and have shown to be practically

successful. SVM's have also been extended to solve regression tasks (where the system is trained to output

a numerical value, rather than \yes/no" classification).

3.2.1.2 History

Support Vector Machines were introduced by Vladimir Vapnik and colleagues. The earliest mention was in

(Vapnik, 1979), but the first main paper seems to be (Vapnik, 1995).

3.2.1.3 Mathematics

We are given L training examples { }; i = 1, . . . l , where each example has d inputs ( ), and a class

label with one of two values ( {-1, 1}). Now, all hyper planes in are parameterized by a vector (w),

and a constant (b), expressed in the equation

w.x + b = 0 (1)

(Recall that w is in fact the vector orthogonal to the hyper plane.) Given such a hyper plane (w, b) that

separates the data, this gives the function

f(x) = sign(w. x + b) (2)

This correctly classifies the training data (and hopefully other “testing” data it hasn’t seen yet). However, a

given hyper plane represented by (w, b) is equally expressed by all pairs ( w, b ) for . So we define

the canonical hyper plane to be that which separates the data from the hyper plane by a “distance" of at

least 1. That is, we consider those that satisfy:

.w + b ≥ +1 when = +1 (3)

19

.w + b ≤ -1 when = -1 (4)

or more compactly:

( .w + b) ≥ 1 i (5)

All such hyper planes have a “functional distance" ≥ 1 (quite literally, the function's value is ≥ 1). This shouldn't

be confused with the “geometric" or “Euclidean distance" (also known as the margin). For a given hyper plane

(w, b), all pairs ( w, b ) define the exact same hyper plane, but each has a Different functional distance to a

given data point. To obtain the geometric distance from the hyper plane to a data point, we must normalize by

the magnitude of w. This distance is simply:

(6)

Intuitively, we want the hyper plane that maximizes the geometric distance to the closest data points. (See

figure 1.)

Figure 1: Choosing the hyper plane that maximizes the margin.

20

From the equation we see this is accomplished by minimizing (subject to the distance constraints). The

main method of doing this is with Lagrange multipliers. (See (Vapnik, 1995), or (Burges, 1998) for derivation

details.) The problem is eventually transformed into:

minimize:

subject to:

Where is the vector of l non-negative Lagrange multipliers to be determined, and C is a constant (to be

explained later). We can de ne the matrix and introduce more compact notation:

minimize:

(7)

subject to: (8)

(9)

(This minimization problem is what is known as a Quadratic Programming Problem (QP). Fortunately, many

techniques have been developed to solve them.)

In addition, from the derivation of these equations, it was seen that the optimal hyper plane can be written as:

W = (10)

That is, the vector w is just a linear combination of the training examples. Interestingly, it can also be shown that

which is just a fancy way of saying that when the functional distance of an example is strictly greater than 1

(when ) then . . So only the closest data points contribute to w. These training examples

for which are termed support vectors. They are the only ones needed in defining (and finding) the

optimal hyper plane. Intuitively, the support-vectors are the “borderline cases" in the decision function we are

trying to learn. Even more interesting is that can be thought of as a “difficulty rating" for the example - how

important that example was in determining the hyper plane.

Assuming we have the optimal (from which we construct w), we must still determine b to fully specify the

hyper plane. To do this, take any “positive" and “negative" support vector, , for which we know

21

Solving these equations gives us

b =

(11)

Now, you may have wondered the need for the constraint (eq. 9)

When C = , the optimal hyper plane will be the one that completely separates the data (assuming one exists).

For finite C, this changes the problem to finding a “soft-margin" classifier 4, which allows for some of the data to

be misclassified. One can think of C as a tunable parameter: higher C corresponds to more importance on

classifying all the training data correctly, lower C results in a \more flexible" hyper plane that tries to minimize

the margin error (how badly ) for each example. Finite values of C are useful in situations

where the data is not easily separable (perhaps because the input data { } are noisy).

3.2.1.4 The Generalization Ability of Perfectly Trained SVM's

Suppose we find the optimal hyper plane separating the data. And of the training examples, of them are

support vectors. It can then be shown that the expected out-of-sample error (the portion of unseen data that

will be misclassified), bound by

(12)

This is a very useful result. It ties together the notions that simpler systems are better (Ockham's Razor principle)

and that for SVM's, fewer support vectors are in fact a more “compact" and “simpler" representation of the

hyper plane and hence should perform better. If the data cannot be separated however, no such theorem

applies, which at this point seems to be a potential setback for SVM's.

3.2.1.5 Mapping the Inputs to other dimensions – the use of Kernels

Now just because a data set is not linearly separable, doesn't mean there isn't some other concise way to

separate the data. For example, it might be easier to separate the data using polynomial curves, or circles.

However, finding the optimal curve to fit the data is difficult, and it would be a shame not to use the method of

finding the optimal hyper plane that we investigated in the previous section. Indeed there is a way to “pre-

process" the data in such a way that the problem is transformed into one of finding a simple

22

Hyper plane. To do this, we define a mapping z = that transforms the

d dimensional input vector x into a (usually higher) d’ dimensional vector z. We hope to choose a so that the

new training data ( ) g is separable by a hyper plane. (See Figure 2.)

This method looks like it might work, but there are some concerns. Firstly, how do we go about choosing It

would be a lot of work to have to construct one explicitly for any data set we are given. Not to fear, if casts

the input vector into a high enough space (d’ ≥ d), casts the input vector into a high enough space that does

this for most data... But casting into a very high dimensional space is also worry some. Computationally, this

creates much more of a burden. Recall that the construction of the matrix H requires the dot products ( ). if

d’ is exponentially larger than d (and it very well could be), the computation of H becomes prohibitive (not to

mention the extra space requirements). Also, by increasing the complexity of our system in such a way, over

fitting becomes a concern. By casting into a high enough dimensional space, it is a fact that we can separate any

data set. How can we be sure that the system isn't just fitting the idiosyncrasies of the training data, but is

actually learning a legitimate pattern that will generalize to other data it hasn't been trained on?

As we'll see, SVM's avoid these problems. Given a mapping , to set up our new optimization problem,

we simply replace all occurrences of x with .

Our QP problem (recall eq. 7) would still be

minimize:

but instead of ( ), it is ( . eq. 10 would be

23

W =

And eq. 2 would be

f(x) = sign(w. + b)

=

=

The important observation in all this, is that any time a appears, it is always in a dot product with some

other That is, if we knew the formula (called a kernel) for the dot product in the higher dimensional

feature space,

(13)

we would never need to deal with the mapping z = directly. The matrix in our optimization would simply be

( )). And

our classifier f(x) = Once the problem is set up in this way, finding the optimal hyper plane

proceeds as usual, only the hyper plane will be in some unknown feature space. In the original input space, the

data will be separated by some curved, possibly non-continuous contour. It may not seem obvious why the use

of a kernel alleviates our concerns, but it does. Earlier, we mentioned that it would be tedious to have to design

a different feature map for any training set we are given, in order to linearly separate the data. Fortunately,

useful kernels have already been discovered. Consider the “polynomial kernel"

(14)

Where p is a tunable parameter, which in practice varies from 1 to ~ 10. Notice that evaluating K involves only

an extra addition and exponentiation more than computing the original dot product. You might wonder what

the implicit mapping was in the creation of this kernel. Well, if you were to expand the dot product inside K

...

(15)

and multiply these (d+1) terms by each other p times, it would result in

terms each of which are

polynomials of varying degrees of the input vectors. Thus, one can think of this polynomial kernel as the dot

product of two exponentially large z vectors. By using a larger value of p the dimension of the feature space is

implicitly larger, where the data will likely be easier to separate. (However, in a larger dimensional space, there

might be more support vectors, which we saw leads to worse generalization.)

3.2.1.6 Other kernels

Another popular one is the Gaussian RBF Kernel

24

(16)

where is a tunable parameter. Using this kernel results in the classifier

f(x) =

which is really just a Radial Basis Function, with the support vectors as the centers. So here, a SVM was implicitly

used to find the number (and location) of centers needed to form the RBF network with the highest expected

generalization performance.

At this point one might wonder what other kernels exist, and if making your own kernel is as simple as just

dreaming up some function As it turns out, K must in fact be the dot product in a feature space for

some , if all the theory behind SVM's is to go through. Now there are two ways to ensure this. The first, is to

create some mapping z = and then derive the analytic expression for .

This kernel is most definitely the dot product in a feature space, since it was created as such. The second way is

to dream up some function K and then check if it is valid by applying Mercer's condition. Without giving too

many details, the condition states: Suppose K can be written as for some choice of

the

If K is indeed a dot product

The mathematically inclined reader interested in the derivation details is encouraged to see (Cristianini, Shawe-

Taylor, 2000). It is indeed a strange mathematical requirement. Fortunately for us, the polynomial and RBF

kernels have already been proven to be valid. And most of the literature presenting results using SVM's all use

these two simple kernels. So most SVM users need not be concerned with creating new kernels, and checking

that they meet Mercer's condition. (Interestingly though, kernels satisfy many closure properties. That is,

addition, multiplication, and composition of valid kernels all result in valid kernels. Again, see (Cristianini, Shawe-

Taylor, 2000).)

25

3.2.2 Naïve Bayes classification

3.2.2.1 Introduction to Bayesian Classification The Bayesian Classification represents a supervised learning method as well as a statistical method for

classification. Assumes an underlying probabilistic model and it allows us to capture uncertainty about the

model in a principled way by determining probabilities of the outcomes. It can solve diagnostic and predictive

problems.

This Classification is named after Thomas Bayes (1702 - 1761), who proposed the Bayes Theorem.

Bayesian classification provides practical learning algorithms and prior knowledge and observed data can be

combined. Bayesian Classification provides a useful perspective for understanding and evaluating many learning

algorithms. It calculates explicit probabilities for hypothesis and it is robust to noise in input data.

Uses of Naive Bayes classification: 1. Naive Bayes text classification

(http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html) The Bayesian classification is used as a probabilistic learning method (Naive Bayes text classification). Naive

Bayes classifiers are among the most successful known algorithms for learning to classify text documents.

2. Spam filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering) Spam filtering is the best known use of Naive Bayesian text classification. It makes use of a Naive Bayes

classifier to identify spam e-mail.

Bayesian spam filtering has become a popular mechanism to distinguish illegitimate spam email from

legitimate email (sometimes called "ham" or "bacn").[4] Many modern mail clients implement Bayesian

spam filtering. Users can also install separate email filtering programs. Server-side email filters, such as

DSPAM, Spam Assassin, Spam Bayes, Bogofilter and ASSP, make use of Bayesian spam filtering techniques,

and the functionality is sometimes embedded within mail server software itself.

3. Hybrid Recommender System Using Naive Bayes Classifier and Collaborative Filtering

(http://eprints.ecs.soton.ac.uk/18483/)

Recommender Systems apply machine learning and data mining techniques for filtering unseen information

and can predict whether a user would like a given resource.

It is proposed a unique switching hybrid recommendation approach by combining a Naive Bayes

classification approach with the collaborative filtering. Experimental results on two different data sets,

show that the proposed algorithm is scalable and provide better performance-in terms of accuracy and

coverage-than other algorithms while at the same time eliminates some recorded problems with the

recommender systems.

26

4. Online applications (http://www.convo.co.uk/x02/)

This online application has been set up as a simple example of supervised machine learning and affective

computing. Using a training set of examples which reflect nice, nasty or neutral sentiments, we're training

Ditto to distinguish between them.

Simple Emotion modeling combines a statistically based classifier with a dynamical model. The Naive Bayes

classifier employs single words and word pairs as features. It allocates user utterances into nice, nasty and

neutral classes, labeled +1, -1 and 0 respectively. This numerical output drives a simple first-order

dynamical system, whose state represents the simulated emotional state of the experiment's

personification, Ditto the donkey.

Independence

Example

Suppose there are two events:

M: Manuela teaches the class (otherwise it's Andrew)

S: It is sunny

"The sunshine levels do not depend on and do not influence who is teaching."

Theory:

From P(S | M) = P(S), the rules of probability imply:

P(~S | M) = P(~S)

P(M | S) = P(M)

P(M ^ S) = P(M) P(S)

P(~M ^ S) = P(~M) P(S)

P(M^~S) = P(M)P(~S)

P(~M^~S) = P(~M)P(~S)

Theory applied on previous example:

"The sunshine levels do not depend on and do not influence who is teaching." can be specified very simply:

P(S | M) = P(S)

"Two events A and B are statistically independent if the probability of A is the same value when B occurs, when

B does not occur or when nothing is known about the occurrence of B"

27

3.2.2.2 Conditional Probability

3.2.2.3 Simple Example:

H = "Have a headache" F = "Coming down with Flu" P(H) = 1/10

P(F) =1/40

P(H|F) = 1/2

P( A | B)P(A | B)

1

"Headaches are rare and flu is rarer, but if you're coming down with 'flu there's a 50-50 chance

you'll have a headache."

P(H|F) = Fraction of flu-inflicted worlds in which you have a headache =

#worlds with flu and headache Area of "H and F" region P(H ^ F)

= ------------------------------------ = ------------------------------------- = -----------

#worlds with flu Area of "F" region P(F)

3.2.2.4 Theory:

P(A|B) = Fraction of worlds in which B is true that also have A true

P(A ̂ B) P(A|B) = ------------------

P(B)

Corollary: P(A ^ B) = P(A|B) P(B) P(A|B)+P( A|B) = 1

n

P( A | B) 1 kk1

28

3.2.2.5 Detailed Example

M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive

late than Manuela. Let's begin with writing down the knowledge:

P(S M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer. Therefore

Lateness is dependant on both weather and lecturer

29

3.2.3 Conditional Independence

3.2.3.1 Example: Suppose we have these three events:

M : Lecture taught by Manuela L : Lecturer arrives late R : Lecture concerns robots

Suppose: Andrew has a higher chance of being late than Manuela.

Andrew has a higher chance of giving robotics lectures.

30

3.2.3.2 Theory: R and L are conditionally independent given M if for all x,y,z in {T,F}:

P(R=x M=y ^ L=z) = P(R=x M=y) More generally:

Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all

assignments of values to the variables in the sets, P(S1's assignments| S2's assignments & S3's assignments)= P(S1's assignments| S3's assignments)

P(A|B) = P(A ^B)/P(B) Therefore P(A^B) = P(A|B).P(B) - also known as Chain Rule Also P(A^B) = P(B|A).P(A) Therefore P(A|B) = P(B|A).P(A)/P(B) P(A,B|C) = P(A^B^C)/P(C) = P(A|B,C).P(B^C)/P(C ) - applying chain rule = P(A|B,C).P(B|C) = P(A|C).P(B|C) , If A and B are conditionally independent given C.

This can be extended for n values as P(A1,A2An|C) = P(A1|C).P(A2|C)P(An|C) if A1, A2An are conditionally independent given C.

3.2.3.3 Theory applied on previous example:

For the previous example, we can use the following notations: P(R| M,L) = P(R| M) and P(R| ~M,L) = P(R| ~M)

We express this in the following way: "R and L are conditionally independent given M"

31

3.2.4 Bayes Theorem

Bayesian reasoning is applied to decision making and inferential statistics that deals with Probability inference. It is used the knowledge of prior events to predict future events. Example: Predicting the color of marbles in a basket

3.2.4.1 Example:

Table1: Data table

3.2.4.2 Theory:

The Bayes Theorem:

P(h/D)= P(D/h) P(h) P(D)

32

P(h) : Prior probability of hypothesis h P(D) : Prior probability of training data D P(h/D) : Probability of h given D

P(D/h) : Probability of D given h


D : 35 year old customer with an income of $50,000 PA h :

Hypothesis that our customer will buy our computer

P(h/D) : Probability that customer D will buy our computer given that we know his age and income P(h) : Probability that any customer will buy our computer regardless of age (Prior Probability)

P(D/h) : Probability that the customer is 35 yrs old and earns $50,000, given that he has bought our computer (Posterior Probability) P(D) : Probability that a person from our set of customers is 35 yrs old and earns $50,000

3.2.5 Maximum A Posteriori (MAP) Hypothesis

3.2.5.1 Example: h1: Customer buys a computer = Yes h2 :

Customer buys a computer = No where h1 and h2 are subsets of our Hypothesis Space 'H'

P(h/D) (Final Outcome) = arg max{ P( D/h1) P(h1) , P(D/h2) P(h2)} P(D) can be ignored as it is the same for both the terms

3.2.5.2 Theory: Generally we want the most probable hypothesis given the training data hMAP = arg max P(h/D) (where h belongs to H and H is the hypothesis space)

hMAP = arg max P(D/h) P(h) P(D)

hMAP = arg max P(D/h) P(h)

33

3.2.6 Maximum Likelihood (ML) Hypothesis

3.2.6.1 Example:

Table 2

3.2.6.2 Theory: If we assume P(hi) = P(hj) where the calculated probabilities amount to the same Further simplification leads to: hML = arg max P(D/hi) (where hi belongs to H)

2.5.3. Theory applied on previous example: P (buys computer = yes) = 5/10 = 0.5 P

(buys computer = no) = 5/10 = 0.5 P (customer is 35 yrs & earns $50,000) = 4/10 = 0.4 P (customer is 35 yrs & earns $50,000 / buys computer = yes) = 3/5 =0.6 P

(customer is 35 yrs & earns $50,000 / buys computer = no) = 1/5 = 0.2

Customer buys a computer P(h1/D) = P(h1) * P (D/ h1) / P(D) = 0.5 * 0.6 / 0.4 Customer does not buy a computer P(h2/D) = P(h2) * P (D/ h2) / P(D) = 0.5 * 0.2 / 0.4

34

Final Outcome = arg max {P(h1/D) , P(h2/D)} = max(0.6, 0.2) => Customer buys a computer

3.2.7 Naïve Bayesian Classification

It is based on the Bayesian theorem It is particularly suited when the dimensionality of the inputs is high. Parameter estimation for naive Bayes models uses the method of maximum likelihood. In spite over-simplified assumptions, it often performs better in many complex real- world situations

Advantage: Requires a small amount of training data to estimate the parameters

3.2.7.1 Example

X = ( age= youth, income = medium, student = yes, credit_rating = fair)

A person belonging to tuple X will buy a computer?

3.2.7.2 Theory: Derivation:

D : Set of tuples Each Tuple is an 'n' dimensional attribute vector X : (x1,x2,x3,. xn)

Let there be 'm' Classes : C1,C2,C3Cm Naïve Bayes classifier predicts X belongs to Class Ci iff

35

P (Ci/X) > P(Cj/X) for 1<= j <= m , j <> i Maximum Posteriori Hypothesis

P(Ci/X) = P(X/Ci) P(Ci) / P(X) Maximize P(X/Ci) P(Ci) as P(X) is constant

With many attributes, it is computationally expensive to evaluate P(X/Ci). Naïve Assumption of "class conditional independence"

n

P( X / .Ci) P(xk / Ci)

k1 P(X/Ci) = P(x1/Ci) * P(x2/Ci) ** P(xn/ Ci)


P(C1) = P(buys_computer = yes) = 9/14 =0.643 P(C2)

= P(buys_computer = no) = 5/14= 0.357 P(age=youth /buys_computer = yes) = 2/9 =0.222

P(age=youth /buys_computer = no) = 3/5 =0.600 P(income=medium /buys_computer = yes) = 4/9 =0.444

P(income=medium /buys_computer = no) = 2/5 =0.400 P(student=yes /buys_computer = yes) = 6/9 =0.667 P(student=yes/buys_computer = no) = 1/5 =0.200 P(credit rating=fair /buys_computer = yes) = 6/9 =0.667

P(credit rating=fair /buys_computer = no) = 2/5 =0.400 P(X/Buys a computer = yes) = P(age=youth /buys_computer = yes) * P(income=medium /buys_computer = yes) * P(student=yes /buys_computer = yes) * P(credit rating=fair /buys_computer = yes) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044

P(X/Buys a computer = No) = 0.600 * 0.400 * 0.200 * 0.400 = 0.019 Find class Ci that Maximizes P(X/Ci) * P(Ci)

=>P(X/Buys a computer = yes) * P(buys_computer = yes) = 0.028

=>P(X/Buys a computer = No) * P(buys_computer = no) = 0.007

Prediction : Buys a computer for Tuple X

36

3.3 Evaluation measures

Text classification rules are typically evaluated using performance measures from information retrieval. Common metrics for text categorization evaluation include recall, precision, accuracy and error rate and F1. Given a test set of N documents, a two-by-two contingency table (see Table 1) with four cells can be constructed for each binary classification problem. The cells contain the counts for true positive (TP), false positive (FP), true negative (TN) and false negative (FN), respectively. Clearly, N = TP + FP + TN + FN.

Table 1

actual class (observation)

predicted class (expectation)

TP (true positive) Correct result

FP (false positive) Unexpected result

FN (false negative) Missing result

TN (true negative) Correct absence of result

the terms true positives, true negatives, false positives, and false negatives compare the results of the classifier

under test with trusted external judgments.

The terms positive and negative refer to the classifier's prediction (sometimes known as the expectation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the observation).

Hence the metrics for binary-decisions are defined as:

3.3.1 Precision

Is the proportion of Predicted Positive cases that are correctly Real Positives. This is what Machine Learning, Data Mining and Information Retrieval focus on, but it is totally ignored in ROC analysis. It can however analogously be called True Positive Accuracy (tpa), being a measure of accuracy of Predicted Positives in contrast with the rate of discovery of Real Positives (tpr).

In precise it represents how many of the returned documents or topics are correctly predicted

Precision = TP / (TP + FP)

Issue in Precision When a system outputs only confident topics, the precision easily reaches a high percentage.

37

3.3.2 Recall (Sensitivity)

Is the proportion of Real Positive cases that are correctly Predicted Positive. This measures the Coverage of the Real Positive cases by the +P (Predicted Positive) rule. Its desirable feature is that it reflects how many of the relevant cases the +P rule picks up.

In precise it represents what percent of positive cases were caught

Recall = TP / (TP + FN)

When a system outputs loosely, the recall easily reaches a high percentage.

3.3.3 Accuracy

Accuracy represents what percent of the prediction were correct

The rate of correctly predicted topics

Figure 2

Accuracy =

Issue in Accuracy When a certain topic (e.g., not-spam) is a majority, the accuracy easily reaches a high percentage.

3.3.4 F1 –measure

F1 measure effectively references the True Positives to the Arithmetic Mean of Predicted Positives and Real Positives, being a constructed rate normalized to an idealized value, and expressed in this form it is known in statistics as a Proportion of Specific Agreement as it is a applied to a specific class, so applied to the Positive Class

F1 measure = 2*Recall*Precision/(Recall + Precision)

Since there is a trade-off between recall an precision, F-measure is widely used to evaluate text classification system.

38

Chapter 4 Opinion lexicons

4.1 Definition

Opinion lexicons are resources that associate sentiment orientation and words. Their use in opinion mining research

stems from the hypothesis that individual words can be considered as a unit of opinion information, and therefore

may provide clues to document sentiment and subjectivity.

Manually created opinion lexicons were applied to sentiment classification as seen in [13], where a

prediction of document polarity is given by counting positive and negative terms. A similar approach

is presented in the work of Kennedy and Inkpen [10], this time using an opinion lexicon based on the

combination of other existing resources.

Manually built lexicons however tend to be constrained to a small number of terms. By its nature, building manual

lists is a time consuming effort, and may be subject to annotator bias. To overcome these issues lexical induction

approaches have been proposed in the literature with a view to extend the size of opinion lexicons from a core set of

seed terms, either by exploring term relationships, or by evaluating similarities in document corpora. Early work in

this area seen in [9] extends a list of positive and negative adjectives by evaluating conjunctive statements in a

document corpus. Another common approach is to derive opinion terms from the WordNet database of terms and

relationships [12], typically by examining the semantic relationships of a term such as synonyms and antonyms.

In this work two commonly used opinion lexicon are used, first is SentiWordNet 3.0 and the second is Opinion

lexicon created by Dr. Bing Liu.

4.2 SENTIWORDNET 3.0 An enhanced lexical resource explicitly devised for supporting sentiment classification and opinion mining applications (Pang and Lee, 2008). SENTIWORDNET 3.0 is an improved version of SENTIWORDNET 1.0 (Esuli and Sebastiani, 2006), a lexical resource publicly available for research purposes, now currently licensed to more than 300 research groups and used in a variety of research projects worldwide. SENTIWORDNET is the result of the automatic annotation of all the synsets of WORDNET according to the notions of “positivity”, “negativity”, and “neutrality”. Each synset s is associated to three numerical scores P os(s), Neg(s), and Obj(s) which indicate how positive, negative, and “objective” (i.e., neutral) the terms contained in the synset are. Different senses of the same term may thus have different opinion-related properties. For example , in SENTIWORDNET 1.0 the synset [estimable(J,3)], corresponding to the sense “may be computed or estimated” of the adjective estimable, has an Obj score of 1:0 (and P pos and Neg scores of 0.0), while the synset *estimable(J,1)+ corresponding to the sense “deserving of respect or high regard” has a P os score of 0:75, a Neg score of 0:0, and an Obj score of 0:25. Each of the three scores ranges in the interval [0:0;1:0], and their sum is 1:0 for each synset. This means that a synset may have nonzero scores for all the three categories, which would indicate that the corresponding terms have, in the sense indicated by the synset, each of the three opinions related properties to a certain degree. Each set of terms sharing the same meaning in SentiWordNet (synsets) is associated with two numerical scores ranging from 0 to 1, each indicating the synset’s positive and negative bias. The scores reflect the agreement amongst the classifier committee on the positive or negative label for a term, thus one distinct aspect of SentiWordNet is that it is possible for a term to have non-zero values for both positive and negative scores, according to the formula: Pos. Score(term) + Neg. Score(term) + Objective Score(term) = 1

39

Terms in the SentiWordNet database follow the categorization into parts of speech derived from WordNet, and therefore to correctly apply scores to terms, a part of speech tagger program was applied to the polarity data set. In our experiment, the Stanford Part of Speech Tagger was used. The opinion lexicon can be downloaded freely for research purposes from the following link:

http://sentiwordnet.isti.cnr.it/

4.3 (Bing Liu) Opinion lexicon

4.3.1 Who is Dr. Bing Liu?

Dr. Bing Liu is a professor in department of computer science in university of Illinois at Chicago (UIC) whose research interests are Sentiment Analysis, Opinion Mining, Data and Web Mining, Machine and Constraint satisfaction, AI scheduling. He has a history full of publications especially in the field of opinion mining and data mining in general.

The following are examples of his publications in this field:

Mining and summarizing customer reviews

Opinion observer: analyzing and comparing opinions on the Web

Mining opinion features in customer reviews

Sentiment analysis and subjectivity

A holistic lexicon-based approach to opinion mining

Opinion spam and analysis

The following link contains most of his publications:

http://www.cs.uic.edu/~liub/publications/papers_chron.html

4.3.2 (Bing Liu) Opinion lexicon

Opinion lexicon is a list of positive and negative opinion words or sentiment words for English (around 6800 words which are divided into two separate files one contains the positive words and the other conations the negative words). This list was compiled over many years starting from his first paper (Hu and Liu, KDD-2004).

The opinion lexicon can be downloaded freely for research purposes from the following link: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

http://sentiwordnet.isti.cnr.it/

http://scholar.google.com/citations?view_op=view_citation&hl=en&user=Kt1bjZoAAAAJ&citation_for_view=Kt1bjZoAAAAJ:u5HHmVD_uO8C

http://scholar.google.com/citations?view_op=view_citation&hl=en&user=Kt1bjZoAAAAJ&citation_for_view=Kt1bjZoAAAAJ:u-x6o8ySG0sC

http://scholar.google.com/citations?view_op=view_citation&hl=en&user=Kt1bjZoAAAAJ&citation_for_view=Kt1bjZoAAAAJ:IjCSPb-OGe4C

http://scholar.google.com/citations?view_op=view_citation&hl=en&user=Kt1bjZoAAAAJ&citation_for_view=Kt1bjZoAAAAJ:8k81kl-MbHgC

http://scholar.google.com/citations?view_op=view_citation&hl=en&user=Kt1bjZoAAAAJ&citation_for_view=Kt1bjZoAAAAJ:LkGwnXOMwfcC

http://scholar.google.com/citations?view_op=view_citation&hl=en&user=Kt1bjZoAAAAJ&citation_for_view=Kt1bjZoAAAAJ:5nxA0vEk-isC

http://www.cs.uic.edu/~liub/publications/papers_chron.html

http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar

http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

40

Chapter 5 Experimental results

5.1 Data collection The used data set in this literature is an Amazon product review data (Jindal and Liu, WSDM-2008) used in

(Jindal and Liu, WWW-2007, WSDM-2008; Lim et al, CIKM-2010; Jindal, Liu and Lim, CIKM-2010; Mukherjee et al.

WWW-2011; Mukherjee, Liu and Glance, WWW-2012) for opinion spam (fake review) detection. The dataset

consists of more than 2.8 million product reviews in multiple domains.

In this literature we extracted 2000 reviews (1000 positive review and 1000 negative review) in the Digital

cameras domain. The extracted data were split later as follows (90% of the data “1800 review” for the training

of classifiers and 10% of the data “200 reviews” as a test set).

The full dataset is free for research purposes and can be downloaded from the following link:


The following steps were applied to the dataset to extract the data of the desired domain:

Converted to database using SQL Server 2008 Import and Export Wizard tool

Figure 3

http://www.cs.uic.edu/~liub/FBS/opinion-spam-WSDM-08.pdf


41

5.2 Feature extraction We used the weka filter “StringToWordVector”, to extract the features as unigrams and bigrams.

5.2.1 Unigram:

Figure 4

42

5.2.2 Bigram:

Figure 5

Also all stop words were selected to be exempted from the feature vector in both uni-gram feature vector and

bi-gram feature vector.

Stop words that were chosen to be removed are gathered from the following link:

https://code.google.com/p/textminingtools/source/browse/trunk/data/stopwords.txt?r=5

In both feature vectors (unigram & bigram) the “IteratedLovinsStemmer” stemmer were chosen and the

tokenizer in bigram included both tokens consists of one word (unigram) and tokens consists of two words.

Also TFTransform and IDFTransform were both applied on both feature vectors.

https://code.google.com/p/textminingtools/source/browse/trunk/data/stopwords.txt?r=5

43

5.3 Feature selection For dimensionality reduction the following patterns were removed from the feature vector. Patterns excluded from the uni-gram feature vector: Patterns excluded:- (using weka)

Numbers only: ([0-9]+) Contains special characters: (.*[^a-z0-9 ]+.*) One character: (.) Two characters: (..)

The above patterns were excluded to remove features that consists of numbers only, features that include special characters and features that consists of one or only two characters.

Patterns excluded from the bi-gram feature vector:

Numbers only: ([0-9]+) Contains special characters: (.*[^a-z0-9 ]+.*) One character: (.) Two characters: (..) One word is one letter: ([a-z] .*)|(.* [a-z]) One word is two letters: ([a-z][a-z] .*)|(.* [a-z][a-z]) 2nd word is a number: (.* [0-9]+)

The above patterns were excluded to remove features that consists of numbers only, features that include special characters, features that consists of one or only two characters, features that have one word is only one character , features that include a word that consists of two characters only and the features that has the second word as number.

A further dimensionality reduction is applied on the uni-gram feature vector and the bi-gram feature vector

using the information gain algorithm. The following are the results of running information gain algorithm on

both feature vectors (unigram and Bi-gram respectively) in weka:

The results of applying information gain attribute selection algorithm On unigram feature vector:

=== Run information ===

Evaluator: weka.attributeSelection.InfoGainAttributeEval

Search: weka.attributeSelection.Ranker -T 0.0 -N -1

Instances: 2000

Attributes: 8081

Evaluation mode: evaluate on all training data

=== Attribute Selection on all input data ===

Search Method:

Attribute ranking.

Threshold for discarding attributes: 0

Attribute Evaluator (supervised, Class (nominal): 1 ReviewClass):

Information Gain Ranking Filter

Selected attributes: 1034 attribute

44

The results of applying information gain attribute selection algorithm On Bi-gram feature vector:


Evaluator: weka.attributeSelection.InfoGainAttributeEval

Search: weka.attributeSelection.Ranker -T 0.0 -N -1

Instances: 2000

Attributes: 107765

Evaluation mode: evaluate on all training data


Search Method:

Attribute ranking.

Threshold for discarding attributes: 0

Attribute Evaluator (supervised, Class (nominal): 1 ReviewClass):

Information Gain Ranking Filter

Selected attributes: 3740

A further dimensionality reduction is applied on the uni-gram feature vector and the bi-gram feature vector

using the principle components attribute selection algorithm. The following are the results of running PCA

algorithm on both feature vectors (unigram and Bi-gram respectively) in Weka:

The results of applying PCA attribute selection algorithm On unigram feature vector:


Evaluator: weka.attributeSelection.PrincipalComponents -R 0.2 -A 5

Search:weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1

Instances: 2000

Attributes: 1035

Evaluation mode:evaluate on all training data


Search Method: Attribute ranking.

Attribute Evaluator (unsupervised): Principal Components Attribute

Transformer

Correlation matrix

1 0.22 0.13 0.14 0.17 . . . . ..

0.22 1 0.13 0.19 0.23 . . . . .

. . . . . . . . .. .

. . . . . . . . . . .

. . . . . . .. . .. . .

eigenvalue proportion cumulative

16.34076 0.0158 0.0158 -0.117mod-0.1featur-0.099shoot-0.097control-0.093ma...

9.9449 0.00962 0.02542 -0.3rebuilt-0.3unreasolut-0.287apar-0.278bureau 0.2...

7.42351 0.00718 0.0326 0.241tech+0.231awar+0.224proper+0.221easyshar+0.204p...

6.98087 0.00675 0.03935 0.329deceit+0.321downright+0.318death+0.303cnet+0.27...

6.53621 0.00632 0.04567 0.319ent+0.313cluster+0.312shiver+0.286univer+0.28 k...

6.13852 0.00594 0.05161 0.158len-0.136card-0.123usb+0.111foc-0.111vid...

45

5.64292 0.00546 0.05707 -0.358dieg-0.358transcript-0.356raynox-0.329unedit-....

........ ........ ........ .........................................

........ ........ ........ .........................................

Ranked attributes: 0.984 1 -0.117mod-0.1featur-0.099shoot-0.097control-0.093manu...

0.975 2 -0.3rebuilt-0.3unreasolut-0.287apar-0.278bureau-0.268gask...

.

.

.

0.798 55 0.116piec+0.111algorithm+0.11 snapshot-0.108laser+0.106doubl...

Selected attributes: 55 The results of applying information gain attribute selection algorithm On Bi-gram feature vector:

Selected attributes: 58

The following table (Table 2) summarizes the evolution of the feature vectors through the different phases of

feature extraction and feature selection (dimensionality reduction) applied on the data set used in this literature

starting from the original feature vector and ending with the least possible obtained feature vector. That is also

visualized in (Figure 6 and Figure 7) below.

Table 2

The applied feature selection, extraction feature vector size

Unigram Bigram

Original feature vector 12974 165788

After removing stop words 12733 165547

After removing patterns: [0 9]* 12294 165349

After removing patterns: [.] 12280 165311

After removing patterns: [..] 11805 164765

After removing patterns: ( .*[^a z0 9]+.*) 8081 146376

After removing patterns: ([a z] .*)|(.*[a z] ) n/a 132961

After removing patterns:([a z] [a z] .*)|(.*[a z] [a z] ) n/a 100665

After removing patterns: (.*[0 9]+) n/a 97631

After applying information gain attribute selection 1043 3740

After applying Principle components analysis 55 58

46

Figure 6

Figure 7

47

5.4 Results of the classifiers

5.4.1 Unigram

NaiveBayes without attributes selection

=== Summary ===

Correctly Classified Instances 179 89.5 %

Incorrectly Classified Instances 21 10.5 %

Kappa statistic 0.79

Mean absolute error 0.105

Root mean squared error 0.324

Relative absolute error 21 %

Root relative squared error 64.8074 %

Coverage of cases (0.95 level) 89.5 %

Mean rel. region size (0.95 level) 50 %

Total Number of Instances 200

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

0.930 0.140 0.869 0.930 0.899 0.792 0.917 0.864 +

0.860 0.070 0.925 0.860 0.891 0.792 0.948 0.920 -

Weighted Avg. 0.895 0.105 0.897 0.895 0.895 0.792 0.933 0.892

=== Confusion Matrix ===

a b <-- classified as

93 7 | a = +

14 86 | b = -

NaiveBayes after attribute selection (using Information gain only)

=== Summary ===

Correctly Classified Instances 186 93 %

Incorrectly Classified Instances 14 7 %






Coverage of cases (0.95 level) 93 %





0.950 0.090 0.913 0.950 0.931 0.861 0.930 0.893 +

0.910 0.050 0.948 0.910 0.929 0.861 0.930 0.908 -

Weighted Avg. 0.930 0.070 0.931 0.930 0.930 0.861 0.930 0.900



95 5 | a = +

9 91 | b = -

48

NaiveBayes after using Principle components analysis (PCA)

=== Summary ===






Relative absolute error 25.3785 %



Mean rel. region size (0.95 level) 53.7129 %




0.941 0.198 0.826 0.941 0.880 0.750 0.957 0.958 +

0.802 0.059 0.931 0.802 0.862 0.750 0.957 0.959 -

Weighted Avg. 0.871 0.129 0.879 0.871 0.871 0.750 0.957 0.959



95 6 | a = +

20 81 | b = -

SVM without attributes selection

=== Summary ===

Correctly Classified Instances 186 93 %

Incorrectly Classified Instances 14 7 %






Coverage of cases (0.95 level) 93 %





0.950 0.090 0.913 0.950 0.931 0.861 0.930 0.893 +

0.910 0.050 0.948 0.910 0.929 0.861 0.930 0.908 -

Weighted Avg. 0.930 0.070 0.931 0.930 0.930 0.861 0.930 0.900



95 5 | a = +

9 91 | b = -

49

SVM after attributes selection (using information gain)

=== Summary ===













0.960 0.070 0.932 0.960 0.946 0.890 0.945 0.915 +

0.930 0.040 0.959 0.930 0.944 0.890 0.945 0.927 -

Weighted Avg. 0.945 0.055 0.945 0.945 0.945 0.890 0.945 0.921



96 4 | a = +

7 93 | b = -

SVM after using Principle components analysis (PCA)

=== Summary ===













0.980 0.139 0.876 0.980 0.925 0.848 0.921 0.869 +

0.861 0.020 0.978 0.861 0.916 0.848 0.921 0.911 -

Weighted Avg. 0.921 0.079 0.927 0.921 0.921 0.848 0.921 0.890



99 2 | a = +

14 87 | b = -

50

5.4.2 Bigram

NaiveBayes without attributes selection

=== Summary ===













0.930 0.100 0.903 0.930 0.916 0.830 0.920 0.883 +

0.900 0.070 0.928 0.900 0.914 0.830 0.938 0.907 -

Weighted Avg. 0.915 0.085 0.915 0.915 0.915 0.830 0.929 0.895



93 7 | a = +

10 90 | b = -

NaiveBayes after attribute selection (with IG)

=== Summary ===













0.970 0.160 0.858 0.970 0.911 0.817 0.913 0.855 +

0.840 0.030 0.966 0.840 0.898 0.817 0.904 0.891 -

Weighted Avg. 0.905 0.095 0.912 0.905 0.905 0.817 0.908 0.873



97 3 | a = +

16 84 | b = -

51

NaiveBayes after using Principle components analysis (PCA)

=== Summary ===









Mean rel. region size (0.95 level) 51.2376 %




0.980 0.050 0.952 0.980 0.966 0.931 0.982 0.973 +

0.950 0.020 0.980 0.950 0.965 0.931 0.983 0.986 -

Weighted Avg. 0.965 0.035 0.966 0.965 0.965 0.931 0.983 0.980



99 2 | a = +

5 96 | b = -

SVM without attributes selection

=== Summary === Correctly Classified Instances 129 64.5 %












1.000 0.710 0.585 1.000 0.738 0.412 0.645 0.585 +

0.290 0.000 1.000 0.290 0.450 0.412 0.645 0.645 -

Weighted Avg. 0.645 0.355 0.792 0.645 0.594 0.412 0.645 0.615



100 0 | a = +

71 29 | b = -

52

SVM after attributes selection (IG)

=== Summary ===













0.980 0.070 0.933 0.980 0.956 0.911 0.955 0.925 +

0.930 0.020 0.979 0.930 0.954 0.911 0.955 0.945 -

Weighted Avg. 0.955 0.045 0.956 0.955 0.955 0.911 0.955 0.935



98 2 | a = +

7 93 | b = -

SVM after using Principle components analysis (PCA)

=== Summary ===













0.980 0.050 0.952 0.980 0.966 0.931 0.965 0.943 +

0.950 0.020 0.980 0.950 0.965 0.931 0.965 0.956 -

Weighted Avg. 0.965 0.035 0.966 0.965 0.965 0.931 0.965 0.949



99 2 | a = +

5 96 | b = -

53

5.5 Comparison The following table (Table 3) shows a summarization of the classification results of the classifiers used in this

literature which are also described more clearly in figures (Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12).

Table 3

N-Gram Attributes Classifier/Lex Accuracy RMS Precision Recall F-Measure

Uni

All Attributes Naive Bayes 89.50% 0.324 0.897 0.895 0.895

SVM 93.00% 0.2646 0.931 0.93 0.93

Selected Attrs (Info. Gain)

Naive Bayes 89.50% 0.3242 0.902 0.895 0.895

SVM 94.50% 0.2345 0.945 0.945 0.945

Using PCA Naive Bayes 87.13% 0.3397 0.879 0.871 0.871

SVM 92.08% 0.2814 0.927 0.921 0.921

Bi+Uni

All Attributes Naive Bayes 91.50% 0.2915 0.915 0.915 0.915

SVM 64.50% 0.5958 0.792 0.645 0.594

Selected Attrs (Info. Gain)

Naive Bayes 90.50% 0.3082 0.912 0.905 0.905

SVM 95.50% 0.2121 0.956 0.955 0.955

Using PCA Naive Bayes (96.53%) (0.1802) (0.966) (0.965) (0.965)

SVM (96.53%) 0.1862 (0.966) (0.965) (0.965)

Lexicon SentiWordNet 72.50% 0.5244 0.6552 0.95 0.776

Bing Lui’s 76.00% 0.4899 0.6857 0.96 0.8

54

5.6 Charts

Figure 8

Figure 9

60.00%

70.00%

80.00%

90.00%

100.00%

U-A U-S U-P B-A B-S B-P Lex

Accuracy

Naive Bayes/SWN SVM/Bing Lui’s

0.15

0.25

0.35

0.45

0.55


Root mean squared error (RMS)


55

Figure 10

Figure 11

0.6

0.7

0.8

0.9

1


Precision


0.6

0.7

0.8

0.9

1


Recall


56

Figure 12

Where : U-A = Unigram - All attributes U-S = Unigram - Selected attributes U-P = Unigram - PCA B-A = Bigram - All attributes B-S = Bigram - Selected attributes B-P = Bigram - PCA Lex = Lexicon

0.55

0.65

0.75

0.85

0.95


F-Measure


57

5.7 UI for predictions preview To make it easier to display the detailed predictions for each classifier/lexicon, we developed an application with

C# (see Figure 13) to navigate on the product reviews, listing the prediction of every classifier/lexicon for a

product review, and compare it with its actual class.

The test set used in this literature (200 of product reviews in cameras domain) are loaded to this application so

we may scan through each sentiment/review of the 200 product reviews to display its actual class and the

predictions of the different classifiers used in this literature.

Figure 13

As shown in the above figure the displayed sentiment has a negative actual class and is classified correctly with

the following classifiers: Naïve Bayes with all attributes on unigram or bigram feature vectors, Naïve Bayes after

attribute selection using information Gain on unigram or bigram feature vectors, SVM with IG attribute

selection, with PCA attribute selection, or without attribute selection on either unigram or bigram, Naïve Bayes

with attribute selection using PCA on Bigram and SetiwordNet Lexicon.

But this review or sentiment were not correctly classified using Naïve Bayes classifier after attribute selection

using PCA on the Unigram feature vector and also misclassified with the opinion lexicon.

58

5.8 Application for live sentiment analysis As it's interesting to analyze live product reviews, we developed an application with Java (see Figure 14) to fetch

all the reviews of a product from its web page to analyize. The product is targeted with its URL on Amazon or

gsmArena websites, or the user can type a review manually. Considering the chosen lexicon (SentiWordNet or

Bing Liu's), the application analyzes the review(s) and displays how much these are positive and/or negative, and

also the score for every reviews file or URL.

Figure 14

The input of this application could be a reviews file or document in the extention of .arff or can be an URL of a

product reviews page on Amazon.com or gsmarean.com.

The output of the application could be in three shapes (see Figure 15):

First it may be exported to an external xml file.

Secondly it may be represented as a result of sentiments in the document/URL page as a pie graph.

Third, results may be represented as a table as shown in the following figure.

59

Figure 15

60

Chapter 6 Conclusion Obviously in the experimental work it is very clear that spending some efforts in the preprocessing phase

and carefully apply the appropriate attribute extraction and attribute selection methods will definitely lead

to a better classification results even with less features and less classification cost.

In this case the principle components attribute selection algorithm has proven that it typically suits the text

classification area given the highest classification accuracy, precession, recall and F-measure.

In general we applied two different approaches to sentiment analysis. The opinion lexicons approach

(SentiWordNet and Bing Liu’s) and the supervised machine learning approach (NaiveBayes and SVM).

The supervised machine learning approach consistently demonstrated high quality results of 96.53% for

product reviews, 88∼ 96.6% (precision) and 87∼ 96.5% (accuracy) for cameras and photos product reviews

comparing with the relatively low measures given by the opinion lexicons approach.

The explanation why lexicon approaches have had a poor classification results as mentioned before in

chapter one is that opinion lexicon is necessary but not sufficient for sentiment classification.

However, from our initial experience with sentiment detection, we have identified a few areas of potentially

substantial improvements in the opinion lexicons classification area. We expect applying negation detection

would provide better polarity detection while using the opinion lexicons approach, thus better analysis

results. Second, more advanced sentiment patterns currently require a fair amount of manual validation.

Although some amount of human expert involvement may be inevitable in the validation to handle the

semantics accurately, we plan on more research on increasing the accuracy of the sentiment analysis.

As some potential improvements were provided above it is also important to state that there is some issues

that are until this moment is very hard for researchers to solve in the opinion lexicon classification field,

some of which are discussed earlier in chapter two section 2.4.

61

References

[1] Narendra Ahuja, Ming-Hsuan Yang. \A Geometric Approach to Train Support Vector Machines" Proceedings of the 2000 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2000), pp. 430-437, vol. 1, Hilton Head Island, June, 2000.

[2] Bernhard E. Boser, Isabelle M. Guyon, Vladimir Vapnik. \A Training Algorithm for Optimal Margin Classifiers." Fifth Annual Workshop on Computational Learning Theory. ACM Press, Pittsburgh. 1992

[3] Christopher J.C. Burges, Alexander J. Smola, and Bernhard Scholkopf (editors). Advances in Kernel Methods - Support Vector Learning MIT Press, Cambridge, USA, 1999

[4] Christopher J.C. Burges. "A Tutorial on Support Vector Machines for Pattern Recognition", Data Mining and Knowledge Discovery 2, 121-167, 1998

[5] Nello Cristianini, John Shawe-Taylor. An Introduction to Support Vector Networks and other kernel-based learning methods. Cambridge University Pres 2000

[6] Flake, G. W., Lawrence, S. “Efficient SVM Regression Training with SMO." NEC Research Institute, (submitted to Machine Learning, special issue on Support Vector Machines). 2000

[7] Robert Freund, Federico Girosi, Edgar Osuna. “Training Support Vector Machines: an Application to Face Detection." IEEE Conference on Computer Vision and Pattern Recognition, pages 130-136, 1997a

[8] Robert Freund, Federico Girosi, Edgar Osuna. “An Improved TraininAlgorithm for Support Vector Machines." In J. Principe, L. Gile, N.Morgan, and E. Wilson, editors, Neural Networks for Signal Processing VII { Proceeding of the 1997 IEEE Workshop, pages 276-285, New York, 1997b

[9] Thorsten Joachims. "Text Categorization with Support Vector Machines: Learning with Many Relevant Features", 1998

[10] Thorsten Joachims. "Making Large-Scale SVM Learning Practical" 1999 (Chapter 11 of (Burges, 1999)) [11] Linda Kaufman. \Solving the Quadratic Programming Problem Arising in Support Vector Classification",

1999 (Chapter 10 of (Burges, 1999)) [12] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya and K.R.K. Murthy.”A fast iterative nearest point algorithm

for support vector machine classifier design, "Technical Report TR-ISL-99-03, Intelligent Systems Lab, Dept. of Computer Science & Automation, Indian Institute of Science, 1999a.

[13] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy. “Improvements to Platt's SMO algorithm for SVM classifier design." Technical report, Dept of CSA, IISc, Bangalore, India, 1999b.

[14] John C. Platt. “Fast Training of Support Vector Machines using Sequential Minimal Optimization" (Chapter 12 of (Burges, 1999))

[15] Robert Vanderbei. \Loqo: An Interior Point Code for Quadratic Programming." Technical Report SOR 94-15, Princeton University, 1994

[16] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995 [17] Vladimir Vapnik, Corinna Cortes. "Support vector networks," Machine Learning, vol. 20, pp.273-297,

1995. [18] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani “SENTIWORDNET 3.0: An Enhanced Lexical

Resource for Sentiment Analysis and Opinion Mining” [19] T. Mitchell, Machine Learning, McGraw Hill, 1997. [20] Chai, K.; H. T. Hn, H. L. Chieu; "Bayesian Online Classifiers for Text Classification and Filtering",

Proceedings of the 25th annual international ACM SIGIR conference on Research and Development in Information Retrieval, August 2002, pp 97-104

[21] DATA MINING Concepts and Techniques,Jiawei Han, Micheline Kamber Morgan Kaufman Publishers, 2003

[22] Abdi. H., & Williams, L.J. (2010). "Principal component analysis.". Wiley Interdisciplinary Reviews: Computational Statistics, 2: 433–459.

[23] ^ a b Olson, David L.; and Delen, Dursun (2008); Advanced Data Mining Techniques, Springer, 1st edition (February 1, 2008), page 138, ISBN 3-540-76916-1

[24] ^ a b c Powers, David M W (2007/2011). "Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies 2 (1): 37–63.

sentiment classification for product reviews (documentation)

Education