the how, when and why of sentiment analysis - · pdf file* departement of information...

The How, When and Why of Sentiment Analysis

Mrs. Vijyalaxmi M*, Mrs Shalu Chopra*, Mrs Sangeeta Oswal*,Mrs Deepshikha Chaturvedi* * Departement of Information Technology, VESIT,Mumbai

Abstract— With the explosive growth of social

media on web ,individuals and organizations are

increasingly using the content in these media for

decision making. Nowadays if one wants to buy a

consumer product one prefer user reviews and

discussion in public forums on web about the

product.As a result opinion mining has gained

importance. This online -word-of-mouth represent

new and measurable source of information with

many applications, this process of identifying and

extracting subjective information from raw data is

known as sentiment analysis. This paper presents a

survey on sentiment analysis or opinion mining.

The existing literature of opinion mining is

explained in detail. We explore the application

oriented approach and discussed various domain

where opinion mining would be useful. Finally we

put forward research challenges in this field.

1. INTRODUCTION

Nowadays, social media has become a platform for

people to convey their voice to the public. The

Internet has rapidly advanced from a static to an

interactive medium. Today‟s users cannot only

obtain information but also actively generate

content. News reports, BBS, forums, blogs, and etc

are the main sources of public opinion information.

The text contains both facts and opinion which

could be extracted using natural language

processing to get some opinionated

views .Opinions are usually subjective expressions

that describe people‟s sentiments, appraisals or

feelings toward entities, events and their properties,

it is a sub-discipline of computational linguistics

that focuses on extracting people‟s opinion from

the web.

When it comes to sentiment or opinion or

emotion we are not concerned with the topic of the

text but the positive or negative opinion it express.

People can freely express their opinion in social

media as reviews, blogs, micro blogs, and forum

discussion, social network sites towards any

product, service, news or organization. All these

platforms provide a huge amount of valuable

information that we are interested to analyze.

Given a piece of text, opinion-mining systems

analyze:

· Which part is opinion expressing?

· Who wrote the opinion?

· What is being commented?

Sentiment analysis, on the other hand, is about

determining the subjectivity, polarity (positive or

negative) and polarity strength (weakly positive,

mildly positive, strongly positive, etc.) of a piece

of text.

Sentiment analysis has found its application in

almost every domain. Any individual wants to buy

a product or use a service will check the opinion or

reviews of the others. Also the organization would

like to know the sentiment of its customers for

improving the service or product.

2. LITERATURE SURVEY

Sentiment analysis can be done at three levels

namely; Document level, Sentence level, Entity

and aspect level. Document level expresses the

positive or negative opinion of a single entity in

the document as a whole. In sentence level each

sentence in the document is analyzed to determine

the positive, negative or neutral opinion [1]. An

aspect-based opinion polling system takes as input

a set of textual reviews and some predefined

aspects, and identifies the polarity of each aspect

from each review to produce an opinion poll [2]. It

Sangeeta Oswal et al, Int.J.Computer Technology & Applications,Vol 4 (4),660-665

IJCTA | July-August 2013 Available [email protected]

660

ISSN:2229-6093

is based on the idea that an opinion consists of a

sentiment and target on which opinion has been

expressed. For instance, in a product review

sentence, it identifies product features that have

been commented on by the reviewer and

determines whether the comments are positive or

negative. For example, in the sentence, “The

battery life of this camera is too short,” the

comment is on “battery life” of the camera

object(target) and the opinion is negative. Many

real life applications require this level of detailed

analysis because in order to make product

improvements one needs to know what

components and/or features of the product are

liked and disliked by consumers. Such information

is not discovered by sentiment and subjectivity

classification[1].

The sentiment expressed can be direct opinions or

Comparative opinion on entities. Comparisons are

related to but are also quite different from direct

opinions. For example, a typical direct opinion

sentence is “the picture quality of Camera X is

great”, while a typical comparative sentence is

“the picture quality of Camera X is better than that

of Camera Y.” We can see that comparisons use

different language constructs from direct opinions.

A comparison typically expresses a comparative

opinion on two or more entities with regard to their

shared features or attributes, e.g., “picture quality”.

A comparative opinion is sextuples of the

form: (E1, E2, A, PE, h, t), where E1 and E2 are

the entity sets being compared based on their

shared aspects A (entities in E1 appear before

entities in E2 in the sentence), PE({E1, E2}) is the

preferred entity set of the opinion holder h, and t is

the time when the comparative opinion is

expressed[6]. A comparative sentence expresses a

relation based on similarities or differences of

more than one entity. There are several types of

comparisons. They can be grouped into two main

categories: gradable comparison and non-gradable

comparison. The types of gradable comparatives

are 1) Non-equal gradable that express a total

ordering i.e greater than or less than of some

entities with regard to their shared features. 2)

Equative express whether two objects as equal

with respect to some features. 3) Superlative ranks

one object over all others. Non-gradable do not

explicitly grade the sentences which compare

features of two or more entities[3]. A typical

comparative sentence is “the picture quality of

Camera X is better than that of Camera Y.” We

can see that comparisons use different language

constructs from direct opinions. A comparison

typically expresses a comparative opinion on two

or more entities with regard to their shared features

or attributes, e.g., “picture quality”.

The opinion can be explicit or implicit within the

document; it is easier to detect an explicit opinion.

For example “Coke tastes good” is an explicit

opinion while “I bought the mattress a week ago,

and a valley has formed” gives implicit opinion.

The rest of the paper is organized as follows:

Section 3 we describe the opinion mining process.

Section 4 is application of opinion mining and

sentiment analysis, section 5 we discuss evaluation

of result. Section 6 is focused on research

challenges in this field and lastly is conclusion.

3. OPINION MINING PROCESS

The opinion mining process is explained in the fiq

1below. The raw data is collected from various

social media ,we can also write a crawler to extract

the data from it. The Data set available online for

research work is listed below.

Hotel review data set Trip Advisor

(http://sifaka.cs.uiuc.edu/~wang296/LARA/TripAd

visor)

MP3 review data set from Amazon

(http://sifaka.cs.uiuc.edu/~wang296/LARA/Amazo

n/mp3)

Review data set from Amazon

(http://liu.cs.uic.edu/download/data)

Review data set from

Epinions(http://www.sfu.ca/~sam39/Datasets/Epini

onsReviews

After data collection we preprocess the data

to have a structured set of reviews so that we can

apply classification techniques to classify the

opinion either as positive, negative or neutral. In

this paper we have not focused on lexicon-based

approach in which we use dictionaries of words

annotated with the word‟s semantic orientation, or

polarity. The paper is mainly focused on text

classification approach which involves building

classifiers from labeled instances of texts or

sentences [5], essentially a supervised

classification task.



661

ISSSN:2229-6093

http://sifaka.cs.uiuc.edu/~wang296/LARA/TripAdvisor



http://sifaka.cs.uiuc.edu/~wang296/LARA/Amazon/mp3



http://liu.cs.uic.edu/download/data

http://www.sfu.ca/~sam39/Datasets/EpinionsReviews



The supervised classification is discussed in detail

with the algorithm like SVM, Naïve Bayes and

Nearest neighbor .

3.1Supervisedclassification:

Sentiment classification is usually formulated as a

two-class classification problem, positive and

negative. Training and testing data used are

normally product reviews. Any existing supervised

learning method can be applied, e.g., naïve Bayes

classification, and support vector machines (SVM)

(Joachims, 1999; Shawe-Taylor and Cristianini,

2000). Pang, Lee and Vaithyanathan (2002) was

the first paper to take this approach to classify

movie reviews into two classes, positive and

negative.

Figure 1: Opinion Mining Process

Fiqure 1:Opinion Mining Process

ooo

Sentiment classification

Neutral Positive Negative

Sentiment Application

Forecasting

Prediction

Business Intelligence

Decision Making

Text classification approach

Supervised

Classification Unsupervised

classification

n- gram TF-idf POS PMI

Unstructured

raw text

Data Preprocessing

blogs forum discussion Reviews



662

ISSSN:2229-6093

Feature Extraction: It is the process where

properties are extracted from the data, because the

whole input data is too large to use in classification

N-gram model: The n-gram is a contiguous

sequence of n items from a textual or spoken

source In case of unigram (n=1), each text is a

document and is spilt up into words. The term

frequency (tf(w,d)) is the number of times that a

word w occurs in a document d.

…. (eq.1)

The term presence tp(w,d) only checks if a word w

is present within a document d which result in

binary value.

d

…. (eq.2)

However frequent words are not necessarily good

feature for classification. If the distribution of a

highly frequent world is uniformly distributed over

the classes, then the discriminative power is low.

In case of n=2(big rams) the items consist of two

consecutive words i.e the set contains all

combination of two words that are consecutive in

the original text e.g “This car is good” can be split

into bigram as „this car‟ ,‟car is‟ ,‟is good‟.

The same is true for n=3(trigram) and higher value

of n.

Tf-idf Measure: The tf-idf (term frequency-inverse

document frequency) measure is a statistic that

reflect the importance of a word across a set of

document. The term frequency and document

frequency are stated above in eq.1 and eq.2.The

inverse document frequency is used to measure the

rareness of a word across all the document. Higher

the value of inverse document frequency rare the

word across the set of document.

The inverse document frequency idf(w,D) of the

word w across all document D is shown as

…. (eq.3)

However it is a question whether the tf-idf is a

good measure f or feature selection.The tf-idf value

is high , when a word occur often in a document

and does not occur often within all document.

Part-of-speech Tagger: The POS tagger, is a

method of marking up a word in a text

corresponding to a particular part-of-speech. The

idea behind this is that only a limited set of word in

a sentence indicate the sentiment, referred to as

sentiment-words. In English language POS

examples are noun, verb, adverb and adjective.

The pos tagging technique is often used, which is

applied in several papers [5,10]

3.2 Classification Algorithm :

Classification algorithm like nearest neighbour,

naive Bayes, maximum entropy and Support vector

machine (SVM) are applied in many domains.

Nearest Neighbour:

It is applied in domain where the number of

dimension is low. It is known from the literature

[3]that nearest neighbour has more difficulties in

higher dimension space.

Naïve Bayes

The naïve bayes algorithm uses Bayes‟

theorem .The formula P(C|F) states the conditional

probability of C given F, where C is a class label

and F a feature .

…. (eq.4)

It allows calculating unknown conditional

probability form a known conditional probability

together with the prior probabilities‟. It is assumed

that the presence of a feature is unrelated to the

presence of any other feature.

The major advantage of naïve bayes models is the

fact that a relatively small training set is sufficient

to train the model. It is a good model to use as

reference for testing the quality of other models.

The naïve bayes classifier has been applied in quite

a lot of papers about sentiment analysis.

Support Vector Machine: SVM exit in different

form linear and non-linear. It is a supervised

classifier. In ideal situation the classes are linearly

….



663

ISSSN:2229-6093

separable. In such situation a line can be found

which spilt the two classes perfectly.

In practice the classes are usually not linearly

separable; in such cases a higher order function can

spilt the dataset. A function is applied to the data

set which maps the point in the non linear data set

to point in a linear data set.

Although it is possible to create a model that

perfectly separates the data, it is not desirable

because such models are over fitting on the

training data.

SVM classifier are applied in many papers [5, 8, 9,

10].they are very popular in recent research . The

popularity due to the good overall empirical

performance.

Comparing the naïve Bayes and SVM

classifier ,the SVM has been applied the most

3.3 Unsupervised classification

Point mutual information (PMI) is a simple

association, which can be used for unsupervised

learning. The classification is based on the average

semantic orientation, which make use of reference

words, for most positive and most negative

association.

…. (eq.5)

This PMI measure is used in a sentiment

orientation function SO ,it formalize the

dependence of the positive and dependence of the

negative sentiment.

…. (eq.6)

This semantic orientation function can be applied

to all the extracted words in an message.

Averaging all those sentiment orientation values of

a message result in a quality. The number can be

interpreted as a sentiment according to the

positivity.

The area of unsupervised classification algorithm

is somewhat underdeveloped compared to

supervised classification algorithm .

Another unsupervised approach is the lexicon-

based method, which uses a dictionary of

sentiment words and phrases with their associated

orientations and strength, and incorporates

intensification and negation to compute a

sentiment score for each document (Taboada et al.,

2011). This method was originally used in sentence

and aspect-level sentiment classification (Ding, Liu

and Yu, 2008; Hu and Liu, 2004; Kim and Hovy,

2004).

4. Application of opinion mining

Some of the applications of sentiment analysis

includes online advertising, hotspot detection in

forums , web blog author‟s attitude analysis,

sentiment filtering etc.

Opinion mining for recommendation system:

One possibility is as an augmentation to

recommendation systems since it might not

recommend items that receive a lot of negative

feedback.

Opinion mining for Ad Placment

In online systems that display ads as sidebars, it is

helpful to detect WebPages that contain sensitive

content inappropriate for ads placement; for more

sophisticated systems, it could be useful to bring

up product ads when relevant positive sentiments

are detected, and perhaps more importantly, nix the

ads when relevant negative statements are

discovered.

Opinion mining in Business Intelligence

When faced with tremendous amount of online

information , information seekers usually find it

very difficult to yield accurate information that is

useful to them, which has motivated the research in

hotspot detection.

Sentiment analysis find a major role in Business

Intelligence to extract and visualize comparative

relation between customer review and help

enterprise discover potential risk and future design

of new product and marketing strategies.

Using an Opinion Mining Approach we can

Exploit Web Content in Order to Improve

Customer Relationship Management and provide

better service to the customer by improving on the

product quality and making the product

personalized according to the customer view point.



664

ISSSN:2229-6093

Opinion mining for trend prediction

Organization could perform trend prediction in

sales using Opinion mining by tracking public

viewpoints.

In stock market we can analysis the sentiment

related to detect whether the stock price will be

higher or lower and help the investor to take

decision related to buying or selling the stock.

Opinion mining for political domain

Opinions matter a great deal in politics. Sentiment

analysis has specifically been proposed as a key

enabling technology, allowing the automatic

analysis of the opinions that people submit about

pending policy or government-regulation,

understanding what the voters is thinking,

predicting the outcome of elections etc.

5.Evaluation :

The performance of different methods used for

opinion mining is evaluated by calculating various

metrics like precision, recall and F-measure.

Precision is the fraction of retrieved instances that

are relevant while recall is the fraction of relevant

instances that are retrieved

6.RESEARCH CHALLENGES FOR OPINION

MINING

The challenge is to detect sentiment in spoken and

written language which is easy for humans to

understand but difficult for computers to detect.

Opinions are far harder than facts to describe as

they are short and informally written and highly

diverse. The raw text contains wrong spellings,

sarcasm, idiom, abbreviations, poor grammar.

Using computer the sentiment can be analyzed for

huge data in less time but accuracy is important.

Few opinion mining challenges are listed below:

1. Analyzing natural language is difficult

enough. Sarcasm or other forms of derisive

language are extremely problematic for

technologies to interpret.

2. To identify what the person actually talking

about.

3. The problem of resolving what a phrase

refers to example "We watched the movie

and went to dinner; it was awful." What

does "It" refer to?

4. Difficulty in parsing the sentence to find

the subject and object to which verb and/or

adjective refer to.

5. The opinion on twitter have abbreviations,

lack of capitals, poor spelling, poor

punctuation, poor grammar and so difficult

to understand.

6. The detection of spam and fake reviews,

mainly through the identification of

duplicates, the comparison of qualitative

with summary reviews, the detection of

outliers, and the reputation of the reviewer.

7. Language which is another challenge, most

of the work done in sentiment analysis is

focused on English and Chiness language

other languages are yet to be explore.

8. In one context the statement can be positive

and in other it can be negative. For example,

“fighting” is negative in a war context but

positive in a medical one. Different

sentiment for different domains.

9. A single word can be used to convey three

different opinions, positive, neutral and

negative respectively depending on its use

and context.

7 .Conclusion : Sentiment analysis can be applied

to a wide domain to classifying and summarizing

review and prediction. However, finding opinion

sources and monitoring them on the Web can still

be a difficult task because there are a large number

of diverse sources, and each source may also have

a huge volume of opinionated text (text with

opinions or sentiments).Also the fact that

sentiment analysis is a natural language processing

task, which is not an easy problems.



665

ISSSN:2229-6093

Due to its tremendous value for practical

applications, there has been an explosive growth of

both research in academia and applications in the

industry. In this paper the opinion mining is

explained covering its process, application and

challenges. In future, more work is needed on

further improving the performance measures.

Sentiment analysis can be applied for new

applications like depression analysis, sentiment

analysis from songs etc.

REFERENCES

[1] G.Vinodhini and R.M.Chandrasekaran, “Sentiment

Analysis And Opinion Mining : A Survey”,

“International journal of advanced research in

computer science and software engineering”, vol. 2,

Issue 6, June 2012

[2] J Zhu, H Wang, M Zhu, B Tsou and Matthew M,

“Aspect-Based Opinion Polling from Customer

Reviews", " Ieee Transaction On Affective Computing”,

vol. 2, NO. 1, January-March 2011.

[3] M Ganapathibhotla and Bing Liu, “Mining Opinion in

comparative Sentence”, “In Proceedings of 22nd

International Conference on Computational Linguistic”,

August 2008.

[4] Turney, Peter, “Thumbs Up or Thumbs

Down?Semantic orientation applied to unsupervised

classification of reviews”, “In Proceedings of 40th

Meeting of the Association for Computational

Linguistics”, pages 417–424, Philadelphia, July 2002.

[5] Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan,

“Thumbs up? Sentiment classification using machine

learning techniques”, “In Proceedings of theConference

on Empirical Methods in NLP”, pages 79–86,

Philadelphia, 2002.

[6] Nitin Jindal and Bing Liu,”Mining Comparative

Sentences and Relations”, “Proceedings of AAAI-06,

the 21st National Conference on Artificial Intelligence”,

2006

[7] B. Pang and L. Lee, “Opinion mining and sentiment

analysis.”, “ Foundations and Trends in Information

Retrieval”, 2(1-2), pp. 1–135, 2008.

[8] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen

Rambow, and Rebecca Passonneau. Sentiment analysis

of Twitter data". In: Proceedings of the Workshop on

Languages in Social Media. Association for

Computational Linguistics, 2011, pp. 30-38.

[9] GebreKirstos Gebreselassie Gebremeskel. Sentiment

Analysis of Twitter Posts About News". Sentiment

Analysis. Feb. 2011

[10] Alec Go, Richa Bhayani, and Lei Huang. Twitter

Sentiment Classi_cationusing Distant Supervision". In:

Processing (2009), pp. 1-6.



666

ISSSN:2229-6093

the how, when and why of sentiment analysis - · pdf file* departement of information...

Documents