sentiment analysis - irts · sentiment analysis 3 certificate this is to certify that this research...
Post on 12-Mar-2020
4 Views
Preview:
TRANSCRIPT
1
Sentiment Analysis
Term Paper Report
Submitted in partial fulfilment of the requirement
For
Master of Technology
In
Computer Science and Engineering
Under the guidance
of
Ritika Vern
Research Scholar
USICT, GGSIPU
New Delhi
Varsha Mittal
M. Tech - CSE (2nd Semester)
Submitted By
(00716414816)
University School of Information & Communication Technology
Guru Gobind Singh Indraprastha University
Sector 16C, Dwarka
New Delhi 110078New Delhi 110078
Sentiment Analysis
2
Contents
1. Certificate……………………………………………..3
2. Acknowledgement………………………………….4
3. Abstract ………………………………………………………… 5
4. Introduction of the Topic……………………….. 6-11
5. Related Work………………………..12-14
6. Proposed Work……………………………………… 15-17
7. References ………………….. 18
Sentiment Analysis
3
CERTIFICATE
This is to certify that this research work is done on a study of Sentiment Analysis & its
techniques submitted by Varsha Mittal, Enrollment no. 00716414816 in partial fulfillment of
the requirement for the award of the degree of Master of Technology in Computer Science, is a
valid work carried out by her under my supervision and guidance. The research work is the
original one and has not submitted anywhere else for any other degree.
Name of Student Name of the Guide
Varsha Mittal Ritika Vern
Signature
Sentiment Analysis
4
ACKNOWLEDGEMENT
I express my sincere thanks and deep sense of gratitude to my research mentor Ritika Vern, for
her vulnerable motivation and guidance, without which this report would not have been possible.
I consider myself fortunate for having the opportunity to learn and work under her able
supervision and guidance over the period of association. I have a deep sense of admiration for
her innate goodness.
Varsha Mittal
(00716414816)
M.tech(CSE)
Sentiment Analysis
5
Abstract
Sentiment Analysis is the process of identifying and categorizing opinions expressed in a piece of text, especially
in order to determine whether the writer's attitude towards a particular topic or product is positive, negative, or
neutral. With the increasing use of micro blogging websites such as twitter, facebook and other social media,
every day a lot of reviews are being made available online. These reviews could be of a product, movie or it can
be an independent statement describing a situation. Sentiment analysis is thus used to classify these statements as
a positive one or a negative one. There are various benefits of Sentiment Analysis. It makes the user aware about
the various positive and negative features of any product. It helps the users in effective decision making.
Furthermore, SA helps companies to seek feedback from these reviews and alleviate their products/services
wherever necessary. For example, when a person plans to buy a mobile phone, he tends to scrutinize multiple
review sites to read the reviews that the other consumers have written. In this manner, the consumer can get an
idea about the features that he may consider as important. Analyzing the reviews available on thousands of sites
is a tedious task. Sentiment Analysis thus comes into play at such situations. It eases the consumer's task of
categorizing the text into positive and negative which further helps them in effective decision making.
Sentiment Analysis
6
Introduction to the Topic
Sentiment Analysis is the process of figuring out the emotions from a piece of writing that whether it is positive,
negative or neutral and is used to tell the speaker's attitude. The trend, today, is to consider the opinions of a
variety of individuals around the globe before purchasing an item using micro-blogging data. Customers tend to
go over a lot of reviews about a particular item before buying it. Sentiment Analysis makes this task easy for the
customers. Sentiment Analysis aims to achieve its function in the simplest manner with the help of an existing
approach and an existing algorithm. A number of frameworks have been designed for prediction of user
sentiment & the topic of that particular discussion.
Sentiment Analysis is done at three different levels namely that are Document Level, Sentence Level and Feature
Level. Document Level sentiment analysis takes the whole document and classifies it into two categories
positive and negative based on the sentiment expressed by the user.
Document Level analysis reduces the whole document into single level score. Analysis is done based on
four emotions that are "Joy: sadness", "Acceptance: Disgust", "Anticipation: surprise" and "fear: anger". The
problem with this analysis is that it hides the best insights, the useful ones, and prevents clients from drilling
down to extract the useful information.
Sentence level Sentiment Analysis takes a sentence and determines whether that sentence is positive,
negative, or neutral opinion. Neutral usually means no opinion. It is further classified into subjectivity
classification and sentiment classification. There are two kinds of information in a particular sentence; objective
and subjective. Subjectivity classification means determining the type of sentence. Sentiment classification
furthers classifies the subjective information into positive and negative. Sentence Level analysis is somehow
related to subjectivity classification which separates sentences that express factual information from sentences
that express subjective views and opinions.
Sentiment Analysis
7
Feature Level Sentiment analysis takes into account the opinion itself. It is based on the idea that an
opinion consists of an emotion which could be either a positive one or a negative one and a target (of opinion)
consists of three main tasks. Extraction of features the web content is the first step. The next step is determining
the opinion's polarity. The last and the final task are to group the feature synonym. This type of classification is
also known as word/phrase classification. Feature level looks at the opinion itself and does not take into account
language constructs (documents, paragraphs, sentences, clauses or phrases). It is based on the idea that an
opinion consists of a sentiment (positive or negative) and a target. Document Level and Sentence Level analysis
does not recognize each and every detail of the opinions and facts and thus feature level analysis is done widely.
A specific model framework is followed throughout the process of Sentiment Analysis.
SENTIMENT ANALYSIS FRAMEWORK
This framework consists of three main steps [1]. The first step being data collection, followed by preprocessing
of the data collected. The last step is the classification which categorizes the data processed into either positive
or negative. Fig. 1 gives the basic overview of sentiment analysis framework.
Fig. I: Sentiment Analysis Framework [1]
Sentiment Analysis
8
A. Data Collection Sentiment Analysis can be done on any data. The data can either be collected from
any data set or can be extracted from any website. Data set is available online with thousands of reviews along
with the label of positive and negative. On the other hand, extracting data from web is a lengthy task but one can
perform sentiment analysis on the data of their own choice.
B. Pre-Processing Data extracted from the web contains several syntactic features that may not be useful
and therefore data cleaning and filtering needs to be done. In order to remove the unprocessed data, this step
needs to be performed. It is imperative to preprocess all the data to carry out further functionalities. The various
pre-processing steps involved are given as below:
1) Removing URLs URLs are of no use while performing sentiment analysis and can sometimes
lead to false analysis. For example "I have logged in to www.happy.com as I am bored... This sentence is
negative but because of there is one positive word in the url, it becomes neutral thus leading to a wrong
prediction. To avoid the chances of false prediction, URLs must be removed.
2) Filtering Repeated letters in words like "thankuuuuu" are often used to show the depth of
expression. However, these words are absent in the dictionary hence the extra letters in the word needs to
be eliminated. This is done on the basis of a rule that a letter cannot repeat itself more than three times
and if there is such letter that will be eliminated.
3) Questions Words like "what", "which", "how" etc., does not contribute to polarity and thus
such words must be removed in order to reduce the complexity.
4) Removing special characters In order to remove discrepancies during the Sentiment Analysis
process, special characters like '[] {} 0/' should be removed. For example "it's good:" If these characters
Sentiment Analysis
9
are not eliminated before performing sentiment analysis, they will get combined with the words and those
words will not be recognized. To avoid the situation, removal of such characters is important.
5) Removing Stop words and emoticons Stop words are words that should be excluded in order
to proceed with the SA process. Stop words don't carry as much meaning, such as determiners and
prepositions (in, to, from, etc.) and thus needs to be filtered. Most of the times, while writing a review,
people tend to use emoticons in order to express their feelings better. Although, these emoticons help in
better understanding of the emotions but while performing Sentiment analysis, this can mislead and
predict wrong.
6) Lemmatization or stemming Lemmatization and stemming aims to reduce inflectional and
related forms of a word to a common base forms. Stemming achieves its goal correctly most of the time
by removing the ends of the words. Whereas, lemmatization does the same process properly with the use
of a vocabulary and morphological analysis of words.
7) Tokenization refers to splitting the sentence into its desired constituent parts. It is an important
step in all NLP tasks.
8) Feature selection it finds a reduced set of attributes that provides a suitable representation of
the database given a certain analysis to be performed. This is necessary because the excessive use of
slangs, ironies and language mixtures makes the classification task easy.
C. Classification is a technique which classifies data into various categories. Classification is also used in
the field of Sentiment Analysis in order to classify data into three classes namely positive, negative and neutral
and based on that the sentiment analysis process is completed. The classification task requires a pre-classified
database sample, called training set, which is used to train and generate a classifier. It also helps in comparing
Sentiment Analysis
10
new unlabeled data to be classified. The classifier accuracy is highly dependent upon such training data. There
are different classifiers available to perform the same and are discussed below, but Naive Bayes classifier is the
one which is most commonly used for classification of data in Sentiment Analysis.
1) Naive Bayes classifier is a supervised machine learning approach. This supervised classifier was given
by Thomas Bayes and hence the name. According to this theorem, suppose there are two events say, p1 and p2
then the conditional probability of occurrence of event p1 when p2 has already occurred is given by the
following mathematical formula:
P(p1|p2)=P(p2|p1)P(p1)/P(p2)
The algorithm of the same calculates the probability of the data to be positive or negative. The formula is as
follows:
P(pA|pB)=P(pB|pA)P(pA)/P(pB)
Where A = Sentiment,
B=Sentence
And, the conditional probability of a word is given by-
P(word|A)=C+1/(D+E)
C=no. of word occurrence in class
D= no of words belonging to a class
E= total no. of words
Sentiment Analysis
11
2) Maximum entropy classifier this is another probabilistic classifier which belongs to the class of
exponential models. It is almost similar to Naive Bayes classifier; however, naive bayes assumes that the
features are conditionally not dependent of each other whereas this algorithm does not take this assumption. This
classifier performs by the Principle of Maximum Entropy. From all the models that fit the training data it tends
to select the one which has the largest entropy. Apart from performing Sentiment analysis the Max Entropy
classifier aims to solve a lot of text classification problems such as detecting languages, classification of topics
and more.
3) Support machine vector classifier the classifier is a supervised learning models with associated
learning algorithms that analyze knowledge used for classification and multivariate analysis. A SVM model
represents examples as points. These examples are mapped so that the new examples are divided by clear gap
which can be as wide as possible. New examples are then mapped into the space taken earlier and predict the
category by analyzing the side of gap they fall on.
Sentiment Analysis
12
Related Work
Sentiment Analysis aims to help the customers in effective decision making. The task of manually analyzing the
reviews seems to be a difficult one. Thus, Sentiment analysis helps the customers in doing so. Many of the
researchers have given their significant contribution in the same. In this section, a review of the existing and
related works on Sentiment Analysis has been presented. Keke Cai et al. have presented a research that focuses
on topic detection techniques that is able to detect the topics. These topics are highly correlated with the positive
and negative opinions all these techniques help the business analysts and helps in understanding the overall
sentiment scope as well as the drivers behind the sentiment. They performed the basic sentiment classification
that categorized the text into positive, negative or neutral. But the problem they felt with this type of
classification was it lacked insight of what drives these sentiments. To solve the problem, they came up with a
new sentiment analysis technique that not only determines the sentiment of a given topic, but also determines the
root cause of the sentiments. Prashant Raina came up with an opinion mining engine that uses common-sense
knowledge extracted from Concept Net and Semantic Net to perform sentiment analysis in news article. He used
a large corpus of sentences form news article to test the opinion mining engine. The classification accuracy was
71%, with 91% precision for neutral sentences. Federico Neri et al. has described a Sentiment study. The study
was done on over than 1000 Facebook posts. There were posts about newscasts, comparing the sentiment for
Rai, the Italian public broadcasting service [2]. Ana CES. Lima et al. proposed an automatic sentiment classifier
for emoticons or sentiment based words containing tweets. They used naive bayes algorithm to classify the
tweets. However, the problem with this approach that it classified the tweets as either positive or negative and
did not ass neutral to the classification [3].Min Wang et al. have emphasized on an approach that helped in
realizing polarity analysis of new words and in addition implemented quantitative computation of sentiment
words and automatic expansion of polarity lexicon [4].Their experimental results showed feasibility and
effectiveness of their approach. ZHU Nanli et al. have presented a study on the recent development in the field of
Sentiment Analysis
13
sentiment analysis. They have conducted a survey in three major research fields: framework, feature extraction
and sentiment analysis. The problem that was encountered during this was there has been no research on the
commercial value of online reviews [5]. Seyed-Ali Bahrainian et al. came up with a novel solution to target
sentiment summarization and SA of short informal texts with emphasis on tweets [6]. They have compared
different algorithms and methods for SA polarity detection and sentiment summarization. They have compared
various PD algorithms. However, detection of sarcasm is yet to be taken into account. Andreas Dengel et al. have
compared state-of-art Sentiment Analysis methods against a novel hybrid method in their paper. Their approach
trains a linear Support Vector Machine (SVM) c1assifier and for that they create a brand new set of features
using Sentiment Lexicon. The problem they faced was the classification did not take sarcasm into account
[7].Sunil Kumar Khatri et al. have presented a research work in which they have performed classification on e-
data collected from multiple sites and then after classification analyzed it with ANN. They reduced the error in
prediction up to least. However, there was a problem with their study. It did not just predict the direction of the
market for a particular day, but they aimed to take their research to a level where they could predict the closing
value for the day [8]. Rui Xia et al. came up with a model called dual sentiment analysis (DSA). Their paper
highlighted the issues with sentiment classification [9]. They created a sentiment reversed review for each
training and test review to perform their novel data expansion technique. They developed a training algorithm
that was dual. The algorithms employed both kind of reviews together for learning a sentiment classifier and
classified the test reviews using this. They then took forward the same from 2-c1ass classification to 3-c1ass
classification. They considered the neutral reviews for the same. Finally, a pseudo-antonym dictionary was
created that helped them to perform a corpus-based method. They conducted a wide range of experiments. The
results demonstrate show effective DSA is. Vee W.LO et al. have discussed the existing works on opinion
mining and sentiment classification performed on customer feedback and online reviews, and has evaluated the
Sentiment Analysis
14
various approaches used for the process [10]. It can be seen from the existing literature that there exists many
algorithms for Sentiment Analysis but with few drawbacks and the room for improvement is still there.
Harsha et al. have done a comparative study on basic techniques used for sentiment analysis [1]. The table shown
below shows the comparison done by them.
TABLE 1: COMPARATIVE CHART OF ALGORITHMS [1]
Sentiment Analysis
15
Proposed Research Work
With the rapid growth in use of social networking sites in the past decade, it has become a notable medium for
people to express their views or opinions. This has fostered & promoted sentiment analysis as a dynamic &
potential area of research where new techniques & models need to be explored for continuous improvement in
result accuracy. Many techniques have been used to analyze sentiments from dataset of various category and
size. The techniques used in the past are: Naïve Bayes algorithm, Support Vector Machine, Neural network and
many others. The experiments conducted using these techniques have shown and proved their efficiency. But
there are many open areas of research. Many researchers have used a hybrid approach for the sentiment analysis.
They have combined various algorithms to achieve better results. Akshi et al. have used Neural Network to
perform sentiment analysis on tweets. Neural Networks give several advantages over other techniques. They
have prominent features like adaptive learning, fault tolerance, parallelism and generalization.
ANNs are capable of learning and they need to be trained. There are several learning strategies −
Supervised Learning − It involves a teacher that is scholar than the ANN itself. For example, the teacher
feeds some example data about which the teacher already knows the answers.
For example, pattern recognizing. The ANN comes up with guesses while recognizing. Then the teacher
provides the ANN with the answers. The network then compares it guesses with the teacher’s “correct” answers
and makes adjustments according to errors.
Unsupervised Learning − It is required when there is no example data set with known answers. For
example, Searching for a hidden pattern. In this case, clustering i.e. dividing a set of elements into groups
according to some unknown pattern is carried out based on the existing data sets present.
Sentiment Analysis
16
Reinforcement Learning − this strategy built on observation. The ANN makes a decision by observing
its environment. If the observation is negative, the network adjusts its weights to be able to make a different
required decision the next time.
In my proposed research work I will work with supervised Learning model of ANN. For my dataset I will take a
set of feedback given by users. The set of feedback can be labelled as: Excellent, Good, Average and Poor.
The ANN needs a learning algorithm to effectively calculate weights for the neurons to fire. For my research I
am planning to use an optimization algorithm for adjusting weights in the neural network. The algorithm that I
will be using in Firefly Algorithm. Firefly algorithm is a metaheuristic proposed by Xin-She Yang and inspired
by the flashing behavior of fireflies [11]. The pseudo code of the algorithm is given below:-
Sentiment Analysis
17
My proposed algorithm will have the following steps:-
1) Collection of dataset
2) Pre-processing and cleaning of data
3) Calculate the relative occurrence of words in the dataset
4) Creation of neural network structure with each word being assigned a node
5) Re-adjusting the weights of the neural network by using the firefly algorithm
6) Calculation of the accuracy of the generated result
Fig. 2: Proposed Sentiment Analysis Framework
Sentiment Analysis
18
References
[1] H. Sinha and A. Kaur, A Detailed Surveyand Comparative Study of Sentiment Analysis Algorithms . IEEE, 2016.
[2] Federico Neri, Carlo Aliprandi, Federico Capeci, Monstserrat Cuadros, Tomas, "Sentiment Analysis on social media",
IEEE/ACM International conference on Advances in Social Networks analysIs and mining,2012,pp. 919-926.
[3] Ana C. E.S.Lima, Leandro N.de Castro., "Automatic Sentiment Analysis of twitter Messages", Publisher IEEE, 2012, pp.
52-57.
[4] Min Wang, Hanxio Shi., "Research on Sentiment Analysis Technology and Polarity Computation of Sentiment words",
Publisher IEEE, 2010, pp.331-334.
[5] ZHU Nanli, ZOU Ping, Ll Weign, CHENG Meng., "Sentiment Analysis: A Literature Review", in proceedings of the 2012
IEEE ISMOT, pp. 572-576.
[6] Seyed-Ali Bahrainian, Andreas Dengel., "Sentiment Analysis and Summarization of Twitter Data", IEEE 161h
International Conference on Computational Science and Engineering, 2013,pp. 227-234.
[7] Seyed-AliBahrainian, Andreas Dengel., "Sentiment Analysis using Sentiment Features", IEEE/WIC/ACM International
Conferences on Web Intelligence(WI) and Intelligent Agent Technology(lAT),20I3,pp. 26-29.
[8] Sunil Kumar Khatri, Himanshu Singhal, Prashant Johri., "Sentiment analysis to predict Bombay Stock Exchange Using
Artificial Neutral Network, Publisher IEEE,2014.
[9] Rui Xia, et al., "Dual Sentiment Analysis: Considering Two Sides of One Review", Transactions on Knowledge and Data
Engineering, Vol. 27,No.8, Publisher IEEE, 2015, pp. 2121-2133.
[10] Vee W.LO, Vidyasagar POTDAR., "A review of opinion mining and Sentiment Classification Framework in Social
Networks", 3rolEEE International Conference on Digital Ecosystems and technologies, 2009, pp. 396-40I.
[11] A. Kumar and R. Khorwal., “Firefly Algorithm for Feature Selection in Sentiment Analysis,” SpringerLink, 2017.
top related