l.i.s.a 2.0 l information-based sentiment analysis 2

29
LEBANEESE AMERICAN UNIVERSITY Lebanese American University School of Engineering Department of Electrical and Computer Engineering Byblos, Lebanon Undergraduate Research Project (COE 594) Presented by: Mireille Fares 201102508 Supervised by Dr. Joe Tekli L.I.S.A 2.0 Lexical Information-based Sentiment Analysis 2.0 16/05/2016 Byblos

Upload: others

Post on 27-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY

Lebanese American University

School of Engineering

Department of Electrical and Computer Engineering

Byblos, Lebanon

Undergraduate Research Project (COE 594)

Presented by: Mireille Fares – 201102508

Supervised by Dr. Joe Tekli

L.I.S.A 2.0 Lexical Information-based Sentiment Analysis

2.0

16/05/2016 Byblos

Page 2: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

Abstract— Sentiment analysis systems are automated tools which analyze text extracts entered by users and attempt to

classify them under different sentiment categories, namely under: positive, negative, or neutral emotion. Such systems are

gaining increasing interest with a wide range of applications covering: blog sentiment analysis, client feedback analysis, and

opinion mining on social media (with people expressing their opinions on social media websites, e.g. Facebook, Twitter, etc.). In

this project, we introduce a tool titled LISA which performs Lexical Information-based Sentiment Analysis, covering not only

positive, negative, and neutral sentiments, but rather spans a battery of affect classes from positive to negative as well as to

more ambiguous emotions such as joy, sadness, love, anger, disgust and astonishment.

Index Terms—Sentiment Analysis, Affect Analysis, LISA, Effective and Efficiency tests.

CONTENTS

1 Introduction ................................................................................................................................................................................ 3

2 Background ................................................................................................................................................................................ 3

3 Proposal ..................................................................................................................................................................................... 4

4 Experimental evaluation......................................................................................................................................................... 24

5 Conclusion ............................................................................................................................................................................... 27

6 List of figures ........................................................................................................................................................................... 28

7 References............................................................................................................................................................................... 29

Page 3: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

M.FARES: LISA 2.0 3

1 INTRODUCTION

EXICAL sentiment analysis systems are automated tools which analyze text extracts entered by users and attempt to classify them under different sentiment categories, namely under: positive, negative, or neutral sentiments. Affect

analysis can be viewed as a more generalized/ comprehensive approach of sentiment analysis, which involves specific classes of affective emotions such as: happiness, sadness, surprise, and anger, etc. Methods in both (sentiment/affect) cate-gories utilize Natural Language Processing (NLP) and machine learning techniques to automatically identify the un-derlying emotions carried in textual data. Sentiment and affect analysis methods are becoming increasingly popular, in a wide range of applications covering: blog sentiment analysis, client feedback analysis, opinion mining on social media, and therapeutic and educational analysis that can help people in expressing their emotion (e.g. autistic people). In this project, we aim to provide an improved method for sentiment and affect analysis, improving on both effective-ness and efficiency of legacy systems. such as SentiWordNet (sentiment analysis system) and AlchemyAPI (affect and sentiment analysis system). On one hand, our method reduces the processing time of the analysis, on the other hand, it improves the quality of the results of the analysis.

2 BACKGROUND

2.1 Related Works

Sentiment analysis has been a hot topic during the last decade especially with the rise of social media and web forums. Several techniques concerning sentiment and affect analysis have been previously researched. Those techniques could be lexicon-based or generic n-gram based. Lexicon-based techniques rely on some manually or automatically (by ex-panding a small set of seed words) generated lexicon created using semantic orientation (SO) or WordNet. An example of manually generated lexicons from WordNet is WordNet Affect which we will use in our project. Generic n-gram based techniques rely on words and part-of-speech (POS) tag n-grams. Moreover, concerning affect intensities as-signed to the words, scoring techniques were applied and machine learning techniques. An instance of scoring meth-ods, used with lexicons, is evaluating the frequency of co-occurrence of the word with a set of core paradigm words that reflect a certain affect class. On the other hand, examples of machine learning techniques to assign affective inten-sities to words are Support Vector Machines (SVM) and Support Vector Regression (SVR) *1+. Many tools are nowadays available to compute sentiment analysis of words, sentences or texts. For instance, Senti-WordNet is a tool originating from the WordNet database which associates to each synset in WordNet three numerical scores: Positive, Negative and Objective *2+. Moreover, we have the L.I.W.C. tool which stands for Linguistic Inquiry and Word Count and which is a famous technique to analyze contents of a text and determine the percentage it uses positive or negative emotions, along with other dimensions such as self-references, social words, etc< *3+. Another tool used for affective lexicon analysis is General Inquirer which maps each text file with counts on dictionary supplied categories such as positive, negative, frequency, adverbs reflecting degrees (very, extremely<), etc<*2+. Furthermore, some existing tools and APIs perform sentiment analysis on Twitter by analyzing the polarities (negative, positive or neutral) of tweets based on a specific keyword that the user enters such as Sentiment140 tool (http://www.sentiment140.com) and Alchemy API for sentiment analysis (http://www.alchemyapi.com/api/sentiment-analysis).

2.2 Lexicon Based Approach

One example of a lexicon-based approach is WordNet. WordNet is a lexical database created by researchers at Prince-ton University; it contains around 155,287 words and their definitions. Words inside this database are organized in groups called “Synsets”. Every synset groups words that are related by synonymy as long as they have the same type as the synset type (noun, verb, adjective<). Semantic relationships relate a synset to other synsets in the database. WordNet has almost 18 semantic relationships like: hypernyms, hyponyms, instance hyponym, part holonym, part meronym, substance meronym, etc. In addition to semantic relationships, WordNet contains lexical relationships that define relationships between words. Some of the lexical relationships are: pertainym (a lexical relationship that allow to derive the adjective of the noun), derivation (a lexical relationship that allow to derive any type if existed) *4+*5+.

WordNet-Affect is another example of a lexicon-based approach. WordNet Affect is a database that was built starting

from WordNet. It is an approach for affective definitions and it was created by retrieving synsets from WordNet that

really represent emotions. All synsets where annotated using one or more a-labels such as emotion, mood, cognitive

state, and physical state, and so on. In WordNet Affect synsets belongs to different valences which made it easier for us

to assign scores to words entered by the user; four main valences or new label where assigned: positive emotion (joy,

enthusiasm, etc<), negative emotion (fear, horror, etc<), ambiguous emotion (surprise, gravity, etc<) and neutral

emotion (indifference, etc<).The WordNet Affect hierarchy is an “is-a” hierarchy. In other words, it only includes the

hypernomy and hyponomy relationship

L

Page 4: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

4 M. FARES –LISA 2.0

2.3 Scoring approach

An example of a scoring approach is LISA. LISA is a tooldeveloped by A. Moufarrej and E. Jreij *8+. It performs Lexical Information-based Sentiment Analysis, covering positive, negative, and neutral sentiments, as well as 6 affect classes ranging from positive to negative as well as to more ambiguous emotions. LISA computes the scores of a synset with respect to sentiment/affect classes. It is a knowledge-based method utilizing the predefined digital dictionary: the WordNet online lexical reference, coined with the WordNet Affect hierarchy – allowing to produce sentiment scores using typical (OSPF - also known as Dijkstra) graph navigation techniques.

2.4 Limitations of LISA

LISA had a major drawback: it takes several minutes up to hours to compute the scores of a certain word, due to the amount of time it takes to navigate the large semantic network WordNet. This drawback is considered as a bottleneck in the engine’s performance, because it computes the shortest path 6 times which slows down the engine’s speed. Moreover, the tool was limited to analyzing 6 affect categories, while WordNet-Affect is composed of 298 affect catego-ries. LISA also suffers from word sense disambiguity: It does not distinguish between different meanings of words. The first synset (Synset 0) of words (the first meaning) is always taken into consideration ignoring the other ones, and in most cases, the meaning that we are mostly interested in (which is the closest to the sentiment/emotion) appears in the second synset of the word.

3 PROPOSAL

3.1 Introducing LISA 2.0

In this project, we introduce LISA 2.0, an updated/extended version of the original LISA engine, performing both sen-timent and affect analysis with more optimized functionality and higher efficiency levels. LISA 2.0 offers several solu-tions to the problems that LISA 1.0 was facing, and some other improvements listed as follows:

1. Increasing the number of sentiments: the number of sentiments to analyze was increased from 6 to all 298 af-

fect categories present in WordNet-Affect.

2. Creating a database: storing the sentiment scores of all of 117,000 synsets of WordNet with respect to the 298

affect classes in a database.

3. Increasing the performance: The major problem that LISA 1.0 was facing was solved by developing several

algorithms that speed up the navigation in WordNet. One of the algorithms navigates the whole network of

WordNet in order to compute the scores of all synsets, and then store them in the database.

4. Solve disambiguation problem: developed an algorithm that can deal with the disambiguation problem that

LISA 1.0 was facing: it can choose the accurate meaning of the words.

Page 5: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 5

3.2 Architecture of LISA 2.0

The overall architecture of LISA 2.0 is illustrated in the following diagram:

Lexical sentiment processing and lexical syntactic processing will first be applied on the users’ input text:

Lexical sentiment processing involves checking for character repetition, negation, word repetition, emoticons

and salient words.

Lexical syntactic processing involves removing repeated characters and repeated words, lemmatizing, remov-

ing emoticons and removing salient words.

Then the scores of the resulting derived nouns are:

Either computed on the fly by using several algorithms that will be explained later.

Or taken from the pre-computed scores that are stored in a database (using backward processing - will be

explained later).

In order to speed up the running time of LISA 1.0, we have implemented 3 different algorithms that compute senti-ment scores on the fly. These algorithms are listed as follows:

OSPF-WN (updated version of the one used in LISA 1.0)

OSPF-WNA (updated version of the one used in LISA 1.0)

Delta Stepping

ID-Processing

Moreover, we have developed an algorithm called “Backward processing” that navigate the whole WordNet semantic network in order to compute sentiment scores of all 117,000 synsets with respect to the 298 affect categories, and save the results in a database.

Fig. 1 Overall Architecture of LISA 2.0

Page 6: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

6 M. FARES –LISA 2.0

3.2.1 WordSenseDisambiguation Algorithm

In WordNet every word can belong to several synsets. Every synset represents a single common meaning of the words it

contain. In general, the synsets are ranked in terms of how strongly they possess a given semantic property, and they

appear to the user in that specific ranking. Which means that when the user enters a word in WordNet, the first defini-

tion that appears to him is the one that has the most common meaning and that is the most frequently used.

This is why, when dealing with synsets, LISA 1.0 used to always get the ones that have an index zero. This caused a

disambiguation problem when dealing with the affective categories in LISA 2.0. We cannot always take the synsets that

have an index zero to be the correct synsets that correspond to the different affective categories. Sometimes the affec-

tive meaning belongs to the synsets at index one.

To solve this disambiguation problem, we came up with the algorithm WordSenseDisambiguation. The following figure

shows the pseudo-code of this algorithm.

When the word possesses two or more synsets, we need to identify the correct synset that represent the meaning of the affective category. In order to find the correct synset, we need to check in the definition of each synset if it contains the word “feeling” or the word “emotion”. The algorithm identifies four situations: Case 1: The word possesses only one synset. In this case, this synset will for sure represent the meaning that is related to the affective category Case 2: The definition of the word at the synset that has an index zero contains the word “emotion” or “feeling”. In this case, we choose the synset at the index zero to represent the corresponding affective category. Case 3: The definition of the word at the synset that has an index zero doesn’t contain the word “emotion” or “feeling”, and the one at index one contains the word “emotion” or “feeling”. In this case, we choose the synset at the index one to represent the corresponding affective category Case 4: Neither the synset at index zero nor the synset at index one contain the word “emotion” or “feeling”. In this case, choose the first synset (at index zero) to represent the corresponding affective category.

Fig 2 Pseudo-code of WordSenseDisambigation

Page 7: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 7

The different cases we can deal with are explained as follows: The following figure is an example of case 1: the word “affection” possesses only one synset. Moreover, we can see clearly that its definition contains the word “feeling”

The following figure is an example of case 3: the word “creeps” has two synsets. We can see clearly that the synset at index 0 doesn’t contain the word “emotion” or “feeling”, and that the synset at index 1 contains the word “feeling” (in its definition)

The following figure is an example of Case 4: the word “compassion” posses 2 synsets. Neither the synset at index zero nor the synset at index 1 contains the word “feeling” or “emotion”. In this case, we choose the synset at index zero to be the synset that corresponds to the affective category compassion.

Fig 3.0 Case 1

Fig 5.0 Case 3

Page 8: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

8 M. FARES –LISA 2.0

3.2.2 Optimized OSPF-WN

In order to compute the scores of a certain synset with respect to the affect categories, LISA 1.0 used the famous Open Shortest Path First Algorithm also known as Dijkstra’s Algorithm. The figure below shows the basic algorithm’s pseu-do code where V is the set of vertices in the graph G and v is a specific vertex. In order to use this algorithm in Wordnet, they have made some modifications to the algorithm. Instead of starting from an initial distance of 0 for the source, they started from an initial distance of 1 which is multiplied (not added) throughout the graph by the weight of each node. Concerning the minimum neighbor, they chose it to be the one which has the maximum distance from the source among the neighbors. In LISA 2.0, the algorithm OSPF-WN was developed in order to speed up the running time of the OSPF algorithm that was used in LISA 1.0. The following figure shows the OSPF-WN pseudo-code.

Fig 6.0 Dijkstra Pseudo-Code

Page 9: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 9

OSPF-WN’s algorithm altered the OSPF used in LISA 1.0. It subdivides it to two: Forward Dijkstra and Backward Dijkstra. In order to get the neighbors of a certain synset, the OSPF of LISA(v1) used the following relations: Hyponyms, Hy-pernyms, Similar, Pertainym, Related, and Derivation. Forward Dijkstra applies the OSPF from the destination to the source, and it only uses the relations: Hyponyms, Simi-lar, Pertainym, Related, and Derivation (without using the relation “Hypernyms”). On the other hand, Backward Dijks-tra applies the OSPF from the destination to the target, and it only uses the relations: Hypernyms, Similar, Pertainym, Related, and Derivation. (without using the relation “Hyponyms”) In order to use Forward Dijkstra and Backward Dijkstra properly, we need to clarify these four situations:

1. Case 1: The source synset is an ancestor of the destination synset. Which means that the source synset has a

lower depth than the destination synset in the WordNet hierarchy.

2. Case 2: The source synset is a descendant of the destination synset. In this case, the source synset has a higher

depth than the destination synset in the Wordnet Hierarchy

3. Case 3: The source synset and the destination synset are not related by the ancestor/descendant relationship,

but they have a least common ancestor synset instead.

4. Case 4: The source synset and the destination synset are not related to each other.

Page 10: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

10 M. FARES –LISA 2.0

The following figure displays the four cases: The first case displays the IS-A relation between Light and Afternoon. In this case, Afternoon is a descendant of Light. In order to get the distance from Light to Afternoon, the algorithm OSPF-WN uses the backward Dijkstra from target (Afternoon) to the source (Light). The second case displays the HAS-A relation between Equine and Pony. In this case, Equine is an ancestor of Pony. In order to get the distance from Pony to Equine, the algorithm OSPF-WN uses the forward Dijkstra from target (Equine) to the source (Pony). The third case is the case when the source and the destination are not related by the ancestor/descendant relation, but they have a least common ancestor synset instead. In this case, the least common ancestor of Evening and Glow is Light. In order to get the distance from Evening to Glow, the algorithm OSPF-WN uses the backward Dijkstra from target (Equine) to the least common ancestor (Light). The fourth case displays the case when there are no relations between the two synsets. In this case, Horse and Steve Jobs are not related, and the algorithm OSPF-WN sets the distance to be 0.

3.2.3 Optimized OSPF-WNA

LISA 1.0 used to compute the scores of a certain synset with respect to six affect categories. It used to calculate each score by finding the shortest path between the synset and the corresponding affect category. This means computing the shortest path 6 times, which slows down the speed of the engine. In LISA 2.0, we have upgraded from computing the scores of 6 affect categories to computing the scores of all WordNet Affect categories (which are 298 categories). In this case, if we want to calculate the shortest path between the synset and each of the 298 categories, the engine would spend several hours doing the calculations. In order to speed up the process, we propose the algorithm OSPF-WNA as a solution: The algorithm OSPF-WNA computes the scores of all affect categories with respect to each affect category in WordNet Affect. In other words, each affect category will store the scores of itself with respect to all other affect categories. In this way, instead of computing the shortest path between the synset and all 298 affect categories, we can compute the shortest path once, from the synset to the closest affect category and then we get the scores of all the affect categories (that are already stored inside this closest affect category), and we multiply them by the distance we got from the shortest path.

Fig.7 Cases to consider

Page 11: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 11

The figure below displays the algorithm OSPF-WNA’s pseudo-code: The algorithm’s goal is to return a HashMap (WNAMap) that stores all the affect categories (as keys). Each affect cate-gory will map to the scores of all WordNet Affect categories with respect to it. The algorithm starts by getting the leaf nodes of the WordNet Affect hierarchy (There are 196 leaf node). Then, it iter-ates on the leaf nodes one by one:

For each leaf node, it calls the method fillHashmapOf( ), which will compute the scores of all the affect catego-

ries with respect to the current leaf node.

Then, it will get the first ancestor (the hypernym) of the current leaf node, and call again the method

fillHashmapOf( ) in order to compute the scores of all affect categories with respect to this ancestor node.

The affect sentiments that belong to the node “Emotion” have a maximum depth of 3. This is why we need to

check if the current leaf node has a second ancestor. If yes, we call again the function fillHashmapOf( ) in or-

der to compute the scores of all affect categories with respect to this second ancestor node.

Fig.8 OSPF-WNA Pseudo-Code

Page 12: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

12 M. FARES –LISA 2.0

The figure below displays the pseudo code of the method fill HashmapOf( ):

This method creates a HashMap for the input affect category node. This HasMap will contain all WNA categories as keys. The value of each key will be initialized to null.

The distance of the input affect category to itself is 1, hence we need to put the value of the key that represents

this input affect category in the HashMap as 1.

The second step is to get the ancestors of the input affect category node.

The third step is to check if the input node is a leaf node:

o If the input node is a leaf node, iterate through the ancestors:

o The relationship between the input node and each of its ancestors is an “IS-A” relationship. Hence the

distance from the input node to each of these ancestors is 1

o Set each one’s distance to 1 in the HashMap, and then call the method navigateForward(from the an-

cestor)

If the input node is not a leaf node:

Fig. 9 fillHashMap PseudoCode

Page 13: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 13

o Call the method navigateForward (from the leaf node)

o Iterate through the ancestors, set each one’s distance to 1 in the HashMap, and then call the method

navigateForward (from the ancestor)

The final step is to insert in the WNAMap the input affect category node along with its hashmap of scores.

The figure below displays the pseudo-code of the method navigateForward( ):

This method iterates through all the nodes that are descendants of the input node n recursively (till it reaches the leaf nodes), while setting the distance (in the HashMap) from the original node to each node traversed The following is an example of how the algorithm OSPF-WNA works:

We get the leaf nodes of WNA hierarchy: there are 196 affective sentiment nodes that are leaf nodes. They

don’t possess children.

Fig10 Navigate Forward Pseudo Code

Page 14: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

14 M. FARES –LISA 2.0

We iterate through the leaf nodes. Let’s take for example animosity. We create a hashmap for animosity, we

initialize the value of all the distances to null.

Then we set the distance of animosity to 1. We get the ancestors of animosity, and set the distance of each one

of them in the hashmap of to 1 (As shown in the figure below)

Fig.12 Example

Fig.11 Example

Page 15: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 15

We iterate on the ancestors: the first ancestor is hostility, we use the method navigate forward in order to set

the distance of all the descendants (with respect to hostility) in the hashmap. We then repeat the same process

on all the ancestors

Fig. 13 Example

We repeat the same process for all the leaf nodes, and all their ancestors. We only set the distance of the affect

sentiment that has a null value in the hashmap.

3.2.4 ID-Processing Algorithm

The algorithm ID-Processing aims to find the closest affect sentiment synset with respect to the source synset in Word-

Net. After finding the closest affect sentiment, the algorithm uses WNAMap map(from OSPF-WNA) in order to get the

scores of all the affect sentiments with respect to this closest synset to the source. The final distance between the source

synset and each of the 298 affect sentiments will be equal to the distance between the source synset and the closest af-

fect synset multiplied by each affective sentiment score. The distance between the source synset and the closest affect

sentiment synset is found using the OSPF-WN algorithm.

The algorithm ID-Processing finds the closest affect sentiment synset with the help of the synset IDs in WordNet. In

fact, in WordNet, each synset has an identifer ID called synsetID. As we go in depth in the hierarchy of WordNet, the

IDs becomes larger. This fact can help us in locating the closest affect sentiment synset to the source synset.

The following figure shows the pseudo-code of the algorithm ID-Processing

Page 16: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

16 M. FARES –LISA 2.0

Fig.14 ID-Processing Pseudo Code

In order to locate the closest affect sentiment synset to the source synset the following steps are done: Get the ID of the source synset The IDs of the affective sentiment synsets are previously found. We can notice that they range from 100149041 to 114348977. Check if the source synset is an affective sentiment synset, if yes, the scores of the 298 affect sentiments are directly taken from WNAMap If the source synset is not an affective sentiment synset, check if the ID of the source synset is in the interval 100149041 114348977. As previously explained, we have used the algorithm solveDisambiguation in order to choose the right synsets that represent the affective sentiments in WordNet. Which means that in the interval 100149041 114348977, there are the correct affective sentiment synsets with no disambiguation. If the source ID belongs to this interval, we get the two synsets that surround the source synset. In other words, we get the two affect sentiment synsets, one that has a higher ID from the source synset, and one that has a lower ID from the source synset. The closest affective sentiment synset will be one of these two synsets: the one with the highest distance with repect to the source synset.

Page 17: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 17

The following example clarifies this case:

The synset “Breakdown” doesn’t have any ancestor that is an affective sentiment synset. ( The ancestors are listed as follows: disruption, disturbance, move, change, action, human activity, event, psychological feature, abstraction, ab-stract entity, entity) Moreover, its synsetID is 113879954, and it belongs to the interval 100149041 114348977. Hence there exist two affective sentiment synsets that are the closest to the synset Breakdown. We had previously listed all the IDs of the affective sentiment synsets in an ascending order in a file (as shown in the next figure) These two synsets are: “Shadow” (which has the ID 113799076) and “Sickness” (which has the ID 114168673)

Fig. 15 Example

Fig. 16 – Closest two affective synsets

Page 18: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

18 M. FARES –LISA 2.0

If the ID of the source synset doesn’t belong to the interval 100149041 114348977, it means that the closest affective sentiment synset is one of the boundary affective sentiment synsets. The boundary affective sentiment synsets are the affective sentiment synsets that have the maximum number of ancestors (with respect to the other affective sentiment synsets), and the ones that have the minimum number of ancestors (with respect to the other affective sentiment synsets). We can notice that the boundary affective sentiments are: - The ones with the lowest number of ancestors: emotionless, brotherhood and antagonism. -The ones with the highest number of ancestors: identification, creeps, woe, oppression, frustration, covetousness, jealousy. The boundary affective sentiment synsets are shown in the following figure:

The boundary affective synsets represent the synsets that surround all the other affective synsets, which means that these synsets are the closest to all WordNet synsets that don’t have the synset ID in the interval 100149041 114348977. For example, the closest affective synset to the synset “Kinship” is “Brotherhood” because the synset ID of “Kinship” is not in the interval 100149041 114348977, hence the closest affective synset is one of the boundary synsets. In this case it is clear that it is “Brotherhood” (from the figure). After identifying that we’re in the case where the source ID is closer to one of the boundary synsets, we need to apply the algorithm OSPF-WN between the source synset and each of the boundary synsets in order to find the closest one. Return the closest affective sentiment synset. NB: When the source synset have the same distance to many affective sentiment synsets: the distance from the source to each of the 298 affective sentiment category will be based on the highest score of each of the 298 affective sentiment category in the hashmap WNAMap, where the keys are these closest affective sentiment synsets.

Fig. 17- Boundary Affective sentiments

Page 19: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 19

The following figure shows the pseudo-code of the method locateClosestWNASynset:

3.2.5 Delta Stepping

Another solution for the OSPF problem is applying the algorithm Delta Stepping. Delta Stepping is simple algorithm that can be used to perform the OSPF sequentially and in a parallel manner on a large graph. In order to find the shortest path between two nodes using Delta Stepping algorithm, the first thing to do is to get the nodes that are located beween the source node and the destination node. The next step is to set the weight of each of these nodes. Then, these nodes are divided into buckets. Each bucket will contain the nodes with a specific range of weights. These ranges are multiples of Δ. Δ can be any value that the user specifies. The buckets are then processed in parallel: multithreading is applied on the-se buckets in order to choose the nodes with the smallest weights, and then multiply these weights in order to get the distance. Every set of hyponyms (children) of a certain synset is placed in a bucket. Moreover, all the hyponyms of a certain syn-set have equal weights. Hence, we don’t need to choose the synset with smallest weight. We keep on getting the hypo-nyms of synsets and placing them into buckets, till we reach a synset that corresponds to an affective sentiment. We apply multithreading while inserting the synsets into buckets. When the buckets are filled, we get the set of ancestors synsets that are located between the target synset and the source synset. Then, we multiply the weight of each of these synsets in order to get the distance that we want.

Fig.18 locateclosestWNASynset PseudoCode

Page 20: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

20 M. FARES –LISA 2.0

The algorithm ShortestPath_DELTA considers the 4 cases that were previously explained in order to optimize the run-ning time of it. The following is an example of how the algorithm ShortesPath_DELTA is processed:

1- In the following graph, the source is an ancestor of the target, hence we’re in Case 1:

Fig.20 Case 1

2- The nodes are then divided into buckets (using multithreading), while setting the weight of each node:

3- The ancestors that are between the source and the target are then identified. The final distance will be equal to

the multiplication of the weights of these ancestors.

Fig. 21 – Nodes divided into buckets

Fig. 22 – Final distance

Page 21: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 21

3.2.6 Backward Processing Algorithm

We came up with the algorithm BackwardProcessing which we have used to get the scores of all the WordNet synsets and then save them in a database. BackwardProcessing computes the scores of all the synsets of WordNet by starting by the ones that are neighbors to eve-ry affect category synsets. In other words, BackwardProcessing starts by computing the scores of the affective category synsets with respect to all 298 affect categories (by calling OSPF-WNA), and then starts setting the scores of their sur-rounding neighbors, and continue to all the synsets in WordNet. The following figure shows the pseudo-code of the algorithm BackwardProcessing: Note that the algorithm will store in every synset in WordNet different distances with respect to different affect catego-ries. These distances will later be compared to each other in order to get the closest affect categories to every synset in WordNet. The algorithm starts by storing, in very descendant of a certain affect category synset, the distance of that descendant with respect to the that affect category. This is done by calling the method navigateForwardFromWNA(). Note that when the descendant of a certain affect category synset happens to be an affect category synset, the navigation from that af-fect category synset stops. In this way, we can specify the closest affect category synset to every descendant synset of all affect category synsets, and optimize the running time of the algorithm.

Fig. 23 Backward Processing Pseudo-Code

Page 22: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

22 M. FARES –LISA 2.0

The following is an example of how NavigateForwardFromWNA( ) works:

In the above picture, each colored synset will store the distance 1 with respect to the affect category synset that have the same color. At this stage, the algorithm will call the method NavigateToAncestorsFromWNA.This method will get the ancestors of every affective category synset, and check if the first ancestor synset stores a distance of 1 with respect to any affective category. If yes, then this ancestor synset has set its closest affective category synset.If no, then the ancestors didn’t set their closest affect category synset yet, then we move to another affective category, as shown in the figure below:

Fig. 24 Example

Fig. 19 Example

Page 23: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 23

If the first ancestor doesn’t happen to have stored a distance of 1 with respect to any affective category, then this ances-tor and all of the ancestors didn’t set their closest affective category synset yet. The next thing to do is to calculate the distance of all the ancestors with respect to the corresponding affect category synset, as shown in the figure below: Then, while the algorithm is iterating through these ancestors, it iterates again on all the affective category synsets, and take the ones that are not related to the current ancestor by an IS-A or HAS-A relation, which means that both the an-cestor and the affective category synset picked, have a least common ancestor as shown in the figure below:

Then it checks if the first ancestor of the current affective category synset has previously stored a distance of one with respect to an affective category, if yes, it leaves it and moves to the next affect category synset. If not, it gets the least common ancestor LCA, and store in each synset that has a relationship HAS-A with the affective category synset, and IS-A with the LCA, its distance with respect to the affective category synset, as shown in the following figures: Then, the algorithm gets the synsets that have a relationship HAS-A with the ancestor and IS-A with the LCA, and stores in each one of them the distance that was stored in the LCA. This distance will represent the distance from each one of these synsets with respect to the corresponding affective category.

Then, the method locateClosestWNANode( ) keeps in every synsets the highest distances in order to locate the closest affective category synsets. For ancestor of every affective category synset, we check if the ancestor stores 1 distance or more:

If it stores only 1 distance, we get from WNAMap the corresponding 298 scores and multiply all the scores

by that distance

Ex: Assume D is the sentiment Joy, the scores of synset F = 0.5*(scores of Joy w.r.t all 298 affective categories)

If it stores 2 or more distances, we get from WNAMap the 298 scores of the corresponding 2 or more senti-

ments and multiply all the scores by the distance

Ex: Assume B is the sentiment love and D is the sentiment affection, score (loyalty) of synset A =

max((scores(loyalty) of b), (scores(loyalty) of D))) and we do the same thing for the 298 affective categories

Fig. 25 Example

Page 24: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

24 M. FARES –LISA 2.0

The final thing to do is to get all the ancestors of all the affective category synsets, and set the scores of all the descend-ants of every ancestor to be the same as the scores of the ancestors, since every ancestor will have the scores of the clos-est affective category synset multiplied by the distance to that synset, and there is an IS-A relation between the ancestor and its descendants. Note that if the descendant happens to be an affective category synset, or an ancestor of an affec-tive category synset, or a leaf node, the navigation stops from that descendant.

4 EXPERIMENTAL EVALUATION

We conductede several expreriments to highligh: the quality of the affect scores of LISA (effectiveness tests), and the performance of the system (efficiency tests).

4.1 Effectiveness Tests

To evaluate the quality of the affect scores, we used a measurement called the Pearson Correlation Coefficient (PCC) in order to compare the quality of the scores of LISA with respect to the quality of other reliable systems: SentiWordNet and AlchemyAPI. All systems’ scores were evaluated with respect to human experts’ ratings developed in the ANEW study *10+ widely used as reference in sentiment/affect analysis empirical evaluations. ANEW provide a set of norma-tive emotional ratings for a large number of words in English. The following figure represents the different variations of scores of positive emotions for different words, computed by LISA, SentiWordNet and ANEW.

From the above graph, we can calculate the PCC(LISA, ANEW) and PCC(SentiWordNet, ANEW) - PCC(SentiWordNet, ANEW) = - 0.2796377 - PCC(LISA, ANEW) = 0.2815578

As we can see, PCC of LISA, ANEW is positive whereas PCC of SentiWordNet and ANEW is negative, which means

that the scores of LISA are more reliable because they have positive correlation with the scores of the experts.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

abd

uct

ion

absu

rd

abu

se

acci

de

nt

ach

e

acti

vate

add

icte

d

ado

rab

le

adva

nta

ge

affe

ctio

n

aggr

ess

ive

ago

ny

air

ale

rt

alim

on

y

alle

rgy

alo

ne

amb

itio

n

ange

l

angr

y

ben

ch

bet

ray

SWN

ANEW

LISA

ALCHEMY

Page 25: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 25

The following figure represents the different variations of scores of negative emotions for different words, computed

by LISA, SentiWordNet and ANEW.

From the above graph, we can conclude that for negative scores:

- PCC(LISA, ANEW) = 0.195082782.

- PCC(SentiWordNet, ANEW) = - 0.116738623.

As we can see, PCC of LISA, ANEW is positive whereas PCC of SentiWordnet and ANEW is negative, which means

that the scores of LISA are more reliable because they have positive correlation with the scores of the experts

Furthermore, we calculated the Mean Square Error (MSE) which is an average of the squares of the difference between

the actual observations and those predicted. In other words, we compared LISA and the legacy systems (Alchemy and

SentiWordNet) to the published benchmark ANEW in order to know which system has the least error compared to

ANEW. The obtained resulted are the following: MSE(Alchemy, ANEW)= 0.099786 MSE(Lisa, ANEW) = 0.077128 MSE(SentiWordNet, ANEW) = 0.127913

As we can see, LISA has the least MSE with respect to ANEW. We can deduce that the scores of lisa have the least per-

centeage of error compared to SentiWordNet and Alchemy which are legacy systems.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

abd

uct

ion

absu

rd

abu

se

acci

de

nt

ach

e

acti

vate

add

icte

d

ado

rab

le

adva

nta

ge

affe

ctio

n

aggr

ess

ive

ago

ny

air

ale

rt

alim

on

y

alle

rgy

alo

ne

amb

itio

n

ange

l

angr

y

ben

ch

bet

ray

SWN

ANEW

LISA

ALCHEMY

Page 26: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

26 M. FARES –LISA 2.0

4.2 Efficiency Tests

The following graph shows the performance of our 4 developed algorithms.

The x-axis represents the number of words in a sentence, and the y-axis represents the time measured in seconds. As we can see, the time required to run a certain algorithm increases when the number of words increases. We can no-tice that the running time of the Backward processing is the best since its retrieving the scores from the database. The above graph shows the running time required to run each algorithm. The x-axis represents the number of affect categories and the y-axis represent the time measured in seconds. We can notice that the algorithm Backward Pro-cessing have the best running time, because when we increase the number of affect categories to analyze, it requires less time to run than the other algorithms do.

Fig. 26 Performance of all algorithms

Fig. 27 Performance of all algorithms

Page 27: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 27

5 CONCLUSION

In this project, we introduced an updated version of a knowledge-based method for sentiments and affects analysis, titled LISA 2.0. Our solution covers not only positive and negative sentiments, but rather spans a battery of affect classes from positive to negative as well as more ambiguous emotions such as joy, sadness, love, anger, disgust and astonishment. In contrast with other corpus-based and machine learning methods which require training data and time (which are not always available), our method utilizes predefined lexical dictionaries: WordNet and WordNet Affect. LISA 2.0 is an updated/extended version of the original LISA engine, performing both sentiment and affect analysis with more optimized functionality and higher efficiency levels. LISA 2.0 offers several solutions to the problems that LISA 1.0 was facing, and some other improvements. Furthermore, we conducted some experiments to highlight our method’s effectiveness and efficiency in automatically analyzing and detecting sentiment in text. Concerning future work, our project can be upgraded in many ways:

Investigating other possible improvement to LISA 2.0, such as considering emoticons, word repetitions, word

negations, and different lexical cues which are currently disregarded by the engine.

Implementing and integrating our method with different social medial sites, such as with Facebook,

WhatsApp, Flickr, and YouTube, in order to perform various sentiments search and mining applications (e.g.,

analyzing sentiments reflected in images/videos based on related text, subtitles, etc.).

Page 28: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

28 M. FARES –LISA 2.0

6 LIST OF FIGURES

Fig. 1 Overall Architecture of LISA 2.0 ............................................................................................................................................ 5

Fig 2 Pseudo-code of WordSenseDisambigation…………………………………………………………………………………4 Fig 3.0 Case 1 ………………………………………………………………………………………………...………………………5 Fig 4.0 Case 2 ………………………………………………………………………………………………………………………...6 Fig 5.0 Case 3 ………………………………………………………………………………………………………………………...7 Fig 6.0 Dijkstra Pseudo-Code ………………………………………………………………………………………………………8 Fig.7 Cases to consider ……………………………………………………………………………………………………………..9 Fig.8 OSPF-WNA Pseudo-Code. …………………………………………………………………………………………………10 Fig. 9 fillHashMap PseudoCode …………………………………………………………………………………………………11 Fig10 Navigate Forward Pseudo Code ………………………………………………………………………………… ………12 Fig.11 Example ……………………………………………………………………………………………………………………..13 Fig.12 Example ……………………………………………………………………………………………………………………..14 Fig. 13 Example …………………………………………………………………………………………………………………….15 Fig.14 ID-Processing Pseudo Code ………………………………………………………………………………………………16 Fig. 15 Example …………………………………………………………………………………………………………………….17 Fig. 16 Closest two affective synsets …………………………………………………………………………………………….18 Fig. 17 Boundary Affective sentiments …………………………………………………………………………………………..19 Fig.18 locateclosestWNASynset PseudoCode …………………………………………………………………………………..20 Fig.19 FillBuckets PseudoCode …………………………………………………………………………………………………..21 Fig.20 Case 1 ……………………………………………………………………………………………………………….............22 Fig. 21 – Nodes divided into buckets ……………………………………………………………………………………………23 Fig. 22 – Final distance ……………………………………………………………………………………………………………24 Fig. 23 Backward Processing Pseudo-Code ……………………………………………………………………….…………25 Fig. 24 Example ………………………………………………………………………………………………………………….26 Fig. 25 Example …………………………………………………………………………………………………………………..27 Fig. 26 Performance of all algorithms …………………………………………………………………………………………28 Fig. 27 Performance of all algorithms …………………………………………………………………………………………29

Page 29: L.I.S.A 2.0 L Information-based Sentiment Analysis 2

LEBANEESE AMERICAN UNIVERSITY 29

7 REFERENCES

*1+ Strapparava, C., Valitutti, A. WordNet-Affect: an Affective Extension of WordNet

*2+C. Strapparava and A. Valitutti and O. Stock “The Affective Weight of Lexicon” Proceedings of LREC 2006

*3+ Gill, A. J., French, R. M., Gergle, D., Oberlander, J. (2008). Identifying Emotional Characteristics from Short Blog Texts. In Proceedings of

the Thirtieth Annual Cognitive Science Conference, NJ:LEA

*4+ Richardson, R., &Smeaton, A. (n.d.). Using WordNet in a knowledge-based approach to information retrieval.

*5+ Valitutti, A., Strapparava, C., & Stock, O. (2004, March 31). Developing affective lexical resources. PsychNology Journal, 2(1), 61-83.

*6+Strapparava, C., &Valitutti, A. (n.d.). WordNet Affect.

*7+ Banerjee, S., & Pedersen, T. (n.d.). Extended gloss overlaps as a measure of semantic relatedness.

*8+ Moufarrej A. and Jreij E., LISA: Lexical Information-based Sentiment Analysis, Capstone (Final Year Project) Project report, ECE Dept.,

Lebanese American University (LAU), pp. 42, Byblos, 2015