building large arabic multi-domain resources for sentiment analysis

Building Large Arabic Multi-domain Resources

for Sentiment AnalysisHady ElSahar and Samhaa R. El-Beltagy

Center for Informatics Science, Nile University

CICLing 2015 – April 19, 2014

[email protected]

mailto:[email protected]

Agenda

• Problem Statement

• Building Multi-Domain Datasets for Sentiment Analysis

• Building Multi-Domain lexicons

• Experiments and Evaluation

• Mining experiments results

Problem Statement

Problem Statement

• Small size

• Domain Specificity

• Not publicly available

• Insufficient coverage of different Arabic dialects and non standard terms

Current resources for sentiment analysis suffer many deficiencies:

Problem Statement

Author Dataset name Size Multi Domain Publicly Available

Rushdi-Saleh et al. OCA 500 NO YES

Abdul-Mageed & Diab AWATIF < 10K Yes NO

Aly, M. & Atiya, A. LABR 63K NO YES

Eshrag Refaee et al. Twitter Corpus 8,868 N/A YES

Sentiment Datasets related work

Problem Statement

Author Size MSA / Dialect Multi Domain Publicly Available

El-Beltagy et al. 4K MSA + Dialect N/A YES

Abdul-Mageed & Diab (SANA) 225K MSA + Dialect Yes NO

Badaro et al. 150K MSA only N/A YES

Sentiment lexicons related work

Proposed solution

• Building large Arabic datasets and lexicons for sentiment analysis• Large size

• Multi-domain

• Arabic dialects

• Well documented, tested for sentiment classification

• Publicly available for every one to use

Agenda






Building Datasets

Building datasets from reviewing content on the internet

Building Datasets

• Lack of Arabic reviewing content on the internet:

• Less Arabic based e-commerce & reviewing websites • Arabic speakers use the English language to write their reviews

English *** , Do you Speak it !!!!

Domain Reviewing Websites Scrapped

Hotel reviews

Restaurant reviews

Product Reviews

Movie Reviews

Building Datasets

Scrapping Arabic Reviewing content on the Internet

Building Datasets

• Normalize different ratings systems into ( positive, negative and neutral ) classes using heuristics.

• Automatic labeling of reviews.

Building Datasets

• Removing redundant and spamming reviews• Removing contradicting reviews ( Similar Text Different polarity )

• Remove duplicate reviews

Datasets Statistics

Hotels Restaurants Movies Products ALL

#Reviews 15579 11310 1524 14279 42692

#Unique Reviews 15562 10940 1522 5092 33116

#Users 13407 1639 416 7465 24653

#Items 8100 4654 933 5906 19593

Sizes of Extracted Datasets

Datasets Statistics

Number of reviews for each class

Datasets Statistics

Number of tokens per review for each of the datasets

Agenda






Building multi domain lexicons

• Manually hand crafting sentiment lexicons is a tedious task

• Proposed approach utilizes feature selection and ranking of Support Vector

Machines (SVM)

• SVM with L1 regularization penalty results in sparse coefficient vectors

doesn't deserve

bad failure happen scene better wonderful enjoyable

ال يستحق سئ فشل ……. حصل مشهد …….. افضل رائع ممتع

-0.532 -0.52 -0.4 ……. 0 0 …….. 0.270 0.272 0.357

Coefficient vector of a trained support vector machine

DatasetsTraining

L1-norm SVM

Selecting Top Features

Manually Verification

Multi domain

Lexicons

Building multi domain lexicons

• Train SVM classifier on each of the generated datasets using a unigram + bigram model

• Omit features corresponding to zero coefficients

• Label features with positive coefficient values as positive lexicon entries

• Label features with negative coefficient values as Negative lexicon entries

• Manually filter and verify resulting lexicon ( a lot easier ! )

Building multi domain lexicon from the datasets

Hotels Restaurants Movies Products LABR / Books ALL

# non-zero coef.

features556 1413 526 661 3552 6708

# Manually filtered 218 734 87 369 874 1913

Size of built multi-domain lexicons before and after manual filtration

Building multi domain lexicon from the datasets Selected examples from the Generated lexicons:

Hotels Restaurants Movies

لن أعود not coming back

بارد cold

يستحق المشاهدةworth watching

ضعيفة المياهlow water pressure

يشبعEnough portions

برافوBravo

Agenda






Experiments and bench marking Datasets

• Verify the viability of using the datasets for sentiment analysis

• Test the effectiveness of the generated lexicon

• Export the results of all experiments publicly for further analysis

• Provide easy benchmarking framework for future sentiment classifiers

Experiments Benchmarking the datasets for the task of sentiment analysis :


Datasets setups : • 2 Class sentiment Classification (Positive or Negative)

• 3 Class Sentiment Classification problem (Positive, Negative or Mixed/Neutral )

• Balanced / Unbalanced Setups

• 20%-80% Splits (testing generated lexicons on unseen data)

• Cross validation


Feature building Methods :

• Standard feature building methods : • Count, TF-IDF, Delta-TFIDF

• Features built from generated lexicons :• (term existence, term count, weighted count )• Domain specific lexicon, domain general lexicon

• Merging Lexicon based features with other features

Classifiers : Linear SVM, Logistic regression, BNB , KNN and SGD


• 3075 experiments, resulted from using all classifiers, features and Datasets setups combinations together.

• Results are publicly available for further analysis and as benchmarks

Agenda






Mining experiments results

Mining the experiments results to answer questions like :

• What are the top performing classifiers and features combinations ?

• Can we rely only on lexicons for sentiment analysis ?

• What is the effect of combining lexicon based features with other features ?

• Are shorter documents easier to classify ?

• Are documents richer with subjective words easier to classify ?

Can we rely only on lexicon based features for sentiment classification?Can features generated from lexicons provide an adequate accuracy relative to other feature generating methods.


Features Number of features Average Accuracy

2 C

lass

Lex-domain ~ 500 0.768

Lex-all 1913 0.782

Count ~ 50K features 0.783


Features Number of features Average Accuracy

3C

lass

Lex-domain ~ 500 0.549

Lex-all 1913 0.554

Count ~ 50K features 0.570

Effect of merging lexicon based features with other features?Can features generated from lexicons provide an adequate accuracy relative to other feature generating methods.


Features Aggregated Lexicon Average Accuracy Enhancement

2 C

lass

Count

None 0.783

Lex-domain 0.790 + 1 %

Lex-all 0.796 + 1.6 %

TFIDF

None 0.7

Lex-domain 0.791 + 9.1 %

Lex-all 0.8 +10 %

Delta-TFIDF

None 0.692

Lex-domain 0.789 + 9.7 %

Lex-all 0.798 + 10.6 %

Shorter documents are easier to classify?Or longer ones?, How about longer ones rich with subjective terms ?


Small Space


Storyline : Patch Adams was desperate and attempt to commit a suicide many times, until he was sent to a mental hospital….……..Then he started unintentionally helping others through socializing with them until they have become better


• Document length : No. of tokens in per document (log scale)

• Subjectivity score • Sum of polarities of words that appear in the document (using generated

lexicons)

• Error Rate • Number of misclassified documents of this specific group (doc. Length and

subjectivity score )


The error rate for various document lengths and subjectivity score groups (the Darker the worse)

Conclusion• Built a large multi-domain datasets for sentiment Analysis ( 33K

reviews)

• Proposed an approach for semi-automatically learning multi-domain lexicons (~2K)

• Everything is publicly available :• Datasets (raw + processed)

• Lexicons

• Web Scrappers (to rerun for more recent reviews)

• Experiments code and results

Questions ?

Slides : bit.ly/cicling2015_elsahar_slidesDatasets : bit.ly/cicling2015_elsahar_resources

building large arabic multi-domain resources for sentiment analysis

Science

size multi domain

datasetsscrapping arabic

sentiment analysishady

reviews english

websites arabic speakers

arabic based ecommerce

automatic labeling of

support vector machines