rcomm 2011 - sentiment classification with rapidminer

Sentiment Classification with

RapidMiner

Bruno Ohana and Brendan Tierney

DIT School of Computing

June 2011

Sentiment Classification with

RapidMiner

Bruno Ohana and Brendan Tierney

DIT School of Computing

June 2011

� Introduction to Sentiment Analysis� Supervised Learning Approaches� Case Study with RapidMiner

Our Talk

Introduction to Sentiment AnalysisSupervised Learning ApproachesCase Study with RapidMiner

“81% of US internet users (60used the internet to perform research on a product theyintended to purchase, as of

“Over 30% of US internet users have at one time

Motivation

“Over 30% of US internet users have at one timeposted a comment or online review about a product orservice they’ve purchased.”

60% of population) haveused the internet to perform research on a product theyintended to purchase, as of 2007.”

% of US internet users have at one time% of US internet users have at one timeposted a comment or online review about a product orservice they’ve purchased.”

(Horrigan, 2008)

Motivation

� A lot of online content is subjective in nature.� User Generated Content: Product reviews, blog

posts, twitter, etc.� epinions.com, Amazon, RottenTomatoes.com.� Sheer volume of opinion data calls for automated

analytical methods.

A lot of online content is subjective in nature.User Generated Content: Product reviews, blog

epinions.com, Amazon, RottenTomatoes.com.Sheer volume of opinion data calls for automated

Why Are Automated Methods Relevant?

� Search and Recommendation Engines.� Show me only positive/negative/neutral.

� Market Research.� What is being said about brand X on Twitter?� What is being said about brand X on Twitter?

� Contextual Ad Placement.

� Mediation of online communities.

Why Are Automated Methods Relevant?

Search and Recommendation Engines.Show me only positive/negative/neutral.

What is being said about brand X on Twitter?What is being said about brand X on Twitter?

Contextual Ad Placement.

Mediation of online communities.

A Growing Industry

� Opinion Mining offerings� Voice of Customer analytics� Social Media Monitoring� SaaS or embedded in data mining packages

Voice of Customer analytics

SaaS or embedded in data mining packages

Opinion Mining – Sentiment Classification

� For a given Text Document, Determine Sentiment Orientation� Positive or Negative, Favorable or Unfavorable, etc.� Binary or along a scale (e.g. 1� Data is unstructured text format. From sentence to

document level.document level.

Ex: Positive or Negative?“This is by far the worst hotel experience i've ever had. the owner

overbooked while i was staying there (even though i booked the room two months in advance) and made me move to another room, but that room wasn't even a hotel room!”

Sentiment Classification

For a given Text Document, Determine Sentiment

Positive or Negative, Favorable or Unfavorable, etc.Binary or along a scale (e.g. 1-5 stars)Data is unstructured text format. From sentence to

his is by far the worst hotel experience i've ever had. the owner overbooked while i was staying there (even though i booked the room two months in advance) and made me move to another room, but that

Supervised Learning for Text

� Train a classifier algorithm based on a training data set.� Raw data will be text.

� Approach: Use term presence � Approach: Use term presence features.� A plain text document becomes a word vector.


Train a classifier algorithm based on a training

Raw data will be text.

term presence information as term presence information as

A plain text document becomes a word vector.


� A word vector can be used to train a classifier.� Building a Word Vector

� Unit of tokenization: uni/bi/n� Term presence metric� Binary, tf-idf, frequency

� Stemming� Stemming� Stop Words Removal

IMDB Data Set(Plain Text)

TokenizeTokenize StemmingStemming


A word vector can be used to train a classifier.

Unit of tokenization: uni/bi/n-gram

idf, frequency

Train ClassifierTrain ClassifierWord Word VectorVector


Challenges of Data Driven Approaches

� Domain dependence.� “chuck norris” might be a good sentiment

predictor, but on movies onlypredictor, but on movies only� We lose discourse information.

� Ex: negation detection� “This comedy is not really funny.”

� NLP techniques might help.


Challenges of Data Driven Approaches

” might be a good sentiment predictor, but on movies onlypredictor, but on movies only

We lose discourse information.Ex: negation detection“This comedy is not really funny.”

NLP techniques might help.

RapidMiner Case Study

� Sentiment Classification based on Word Vectors.

� Convert Text data to Word Vectors� Using RapidMiner’s Text Processing Extension.

� Use it to Train/Test a Learner Model.� Use it to Train/Test a Learner Model.� Using Cross-Validation.� Using Correlation and Parameter Testing to pick better

features.

� Our data set is a collection of Film reviews from presented in (Pang et al, 2004).

RapidMiner Case Study

Sentiment Classification based on Word Vectors.

Convert Text data to Word VectorsUsing RapidMiner’s Text Processing Extension.

Use it to Train/Test a Learner Model.Use it to Train/Test a Learner Model.

Using Correlation and Parameter Testing to pick better

Our data set is a collection of Film reviews from IMDBpresented in (Pang et al, 2004).

RapidMiner Case StudyRapidMiner Case Study

Selects document collectionFrom a directory.

From text to list of tokens.From text to list of tokens.

Convert word variations toTheir stem.

RapidMiner Case StudyParameter Testing- Filter “top K” most correlated attributes.- K is a macro iterated using

Testing

RapidMiner Case StudyParameter Testing

Filter “top K” most correlated attributes.K is a macro iterated using Parameter Testing.

RapidMiner Case Study� Cross Validation - Training Step.

� Calculate Attribute Weights and Normalize.� Pass models on “through port” to Testing.� Select “top k” attributes by weight and train SVM.

RapidMiner Case StudyTraining Step.

Calculate Attribute Weights and Normalize.Pass models on “through port” to Testing.Select “top k” attributes by weight and train SVM.

RapidMiner Case Study� Cross Validation – Testing Step

RapidMiner Case StudyTesting Step

Case Study – Adding More Features

� Pre-Computed features based on text statistics.� Document, Word and Sentence Sizes, Part

Presence, Stop words ratio, Syllable Count.

� Features based on scoring using a sentiment lexicon.� (Ohana & Tierney ‘09).� (Ohana & Tierney ‘09).� Used SentiWordNet as the Lexicon (Esuli et al, 09).

� In RapidMiner we can merge those data sets using a known unique ID (File name in our case).

Adding More Features

Computed features based on text statistics.Document, Word and Sentence Sizes, Part-of-speech Presence, Stop words ratio, Syllable Count.

Features based on scoring using a sentiment lexicon.

Used SentiWordNet as the Lexicon (Esuli et al, 09).

In RapidMiner we can merge those data sets using a known unique ID (File name in our case).

Opinion Lexicons

� Opinion Lexicons.� A database of terms and opinion information they carry.� Some terms and expressions carry “a priori” opinion

bias, relatively independent from context.� Ex: good, excellent, bad, poor.

� To build the data set:� Score document based on terms found.� Total positive/negative scores.� Per part-of-speech.� Per document section.

A database of terms and opinion information they carry.Some terms and expressions carry “a priori” opinion bias, relatively independent from context.

Ex: good, excellent, bad, poor.

Score document based on terms found.Total positive/negative scores.

Lexicon Based Approach

IMDB Data Set

POS POS TaggerTagger

NegationNegationDetectionDetection

IMDB Data Set(Plain Text)

Lexicon Based Approach

Document ScoresDocument ScoresSWN FeaturesSWN FeaturesScoringScoring

SentiWordNet

Part of Speech Tagging

The computer-animated comedy " shrek " is designed to be enjoyed on different levels by different groups . for children , it offers imaginative visuals , appealing new characters mixed with a host of familiar faces , loads of action and a barrage of big laughs

The/DT computer-animated/JJ comedy/NN ''/'' shrek/NN ''/'' is/VBZ designed/VBN to/TO be/VB enjoyed/VBN on/IN different/JJ groups/NNS ./. for/IN children/NNS ,/, it/PRP offers/VBZ imaginative/JJ visuals/NNS ,/, appealing/VBG mixed/VBN with/IN a/DT host/NN of/IN action/NN and/CC a/DT barrage/NN of/IN

Part of Speech Tagging

animated comedy " shrek " is designed to be enjoyed on different levels by different groups . for children , it offers imaginative visuals , appealing new characters mixed with a host of familiar faces , loads of action and a barrage of big laughs

comedy/NN ''/'' shrek/NN ''/'' is/VBZ designed/VBN to/TO be/VB enjoyed/VBN on/IN different/JJ levels/NNS by/IN

groups/NNS ./. for/IN children/NNS ,/, it/PRP offers/VBZ visuals/NNS ,/, appealing/VBG new/JJ characters/NNS

mixed/VBN with/IN a/DT host/NN of/IN familiar/JJ faces/NNS ,/, loads/NNS of/IN action/NN and/CC a/DT barrage/NN of/IN big/JJ laughs/NNS

Negation Detection

� NegEx (Chapman et al ’01

� Look for negating expressions�Pseudo-negations.

� “no wonder”, “no change”, “not only”� “no wonder”, “no change”, “not only”

�Forward and Backward Scope.� “don’t”, “not”, “without”, “unlikely to”, etc…

Chapman et al ’01).Look for negating expressions

“no wonder”, “no change”, “not only”“no wonder”, “no change”, “not only”

Forward and Backward Scope.“don’t”, “not”, “without”, “unlikely to”, etc…

Case Study – Adding More Features

� Data Set Merging

Adding More Features

Results - Accuracy

Method

Baseline word vector

Baseline less uncorrelated attributes

Average Accuracy using 10-fold Cross

Document Stats (S)

SentiWordNet features (SWN)

Merging (S) + (N)

Merging Baseline + (S) + (SWN) and removing uncorrelated attributes

Accuracy % Feature Count

85.39 6739

Baseline less uncorrelated attributes 85.49 1800

fold Cross-validation

68.73 22

67.40 39

72.79 61

Merging Baseline + (S) + (SWN) and 86.39 1800


Method Accuracy

Support Vector Machines and Bigrams word vector

77.10%

Word Vector Naïve Bayes + Parts of 77.50%

� Some results from the field (IMDB data set).

Word Vector Naïve Bayes + Parts of Speech

77.50%

Support Vector Machines and Unigrams word vector

82.90%

Unigrams + Subjectivity Detection 87.15%

SVM + stylistic features 87.95%

SVM + GA feature selection 95.55%


Accuracy Source

77.10% (Pang et al, 2002)

77.50% (Salvetti et al, 2004)

Some results from the field (IMDB data set).

77.50% (Salvetti et al, 2004)

82.90% (Pang et al, 2002)

87.15% (Pang et al, 2004)

87.95% (Abbasi et al, 2008)

95.55% (Abbasi et al, 2008)

Results – Term Correlation

Terms (after Stemming)

Most Correlated didn, georg, add, wast, bore, guess, bad, son, stupid, masterpiece, perform, stereotyp, if, adventur, oscar, worst, blond, mediocr

Least Correlated already, face, which, put, same, without, someth, mustLeast Correlated already, face, which, put, same, without, someth, mustmanag, someon, talent, get, goe, sinc, abrupt

Term Correlation

Terms (after Stemming)

didn, georg, add, wast, bore, guess, bad, son, stupid, masterpiece, perform, stereotyp, if, adventur, oscar, worst, blond, mediocr

already, face, which, put, same, without, someth, mustalready, face, which, put, same, without, someth, mustmanag, someon, talent, get, goe, sinc, abrupt

Thank YouThank YouThank YouThank You

rcomm 2011 - sentiment classification with rapidminer

Technology