rcomm 2011 - sentiment classification with rapidminer

25
Sentiment Clas Rapid Bruno Ohana and DIT School o June ssification with dMiner d Brendan Tierney of Computing e 2011

Upload: bohanairl

Post on 18-Dec-2014

11.364 views

Category:

Technology


1 download

DESCRIPTION

Presentation on Sentiment Classification from the 2011 RapidMiner User Conference in Dublin.

TRANSCRIPT

Page 1: RCOMM 2011 - Sentiment Classification with RapidMiner

Sentiment Classification with

RapidMiner

Bruno Ohana and Brendan Tierney

DIT School of Computing

June 2011

Sentiment Classification with

RapidMiner

Bruno Ohana and Brendan Tierney

DIT School of Computing

June 2011

Page 2: RCOMM 2011 - Sentiment Classification with RapidMiner

� Introduction to Sentiment Analysis� Supervised Learning Approaches� Case Study with RapidMiner

Our Talk

Introduction to Sentiment AnalysisSupervised Learning ApproachesCase Study with RapidMiner

Page 3: RCOMM 2011 - Sentiment Classification with RapidMiner

“81% of US internet users (60used the internet to perform research on a product theyintended to purchase, as of

“Over 30% of US internet users have at one time

Motivation

“Over 30% of US internet users have at one timeposted a comment or online review about a product orservice they’ve purchased.”

60% of population) haveused the internet to perform research on a product theyintended to purchase, as of 2007.”

% of US internet users have at one time% of US internet users have at one timeposted a comment or online review about a product orservice they’ve purchased.”

(Horrigan, 2008)

Page 4: RCOMM 2011 - Sentiment Classification with RapidMiner

Motivation

� A lot of online content is subjective in nature.� User Generated Content: Product reviews, blog

posts, twitter, etc.� epinions.com, Amazon, RottenTomatoes.com.� Sheer volume of opinion data calls for automated

analytical methods.

A lot of online content is subjective in nature.User Generated Content: Product reviews, blog

epinions.com, Amazon, RottenTomatoes.com.Sheer volume of opinion data calls for automated

Page 5: RCOMM 2011 - Sentiment Classification with RapidMiner

Why Are Automated Methods Relevant?

� Search and Recommendation Engines.� Show me only positive/negative/neutral.

� Market Research.� What is being said about brand X on Twitter?� What is being said about brand X on Twitter?

� Contextual Ad Placement.

� Mediation of online communities.

Why Are Automated Methods Relevant?

Search and Recommendation Engines.Show me only positive/negative/neutral.

What is being said about brand X on Twitter?What is being said about brand X on Twitter?

Contextual Ad Placement.

Mediation of online communities.

Page 6: RCOMM 2011 - Sentiment Classification with RapidMiner

A Growing Industry

� Opinion Mining offerings� Voice of Customer analytics� Social Media Monitoring� SaaS or embedded in data mining packages

Voice of Customer analytics

SaaS or embedded in data mining packages

Page 7: RCOMM 2011 - Sentiment Classification with RapidMiner

Opinion Mining – Sentiment Classification

� For a given Text Document, Determine Sentiment Orientation� Positive or Negative, Favorable or Unfavorable, etc.� Binary or along a scale (e.g. 1� Data is unstructured text format. From sentence to

document level.document level.

Ex: Positive or Negative?“This is by far the worst hotel experience i've ever had. the owner

overbooked while i was staying there (even though i booked the room two months in advance) and made me move to another room, but that room wasn't even a hotel room!”

Sentiment Classification

For a given Text Document, Determine Sentiment

Positive or Negative, Favorable or Unfavorable, etc.Binary or along a scale (e.g. 1-5 stars)Data is unstructured text format. From sentence to

his is by far the worst hotel experience i've ever had. the owner overbooked while i was staying there (even though i booked the room two months in advance) and made me move to another room, but that

Page 8: RCOMM 2011 - Sentiment Classification with RapidMiner

Supervised Learning for Text

� Train a classifier algorithm based on a training data set.� Raw data will be text.

� Approach: Use term presence � Approach: Use term presence features.� A plain text document becomes a word vector.

Supervised Learning for Text

Train a classifier algorithm based on a training

Raw data will be text.

term presence information as term presence information as

A plain text document becomes a word vector.

Page 9: RCOMM 2011 - Sentiment Classification with RapidMiner

Supervised Learning for Text

� A word vector can be used to train a classifier.� Building a Word Vector

� Unit of tokenization: uni/bi/n� Term presence metric� Binary, tf-idf, frequency

� Stemming� Stemming� Stop Words Removal

IMDB Data Set(Plain Text)

TokenizeTokenize StemmingStemming

Supervised Learning for Text

A word vector can be used to train a classifier.

Unit of tokenization: uni/bi/n-gram

idf, frequency

Train ClassifierTrain ClassifierWord Word VectorVector

Page 10: RCOMM 2011 - Sentiment Classification with RapidMiner

Opinion Mining – Sentiment Classification

Challenges of Data Driven Approaches

� Domain dependence.� “chuck norris” might be a good sentiment

predictor, but on movies onlypredictor, but on movies only� We lose discourse information.

� Ex: negation detection� “This comedy is not really funny.”

� NLP techniques might help.

Sentiment Classification

Challenges of Data Driven Approaches

” might be a good sentiment predictor, but on movies onlypredictor, but on movies only

We lose discourse information.Ex: negation detection“This comedy is not really funny.”

NLP techniques might help.

Page 11: RCOMM 2011 - Sentiment Classification with RapidMiner

RapidMiner Case Study

� Sentiment Classification based on Word Vectors.

� Convert Text data to Word Vectors� Using RapidMiner’s Text Processing Extension.

� Use it to Train/Test a Learner Model.� Use it to Train/Test a Learner Model.� Using Cross-Validation.� Using Correlation and Parameter Testing to pick better

features.

� Our data set is a collection of Film reviews from presented in (Pang et al, 2004).

RapidMiner Case Study

Sentiment Classification based on Word Vectors.

Convert Text data to Word VectorsUsing RapidMiner’s Text Processing Extension.

Use it to Train/Test a Learner Model.Use it to Train/Test a Learner Model.

Using Correlation and Parameter Testing to pick better

Our data set is a collection of Film reviews from IMDBpresented in (Pang et al, 2004).

Page 12: RCOMM 2011 - Sentiment Classification with RapidMiner

RapidMiner Case StudyRapidMiner Case Study

Selects document collectionFrom a directory.

From text to list of tokens.From text to list of tokens.

Convert word variations toTheir stem.

Page 13: RCOMM 2011 - Sentiment Classification with RapidMiner

RapidMiner Case StudyParameter Testing- Filter “top K” most correlated attributes.- K is a macro iterated using

Testing

RapidMiner Case StudyParameter Testing

Filter “top K” most correlated attributes.K is a macro iterated using Parameter Testing.

Page 14: RCOMM 2011 - Sentiment Classification with RapidMiner

RapidMiner Case Study� Cross Validation - Training Step.

� Calculate Attribute Weights and Normalize.� Pass models on “through port” to Testing.� Select “top k” attributes by weight and train SVM.

RapidMiner Case StudyTraining Step.

Calculate Attribute Weights and Normalize.Pass models on “through port” to Testing.Select “top k” attributes by weight and train SVM.

Page 15: RCOMM 2011 - Sentiment Classification with RapidMiner

RapidMiner Case Study� Cross Validation – Testing Step

RapidMiner Case StudyTesting Step

Page 16: RCOMM 2011 - Sentiment Classification with RapidMiner

Case Study – Adding More Features

� Pre-Computed features based on text statistics.� Document, Word and Sentence Sizes, Part

Presence, Stop words ratio, Syllable Count.

� Features based on scoring using a sentiment lexicon.� (Ohana & Tierney ‘09).� (Ohana & Tierney ‘09).� Used SentiWordNet as the Lexicon (Esuli et al, 09).

� In RapidMiner we can merge those data sets using a known unique ID (File name in our case).

Adding More Features

Computed features based on text statistics.Document, Word and Sentence Sizes, Part-of-speech Presence, Stop words ratio, Syllable Count.

Features based on scoring using a sentiment lexicon.

Used SentiWordNet as the Lexicon (Esuli et al, 09).

In RapidMiner we can merge those data sets using a known unique ID (File name in our case).

Page 17: RCOMM 2011 - Sentiment Classification with RapidMiner

Opinion Lexicons

� Opinion Lexicons.� A database of terms and opinion information they carry.� Some terms and expressions carry “a priori” opinion

bias, relatively independent from context.� Ex: good, excellent, bad, poor.

� To build the data set:� Score document based on terms found.� Total positive/negative scores.� Per part-of-speech.� Per document section.

A database of terms and opinion information they carry.Some terms and expressions carry “a priori” opinion bias, relatively independent from context.

Ex: good, excellent, bad, poor.

Score document based on terms found.Total positive/negative scores.

Page 18: RCOMM 2011 - Sentiment Classification with RapidMiner

Lexicon Based Approach

IMDB Data Set

POS POS TaggerTagger

NegationNegationDetectionDetection

IMDB Data Set(Plain Text)

Lexicon Based Approach

Document ScoresDocument ScoresSWN FeaturesSWN FeaturesScoringScoring

SentiWordNet

Page 19: RCOMM 2011 - Sentiment Classification with RapidMiner

Part of Speech Tagging

The computer-animated comedy " shrek " is designed to be enjoyed on different levels by different groups . for children , it offers imaginative visuals , appealing new characters mixed with a host of familiar faces , loads of action and a barrage of big laughs

The/DT computer-animated/JJ comedy/NN ''/'' shrek/NN ''/'' is/VBZ designed/VBN to/TO be/VB enjoyed/VBN on/IN different/JJ groups/NNS ./. for/IN children/NNS ,/, it/PRP offers/VBZ imaginative/JJ visuals/NNS ,/, appealing/VBG mixed/VBN with/IN a/DT host/NN of/IN action/NN and/CC a/DT barrage/NN of/IN

Part of Speech Tagging

animated comedy " shrek " is designed to be enjoyed on different levels by different groups . for children , it offers imaginative visuals , appealing new characters mixed with a host of familiar faces , loads of action and a barrage of big laughs

comedy/NN ''/'' shrek/NN ''/'' is/VBZ designed/VBN to/TO be/VB enjoyed/VBN on/IN different/JJ levels/NNS by/IN

groups/NNS ./. for/IN children/NNS ,/, it/PRP offers/VBZ visuals/NNS ,/, appealing/VBG new/JJ characters/NNS

mixed/VBN with/IN a/DT host/NN of/IN familiar/JJ faces/NNS ,/, loads/NNS of/IN action/NN and/CC a/DT barrage/NN of/IN big/JJ laughs/NNS

Page 20: RCOMM 2011 - Sentiment Classification with RapidMiner

Negation Detection

� NegEx (Chapman et al ’01

� Look for negating expressions�Pseudo-negations.

� “no wonder”, “no change”, “not only”� “no wonder”, “no change”, “not only”

�Forward and Backward Scope.� “don’t”, “not”, “without”, “unlikely to”, etc…

Chapman et al ’01).Look for negating expressions

“no wonder”, “no change”, “not only”“no wonder”, “no change”, “not only”

Forward and Backward Scope.“don’t”, “not”, “without”, “unlikely to”, etc…

Page 21: RCOMM 2011 - Sentiment Classification with RapidMiner

Case Study – Adding More Features

� Data Set Merging

Adding More Features

Page 22: RCOMM 2011 - Sentiment Classification with RapidMiner

Results - Accuracy

Method

Baseline word vector

Baseline less uncorrelated attributes

Average Accuracy using 10-fold Cross

Document Stats (S)

SentiWordNet features (SWN)

Merging (S) + (N)

Merging Baseline + (S) + (SWN) and removing uncorrelated attributes

Accuracy % Feature Count

85.39 6739

Baseline less uncorrelated attributes 85.49 1800

fold Cross-validation

68.73 22

67.40 39

72.79 61

Merging Baseline + (S) + (SWN) and 86.39 1800

Page 23: RCOMM 2011 - Sentiment Classification with RapidMiner

Opinion Mining – Sentiment Classification

Method Accuracy

Support Vector Machines and Bigrams word vector

77.10%

Word Vector Naïve Bayes + Parts of 77.50%

� Some results from the field (IMDB data set).

Word Vector Naïve Bayes + Parts of Speech

77.50%

Support Vector Machines and Unigrams word vector

82.90%

Unigrams + Subjectivity Detection 87.15%

SVM + stylistic features 87.95%

SVM + GA feature selection 95.55%

Sentiment Classification

Accuracy Source

77.10% (Pang et al, 2002)

77.50% (Salvetti et al, 2004)

Some results from the field (IMDB data set).

77.50% (Salvetti et al, 2004)

82.90% (Pang et al, 2002)

87.15% (Pang et al, 2004)

87.95% (Abbasi et al, 2008)

95.55% (Abbasi et al, 2008)

Page 24: RCOMM 2011 - Sentiment Classification with RapidMiner

Results – Term Correlation

Terms (after Stemming)

Most Correlated didn, georg, add, wast, bore, guess, bad, son, stupid, masterpiece, perform, stereotyp, if, adventur, oscar, worst, blond, mediocr

Least Correlated already, face, which, put, same, without, someth, mustLeast Correlated already, face, which, put, same, without, someth, mustmanag, someon, talent, get, goe, sinc, abrupt

Term Correlation

Terms (after Stemming)

didn, georg, add, wast, bore, guess, bad, son, stupid, masterpiece, perform, stereotyp, if, adventur, oscar, worst, blond, mediocr

already, face, which, put, same, without, someth, mustalready, face, which, put, same, without, someth, mustmanag, someon, talent, get, goe, sinc, abrupt

Page 25: RCOMM 2011 - Sentiment Classification with RapidMiner

Thank YouThank YouThank YouThank You