rule-based classifications - cloudinary · 2016. 5. 12. · kaggle commercial datasets provided by...

42

Upload: others

Post on 18-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython
Page 2: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Natural language data pre-processing steps

02

PRE-PROCESSING

What were the data collection process.

When, where, and how?

01

DATA SETS

Page 3: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Using machine learning algorithms, we attempt

to classify positive and negative sentiments

04

MACHINE LEARNING

BINARY CLASSIFICATIONS

Using machine learning to predict positive,

neutral and negative sentiments and determine

the best model for data with unknown sentiments

05

MACHINE LEARNING

MULTI-CLASS CLASSIFICATIONS

Without using machine learning, can we use

some basic rules to classify sentiments?

03

RULE-BASED CLASSIFICATIONS

Page 4: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

07

CLOSING THOUGHTS

FEEDBACK | Q&A

The summary and statistics of the collected airline

tweets

06

REASULTS

Page 5: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Where, when and how?

Page 6: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

#unitedairlines

#unitedair

@united

UNITED#americanairlines

#americanair

@AmericanAir

AA#deltaairlines

#deltaair

@delta

DELTA#southwestairlines

#southwestair

@southwestair

SOUTHWEST

Page 7: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Reformatted/cleaned tweets with graded

sentiment of Major Airlines from Feb 2015

14,640 Tweets

KAGGLE

Commercial datasets provided by Newsroom

with machine graded tweets

4,000 Tweets

Newsroom

Using Python and twython to retrieve tweets

through Twitter’s API during 7 days period.

80,121 Tweets

TWITTER API

k

Page 8: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython
Page 9: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Natural language processes to prepare and transform the tweet

content for various classification models and sentiment analysis

Page 10: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Major processing steps

For machine learning

Vector Matrix… get it?

Perform regular

expression

Tokenize all tweet

contents

Remove all the English

stop words

Page 11: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Replacing all &amp symbols with the

actual words ‘and’ with a space

re.sub(r'&', r' and', tweet)

Replace ‘&amp’ Signs

Remove all annoying white spaces

re.sub('[\s]+', ' ', tweet)

Remove Additional Spaces

Convert all #hashtagwords to just

‘hashtagwords’

re.sub(r'#([^\s]+)', r'\1', tweet)

Replace #hastags

Turn all https://... and www… format to

display just the word ‘URL’

re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)

Eliminate URLs

Convert all @username to display just

‘username’ without the ‘@’ sign

re.sub(r'@([^\s]+)', r'\1', tweet)

Replace @username

Turn all letters to lower cases. One

option is to retain original upper cases

Lower Cases

tweet.lower()

&

Page 12: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

RULE BASED

First we will use two simple rule-based techniques to conduct our sentiment analysis to see what happens

Page 13: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

• (pos) x (pos) = (pos)

This movie is very good

• (pos) x (neg) = (neg)

This movie is not good at all

• (neg) x (neg) = (pos)

This movie was not bad

Page 14: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython
Page 15: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

VADER is a lexicon and rule-

based sentiment analysis tool

that is especially attuned to

sentiments expressed in social

media.

Page 16: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Examples

{‘neg’: 0.0, ‘neu’: 0.293, ‘pos’: 0.707, ‘compound’: 0.8646}

“VADER is very smart, handsome, and funny!”

{‘neg’: 0.0, ‘neu’: 0.254, ‘pos’: 0.746, ‘compound’: 0.8316}

“VADER is smart, handsome and funny.”

{‘neg’: 0.0, ‘neu’: 0.233, ‘pos’: 0.767, ‘compound’: 0.9342}

“VADER is VERY SMART, handsome, and FUNNY!!!”

{‘neg’: 0.779, ‘neu’: 0.221, ‘pos’: 0.0, ‘compound’: -0.5461}

“Today SUX!”

{‘neg’: 0.0, ‘neu’: 0.637, ‘pos’: 0.363, ‘compound’: 0.431}

“At least it isn’t a horrible book.”

{‘neg’: 0.154, ‘neu’: 0.42, ‘pos’: 0.426, ‘compound’: 0.5974}

“Today kinda sux! But I’ll get by, lol :)”

Page 17: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython
Page 18: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Part I – Binary Classifications

Page 19: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Random Forest

Perform Random Forest Model

Hyperplane

Linear SVM

Perform Support Vector Classifier

Data Prep

Remove All Neutral Sentiment

Page 20: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Training Accuracy: 99.88%

Testing Accuracy: 86.49%

TRAINING: 11,541 TESTING: 1,843

Training Accuracy: 99.91%

Testing Accuracy: 84.48%

TRAINING: 1,124 TESTING: 290

Training Accuracy: 99.87%

Testing Accuracy: 89.58%

TRAINING: 9,266 TESTING: 2,275

k

k

k

Page 21: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Training Accuracy: 99.12%

Testing Accuracy: 86.38%

Training Accuracy: 99.56%

Testing Accuracy: 84.14%

Training Accuracy: 99.18%

Testing Accuracy: 90.95%

TRAINING: 11,541 TESTING: 1,843

k

TRAINING: 9,266 TESTING: 2,275

kk

TRAINING: 1,124 TESTING: 290

Page 22: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

MACHINE LEARINGPart II – Multi-Class Classifications

Page 23: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

TRAINING: 2,040 TESTING: 360 TRAINING: 14,640 TESTING: 2,400

k

Page 24: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

While the test accuracy is slightly lower, using Kaggle data as

training set proved to be a stronger predictor based on the cross-

validation accuracy results. In addition, it is superior in positive and

neutral precision, not to mention much higher training volume.

96.89%

1.43%

33.85%

66.94%

64.86%

65.25%

50.00%

60.34%

60.88%

67.80%

0% 20% 40% 60% 80% 100% 120%

[ -1] Precision

[ne] Precision

[+1] Precision

Test Accuracy

CV Accuracy

Kaggle->Tweet Tweet->Tweet

Page 25: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Again, while we see the precision of the twitter data is superior, the

Kaggle data yields a more balanced results for all three sentiments.

Because the overall test accuracy is about the same, we chose to

use Kaggle data as training due to its superior CV accuracy as well

as its large sum of data points (~8X).

93.33%

14.29%

47.69%

69.72%

66.39%

79.34%

49.10%

60.34%

69.08%

71.91%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

[ -1] Precision

[ne] Precision

[+1] Precision

Test Accuracy

CV Accuracy

Kaggle->Tweet Tweet->Tweet

Page 26: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Support Vector Classifier

Model Parameters: decision_fun=‘ovo’, kernel=‘linear’, cost=0.4, gamma=0

Use Entire Kaggle Dataset as Training Set (14,640 data points)

Apply Model to Tweets of Unknown Sentiment

Page 27: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Now we have the model, let’s take a look at the

results of collected tweets!

Page 28: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

5,705

5,478

5,728

10,516

12,567

9,647

11,718

18,762

0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000

United

Southwest

Delta

American

Original Tweets Re-Tweets

Page 29: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

52,694 Original Tweets

Positive

Neutral

Negative

Page 30: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

20,000

American Delta Southwest United

Negative Netual Positive

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

American Delta Southwest United

Negative Netual Positive

Page 31: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

-

Positive Sentiment

“Cenk Uygur”

Negative Sentiment

Page 32: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

-

Positive Sentiment

“Delta Assist”

Negative Sentiment

Page 33: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

“Dalton Rapattoni”

Positive Sentiment

“Dalton Rapattoni”

Negative Sentiment

Page 34: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

“Happy Birthday”, “90th"

Positive Sentiment

“Muslim”, “Shame”

Negative Sentiment

Page 35: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

NEUTRAL POSITIVE

26.0%

NEGATIVE

31.7%

14.5%

19.1%

59.5%

49.2%

RetweetsOriginal Tweets

If the sentiment is classify accurately, people

are more likely to retweet a neutral or positive

tweets than a negative tweet.

Page 36: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

RT @JackAndJackReal: Thanks @Delta for adding our

movie to all your flights! If any of you are traveling and

bored make sure u peep it

3,535 Retweets…

RT @ggreenwald: Arab-American family of 5 going on

vacation - 3 young kids in tow - removed from @United

plane for "safety reasons" https:/…

352 Retweets…

RT @UMG: The perfect trio @theweeknd @ddlovato and

Lucian Grainge #MusicIsUniversal #UMGShowcase w/

@AmericanAir & @Citi https://t.co/J

3,795 Retweets…

RT @DaltonRapattoni: Been home in Dallas for hours now

but still in the airport. @SouthwestAir PLEASE find my

bag. Please RT. #SWAFindDalto

1,952 Retweets…

RT @Mets: We’re partnering with @Delta to send a lucky

fan to games 1 & 2! RT for your chance to win

#DeltaMetsSweeps https://t.co/CDELSe84

17,093 Retweets…

RT @antoniodelotero: this skank bitch was SO RUDE to

me on my flight!!!! @united I asked for her name and she

goes, "it's Britney bitch.”

3,329 Retweets…

RT @JensenAckles: @AmericanAir I know it's crazy

#DFWsnow I'm just excited 2see my baby girl! Man

freaking out next 2 me isn't helping

10,260 Retweets…

RT @SouthwestAir: Yo #Kimye, we're really happy for you,

we'll let you finish, but South is one of the best names of all

time.

3,362 Retweets…

Page 37: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

The majority of the tweets are from iPhone. There is also

almost 20% of tweets from Android devices. The rest

from the Web client and other applications. There is little

difference between the source of negative and positive

tweets; but there seems to be more alternative sources

for the neutral sentiment.

0%

10%

20%

30%

40%

50%

60%

iPhone

Android

Web

Others

Negative Neutral Positive

Page 38: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Photo Link Video LinkPhoto

PositiveNeutralNegative

Page 39: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

ENDINGTHOUGHTS

Discussing final thoughts and some future steps

Page 40: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

https://github.com/michaelucsb/Python_Twitter-Sentiment_GA

Each data point being

validated by multiple people

MULTIPLE VALIDATION

Naïve Bayes

Lemmatization

OTHER MODELS

Current coordinates data

isn’t great for Geo-location

analysis

COORDINATES

Unsupervised model to

discover words association

CLUSTERING

Page 41: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython
Page 42: RULE-BASED CLASSIFICATIONS - Cloudinary · 2016. 5. 12. · KAGGLE Commercial datasets provided by Newsroom with machine graded tweets 4,000 Tweets Newsroom Using Python and twython

Project Description

Prior to machine learning, the project conducted two rule-based senti-

ment classifications. One of which simply count the number of positive

and negative words, and the other utilizes VADER lexicon, a pre-built

sentiment analysis tool for Python. Both rule-based sentiment

classifications achieved an accuracy of 47% to 55%.

Next, the project used machine learning algorithms – Random Forest,

and Support Vector Classifier – to conduct binary classification on

positive and negative sentiments by removing all known neutral tweets.

85% to 90% accuracy were achieved on the test dataset.

Finally, the project used Ada Boost and Support Vector Classifier to

conduct multi-class classifications by including back all neutral tweets.

Through grid-search and cross-validation, the project achieved a 70%

overall accuracy on prediction and an 80% precision on negative

sentiment prediction on the test dataset.

Using the best model and parameters found, the model is applied to

tweets with unknown sentiments. The statistics summary were included

in the later half of the presentation. The presented word clouds also

captured interesting stories during the time the tweets were collected.

The project aims to use rule-based classification and machine learning

algorithms to predict content sentiments of Twitter feeds that mentioned

four major airlines: American Airline, United, Delta, and Southwest.

The project collected 80,121 tweets during a 7-day period via Twitter’s

API using Twython, a Python package; among these tweets, 2,400

randomly selected tweets were given a sentiment rating manually

(test). Through Kaggle, the project also obtained a cleaned dataset

contains 14,640 airline tweets with already rated sentiments (training).

To start off the project, the tweets were processed using tokenization,

regular expression, stop words removals, and vectorization. In regular

expression, the project eliminated all URLs, “@” sign, “#” sign, and

replaced “&” sign with the word “and”.