sales_prediction_technique using r programming

12
Sales Prediction Analyzing Twitter Data

Upload: nagarjun-kotyada

Post on 13-Apr-2017

212 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Sales_Prediction_Technique using R Programming

Sales Prediction

Analyzing Twitter Data

Page 2: Sales_Prediction_Technique using R Programming

Abstract:

The Chevrolet Camaro is an American automobile manufactured by Chevrolet, Sixth generation car launched

on Feb 27 2016 in United States, yet to be launched in India. According to a recent report by SpeedLux, the 2016

Chevrolet Camaro is listed as the 4th most searched vehicle on Google this past year. Compared with last year's

line of Camaros from GM, this year's model is shorter, narrower, lower and lighter. Car Scoops believes consumers

will like these sportier physical changes.

So, it’s been quite some time this car has hit the road with its enhanced specifications and new design. The car has

been talk of the town since release and received lot of feedback from users via different social media sources. As I

mentioned in my previous deliverable, I choose to proceed with Twitter reviews and feedbacks for detecting the

sentiments of the posts as this means is highly used to share opinions. R programming has been used for

sentiment analysis.

Introduction:

The core idea of this paper is detecting and understanding how the audience responded to this sixth

generation car and do the sentiment analysis on the data captured part of tweets. As previously mentioned,

sentiment analysis is the process of determining the emotional tone behind a series of words used to gain an

understanding of the opinions and emotions expressed within an online platform. Sentiment analysis is used for

social media monitoring, tracking of products reviews, analyzing survey responses and in business analytics. The

ability to extract insights from social data is a practice that is being widely adopted by organizations across the

world. Machine learning makes sentiment analysis more convenient. I choose R programming to do the

sentiment analysis as it has sentiment R, RTextTools packages and the more general text mining package which

come handy in detailed analysis. Text analysis in R has been well recognized. tm package plays a bigger role in the

Page 3: Sales_Prediction_Technique using R Programming

analysis. tm package is a framework for text mining applications within R. It did a good job for text cleaning

(stemming, delete the stop words etc.) and transforming texts to document-term matrix (dtm).

Data Analysis:

Before describing the steps involved in the analysis, below are the important packages which are essential in

the data analysis process:

twitteR : Provides an interface to the Twitter web API

ROAuth: Provides an interface to the OAuth 1.0 specification allowing users to authenticate via OAuth to the

server of their choice.

plyr : Provides tools for Splitting, Applying and Combining Data. A set of tools that solves a common set of

problems: you need to break a big problem down into manageable pieces, operate on each piece and then put

all the pieces back together.

Stringr: Simple, Consistent Wrappers for Common String Operations. A consistent, simple and easy to use set

of wrappers around the fantastic 'stringr' package.

ggplot2 : Create Elegant Data Visualizations Using the Grammar of Graphics. A system for 'declaratively'

creating graphics, based on ``The Grammar of Graphics''. You provide the data, tell 'ggplot2' how to map

variables to aesthetics, what graphical primitives to use, and it takes care of the details.

Httr: Tools for Working with URLs and HTTP. Useful tools for working with HTTP organized by HTTP verbs (GET(),

POST(), etc.). Configuration functions make it easy to control additional request components (authenticate(),

add_headers() and so on).

Wordcloud: Plot a cloud of words shared comparing the frequencies of words across documents.

Sentimentr: Calculate Text Polarity Sentiment t at the sentence level and optionally aggregate by rows or

grouping variable(s).

SnowballC: An R interface to the C libstemmer library that implements Porter's word stemming algorithm for

collapsing words to a common root to aid comparison of vocabulary.

Tm: The tm package offers functionality for managing text documents, abstracts the process of document

manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database

back-end support to minimize memory demands. An advanced meta data management is implemented for

Page 4: Sales_Prediction_Technique using R Programming

collections of text documents to alleviate the usage of large and with meta data enriched document sets. tm

provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming,

or stopword deletion. Further a generic filter architecture is available to filter documents for certain criteria,

or perform full text search. The package supports the export from document collections to term-document

matrices.

Tmap: Thematic maps are geographical maps in which spatial data distributions are visualized. This package

offers a flexible, layer-based, and easy to use approach to create thematic maps, such as choropleths and

bubble maps.

RColorBrewer: Provides color schemes for maps (and other graphics)

1. Building word cloud using R

A word cloud is a text mining method that allows us to highlight the most frequently used keywords in a

paragraph of texts. It is also referred to as a text cloud or tag cloud. Building word cloud is a powerful method

for text mining and, it adds simplicity and clarity. They are easy to understand, to be shared and are impactful.

Word clouds are visually engaging than a table data. The height of each word in this picture is an indication of

frequency of occurrence of the word in the entire text. For word cloud formation, we follow the below steps:

Step 1: Setting up the working Directory in RStudio

setwd("C:/Users/nagar/Desktop/Extras/R")

Step 2: Installing and loading the necessary packages.

Below is the list of packages required for Emotion classification. The functionality of each of these packages has

been explained under Necessary packages.

library(twitteR)

library(ROAuth)

library(plyr)

library(dplyr)

library(stringr)

library(ggplot2)

library(httr)

library(wordcloud)

library(sentiment)

Step 3: Create a corpus from the collection of text files.

mydata <- read.csv("C:/Users/nagar/Desktop/Extras/R/camaro.csv")

Page 5: Sales_Prediction_Technique using R Programming

mycorpus <- Corpus(VectorSource(mydata$text))

After reading the data from CSV file using the function read.csv into a variable mydata, Corpus is created. The text

is loaded using Corpus() function from text mining (tm) package. Corpus is a list of a document (in our case, we

only have one document).

Step 4: Create structured data from the text file

mycorpus <- tm_map(mycorpus, tolower)

mycorpus <- tm_map(mycorpus, PlainTextDocument)

mycorpus <- tm_map(mycorpus, removePunctuation)

mycorpus <- tm_map(mycorpus, removeNumbers)

mycorpus <- tm_map(mycorpus, removeWords, stopwords(kind = "en"))

mycorpus <- tm_map(mycorpus, stripWhitespace)

mycorpus <- tm_map(mycorpus, stemDocument)

pal <- brewer.pal(8, "Dark2")

The tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove

common stopwords like ‘the’, “we”. The information value of ‘stopwords’ is near zero due to the fact that they

are so common in a language. Removing this kind of words is useful before further analysis. For ‘stopwords’,

supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian,

portuguese, russian, spanish and swedish. Language names are case sensitive.

Here, I have also remove numbers and punctuation with removeNumbers and removePunctuation arguments.

Another important preprocessing step is to make a text stemming which reduces words to their root form. In

other words, this process removes suffixes from words to make it simple and to get the common origin. For

example, a stemming process reduces the words “moving”, “moved” and “movement” to the root word, “move”.

Step 5: Making the word cloud using the structured form of the data.

wordcloud(mycorpus, min.freq=3, max.words=Inf, width=1000,

height=1000, random.order=FALSE, Color=pal)

Arguments of the word cloud generator function :

words : the words to be plotted

freq : their frequencies

min.freq : words with frequency below min.freq will not be plotted

max.words : maximum number of words to be plotted

Page 6: Sales_Prediction_Technique using R Programming

random.order : plot words in random order. If false, they will be plotted in decreasing frequency

rot.per : proportion words with 90 degree rotation (vertical text)

colors : color words from least to most frequent. Use, for example, colors =“black” for single color.

Word Cloud on Camaro tweets:

2. Classify Emotion and publish graph:

Classification of emotion in R programming is achieved by using the function classify_emotion. This

function helps us to analyze some text and classify it in different types of emotion: anger, disgust, fear, joy,

sadness, and surprise. For this, I am using naive Bayes classifier trained on Carlo Strapparava and Alessandro

Valitutti’s emotions lexicon.

Below is the detailed description of the steps involved in Emotion Classification:

Step 1: Setting up the working Directory in RStudio

setwd("C:/Users/nagar/Desktop/Extras/R")

Step 2: Installing and loading the necessary packages.

Below is the list of packages required for Emotion classification. The functionality of each of these packages has

been explained under Necessary packages.

library(twitteR)

Page 7: Sales_Prediction_Technique using R Programming

library(ROAuth)

library(plyr)

library(dplyr)

library(stringr)

library(ggplot2)

library(httr)

library(wordcloud)

library(sentiment)

Step 3: Prepare the text for sentiment analysis.

Before starting the sentiment analysis, we need to keep our data file ready in the working directory. As an

alternate, using R programming, we can also connect directly to Twitter and use OAuth for authorizing our

credentials and proceed with the analysis, the mandate for this process is that we need to have Twitter account

and set up developers account and have all the authentication keys ready for connection establishment. Here, I

have downloaded the CSV data file and saved in the working directory mentioned in the above step. Loading the

data file into a variable for further process.

data <- read.csv("C:/Users/nagar/Desktop/Extras/R/Book1.csv")

Step 4: Perform Sentiment Analysis.

class_emo = classify_emotion(data, algorithm = "bayes", prior = 1.0)

Among the 3 parameters specified in the above function, the first parameter is the name of the data file being

classified, the second parameter is the algorithm being used which in our case is Bayes. So, as mentioned above,

A string indicating whether to use the naive Bayes algorithm or a simple voter algorithm. The third parameter is

a numeric specifying the prior probability to use for the naive Bayes classifier.

emotion = class_emo[,7]

Emotion best fit is set in the above syntax. Returns an object of class data.frame with seven columns and one row

for each document e.g. anger, disgust, fear, joy, sadness, surprise.

emotion[is.na(emotion)] = "unknown"

Lastly, substitute NA's by "unknown" under this step.

Step 5: Create and Sort the data frame.

sent_df = data.frame(text=data, emotion=emotion, stringsAsFactors =

FALSE)

The function data.frame() creates data frames, tightly coupled collections of variables which share many of the

properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software. The

Page 8: Sales_Prediction_Technique using R Programming

input in the first parameter is our data file and the second parameter considers the emotion which has been

analyzed in the above steps. stringsAsFactors is false meaning, If a list or data frame or matrix is passed

to data.frame it is as if each component or column had been passed as a separate argument. The data frame is

stored into sent_df variable which is being sorted by the below command.

sent_df = within(sent_df, emotion <- factor(emotion, levels =

names(sort(table(emotion), decreasing = TRUE))))

Here, we use factor function to create a factor. The only required argument to factor is a vector of values which will

be returned as a vector of factor values. To change the order in which the levels will be displayed from their default

sorted order, the levels= argument can be given a vector of all the possible values of the variable in the order you

desire. The sorted data is again stored into the variable sent_df.

Step 6: Plotting the Statistics.

ggplot(sent_df, aes(x=emotion)) + geom_bar(aes(y=..count..,

fill=emotion)) + scale_fill_brewer(palette = "Dark2") + labs(x="emotion

categories", y="number of tweets", title="classification based on emotion")

One of the very important function in R is ggplot which enables us to create a very wide range of useful plots. aes

function has list of name value pairs giving aesthetics to map, in the current case it is emotion. The bar geom is

used to produce 1d area plots: bar charts for categorical x, and histograms for continuous y as shown below. color

selection can change with one of scale functions such as scale_fill_brewer . X and y Specifies the variables placed

on the horizontal and vertical axis.

Below is the Emotion classification plot for the Camaro tweets:

Page 9: Sales_Prediction_Technique using R Programming

3. Classify Polarity and publish graph:

A fundamental task in sentiment analysis is polarity detection: the classification of the polarity of a given

text, whether the opinion expressed is positive, negative or neutral. This approach uses a supervised learning

algorithm to build a classifier that will detect polarity of textual data and classify it as either positive or negative. It

uses an opinionated dataset to train the classifier, data processing techniques to pre-process the textual data and

simple rules for categorizing text as positive or negative.

I am using the naïve Bayes classifier to attempt to classify the sentences as positive or negative. As the

name suggests, this works by implementing a Naive Bayes algorithm. Basically, this algorithm tries to guess

whether a sentence is positive or negative by examining how many words it has in each category and relating this

to the probabilities of those numbers appearing in positive and negative sentences.

The steps involved in polarity classification is like Emotion classification excepting the usage of classify_polarity

function for classifying positive and negative text. The creation and sorting process using data frame class uses

polarity function in this case. Below are the steps involved in Polarity classification.

Step 1: Setting up the working Directory in RStudio

setwd("C:/Users/nagar/Desktop/Extras/R")

Step 2: Installing and loading the necessary packages.

Below is the list of packages required for Emotion classification. The functionality of each of these packages has

been explained under Necessary packages.

library(twitteR)

library(ROAuth)

library(plyr)

library(dplyr)

library(stringr)

library(ggplot2)

library(httr)

library(wordcloud)

library(sentiment)

Step 3: Prepare the text for sentiment analysis.

Before starting the sentiment analysis, we need to keep our data file ready in the working directory. As an

alternate, using R programming, we can also connect directly to Twitter and use OAuth for authorizing our

credentials and proceed with the analysis, the mandate for this process is that we need to have Twitter account

and set up developers account and have all the authentication keys ready for connection establishment. Here, I

have downloaded the CSV data file and saved in the working directory mentioned in the above step. Loading the

data file into a variable for further process.

Page 10: Sales_Prediction_Technique using R Programming

data <- read.csv("C:/Users/nagar/Desktop/Extras/R/Book1.csv")

Step 4: Perform Sentiment Analysis.

class_pol = classify_polarity(data, algorithm = "bayes")

In contrast to the classification of emotions, the classify_polarity function allows us to classify some text as positive

or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s

subjectivity lexicon; or by a simple voter algorithm.

polarity = class_pol[,4]

Polarity best fit is set in the above syntax.

Step 5: Create and Sort the data frame.

sent_df = data.frame(text=data, polarity=polarity, stringsAsFactors

= FALSE)

The function data.frame() creates data frames, tightly coupled collections of variables which share many of the

properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software. The

input in the first parameter is our data file and the second parameter considers the polarity which has been analyzed

in the above steps. stringsAsFactors is false meaning, If a list or data frame or matrix is passed to data.frame it is as

if each component or column had been passed as a separate argument. The data frame is stored into sent_df

variable which is being sorted by the below command.

sent_df1 = within(sent_df, polarity <- factor(polarity, levels =

names(sort(table(polarity), decreasing = TRUE))))

Here, we use factor function to create a factor. The only required argument to factor is a vector of values which will

be returned as a vector of factor values. To change the order in which the levels will be displayed from their default

sorted order, the levels= argument can be given a vector of all the possible values of the variable in the order you

desire. The sorted data is again stored into the variable sent_df.

Step 6: Plotting the Statistics.

ggplot(sent_df1, aes(x=polarity)) + geom_bar(aes(y=..count..,

fill=polarity)) + scale_fill_brewer(palette = "Dark2") +

labs(x="polarity categories", y="number of tweets", title="classification

based on polarity")

One of the very important function in R is ggplot which enables us to create a very wide range of useful plots. aes

function has list of name value pairs giving aesthetics to map, in the current case it is emotion. The bar geom is

used to produce 1d area plots: bar charts for categorical x, and histograms for continuous y as shown below. color

Page 11: Sales_Prediction_Technique using R Programming

selection can change with one of scale functions such as scale_fill_brewer . X and y Specifies the variables placed

on the horizontal and vertical axis.

Polarity graph of Camaro tweets:

Conclusions:

The insight from the data analysis process of the Chevrolet Camaro car from word cloud is that the highest

frequency of talk is about the engine, seat, speed, power, drive and few other aspects which have been captured

in the word cloud. Visual representation from word cloud on data tends to have an impact and generates interest

amongst the audience. For further analysis, it may stimulate more questions than it answers, but that’s a good

entry point to discussion. Based on the Emotional classification, we see that joy parameter stands out with big

margin followed by anger, surprise, sadness in relatively lesser proportion. Joy adds to positive outlook of the car

in the market, further analysis on this data can be carried using machine learning skills. Based on the Polarity

classification, highest proportion is under neutral parameter. It can be proven that specific classifiers such as

the Max Entropy and the SVMs can benefit from the introduction of a neutral class and improve the overall

accuracy of the classification. The other approach in the current case is estimating a probability distribution over

all categories. Since the data is clearly clustered into neutral, negative and positive language, it makes sense to

filter the neutral language out and focus on the polarity between positive and negative sentiments. Open source

software tools deploy machine learning, statistics, and natural language processing techniques to automate

sentiment analysis on large collections of data similar to Camaro review data.

Page 12: Sales_Prediction_Technique using R Programming

As I mentioned in my previous deliverables, it was indeed a great learning process from R programming

perspective. I’m glad that I could pull together all the important aspects for sentiment analysis from several

different areas to work on one unified program. I believe, sentiment analysis is an evolving field with a variety of

use applications. Although sentiment analysis tasks are challenging due to their natural language processing

origins, much progress has been made over the last few years due to the high demand for it. Sentiment analysis

within microblogging has shown that Twitter can be seen as a valid online indicator of political sentiment. Tweets'

political sentiment demonstrates close correspondence to parties and politicians, political positions, indicating

that the content of Twitter messages plausibly reflects the offline political landscape.

References:

Sentiment Analysis and Opinion Mining by Bing Liu

(http://www.cs.uic.edu/~liub/FBS/SentimentAnalysisandOpinionMining.html)

Sentiment Analysis by Professor Dan Jurafsky (https://web.standford.edu/class/cs124/lec/ sentiment.pdf)

S. Blair-Goldensohn, Hannan, McDonald, Neylon, Reis and Reynar 2008 – Building a Sentiment

Summarizer for Local Service Reviews (http://www. ryanmcd.com/papers/local_service_summ.pdf) S.

Asur et al., “Predicting the Future With Social Media”,arXiv:1003.5699.

R. Sharda et al., “Forecasting Box-Office Receipts of Motion Pictures Using Neural Networks”, CiteSeerX

2002.

http://www.businessinsider.com/apple-and-samsung-just-revealed-their-exact-us-sales-figuresfor-the-

first-ever-time2012-8

https://www.coursera.org/learn/r-programming

http://www.bigdatanews.com/profiles/blogs/learn-everything-about-sentiment-analysis-using-r

Koweika A.,Gupta A.,Sondhi K.(2013).Sentiment analysis for social media. International Journal of

Advanced Research in Computer Science and Software Engineering

Younggue B.,Hongchul L.(2012),”Sentiment analysis of Twitter audience: Measuring the positive or

negative influence”, Journal of the American Society for Information Science and Technology.

http://stackoverflow.com/questions/10233087/sentiment-analysis-using-r

https://rpubs.com/cen0te/naivebayes-sentimentpolarity

https://www.youtube.com/watch?v=oXZThwEF4r0