arsa: a sentiment-aware model for predicting sales performance using blogs

30
ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs Yang Liu, Xiangji Huang, Aijun An and Xiaohui Yu Department of Computer Science and Engineering York University, Toronto, Canada School of Information Technology York University, Toronto, Canada SIGIR 2007

Upload: chibale

Post on 20-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs. Yang Liu, Xiangji Huang, Aijun An and Xiaohui Yu Department of Computer Science and Engineering York University, Toronto, Canada School of Information Technology York University, Toronto, Canada. SIGIR 2007. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

ARSA: A Sentiment-Aware Model for Predicting Sales

Performance Using Blogs

Yang Liu, Xiangji Huang, Aijun An and Xiaohui YuDepartment of Computer Science and Engineering

York University, Toronto, CanadaSchool of Information TechnologyYork University, Toronto, Canada

SIGIR 2007

Page 2: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Introduction

• What the general public thinks of a product can no doubt influence how good it sells

• Blogs can be commentaries or discussions on a particular subject– Ranging from mainstream topics to highly personal

interests• This paper studies the predictive power of

opinions and sentiments expressed in blogs– Focus on the blogs that contain reviews on products

• Blogs serve as a very good indicator of the product’s future sales performance

Page 3: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Introduction(Con.)

• Developing models and algorithms that can – mine opinions and sentiments from blogs – use them for predicting product sales

• Investigate how to predict box office revenues of movies using the sentiment information obtained from blog mentions

Page 4: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Why Using Movies?

• data availability– daily box office revenue data are all published on

the Web (IMDB) and readily available

• expect the models and algorithms to be easily adapted to handle other types of products that are subject to online discussions– books, music CDs and electronics

Page 5: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Introduction(Con.)

• S-PLSA Model– sentiment mining based on Probabilistic Latent

Semantic Analysis– Appraisal words are exploited to compose the feature

vectors for blogs

• ARSA Model (Autoregressive Sentiment Aware )– AR Model

• Count in past sale performance

– Combining with sentiment information mined from the blogs

Page 6: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Related Work

• Sentiment mining– focuses on determining the semantic orientations

of documents• machine learning approaches

– evaluate the semantic distance from a word to good/bad with WordNet

• Blog mining– make use of links or URLs in Blogspace– analyzing the contents of blogs

Page 7: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Characteristics of Online Discussion

• based on the number of blog mentions – very difficult to make a successful prediction of sales ranks

• two movies both released on May 19, 2006– The Da Vinci Code – Over the Hedge

• use the name of each movie as a query to blog search engine– fixed time stamp– starting from one week before the movie release till three

weeks after the release• use the number of returned results for a particular

date as a rough estimate of the number of blog mentions published on that day

Page 8: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Example

1 has higher # of blog mentions

票房差不多 , 甚至 2 有時超越 1

Page 9: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Box Office Data and User Rating

• collect the average user ratings of the two movies from the IMDB website– The Da Vinci Code – 6.5– Over the Hedge - 7.1

• the number of blog mentions may not be an accurate indicator of a product’s sales performance

• people’s opinions (as reflected by the user ratings) seem to be a good indicator of how the box office performance evolves

Page 10: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

S-PLSA : a Probabilistic Approach to Sentiment Mining

• Feature Selection– Traditional way

• Compute the (relative) frequencies of various words in a given blog

• Use the resulting multidimensional feature vector as the representation of the blog

– focus on the set containing 2030 appraisal words extracted from the lexicon constructed by Whitelaw et al.

• use their frequencies in a blog as a feature vector

Page 11: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Sentiment PLSA

• sentiments are often multifaceted– differ from one another in a variety of ways

• just classify the sentiments expressed in a blog as either positive or negative– too simplistic

• a blog can be considered as being generated under the influence of a number of hidden sentiment factors– each hidden factor focusing on one specific aspect of the

sentiments– accommodate the intricate nature of sentiments

• model sentiments and opinions as a mixture of hidden factors and use PLSA for sentiment mining

Page 12: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

S-PLSA : Formerly Presentation

• a set of blog entries B = {b1, . . . , bN}• a set of words (appraisal words) from a

vocabulary W = {w1, . . . ,wM}• blog data can be described as a N × M matrix

D =(c(bi,wj))ij – c(bi,wj) is the number of times wi appears in blog

entry bj

• each row in D is then a frequency vector that corresponds to a blog entry

Page 13: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

S-PLSA (Con.)

• consider the blog entries as being generated from a number of hidden sentiment factorsZ = {z1, . . . , zK}– correspond to blogger’s complex sentiments

expressed in the blog review

• Generative model

Page 14: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

S-PLSA (Con.)

• result

• Assuming blog entry b and the word w are conditionally independent given the hidden sentiment factor z

• Estimate model parameters– P(z), P(b|z), P(w|z)

Page 15: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

S-PLSA (Con.)

• maximize the following likelihood function:

• EM algorithm

Page 16: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

S-PLSA (Con.)

• P(z|b) represents how much a hidden sentiment factor z “contributes” to the blog document b

• the set of probabilities {P(z|b)|z Z} can be considered as a succinct summarization of b in terms of sentiments

Page 17: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

ARSA : a Sentiment-Aware Model

• Capture two different factors that can affect the box office revenue of the current day– box office revenue of the preceding days

• autoregressive model (AR)

– people’s sentiments about the movie

Page 18: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

The autoregressive model

• denote the box office revenue of the movie of interest at day t by xt

– t = 1, . . . ,N

• basic AR process of order p– φ1, φ2, . . . , φp : parameters of the model

– : error term

• Once this model is learned from training data– at day t, the box office revenue xt can be predicted by

xt−1, xt−2,. . ., xt−p

Page 19: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

New AR Model

• AR models are only appropriate for time series that are stationary

• 1st step

• 2nd step (remove seasonality)

• New AR model

Page 20: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Incorporating Sentiments

• Bt denote the set of blogs on the movie of interest that were posted on day t

• average probability of sentiment factor z = j conditional on blogs in Bt

– ωt,j represents the average fraction of the sentiment “mass” that can be attributed to the hidden sentiment factor j

Page 21: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

ARSA Model

• Autoregressive Sentiment-Aware model

– p, q, and K : user-chosen parameters• q : the sentiment information from how many

preceding days are considered• k : the number of hidden sentiment factors used by S-

PLSA to represent the sentiment information– and : parameters whose values are to be

estimated using the training data

Page 22: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

• learning the set of parameters φi(i = 1, . . . , p), and ρi,j(i = 1, . . . , q; j = 1, . . . ,K), from the training data that consist of the true box office revenues

• For a particular movie m(m = 1, . . . ,M)– M : total number of movies in the training data

• ,• minimize

Training the ARSA Model

Page 23: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Empirical Study

• Experiment settings– a set of blog documents on movies of interest collected

from the Web• from May 1, 2006 to August 8, 2006.• timestamp ranging from one week before the release to four

weeks after• the amount of blog entries collected for each movie ranges from

663 (for Waist Deep) to 2069 ( for Little Man)• 45046 blog entries comment on 30 different movies

– corresponding daily box office revenue data for these movies

• manually collected from IMDB

Page 24: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Experiment

• choose half of the movies for training, and the other half for testing

• train an S-PLSA model– For each blog entry b, the sentiments towards a movie are

summarized using a vector of the posterior probabilities of the hidden sentiment factors, P(z|b)

• apply ARSA model– obtain estimates of the parameters

• evaluate the prediction performance of the ARSA model by experimenting it with the testing data set.

Page 25: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

MAPE

• use the mean absolute percentage error (MAPE) to measure the prediction accuracy

– n : total amount of predictions made on the testing data– Predi : predicted value

– Truei : true value of the box office revenue

Page 26: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Parameter Selection

• K 、 p 、 q

12.1%

OverfittingHigh cost

Lower dim. VectorLoss sentiment info.

Get irrelevantinformation

Factor in all influence on preceding day’sperformance

前一天 post blog 之sentiment info. 和prediction 最相關

Page 27: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Result on Particular Movie• similar to observation of Figure 2• close to true values using the optimal parameter settings

Page 28: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Comparison with Alternative Methods

Pure AR model

AR model utilizes the volume of blog mentions

vt −i :number of blog mentions on day t−iφi and ρi : parameters to be learned

Page 29: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Comparison with other Feature Selection Method

• feature vectors are computed using the (relative) frequencies of all the words appearing in the blog entries– large set 、 cost high

• only select words with higher frequencies (excluding stop words)– same # of words as ARSA

for fairness

Page 30: ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs

Conclusions and Future Work• predicting sales performance using sentiment information

mined form blogs – using movies as a case study

• proposal of S-PLSA– generative model for sentiment analysis– “summarizing” sentiment information from blogs

• ARSA– model for predicting sales performance based on

• sentiment information • product’s past sales performance

• Future Work– Clustering and classification of blogs based on their sentiments– use S-PLSA as a tool to help track and monitor the changes and trends in

sentiments expressed online