exploring the relationship between customer reviews …lg5bt/files/finalreport-lz2ay-lg5bt... ·...

6
Exploring the Relationship between Customer Reviews and Prices Bo Man, Lin Gong, Lingjie Zhang Department of Computer Science 85 Engineer’s Way, Charlottesville, Virginia {bm6es, lg5bt, lz2ay}@virginia.edu ABSTRACT Do customer reviews indirectly affect sale prices? In this project, we collect Amazon’s reviews and prices, and explore the relationship between them. First, Amazon’s rating data and price data are crawled. Second, we assume customer’ ratings are consistent with review contents by adopting ma- chine learning methods to verify. We predict ratings ac- cording to the given user contents. Third, we analyze the correlation between ratings and prices. Finally, we find that as for some items, such as“Ceramic Flat Hair-styling Iron”, the correlation is 0.807, indicating a strong positive rela- tionship does exist. However, as for some other items, the relationship is weak. Keywords Machine Learning, Review, Price 1. INTRODUCTION As we all know, customer reviews play an important role in e-commerce and directly affect sales volume. As for users, reviews provide recommendations and help them make deci- sions, such as choosing a movie, picking a restaurant and so on. Yelp is born on this basis. As for retailers, reviews also contribute a lot, giving important feedback to them. Inter- net vendors take reviews seriously. For example, sellers on Amazon will reply customer reviews and make some expla- nations. However, to what extend do they care about those reviews? How will they react to customer reviews? Do they decrease the prices according to negative reviews since they would definitely decrease sales volume? To answer these questions, we search some papers for ref- erence. We find that some researches have focused on so- cial media, including online reviews and forum discussions. Some of them center on the classified reviews to help users to make quick decisions, while some of them concentrate on review data mining and recommend products for users auto- matically. However, none of them connect user reviews with prices. To the best of our knowledge, there are no published Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright all reserved by Bo, Lin, Lingjie. papers covering the reviews and prices. Hence, our idea is brand new and novel. In this project, we choose SNAP Amazon reviews as our data set since it provides a comprehensive review system. We analyze the positive correlation of review curves and price curves, which reflect the existence of relationship. Prove assumption that customer ratings is consis- tent with content: We assume the five-star ratings given by users can directly represent customer reviews. We verify this assumption by predicting user ratings with contents us- ing three machine learning methods, Naive Bayes, Logistics Regression and Support Vector Machine. Collect Amazon reviews and prices as data set: Our data set consists of two parts: (1) Amazon reviews from SNAP (Stanford University)[5]. The original data span a period of 18 years, including more than 35 millions reviews up to March 2013. We analyze this data set and select 419 products from 5 departments. Also, each product has more than 100 reviews. Every review includes product and user information, rating, and a plain text content. (2) For each product we crawl daily prices from the “thetracktor” web- site [1] which tracks prices of Amazon products. Then, we analyze the relationship between the two data sets. Pre-process ratings and prices: In order to compare ratings and prices, we scale the prices into the range of 1 to 5. Then, we smooth both the ratings and prices curves. Due to the time latency of the effect of reviews, we introduce a lag parameter L which is adjustable. Analyze correlation: After pre-process the two curves, we calculate the correlation of prices and ratings. We find the results are amazing, for some items, we do find some strong relationships between prices and ratings, such as Ce- ramic Flat Hair-styling Iron, the correlation is 0.807; how- ever for some other items, the correlation is weak. Contributions Our specific findings and contributions include the following: We use machine learning methods to verify that cus- tomer ratings are consistent with text contents. Thus, re- views can be represented by ratings directly. We analyze correlation of prices and ratings for selected items and find several interesting observations. The rest of the paper is organized as follows. Related work is surveyed in Section 2. Section 3 presents our approach of finding the relationships between reviews and prices. Section 4 applies our analysis methods on Amazon data sets. Section 5 concludes the whole paper. 1.1 State of the Art Some researches have focused on social media such as on-

Upload: duongnhan

Post on 09-Feb-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring the Relationship between Customer Reviews …lg5bt/files/finalReport-lz2ay-lg5bt... · Exploring the Relationship between Customer Reviews and Prices Bo Man, Lin Gong, Lingjie

Exploring the Relationship between Customer Reviewsand Prices

Bo Man, Lin Gong, Lingjie ZhangDepartment of Computer Science

85 Engineer’s Way, Charlottesville, Virginia{bm6es, lg5bt, lz2ay}@virginia.edu

ABSTRACTDo customer reviews indirectly affect sale prices? In thisproject, we collect Amazon’s reviews and prices, and explorethe relationship between them. First, Amazon’s rating dataand price data are crawled. Second, we assume customer’ratings are consistent with review contents by adopting ma-chine learning methods to verify. We predict ratings ac-cording to the given user contents. Third, we analyze thecorrelation between ratings and prices. Finally, we find thatas for some items, such as “Ceramic Flat Hair-styling Iron”,the correlation is 0.807, indicating a strong positive rela-tionship does exist. However, as for some other items, therelationship is weak.

KeywordsMachine Learning, Review, Price

1. INTRODUCTIONAs we all know, customer reviews play an important role

in e-commerce and directly affect sales volume. As for users,reviews provide recommendations and help them make deci-sions, such as choosing a movie, picking a restaurant and soon. Yelp is born on this basis. As for retailers, reviews alsocontribute a lot, giving important feedback to them. Inter-net vendors take reviews seriously. For example, sellers onAmazon will reply customer reviews and make some expla-nations. However, to what extend do they care about thosereviews? How will they react to customer reviews? Do theydecrease the prices according to negative reviews since theywould definitely decrease sales volume?

To answer these questions, we search some papers for ref-erence. We find that some researches have focused on so-cial media, including online reviews and forum discussions.Some of them center on the classified reviews to help usersto make quick decisions, while some of them concentrate onreview data mining and recommend products for users auto-matically. However, none of them connect user reviews withprices. To the best of our knowledge, there are no published

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright all reserved by Bo, Lin, Lingjie.

papers covering the reviews and prices. Hence, our idea isbrand new and novel.

In this project, we choose SNAP Amazon reviews as ourdata set since it provides a comprehensive review system.We analyze the positive correlation of review curves andprice curves, which reflect the existence of relationship.

Prove assumption that customer ratings is consis-tent with content: We assume the five-star ratings givenby users can directly represent customer reviews. We verifythis assumption by predicting user ratings with contents us-ing three machine learning methods, Naive Bayes, LogisticsRegression and Support Vector Machine.

Collect Amazon reviews and prices as data set:Our data set consists of two parts: (1) Amazon reviews fromSNAP (Stanford University)[5]. The original data span aperiod of 18 years, including more than 35 millions reviewsup to March 2013. We analyze this data set and select 419products from 5 departments. Also, each product has morethan 100 reviews. Every review includes product and userinformation, rating, and a plain text content. (2) For eachproduct we crawl daily prices from the “thetracktor” web-site [1] which tracks prices of Amazon products. Then, weanalyze the relationship between the two data sets.

Pre-process ratings and prices: In order to compareratings and prices, we scale the prices into the range of 1 to5. Then, we smooth both the ratings and prices curves. Dueto the time latency of the effect of reviews, we introduce alag parameter L which is adjustable.

Analyze correlation: After pre-process the two curves,we calculate the correlation of prices and ratings. We findthe results are amazing, for some items, we do find somestrong relationships between prices and ratings, such as Ce-ramic Flat Hair-styling Iron, the correlation is 0.807; how-ever for some other items, the correlation is weak.

Contributions Our specific findings and contributionsinclude the following:

• We use machine learning methods to verify that cus-tomer ratings are consistent with text contents. Thus, re-views can be represented by ratings directly.

• We analyze correlation of prices and ratings for selecteditems and find several interesting observations.

The rest of the paper is organized as follows. Related workis surveyed in Section 2. Section 3 presents our approach offinding the relationships between reviews and prices. Section4 applies our analysis methods on Amazon data sets. Section5 concludes the whole paper.

1.1 State of the ArtSome researches have focused on social media such as on-

Page 2: Exploring the Relationship between Customer Reviews …lg5bt/files/finalReport-lz2ay-lg5bt... · Exploring the Relationship between Customer Reviews and Prices Bo Man, Lin Gong, Lingjie

line reviews and forum discussions. One of them focuses onfilm reviews, Pang et al.[8] do sentiment classification of filmreviews with the adoption of machine learning techniques.In their work, they concentrate on the analysis and compar-isons by applying three classical machine learning methodsto classify customer reviews. Elsas et al.[3] develops a novelforum ranking model according to the customer discussionson product-related topics. Kevin al.[9] puts forward a rec-ommendation dialog system according to the reviews givenby customers, which benefit customs in return. This modelalso combines customer reviews with business.

Since our work contain time series analysis, we also refersome papers related with it. Brendan et al.[7] connect mea-sures of public opinion from polls with sentiment from thepopular micro-blogging site Twitter. They find that politicalpolls correlate to sentiment word frequencies in contempo-raneous Twitter messages. Also, some time series analysisis adopted in the paper to support their opinion. Their re-sults highlight the potential of text streams as a substituteand supplement for traditional polling, which will save a lotof efforts. Pedro et al.[2] also do real-time analysis basedon users opinions and sentiment. Instead of learning tex-tual models to predict content polarity, they believe userbias will not change often. Thus, they analyze sentimentsby transferring user bias to text features, which is proposedas a transfer learning strategy.

Besides, a lot of research is conducted on Amazon re-views. Hu and Liu’s[4] research focus on mining opinionand product features that the reviewers have commentedon. Mudambi[6] explores how helpful a customer’s reviewis. Other existing research about Amazon reviews usuallyfocus on review classification and recommendation for cus-tomer. However, our work is more general and useful whichbenefit both customers and retailers. We want to explore therelationships between reviews and prices, which is a brandnew topic.

2. METHODOLOGY

2.1 Collect ReviewsAmazon, which is the largest Internet-based company in

the United States. Amazon.com started as an online book-store, but soon diversified, selling video products, software,electronics, furniture, food and so on. Our dataset is a sub-set of SNAP Amazon review dataset, which is widely used inacademic area, especially in text mining area. It spans a pe-riod of 18 years, including 35 millions reviews up to March2013. It covers all types of items mentioned before. Dueto the large number of items and the availability of salesprices, we create a filter to select some items used in theproject. We choose the items with over 100 ratings during2012 August to 2013 March. The review has the followingformat:

product/productId: B000GKXY4Sproduct/title: Crazy Shape Scissor Setproduct/price: unknownreview/userId: A1QA985ULVCQOBreview/profileName: Carleen M. Amadio “Lady Dragon-

fly”review/helpfulness: 2/2review/score: 5.0review/time: 1314057600review/summary: Fun for adults too!

review/text: I really enjoy these scissors for my inspira-tion books that I am making (like collage, but in books) andusing these different textures these give is just wonderful,makes a great statement with the pictures and sayings. Wantmore, perfect for any need you have even for gifts as well.Pretty cool!

2.2 AssumptionAmazon provides a comprehensive review system as a

leader of the online shopping websites. Also, the reviewconsists of two parts, the five-star rating and the reviews.In our paper, we assume the rating is consistent with thecontent since humans are consistent. In order to verify this,we adopt three machine learning methods to predict ratingsaccording to the given contents. In detail, we use NaiveBayes (NB), Logistics Regression (LR) and Support VectorMachine (SVM) to predict results. Table 1-3 list the pre-dicted results with Naive Bayes, Logistics Regression andSupport Vector Machine, respectively.

As we can see from the tables, we select five categories andpredict the ratings. The overall precision is quite high for allthese three methods, which is around 0.85. As for some lowrecalls, they are usually for rating 1 and rating 2 since thetwo classes have smaller sample sizes which may introducemore errors. Among all these three methods, LR performsbest since it achieves higher score for both precisions andrecalls. Therefore, we verify that ratings can represent cus-tomer’s opinions. In our later experiments, we use ratingsdirectly to represent customer’ overall reviews.

2.3 Crawl PricesEven the dataset we get from SNAP contains the prices

for every item, many of them are not available or accurateenough. For example, some of them just record the initialprices. Since Amazon’s prices do not regularly fluctuate,some websites keep track of those prices for customer, suchas “camelcamelcamel” and “thetrackor”. We crawl the pricedata from the second website for every item we choose instep 1. If the price for one item changes several times in aday, then the average price is recorded. If the price does notchange over several days, then we keep recording the sameprice for each day. Finally, we get 221 products which haveprice changes from previous 419 products, which spans from2012 September to 2013 March.

2.4 Analysis MethodScaling Scaling is adjusting values measured on different

scales to a notionally common scale, which is carried outfirstly. The data shows that ratings only vary from oneto five, but prices vary in a large range. The purpose ofscaling is to show relationship clearly in visualization andto get better comparison. Also we have verified that scalingdoesn’t change the shape of the curve and the correlationwith other curves. The whole process is shown in followingequation:

Y = (x− xmin)/(xmax − xmin) ∗ 5 (1)

Moving average Since we sample the reviews and pricesevery day, the data is volatile which rises and falls each day.In order to derive a more consistent signal, we smooth boththe reviews and prices with one simple temporal smoothingtechnique, a moving average over a window length of the

Page 3: Exploring the Relationship between Customer Reviews …lg5bt/files/finalReport-lz2ay-lg5bt... · Exploring the Relationship between Customer Reviews and Prices Bo Man, Lin Gong, Lingjie

!NB!

Sport! Tools! Home! Beauty! Baby!Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall!

1! 0.924! 0.476! 0.91! 0.447! 0.949! 0.795! 0.923! 0.773! 0.952! 0.702!2! 0.947! 0.907! 0.929! 0.833! 0.962! 0.881! 0.898! 0.808! 0.948! 0.95!3! 0.947! 0.981! 0.946! 0.889! 0.967! 0.905! 0.933! 0.957! 0.951! 0.952!4! 0.927! 0.994! 0.906! 0.961! 0.893! 0.567! 0.968! 0.952! 0.934! 0.975!5! 0.981! 0.977! 0.963! 0.993! 0.865! 0.979! 0.976! 0.992! 0.981! 0.989!

!LR!

Sport! Tools! Home! Beauty! Baby!Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall!

1! 0.958! 0.737! 0.975! 0.517! 0.84! 0.454! 0.973! 0.775! 0.978! 0.825!2! 0.949! 1.0! 0.962! 1.0! 0.908! 0.762! 0.976! 1.0! 0.961! 1.0!3! 0.96! 0.984! 0.96! 0.968! 0.827! 0.587! 0.99! 1.0! 0.97! 0.961!4! 0.94! 0.918! 0.971! 0.992! 0.7! 0.373! 0.982! 0.956! 0.97! 0.981!5! 0.963! 0.994! 0.976! 0.998! 0.768! 0.949! 0.978! 1.0! 0.984! 1.0!

!SVM!

Sport! Tools! Home! Beauty! Baby!Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall!

1! 0.899! 0.625! 0.781! 0.343! 0.699! 0.38! 0.879! 0.677! 0.938! 0.755!2! 0.851! 0.477! 0.723! 0.357! 0.858! 0.568! 0.819! 0.612! 0.872! 0.661!3! 0.871! 0.613! 0.745! 0.407! 0.72! 0.37! 0.905! 0.735! 0.851! 0.605!4! 0.782! 0.588! 0.875! 0.633! 0.468! 0.204! 0.908! 0.758! 0.88! 0.693!5! 0.822! 0.964! 0.841! 0.961! 0.703! 0.926! 0.875! 0.961! 0.864! 0.971!

!

Table 1: The predicted results of Naive Bayes.

!NB!

Sport! Tools! Home! Beauty! Baby!Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall!

1! 0.924! 0.476! 0.91! 0.447! 0.949! 0.795! 0.923! 0.773! 0.952! 0.702!2! 0.947! 0.907! 0.929! 0.833! 0.962! 0.881! 0.898! 0.808! 0.948! 0.95!3! 0.947! 0.981! 0.946! 0.889! 0.967! 0.905! 0.933! 0.957! 0.951! 0.952!4! 0.927! 0.994! 0.906! 0.961! 0.893! 0.567! 0.968! 0.952! 0.934! 0.975!5! 0.981! 0.977! 0.963! 0.993! 0.865! 0.979! 0.976! 0.992! 0.981! 0.989!

!LR!

Sport! Tools! Home! Beauty! Baby!Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall!

1! 0.958! 0.737! 0.975! 0.517! 0.84! 0.454! 0.973! 0.775! 0.978! 0.825!2! 0.949! 1.0! 0.962! 1.0! 0.908! 0.762! 0.976! 1.0! 0.961! 1.0!3! 0.96! 0.984! 0.96! 0.968! 0.827! 0.587! 0.99! 1.0! 0.97! 0.961!4! 0.94! 0.918! 0.971! 0.992! 0.7! 0.373! 0.982! 0.956! 0.97! 0.981!5! 0.963! 0.994! 0.976! 0.998! 0.768! 0.949! 0.978! 1.0! 0.984! 1.0!

!SVM!

Sport! Tools! Home! Beauty! Baby!Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall!

1! 0.899! 0.625! 0.781! 0.343! 0.699! 0.38! 0.879! 0.677! 0.938! 0.755!2! 0.851! 0.477! 0.723! 0.357! 0.858! 0.568! 0.819! 0.612! 0.872! 0.661!3! 0.871! 0.613! 0.745! 0.407! 0.72! 0.37! 0.905! 0.735! 0.851! 0.605!4! 0.782! 0.588! 0.875! 0.633! 0.468! 0.204! 0.908! 0.758! 0.88! 0.693!5! 0.822! 0.964! 0.841! 0.961! 0.703! 0.926! 0.875! 0.961! 0.864! 0.971!

!

Table 2: The predicted results of Logistics Regression.

past k days:

RAt =1

k(xt−k+1 + xt−k+2 + xt−k+3 + ... + xt) (2)

Smoothing is a critical issue which causes both the pricesand reviews to respond more slowly to recent changes. Infact, smoothing can merit consistent behavior to appear overlonger periods of time. Window length matters a lot insmoothing and we need to choose a proper window length.Too much smoothing will make it impossible to see fine-grained changes.

Shifting Analysis Customer reviews can not have ef-fects on the prices immediately. Thus, time delay should beconsidered in the analysis. We introduce a hyper-parameterL into the model, so the ratings are compared against theprices ending L days later than the ratings.

Correlation Analysis Based on the samples, we usePearson correlation coefficient to indicate the relationshipbetween two curves, which can be shown in the followingequation:

s(x, y) =

∑pi=1(xi − x)(yi − y)√∑p

i=1(xi − x)2 ∗∑p

i=1(yi − y)2(3)

According to common statistical criteria, we can classifycorrelations into several levels. Since we only care aboutthe positive data, all correlations mentioned in our paperrefer to the positive correlations, and all relationships meanpositive relationship. If the correlation is larger than 0.4,we can conclude that there is a relationship between twocurves. If the correlation is smaller than 0, we identify nocorrelation instead of no positive correlation.

3. EVALUATION

3.1 Sample SelectionBase on the 221 products, we only choose products which

have have prices change more than 50 times. Then we get 26

CriteriaPositive Correlation Degree CorrelationExtremely Strong 0.8-1.0Strong 0.6-0.8Medium 0.4-0.6Weak 0.2-0.4Extremely weak 0-0.2No <0

Table 4: Statistical Criteria on Classifying Correla-tions

products in total. The reason behind this selection is basedon the assumption that the more price changes occur, themore likely a potential relationship may exist or be found.For example, if an item’s price never changes over a longperiod time no matter what happens, which means retailersdo not care other factors neither the ratings nor the reviews,to adjust its price. However, if an item’s price changes often,there might be some factors affecting it, reviews could be anoption.

3.2 ScalingFigure 1 demonstrates the effects of scaling. As we can

see, the two curves are put together while still keeping theshape, which would be easier for us to “see” the relationship.

3.3 Moving AverageAs we can see in Figure 2, 15-day smoothing is applied on

the data. After smoothing, extreme values are eliminatedand the curve indicates more consistent behavior.

Different window lengths are chosen to indicate the im-portance of lengths, which is shown in Figure 3. In the fol-lowing figure, 7, 15 and 30 are picked as the window lengthsrespectively.

According to our experiments and reference papers, wefind that 15-day smoothing is the best choice compared to

Page 4: Exploring the Relationship between Customer Reviews …lg5bt/files/finalReport-lz2ay-lg5bt... · Exploring the Relationship between Customer Reviews and Prices Bo Man, Lin Gong, Lingjie

!NB!

Sport! Tools! Home! Beauty! Baby!Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall!

1! 0.924! 0.476! 0.91! 0.447! 0.949! 0.795! 0.923! 0.773! 0.952! 0.702!2! 0.947! 0.907! 0.929! 0.833! 0.962! 0.881! 0.898! 0.808! 0.948! 0.95!3! 0.947! 0.981! 0.946! 0.889! 0.967! 0.905! 0.933! 0.957! 0.951! 0.952!4! 0.927! 0.994! 0.906! 0.961! 0.893! 0.567! 0.968! 0.952! 0.934! 0.975!5! 0.981! 0.977! 0.963! 0.993! 0.865! 0.979! 0.976! 0.992! 0.981! 0.989!

!LR!

Sport! Tools! Home! Beauty! Baby!Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall!

1! 0.958! 0.737! 0.975! 0.517! 0.84! 0.454! 0.973! 0.775! 0.978! 0.825!2! 0.949! 1.0! 0.962! 1.0! 0.908! 0.762! 0.976! 1.0! 0.961! 1.0!3! 0.96! 0.984! 0.96! 0.968! 0.827! 0.587! 0.99! 1.0! 0.97! 0.961!4! 0.94! 0.918! 0.971! 0.992! 0.7! 0.373! 0.982! 0.956! 0.97! 0.981!5! 0.963! 0.994! 0.976! 0.998! 0.768! 0.949! 0.978! 1.0! 0.984! 1.0!

!SVM!

Sport! Tools! Home! Beauty! Baby!Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall! Precision! Recall!

1! 0.899! 0.625! 0.781! 0.343! 0.699! 0.38! 0.879! 0.677! 0.938! 0.755!2! 0.851! 0.477! 0.723! 0.357! 0.858! 0.568! 0.819! 0.612! 0.872! 0.661!3! 0.871! 0.613! 0.745! 0.407! 0.72! 0.37! 0.905! 0.735! 0.851! 0.605!4! 0.782! 0.588! 0.875! 0.633! 0.468! 0.204! 0.908! 0.758! 0.88! 0.693!5! 0.822! 0.964! 0.841! 0.961! 0.703! 0.926! 0.875! 0.961! 0.864! 0.971!

!Table 3: The predicted results of Support Vector Machine.

Figure 1: Scaling of prices and ratings.

!

!

Figure 2: Prices after 15-day smoothing.

the 7-day inadequate smoothing and 30-day over-smoothing.Although over-smoothing will lead to a better correlationresult to some extent, it can not reflect the relationship re-alistically.

3.4 Shifting AnalysisThe lag parameters are of great importance in the anal-

ysis. Thus, in Figure 4, 7, 15, 30, 45 and 60 are chosenas the parameters to plot the curves. It is just like movingthe curves right by k days, which can be seen in the figureclearly.

Also, the relationship is quite clear after doing shifting.Figure 5 demonstrates the comparisons after shifting. Aswe can see, both the peaks and valleys match with eachother.

Since different items have different price-rating patterns,

!

!

Figure 3: Smoothing with different window lengths.

we try to find the local optimal parameter of shifting(7, 10,15, 30, 45, 60) to maximize the correlation.

3.5 Correlation AnalysisAfter correlation analysis, we do find some interesting ob-

servations which is show in the following part.As for some items, there is almost no positive relationship

between prices and ratings. As we can see in Figure 6, thetwo curves do not match at all. Also, by computation, themaximized correlation after shifting is still quite low, only-0.199. Thus, we may conclude there are no relationshipsbetween prices and reviews for these items.

We do find some items which have strong relationshipsbetween prices and ratings. Figure 7 is for hairstyle iron, itis quite obvious that the two curves match with each other.Also, the correlation between its prices and ratings is high,

Page 5: Exploring the Relationship between Customer Reviews …lg5bt/files/finalReport-lz2ay-lg5bt... · Exploring the Relationship between Customer Reviews and Prices Bo Man, Lin Gong, Lingjie

Figure 5: Rating and price curves before and after shifting.

Figure 4: Different lag parameters.

Figure 6: No-positive-relationship item (Home-B000FFVJ3C).

which is 0.807.Based on the 26 samples, we find that 17 (65%) of them

has a correlation larger than 0.4, which indicates that thereis a relationship between ratings and prices according to ourcriteria.

4. CONCLUSIONIn our project, we try to explore the relationships be-

tween prices and reviews. Machine learning methods areadopted to verify our assumption. Also, modern data anal-ysis methods are used to help us analyze the data. By do-ing experiments, we do find some items that have strongrelationships between prices and reviews. In fact, we find65% sample products satisfying our requirements. We mayconclude that most retailers will adjust the product pricesaccording to user reviews. Also we find that reviews ofteninfluence prices with a delay time of 7 to 30 days.

Figure 7: Positive-related item (Beauty-B0009V1YR8).

Figure 8: Correlation Distribution of the 26 items.

5. FUTURE WORKFirst, we will try different principles on sample selection,

like the threshold of times of prices changes. For example, ifprices change more than 25 times instead of 50 times duringthe six month period, the times we assume retailers usingto react will be almost once per week, which seems morereasonable and will provide larger-size samples to verify therelationships.

Then we will expand our analysis in three different as-pects. One is to repeat the same procedure for each sep-arate category. Another is to use other metrics instead ofcorrelations to represent relationships, since samples withsmall size will sometimes lead to a bias. Besides, in our ex-periments, we find that negative reviews contribute more tothe price changes which should be paid more attention to infuture analysis.

Finally we will use our rules to predict prices.

Page 6: Exploring the Relationship between Customer Reviews …lg5bt/files/finalReport-lz2ay-lg5bt... · Exploring the Relationship between Customer Reviews and Prices Bo Man, Lin Gong, Lingjie

No. Item NoShft cor shft cor Scr1 Baby-B000BNQC58 0.76 0.803 52 Baby-B000GKWA66 0.37 0.613 43 Baby-B00020L78M -0.2 0.187 14 Beauty-B0009V1YR8 0.75 0.807 55 Home-B000AQSMPO 0.25 0.329 26 Home-B000F49XXG 0.79 0.719 47 Home-B000FFVJ3C -0.06 -0.199 08 Home-B000GTR2F6 -0.17 0.479 39 Home-B000Q5XTE8 0.58 0.502 310 Home-B0000X7CMQ -0.06 0.458 311 Home-B0002OKDT2 0.54 0.814 512 Home-B0006G3JRO -0.23 0.309 213 Home-B00008T960 0.13 0.406 314 Home-B00018RRRK -0.31 0.415 315 Sport-B000FI6XGC -0.08 0.299 216 Sport-B000O3GCFU -0.36 0.525 317 Sport-B000RYAKHC 0.56 0.484 318 Sport-B0007IS6ZG -0.74 0.425 319 Sport-B0009VC9Y0 0.84 0.828 520 Tool-B000CITK8S 0.08 0.692 421 Tool-B000DZGN7Q 0.2 0.260 222 Tool-B000E7NYY8 -0.09 0.643 423 Tool-B000IE0YIQ -0.12 0.200 224 Tool-B000PICTYC 0.2 0.185 125 Tool-B00004WA4C 0.51 0.377 226 Tool-B0006VVN1S 0.42 0.634 4

Table 5: Correlation of each item

6. REFERENCES[1] https://thetracktor.com/detail/0262033844/.

[2] P. H. Calais Guerra, A. Veloso, W. Meira Jr, andV. Almeida. From bias to opinion: a transfer-learningapproach to real-time sentiment analysis. InProceedings of the 17th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 150–158. ACM, 2011.

[3] J. L. Elsas and N. Glance. Shopping for top forums:discovering online discussion for product research. InProceedings of the First Workshop on Social MediaAnalytics, pages 23–30. ACM, 2010.

[4] M. Hu and B. Liu. Mining opinion features in customerreviews. In AAAI, volume 4, pages 755–760, 2004.

[5] J. McAuley and J. Leskovec. Hidden factors and hiddentopics: understanding rating dimensions with reviewtext. In Proceedings of the 7th ACM conference onRecommender systems, pages 165–172. ACM, 2013.

[6] S. M. Mudambi and D. Schuff. What makes a helpfulonline review? a study of customer reviews on amazon.com. Management Information Systems Quarterly,34(1):11, 2010.

[7] B. O’Connor, R. Balasubramanyan, B. R. Routledge,and N. A. Smith. From tweets to polls: Linking textsentiment to public opinion time series. ICWSM,11:122–129, 2010.

[8] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?:sentiment classification using machine learningtechniques. In Proceedings of the ACL-02 conference onEmpirical methods in natural languageprocessing-Volume 10, pages 79–86. Association for

Computational Linguistics, 2002.

[9] K. Reschke, A. Vogel, and D. Jurafsky. Generatingrecommendation dialogs by extracting information fromuser reviews. In ACL (2), pages 499–504, 2013.