improving product review search experiences on general...

Improving Product Review Search Experiences on General Search Engines

Shen Huang, Dan Shen eBay Research Labs

No.88 KeYuan Rd., 2010203, Shanghai, China 86-21-28913756

[email protected] [email protected]

Wei Feng Shanghai Jiao Tong University

No. 800, Dongchuan Rd., 200240, Shanghai, China 86-21-34205446

[email protected]

Catherine Baudin, Yongzheng Zhang eBay Research Labs

2145 Hamilton Avenue San Jose, CA 95125

[email protected]

[email protected]

ABSTRACT

In the Web 2.0 era, internet users contribute a large amount of online content. Product review is a good example. Since these phenomena are distributed all over shopping sites, weblogs, forums etc., most people have to rely on general search engines to discover and digest others’ comments. While conventional search engines work well in many situations, it’s not sufficient for users to gather such information. The reasons include but are not limited to: 1) the ranking strategy does not incorporate product reviews’ inherent characteristics, e.g., sentiment orientation; 2) the snippets are neither indicative nor descriptive of user opinions. In this paper, we propose a feasible solution to enhance the experience of product review search. Based on this approach, a system named “Improved Product Review Search (IPRS)” is implemented on the ground of a general search engine. Given a query on a product, our system is capable of: 1) automatically identifying user opinion segments in a whole article; 2) ranking opinions by incorporating both the sentiment orientation and the topics expressed in reviews; 3) generating readable review snippets to indicate user sentiment orientations; 4) easily comparing products based on a visualization of opinions. Both results of a usability study and an automatic evaluation show that our system is able to assist users quickly understand the product reviews within limited time.

Categories and Subject Descriptors

H.2.8 [Database Management]: Database Applications – Data

Mining; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – retrieval models, search process; H.5.2 [User Interfaces]: Graphical User Interfaces

General Terms

Algorithms, Experimentation, Human Factors.

Keywords

Opinion Mining, Sentiment Classification, Review Search, Affinity Rank

1. INTRODUCTION It becomes more convenient to do online shopping with the boom of e-commerce. According to a report from iResearch China Internet Research Center (http://www.iresearch.com.cn/), in year 2003 “user review” tends to be an important factor in purchase-decision making, even more important than “producer/brand”. Many shoppers are willing to leverage others’ opinions to make a better choice. We analyze a 2-year query log collected from a commercial search engine. Roughly 1.7% of queries contain the word “review” or “rating”, which indicates a high demand for review search. Moreover, among a total of 4,980 million pages indexed by this search engine, about 16% of them are review pages. It is true to say there is a challenge to provide better search experience on the product reviews.

Some e-commerce and online shopping portal sites, like eBay (http://www.ebay.com), Amazon (http://www.amazon.com) and Epinions (http://www.epinions.com), have organized the reviews quite well. The discrete information drives some users to general web search engines to achieve more complete overview. However, we observe that current search engines have been optimized for finding information relevant to a given topic instead of discovering good product reviews. For example, when a query “Nikon D200 review” is issued on Google, the search results will be ranked based on their relevance to the search query. The relevance is usually measured by the overlapping terms between a result page and the query without considering some characteristics of reviews. Also, the snippets are not very helpful for a user to understand the “review” or “rating” of the product. Only three keywords, “Nikon”, “D200” and “review”, are highlighted. The users still have to follow the URL links to check the reviews one by one. We suppose users will obtain a better experience if the system is capable of selecting important aspects and showing users a paragraph containing crucial comments. Sentiment analysis, another related research field, has recently received considerable attentions [1][2][3]. Nevertheless, these efforts do not explore how to adapt ranking algorithm and improve user experience for product review search.

In this paper, we propose a novel approach that consists of the following three components:

� First, within the search results, we will identify the pages with subjective contents and extract the passages containing opinions. The classification regarding subjective and objective contents has been explored in some previous work [4][5][6][7]. Most of them suffer from the “unseen

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICEC ’09, August 12-15, 2009, Taipei, Taiwan Copyright © 2009 ACM 978-1-60558-586-4/09/08…$10.00.

words” problem, which is quite common due to the fact that there are far less focused and organized topics on the web. In our system, we exploit Part-Of-Speech (POS) to smooth the probability of “unseen words” to improve the accuracy of subjectivity/objectivity classification [8].

� Second, after the pages with subjective information are identified, the next key point is to predict the orientation of opinions, i.e. positive or negative. Additionally, the importance weight should be assigned to each opinion as well. To the best of our knowledge, few research efforts have been focused on measuring the quality of opinions. We propose two kinds of implicit links to leverage the available link analysis algorithms such as PageRank [9][24]. One is implicit content link, which will connect two opinions if they discuss the same topic [10]. The other one is the opinion orientation link, which is used to reflect whether the opinions in different reviews will agree or disagree with each other. Inspired by the work of [10], we formulate four criteria for ranking product reviews: opinion richness, opinion diversity, topic richness and topic diversity.

� Finally, a novel presentation interface is set forth with the subjective contents ranked by the four criteria above. Practical user opinions are incorporated into the search snippets, which allow end users to have a clear overview about the main product comments with a glance instead of browsing the whole page. Additionally, all the opinions extracted from result pages will be summarized by a “radar graph”. That will help the users quickly get the overall opinions of one product and compare two products on the Web easily.

Using a commercial search engine as the backend system, we have built a real modified system, named “Improved Product Review Search (IPRS)”, to consolidate the above functionalities. Our usability study shows that the participants are satisfied with the product review search experience provided by our IPRS system. The evaluation results using reviews from 3 shopping sites also indicate that our review ranking algorithm is promising.

The rest of the paper is organized as follows: Related work is reviewed in Section 2. In Section 3, we present the overview of our system. Section 4 introduces our solutions to subjective content extraction, review ranking and opinion presentation. In Section 5, we show a usability study and some experimental results. Finally, we draw conclusions in Section 6.

2. RELATED WORK Linguistic features, such as adjectives, verbs and n-grams were studied in [11][12][6] for the purpose of subjectivity categorization. In practical use, the problem of “unseen words” is critical and has not been well studied. Finn [13] showed a pilot study in domain transfer: how well the learned classifiers generalize from the training corpus transfer to a new corpus. In Finn’s work, POS statistics are utilized and the results are promising. In the language modeling area, researchers employed models based on n-grams of word classes to generalize unseen word sequences, and hence offer improved robustness to novel or rare word combinations [14][15]. Our approach for solving “unseen words” problem differs from the previous approaches in the fact that we adopt POS-based word classes.

Sentiment analysis tries to classify people sentiments into positive, negative or neutral. Pang et al. [1] found that for sentiment classification, standard machine learning techniques definitively

outperform human-produced baselines. They also examined the factors that make the sentiment classification problem more challenging. Hu and Liu [2] proposed a feature-based opinion summarization system, which focuses on mining and summarizing customer reviews of products posted on the websites. Based on this work, Liu et al. [3] proposed a sentiment analysis system called Opinion Observer, which provides visual comparisons of different products. Different from their work, we also measure the quality of product reviews and provide more informative reviews.

Eguchi and Lavrenko [16] proposed several sentiment information retrieval models in the framework of probabilistic language models, assuming that users are willing to specify the sentiment polarity of interest when they input query terms. However, such frameworks do not offer users a good survey to make purchase decisions. Based on these observations, we try to define the quality of review and propose a novel ranking approach.

Link analysis algorithms, such as PageRank [9][24] and HITS [17] are used to assign an importance factor to web pages and have drawn many research attentions for some years. Most of them exploit the hyperlink structure among web pages to model a group of web pages as a link graph. However, they may fail on opinion ranking since few review pages are linked with each other. Recently, “implicit link analysis” [18][19][20] is another major sub-area in link analysis research field. Zhang et al. [10] construct links by the asymmetric content similarity between each pair of documents, which is called the Affinity Rank. In our opinion, the “agreement” relationships among reviews are very similar to the “recommendation” relationships among web pages. The difference is that “disagreement”, or “no recommendation”, also exist among reviews. Thus, we construct two kinds of implicit links: implicit content link and opinion orientation link. Based on this link information, we propose a revised version of Affinity Rank for product review search.

Radar graph, which is also called a spider plot, star plot or a polar plot, is a two dimensional polar graph that can simultaneously display many variables with different quantitative scales. Radar graph has been studied in data visualization [21], financial model analysis, mathematical and statistical applications. It also appears in RPG Game UI to evaluate avatar multi-features. As far as we know, we are the first to use radar graph for summarizing user sentiments towards products in review search system.

3. IPRS SYSTEM OVERVIEW Figure 1 gives two examples of the IPRS system interface. The search results for single product is illustrated in Figure 1(a) and the comparison of two products is given in Figure 1(b). In both modes, two components are provided: opinion snippets and visualized opinion summarization. For single product, after a query is submitted, we collect the top 100 results from a commercial search engine and re-rank them by our adapted Affinity Rank algorithm. As Figure 1(a) illustrates, we generate opinion-based snippets and highlight positive comments, negative comments and product features for easy understanding. In the right panel, we show a radar graph generated by the statistics for the most frequent 6 product features. As it can be seen from Figure 1(b), for the comparison of two products, the snippets containing opinions are listed side by side. The two radar graphs are also overlapped to show the differences in terms of features1.

1 Currently we only compare products belonging to the same

catalog, such as digital camera

Figure 1(b). IPRS system interface for the comparison

of two products.

Figure 1(a). IPRS system interface for single product.

Using a commercial search engine as our backend system, we built a real enhanced system to support product review search functionality. The framework of the system is illustrated in Figure 2, which includes three major components:

1. Subjectivity extraction. It is a preprocessing step which identifies the passages containing the subjective opinion from each result page.

2. Opinion ranking. For each passage with subjective opinion, we extract the product features as well as the sentiment polarity on each feature. Considering both of them, a similarity function is re-defined to construct what we call affinity graph.

3. Opinion presentation. Opinion based snippet is generated to help users understand the main idea easily instead of browsing the page from the top to the bottom. A summary of opinions within all returned pages is visualized with a “radar graph”. That will help the users quickly get the overall opinion of one product and compare two products on the web.

In the next section, we will present the detailed algorithms for all proposed components.

4. ALGORITHMS FOR PRODUCT

REVIEW SEARCH

4.1 Subjective Content Extraction Since the user aims to find good product reviews, firstly, we need to identify the passages which contain subjective opinions from the web pages returned by the backend search engine. We first extract all sentences from the search results pages. Then a classifier is trained to predict each sentence into subjective or objective category. Different from previous work on certain specific domains, our target is to build a robust review search engine which can handle various products all over the web. When data is collected from the web for training classification models, we find it hard to cover all topics. Moreover, the OOV (out of vocabulary) unseen words become a critical problem for this task. We bring forward a Part-Of-Speech (POS) based smoothing technique to alleviate the above problem. The idea of our POS based smoothing is similar to that of the Jelinek-Mercer method [22] in language modeling, which involves a linear interpolation

of two models, i.e., the word maximum likelihood model and the POS maximum likelihood model:

1 2 3

4 5

1 2

3 1 4 1

5 2 1 6 2 1

7 1 8

( | ) ( | ) ( | ) ( | )

( | ) ( | )

( ( | ) ( | ))

( ( | ) ( | ))

( ( | ) ( | ))

( ( | ) (

i SP i BGSP i TGSP i

MBGSP i MTGSP i

i i

i i i i

i i i i i i

i i

P w c P w c P w c P w c

P w c P w c

P w c P pos c

P w w c P pos pos c

P w w w c P pos pos pos c

P w w c P p

α α α

α α

β β

β β

β β

β β

− −

− − − −

−

= + +

+ +

= +

+ +

+ +

+ + 1

9 2 1 10 2 1

| ))

( ( | ) ( | ))

i i

i i i i i i

os pos c

P w w w c P pos pos pos cβ β−

− − − −+ +

(1)

where ( | )SP i

P w c denotes smoothing by POS tags,

( | )BGSP i

P w c and ( | )TGSP i

P w c denote smoothing by POS bigrams

and trigrams, respectively, ( | )MBGSP i

P w c and

( | )MTGSP i

P w c denote smoothing by Markov POS bigrams and

trigrams. For the probability combination of these 10 models, one important problem is how to assign coefficients to each one. We utilize linear regression model to learn the coefficients automatically. Empirical studies on five datasets show that our approach consistently outperforms the Naïve Bayes with Laplace smoothing. In a cross-domain experiment, our approach achieves 22.0% improvement in Macro F1 and 24.4% in Micro F1 over Naïve Bayes. More details can be found in Huang’s work [8].

4.2 Opinion Ranking

4.2.1 Product Feature Level Sentiment Analysis After we identify the sentences with subjective opinions, we extract the product features as well as the opinion orientations. The process could be summarized as there major steps: 1) Product feature extraction. 2) Opinion appraisal extraction. 3) Sentiment classification.

At the very beginning, the basic noun phrases [26] are extracted as product feature candidates. After compactness pruning and redundancy pruning [2], the frequently appeared ones are identified as the final product features. Secondly, we extract opinion appraisals using machine learning techniques combined with dictionaries and web resources. Opinion appraisal means a word or phrase that can express opinions. As described in [2][3], adjective words are useful for predicting opinion orientations. However, people express their opinions not only by adjective words but also by adverbs, verbs, nouns and phrases, etc. For example, “badly”, “buy”, “problem”, “give it low score”. To improve the coverage of our classifier, we modified the algorithm using the following two methods:

� Exploit the user rating information in the reviews collected from shopping sites. Usually, the reviews with 5 stars are assumed to be positive and those with 1 star to be negative. However, we observed that some 1-star reviews may also praise some features for a product and vice versa. To remove such noises, we firstly use a well-trained model, which has high precision but low recall, to select sentences with high classification confidence from a large corpus of reviews. After that, the model is re-trained with the expanded training data. With such a bootstrapping process, we can gradually increase the recall of our classifier with little loss of precision.

� By examining the wrongly classified samples, we found phrases play an important role in sentiment classification.

Figure 2. The framework of IPRS.

For example, “buy it again”, “get them now” are frequently used phrases in positive comments, while phrases like “keep away from it”, “avoid this brand” are frequently used in negative comments. To avoid the biases introduced by noisy patterns, we mine such phrases from the review titles. The reason is that titles are often short and contain good candidates of such phrases.

Finally, we use Naïve Bayes to predict the sentiment orientation. Motivated by the work in [2][3], we specially process the negation expressions. Let oa denote an opinion appraise, oai (i=1… n) denote the appraise in affirmation, oaj (j=1… m) denote the

appraise in negation (with the negation word being removed), c

denotes the opposite class for c, we revise Naïve Bayes as follows:

( ) ( )*

1 1

arg max ( | ) ( | )n m

i j

ic C

j

c P c P oa c P c P oa c∈

= =

∝ × × ×

∏ ∏ (2)

Our algorithm is tested on 1,800 sentences labeled by human judges with 10-fold cross-validation. We achieve a F1 score of 0.9 for positive class and 0.74 for negative class, respectively. The empirical results show that such a model can significantly improve the recall of negative comments from 0.64 to 0.71. After analyzing the data, we found that users would more likely use a circumbendibus way when they say negative comments. For example, “not good” and “without any improvement” frequently occur in negative opinions.

4.2.2 Affinity Opinion Rank After opinion orientation classification, we take the opinion quality into consideration. The opinion quality is two-fold: one is to get as many comments as possible on different product features; the other is to get as much diverse opinion polarity as possible on the commented features. In this section, we will introduce an Affinity Opinion Rank to measure the opinion quality.

4.2.2.1 Overview of Affinity Rank

Before explaining Affinity Opinion Rank, we briefly review a previous work, Affinity Rank [10]. In that paper, the authors introduced two metrics, diversity and information richness, which measure the quality of search results by considering the content based link structure of a group documents and the content of a single document in the search results. Thus Affinity Rank can be used to re-rank the top search results. The four major components of Affinity Rank include:

1. Definitions of Information Richness and Diversity: Information richness measures how many different topics a single document contains. Diversity measures the variety of topics in a group of documents.

2. Construction of Affinity Graph: Let

{ |1 }i

D d i n= ≤ ≤denote a document collection. According

to the vector space model [23], each document id

can be

represented as a vector id��

. Each dimension of the vector is a term and the value for each dimension is the TFIDF of a

term. The affinity of id

to jd

is defined as

( , )i j

i j

i

d daff d d

d=

�� i��

(3)

The affinity defined here is asymmetric because

( , ) ( , )i j j i

aff d d aff d d≠

. If we consider documents as nodes,

the document collection can be modeled as a affinity graph by generating the links between documents using the following rules: A directional link from

id to ( )

jd i j≠ with

weight ( , )i j

aff d d , is constructed if ( , )i j t

aff d d aff≥ (afft is

a threshold, which is empirically set to 0.2 in our experiments); otherwise no link is constructed.

3. Link Analysis by Affinity Graph: After obtaining an Affinity Graph, we apply a link analysis algorithm similar to PageRank [9] to compute the information richness for each node in the graph. First, an adjacency matrix M is used to describe Affinity Graph with each entry corresponding to the weight of a link in the graph.

, ( )i j n n

M M ×= is defined

as below:

,

( , ), if ( , )=

0 , otherwise

i j i j t

i j

aff d d aff d d affM

≥

(4)

Without loss of generality, M is normalized to make the sum of each row equal to 1. The normalized adjacency matrix � �

, ( )i j n nM M ×=is used to compute the information richness

score of each node. The richness computation is based on the following intuitions: 1) The more neighbors a document has, the more informative it is; 2) The more informative a document’s neighbors are, the more informative it is. Thus, the score of document

id can be deduced from those of all

other documents linked to it and it can be formulated in a recursive form as follows:

�,

( ) ( ) j ii j

all j i

InfoRich d InfoRich d M≠

= ⋅∑ (5)

Since �M is normally a sparse matrix, all-zero rows may

appear, i.e. some documents have no other documents with significant affinity to them. To compute a meaningful eigenvector, PageRank introduces a dumping factor c:

� (1 )T ccM e

nλ λ

−= +

� (6)

where e�

is a unit vector with all components equal to 1, and

the dumping factor c is set at 0.85 in our experiments. The computation of information richness can be explained in a way similar to the random surfer model, and we call it random information flow model.

4. Diversity Penalty: Computing information richness helps us choose more informative documents to be presented in top search results. However, in some cases two most informative documents could be very similar. To increase the coverage on the top search results, different penalty is imposed to the information richness score of each document in terms of its influences to the topic diversity. We calculated the diversity penalty by a greedy algorithm. In each iteration of the algorithm, penalty is imposed to documents topic by topic, and the Affinity Ranking score gets updated with it. The more similar a document is to the most informative one, the more penalties it receives and its Affinity Ranking score is decreased. It ensures only the

most informative one in each topic becomes distinctive in the ranking process. More details can be found in [10].

4.2.2.2 Rank Opinions by Affinity Graph

Before purchasing a product, the user would like to survey more reviews to avoid biased opinions. Information coverage is very indispensible. We claim that Affinity Rank is more appropriate for opinion rank for two reasons: it is necessary for a user to listen to different reviewers; also it is necessary to enable a user to find more information by limited reading efforts. For the first one, diversity can measure the variety of topics in a group of documents. For the second one, information richness should be taken into consideration. We notice that two kinds of implicit links may be constructed to build affinity graphs. One is the implicit content link, which has been introduced in [10]. Another is the opinion orientation link, that is, the opinions in different reviews may agree or disagree with each other. Considering such specialty of reviews, we revise Affinity Rank in three aspects:

1. Definitions of diversity and information richness.

2. Definitions of affinity between reviews.

We call this revised Affinity Rank as Affinity Opinion Rank. First, we refine diversity and information richness into 4 metrics:

� Opinion Diversity: this can ensure the diversity of opinions and let users see different opinions from different reviewers.

� Opinion Richness: We suppose comments on more different product features are more valuable to users.

� Topic Diversity and Topic Richness: it is the same with diversity and information richness definitions in [10]. Since users care more about product features when they search for product reviews, we use product features to represent topics.

Before constructing the affinity graph for reviews, we first introduce the definition of review affinity. For each product feature, we classify comments into three sentiments: positive, negative and neutral. We assume that the comments having same sentiment orientation will recommend each other. The strength of recommendation is defined in Table 1.

Table 1. Recommendation between different types of opinions

Positive (+) Negative (-) Neutral (0)

Positive (+) Strong None Middle

Negative (-) None Strong Middle

Neutral (0) Middle Middle Week

The basic assumptions are: � Sharing the same opinions means two comments support

each other. We think implicit recommendation exists between them, i.e. implicit opinion-orientation link.

� Sharing the same product feature means two comments discuss the same aspect of a product, i.e. implicit content link.

“Strong”, “Middle”, “Week” and “None” denote the relationships between different opinions. We use these relationships to build the weights of “implicit link”. We assign the highest weight to the same opinions, and no weight to the disagreed opinions for one product feature. If at least one of the comments is neutral, the weight will be decreased to “Middle” or “Week”. By defining different levels of weights, we combine the similarities based on

opinion orientations and product features. Two kinds of implicit links are constructed in the same graph. Thus, opinion richness/diversity and topic richness/diversity can be calculated simultaneously. Based on these observations, we re-define the similarity measurement between two documents as follows: Let { |1 }

iD d i n= ≤ ≤ denote a document collection and each

document i

d is represented as a vectorid��

. The review affinity of

id to

jd is defined as

, ,1

2

,1

( , )

t

k i k ji j ki j t

i k ik

w wd daff d d

d w

=

=

×= =

∑∑

�� i��

(7)

Different from conventional search model, we treat each product feature as one vector dimension and its sentiment as the value. In our experiment, the sentiment value is obtained by combining the normalized probability of Naïve Bayes classifier with sentiment polarity. If one feature is not neutral, its normalized probability is larger than 0.5. Otherwise, its probability is set as 0.5. Suppose

,k iw and

,k jw appear in

id and

jd respectively. The opinion

associated with feature ,k i

w belongs to classp

C and the opinion

associated with feature ,k j

w belongs to classq

C , , ,k i k j

w w× is

defined as:

, , , ,

, , , ,

( ( ) ( )) ( ( | ) ( | )),

if ( ) ( ) 0

( | ) ( | )

1 , if is "Positive" Class

( ) 1, if is "Negative" Class

0

k i k j p q k i p k j q

p q

k i k j k i p k j q

w w Polarity C Polarity C P w C P w C

Polarity C Polarity C

w w P w C P w C

C

Polarity C C

× = × × ×

× <>

× = ×

= −

, if is "Neutral" ClassC

(8)

where ,( | )k i pP w C and

,( | )k j qP w C are normalized probability of

Naïve Bayes. We use this to reflect the relationships defined in Table 1. With review affinity, links between documents are generated in a way similar to that one used in previous work [10].

4.3 Opinion Presentation When users try to search product reviews on the web, they usually input product name together with words like “reviews” and “opinions”. In that situation, conventional search engines show the snippets containing such keywords for users. However, we do not think it will help users quickly understand what others are thinking about one product. After extracting and ranking opinions in the web pages, we could show that information to users in a more friendly way. This section studies how to improve user experience by generating opinion snippets and visualizing opinion summarization.

4.3.1 Opinion Snippet Generation

Normally, the topic keywords are highlighted for the purpose of assisting users read text snippet returned by conversional search engine. Here we suppose that the keywords which expressing opinions are also important for review reader. Therefore we assign more weights to the short segments containing both product feature (topic keywords) and opinion keywords. We define snippet score as follows:

,

1,

_ ( | )k i

k n

snippet score P w C=

= ∑ (9)

Similar to the definition of equation (8), ,k i

w is a product feature

word described in the snippet and ,( | )

k iP w C is the corresponding

normalized probability for ,k i

w . If one feature is not neutral, its

normalized probability is larger than 0.5. Otherwise, its probability is set as 0.5. We devise snippet score by using a sum of such probabilities so that text enriched with more opinions will have a higher score. After that, a greedy algorithm is used to generate opinion snippets:

Eventually, we highlight the product features, positive appraises and negative appraises with different colors.

4.3.2 Opinion Summary Visualization

Besides the improved preview of details, we also provide an opinion summarization algorithm to facilitate the user to have a glimpse on the overall comments. In this paper, we propose to use radar graph to visualize the opinion summary. Each axis in a radar graph stands for a product feature, and the length stands for the support ratio of this feature. Let us take the case of digital cameras, the axes in Figure 3 represent different features for digital cameras, i.e. appearance, price, function, etc. Users can get intuitive feelings on the strength and weakness of the product. When several radar graphs of different products are put together, it is much easier to make comparisons among products. As shown in Figure 4, user can quickly get the information like product A is better than product B on which features. Our usability test indicates that radar graph expresses the product opinion summary clearly and effectively.

Now the question is: how to obtain the support ratio for each product feature. One intuitive way is to use the percentage of positive comments. However, if the comments on one product feature are very limited, the percentage will not be trustable enough. For example, we cannot say for sure that the “100% positive of 2 reviews” is definitely better than “90% positive of 100 reviews”. Thus both the number of reviews and the percentage of positive ones should be considered. Let pf denote a product feature and p denote a product, we model the risk of using percentage as a criterion by the geometric distribution [25] as follows:

,

,

1.0ˆ(1.0 ) (1.0 )

pf pcf

pf

pf p

pf pf

fR

f f

= × + +

(10)

where pf

f is the mean frequency of product feature pf in

products where pf occurs, and cf denotes the frequency of the

comments on pf . The intuition behind this formula is that as the

cf gets further away from the mean, the percentage becomes less

risky to user as an estimation. We use this risk function as a mixing parameter in our calculation of rs :

, ,ˆ ˆ(1.0 )

( , ) 0.5pf p pf pR Rrs pp pf p

−= × (11)

where ( , )pp pf p is the percentage of positive comments for

product feature pf . Once radar score is obtained, we use the 6

most frequent product features to generate the radar graph. If two products are compared, the 6 most frequent common features will be chosen to display the radar graph.

5. EXPERIMENTS In order to test the effectiveness of our proposed “improved product review search” solution, we conduct an objective experiment to test the effectiveness of Affinity Opinion Rank on the data collected from online shopping sites. Besides the backend, a usability study is also conducted on 7 users to compare our system with another commercial search engine from the perspective of front end.

5.1 Automatic Evaluation of Affinity Opinion

Rank

5.1.1 Datasets We collected product reviews from three online shopping websites, eBay, Epinions and Amazon. For each site we get reviews for three product categories: digital camera (DC), cell phone (CP) and video game console (VGC). Some statistics of the

Algorithm Overview:

1. Set max length (in words) for snippet as n.

2. Select opinion word and product features from the review. Expand each selected word backward and forward up to 5 words. The short segments are candidate snippets. Calculate snippet score for each candidate.

3. Let m denote the length of already selected text. Select the one with highest snippet score from the rest of candidates.

a) If the candidate overlaps with candidates that are already selected, then merge them.

b) If the latest snippet is longer than n, truncate it and exit.

4. Let n = n – m, repeat step 3.

Figure 3. Radar graph for one product.

Figure 4. Comparison of two products using radar graph.

3 datasets are listed in Table 2. For each review, we store product name, date, helpfulness (i.e. how many people think the review is useful), user rating and user comments.

Table 2. Datasets Statistics

5.1.2 Evaluation Metrics One challenge of the affinity ranking strategy is how to evaluate the richness and diversity conveyed in the top n documents. Here we make use of helpfulness and user rating to estimate Opinion Richness and Opinion Diversity. Moreover, we collected specifications for these products to get a combined product scheme, which is used for Topic Richness and Topic Diversity calculation.

In the collected data set, we notice that the websites provide a schema for a user to express whether the review is helpful or not. Thus the information “x of y people found the review is helpful” and product rating could be utilized to approximately estimate opinion richness and diversity for top n reviews:

� Opinion Richness (OR): Suppose h denotes the number of the people who think a review is useful, len denotes the length of reviews (in terms of number of words), OR can be defined as:

hOR

len= (12)

The assumption is that the higher helpfulness a review has, the more useful opinions can be found in it. The OR score is normalized by the maximum one in collection. The OR of top n reviews is the average of top n documents’ OR.

� Opinion Diversity (OD): For all reviews of a specific product, let rd denote user rating (normally from 1 star to 5 stars) distribution in top n documents, rdc denote user rating distribution in the whole collections, OD is defined as the similarity between rd and rdc. By formulating two rating lists into vectors, we calculate OD by the cosine of two vectors. The assumption is that high similarity between the two distributions indicates that users will not be misled even if they read only several reviews ranked at top.

Moreover, we combine major product features from three websites to get a unified product scheme, which is used for the measurements of topic richness and diversity for top n review documents:

� Topic Richness (TR): Suppose cf represents the common features between one review document and the unified scheme, ef is the extracted features in reviews. TR is defined as follows:

| |

| |

cfTR

ef=

(13)

The TR of top n reviews is the average of top n documents’ TR.

� Topic Diversity (TD): Suppose cf represents the common features between one review document and the unified scheme, F is the whole feature set in combined product scheme. TD is defined as follows:

| |

| |

cfTD

F=

(14).

5.1.3 Results and Analysis On the shopping websites, the reviews can be ranked by helpfulness, user rating and review date. Because helpfulness-based ranking is a sort of human knowledge and has been used in evaluation metrics, we excluded it in the comparison of Affinity Ranking and other ranking methods. The baselines are ranking by user rating and date, including Highest Rating First (HRF), Lowest Rating First (LRF), Oldest First (OF) and Newest First (NF). With Affinity Rank (AR), the improvements of OR, OD, TR and TD averaged on 3 datasets are listed in Figures 5 and 6. From these figures, we can see that Affinity Opinion Rank has more improvements on OR and TD. For OR improvement, the reason may be that users prefer the content with both product features and others’ comments, which is consistent with our assumption in Section 4.2.2.2. We further studied the data and found that in the 3 datasets about 70% reviewers gave very high rating, i.e. 5 or 4 stars. That makes the rating distribution less influenced by the different selection of top n documents.

0.05

0.15

0.25

0.35

0.45

0.55

0.65

0.75

0.85

0.95

OR OD TR TD

HRF

LRF

OF

NF

AR

0.05

0.15

0.25

0.35

0.45

0.55

0.65

0.75

0.85

0.95

OR OD TR TD

HRF

LRF

OF

NF

AR

Category Products

Reviews

per

Product

Length per

Review (Words)

Digital Camera

2,669 18 138

Cell Phone 737 27 121

Video Game Console

153 50 126

Figure 6. Information richness and diversity of top 10 results.

Figure 5. Information richness and diversity of top 5 results.

5.2 Usability Study Based on all the solutions introduced in Section 4, we built a real modified system, Improved Product Review Search (IPRS) with a commercial search engine as our backend system. The basic goal of the usability study is to compare IPRS with the other traditional search engine.

5.2.1 Experimental Design In the experiment, three aspects of our system are evaluated: 1) The efficiency of opinion snippet; 2) The quality of Affinity Opinion Rank; 3) The usefulness the visualization system. A subjective study is conducted and the questionnaire is designed as follows:

Q1. Whether the snippet in our system is helpful in understanding the information, compared with traditional search?

Q2. Whether the searched results in our system are better for the purpose of product survey, compared with traditional search?

Q3. How helpful is the radar graph in terms of understanding the overall opinion of a product?

Q4. How helpful is the radar graph when comparing two products?

Q5. Do you have any other opinions about our system?

Q1 and Q2 are designed for aspect 1) and aspect 2) respectively, Q3 and Q4 are designed for aspect 3).

5.2.2 Participants The study was conducted within a group of 7 volunteers (2 females, 5 males), with the age range from 20 to 36. All of them had experiences with searching on the web.

5.2.3 Procedure Each participant was seated in a quiet room facing a laptop, the moderator explains the tasks and the overall procedure. Each participant was made to feel comfortable and relaxed with the procedure. The 20 queries in Table 3 were used in this study, which include different types of digital cameras, cell phones and video game consoles. Each participant will play with the system for about 20 minutes, and try 2 or 3 queries. After reviewing the snippets, searched results and the radar graph, the participant will fill in the questionnaire to complete this test. Each participant follows the following steps during the test:

1. Submit a query, and then go through all snippets presented by our system and the traditional search engine, respectively.

2. Review the top 5 searched results presented by our system and the traditional search engine, respectively.

3. Get the overall opinion for a single product from a radar graph.

4. Get the comparison information for two products showed in a radar graph.

5. The above processes are repeated till participant finished two or more queries.

6. At the end of the test, each participate was invited to answer the 5 questions we designed. They can rate the quality by an evaluation score ranging from 1 (not unhelpful at all) to 5 (very helpful).

Table 3. Queries for Usability Study

5.2.4 Experimental Results Table 4 summarizes the participants’ feedbacks to the three aspects we care. The answers to the open question Q5 are collected as well for better evaluation. The study results indicate that our system has good performance on the tasks of opinion search and visualization.

Table 4. Evaluation of new GUI through questionnaire

*1- not helpful at all 2-Unhlpful 3-Neutral 4-Helpful 5-Very

helpful

As it can be seen in Table3, most people think the searched results in our system are better than traditional search. In other words, users are satisfied with the quality of affinity opinion rank. In addition, more than 85% of users think radar graph is helpful for intuitively understanding when showing the comparison information for two products. The efficiency of the opinion snippet is slightly better than the traditional search engine. From the answer of Q5 and our observation on user behavior during the test, we found that the complex colored UI of showing snippet makes users feel confused, users have to stop reading and think of what green color means, what red color means, etc.

6. CONCLUSION We propose an improved product review search experience to facilitate the user to find product reviews on general web search

Category Query

Digital Camera

“Canon Powershot” review “Canon digital ixus” review “Sony Cybershot dsct9” review “Sony dscs600” review “Nikon Coolpix p4” review “Nikon d200” review “Kodak dsc620” review “Kodak pro880” review

Cell Phone

“Nokia 6236i” review “Nokia 3220” review “Nokia 3660” review “Motorola RAZR v3c” review “Motorola v557” review “Motorola MPx220 Smartphone” review “Sony Ericsson w300i” review “Sony Ericsson t68i” review “Sony Ericsson w810i” review

Video Game Console

“XBox 360” review “Playstation Portal” review “Nitendo DS Light” review

Feature

Answers * Mean

Standard Difference

1 2 3 4 5

Q1 0 1 2 4 0 3.43 0.73

Q2 0 0 2 4 1 3.86 0.64

Q3 0 1 3 2 1 3.43 0.91

Q4 0 1 0 4 2 4.00 0.93

engine. Given a query, our system IPRS can automatically collect text segments containing user opinions from the web, with the help of a traditional web search engine. We put forward Affinity Opinion Rank algorithm for opinion ranking to make the review search results rich and diverse enough in user opinions. Our system is also capable of automatic generation of review snippets to indicate user sentiment orientations and to describe user opinions towards product features. Moreover, the system is convenient for making comparisons between products based on user opinions.

7. ACKNOWLEDGMENTS We thank Dr. Zheng Chen, Dr. Jian-Tao Sun and Dr. Min Zhou for their help on algorithm and system design. We thank Ning Liu, Yunbo Cao, Shiwei Cheng, Qionghua Wang and Pu Wang for their help on the usability study. We are thankful to Johnson Wang that he greatly supported our research projects all the time. We also thank the reviewers for their valuable suggestions on this work.

8. REFERENCES [1] Pang, B., Lee, L., and Vaithyanathan, S. Thumbs up?

Sentiment Classification using Machine Learning Techniques. In Proceedings of the Conference on Empirical

Methods in Natural Language Processing, 2002.

[2] Hu, M., and Liu, B. Mining and Summarizing Customer Reviews. In Proceedings of the ACM SIGKDD International

Conference on Knowledge Discovery & Data Mining, 2004.

[3] Liu, B., Hu, M., and Cheng, J. Opinion Observer: Analyzing and Comparing Opinions on the Web. In Proceedings of the

14th International World Wide Web conference, 2005.

[4] Weibe, J., Bruce, R., and O’Hara, T. Development and Use of a Gold Standard Data Set for Subjectivity Classifications. In Proceedings of ACL-99, 1999.

[5] Bruce, R., and Wiebe, J. Recognizing Subjectivity: A Case Study of Manual Tagging. Natural Language Engineering, 2000.

[6] Wiebe, J., and Wilson, T. Learning to Disambiguate Potentially Subjective Expressions, In Proceedings of

Conference on Computational Natural Language Learning, 2002.

[7] Riloff, E., and Wiebe, J. Learning Extraction Patterns for Subjective Expressions. In Proceedings of Conference on

Empirical Methods in Natural Language Processing, 2003.

[8] Huang, S., Sun, J.-T., Wang, X.-H., Zeng, H.-J., and Chen, Z. Subjectivity Categorization in Weblog Space using Part-Of-Speech based Smoothing. In Proceedings of the 2006 IEEE

International Conference on Data Mining, 2006.

[9] Page, L., Brin, S., Motwani, R., and Windograd, T. The Pagerank Citation Ranking: Bring Order to the Web. Stanford Digital Library Technologies Project, 1998.

[10] Zhang, B.-Y., Li, H., Liu, Y., Ji, L., Xi, W.-S., Fan, W.-G., Chen, Z., Ma, W.-Y. Improving Web Search Results Using Affinity Graph. In Proceedings of ACM SIGIR, 2005.

[11] Turney, P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In Proceedings of the ACL-02, 2002.

[12] Hatzivassiloglou, V., and McKeown, K. Predicting the Semantic Orientation of Adjectives. In Proceedings of ACL-

EACL-97, 1997.

[13] Finn, A., Kushmerick, N., and Smyth, B. Genre Classification and Domain Transfer for Information Filtering. In Procceedings of ECIR-02, 2002.

[14] Brown, P.F., DellaPietra, V.J., deSouza, P.V., Lai, J.C., and Mercer, R.L. Class-based N-gram Models of Natural Language. Computational Linguistics, 18(4), 1992.

[15] Niesler, T.R, Whittaker, E.W.D., and Woodland, P. C. Comparison of Part-of-speech and Automatically Derived Category-based Language Models for Speech Recognition. In Proceedings of the International Conference on Acoustics,

Speech, and Signal Processing, 1998.

[16] Eguchi, K., and Lavrenko, V. Sentiment Retrieval using Generative Models. In Proceedings of EMNLP, 2006.

[17] Kleinberg, J.M. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46 (5). 604-632.

[18] Chen, Z., Tao, L., Wang, J., Liu, W. and Ma, W.-Y. A Unified Framework for Web Link Analysis. In Proceedings

of the 3rd International Conference on Web Information

Systems Engineering, 2002.

[19] Xi, W., Zhang, B., Chen, Z., Lu, Y., Yan, S., Ma, W.-Y., and Fox, E.A. Link Fusion: A Unified Link Analysis Framework for Multi-type Interrelated Data Objects. In Proceedings of

the 13th International Conference on World Wide Web, 2004.

[20] Xue, G.-R., Zeng, H.-J., Chen, Z., Ma, W.-Y., Zhang, H.-J., and Lu, C.-J. Implicit Link Analysis for Small Web Search. In Proceedings of ACM SIGIR, 2003.

[21] Wesson, J.L. Applying Visualisation Techniques in Novel Domains. In Proceedings of the Ninth International

Conference on Information Visualisation (IV'05), 2005.

[22] Jelinek, F., and Mercer, R. Interpolated Estimation of Markov Source Parameters from Sparse Data. Pattern Recognition in Practice, 1980.

[23] Wong, S.K.M., and Raghavan, V.V. Vector Space Model of Information Retrieval: A Reevaluation. In Proceedings of

ACM SIGIR, 1984.

[24] Haveliwala, T. H. Topic-sensitive Pagerank. In Proceedings

of the 11th International World Wide Web Conference, 2002.

[25] Zelen, M., and Severo, N. “Probability Functions” Handbook of Mathematical Functions. National Bureau of Standards Applied Mathematics Series No. 55. 1964.

[26] Evans, D. A., and Zhai, C. Noun-phrase Analysis in Unrestricted Text for Information Retrieval. In Proceedings

of ACL-96, 1996.

improving product review search experiences on general...

Documents