mining arabic business reviews · 2018-04-06 · mining arabic business reviews mohamed elhawary...

Mining Arabic Business Reviews

Mohamed Elhawary Mohamed ElfekyGoogle Inc.

Mountain View, CA, USA{elhawary | mgelfeky}@google.com

Abstract—For languages with rich content over the web,business reviews are easily accessible via many known websites,e.g., Yelp.com. For languages with poor content over the weblike Arabic, there are very few websites (we are actually awareof only one that is indeed unpopular) that provide businessreviews. However, this does not mean that such reviews do notexist. They indeed exist unstructured in websites not originallyintended for reviews, e.g., Forums and Blogs. Hence, there is aneed to mine for those Arabic reviews from the web in order toprovide them in the search results when a user searches for abusiness or a category of businesses. In this paper, we showhow to extract the business reviews scattered on the webwritten in the Arabic language. The mined reviews areanalyzed to also provide their sentiments (positive, negative orneutral). This way, we provide our users the information theyneed about the local businesses in the language theyunderstand, and therefore provide a better search experiencefor the Middle East region, which mostly speaks Arabic.

I. INTRODUCTION

Review websites, like Yelp.com, are popular becausepeople like to read reviews about restaurants they consider toeat in, or shops they consider to purchase from, etc. Searchengines, e.g., Google, do not provide content for the internetbut their goal is to organize the content and provide theinformation to its users. Therefore, search engines, as well asreview aggregation websites, would like to organize suchbusiness reviews and make them available when someonesearches for them. Search engines do not have problems withreviews in languages with rich content over the web likeEnglish, as they can whitelist the websites that contain suchreviews. The problem is in the languages with poor contentlike Arabic. We need to search for these reviews over theweb, crawl their webpages, extract and organize them, andpresent them to our users when they ask for reviews. In thispaper, we describe this latter system for Arabic.

The first component in this system is to determinewhether an internet page contains a review or not. We haveemployed an in-house developed multi-label classifier thatclassifies any English document as a review, forum, blog,news, or shopping store. We have extended this classifier towork on Arabic documents and we used the review label toidentify Arabic reviews.

Next, we need to identify the sentences in the classified-as-review page, which actually contain the review, and

determine whether they contain a sentiment or not. For everysentiment, it is important to determine its polarity: positive,negative, mixed, or neutral, and its score that indicates howstrong this sentiment is. This way, we can tell whether thewhole review is positive or negative so that we present theinformation to the user sorted by the score. The sentimentanalysis is also an in-house developed system [1] thatclassifies English sentences by polarity as positive, negative,mixed, or neutral; and by magnitude as strong or weak. Wehave extended this system to build an Arabic SentimentAnalyzer using MapReduce [2].

The final component of this system is how to provide themined reviews and sentiments to the users. We annotate thedocuments, classified as Arabic reviews, in the search indexwith the extracted sentiments, and it is up to the searchengine now to show those annotations as the snippet of thesearch result.

The remaining of the paper is organized as follows.Sections II and III describe the two above components: thereviews classifier and the sentiment analysis. In Section IV,we evaluate the performance of the system.

II. ARABIC REVIEWS CLASSIFIER

To build the Arabic reviews classifier, we need first tocollect training data from websites with Arabic languagecontent. We collected about 2,000 URLs where about 40%of them were reviews. We found the reviews by searching theweb for keywords that usually exist in user reviews like “thefood was bad”, or “the bed sheets were unclean”, etc.

Next step is to determine and extract the features that canrepresent a review. The most important features were alongthe line of the number of times some keywords appear in thedocument. We have used the translated the features thatappeared in previous classifiers for other languages and weadded our own features that include Arabic keywords thatusually appears in opinionated text. We have built around1500 features that scan every piece of information we can getabout a document. Fig. 1 shows a sample of the features filefor Arabic. It basically describes a feature that counts theoccurrences of the mentioned keywords when they have abold typeface.

After constructing the training data, we trained anAdaBoost classifier with 200 stumps to classify whether adocument is an Arabic review or not. (80% of the data wereused in training and 20% in evaluation).

2010 IEEE International Conference on Data Mining Workshops

978-0-7695-4257-7/10 $26.00 © 2010 IEEE

DOI 10.1109/ICDMW.2010.24

1108

<feature> <type>StringTesterInSpecialText</type> <name>BoldReviewsOpinion</name> <select>bold</select> <CountFeatureHelper> <select>count</select> </CountFeatureHelper> <stringtester> <type>acpattern</type> <keyword> ••• </keyword> <keyword> ••••• </keyword> <keyword> ••••• </keyword> <keyword> ••• </keyword> <keyword> •••• ••• </keyword> <keyword> ••••••• ••• </keyword> <keyword> ••••• </keyword> <keyword> •••• </keyword> <keyword> •••• </keyword> <keyword> •••• ••• </keyword> <keyword> ••••••• •••• </keyword> <keyword> ••••• </keyword> <keyword> •••• </keyword> <keyword> ••••• </keyword> <keyword> ••••• </keyword> <keyword> ••••• •••• ••• </keyword> <keyword> ••••• </keyword> <keyword> •••• •• •••• </keyword> <keyword> ••••• </keyword> <keyword> •••• </keyword> <keyword> •••• </keyword> <keyword> •••• </keyword> <keyword> ••••• •••••••• </keyword> </stringtester></feature>

Figure 1. A sample features file for Arabic.

III. ARABIC SENTIMENT ANALYSIS

The Arabic Sentiment Analysis relies on the ArabicLexicon (vocabulary) that contains positive and negativewords / phrases ranked by their score. The standard way togenerate such a Lexicon is to take a small set of manually-labeled positive and negative seed words and perform labelpropagation along the Arabic similarity graph.

A. Arabic Similarity GraphThe similarity graph is a graph that clusters all the

words / phrases in a certain language, where two words /phrases have an edge if they are similar, i.e., they have thesame polarity of sentiment, or they have the same meaning.The weight on the edge is an indication of how similar thesetwo words / phrases are. This graph can be constructed fromlexical co-occurrences in an unsupervised learning schemefrom a large web corpus that can easily be trained on anylanguage [3]. The similarity graph automatically providesextensive coverage of multi-word phrases, alternativespellings, and slang expressions.

B. Arabic LexiconWe started by a seed list of more than 600 positive words

/ phrases and more than 900 negative words / phrasescollected by manually looking at some of the reviewscollected for the reviews classifier training; and a seed list ofalmost 100 neutral words / phrases, collected from the topfrequent Arabic words / phrases used over the web.

We used those seed lists and the Arabic similarity graphto build the Arabic Lexicon. The Lexicon is a two columnfile: a phrase / word and a score. The score captures thepolarity and magnitude of the phrase / word. To build theArabic Lexicon, every word / phrase in the seed lists wouldget a high positive or negative score, or a neutral score in thesimilarity graph. Then, using label propagation, the score ofeach node in the similarity graph gets the sum of edgesweights connecting to this node in the similarity graph,multiplied by the score of the adjacent nodes. Therefore, anode with many positive neighbors will become positive anda node with many negative neighbors will become negative.A node with a nearly balanced positive and negativeneighbors will become neutral.

The scores of the words from the Arabic Lexicon werecarefully inspected. We discovered that the top 200 words /phrases with positive polarity are not supposed to be positiveat all. We traced down the positive nodes that led to suchlabeling through the similarity graph and we discovered thatthese 200 nodes shared a few garbage nodes. We discoveredthat, since there is sparse data coverage in Arabic, thesimilarity graph was built using webpages with low quality orlow page rank to reach a total of 1.8 million phrases. Hence,this provides a good coverage, or so it was thought, however,we have noticed lots of garbage words in there due to the lowquality of Arabic documents used. To fix this problem, wehave modified the pruning scheme used to throw away nodesin the similarity graph. We came up with a new filtering rulethat the words / nodes in the similarity graph with a largenumber of high weight edges should be thrown out from thegraph. When we have applied this modification to all otherlanguages, their corresponding Lexicon performance graphshave significantly improved.

Another important addition is that we decided to onlykeep the top-ranked 25 synonyms per phrase. This decisionwas based on the correlation between the synonym rank andaccuracy used in the similarity graph: the top 25 synonyms ofa positive word tend to be >=90% positive, while all of thesynonyms may only be 50-60% positive.

The label propagation mechanism produces separatepositive and negative word scores. We normalize the positiveand negative scores before adding them to compensate forthe negative skew (bias) in the scores. Finally, the scores arefiltered by eliminating the scores below some cutoff and thelog is taken. The end result of running label propagation is aLexicon with scored words / phrases which are considered tocarry positive or negative sentiments.

C. Negation Mechanism and Sentence BoundaryDetectionA negation mechanism is needed in order to switch the

polarity from negative to positive and vice versa in case thereis a negation word (around 20 words in Arabic). Thenegation logic currently assumes that the negation termprecedes the negated text, which is true for Arabic as well.

1109

120 characters) were discarded.

Using the Arabic Lexicon, the negation mechanism, andthe sentence boundary detection, the Arabic sentimentanalyzer is built. Given the scores of every word / phrase in asentence, and the flipped scores of negated words/phrases,the scores are added together. The total score of a sentencewill determine if it is positive, negative, mixed or neutral.

Fig. 2 shows a working example of the SentimentAnalysis process, using English text for clarification. Fig.

boundary detection. Fig. 2(c) shows the words / phrases thatwere identified as having positive sentiment (highlighted ingreen), and those identified as having negative sentiment(highlighted in pink). Note that although the word “quickly”was originally identified as positive, it got a negative scorebecause it was negated. Finally, Fig. 2(d) shows theaggregated polarity of the sentences. The first sentence(highlighted in yellow) is identified as mixed since itcontained one positive and one negative word. The othersentences (highlighted in green) are all identified as positive

1110

since they contain more positive words than than negativeones. Fig. 3 shows the same behavior for Arabic. Note thatthe neutral sentences are highlighted in blue.

D. Label Propagation CappingAs mentioned above, while working on the Arabic

similarity graph, which is built on rather sparse data, wediscovered the following problem. Certain words in thesimilarity graph have very high edge weights to neighbors.Nevertheless, these words and their synonym links might begarbage. This happens somewhat infrequently in a high-quality language graph such as English, but much more oftenin languages with sparse data. We realized the sameimprovement can be used in other languages includingEnglish, to suppress this type of undesirable noise.

This problem affects building any language Lexicon sincethe label propagation multiplies the propagated values byedge weights, which usually correlate with the precision.Hence, very large edge weights cause excess weight to bepropagated. Weird clusters like these get amplified by achain reaction, and produce high scoring garbage words.

To solve this problem, we introduced a cap on themaximum allowed value that could be propagated at eachnode. For Arabic, this maximum was set to 1.0 (thuspreventing any chain-reaction-like amplification ofpropagated values). For English, a maximum of 1.5 workedbetter, because the graph is more accurate and we benefitfrom a small degree of signal amplification. Using thosecaps, we were able to get a slight improvement in the Englishlexicon, yet a significant improvement for Arabic, especiallyfor positive sentences. Fig. 4 shows an example of one node"michael jordan" in the English similarity graph, where onlythe top 10 edges are shown.

Figure 4. Top 10 edges for the node “michael jordan”

IV. PERFORMANCE EVALUATION

First, we will evaluate the Arabic Reviews classifier andcompare it to the English Reviews classifier that is known towork well. Fig. 5 shows the precision-recall curves for bothArabic and English Reviews classifiers. It shows that wewere able to achieve a high-precision relatively-high-recall

Arabic Reviews classifier. As mentioned before, 80% of thedata are used for training and 20% are used for testing.

To evaluate the Arabic Lexicon and the Arabic SentimentAnalysis, we are going to use two measures. The a-measureis the mean average precision of the precision-recall curve orthe area underneath the curve. It gives a rough measure forcomparing curves that have reasonably good precision andrecall. Note that the a-measure is the mean average precisionover the entire x-axis length. If the recall is low (i.e., thecurve only covers a fraction of the horizontal space), the a-measure will be proportionally low. The f-measure [4], aweighted measure of an optimal point on the curve. Note thatwe are actually using F0.5 measure, not F1, which weighsprecision twice as much as recall. The F0.5 measure is betterfor measuring the optimal performance on a curve.

When we evaluate the Arabic Lexicon, we sort words inthe lexicon by score, and measure precision and recallconsecutively at each word, starting from the highest-scoringword and going down the list. The Arabic lexicon wasevaluated against 605 labeled random phrases: 500 manuallylabeled, and 105 auto labeled neutral phrases. Fig. 6 and 7show the evaluation for the Lexicon and the SentimentAnalysis, respectively, also compared to the known-to-be-working-well English ones. Fig. 6(a) shows that the ArabicLexicon has a relatively-high precision, similar to the Englishone, and has a low recall due to the label propagationcapping. Fig. 7(a) shows that Arabic Sentiment Analysis hasrelatively-high precision for the sentences with positive ornegative sentiments, which is more important to achieve thanthe mixed or neutral sentiments. That behavior is also similarto that of the English Sentiment Analysis

V. CONCLUSIONS

In this paper, we have demonstrated a system for miningArabic business reviews from the web. The system comprisestwo main components: a reviews classifier that classifies anywebpage whether it contains reviews or not, and a sentimentanalyzer that identifies the review text itself and identifies theindividual sentences that actually contain a sentiment(positive, negative, neutral or mixed) about the businessgetting reviewed. The system is of particular interest forlanguages that are of poor web content, e.g., Arabic; and caneasily be extended to other alike languages.

ACKNOWLEDGMENT

We would like to thank Sasha Blair-Goldensohn, KerryHannan, Leonid Velikovich, and Jeremy Shute for their helpthroughout the process of developing this system. We alsothank Ryan McDonald for his valuable comments regardingthe paper.

REFERENCES

[1] S. Blair-Goldensohn, K. Hannan, R. McDonald, T. Neylon, G. Reis,and J. Reynar, “Building a Sentiment Summarizer for Local ServiceReviews”, Proc. of the WWW Workshop on NLP Challenges in theInformation Explosion Era, NLPIX, 2008.

1111

[2] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processingon Large Clusters”, Proc. of the 6th Symposium on Operating SystemDesign and Implementation, OSDI 2004, pp. 137-150.

[3] L. Velikovich, S. Blair-Goldensohn, K. Hannan, and R. McDonald,“The Viability of Web-derived Polarity Lexicons”, Proc. of the North

American Chapter of the Association for Computational Linguistics,2010.

[4] http://en.wikipedia.org/wiki/Information_retrieval#F-measure

1112

Figure 5. Performance of Reviews Classifiers: (a) Arabic (b) English

Figure 6. Performance of Lexicons: (a) Arabic (b) English

Figure 7. Performance of Sentiment Analysis: (a) Arabic (b) English

1113

mining arabic business reviews · 2018-04-06 · mining arabic business reviews mohamed elhawary...

Documents