review spam detection via temporal pattern discovery

29
Review Spam Detection via Temporal Pattern Discovery Sihong Xie, Guan Wang, Shuyang Lin, Philip S. Yu Department of Computer Science University of Illinois at Chicago

Upload: monifa

Post on 17-Jan-2016

91 views

Category:

Documents


1 download

DESCRIPTION

Review Spam Detection via Temporal Pattern Discovery. Sihong Xie, Guan Wang, Shuyang Lin, Philip S. Yu Department of Computer Science University of Illinois at Chicago. What’s review spams. Give some examples of spam reviews Also give brief descriptions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Review Spam Detection via Temporal Pattern Discovery

Review Spam Detection via Temporal Pattern Discovery

Sihong Xie, Guan Wang, Shuyang Lin, Philip S. YuDepartment of Computer Science

University of Illinois at Chicago

Page 2: Review Spam Detection via Temporal Pattern Discovery

What’s review spams•Created on review websites in order to

create positive impressions for bad products/stores, and make profit out of misled customers

•They are harmful: lead to poor customer experience, ruin reputation of good stores

•Guidelines to spot fake reviews for human, but it is hard for machines ([1])

Give some examples of spam reviewsAlso give brief descriptions

Give some examples of spam reviewsAlso give brief descriptions

http://consumerist.com/2010/04/how-you-spot-fake-online-reviews.html

Page 3: Review Spam Detection via Temporal Pattern Discovery

Human friendly clues of spams

• Language features [5]

1.All praises2.say nothing about the product3.Red flag words4.mention the name a lot

Too hard for Too hard for machines:machines:

Involving natural Involving natural language processinglanguage processing

Page 4: Review Spam Detection via Temporal Pattern Discovery

Machine friendly clues of spams• similar reviews (texts and ratings) on

one product / product group in a short time [1,2]

Page 5: Review Spam Detection via Temporal Pattern Discovery

Machine friendly clues of spams• Group of spammers:

• Wrote reviews together frequently on the same set of products / stores [3]

Reviewer 1 Reviewer 2 Reviewer 3

Page 6: Review Spam Detection via Temporal Pattern Discovery

How to play the spamming game

• Duplicated reviews: two reviews are almost the same

More More sophisticated sophisticated

writingswritingsEasy: shingling

Use different Use different reviewer idsreviewer ids

detection is easy [3]

Use different Use different reviewer idsreviewer ids

detection based on statistics of reviewer

Player 1Detection systems

Player 2Spammers

• Group spamming: a group of reviewers frequently write reviews together.

• Other kinds: targeting on the same product, similar texts/ratings by one id.

Page 7: Review Spam Detection via Temporal Pattern Discovery

Failed machine friendlyclues of spams

If these reviews were posted by the same id, it would have been easy to detect

Page 8: Review Spam Detection via Temporal Pattern Discovery

Same id wrote multiple reviews, making it easy to be detected. Smart spammers would avoid this

Reviewer 1 Reviewer 2 Reviewer 3

Failed machine friendlyclues of spams

Page 9: Review Spam Detection via Temporal Pattern Discovery

Spammers like singleton spam

•Strong motivations to have singleton spams:

1.Need to boost the rating in a short time

2.Need to avoid being caught

• Post reviews with high rating under different names in a short time

Page 10: Review Spam Detection via Temporal Pattern Discovery

Singleton reviews

0 +

Each reviewer id contributes only one review for one store only

A physical person can

register many reviewer ids

Reviewer id

Store

Singleton

non-singleton

+ Spammer

Registration

0 Normal reviewer

Page 11: Review Spam Detection via Temporal Pattern Discovery

Facts of Singleton reviews

• Constitute a large portion of all the reviews

• Over 90% of the reviews are singleton reviews in this paper; similar situations in another dataset [4]

• More influential, more harmful

Page 12: Review Spam Detection via Temporal Pattern Discovery

The challengesTraditional cluesTraditional clues shortcomingsshortcomings

Review features (bag of words, ratings, brand names reference) [4]

Hard for human, not to mention machines

Reviewer features (rating behaviors) [1]

Poor if one wrote only one review

Product/Store features[4]Tell little about individual

reviews

Review/reviewer/store reinforcementsFails on large number of

spam reviews with consistent ratings

Group spamming [2,3]No applicable on singleton reviews

Singleton reviews detection [7]*Finds suspicious hotels,

can’t find individual singleton spam

* [7] is a supervised method, and we have contrasting conclusion with theirs

Page 13: Review Spam Detection via Temporal Pattern Discovery

The proposed method

•Recall the motivations of singleton reviews: boost the ratings in a short time and avoid being caught

•The results: in a short time, many reviewers wrote only one review with a very high rating

• The correlations between rating andThe correlations between rating andvolume of (singleton) reviews volume of (singleton) reviews is the key feature of singleton review is the key feature of singleton review spammingspamming

Page 14: Review Spam Detection via Temporal Pattern Discovery

Detected burst of singleton spams

averagerating

numberof reviews

ratio ofsingleton reviews

a suspicious time window

Page 15: Review Spam Detection via Temporal Pattern Discovery

The algorithm1.For each store do

A.split the whole period into small time windows

B.compute avg rating, total number of reviews, percentage of singleton reviews in each window

C.form a three dimension time series

1.detect windows with correlated burst patterns

2.for each detected window, repeat step A.-D. until window size becomes too small

Page 16: Review Spam Detection via Temporal Pattern Discovery

55

The algorithm11

3322

4455

5544

1133

22

average rating: average rating: 22

review volume: review volume: 33

SR volume: 1/3SR volume: 1/3 average rating: average rating: 4.64.6

review volume: review volume: 55

SR volume: 5/5SR volume: 5/5 average rating: average rating: 22

review volume: review volume: 33

SR volume: 3/3SR volume: 3/3

sorted by sorted by posting posting time;time;

divided divided into into

groupsgroups

Multi-dimensional time series

the the correlated correlated

burstburst

Page 17: Review Spam Detection via Temporal Pattern Discovery

Dataset• A snapshot of a review website*

• 408,469 reviews, 343,629 reviewers

• 310,499 reviewers (> 90%) wrote only one review

• 76% reviews are singleton reviews

• Focus on top 53 stores with over 1,000 reviews

* www.resellerratings.com

# reviewers

# reviews

Page 18: Review Spam Detection via Temporal Pattern Discovery

Experimental results

•29 stores are regarded as suspicious by least 2 out of 3 human evaluators.

•The proposed algorithm labeled 39 stores as suspicious. ( recall = 75.86%, precision = 61.11%)

Page 19: Review Spam Detection via Temporal Pattern Discovery

Case studies

time window size = 30 days

correlated bursts detected

Period with the detected correlated burst enlarged

time window size = 15 days

Volume of reviews: 57

Ratio of SRs: 61%

rating: 4.56

154

83%

4.79

pin-point the exact time and shape of the bursts

Page 20: Review Spam Detection via Temporal Pattern Discovery

• Text features: ratio of reviews talking about “customer service/support”

• Hurry reviewers: wrote only one review at the same time of ID registration

• Human validation: read the reviews and found a reviewer disclosed being solicited for a 5 star review

Case studies (cont’)

Most of the later reviews are written by “Hurry Reviewers”

more than 80% of the singleton reviews are related to “customer service”

more than 80% of the singleton reviews are related to “customer service”

Page 21: Review Spam Detection via Temporal Pattern Discovery

References1. Detecting Product Review Spammers using Rating Behaviors2. Finding Unusual Review Patterns Using Unexpected Rules3. Spotting Fake Reviewer Groups in Consumer Reviews4. Opinion spam and analysis5. Finding Deceptive Opinion Spam by Any Stretch of the Imagination6. Review Graph based Online Store Review Spammer Detection7. Merging Multiple Criteria to Identify Suspicious Reviews

Page 22: Review Spam Detection via Temporal Pattern Discovery

the end

Page 23: Review Spam Detection via Temporal Pattern Discovery

Examples All praisesPosted in a short timesay nothing about the

productsimilar ratingsRed flag words

mention the name a lot

Need to set up the animations for all these elements

Need to set up the animations for all these elements

Page 24: Review Spam Detection via Temporal Pattern Discovery

Types of review spams

•Duplicate (easy, string matching)

•Advertisements

•Other easy-to-detect spams (all symbols, numbers, empty, etc.)

•untruthful (very hard, need machines to understand the intentions of the reviews)

Page 25: Review Spam Detection via Temporal Pattern Discovery

Feature based methods

•Paradigms:

•Define features, training set and pick a classifier

•Keys to success: good features, large training data, and powerful classifier

Page 26: Review Spam Detection via Temporal Pattern Discovery

Common but not fully investigated•Previous methods can not catch

them

•Each reviewer id has only one review

•Many features used by previous methods simply become meaningless

Page 27: Review Spam Detection via Temporal Pattern Discovery

Traditional methods (cont.)

• [6] uses a graph to describe reinforcement relationships between entities: good / bad reviews influence their authors, who in turn influence the stores, which in turn influence its reviews.

reviews storesreviewers

Page 28: Review Spam Detection via Temporal Pattern Discovery

Traditional methods (cont.)

reviews storesreviewers

{If these reviews are posted in a

short period with consistent ratings...

The store will be regarded as a good

one

Page 29: Review Spam Detection via Temporal Pattern Discovery

Traditional methods

review featuresrating: 4, bag-of-words: {switch, nook, kindle, why,

what, learn}

reviewer features name: KKX; number of reviews: 1 average rating: 4

store/product features

Kindle average rating: 4/5 stars, price: $199

group spammingKKX wrote one review only, failed the frequency

test

what can what can you you

conclude conclude from these from these features?features?