exploring linkability of user reviews

Exploring Linkability of User Reviews

Mishari Almishari and Gene Tsudik

University of California, Irvine

Roadmap

1. Introduction2. Data Set & Problem Settings3. Linkability Results &

Improvements4. Discussion5. Future Work & Conclusion

Motivation

Increasing Popularity of Reviewing Sites

Yelp, more than 39M visitors and 15M reviews in 2010

Example

category

Rating

Motivation

Rising awareness of privacy

Motivation

How is it applied?

Traceability/Linkability

Linkability of Ad hoc Reviews

Linkablility of Several Accounts

Goal

Assess the linkability in user reviews

Roadmap



Data Set

• 1 Million Reviews • 2000 Users• more than 300 reviews

Problem Settings

IR: Identified RecordIR

IR

IR

IR

AR

AR

AR

AR

AR: Anonymous Record

Problem Formulation

Anonymous Record (AR)

Identified Records (IR’s)

Matching Model

TOP-X LinkabilityX: 1 and 10

1, 5, 10, 20,…60

Problem Settings

Methodologies(1) Naïve Bayesian Model

(2) Kullback-Leibler Divergence (KLD)

Decreasing Sorted List of IRs

Increasing Sorted List of IRs

Maximum-Likelihood Estimation

Tokens

• Unigram:• “privacy”: “p”, “r”, “i”, “v”, “a”, “c”, “y”• 26 values

• Digram• “privacy”: “pr”, “ri”, “iv”, “va”, “ac”, “cy”• 676 values

• Rating• 5 values

• Category• 28 values

Roadmap



NB -Unigram

Unigram Results

Anonymous Record Size

Lin

kab

ilit

y R

ati

o

Size 60, LR 83%/ Top-1LR 96% Top-10

Digram Results

NB -Digram

Lin

kab

ilit

y

Rati

o


Size 20, LR 97%/

Top-1Size10, LR 88%/

Top-1

Improvement (1): Combining Lexical and non-Lexical

onesNB Model


Lin

kab

ilit

y

Rati

o

Gain, up to 20%

Size 60, 83 % To

96%

Size 30, 60 % To

80%

What about Restricting Identified Record (IR) Size?

NB Model KLD Model


Lin

kab

ilit

y R

ati

oAnonymous Record

Size

Lin

kab

ilit

y R

ati

o

Affected by IR size

Performed better for smaller IR

Size 20 or less, improved

✔

✔

✔

✔

✖

✖

✖✖

✖

✖

v1 v3v2 v4

v7v5 v6 v8

v9 v10

v11

v12

v13

v14

v15 v1

6

Improvement (2): Matching All IR’s At Once

Matching All Results

Restricted IR Full IR


Lin

kab

ilit

y R

ati

o


Lin

kab

ilit

y R

ati

o

Gain, up to 16%

Size 30, From 74% To 90%

Gain, up to 23%Size 20, From 35% To 55%

Improvement (3): For Small IR Size

Changing it to:0.5 + Review Length


Lin

kab

ilit

y

Rati

o Size 10, 89% To 92%

Size 7, 79% To 84%

Gain up to 5%

Roadmap



Discussion

o Unigram and Scalabilityo 26 VS 676o 59 VS 676o Less than 10%

o Prolific Userso On the long run, will be prolific

o Anonymous Record Size o A set of 60 reviews, less than 20% of minimum

contribution o Detecting Spam Reviews

Roadmap



Future Work

o Improving more for Small AR’so Other Probabilistic Modelso Using Stylometry

o Review Anonymizationo Exploring Linkability in other Preference

Databases

Conclusion

o Extensive Study to Assess Linkability of User Reviewso For large set of userso Using very simple features

o Users are very exposed even with simple features and large number of authors

Reviews can be accurately de-anonymized using alphabetical letter distributions

Takeaway Point:

Questions?

exploring linkability of user reviews

Documents

prolificanonymous record

identified record ir

problem settingsmethodologies

small ir size

ir sizeperformed

unigram alpha value

digram alpha value

list of irsincreasing