Collective Spammer Detection in Evolving Multi-Relational Social Networks(To appear at KDD’15)
Shobeir Fakhraei*University of Maryland*Research performed while on internship at if(we) (Tagged Inc.) through Graphlab program
In collaboration with:James Foulds (UCSC, Currently UCSD)Madhusudana Shashanka (if(we) Inc., Currently Niara Inc.) Lise Getoor (UCSC)
2
Spam in Social Networks• Recent study by Nexgate in 2013:• Spam grew by more than 300% in half a year• 1 in 200 of social messages are spam• 5% of all social apps are spammy
• Social networks:• Spammers have more ways to interact with users
e.g, Comments on photos, send message, and send not verbal signals
• They can split content in several messages and forms• More available info about users on their profiles!
3
Our Approach• Lets look at the graph:• Multi-relational social
network Links = actions at time t (e.g. profile view, message, or poke)
• Time-stamped Links
4
Tagged.com• Tagged.com is a social network founded in 2004 which
focuses on connecting people through different social features and games.
• It has over 300 million registered members
• Data sample for experiments (on a laptop):• 5.6 Million users (%3.9 Labeled Spammers)• 912 Million Links
5
Graph Structure
• Using Graphlab Create• Graph Analytics Toolkit to Generate Features:
• In each relation graph we compute:PageRank, Total Degree, In Degree, Out Degree, k-Core, Graph Coloring, Connected Components, Triangle Count
• Gradient Boosted Trees:• Classification (Spam or Legitimate)
6
Graph StructureExperiments AU-PR AU-ROC
1 Relation, k Feature type 0.187 ± 0.004 0.803 ±0.001
n Relations, 1 Feature Type 0.285 ± 0.002 0.809 ± 0.001
n Relations, k Features types 0.328 ± 0.003 0.817 ± 0.001
7
Sequence of Actions• Sequential k-gram Features:
Short sequence segment of k consecutive actions, to capture the order of events
User1 Actions: Message, Profile_view, Message, Friend_Request, ….
8
Sequence of Actions• Mixture of Markov Models (MMM):
Also called chain-augmented or tree-augmented naive Bayes model to capture longer sequences
9
Sequence of ActionsExperiments AU-PR AU-ROC
Bigram Features 0.471 ± 0.004 0.859 ± 0.001
MMM 0.246 ± 0.009 0.821 ± 0.003
10
ResultsPrecision-Recall ROC
We can classify 70% of the spammer that need manual labeling with about 90% accuracy
11
Deployment and Example Runtimes• We can: • Run the model on short intervals, with new snapshots of the network• Update some of the features with events.
• Example runtimes with Graphlab CreateTM on a Macbook Pro:• 5.6 million nodes and 350 million links• PageRank: 6.25 minutes• Triangle counting: 17.98 minutes• k-core: 14.3 minutes
12
Refining the abuse reporting systems• Abuse report systems are very noisy.• People have different standards.• Spammers report random people to increase noise.• Personal gain in social games.
• Goal is to clean up the system using• Reporters previous history• Looking at all the reports collectively
13
Probabilistic Soft Logic (PSL)• Probabilistic Soft Logic (PSL) Declarative language based on
first-order logic to express collective probabilistic inference problems.
• Example:
14
Results of Classification Using ReportsExperiments AU-PR AU-ROC
Reports Only 0.674 ± 0.008 0.611 ± 0.007
Reports & Credibility 0.869 ± 0.006 0.862 ± 0.004
Reports & Credibility & Collective Reasoning 0.884 ± 0.005 0.873 ± 0.004
15
Conclusions• Strong signals in multi-relational time-stamped graph
• Graph Structure:• Multi-rationality is more important than extracting more features from one relations
• Sequence:• Bigrams features of actions capture most of the patterns
• Reports:• Incorporating credibility of reporting users and collective reasoning significantly improves
the signal
• Fast experiments iterations with Graphlab CreateTM facilitates finding these pattern
16
Acknowledgements • Co-authors of the paper:
James Foulds (UCSC)Madhusudana Shashanka (if(we) Inc. -Currently Niara Inc.-)Lise Getoor (UCSC)
• If(we) Inc. (Formerly Tagged Inc.):Johann Schleier-Smith, Karl Dawson, Dai Li, Stuart Robinson, Vinit Garg, and Simon Hill
• Dato (Formerly Graphlab):Danny Bickson, Brian Kent, Srikrishna Sridhar, Rajat Arya, Shawn Scully, and Alice Zheng
• The paper is available online, and code and part of the data will be released soon:http://www.cs.umd.edu/~shobeir
17
Acknowledgements • Co-authors of the paper:
James Foulds (UCSC)Madhusudana Shashanka (if(we) Inc. -Currently Niara Inc.-)Lise Getoor (UCSC)
• If(we) Inc. (Formerly Tagged Inc.):Johann Schleier-Smith, Karl Dawson, Dai Li, Stuart Robinson, Vinit Garg, and Simon Hill
• Dato (Formerly Graphlab):Danny Bickson, Brian Kent, Srikrishna Sridhar, Rajat Arya, Shawn Scully, and Alice Zheng
• The paper is available online, and code and part of the data will be released soon:http://www.cs.umd.edu/~shobeir
Thank you!