detecting spammers on twitter

NOZZLE: A Defense Against Heap-spraying Code Injection Attacks

Detecting Spammers on TwitterFabricio Benevenuto, Gabriel Magno,Tiago Rodrigues, and Virgilio AlmeidaUniversidade Federal de Minas Gerais Belo Horizonte, BrazilACSAC 2010A Presentation at Advanced Defense Lab1OutlineIntroductionBackgroundDataset and Labeled CollectionIdentifying User AttributesDetecting SpammersRelated WorkConclusion

Advanced Defense Lab22IntroductionTwitter has recently emerged as a popular social system.

With a simple interface where only 140 character messages can be posted.

These services open opportunities for new forms of spam

Advanced Defense Lab3Introduction4-step approachCrawled a near-complete dataset from Twitter.Created a labeled collection with users manually classified as spammers and non-spammers.Conducted a study about the characteristics of tweet content and user behavior .Used supervised machine learning method to identify spammers.Advanced Defense Lab4OutlineIntroductionBackgroundDataset and Labeled CollectionIdentifying User AttributesDetecting SpammersRelated WorkConclusion

Advanced Defense Lab55BackgroundRelationship links are directional.A re-tweeted message usually starts with [email protected] users usually use hashtags (#) to identify certain topics.Trending Topics#musicmonday

Advanced Defense Lab6BackgroundA URL to a website containing advertisements completely unrelated to a hashtag on the tweetRe-tweets in which legitimate links are changed to illegitimate ones.Advanced Defense Lab7

OutlineIntroductionBackgroundDataset and Labeled CollectionIdentifying User AttributesDetecting SpammersRelated WorkConclusion

Advanced Defense Lab88Dataset and Labeled CollectionWe asked Twitter to allow us to collect such data and they white-listed 58 servers located at the MPI-SWS.Twitter assigns each user a numeric ID which uniquely identifies the users profile.We launched our crawler in August 2009 to collect all user IDs ranging from 0 to 80 million.In total54,981,152 used accounts1,963,263,821 social links1,755,925,520 tweetsAdvanced Defense Lab9Building a labeled collectionThree desired properties that need to be considered to create such collection of users labeled as spammers and non-spammers.The collection needs to have a significant number spammers and non-spammers.The labeled collection needs to include spammers who are aggressive in their strategies and mostly affect the system.The users are chosen randomly and not based on their characteristics.Advanced Defense Lab10Building a labeled collectionThree trending topicsThe Michael Jacksons deathSusan Boyles emergenceThe hashtag #musicmonday

Advanced Defense Lab11

Building a labeled collectionWe developed a website to help volunteers to manually label users as spammers or non-spammers based on their tweets containing #keywords related to the trending topics.In total, 8,207 users were labeled, including 355 spammers and 7,852 non-spammers.We select only 710 of the legitimate users to include in our collection.Advanced Defense Lab12


Advanced Defense Lab1313Indentifying User AttributesContent Attributesthe maximum, minimum, average, and median of the following metrics:number of hashtags per number of words on each tweetnumber of URLs per wordsnumber of words of each tweetnumber of characters of each tweetnumber of URLs on each tweetnumber of hashtags on each tweetnumber of numeric characters that appear on the textnumber of users mentioned on each tweetnumber of times the tweet has been re-tweetedthe fraction of tweets with at least one word from a popular list of spam wordsthe fraction of tweets that are reply messagesthe fraction of tweets of the user containing URLsAdvanced Defense Lab1439Identifying User AttributesTotal 1065 users.39% of the spammers posted all their tweets containing spam words, whereas non-spammers typically do not post more than 4% of their tweets containing spam word.Advanced Defense Lab15

Indentifying User AttributesUser Behavior Attributesthe maximum, minimum, average, and median of the following metrics:the time between tweetsnumber of tweets posted per daynumber of tweets posted per weeknumber of followersnumber of followeesfraction of followers per followeesnumber of tweetsage of the user accountnumber of times the user was mentionednumber of times the user was replied tonumber of times the user replied someonenumber of followees of the users followersnumber tweets receveid from followeesexistence of spam words on the users screenameAdvanced Defense Lab1623Identifying User Attributes(a) Spammers have a high ratio of followers per follwees.(b) Spammers usually have new accounts probably because they are constantly being blocked by other users and reported to Twitter.(c) non-spammers receive a much large amount of tweets from their followees in comparison with spammers.Advanced Defense Lab17


Advanced Defense Lab1818Detecting SpammersSVM-light5-fold cross-validation.In each test, the original sample is partitioned into 5 sub-samples, out of which four are used as training data, and the remaining one is used for testing.

Advanced Defense Lab19

Detecting SpammersAdvanced Defense Lab20


Detecting SpammersX2Advanced Defense Lab22



Detecting SpamsConsider the following attributes for each tweet:number of words from a list of spam wordsnumber of hashtags per wordsnumber of URLs per wordsnumber of wordsnumber of numeric characters on the textnumber of characters that are numbersnumber of URLsnumber of hashtagsnumber of mentionsnumber of times the tweet has been repliedAdvanced Defense Lab25Detecting SpamsAdvanced Defense Lab26



Advanced Defense Lab2828Related WorkSpam has been observed in various applications, including e-mail, web search engines, blogs, videos, and opinions.RE:Each user specifies a list of users who they are willing to receive content from.Advanced Defense Lab29OutlineIntroductionBackgroundDataset and Labeled CollectionIdentifying User AttributesDetecting SpammersRelated WorkConclusion

Advanced Defense Lab3030ConclusionsCrawled the Twitter site to obtain more than 54 million user profiles.

Investigate different tradeoffs for our classification approach and the impact of different attributes sets.Advanced Defense Lab31

detecting spammers on twitter

Documents

advanced defense lab10building

labeled collectionwe

collection of users

labeled collection needs

significant number spammers

legitimate users

user behavior

user ids