identifying twitter spam by utilizing random forests · 1 introduction i top social media platforms...

103
Identifying Twitter Spam by Utilizing Random Forests Humza Haider Division of Science and Mathematics University of Minnesota Morris 2017-04-15

Upload: others

Post on 06-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

Identifying Twitter Spam by UtilizingRandom Forests

Humza HaiderDivision of Science and Mathematics

University of Minnesota Morris

2017-04-15

Page 2: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

1

Introduction

I Top social media platforms

I 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively

impact other usersI How can we identify spammers?

I Manual classificationI URL blacklistingI Machine learning classification

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 3: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

1

Introduction

I Top social media platformsI 500 million tweets per day

I Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively

impact other usersI How can we identify spammers?

I Manual classificationI URL blacklistingI Machine learning classification

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 4: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

1

Introduction

I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious users

I Twitter spam: Any unsolicited, repeated actions that negativelyimpact other users

I How can we identify spammers?I Manual classificationI URL blacklistingI Machine learning classification

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 5: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

1

Introduction

I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively

impact other users

I How can we identify spammers?I Manual classificationI URL blacklistingI Machine learning classification

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 6: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

1

Introduction

I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively

impact other usersI How can we identify spammers?

I Manual classificationI URL blacklistingI Machine learning classification

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 7: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

1

Introduction

I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively

impact other usersI How can we identify spammers?

I Manual classification

I URL blacklistingI Machine learning classification

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 8: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

1

Introduction

I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively

impact other usersI How can we identify spammers?

I Manual classificationI URL blacklisting

I Machine learning classification

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 9: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

1

Introduction

I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively

impact other usersI How can we identify spammers?

I Manual classificationI URL blacklistingI Machine learning classification

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 10: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

2

Outline

BackgroundDecision TreesRandom ForestsModel Evaluation

MethodsTweet and User Content FeaturesGeo-Tagged FeaturesTime Features

Results

Conclusion

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 11: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

3

BackgroundDecision Trees

I Decision Trees

I Machine learning technique for classificationI Classifies an observation based on features available in a dataset

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 12: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

3

BackgroundDecision Trees

I Decision TreesI Machine learning technique for classification

I Classifies an observation based on features available in a dataset

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 13: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

3

BackgroundDecision Trees

I Decision TreesI Machine learning technique for classificationI Classifies an observation based on features available in a dataset

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 14: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

4

BackgroundDecision Trees

URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not SpamYes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam

......

......

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 15: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

5

BackgroundDecision Trees

URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not SpamYes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 16: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

6

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 17: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

7

BackgroundDecision Trees

URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not SpamYes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 18: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

8

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 19: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

9

BackgroundDecision Trees

URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not Spam

Yes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 20: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

10

BackgroundDecision Trees

URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not Spam

Yes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 21: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

11

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 22: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

12

BackgroundDecision Trees

URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not Spam

Yes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 23: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

13

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 24: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

14

BackgroundDecision Trees

URL Account Age Reported ClassYes Old Yes TBD

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 25: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

15

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 26: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

15

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 27: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

15

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 28: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

16

BackgroundDecision Trees

I How are splits decided?

I EntropyI Information Gain

I Trees seem pretty neat! Why do I need a whole forest?I Disagreement in decisions between different trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 29: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

16

BackgroundDecision Trees

I How are splits decided?I Entropy

I Information GainI Trees seem pretty neat! Why do I need a whole forest?

I Disagreement in decisions between different trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 30: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

16

BackgroundDecision Trees

I How are splits decided?I EntropyI Information Gain

I Trees seem pretty neat! Why do I need a whole forest?I Disagreement in decisions between different trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 31: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

16

BackgroundDecision Trees

I How are splits decided?I EntropyI Information Gain

I Trees seem pretty neat! Why do I need a whole forest?

I Disagreement in decisions between different trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 32: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

16

BackgroundDecision Trees

I How are splits decided?I EntropyI Information Gain

I Trees seem pretty neat! Why do I need a whole forest?I Disagreement in decisions between different trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 33: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

17

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 34: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

17

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 35: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

17

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 36: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

17

BackgroundDecision Trees

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 37: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

18

BackgroundRandom Forests

I How do we handle disagreement?

I Train many trees on samples of the data (Bagging)I Don’t let trees access all the features (Feature Bagging)

I After we make a bunch of trees, how do we combine them?I Majority vote

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 38: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

18

BackgroundRandom Forests

I How do we handle disagreement?I Train many trees on samples of the data (Bagging)

I Don’t let trees access all the features (Feature Bagging)I After we make a bunch of trees, how do we combine them?

I Majority vote

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 39: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

18

BackgroundRandom Forests

I How do we handle disagreement?I Train many trees on samples of the data (Bagging)I Don’t let trees access all the features (Feature Bagging)

I After we make a bunch of trees, how do we combine them?I Majority vote

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 40: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

18

BackgroundRandom Forests

I How do we handle disagreement?I Train many trees on samples of the data (Bagging)I Don’t let trees access all the features (Feature Bagging)

I After we make a bunch of trees, how do we combine them?

I Majority vote

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 41: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

18

BackgroundRandom Forests

I How do we handle disagreement?I Train many trees on samples of the data (Bagging)I Don’t let trees access all the features (Feature Bagging)

I After we make a bunch of trees, how do we combine them?I Majority vote

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 42: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

19

BackgroundRandom Forests

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Source: https://i.ytimg.com/vi/ajTc5y3OqSQ/hqdefault.jpg

Page 43: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

20

BackgroundModel Evaluation

I How do we evaluate a random forest’s performance?

True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 44: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

20

BackgroundModel Evaluation

I How do we evaluate a random forest’s performance?

True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 45: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

21

BackgroundModel Evaluation

Accuracy

I “How many tweets were correctly identified?”

True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

Accuracy

Accuracy =TP + TN

TP + TN + FP + FN

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 46: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

21

BackgroundModel Evaluation

Accuracy

I “How many tweets were correctly identified?”

True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

Accuracy

Accuracy =TP + TN

TP + TN + FP + FN

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 47: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

22

BackgroundModel Evaluation

Precision (p)

I “How good is our spam prediction?"

True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

Precision

Precision =TP

TP + FP

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 48: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

22

BackgroundModel Evaluation

Precision (p)

I “How good is our spam prediction?"

True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

Precision

Precision =TP

TP + FP

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 49: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

23

BackgroundModel Evaluation

Recall (r )

I “How much spam was identified?"

True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

Recall

Recall =TP

TP + FN

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 50: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

23

BackgroundModel Evaluation

Recall (r )

I “How much spam was identified?"

True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

Recall

Recall =TP

TP + FN

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 51: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

24

BackgroundModel Evaluation

I F-measure (F )I Harmonic mean of Precision and RecallI Equally weights both Precision and Recall

F-Measure

F-measure = 2 · Precision · RecallPrecision + Recall

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 52: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

25

Outline

BackgroundDecision TreesRandom ForestsModel Evaluation

MethodsTweet and User Content FeaturesGeo-Tagged FeaturesTime Features

Results

Conclusion

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 53: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

26

MethodsTweet and User Content Features

I Chen et al. identify tweets, as opposed to users

I Utilized 12 features directly accessible from a tweetI 6 user featuresI 6 tweet features

I User FeaturesI Age in days of accountI Number of followers, followeesI Number of tweets

I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 54: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

26

MethodsTweet and User Content Features

I Chen et al. identify tweets, as opposed to usersI Utilized 12 features directly accessible from a tweet

I 6 user featuresI 6 tweet features

I User FeaturesI Age in days of accountI Number of followers, followeesI Number of tweets

I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 55: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

26

MethodsTweet and User Content Features

I Chen et al. identify tweets, as opposed to usersI Utilized 12 features directly accessible from a tweet

I 6 user features

I 6 tweet featuresI User Features

I Age in days of accountI Number of followers, followeesI Number of tweets

I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 56: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

26

MethodsTweet and User Content Features

I Chen et al. identify tweets, as opposed to usersI Utilized 12 features directly accessible from a tweet

I 6 user featuresI 6 tweet features

I User FeaturesI Age in days of accountI Number of followers, followeesI Number of tweets

I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 57: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

26

MethodsTweet and User Content Features

I Chen et al. identify tweets, as opposed to usersI Utilized 12 features directly accessible from a tweet

I 6 user featuresI 6 tweet features

I User FeaturesI Age in days of accountI Number of followers, followeesI Number of tweets

I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 58: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

26

MethodsTweet and User Content Features

I Chen et al. identify tweets, as opposed to usersI Utilized 12 features directly accessible from a tweet

I 6 user featuresI 6 tweet features

I User FeaturesI Age in days of accountI Number of followers, followeesI Number of tweets

I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 59: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

27

MethodsTweet and User Content Features

I Two sets of testing data.

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 60: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

28

Outline

BackgroundDecision TreesRandom ForestsModel Evaluation

MethodsTweet and User Content FeaturesGeo-Tagged FeaturesTime Features

Results

Conclusion

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 61: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

29

MethodsGeo-Tagged Features

I So, what is a geo-tagged tweet?

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 62: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

29

MethodsGeo-Tagged Features

I So, what is a geo-tagged tweet?

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 63: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

30

MethodsGeo-Tagged Features

I Guo and Chen identify non-personal usersI SpammersI BotsI Business accounts

I Features:I Tweeting Speed

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 64: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

30

MethodsGeo-Tagged Features

I Guo and Chen identify non-personal usersI SpammersI BotsI Business accounts

I Features:

I Tweeting Speed

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 65: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

30

MethodsGeo-Tagged Features

I Guo and Chen identify non-personal usersI SpammersI BotsI Business accounts

I Features:I Tweeting Speed

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 66: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

31

MethodsGeo-Tagged Features

I 19.6 Miles

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 67: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

31

MethodsGeo-Tagged Features

I 19.6 MilesI From Clontarf at 7:53 PM

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 68: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

31

MethodsGeo-Tagged Features

I 19.6 MilesI From Clontarf at 7:53 PMI From Morris at 8:00 PM

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 69: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

31

MethodsGeo-Tagged Features

I 19.6 MilesI From Clontarf at 7:53 PMI From Morris at 8:00 PMI Tweeting speed = 19.6 miles

7 minutes = 2.8 miles per minute (168 MPH)

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 70: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

32

MethodsGeo-Tagged Features

I Features:

I Max SpeedI Mean SpeedI Max Distance (connected to Max Speed)I Mean number of times a user exceeds 90 MPH per month

I County based features

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 71: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

32

MethodsGeo-Tagged Features

I Features:I Max Speed

I Mean SpeedI Max Distance (connected to Max Speed)I Mean number of times a user exceeds 90 MPH per month

I County based features

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 72: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

32

MethodsGeo-Tagged Features

I Features:I Max SpeedI Mean Speed

I Max Distance (connected to Max Speed)I Mean number of times a user exceeds 90 MPH per month

I County based features

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 73: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

32

MethodsGeo-Tagged Features

I Features:I Max SpeedI Mean SpeedI Max Distance (connected to Max Speed)

I Mean number of times a user exceeds 90 MPH per monthI County based features

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 74: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

32

MethodsGeo-Tagged Features

I Features:I Max SpeedI Mean SpeedI Max Distance (connected to Max Speed)I Mean number of times a user exceeds 90 MPH per month

I County based features

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 75: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

32

MethodsGeo-Tagged Features

I Features:I Max SpeedI Mean SpeedI Max Distance (connected to Max Speed)I Mean number of times a user exceeds 90 MPH per month

I County based features

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 76: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

33

I Number of times a user crosses county borders per monthI Mean number of counties a user has been to per month

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 77: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

33

I Number of times a user crosses county borders per month

I Mean number of counties a user has been to per month

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 78: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

33

I Number of times a user crosses county borders per monthI Mean number of counties a user has been to per month

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 79: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

34

Outline

BackgroundDecision TreesRandom ForestsModel Evaluation

MethodsTweet and User Content FeaturesGeo-Tagged FeaturesTime Features

Results

Conclusion

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 80: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

35

MethodsTime Features

I Washha et al. classify spammers on a user level

I Motivated to use time since altering time dependent features is achallenge.

I Features that spammers can easily manipulate:I Number of URLsI Number of HashtagsI Including Geo-tags

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 81: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

35

MethodsTime Features

I Washha et al. classify spammers on a user levelI Motivated to use time since altering time dependent features is a

challenge.

I Features that spammers can easily manipulate:I Number of URLsI Number of HashtagsI Including Geo-tags

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 82: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

35

MethodsTime Features

I Washha et al. classify spammers on a user levelI Motivated to use time since altering time dependent features is a

challenge.I Features that spammers can easily manipulate:

I Number of URLsI Number of HashtagsI Including Geo-tags

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 83: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

35

MethodsTime Features

I Washha et al. classify spammers on a user levelI Motivated to use time since altering time dependent features is a

challenge.I Features that spammers can easily manipulate:

I Number of URLs

I Number of HashtagsI Including Geo-tags

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 84: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

35

MethodsTime Features

I Washha et al. classify spammers on a user levelI Motivated to use time since altering time dependent features is a

challenge.I Features that spammers can easily manipulate:

I Number of URLsI Number of Hashtags

I Including Geo-tags

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 85: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

35

MethodsTime Features

I Washha et al. classify spammers on a user levelI Motivated to use time since altering time dependent features is a

challenge.I Features that spammers can easily manipulate:

I Number of URLsI Number of HashtagsI Including Geo-tags

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 86: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships

I Time weighted correlations:I URLsI MentionsI Hashtags

I Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 87: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accounts

I Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships

I Time weighted correlations:I URLsI MentionsI Hashtags

I Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 88: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accountsI Likely to be made at the same time

I FollowersI FolloweesI Bi-directional relationships

I Time weighted correlations:I URLsI MentionsI Hashtags

I Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 89: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accountsI Likely to be made at the same timeI Followers

I FolloweesI Bi-directional relationships

I Time weighted correlations:I URLsI MentionsI Hashtags

I Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 90: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI Followees

I Bi-directional relationshipsI Time weighted correlations:

I URLsI MentionsI Hashtags

I Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 91: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships

I Time weighted correlations:I URLsI MentionsI Hashtags

I Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 92: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships

I Time weighted correlations:

I URLsI MentionsI Hashtags

I Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 93: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships

I Time weighted correlations:I URLs

I MentionsI Hashtags

I Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 94: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships

I Time weighted correlations:I URLsI Mentions

I HashtagsI Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 95: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships

I Time weighted correlations:I URLsI MentionsI Hashtags

I Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 96: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

36

MethodsTime Features

FeaturesI Differences in Account Age

I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships

I Time weighted correlations:I URLsI MentionsI Hashtags

I Tweet similarity weighted by time

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 97: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

37

Outline

BackgroundDecision TreesRandom ForestsModel Evaluation

MethodsTweet and User Content FeaturesGeo-Tagged FeaturesTime Features

Results

Conclusion

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 98: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

38

Results

I Accuracy:True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

TP+TNTP+FP+TN+FN "How many tweets were

correctly identified?"

I Precision (p):True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

TPTP+FP "How good is our spam

prediction?"

I Recall (r ):True PositiveSpam

Spam

False Positive

Not Spam

False NegativeNot Spam True Negative

Pre

dict

ion

Truth

TPTP+FN "How much spam was

identified?"I F-measure (F ): 2 · precision·recall

precision+recall

Model Results of the Three Studies

Study %Spam p r F Accuracy

User/Tweet Features: I 50.0% 0.929 0.943 0.936 0.936User/Tweet Features: II 5.0% 0.929 0.407 0.566 0.978Geo-tagged Features 21.4% 0.959 0.959 0.958 0.959

Time Features 46.9% 0.932 0.931 0.931 0.931

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 99: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

39

Conclusion

I Classification via random forestI Recall (r ) may drop when test set contains a low proportion of

spamI Future work: Apply this finding to geo-tagged tweets and time

featuresI Future spam classification by Twitter: Random forests?

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 100: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

40

Acknowledgments

I Peter DolanI For acting as my advisor for this research project and his continued

friendship for the past few yearsI Elena Machkasova

I For the invaluable advice and insightful comments throughout theentire senior seminar course

I Jacob OpdahlI For the exceptional revisions and comments he made as my alumni

review

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 101: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

41

References I

Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, andVirgilio Almeida, Detecting spammers on twitter, Collaboration,electronic messaging, anti-abuse and spam conference (CEAS),vol. 6, 2010, p. 12.

Leo Breiman, Random forests, Machine learning 45 (2001),no. 1, 5–32.

Chao Chen, Jun Zhang, Xiao Chen, Yang Xiang, and WanleiZhou, 6 million spam tweets: A large ground truth for timelytwitter spam detection, 2015 IEEE International Conference onCommunications (ICC), IEEE, 2015, pp. 7065–7070.

Diansheng Guo and Chao Chen, Detecting non-personal andspam users on geo-tagged twitter network, Transactions in GIS18 (2014), no. 3, 370–384.

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 102: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

42

References II

Miroslav Kubat, An introduction to machine learning, 1st ed.,Springer Publishing Company, Incorporated, 2015.

Nikita Spirin and Jiawei Han, Survey on web spam detection:Principles and algorithms, SIGKDD Explor. Newsl. 13 (2012),no. 2, 50–64.

Claude Elwood Shannon, A mathematical theory ofcommunication, ACM SIGMOBILE Mobile Computing andCommunications Review 5 (2001), no. 1, 3–55.

Igor Santos, Igor Miñambres-Marcos, Carlos Laorden, PatxiGalán-García, Aitor Santamaría-Ibirika, and Pablo GarcíaBringas, Twitter content-based spam filtering, pp. 449–458,Springer International Publishing, Cham, 2014.

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests

Page 103: Identifying Twitter Spam by Utilizing Random Forests · 1 Introduction I Top social media platforms I 500 million tweets per day I Attracts spammers and malicious users I Twitter

43

References III

Kurt Thomas, Chris Grier, Dawn Song, and Vern Paxson,Suspended accounts in retrospect: An analysis of twitter spam,Proceedings of the 2011 ACM SIGCOMM Conference onInternet Measurement Conference (New York, NY, USA), IMC’11, ACM, 2011, pp. 243–258.

Mahdi Washha, Aziz Qaroush, and Florence Sedes, Leveragingtime for spammers detection on twitter, Proceedings of the 8thInternational Conference on Management of Digital EcoSystems(New York, NY, USA), MEDES, ACM, 2016, pp. 109–116.

Humza Haider | Identifying Twitter Spam by Utilizing Random Forests