botgraph: large scale spamming botnet detection

27
BotGraph: Large Scale Spamming Botnet Detection Yao Zhao Yinglian Xie * , Fang Yu * , Qifa Ke * , Yuan Yu * , Yan Chen and Eliot Gillum EECS Department, Northwestern University Microsoft Research Silicon Valley * Microsoft Cooperation 1

Upload: xerxes

Post on 08-Jan-2016

52 views

Category:

Documents


2 download

DESCRIPTION

BotGraph: Large Scale Spamming Botnet Detection. Yao Zhao Yinglian Xie * , Fang Yu * , Qifa Ke * , Yuan Yu * , Yan Chen and Eliot Gillum ‡ EECS Department, Northwestern University Microsoft Research Silicon Valley * Microsoft Cooperation ‡. Web-Account Abuse Attack. Zombie - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BotGraph: Large Scale Spamming Botnet Detection

BotGraph: Large Scale Spamming Botnet Detection

Yao ZhaoYinglian Xie*, Fang Yu*, Qifa Ke*, Yuan Yu*, Yan Chen and Eliot

Gillum‡

EECS Department, Northwestern UniversityMicrosoft Research Silicon Valley*

Microsoft Cooperation ‡

1

Page 2: BotGraph: Large Scale Spamming Botnet Detection

2

Web-Account Abuse AttackZombie

(Compromised host)

Spammer’sServer

Captcha solver

RDSXXTD3

User/Pwd

Page 3: BotGraph: Large Scale Spamming Botnet Detection

Problems and Challenges• Detect Web-account Abuse with Hotmail Logs

– Input: user activity traces (signup, login, email-sending records)

– Goal: stop aggressive account signup, limit outgoing spam

• Challenges – Attack is stealthy: individual account detection

difficult– Attack is large scale: finding correlated activities

• >500 million accounts• 300GB-400GB data per month

– Low false positive and false negative rate

3

Page 4: BotGraph: Large Scale Spamming Botnet Detection

4

The BotGraph System• A graph-based approach to attack detection

– A large user-user graph to capture bot-account correlations– Identify 26M bot-accounts with a low false positive rate in two

months• Efficient implementation using Dryad/DryadLINQ

– Graph construction/analysis is not easily parallelizable– Hundreds of millions of nodes, hundreds of billions of edges– Process 200GB-300GB data in 1.5 hours with a 240-machine

cluster

The first to provide a systematic solution to the new botnet-based web-account abuse attack

Page 5: BotGraph: Large Scale Spamming Botnet Detection

System Architecture

5

Login data

Login graph

Graph generation

Random graph based clustering

Verification & prune

Sendmail data

Spamming botnets

Suspicious clusters

Signup data

EWMA based change detection Aggressive

signups

Verification & prune

Signup botnets

3. Parallel algorithm on DryadLINQ clusters

(ID, IP, time)

(ID, time, # of recipients)

(ID, IP, time)

1. History based algorithm to detect aggressive signups

2. Graph-based algorithm to find correlations

Page 6: BotGraph: Large Scale Spamming Botnet Detection

6

Detect Aggressive SignupsLarge

prediction error

Back to normal

Date

Nu

mb

er

of S

ign

up A

cco

unts

25

20

15

10

5

1-Jul 2-Jul 3-Jul 4-Jul 5-Jul 6-Jul 7-Jul 8-Jul 9-Jul

Signup Count

EWMA Prediction

• Simple and efficient• Detect 20 million malicious accounts in 2 months• Simple and efficient• Detect 20 million malicious accounts in 2 months

Page 7: BotGraph: Large Scale Spamming Botnet Detection

System Architecture

7

Login data

Login graph

Graph generation

Random graph based clustering

Verification & prune

Sendmail data

Spamming botnets

Suspicious clusters

Signup data

EWMA based change detection Aggressive

signups

Verification & prune

Signup botnets

3. Parallelel Algorithm on DryadLinq clusters

(ID, IP, time)

(ID, time, # of recipients)

(ID, IP, time)

1. History based algorithm on Signup detection

2. Graph-based algorithm on login detection

Page 8: BotGraph: Large Scale Spamming Botnet Detection

8

• Observation: bot-accounts work collaboratively

• Normal Users– Share IP addresses in one AS with DHCP assignment

• Bot-users

Detect Stealthy Accounts by Graphs

A user-user graph to model behavior similaritiesA user-user graph to model behavior similarities

Page 9: BotGraph: Large Scale Spamming Botnet Detection

9

• Observation: bot-accounts work collaboratively

• Normal Users– Share IP addresses in one AS with DHCP assignment

• Bot-users– Likely to share different IPs across ASes

Detect Stealthy Accounts by Graphs

A user-user graph to model behavior similaritiesA user-user graph to model behavior similarities

Page 10: BotGraph: Large Scale Spamming Botnet Detection

User-user Graph

• Node: Hotmail account • Edge weight: # of ASes of the

shared IP addresses– Consider edges with weight>1

• Key Observations– Bot-users form a giant

connected-component while normal users do not

– Interpreted by the random graph theory

10

2 ASes

3 ASes5 ASes

1 AS

4 ASes

User1

User2

User3

User4

User5

User6

Page 11: BotGraph: Large Scale Spamming Botnet Detection

Random Graph Theory

• Random Graph G(n,p) – n nodes and each pair of nodes has an edge with

probability p and average degree d = (n-1) · p

• Theorem– If d < 1, then with high probability the largest component

in the graph has size less than O(log n) No large connected subgraph

– If d > 1, with high probability the graph will contain a giant component with size at the order of O(n)

Most nodes are in one connected subgraph

11

Page 12: BotGraph: Large Scale Spamming Botnet Detection

Graph-based Bot-user Detection

• Step 1: detect giant connected-components from the user-user graph

• Step 2: hierarchical algorithm to identify the correct groupings– Different bot-user groups may be mixed– Easier validation with correct group statistics– Difficult to choose a fixed edge-threshold

• Step 3: prune normal-user groups– Due to national proxies, cell phone users, facebook applications,

etc.

12

Page 13: BotGraph: Large Scale Spamming Botnet Detection

Graph-based Bot-user Detection

• Step 1: detect giant connected-components from the user-user graph

• Step 2: hierarchical algorithm to identify the correct groupings– Different bot-user groups may be mixed– Easier validation with correct group statistics– Difficult to choose a fixed edge-threshold

• Step 3: prune normal-user groups– Due to national proxies, cell phone users, facebook applications,

etc.

13

Page 14: BotGraph: Large Scale Spamming Botnet Detection

Hierarchical Bot-Group Extraction

A

B

T=2

C DT=3

T=4

14

A

Page 15: BotGraph: Large Scale Spamming Botnet Detection

System Architecture

15

Login data

Login graph

Graph generation

Random graph based clustering

Verification & prune

Sendmail data

Spamming botnets

Suspicious clusters

Signup data

EWMA based change detection Aggressive

signups

Verification & prune

Signup botnets

3. Parallelel Algorithm on DryadLINQ clusters

(ID, IP, time)

(ID, time, # of recipients)

(ID, IP, time)

1. History based algorithm on Signup detection

2. Graph-based algorithm on login detection

Page 16: BotGraph: Large Scale Spamming Botnet Detection

Parallel Implementation on DryadLINQ

• EWMA-based Signup Abuse Detection– Partition data by IP– Can achieve real-time detection

• User-User Graph Construction– Two algorithms and optimizations– Process 200GB-300GB data in 1.5 hours with 240

machines

• Connected Component Extraction– Divide and conquer– Process a graph of 8.6 billion edges in 7 minutes

Page 17: BotGraph: Large Scale Spamming Botnet Detection

Parallel Implementation on DryadLINQ

• EWMA-based Signup Abuse Detection– Partition data by IP– Can achieve real-time detection

• User-User Graph Construction– Two algorithms and optimizations– Process 200GB-300GB data in 1.5 hours with 240

machines

• Connected Component Extraction– Divide and conquer– Process a graph of 8.6 billion edges in 7 minutes

Page 18: BotGraph: Large Scale Spamming Botnet Detection

Graph Construction 1: Simple Data Parallelism

• Potential Edges– Select ID group by IP (Map)– Generate potential edges (IDi, IDj, IPk) (Reduce)

• Edge Weights– Select IP group by ID pair (Map)– Calculate edge weight (Reduce)

• Problem– Weight 1 edge is two orders of magnitude more than

others – Their computation/communication is unnecessary

Page 19: BotGraph: Large Scale Spamming Botnet Detection

Graph Construction 2: Selective Filtering

19

Page 20: BotGraph: Large Scale Spamming Botnet Detection

Comparison of Two Algorithms• Method 1

– Simple and scalable• Method 2

– Optimized to filter out weight 1 edges– Utilize Join functionality, data compression and

broadcast optimization

20

Page 21: BotGraph: Large Scale Spamming Botnet Detection

Detection Results

• Data description– Two datasets

• Jun 2007 and Jan 2008

– Three types of data• Signup log (IP, ID, Time)• Login log (IP, ID, Time)

– 500M users and 200~300GB data per month

• Sendmail log (ID, time, # of recipients)– About 100GB per month

21

Page 22: BotGraph: Large Scale Spamming Botnet Detection

Detection of Signup Abuse

22

Page 23: BotGraph: Large Scale Spamming Botnet Detection

Detection by User-user Graph

23

Page 24: BotGraph: Large Scale Spamming Botnet Detection

Validations• Manual Check

– Sampled groups verified by the Hotmail team– Almost no false positives

• Comparison with Known Spamming Users – Detect 86% of complained accounts– Up to 54% of detected accounts are our new findings

• Email Sending Sizes per Group– Most groups have a sharp peak– The remaining contain several peaks

• False Positive Estimation – Naming pattern (0.44%)– Signup time (0.13%)

24

Page 25: BotGraph: Large Scale Spamming Botnet Detection

Possible to Evade BotGraph?

• Evade signup detection: Be stealthy

• Evade graph-based detection– Fixed IP/AS binding

• Low utilization rate• Bot-accounts bound to one host are easy to be grouped

– Be stealthy (sending as few emails as normal user)

25

Severely limit attackers’ spam throughput

Page 26: BotGraph: Large Scale Spamming Botnet Detection

Conclusions

• A graph-based approach to attack detection– Identify 26M bot-accounts with a low false positive rate in

two months

• Efficient implementation using Dryad/DryadLINQ– Process 200GB-300GB data in 1.5 hours with a 240-

machine cluster

26

Large-scale data-mining for network security is effective and practical

Large-scale data-mining for network security is effective and practical

Page 27: BotGraph: Large Scale Spamming Botnet Detection

Q & A?

Thanks!

27