improving spam detection with automaton

17
1 / 17 ® Improving SPAM detection 1 de março 2016 ®

Upload: cooler-freenode

Post on 15-Jan-2017

827 views

Category:

Internet


0 download

TRANSCRIPT

1/ 17®

Improving SPAM detection

1 de março 2016

®

2/ 17®

Whois

● Antonio Costa – Cooler

● Just another System analyst

● Github CoolerVoid

● https://github.com/CoolerVoid

Contact: [email protected]

[email protected]

3/ 17®

How it works

● Anti-Spam - The common way

● Get E-mails POP3 / IMAP ...

● Validate

● Clean all and tokenization

● BoW (Bag-of-words), SoW(Set-of-Words)...

● tf–idf (term frequency–inverse document frequency)...

● Supervised learning ● Classification (SVM, KNN, NB, Random forest... )

4/ 17®

How it works

● Anti-Spam - The common way

● Get E-mails POP3 / IMAP

● Validate

– Country-based filtering – DNS-based blacklists– Enforcing RFC standards– SMTP callback verification

5/ 17®

● DNS-based blacklists

6/ 17®

Wake UP

7/ 17®

How it works

● Anti-Spam - The common way

● Get E-mails POP3 / IMAP ... - INPUT STRING

● Validate

● Clean all and tokenization

● BoW (Bag-of-words), SoW(Set-of-Words), tf–idf (term frequency–inverse document frequency)...

Create MATRIX

● Supervised learning – USING MATRIX ● Classification (SVM, KNN, NB, Random forest... )

8/ 17®

Bag-of-words[ 1 ] - “Luan likes to make hacking. Josimar likes to make hacking too.”

[ 2 ] - “Luan also likes to web hacking.”

● Create array of words ( tokenize... )

{ “Luan”,”likes”,”to”,”make”,”hacking”,”Josimar”,”too”,

”also”,”web”} Total of 9 elements

● Count number of appers !

[0] – { 1, 2, 2, 2, 2, 1, 1, 0, 0 }

[1] – { 1, 1, 1, 0, 1, 0, 0, 1, 1 }

9/ 17®

The common wayLook this following

10/ 17®

The common wayWhy naive bayes ?

● At my tests !

KNN 96% Slow

Super simple, you're just doing a bunch of counts. Naive Bayes is an eager learning classifier and it is much faster than KNN. Nodaways it could be used for prediction in real time.

Classifier Accuracy Performance

SVM 92% Medium

NB 94% Fast

11/ 17®

My wayAutomatos like a Match Rules

● Gain Accuracy !

● Gain Performance !

● Because can match to SPAM before to use classifier !

● www.site.com/www.bank.com/

● URL/malware.exe rule like URL/[a-zA-Z]*\.exe ...

● Rule like to detect IP at URL

● Deterministic finite automaton to detect

● Use ranking !

NB 94% +4% Fast

12/ 17®

My wayAutomatos like a Match Rules

● Gain Accuracy !

● Gain Performance !

● Because can match to SPAM before to use classifier !

● Deterministic finite automaton at Rules to detect

● www.site.com/www.bank.com/

● URL/malware.exe rule like URL/[a-zA-Z]*\.exe ...

● Rule like to detect IP at URL

● Rule to detect Phishing

● Use Ranking !

NB 94% +4% Fast

13/ 17®

Why Ranking ?Automatos like a Match Rules

● Gain Accuracy !

NB 94% +4% Fast

14/ 17®

E-mail audit The project !

● C++ at all source code ! 100% Open Source !

● IMAP – communication

● Blacklists – DNS, bad domains, e-mail address...

● Deterministic Finite Automaton – Filters

● Tf–idf (term frequency–inverse document frequency)

● Naive bayes – classifier

15/ 17®

My wayAutomatos like a Match Rules

● Gain Accuracy !

● Gain Performance !

● Because can match to SPAM before to use classifier !

● www.site.com/www.bank.com/

● URL/malware.exe rule like URL/[a-zA-Z]*\.exe ...

● Rule like to detect IP at URL

● Deterministic finite automaton to detect

● Use ranking !

NB 94% +4% Fast

16/ 17®

E-mail audit The project !

● At the future, using GPU to use KNN and automatons...

● Results with GPU turns all fast...

● Next step 100% of accuracy ?

https://github.com/CoolerVoid/email_audit

17/ 17®

Thanks● https://github.com/CoolerVoid