enhancing scalability in anomaly-based email spam filtering - ceas 2011
DESCRIPTION
Presentation at CEAS 2011 International conference of the paper: Enhancing Scalability in Anomaly-based Email Spam FilteringTRANSCRIPT
Anomaly-basedSpam filtering
Carlos Laorden
Billions of daily losses in
productivity
Infected computers
Stolen credentials
Food?NO!
Monty Python’s Flying Circus
WHAT YOU GOT, THEN? SPAM, EGG,
SPAM, SPAM, BACON AND
SPAM.
SPAM, SPAM, SPAM, BAKED BEANS AND
SPAM.
ANYTHING WITHOUT
SPAM?
I DON’T LIKE
SPAM!!
UGH!
Something that repeats and repeats until being annoying
It is a
real problemfor Information Security
We must
fight
Anti-spam methods
Pre-sending
Newprotocols
Post-sending
Increase sendingcosts
Increase risksfor spammers
E-mailsender
E-mailcontentE-mailcontent
Usually
supervisedapproaches
A significant
labelling workis needed
A significant
labelling workis needed
But,is this
possible?
I mean,is this
possible...
...without
loosing
accuracy
drastically?
YES
Anomaly Detection
no interest this SpamAssassin word has
this has Ling Spam no interest word
SpamAssassin
Ling Spamt1
t2
t3D1
D2
D10
D3
D9
D4
D7
D8
D5
D11
D6
??
Anomaly detection
d
d> threshold?
> threshold?
Manhattan distance
Euclidean distance
Anomaly detection
?
d
d?
Minimum distance
Maximum distance
Mean distance
Minimumdistance
Maximumdistance
Meandistance
Manhattandistance
Euclideandistance
10different
thresholds
Anomaly detection
d
d < threshold
> threshold
min
max
Minimumdistance
Maximumdistance
Meandistance
Manhattandistance
Euclideandistance
10thresholds
d
?
d
dd
HighProcessingOverhead
1. Representation of the emails
2. Anomaly Detection
1.5 Data clustering
QT clustering
algorithm
Minimumdistance
Maximumdistance
Meandistance
Manhattandistance
Euclideandistance
10thresholds
QT1.501.752.00
∞
Results
Ling SpamDistance measure Quality Threshold % Average reduction
Ecuclidean
1.501.752.00
∞
13.21%57.10%89.72%99.94%
Manhattan
1.501.752.00
∞
33.75%46.78%62.47%99.94%
SpamAssassin
Ecuclidean
1.501.752.00
∞
89.78%97.63%99.34%99.96%
Manhattan
1.501.752.00
∞
93.59%96.81%98.57%99.96%
SpamAssassin
Detects more than 95%of junk emails
Less than 5% of
misclassified legitimate emails
Ling Spam
Detects more than 95%of junk emails
An improvable 10% of
misclassified legitimate emails
SpamAssassin
Previous work Clustering Reduction
Euclidean 93.99% 94.39%
Manhattan 96.50% 95.37%
Ling SpamPrevious work Clustering Reduction
Euclidean 95.02% 95.54%
Manhattan 83.85% 89.60%
Suitable to
overcome the amountof unclassified spam e-mails
Will we seethe END of spam?
References1. Monty Python – Spam: http://www.youtube.com/watch?
v=anwy2MPT5RE2. Spam wall by freezelight:
http://www.flickr.com/photos/63056612@N00/155554663/3. monty python flying circus by the_d8_show:
http://www.flickr.com/photos/8056839@N04/478599790/4. Dollars: http://vegasgravy.com/News-detail/two-women-caught-
for-transporting-drug-money-from-vegas/dollars/5. Day 97: Infected by dustywrath:
http://www.flickr.com/photos/10921499@N07/21873186836. my bank sucks by B Rosen:
http://www.flickr.com/photos/rosengrant/3537904106/7. Computer spam:
http://novapublicidad.com.ec/dataexpress/wp-content/uploads/2013/03/computerSpam.jpg
8. Star cluster: http://s3.amazonaws.com/img.tnt/f11e/0b22stock.jpg
9. Feet on table: http://bisystembuilders.com/wp-content/uploads/2010/02/shutterstock_feet-on-table.jpg
10. Buried on bills: http://getupkids.net/wp-content/uploads/2013/06/debt_piling.jpg
11. Kill spam: http://www.email-marketing-wizard.com/wp-content/uploads/2010/03/spammer.jpg