the fight against spam - a machine learning approach
DESCRIPTION
ELPUB 2007, Vienna. The Fight against Spam - A Machine Learning Approach. Jiri Hynek ([email protected]) Karel Jezek ([email protected]). www.textmining.cz. Contents:. Stats 101 Today‘s Spam Types Spammer Tricks Text-Based Spam Filter Implementation Results. Contents:. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/1.jpg)
The Fight against Spam- A Machine Learning
Approach
Jiri Hynek ([email protected])Karel Jezek ([email protected])
ELPUB 2007, Vienna
www.textmining.cz
![Page 2: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/2.jpg)
2
Contents:
Stats 101 Today‘s Spam Types Spammer Tricks Text-Based Spam Filter Implementation Results
![Page 3: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/3.jpg)
3
Contents:
Spamming is publishing:
Web Spam (“comment spam“)- blogs, (unmoderated) forums, wikisWhy: to trigger higher page-ranking!
Unsolicited marketing spam in our e-mails – info dissemination to the public
Why: sell products!
![Page 4: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/4.jpg)
4
A bit of Terminology:“Canned meat made largely from pork“
Ham vs. Spam (Spam mail)UCE (Unsolicited Commercial Email)UBM (Unsolicited Bulk Mail)EMP (Excessive Multi-Posting)Junk mail Bulk email
![Page 5: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/5.jpg)
5
Stats 101
Top five spam categories: Online Pharmacies 20.0%Mortgage Refinancing 9.7%Investment/financial services 9.0%Male products (\/i@gra, CI@1i$) 8.7%Discount computer software 6.9%
Communications of the ACM, February 2007/Vol. 50 No.2
![Page 6: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/6.jpg)
6
Stats 101
1998: Mere 10% of overall mail volumeNow: 80%Communications of the ACM, February 2007/Vol. 50 No.2
Average spammers‘ revenue: $1 per 45,000 spams dispatched
A database of 100 million e-mails costs 100 dollars, spam software included
(www.symantec.com)
![Page 7: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/7.jpg)
7
Today‘s Spam Types
Text Spam
![Page 8: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/8.jpg)
8
Today‘s Spam TypesText Spam Commonly used phrases filtered out by antispam filters(and words to avoid, of course) Free! 50% off! Click HereCall now! Subscribe Earn $Discount! Eliminate Debt Double your incomeYou're a Winner! Reverses Aging HiddenInformation you requested Stop / Stops Lose Weight Multi level Marketing Million Dollars OpportunityCompare Removes CollectAmazing Cash Bonus Promise YouCredit Loans Satisfaction
GuaranteedSerious Cash Search Engine Listings
![Page 9: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/9.jpg)
9
Today‘s Spam TypesImage-Based Spam
![Page 10: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/10.jpg)
10
Today‘s Spam TypesImage-Based Spam in our mailboxes
June 2005
June2006
Overall share in spam
1 % 12 %
New spam domain originating every
48 hours 4 hours
Daily spam volume 30,000 million
55,000 million
![Page 11: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/11.jpg)
11
Today‘s Spam Types
Phishing
![Page 12: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/12.jpg)
12
Today‘s Spam Types
Captcha - fighting web spam
![Page 13: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/13.jpg)
13
Common Spammer Tricks
Tricks to fool statistical spam filters:
Avoidance of keywords (such as stock, Viagra, etc.),Frequent change in sender’s address,Message encoding (such as base64, commonly used for secure message transfer),Hashing (e.g. insertion of HTML tags into messages),Use of images instead of plain text (namely GIF, JPEG, and PNG).
![Page 14: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/14.jpg)
14
New Spammer Tricks
Character Hashing:
I finlaly was able to lsoe the wieght I have been sturggling to lose for years! And I couldn't bileeve how simple it was! Amizang pacth makes you shed the ponuds! It's Guanarteed to work or your menoy back!
![Page 15: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/15.jpg)
15
New Spammer Tricks
Keyword masking by repeating characters: Buuuyyyy cheeeeaaap viaaagraaa
Word obfuscations:\/laGr@Need a{} Dpiloma?sh1pp1ng //orldwideS0ft T4bsCi@li$repl1ca w4tches from r0lex
![Page 16: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/16.jpg)
16
New Spammer Tricks
V I A G R A
V, v, \/ I i 1 l | ï ì : Ì Î Í Ï
A a @ /\ á à â ã ä å æ À Á Â Ã Ä Å
G g R r ® A a @ /\ á à â ã ä å æ À Á Â Ã Ä Å
3 variation
s
12 variations
17 variations
2 variation
s
3 variation
s
17 variations
There are 62,424 (3 x 12 x 17 x 2 x 3 x 17) ways to portray the name Viagra.
In fact, there are 600,426,974,379,824,381,952 ways to spell
Source: http://cockeyed.com/lessons/viagra/viagra.html
Word obfuscations:
![Page 17: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/17.jpg)
17
New Spammer Tricks
ASCII Art: \|||||/
( o o ) -ooO--(_)--Ooo— / \
![Page 18: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/18.jpg)
18
New Spammer Tricks
ASCII Art:
![Page 19: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/19.jpg)
19
New Spammer TricksGood word attacks(Bayesian poisoning)
Russa says McGwire belongs in Hall AP - 35 minutes ago One year on, the face live! EDITORS' BLOG CNN.com AP Action on Elder Abuse Politics My Sources Weather Alerts Back Security SPACE.com The council is now proposing to increase the annual fee to nurses Freeman dies AFP Pope calls for Islam dialogue "There's a lot of theoreticalCSMonitor.com Last Updated: Tuesday, 28 November 2006, 23:13 GMT Bad rapto top ^^ Five girls killed in Iraqi clash This is where a little bit of help28, 6:33 AM ET Wales Lottery Video: Bush Praises Estonia As War on Terror AllyANALYSIS Mucking about? Hazards Podcasts ELSEWHERE ON THE BBC At the same timeVictims Were Asleep Fashion Wire Daily AFP Football's elite Baby beluga dies athands-on situation." 'My mother was assaulted' Entertainment Search World Radio 2 Google together Mr Litvinenko's movements on 1 November, the day he fell...
![Page 20: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/20.jpg)
20
New Spammer TricksGood word attacks
![Page 21: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/21.jpg)
21
A Filter to Fight Text-Based Spam
It‘s just another Short Document Classification Problem:
The Itemsets FilterPlain Bayes FilterLSI FilterSVM FilterGZip (Compression-based) filter
![Page 22: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/22.jpg)
22
Standard Spam Testing Collections
PU1: A mixture of 481 spam messages and 618
legitimate messages
PU123A: Four corpora, based on private mailboxes
Enron Corpus: 200,399 unique messages collected by 158 users
(mostly managers)
![Page 23: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/23.jpg)
23
Itemsets Spam Filter: Results
100 spams, 100 hams
400 spams, 400 hams
Avg (100,400)
FPI (%) 19,00 20,61 19,81
FNI (%) 2,53 3,55 3,04
TNI (%) 97,47 97,80 97,64
TPI (%) 80,99 89,91 85,45
FPI = (#ham as spam) / #hami.e. the proportion of legitimate messages deleted by mistake.
FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.
![Page 24: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/24.jpg)
24
SVM Spam Filter: Results
FPI = (#ham as spam) / #hami.e. the proportion of legitimate messages deleted by mistake.
FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.
100 spams, 100 hams
400 spams, 400 hams
Avg (100,400)
FPI (%) 13,91 10,10 12,01
FNI (%) 1,18 2,19 1,69
TNI (%) 98,82 97,80 98,31
TPI (%) 86,09 89,91 88,00
![Page 25: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/25.jpg)
25
GZip Spam Filter: Results
FPI = (#ham as spam) / #hami.e. the proportion of legitimate messages deleted by mistake.
FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.
…We will look into this in the near future
100 spams, 100 hams
400 spams, 400 hams
Avg (100,400)
FPI (%) 1,72 2,33 2,03
FNI (%) 30,28 27,31 28,80
TNI (%) 69,72 72,69 71,21
TPI (%) 98,28 97,67 97,98
![Page 26: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/26.jpg)
26
Light at the end of the tunnel?
Payment per e-mail?
Quite unlikely…
E-mail authentication by SIDF
• Sender ID Framework (by Microsoft)
• … registered list of servers of domain owners
• Confirmation of e-mail source domain (automatically, by ISPs)
• Protects 40% of legitimate email sent worldwide
• Helps combat phishing scams / domain spoofing (forging a sender's address)
![Page 27: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/27.jpg)
27
Light at the end of the tunnel?
DomainKeys Identified Mail (DKIM)
• Similar technology by Yahoo, Cisco Systems, Sendmail, PGP
• Based on digital signatures
• An official proposed standard by Internet Engineering Task Force
![Page 28: The Fight against Spam - A Machine Learning Approach](https://reader035.vdocuments.us/reader035/viewer/2022062523/56814cb5550346895db9bdbb/html5/thumbnails/28.jpg)
28
Thank You For Your Attention
Questions?
FEEDBACK