spam.ppt
TRANSCRIPT
-
8/14/2019 Spam.ppt
1/34
Text CategorizationMoshe Koppel
Lecture 10: Spam Detection
Some slides from Joshua Goodman
-
8/14/2019 Spam.ppt
2/34
Obligatory Scare Slide
Theres lots of spam The proportion of spam is growing it will
soon exceed 100% of all email sent It costs the world gazillions of dollars Spam is BAD (Actually, lately it looks like spam email
has been mostly defeated.)
-
8/14/2019 Spam.ppt
3/34
Kinds of spam
Active spam ads and scams email
chatbots commentbots
Passive spam websites link farms for SEO
adsense parking lots
Differences between these increasingly artificial
-
8/14/2019 Spam.ppt
4/34
Special Issues
Spam detection is basically a text cat problem, but there are some special issues:
Collecting data non-spam email is private
Asymmetry must never class good mail as spam
Adversarial spammers try to defeat filters
-
8/14/2019 Spam.ppt
5/34
Collecting Data
Standard collections SpamAssassin Corpus TREC corpora
Use your own email Might not reflect world
gmail has user feedback
LOTS of examples Haphazardly labeled How much info do they keep about each email?
-
8/14/2019 Spam.ppt
6/34
Problem of False Positives
False positives more costly than false negatives
Research must report recall-precision curves;key point is precision ~ 1
-
8/14/2019 Spam.ppt
7/34
Adversarial Problem
Spammers reverse engineer global filters;use nasty tricks to circumvent them
This is what makes spam detection aninteresting problem
-
8/14/2019 Spam.ppt
8/34
Basic Spam
Lets start with some garden variety spam
This is easily detected by standard text cattricks
-
8/14/2019 Spam.ppt
9/34
It cost you nothing (Yes! $0) to give Us a call, We will contact You back
Absolutely No exams/Tests/classes/books/Interviews No Pre-School qualification Needed!
-----------------------------Inside USA: 1-718-989-5XXX0utside USA: +1-718-989-5XXX-----------------------------
Degree, Bacheelor, masteerMBA, PhDD available in the field of your choicethat's Right, You can even become a doctor & receive all the benefits Thatomes With it!
Please Leave Below 3 INFO in voicemail:
1) your Name2) your Country3) your Phone No. (with Countrycode)
Call Now! 24 hours a day, 7 Days a week to recieve Your call
-
8/14/2019 Spam.ppt
10/34
Most Honorable Sir,
I am Ehud Olmert, formerly the Prime Minister of Israel. I URGENTLY REQUIREYOUR ASSISTANCE IN A MOST DISCRETE MATTER. As a result of certainevents in my country, it has become necessary for me to transfer a considerable sumof cash to a foreign bank account. I turn to you as a MOST HONORABLE ANDTRUSTED PERSON for your discrete assistance.
The total amount involved is THIRTY MILLION NEW ISRAELI SHEKELS only[30,000.000.00 NIS] and we wish to transfer this money into safe foreigners accountabroad. I am only contacting you as a foreigner because this money cannot beapproved to a local person here, but to a foreigner who has information about theaccount, which I shall give to you upon your positive response. I am revealing this toyou with believe in God that you will never let me down in this business, you are the
FIRST AND THE ONLY PERSON that I am contacting for this business, so pleasereply urgently so that I will inform you the next step to take urgently.
At the conclusion of this business, you will be given 40% of the total amount, 50%will be for us while 10% will be for the expenses both parties may incurred duringthis transaction. PLEASE, TREAT THIS PROPOSAL AS TOP SECRET.
-
8/14/2019 Spam.ppt
11/34
Early Work Sahami et al 98
Learner: Nave Bayes
Feature Set: Words, Phrases, Structural Features
Feature Selection: top 500 infogain
Evaluation Data: ~1700 Messages, ~88% Spam
Results: Spam precision 100%, Spam recall 98.3%
-
8/14/2019 Spam.ppt
12/34
Early Work Sahami et al 98 Hand Crafted Features
35 Phrases Free Money
Only $ be over 21
20 Domain Specific Features Domain type of sender (.edu, .com, etc)
Sender name resolutions (internal mail) Has attachments Time received Percent of non-alphanumeric characters in subject
-
8/14/2019 Spam.ppt
13/34
Later Studies
The early work was followed by the usualstream of extended feature sets and fancier
learning methods (e.g. SVM) It is now common to use over 100,000features
Learning methods for huge data sets must be very efficient (online algorithms)
Methods must be adaptive
-
8/14/2019 Spam.ppt
14/34
How to Beat an Adaptive Spam Filter Graham- Cumming 04
Use machine learning to discover words that beatan adaptive filter Take a message that is near spam threshold
Send it to the target filter 10,000 times each timeadding 5 random words Train an evil filter to learn which messages beat the
target filter Use evil filter to modify new spam messages
Found single word additions to get new spam bythe filter
-
8/14/2019 Spam.ppt
15/34
Other Tricks
Fill messages with real text taken from books, sites, etc.
Can even generate real-looking texts usingMarkovian language models
-
8/14/2019 Spam.ppt
16/34
The Hitchhiker Chaffer Content Chaff
Random passages from theHitchhikers Guide
Footers from valid mail
This must be Thursday, said Arthur to himself, sinking lowover his beer, I never could getthe hang of Thursdays.
Express yourself withMSN Messenger 6.0
-
8/14/2019 Spam.ppt
17/34
Hitchhiker Chaffers
Later Work There is nothing fancy
about this spam
A spam filter will catchthat in its sleep anonymous
Or maybe not
-
8/14/2019 Spam.ppt
18/34
Hitchhiker Chaffers
Later Work Hidden Text Content Chaff URL Spamming
Also included a numberof unusual statementsmade by candidatesduring, On display? Ieventually had to go
down to the cellar to findthem.
http://join.msn.com/?Page=features/es
-
8/14/2019 Spam.ppt
19/34
More Tricks
Encoded Text
Distorted Text
-
8/14/2019 Spam.ppt
20/34
Secret Decoder Ring Dude
Another spam that lookseasy
Is it?
-
8/14/2019 Spam.ppt
21/34
Secret Decoder Ring Dude
Character Encoding HTML word breaking
Pharmacy
Products
-
8/14/2019 Spam.ppt
22/34
Diploma Guy
Word Obscuring
Dplmoia Pragorm
Caerte a mroe prosoeprus
-
8/14/2019 Spam.ppt
23/34
More of Diploma Guy
Diploma Guy is goodat what he does
-
8/14/2019 Spam.ppt
24/34
One Pretty Good Text Cat Method
Optimally compress spam trainingexamples
Optimally compress non-spam trainingexamples
Check which compression method better
compresses suspicious message
-
8/14/2019 Spam.ppt
25/34
Why This Works
Works at level of character n-grams
Should be applied to html source
Captures weird encodings, word distortions
Probably using character n-grams with SVMwould also work well
-
8/14/2019 Spam.ppt
26/34
But Spammers Arent Sitting Around
Embed text in images (can vary non-text
parts of image)
Also, just send link to spam site
-
8/14/2019 Spam.ppt
27/34
-
8/14/2019 Spam.ppt
28/34
Text Cat isnt the only Trick
Dont display images w/o user okay Blacklist IPs that spam comes from
Can harm legitimate senders (zombies, etc.) Charge postage for email
Cash
Puzzles that waste CPU Task easy for humans, hard for computers
-
8/14/2019 Spam.ppt
29/34
Sender Recipient
C ha l le n ge
Response
Message
-
8/14/2019 Spam.ppt
30/34
CAPTCHAS
Identify distorted characters Supposed to be easy for humans, hard for
computers Actually, nowadays computers better at it
than humans
-
8/14/2019 Spam.ppt
31/34
Computers vs. Humans
-
8/14/2019 Spam.ppt
32/34
Slight Variation
Fortunately, for now , humans are still betterthan computers at identifying character
boundaries
-
8/14/2019 Spam.ppt
33/34
New CAPTCHAS
-
8/14/2019 Spam.ppt
34/34