spam.ppt

Upload: rajeev-hatwar

Post on 04-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Spam.ppt

    1/34

    Text CategorizationMoshe Koppel

    Lecture 10: Spam Detection

    Some slides from Joshua Goodman

  • 8/14/2019 Spam.ppt

    2/34

    Obligatory Scare Slide

    Theres lots of spam The proportion of spam is growing it will

    soon exceed 100% of all email sent It costs the world gazillions of dollars Spam is BAD (Actually, lately it looks like spam email

    has been mostly defeated.)

  • 8/14/2019 Spam.ppt

    3/34

    Kinds of spam

    Active spam ads and scams email

    chatbots commentbots

    Passive spam websites link farms for SEO

    adsense parking lots

    Differences between these increasingly artificial

  • 8/14/2019 Spam.ppt

    4/34

    Special Issues

    Spam detection is basically a text cat problem, but there are some special issues:

    Collecting data non-spam email is private

    Asymmetry must never class good mail as spam

    Adversarial spammers try to defeat filters

  • 8/14/2019 Spam.ppt

    5/34

    Collecting Data

    Standard collections SpamAssassin Corpus TREC corpora

    Use your own email Might not reflect world

    gmail has user feedback

    LOTS of examples Haphazardly labeled How much info do they keep about each email?

  • 8/14/2019 Spam.ppt

    6/34

    Problem of False Positives

    False positives more costly than false negatives

    Research must report recall-precision curves;key point is precision ~ 1

  • 8/14/2019 Spam.ppt

    7/34

    Adversarial Problem

    Spammers reverse engineer global filters;use nasty tricks to circumvent them

    This is what makes spam detection aninteresting problem

  • 8/14/2019 Spam.ppt

    8/34

    Basic Spam

    Lets start with some garden variety spam

    This is easily detected by standard text cattricks

  • 8/14/2019 Spam.ppt

    9/34

    It cost you nothing (Yes! $0) to give Us a call, We will contact You back

    Absolutely No exams/Tests/classes/books/Interviews No Pre-School qualification Needed!

    -----------------------------Inside USA: 1-718-989-5XXX0utside USA: +1-718-989-5XXX-----------------------------

    Degree, Bacheelor, masteerMBA, PhDD available in the field of your choicethat's Right, You can even become a doctor & receive all the benefits Thatomes With it!

    Please Leave Below 3 INFO in voicemail:

    1) your Name2) your Country3) your Phone No. (with Countrycode)

    Call Now! 24 hours a day, 7 Days a week to recieve Your call

  • 8/14/2019 Spam.ppt

    10/34

    Most Honorable Sir,

    I am Ehud Olmert, formerly the Prime Minister of Israel. I URGENTLY REQUIREYOUR ASSISTANCE IN A MOST DISCRETE MATTER. As a result of certainevents in my country, it has become necessary for me to transfer a considerable sumof cash to a foreign bank account. I turn to you as a MOST HONORABLE ANDTRUSTED PERSON for your discrete assistance.

    The total amount involved is THIRTY MILLION NEW ISRAELI SHEKELS only[30,000.000.00 NIS] and we wish to transfer this money into safe foreigners accountabroad. I am only contacting you as a foreigner because this money cannot beapproved to a local person here, but to a foreigner who has information about theaccount, which I shall give to you upon your positive response. I am revealing this toyou with believe in God that you will never let me down in this business, you are the

    FIRST AND THE ONLY PERSON that I am contacting for this business, so pleasereply urgently so that I will inform you the next step to take urgently.

    At the conclusion of this business, you will be given 40% of the total amount, 50%will be for us while 10% will be for the expenses both parties may incurred duringthis transaction. PLEASE, TREAT THIS PROPOSAL AS TOP SECRET.

  • 8/14/2019 Spam.ppt

    11/34

    Early Work Sahami et al 98

    Learner: Nave Bayes

    Feature Set: Words, Phrases, Structural Features

    Feature Selection: top 500 infogain

    Evaluation Data: ~1700 Messages, ~88% Spam

    Results: Spam precision 100%, Spam recall 98.3%

  • 8/14/2019 Spam.ppt

    12/34

    Early Work Sahami et al 98 Hand Crafted Features

    35 Phrases Free Money

    Only $ be over 21

    20 Domain Specific Features Domain type of sender (.edu, .com, etc)

    Sender name resolutions (internal mail) Has attachments Time received Percent of non-alphanumeric characters in subject

  • 8/14/2019 Spam.ppt

    13/34

    Later Studies

    The early work was followed by the usualstream of extended feature sets and fancier

    learning methods (e.g. SVM) It is now common to use over 100,000features

    Learning methods for huge data sets must be very efficient (online algorithms)

    Methods must be adaptive

  • 8/14/2019 Spam.ppt

    14/34

    How to Beat an Adaptive Spam Filter Graham- Cumming 04

    Use machine learning to discover words that beatan adaptive filter Take a message that is near spam threshold

    Send it to the target filter 10,000 times each timeadding 5 random words Train an evil filter to learn which messages beat the

    target filter Use evil filter to modify new spam messages

    Found single word additions to get new spam bythe filter

  • 8/14/2019 Spam.ppt

    15/34

    Other Tricks

    Fill messages with real text taken from books, sites, etc.

    Can even generate real-looking texts usingMarkovian language models

  • 8/14/2019 Spam.ppt

    16/34

    The Hitchhiker Chaffer Content Chaff

    Random passages from theHitchhikers Guide

    Footers from valid mail

    This must be Thursday, said Arthur to himself, sinking lowover his beer, I never could getthe hang of Thursdays.

    Express yourself withMSN Messenger 6.0

  • 8/14/2019 Spam.ppt

    17/34

    Hitchhiker Chaffers

    Later Work There is nothing fancy

    about this spam

    A spam filter will catchthat in its sleep anonymous

    Or maybe not

  • 8/14/2019 Spam.ppt

    18/34

    Hitchhiker Chaffers

    Later Work Hidden Text Content Chaff URL Spamming

    Also included a numberof unusual statementsmade by candidatesduring, On display? Ieventually had to go

    down to the cellar to findthem.

    http://join.msn.com/?Page=features/es

  • 8/14/2019 Spam.ppt

    19/34

    More Tricks

    Encoded Text

    Distorted Text

  • 8/14/2019 Spam.ppt

    20/34

    Secret Decoder Ring Dude

    Another spam that lookseasy

    Is it?

  • 8/14/2019 Spam.ppt

    21/34

    Secret Decoder Ring Dude

    Character Encoding HTML word breaking

    Pharmacy

    Products

  • 8/14/2019 Spam.ppt

    22/34

    Diploma Guy

    Word Obscuring

    Dplmoia Pragorm

    Caerte a mroe prosoeprus

  • 8/14/2019 Spam.ppt

    23/34

    More of Diploma Guy

    Diploma Guy is goodat what he does

  • 8/14/2019 Spam.ppt

    24/34

    One Pretty Good Text Cat Method

    Optimally compress spam trainingexamples

    Optimally compress non-spam trainingexamples

    Check which compression method better

    compresses suspicious message

  • 8/14/2019 Spam.ppt

    25/34

    Why This Works

    Works at level of character n-grams

    Should be applied to html source

    Captures weird encodings, word distortions

    Probably using character n-grams with SVMwould also work well

  • 8/14/2019 Spam.ppt

    26/34

    But Spammers Arent Sitting Around

    Embed text in images (can vary non-text

    parts of image)

    Also, just send link to spam site

  • 8/14/2019 Spam.ppt

    27/34

  • 8/14/2019 Spam.ppt

    28/34

    Text Cat isnt the only Trick

    Dont display images w/o user okay Blacklist IPs that spam comes from

    Can harm legitimate senders (zombies, etc.) Charge postage for email

    Cash

    Puzzles that waste CPU Task easy for humans, hard for computers

  • 8/14/2019 Spam.ppt

    29/34

    Sender Recipient

    C ha l le n ge

    Response

    Message

  • 8/14/2019 Spam.ppt

    30/34

    CAPTCHAS

    Identify distorted characters Supposed to be easy for humans, hard for

    computers Actually, nowadays computers better at it

    than humans

  • 8/14/2019 Spam.ppt

    31/34

    Computers vs. Humans

  • 8/14/2019 Spam.ppt

    32/34

    Slight Variation

    Fortunately, for now , humans are still betterthan computers at identifying character

    boundaries

  • 8/14/2019 Spam.ppt

    33/34

    New CAPTCHAS

  • 8/14/2019 Spam.ppt

    34/34