bayesian spam filter

15
Bayesian Spam Filter By Joshua Spaulding

Upload: monifa

Post on 21-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Bayesian Spam Filter. By Joshua Spaulding. Statement of Problem. “Spam email now accounts for more than half of all messages sent and imposes huge productivity costs…By 2007, Spam-stopping should grow to a $2.4 Billion Business.” Technology Review 8/03. Objective. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bayesian Spam Filter

Bayesian Spam Filter

By

Joshua Spaulding

Page 2: Bayesian Spam Filter

Statement of Problem

“Spam email now accounts for more than half of all messages sent and imposes huge productivity costs…By 2007, Spam-stopping should grow to a $2.4 Billion Business.”

Technology Review 8/03

Page 3: Bayesian Spam Filter

Objective

Using Bayes’ rule I will attempt to classify an email message as spam or non-spam (ham). I will use a corpus of spam and ham to determine the probability that a new email is spam given the tokens in the message.

Page 4: Bayesian Spam Filter

Definition of Spam

Unsolicited automated email

Page 5: Bayesian Spam Filter

Bayes’ Rule

P(A|B) = P(B|A)P(A) / P(B)

P(A|B) is the conditional probability that event A occurs given that event B has occurred;P(B|A) is the conditional probability of event B occurring given that event A has occurred;P(A) is the probability of event A occurring;P(B) is the probability of event B occurring.

Page 6: Bayesian Spam Filter

P(spam|token) = P(token|spam)P(spam) / P(token)

P(spam|token) – probability that email is spam given a tokenP(token|spam) – probability token exists given email is spamP(spam) – probability of email being spamP(token) – probability of token in email

Bayes’ Rule

Page 7: Bayesian Spam Filter

Project Design (orig) Read in large text file containing 1000 spam. Read in large text file containing 1000 ham. Create a file for each corpus consisting of the

token and it’s occurrence in the corpus. I'll then create another file with the token and the

probability that an email containing it is spam using Bayesian rule.

When an email arrives I will parse the email. I will look up the probability that the email is spam given the token. I’ll then combine all the probabilities to determine the probability that the email is spam.

Page 8: Bayesian Spam Filter

Project Design

Create Narl model from 100 spam and 100 ham contained in two separate CSV files. Used Narl’s built-in Excel Model function. (emailCorpus.narl)

Parse body slot from emailCorpus.narl, create word nodes and calculate the probability. (kb.narl)

Examine incoming text body, tokenize and create nodeNames. If nodeName is already in the kb then lookup the probability. Otherwise assign probability value of “0.5”.

Page 9: Bayesian Spam Filter

Model

Page 10: Bayesian Spam Filter

Email node

Page 11: Bayesian Spam Filter

Word Node

Page 12: Bayesian Spam Filter

Issues

Text is unknown and often incomplete.

Java data structures Vector, StringTokenizer, floating-point

operations Unfamiliar with Narl

Page 13: Bayesian Spam Filter

Enhancements Read slots other than body. Read data in from another format. Gain

more knowledge about the email. Better error handling. Read email as they enter the mail server. Regular expression matching of

Stringtokenizer. Performance tuning with more data. Take advantage of Narl functionality??

Page 14: Bayesian Spam Filter

Demonstration

Page 15: Bayesian Spam Filter

Questions?