bayesian filtering - openloop technologies · web viewis a probability of a word appears in a spam...

Bayesian Filtering CMPE 209, SJSU, Spring 2007 Instructor: Richard Sinn

Team “Glyph”

Debbie BridyghamPravesvuth UparanukrawRonald KoRihui Luo Thuong Luu

Table of ContentsI. Introduction.........................................................................................................................................1

II. Bayes’s Theorem.................................................................................................................................1

Naïve Bayes Classifier..............................................................................................................................2

III. Bayesian Spam Filter Process...........................................................................................................3

IV. Advantages......................................................................................................................................4

V. Disadvantages......................................................................................................................................4

VI. Bayesian Poisoning..........................................................................................................................5

VII. Conclusion.......................................................................................................................................5

VIII. References.......................................................................................................................................6

b

I. IntroductionSpam has always been a problem when it comes to email. It costs businesses both wasted time and money. Studies show that 50% of all our current email is spam. Because spam is always on the rise, it is predicted that this number will reach higher percentage in a short time. Spam requires end users to spend time going through their emails trying to figure out which emails are legitimate and getting rid of junk emails. Many programs or email software these days have some sort of junk email filtering when it comes to dealing with spam. The simplest form is having a list of words or email addresses. Then using that list, we can search emails for those specific words and addresses to identify spam. While this method is quick and easy, this approach is considered static because spammers can easily adapt and find ways to circumvent that filtering. Because spammers can easily change their tactics or method in sending spam emails, we also need a form of email filtering that change or adapt as well. An example of such a technique is Bayesian filtering. One of the tried and common ways of filtering out spam is using Bayesian filtering. Because of its effectiveness, many third party email filtering programs rely on this method as the ‘brains’ behind their email filtering program. In this paper, we explore the logic and inner workings of the Bayesian filtering method, how it works, and what is achieved from this type of filtering.

II. Bayes’s TheoremBayesian filtering is simply based upon Bayes’ theorem, also known as Bayes’ law. This theorem is named after Thomas Bayes, a minister who simply had an interest in mathematics. The theorem was published in an essay in the Philosophical Transactions of the Royal Socieity of London in 1763.

Bayes’ theorem isn’t some ‘magical’ or special technique that it may seem to sound like. In fact, it is just based on simple math and statistics. The following formula is Bays’ Theorem in its well-known short form. The simple idea behind this is that this formula determines a probability number depending on the probability of other events occurring.

In order to make this formula a bit more understandable, the formula has been rewritten as the following:

“Bayes’s theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email.” (1)

1

Naïve Bayes ClassifierA naïve Bayes Classifier is a probabilistic classifier. It is Bayesian network that is used to do classification (2). The term naïve is used because it means we make an independent assumption.

Since, Bayes classifier needed to be trained in order to work efficiently. Yet, Naïve Bayes classifier required small amount of training to perform classification. This probably is an advantage to the Naïve Bayes classifier because when it comes to email system users would likely to expect the spam filter work without much effort.

Following is an example of using Bayes’ Theorem to classify a document if it is a spam or not.

First we have a document that contains sets of words. The document can be classified as if it is

a spam and if it isn’t a spam. Therefore, from probability theorem, we can say:

And

From above, we can write:

And

Divide one by the other gives:

Knowing that and is called likelihood ratios

Takes a logarithm would yield:

2

Where:

is a probability of a word appears in a spam email.

is a probability of a word appears in a non-spam email.

is a probability of an email being a spam.

is a probability of an email not being a spam.

Thus the likelihood ratios can be calculated from the right side of the equation using probabilities of

words stored in database. Our aim is to get so that we can assume the

email is not a spam (3).

III. Bayesian Spam Filter ProcessBefore mails can be filtered using Bayesian filtering technique, a user needs to generate a database with tokens (words) collected from a sample of spam mail and valid mail (referred to as ‘ham’).

A probability value is then assigned to each token. This probability is based on how often that token occurs in spam as opposed to legitimate mail (ham). This is done by analyzing the outgoing mails and by analyzing known spam: All the tokens in both places are analyzed to generate the probability that a particular word being spam.

For instance, If the word "mortgage" occurs in 400 of 3,000 spam mails and in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 = [400/3000] /[5/300 + 400/3000].

It should be noted that the analysis of ham mail should be tailored to a particular organization. For example, a financial institution might use the word "mortgage" many times and would get a lot of false positives if using a general anti-spam rule set. On the other hand, the Bayesian filter, if tailored to that financial institution through an initial training period, records the institution's valid outgoing mails and has a much better spam detection rate and a lower false positive rate.

Besides ham mail, the Bayesian filter also needs to be trained on spam data. This spam data should include a large sample of known spam and be constantly updated. This is to ensure that the Bayesian filter is aware of the latest spam tricks which results in a high spam detection rate.

3

Once the ham and spam databases have been created, the word probabilities can be calculated and the filter is ready for use.

When a new mail arrives, it is broken down into tokens and the relevant tokens that are significant in identifying whether the mail is spam or not are singled out. From these tokens, the Bayesian filter calculates the probability of the new message being spam or not. If the probability is greater than a predefined threshold, say 0.9, the message is classified as spam.

IV. Advantages1. The Bayesian method takes the whole message (including header) into account. It recognizes

keywords that identify spam or valid mail. It considers only key interesting words and comes up with a probability that a message is spam. Bayesian filtering is an intelligent approach because it examines all aspects of a message, as opposed to keyword checking that classifies a mail as spam on the basis of a single word.

2. A Bayesian filter is self-adapting. Tokens are constantly updated with new probability. The filter learns and evolves from new spam and new valid incoming and outgoing mails. Learning from outgoing mail reduces false positives greatly.

3. The Bayesian technique is user sensitive. It learns the email habits of the organization and therefore more accurately detecting spam emails with a significantly lower false positive rate.

4. The Bayesian method is multi-lingual and international. A Bayesian filter can be used for any language. It can also be configure to takes into account languages deviations or the diverse usage of certain words in different areas.

5. Difficult to fool, as opposed to a keyword filter. A spammer who wants to trick the Bayesian filter can either use fewer words that usually indicate spam, or more words that generally indicate valid mail. Doing the latter is impossible because the spammer would have to know the email profile of each recipient.

V. Disadvantages1. Significantly increased system resource usage to identify a message as spam or not. What used

to take one pass, now takes as many as 10-15 passes.2. Can't identify cloaked spam which is generally the most vile spam such as "v*i(a)g-r-a" or bogus

HTML tags, as well as more sophisticated cloaking.3. Still based on and dependent upon having clearly visible and obvious keyword/token.4. No standard method of determining why a particular message was caught by the filter which

makes it very difficult to intelligently tune the filter for optimal spam recognition.5. Blind "training" and retraining of the Bayesian filter could results in unpredictable results and

negatively impacts the filter's ability to correctly identify future spam.

4

VI. Bayesian PoisoningBayesian Poisoning is a technique used by spammers to attempt to degrade the effectiveness of Bayesian spam filters. In Bayesian spam filtering theorem, the appearance of innocent words are as important as spam words. The appearance of spam words in spam emails can improve the accuracy, but at the same time, the appearance of innocent words in spam emails can decrease the accuracy. Therefore, if the spammers learn what innocent words are, they can inject them into spam emails to reduce the likelihood of being identified as spam.

There’re two types of attacks by spammers, Type I and Type II attack. Type I attempts to deliver the spam to user and Type II attacks attempt to turn previously innocent words into spam words in Bayesian database. In other words, Type I attack tries to bypass the filter’s checking mechanism while Type II attack attempts to confuse the filter.

There’re two types of Bayesian poisoning, passive poisoning and active poisoning. Passive poisoning refers to the method that the spammer blindly sends message without getting any feedback about the outcome. While its active counterpart adds random words into the message and uses web bugs to track if the message is actually delivered.

Depending on the actual filter software, Bayesian poisoning has different outcomes. But in general, active poisoning is more effective than it passive counterpart. Adding random words to conduct passive attack is inefficient while adding innocent words to perform active attack is quite effective.

There’re many strategies to defend Bayesian poisoning, and the most important measurement is to prevent active attack. It can be achieved by blocking the spammer receive any feedback.

VII. ConclusionBayesian filtering is a great improvement over static keyword based filtering techniques. Based on Bayes’s theorem, unlike its static counterpart, Bayesian filter can adapt itself to spammer’s latest tactics and make it very difficult for spammers to get around it. It can also be trained to suit individual’s unique situation; this makes it extremely unlikely for spammers to come up with a universal spam scheme. The result is very low rate of false positive.

Various methods used by spammers to defeat Bayesian filter are developed. These so-called Bayesian poisoning use both passive and active methods to attempt to get around Bayesian filter’s filtering mechanism. Although partially works under some situations, they’re largely unsuccessful.

With such a high spam rate in everybody’s email inbox, Bayesian filter provides a great way to protect your inbox been flooded by spam emails. The filter can adapt changes and meet your individual situation. With a little initial help from user, it can produce a very low false positive rate. Bayesian filter has proved to be a very effective spam fighting tool.

5

VIII. References1. Answers.com. Bayes' theorem . [Online] [Cited: April 9, 2007.] http://www.answers.com/topic/bayes-theorem.

2. A Bayesian Approach to Filtering Junk E-Mail. Sahami, Mehran, et al. Madison, Wisconsin : AAAI Technical Report WS-98-05, 1998. AAAI-98 Workshop on Learning for Text Categorization.

3. Wikipedia. Naive Bayes classifier. Wikipedia. [Online] March 23, 2007. [Cited: April 8, 2007.] http://en.wikipedia.org/wiki/Naive_bayes_classifier.

4. —. Bayesian spam filtering. Wikipedia. [Online] Febuary 14, 2007. [Cited: April 8, 2007.] http://en.wikipedia.org/wiki/Bayesian_spam_filtering.

5. Wei, Kai. A Naive Bayes Spam Filter. 2003.

6. GFI. Why Bayesian filtering is the most effective anti-spam technology. [Online] 2007. http://www.gfi.com/whitepapers/why-bayesian-filtering.pdf.

6

bayesian filtering - openloop technologies · web viewis a probability of a word appears in a spam...

Documents