spam filtering algorithm.pptx

8/14/2019 SPAM FILTERING ALGORITHM.pptx

1/25


2/25

What is Spam?

Unsolicited, unwanted email that wassent indiscriminately, directly orindirectly, by a sender having no currentrelationship with the recipient.


3/25

Purpose of spam:

Delivering information to the recipient thatcontains a payload such as:

Advertising for a (likely worthless, illegal, ornon-existent) product, Bait for a fraud scheme,Promotion of a cause, orComputer malware designed to hijack therecipients computer .


4/25

As it is so cheap to send information,

only a very small fraction of targetedrecipients perhaps 1 in 10,000 or

fewer need to receive and respondto the payload for spam to be

protable to its sender


5/25

Problems faced due to spam:

Large amounts of spam traffic betweenservers cause delays in delivery of legitimatee-mailPeople with dial-up Internet access have tospend bandwidth downloading junk mailSorting out the unwanted messages takes

time and introduces a risk of deleting normalmail by mistake.


6/25

Methods for dealing with spam:

SOCIA L METHODS :

Legal measures:Ex. Anti-spam law introduced in US

Plain personal involvement:Ex. Never respond to spam , never publish your email

address on web pages , never forward chain letters

TECHNOLOGICAL METHODS :

Blocking spammers IP -address Email-filtering


7/25

Email Filtering:

Two general approaches to mail filtering are:

Knowledge engineeringMachine Learning


8/25

Knowledge engineering

A set of rules is created according to whichmessages are categorized as spam orlegitimate-mail.

Ex. A typical rule of this kind could look like ifthe subject of a message contains the textBUY NOW then the message is a spam.

These rules are created either by user of thefilter or some other authority(ex. The softwarecompany that provides a particular rule-basedspam-filtering tool)


9/25

Drawbacks of knowledge engineering:

The set of rules must be constantly updated,and maintaining it is not convenient for mostusers.

When the rules are publicly available, thespammer has the ability to adjust the text ofhis message so that it would pass through thefilter


10/25

Machine Learning:

A set of pre-classified documents(training samples) is needed.

A specific algorithm is then used tolearn the classification rules from thisdata.


11/25

Problem Statement:

To obtain a spam filter, that is: a decisionfunction f, that would tell us whether a given

e-mail message m is spam (S) orlegitimate mail (L).

If we denote the set of all e-mail messagesby M, we search for a function

f : M {S,L}.


12/25

SPAM FILTERING ALGORITHMS

Nave Bayesian Classifierk-Nearest Neighbours ClassifierArtificial neural networks The Perceptron Multilayer PerceptronSupport Vector Machine


13/25

Nave Bayesian ClassifierWe have two categories of messages:Spam (S) and Legitimate mail(L), andx is a feature vector of a message , that is,vector of number of occurences of certainwords in a message.P(x | c) denotes the probability of obtaining amessage with feature vector x from category c. Aim of a spam filter is to determine, P(c | x) ,given a message x, what category cproduced it


15/25

The final classification rule is in the form ofa likelihood ratio:

c= P(x | S) P(S) > (k) P(x |L)P(L) ? S : L

where k is the parameter that specifies howdangerous it is to misclassify legitimatemail as spam.

The greater is k , the less false positives willthe classifier produce.


16/25

Algorithm:Training:

For the given training set (x , c) calculate(x) = P(x | S) and, (k) P(L)

P(x |L) P(S)

Class i f ica t ion:

Given a message m determine x, retrieve the storedvalue for (x),and,Use the decision rule to determine the category ofmessage m.

c= P(x | S) P(S) > (k) P(x |L)P(L) ? S : L


17/25

Advantages:

Conceptually very easy to understand

Very effective(filters more than 99.691%)

Everyones filter is essentially customizedmaking it difficult for the spammers to defeateveryones filter with a single message


18/25

Disadvantages:We need to have a collection of spam andlegitimate mail to initialize the filter.

Initialization is a bit time consuming.

On each message a user-specific databaseof word probabilities has to be consulted.

False positives do happen(though rarely).


19/25

k Nearest Neighbours Classifier Train ing

Store the training messages.

Classi f icat ion

Given a message x, determine its k nearest neighboursamong the messages in the training set. If l or moremessages among the k nearest neighbours of x arespam, classify x as spam , otherwise classify it aslegitimate mail.

l is used for controlling the number of false positives.


20/25

We use eucledian distances for determiningthe nearest neighbours .

We have to calculate distances to all trainingmessages and find the k nearest neighbours.This may take about O(nm) time for atraining set of n messages containing feature

vectors with m elements.


21/25

Artificial Neural Networks

Artificial neural networks are models inspiredby animal central nervous that are capableof machine learning and pattern recognition.

They are usually presented as systems ofinterconnected " neurons " that can computevalues from inputs by feeding informationthrough the network . There are two kinds ofneural networks generally used:The perceptron ,andThe multilayer perceptron.


22/25

The PerceptronThe idea of the perceptron isto find a linear function of thefeature vectorf(x) = w x + b such that f(x) > 0 for vectors of oneclass, and f(x) < 0 for vectors of otherclass. w = (w1,w2, . . . ,wm) is thevector of coefficients (weights)of the function ,andb is the so-called bias.


23/25

Algorithm: Training

Initialize w and b (to random values or to 0). Find a training example (x, c) for whichsign(w x + b) c. If there is no such example training iscompleted. Store the final w and b and stop. Otherwise

go to next step.Update (w, b):w := w + cx,b := b + c.

Go to previous step. Class i f ica t ionGiven a message x, determine its class as sign( wx +b).


24/25

Multilayer PerceptronMultilayer perceptron is a function that may be visualized as a networkwith several layers of neurons, connected in a feed forward manner.

The neurons in the first layer are called input neurons, and representinput variables.

The neurons in the last layer are called output neurons and providefunction result value.

The layers between the first and the last are called hidden layers.

Each neuron in the network is similar to a perceptron : it takes inputvalues x 1 , x 2 , . . . x k, and calculates its output value o by the formula

Output(o) = (wixi + b)

where w i, b are the weights and the bias of the neuron and is a certain

nonlinear function. Most often (x) is either 1/(1+e^ax) or tanh(x).

ax1/(1+e )


25/25

Training of the multilayer perceptron meanssearching for such weights and biases of all theneurons for which the network will have as smallerror on the training set as possible.

Total training error :

E(f) = |f(xi) ci|^2 ,

where (xi, ci) are training samples.

spam filtering algorithm.pptx

Documents