spam filtering algorithm.pptx
TRANSCRIPT
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
1/25
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
2/25
What is Spam?
Unsolicited, unwanted email that wassent indiscriminately, directly orindirectly, by a sender having no currentrelationship with the recipient.
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
3/25
Purpose of spam:
Delivering information to the recipient thatcontains a payload such as:
Advertising for a (likely worthless, illegal, ornon-existent) product, Bait for a fraud scheme,Promotion of a cause, orComputer malware designed to hijack therecipients computer .
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
4/25
As it is so cheap to send information,
only a very small fraction of targetedrecipients perhaps 1 in 10,000 or
fewer need to receive and respondto the payload for spam to be
protable to its sender
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
5/25
Problems faced due to spam:
Large amounts of spam traffic betweenservers cause delays in delivery of legitimatee-mailPeople with dial-up Internet access have tospend bandwidth downloading junk mailSorting out the unwanted messages takes
time and introduces a risk of deleting normalmail by mistake.
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
6/25
Methods for dealing with spam:
SOCIA L METHODS :
Legal measures:Ex. Anti-spam law introduced in US
Plain personal involvement:Ex. Never respond to spam , never publish your email
address on web pages , never forward chain letters
TECHNOLOGICAL METHODS :
Blocking spammers IP -address Email-filtering
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
7/25
Email Filtering:
Two general approaches to mail filtering are:
Knowledge engineeringMachine Learning
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
8/25
Knowledge engineering
A set of rules is created according to whichmessages are categorized as spam orlegitimate-mail.
Ex. A typical rule of this kind could look like ifthe subject of a message contains the textBUY NOW then the message is a spam.
These rules are created either by user of thefilter or some other authority(ex. The softwarecompany that provides a particular rule-basedspam-filtering tool)
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
9/25
Drawbacks of knowledge engineering:
The set of rules must be constantly updated,and maintaining it is not convenient for mostusers.
When the rules are publicly available, thespammer has the ability to adjust the text ofhis message so that it would pass through thefilter
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
10/25
Machine Learning:
A set of pre-classified documents(training samples) is needed.
A specific algorithm is then used tolearn the classification rules from thisdata.
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
11/25
Problem Statement:
To obtain a spam filter, that is: a decisionfunction f, that would tell us whether a given
e-mail message m is spam (S) orlegitimate mail (L).
If we denote the set of all e-mail messagesby M, we search for a function
f : M {S,L}.
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
12/25
SPAM FILTERING ALGORITHMS
Nave Bayesian Classifierk-Nearest Neighbours ClassifierArtificial neural networks The Perceptron Multilayer PerceptronSupport Vector Machine
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
13/25
Nave Bayesian ClassifierWe have two categories of messages:Spam (S) and Legitimate mail(L), andx is a feature vector of a message , that is,vector of number of occurences of certainwords in a message.P(x | c) denotes the probability of obtaining amessage with feature vector x from category c. Aim of a spam filter is to determine, P(c | x) ,given a message x, what category cproduced it
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
14/25
. Using Bayes rule we get:
P(c | x) = P(x|c)P(c)P(x)
= P(x|c)P(c)
P(x|S)P(S)+P(x|L)P(L)
whereP(x) denotes the a-priori probability of
message x andP(c) denotes the a-priori probability of
class c
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
15/25
The final classification rule is in the form ofa likelihood ratio:
c= P(x | S) P(S) > (k) P(x |L)P(L) ? S : L
where k is the parameter that specifies howdangerous it is to misclassify legitimatemail as spam.
The greater is k , the less false positives willthe classifier produce.
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
16/25
Algorithm:Training:
For the given training set (x , c) calculate(x) = P(x | S) and, (k) P(L)
P(x |L) P(S)
Class i f ica t ion:
Given a message m determine x, retrieve the storedvalue for (x),and,Use the decision rule to determine the category ofmessage m.
c= P(x | S) P(S) > (k) P(x |L)P(L) ? S : L
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
17/25
Advantages:
Conceptually very easy to understand
Very effective(filters more than 99.691%)
Everyones filter is essentially customizedmaking it difficult for the spammers to defeateveryones filter with a single message
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
18/25
Disadvantages:We need to have a collection of spam andlegitimate mail to initialize the filter.
Initialization is a bit time consuming.
On each message a user-specific databaseof word probabilities has to be consulted.
False positives do happen(though rarely).
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
19/25
k Nearest Neighbours Classifier Train ing
Store the training messages.
Classi f icat ion
Given a message x, determine its k nearest neighboursamong the messages in the training set. If l or moremessages among the k nearest neighbours of x arespam, classify x as spam , otherwise classify it aslegitimate mail.
l is used for controlling the number of false positives.
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
20/25
We use eucledian distances for determiningthe nearest neighbours .
We have to calculate distances to all trainingmessages and find the k nearest neighbours.This may take about O(nm) time for atraining set of n messages containing feature
vectors with m elements.
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
21/25
Artificial Neural Networks
Artificial neural networks are models inspiredby animal central nervous that are capableof machine learning and pattern recognition.
They are usually presented as systems ofinterconnected " neurons " that can computevalues from inputs by feeding informationthrough the network . There are two kinds ofneural networks generally used:The perceptron ,andThe multilayer perceptron.
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
22/25
The PerceptronThe idea of the perceptron isto find a linear function of thefeature vectorf(x) = w x + b such that f(x) > 0 for vectors of oneclass, and f(x) < 0 for vectors of otherclass. w = (w1,w2, . . . ,wm) is thevector of coefficients (weights)of the function ,andb is the so-called bias.
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
23/25
Algorithm: Training
Initialize w and b (to random values or to 0). Find a training example (x, c) for whichsign(w x + b) c. If there is no such example training iscompleted. Store the final w and b and stop. Otherwise
go to next step.Update (w, b):w := w + cx,b := b + c.
Go to previous step. Class i f ica t ionGiven a message x, determine its class as sign( wx +b).
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
24/25
Multilayer PerceptronMultilayer perceptron is a function that may be visualized as a networkwith several layers of neurons, connected in a feed forward manner.
The neurons in the first layer are called input neurons, and representinput variables.
The neurons in the last layer are called output neurons and providefunction result value.
The layers between the first and the last are called hidden layers.
Each neuron in the network is similar to a perceptron : it takes inputvalues x 1 , x 2 , . . . x k, and calculates its output value o by the formula
Output(o) = (wixi + b)
where w i, b are the weights and the bias of the neuron and is a certain
nonlinear function. Most often (x) is either 1/(1+e^ax) or tanh(x).
ax1/(1+e )
-
8/14/2019 SPAM FILTERING ALGORITHM.pptx
25/25
Training of the multilayer perceptron meanssearching for such weights and biases of all theneurons for which the network will have as smallerror on the training set as possible.
Total training error :
E(f) = |f(xi) ci|^2 ,
where (xi, ci) are training samples.