bayesian learning application to text classification example: spam filtering
Post on 21-Jan-2016
42 Views
Preview:
DESCRIPTION
TRANSCRIPT
Bayesian Learning
Application to Text Classification
Example: spam filtering
Kunstmatige Intelligentie / RuG
Marius Bulacu & prof. dr. Lambert Schomaker
KI2 - 3
2
Founders of Probability Theory
Blaise Pascal
(1623-1662, France)
Pierre Fermat
(1601-1665, France)
They laid the foundations of the probability theory in a correspondence on a dice game.
3
Prior, Joint and Conditional Probabilities
P(A, B) = joint probability of A and BP(A | B) = conditional (posterior) probability of
A given BP(B | A) = conditional (posterior) probability of
B given A
P(A) = prior probability of AP(B) = prior probability of B
4
Probability Rules
Product rule:P(A, B) = P(A | B) P(B)
or equivalently
P(A, B) = P(B | A) P(A)
Sum rule:
P(A) = ΣB P(A, B) = ΣB P(A | B) P(B)
if A is conditionalized on B, then the total probability of A is the sum of its joint probabilities with all B
5
Statistical Independence
Two random variables A and B are independent iff:
P(A, B) = P(A) P(B)
P(A | B) = P(A)
P(B | A) = P(B)
knowing the value of one variable does not yield any information about the value of the other
6
Statistical Dependence - Bayes
Thomas Bayes
(1702-1761, England)
“Essay towards solving a problem in the doctrine of chances” published in the Philosophical Transactions of the Royal Society of London in 1764.
7
Bayes Theorem
=> P(A B) = P(A|B) P(B) = P(B|A) P(A)
P(A|B) = P(A B) / P(B)P(B|A) = P(A B) / P(A)
=> P(A|B) = P(B|A) P(A)
P(B)
8
Bayes Theorem - Causality
Diagnostic:
P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)
Pattern Recognition:
P(Class|Feature) = P(Feature|Class) P(Class) / P(Feature)
P(A|B) = P(B|A) P(A)
P(B)
9
Bayes Formula and Classification
)X(p
)C(p)C|X(p)X|C(p
Posteriorprobability of the classafter seeing the data
Priorprobability of the classbefore seeing anything
Unconditionalprobability of the data
Conditional Likelihoodof the data
given the class
10
Medical example
p(+disease) = 0.002
p(+test | +disease) = 0.97
p(+test | -disease) = 0.04
p(+disease | +test) = p(+test | +disease) * p(+disease) / p(+test)
= 0.97 * 0.002 / 0.04186 = 0.00194 / 0.04186 = 0.046
p(-disease | +test) = p(+test | -disease) * p(-disease) / p(+test)
= 0.04 * 0.998 / 0.04186 = 0.03992 / 0.04186 = 0.953
p(+test) = p(+test | +disease) * p(+disease) + p(+test | -disease) * p(-disease)
= 0.97 * 0.002 + 0.04 * 0.97 = 0.00194 + 0.03992 = 0.04186
11
MAP Classification
p(C1|x) = p(x|C1)p(C1)
Decisionthreshold
p(C2|x) = p(x|C2)p(C2)
To minimize probability of misclassification, assign new input x to the class with the Maximum A posteriori Probability, e.g. assign to x to class C1 if:
p(C1|x) > p(C2|x) <=> p(x|C1)p(C1) > p(x|C2)p(C2)
Therefore we must impose a decision boundary where the two posterior probability distributions cross each other.
x
12
Maximum Likelihood Classification
When the prior class distributions are not known or for equal (non-informative) priors:
p(x|C1)p(C1) > p(x|C2)p(C2)
becomes
p(x|C1) > p(x|C2)
Therefore assign the input x to the class with the Maximum Likelihood to have generated it.
13
Continuous Features
Two methods for dealing with continuous-valued features:
1) Binning: divide the range of continuous values into a discrete number of bins, then apply the discrete methodology.
2) Mixture of Gaussians: make an assumption regarding the functional form of the PDF (liniar combination of Gaussians) and derive the corresponding parameters (means and standard deviations).
14
Accumulation of Evidence
Bayesian inference allows for integrating prior knowledge about the world (beliefs being expressed in terms of probabilities) with new incoming data.
Different forms of data (possibly incommensurable) can be fused towards the final decision using the “common currency” of probability.
As the new data arrives, the latest posterior becomes the new prior for interpreting the new input.
p(C|X,Y) = p(X,Y,C) = p(C) p(X,Y|C)
= p(C) p(X|C) p(Y|C,X)
... = p(C) p(X|C) p(Y|C,X) p(Z|C,X,Y)
priornew prior
new prior
15
Example: temperature classification
Classes C:
Cold P(x|C)Normal P(x|N)Warm P(x|W)Hot P(x|H)
P(x)P(x)
P(x|C)P(x|C)P(x|N)P(x|N)
P(x|W)P(x|W)
P(x|H)P(x|H)
P(x) likelihoodP(x) likelihoodof x valuesof x values
16
Bayes: probability “blow up”
Classes C:
Cold P(x|C)Normal P(x|N)Warm P(x|W)Hot P(x|H)
P(C|x) P(C|x) P(N|x)P(N|x) P(W|x)P(W|x) P(H|x)P(H|x)
P(x|C) P(x|C)
P(C|x) P(C|x)
P(C|x) = P(x|C) P(C) / P(x)P(C|x) = P(x|C) P(C) / P(x)
Bayesian outputhas a nice plateau
even with an irregularPDF shape …
in
out
18
Puzzle
So if Bayes is optimal and can be used for continuous data too, why has it become popular so late, i.e., much later than neural networks?
19
Why Bayes has become popular so late…
P(x)
x
Note: the example was 1-dimensional
A PDF (histogram) with 100 bins for one dimension will cost 10000 bins for two dimensions etc.
Ncells = Nbinsndims
20
Why Bayes has become popular so late…
Ncells = Nbinsndims
Yes… but you could use n-dimensional theoretical distributions (Gauss, Weibull etc.) instead of empirically measured PDFs…
21
Why Bayes has become popular so late…
… use theoretical distributions instead of empirically measured PDFs …
still the dimensionality is a problem:– 20 samples needed to estimate 1-dim. Gaussian PDF
400 samples needed to estimate 2-dim. Gaussian!, etc.
massive amounts of labeled data are needed to estimate probabilities reliably!
22
Labeled (ground truthed) data
Example: client evaluation in insurances
0.1 0.54 0.53 0.874 8.455 0.001 –0.111 risk
0.2 0.59 0.01 0.974 8.40 0.002 –0.315 risk
0.11 0.4 0.3 0.432 7.455 0.013 –0.222 safe
0.2 0.64 0.13 0.774 8.123 0.001 –0.415 risk
0.1 0.17 0.59 0.813 9.451 0.021 –0.319 risk
0.8 0.43 0.55 0.874 8.852 0.011 –0.227 safe
0.1 0.78 0.63 0.870 8.115 0.002 –0.254 risk
. . . . . . . .
23
Success of speech recognition
massive amounts of data increased computing power cheap computer memory
allowed for the use of Bayes in
hidden Markov Models for speech recognition
similarly (but slower): application of Bayes
in script recognition
Global Structure: year title date date and number of entry (Rappt) redundant lines between paragraphs jargon-words:
NotificatieBesluit fiat
imprint with page number
XML model
Local probabilistic structure:
P(“Novb 16 is a date” | “sticks out to the left” & is left of “Rappt ”) ?
26
Naive BayesConditional Independence
Naive Bayes assumes the attributes (features) are independent:
p(X,Y|C) = p(X|C) P(Y|C)or
p(x1, ... xn|C) = i p(xi|C)
Often works surprisingly well in practice despite its manifest simplicity.
27
Accumulation of Evidence – Independence
)Y(p
)X|C(p)C|Y(p
)X|Y(p
)X|C(P)X,C|Y(p
)X(p)X|Y(p
)C(p)C|X(p)X,C|Y(p
)Y,X(p
)C(p)C|Y,X(p)Y,X|C(p
“naive” assumption that
X and Y are independent
28
The Naive Bayes Classifier Assume that each sample x to be classified is described by the attributes a1, a2 ... an.
The most probable (MAP) classification for x is:
Naive Bayes independence assumption:
Therefore:
)c(P)c|a...a,a(P
)a...a,a(P
)c(P)c|a...a,a(P
)a...a,a|c(P)x(class
iinc
n
iin
c
nic
maxarg
maxarg
maxarg
i
i
i
21
21
21
21
j
ijin )c|a(P)c|a...a,a(P 21
j
ijic
)c|a(P)c(P)x(class maxargi
29
Learning to Classify Text
Representation: each electronic document is represented by the set of words that it contains under the independence assumptions
- order of words does not matter
- co-occurrences of words do not matter
i.e. each document is represented as a “bag of words” Learning: estimate from the training dataset of documents
- the prior class probability P(ci)
- the conditional likelihood of a word wj given the document class ci P(wj|ci)
Classification: maximum a posteriori (MAP)
30
Learning to Classify e-mail
Is this e-mail a spam?: e-mail {spam, ham} Each word represents an attribute characterizing the
e-mail. Estimate the class priors p(spam) and p(ham) from the
training data as well as the class conditional likelihoods for all the encountered words.
For a new e-mail, assuming naive Bayes conditional independence, compute the MAP hypothesis.
31
Spam filtering
From acd@essex.ac.uk Mon Nov 10 19:23:44 2003Return-Path: <alan@essex.ac.uk>Received: from serlinux15.essex.ac.uk (serlinux15.essex.ac.uk [155.245.48.17]) by tcw2.ppsw.rug.nl (8.12.8/8.12.8) with ESMTP id hAAIecHC008727; Mon, 10 Nov 2003 19:40:38 +0100
Apologies for multiple postings.> 2nd C a l l f o r P a p e r s>> DAS 2004>> Sixth IAPR International Workshop on> Document Analysis Systems>> September 8-10, 2004>> Florence, Italy>> http://www.dsi.unifi.it/DAS04>> Note:> There are two main additions with respect to the previous CFP:> 1) DAS&DL data are now available on the workshop web site> 2) Proceedings will be published by Springer Verlag in LNCS series
Example of regular mail:
32
Spam filtering
Example of spam:
From : Easy Qualify" <mbulacu@netaccessproviders.net> To : bulacu@hotmail.com Subject : Claim your Unsecured Platinum Card - 75OO dollar limit Date : Tue, 28 Oct 2003 17:12:07 -0400
==================================================mbulacu - Tuesday, Oct 28, 2003==================================================
Congratulations, you have been selected for an Unsecured Platinum Credit Card / $7500 starting credit limit.
This offer is valid even if you've had past credit problems or evenno credit history. Now you can receive a $7,500 unsecured Platinum Credit Card that can help build your credit. And to help get your card to you sooner, we have been authorized to waive any employment or credit verification.
33
Conclusions
Effective: about 90% correct classification Could be applied to any text classification
problem Needs to be polished
34
Summary
Bayesian inference allows for integrating prior knowledge about the world (beliefs being expressed in terms of probabilities) with new incoming data. Inductive bias of Naive Bayes: attributes are independent. Although this assumption is often violated, it provides a very efficient tool often used (e.g. text classification – spam filtering). Applicable to discrete or continuous data.
top related