naive bayes presentation
Post on 17-Jul-2015
196 Views
Preview:
TRANSCRIPT
Naive Bayes
Md Enamul Haque Chowdhury
ID : CSE013083972D
University of Luxembourg
(Based on Ke Chen and Ashraf Uddin Presentation)
Contents
Background
Bayes Theorem
Bayesian Classifier
Naive Bayes
Uses of Naive Bayes classification
Relevant Issues
Advantages and Disadvantages
Some NBC Applications
Conclusions
1
Background
There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: Naive Bayes, Model based classifiers
a) and b) are examples of discriminative classification
c) is an example of generative classification
b) and c) are both examples of probabilistic classification
2
Bayes Theorem
Given a hypothesis h and data D which bears on the hypothesis:
P(h): independent probability of h: prior probability
P(D): independent probability of D
P(D|h): conditional probability of D given h: likelihood
P(h|D): conditional probability of h given D: posterior probability
3
Maximum A Posterior
Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP)
hypothesis for the data
We are interested in the best hypothesis for some space H given observed training
data D.
H: set of all hypothesis.
Note that we can drop P(D) as the probability of the data is constant (and
independent of the hypothesis).
)|(argmax DhPhHh
MAP
)(
)()|(argmax
DP
hPhDP
Hh
)()|(argmax hPhDPHh
4
Maximum Likelihood
Now assume that all hypothesis are equally probable a prior, i.e. P(hi ) = P(hj ) for all
hi, hj belong to H.
This is called assuming a uniform prior. It simplifies computing the posterior:
This hypothesis is called the maximum likelihood hypothesis.
)|(maxarg hDPhHh
ML
5
Bayesian Classifier
The classification problem may be formalized using a-posterior probabilities:
P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.
E.g. P(class=N | outlook= sunny, windy=true,…)
Idea: assign to sample X the class label C such that P(C|X) is maximal
6
Estimating a-posterior probabilities
Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
P(X) is constant for all classes
P(C) = relative freq of class C samples
C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum
Problem: computing P(X|C) is unfeasible!
7
Naive Bayes
Bayes classification
Difficulty: learning the joint probability
Naive Bayes classification
-Assumption that all input features are conditionally independent!
-MAP classification rule: for
)()|,,()()( )( 1 CPCXXPCPC|P|CP n XX
)|()|()|(
)|,,()|(
)|,,(),,,|()|,,,(
21
21
22121
CXPCXPCXP
CXXPCXP
CXXPCXXXPCXXXP
n
n
nnn
),,,( 21 nxxx x
Lnn ccccccPcxPcxPcPcxPcxP ,, , ),()]|()|([)()]|()|([ 1
*
1
***
1
8
Naive Bayes
Algorithm: Discrete-Valued Features
-Learning Phase: Given a training set S,
Output: conditional probability tables; for elements
-Test Phase: Given an unknown instance
Look up tables to assign the label c* to X´ if
;in examples with )|( estimate)|(ˆ
),1 ;,,1( featureeach of valuefeatureevery For
;in examples with )( estimate)(ˆ
of t valueeach targeFor 1
S
S
ijkjijkj
jjjk
ii
Lii
cCxXPcCxXP
N,knj Xx
cCPcCP
)c,,c(c c
Lnn ccccccPcaPcaPcPcaPcaP ,, , ),(ˆ)]|(ˆ)|(ˆ[)(ˆ)]|(ˆ)|(ˆ[ 1
*
1
***
1
LNX jj ,
),,( 1 naa X
9
Example
10
Example
Learning Phase :
P(Play=Yes) = 9/14
P(Play=No) = 5/14
Outlook Play=Yes Play=No
Sunny 2/9 3/5Overcast 4/9 0/5Rain 3/9 2/5
Temperature Play=Yes Play=No
Hot 2/9 2/5Mild 4/9 2/5Cool 3/9 1/5
Humidity Play=Yes Play=No
High 3/9 4/5Normal 6/9 1/5
Wind Play=Yes Play=No
Strong 3/9 3/5Weak 6/9 2/5
11
Example
Test Phase :
-Given a new instance, predict its label
x´=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
-Look up tables achieved in the learning phrase
-Decision making with the MAP rule:
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Play=Yes) = 9/14
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=No) = 3/5
P(Play=No) = 5/14
P(Yes|x´): [ P(Sunny|Yes) P(Cool|Yes) P(High|Yes) P(Strong|Yes) ] P(Play=Yes) = 0.0053
P(No|x´): [ P(Sunny|No) P(Cool|No) P(High|No) P(Strong|No) ] P(Play=No) = 0.0206
Given the fact P(Yes|x´) < P(No|x´) , we label x´ to be “No”.
12
Naive Bayes
Algorithm: Continuous-valued Features
- Numberless values for a feature
- Conditional probability often modeled with the normal distribution
- Learning Phase:
Output: normal distributions and
- Test Phase: Given an unknown instance
-Instead of looking-up tables, calculate conditional probabilities with all the normal
distributions achieved in the learning phrase
-Apply the MAP rule to make a decision
ijji
ijji
ji
jij
ji
ij
cC
cX
XcCXP
for which examples of X valuesfeature ofdeviation standard :
Cfor which examples of valuesfeature of (avearage)mean :
2
)(exp
2
1)|(ˆ
2
2
Ln ccCXX ,, ),,,(for 11 X
LicCP i ,,1 )( Ln
),,( 1 naa X
13
Naive Bayes
Example: Continuous-valued Features
-Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
-Estimate mean and variance for each class
-Learning Phase: output two Gaussian models for P(temp|C)
N
n
n
N
n
n xN
xN 1
22
1
)(1
,1
09.7 ,88.23
35.2 ,64.21
NoNo
YesYes
25.50
)88.23(exp
209.7
1)|(ˆ
09.11
)64.21(exp
235.2
1)|(ˆ
2
2
xNoxP
xYesxP
14
Uses of Naive Bayes classification
Text Classification
Spam Filtering
Hybrid Recommender System
- Recommender Systems apply machine learning and data mining techniques for
filtering unseen information and can predict whether a user would like a given
resource
Online Application
- Simple Emotion Modeling
15
Why text classification?
Learning which articles are of interest
Classify web pages by topic
Information extraction
Internet filters
16
Examples of Text Classification
CLASSES=BINARY
“spam” / “not spam”
CLASSES =TOPICS
“finance” / “sports” / “politics”
CLASSES =OPINION
“like” / “hate” / “neutral”
CLASSES =TOPICS
“AI” / “Theory” / “Graphics”
CLASSES =AUTHOR
“Shakespeare” / “Marlowe” / “Ben Jonson”
17
Naive Bayes Approach
Build the Vocabulary as the list of all distinct words that appear in all the documents
of the training set.
Remove stop words and markings
The words in the vocabulary become the attributes, assuming that classification is
independent of the positions of the words
Each document in the training set becomes a record with frequencies for each word
in the Vocabulary.
Train the classifier based on the training data set, by computing the prior probabilities
for each class and attributes.
Evaluate the results on Test data
18
Text Classification Algorithm: Naive Bayes
Tct – Number of particular word in particular class
Tct’ – Number of total words in particular class
B´ – Number of distinct words in all class
19
Relevant Issues
Violation of Independence Assumption
Zero conditional probability Problem
20
Violation of Independence Assumption
Naive Bayesian classifiers assume that the effect of an attribute value on a given
class is independent of the values of the other attributes. This assumption is called
class conditional independence. It is made to simplify the computations involved and,
in this sense, is considered “naive.”
21
Improvement
Bayesian belief network are graphical models, which unlike naive Bayesian
classifiers, allow the representation of dependencies among subsets of attributes.
Bayesian belief networks can also be used for classification.
22
Zero conditional probability Problem
If a given class and feature value never occur together in the training set then the
frequency-based probability estimate will be zero.
This is problematic since it will wipe out all information in the other probabilities when
they are multiplied.
It is therefore often desirable to incorporate a small-sample correction in all
probability estimates such that no probability is ever set to be exactly zero.
23
Naive Bayes Laplace Correction
To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to
each count
24
Example
Suppose that for the class buys computer D (yes) in some training database, D, containing 1000
tuples.
we have 0 tuples with income D low,
990 tuples with income D medium, and
10 tuples with income D high.
The probabilities of these events, without the Laplacian correction, are 0, 0.990 (from 990/1000),
and 0.010 (from 10/1000), respectively.
Using the Laplacian correction for the three quantities, we pretend that we have 1 more tuple for
each income-value pair. In this way, we instead obtain the following probabilities :
respectively. The “corrected” probability estimates are close to their “uncorrected” counterparts,
yet the zero probability value is avoided.
25
Advantages
• Advantages :
Easy to implement
Requires a small amount of training data to estimate the parameters
Good results obtained in most of the cases
26
Disadvantages
Disadvantages:
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables
-E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modelled by Naïve Bayesian Classifier
27
Some NBC Applications
Credit scoring
Marketing applications
Employee selection
Image processing
Speech recognition
Search engines…
28
Conclusions
Naive Bayes is:
- Really easy to implement and often works well
- Often a good first thing to try
- Commonly used as a “punching bag” for smarter algorithms
29
References
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch6.pdf
Data Mining: Concepts and Techniques, 3rd
Edition, Han & kamber & Pei ISBN: 9780123814791
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://www.slideshare.net/ashrafmath/naive-bayes-15644818
http://www.slideshare.net/gladysCJ/lesson-71-naive-bayes-classifier
30
Questions ?
top related