machine learning, chapter 6 cse 574, spring 2003srihari/cse574/chapbl/chapbl.part2a.pdf · 29...
Post on 16-Mar-2020
9 Views
Preview:
TRANSCRIPT
1
Machine Learning, Chapter 6 CSE 574, Spring 2003
Bayes Optimal Classifier (6.7)
Bayes Optimal Classification
∑∈∈ Hh
iijVv
ij
DhPhvP )|()|(maxarg
2
Machine Learning, Chapter 6 CSE 574, Spring 2003
Bayes Optimal Classifier
• Instead of asking “What is the most probable hypothesis given the training data?” , ask:
• “What is the most probable classification of the new instance given the training data?”
• Instead of learning the function fi, the Bayes optimal classifier assigns any given input to the most likely output vj
fix0x1x2
vj
3
Machine Learning, Chapter 6 CSE 574, Spring 2003
Bayes Optimal Classifier
• Instead of learning the function, the Bayes optimal classifier assigns any given input to the most likely output
• Calculate a posteriori probabilities
• P(x0,x1,x2|0) is the class-conditional probability
fi
x0
x1
x2
),,()0()0|,,(),,|0(
210
210210 xxxP
PxxxPxxxP =
4
Machine Learning, Chapter 6 CSE 574, Spring 2003
Example of Bayes Optimal Classifierx0 x1 x2 f0 f1 f2 f3 f4 f255
0 0 0 0 1 0 1 0 10 0 1 0 0 1 1 0 10 1 0 0 0 0 0 1 10 1 1 0 0 0 0 0 11 0 0 0 0 0 0 0 11 0 1 0 0 0 0 0 11 1 0 0 0 0 0 0 1
),,()0()0|,,(),,|0(
210
210210 xxxP
PxxxPxxxP =
),,()1()1|,,(),,|1(
210
210210 xxxP
PxxxPxxxP =
5
Machine Learning, Chapter 6 CSE 574, Spring 2003
Bayes Optimal Classifier
• To calculate a posteriori probabilities• Need to know Class-conditional probabilities• Each is a table of 2n different probabilities estimated from
many training samples
)0|,,( 210 xxxP )1|,,( 210 xxxPx0 x1 x2Prob(0)0 0 0 0.10 0 1 0.050 1 0 0.10 1 1 0.251 0 0 0.31 0 1 0.11 1 0 0.051 1 1 0.05
x0 x1 x2Prob(1)0 0 0 0.050 0 1 0.10 1 0 0.250 1 1 0.251 0 0 0.11 0 1 0.11 1 0 0.151 1 1 0.05
6
Machine Learning, Chapter 6 CSE 574, Spring 2003
Bayes Optimal Classifier
• Need to know Class-conditional probabilities
• Tables have 2.2n entries in tables• Will need many training samples:
• need to see every instance many times in order to obtain reliable estimates
• When number of attributes is large, impossible to even list all probabilities in a table
)0|,,( 210 xxxP )1|,,( 210 xxxP
7
Machine Learning, Chapter 6 CSE 574, Spring 2003
Bayes Optimal Classifier
• Target function f(x)• Takes any value from finite set V, eg 0,1• Each instance x is composed of attribute values
x1,x2,..,xn
• Most possible target value vmap
•
),..,,()()|,..,,(
maxarg
maxarg
21
21
),..,,|( 21
n
jjn
xxxvPv
xxxPvPvxxxP
Vv
njVv
MAP
j
j
∈
∈
=
=
8
Machine Learning, Chapter 6 CSE 574, Spring 2003
Most Probable Hypothesis vs Most Probable Classification
• Classification result can be different!• Suppose three hypotheses, f0,f1,f2 have posterior
probabilities given the training data as .3, .4, .3. • Therefore MAP hypothesis is f1• Instance x=<0,0,0> classified as 1 by f1 but as 0 by f0 and f2
• P(1|x,D)=P(1|f0,x)P(f0|D,x)+ P(1|f1,x)P(f1|D,x)+ P(1|f2,x)P(f2|D,x)• =0..3 + 1..4 + 0..3 = .4• Similarly P(0|x,D) = .6• Therefore most probable classification of x is 0
x0 x1 x2 f0 f1 f2 f3 f4 f255
0 0 0 0 1 0 1 0 10 0 1 0 0 1 1 0 10 1 0 0 0 0 0 1 10 1 1 0 0 0 0 0 11 0 0 0 0 0 0 0 1
9
Machine Learning, Chapter 6 CSE 574, Spring 2003
Maximum Likelihood and Least-Squared Error Hypotheses (6.4)
• Bayesian analysis shows that under certain circumstances any learning algorithm that minimizes the squared error between output hypothesis predictions and the training data will output a maximum likelihood hypothesis.
10
Machine Learning, Chapter 6 CSE 574, Spring 2003
Learning a Real-Valued Function
Figure 6.2
11
Machine Learning, Chapter 6 CSE 574, Spring 2003
Probability Density Function
)(1lim)( 0000 ∈+<≤∈
≡∈→
xxxPxp
12
Machine Learning, Chapter 6 CSE 574, Spring 2003
Maximum Likelihood Hypothesis
Maximum Likelihood Hypothesis: One that minimizes the sum of the squared errors between the observed training values di and the hypothesis predictions h(xi)
∑=∈
−=m
iii
HhML xhdh
1
2))((minarg
13
Machine Learning, Chapter 6 CSE 574, Spring 2003
Maximum Likelihood Hypotheses for Predicting Probabilities (6.5)
)|,()|(1
hdxPhDP i
m
ii∏
=
=
)|,()|(1
hdxPhDP i
m
ii∏
=
=
)(),|()|,()|(11
ii
m
iii
m
ii xPxhdPhdxPhDP ∏∏
==
==
14
Machine Learning, Chapter 6 CSE 574, Spring 2003
Maximum Likelihood Hypotheses for Predicting Probabilities, continued
⎩⎨⎧
=−=
=0if))(1(1if)(
),|(i
iiii dxh
dxhxhdP
ii di
diii xhxhxhdP −−= 1))(1()(),|(
)())(1()()|( 1
1i
di
m
i
di xPxhxhhDP ii −
=
−=∏
15
Machine Learning, Chapter 6 CSE 574, Spring 2003
Maximum Likelihood Hypotheses for Predicting Probabilities, continued
)())(1()(maxarg1
1∏=
−
∈−=
m
ii
di
di
HhML xPxhxhh ii
∏=
−
∈−=
m
i
di
di
HhML
ii xhxhh1
1))(1()(maxarg
))(1ln()1()(lnmaxarg1
iii
m
ii
HhML xhdxhdh −−+= ∑
=∈
16
Machine Learning, Chapter 6 CSE 574, Spring 2003
Gradient Search to Maximize Likelihood in a Neural Network (6.5.1)
jk
im
i ijk wxh
xhDhG
wDhG
∂∂
∂∂
=∂
∂ ∑=
)()(
),(),(1
jk
im
i i
iiii
wxh
xhxhdxhd
∂∂
∂−−+∂
=∑=
)()(
))(1ln()1)()(ln)(1
jk
im
i ii
ii
wxh
xhxhxhd
∂∂
−−
=∑=
)())(1)((
)(1
17
Machine Learning, Chapter 6 CSE 574, Spring 2003
Gradient Search to Maximize Likelihood in a Neural Network , continued
jkjkjk www ∆+←
∑=
−=∆m
iijkiijk xxhdw
1))((η
jkjkjk www ∆+←
∑=
−−=∆m
iijkiiiijk xxhdxhxhw
1))())((1)((η
18
Machine Learning, Chapter 6 CSE 574, Spring 2003
Minimum Description Length Principle (6.6)
• Occam’s razor: • Choose the shortest explanation for the observed data• Used in Decision Tree design where goal was to find
shortest tree
• Here we consider:• Bayesian perspective on this issue• Closely related principle called• Minimum Description Length (MDL) principle
19
Machine Learning, Chapter 6 CSE 574, Spring 2003
Minimum Description Length Principle
MAPh• Motivated by interpreting the definition of• Using concepts from information theory• Familiar definition:
)()|(maxarg hPhDPhHh
MAP∈
=
20
Machine Learning, Chapter 6 CSE 574, Spring 2003
Minimum Description Length Principle
)()|(maxarg hPhDPhHh
MAP∈
=
• Equivalently, taking logarithms
• Equivalently, taking negatives
)(log)|(logmaxarg 22 hPhDPhHh
MAP +=∈
)(log)|(logminarg 22 hPhDPhHh
MAP −−=∈
21
Machine Learning, Chapter 6 CSE 574, Spring 2003
Minimum Description Length Principle
)(log)|(logminarg 22 hPhDPhHh
MAP −−=∈
• Interpretation of above equation:• Assuming a particular representation scheme for encoding
hypotheses and data• Short hypotheses are to be preferred• Explanation to follow
22
Machine Learning, Chapter 6 CSE 574, Spring 2003
Design a compact code to transmit messages at random
• Probability of message i is pi
• Find code that minimizes expected number of bits we must transmit to encode message drawn at random• Assign shorter codes to more probable messages
• Shannon and Weaver (1949)• Optimal code assigns
• bits to encode message I• No of bits needed to encode message i using Code C is the
description length of message I with respect to C, I.e., LC(i)
ip2log−
23
Machine Learning, Chapter 6 CSE 574, Spring 2003
Minimum Length encoding
• Huffman Code (C) assigns shorter codes more likely symbols optimally
• Message i Code pi Bit length LC(I)• A 0 0.5 1• B 10 0.25 2• C 110 0.125 3• D 111 0.125 3
• Uniquely decodable
24
Machine Learning, Chapter 6 CSE 574, Spring 2003
Expected length of a message• A 0 prob 0.5 length 1• B 10 prob 0.25 length 2• C 110 prob 0.125 length 3• D 111 prob 0.125 length 3• Expected length of a message
• Same as formula for entropy
75.175.05.05.0
)log(2logloglog 81
81
41
41
21
21
2
=++=
−−−=−∑ ii
i pp
25
Machine Learning, Chapter 6 CSE 574, Spring 2003
Interpretation of MAP hypothesis in terms of Coding Theory
)(log)|(logminarg 22 hPhDPhHh
MAP −−=∈
Description length of h under the optimal encoding for hypothesis space H, i.e., size of the description of hypothesis h using this optimal representation
)(hLHC=
where CH is the optimal code for H
Description length of training data D given hypothesis h under the optimal encoding for hypothesis space H
)(/
hLhDC=
where CD/h is the optimal code for describing D assuming sender and receiver know hypothesis h
26
Machine Learning, Chapter 6 CSE 574, Spring 2003
Interpretation of Bayes MAP hypothesis in terms of Coding Theory
)()(minarg/
hLhLhhDH CC
HhMAP +=
∈
27
Machine Learning, Chapter 6 CSE 574, Spring 2003
Minimum Description Length (MDL) Principle
• If C1and C2 represent the hypothesis and the data given the hypothesis
• MDL principle recommends choosing where
MDLh
)|()(minarg21
hDLhLh CCHh
MDL +=∈
28
Machine Learning, Chapter 6 CSE 574, Spring 2003
Minimum Description Length (MDL) Principle
If C1and C2 are chosen optimally • Then • Intuitively
• MDL recommends shortest method for re-encoding the training data,
• where we count the size of the hypothesis and any additional cost of encoding the data given this hypothesis
MAPMDL hh =
29
Machine Learning, Chapter 6 CSE 574, Spring 2003
Gibbs Algorithm (6.8)
• Bayes can be costly to apply• it computes the posterior probability for every hypothesis in
H• then it combines the predictions of each hypothesis to
classify each new system
• Gibbs Algorithm is alternative, less optimal method• choose a hypothesis h from H at random, according to the
posterior probability distribution over H• use h to predict the classification of the next instance x
30
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Classifier
• Practical Bayesian learning method• In some domains performance is comparable to that
of neural network and decision tree learning
31
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Classifier
• Based on the simplifying assumption that the attribute values are statistically independent
)0|(
)0|()..0|()0|()0|,..,,( 2121
∏==
ii
nn
xP
xpxPxPxxxP
32
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Classifier
• Class-conditional probabilities assuming statistical independence
• Tables have 2.2.n entries ==> much better than 2.2n
entries
x0 Prob(x0|0) x1 Prob(x1|0) x2 Prob(x3|0)0 0.65 0 0.4 0 0.151 0.35 0 0.6 1 0.85
x0 Prob(x0|1) x1 Prob(x1|1) x2 Prob(x3|1)0 0.65 0 0.4 0 0.151 0.35 0 0.6 1 0.85
33
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Classifier (6.9)
• Naïve Bayes applies to learning tasks where• each instance x is described by the conjunction of attribute
values• the target function f(x) can take on any value from some
finite set V
34
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Classifier, continued
),|(maxarg 21 njVv
MAP aaavPvj
K∈
=
),()()|,(
maxarg21
21
n
jjn
VvMAP aaaP
vPvaaaPv
j K
K
∈=
)()|,(maxarg 21 jjnVv
vPvaaaPj
K∈
=
35
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Classifier, continued
Naïve Bayes Classifier
)|()(maxarg ii
ijVv
NB vaPvPvj
∏∈
=
36
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes: PlayTennis Example Classify days according to whether someone will play tennisGiven 14 examples:
Day Outlook Temp Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No
Table 3.2
37
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes: PlayTennis Example
• Task is predict the target value (yes or no) of the target concept PlayTennis for the new instance
Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong
Day Outlook Temp Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No
38
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes: PlayTennis Exampleis the target value output by the Naïve Bayes
classifierInstantiate Naïve Bayes classifier equation to fit task
the target value is given by
)|()(maxarg},{
jiijnoyesv
NB vaPvPvj
∏∈
=NBv
)|()|()(maxarg},{
jjjnoyesv
vcooleTemperaturPvsunnyoutlookPvPj
===∈
)|()|( jj vstrongWindPvhighHumidityP ==
NBv
Prior ProbabilitiesConditional Probabilities
39
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes: PlayTennis Example
)|()(maxarg},{
jiijnoyesv
NB vaPvPvj
∏∈
=
)|()|()(maxarg},{
jjjnoyesv
vcooleTemperaturPvsunnyoutlookPvPj
===∈
)|()|( jj vstrongWindPvhighHumidityP ==
Prior Probabilities (2)Conditional Probabilities (8)
• Notice that in the final expression ai has been instantiated using the particular attribute values of the new instance
• To calculate vNB, need 10 probabilities that can be estimated from the training data
40
Machine Learning, Chapter 6 CSE 574, Spring 2003
Estimating Prior Probabilities• Probabilities of different target values estimated from
frequencies over 14 training examplesP(PlayTennis = yes) = 9/14 = .64 P(PlayTennis = no) = 5/14 = .36
Day Outlook Temp Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No
41
Machine Learning, Chapter 6 CSE 574, Spring 2003
Estimating Conditional Probabilities• Similarly, can estimate conditional probabilities. For
example, those for Wind = strong are:P(Wind = strong|PlayTennis = yes) = 3/9 = .33P(Wind = strong|PlayTennis = no) = 3/5 = .60
Day Outlook Temp Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No
42
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes: PlayTennis Target Values
• Using these and similar probability estimates for remaining attribute values, vNB can be calculated as follows
• Thus, the naïve Bayes classifier assigns the target value PlayTennis = no to this new instance
P(yes) P(sunny|yes) P(cool|yes) P(high|yes) P(strong|yes) = .0053P(no) P(sunny| no) P(cool| no) P(high| no) P(strong| no) = .0206
43
Machine Learning, Chapter 6 CSE 574, Spring 2003
PlayTennis:Normalizing class probabilities
• By normalizing the above quantities to sum to one, we can calculate the conditional probability that the target value in no, given the observed attribute values.
• For the current example, this probability is
795.0053.0206.
0206.=
+
44
Machine Learning, Chapter 6 CSE 574, Spring 2003
Estimating Probabilities
• Probability estimated as a fraction of times event is observed (nc) over the total number of observations(n)
• P(Wind = strong|PlayTennis = no) = 3/5 = .60
• When nc very small = => poor estimate of probability• suppose P(Wind = strong|PlayTennis = no) = .08 and we have
n=5. Then most probable value for nc is 0• Yields a biased underestimate of probability• This probability term will dominate since it is multiplying other
probabilities
45
Machine Learning, Chapter 6 CSE 574, Spring 2003
Estimating Probabilities with Small Sample Size
• To avoid problem, use a Bayesian approach using the m-estimate
• m-estimate of Probability
• p is prior estimate of probability we wish to determine• m is a constant called the equivalent sample size
mnmpnc
++
46
Machine Learning, Chapter 6 CSE 574, Spring 2003
Estimating Probabilities with Small Sample Size
• m-estimate of Probability
• p is prior estimate of probability we wish to determine• assume uniform priors: if attribute has k values, we set p=1/k• if k=2, then p=.5
• m is the equivalent sample size• if m=0, m-estimate is equivalent to simple fraction• prior and fraction are combined according to weight m• called equivalent sample size since n actual samples are
augmented using m virtual samples distributed according to p
mnmpnc
++
47
Machine Learning, Chapter 6 CSE 574, Spring 2003
Bayesian Learning Example: Classifying Text
• Instances are Text documents• Target concept:
• electronic news articles that I find interesting• pages of the world-wide-web that discuss machine learning
topics
• If a computer could learn the target concept accurately in instances involving text documents,• it could automatically filter a large volume of on-line
documents and present only the most relevant
48
Machine Learning, Chapter 6 CSE 574, Spring 2003
Text Classification Task
• General Setting:• instance space X consists of all possible text documents
(i.e., all possible strings of words and punctuation of all possible lengths)
• Given training examples of some unknown target function f(x), which can take on any value from some finite set V
• Task:• Learn to classify future documents as interesting or not
interesting to a particular person, • using the target values like and dislike to indicate these two
classes
49
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Example: Learning To Classify Text
• Two main Design Issues• Decide how to represent an arbitrary text document
in terms of attribute values• How to estimate the probabilities required by the
naive Bayes classifier
50
Machine Learning, Chapter 6 CSE 574, Spring 2003
Approach to Representing Arbitrary Text Documents
• Given a text document, we define • an attribute for each word position in the document and • define the value of that attribute to be the English word found
in that position• thus, this paragraph beginning with sentence “Given a text
document …,” would be described by 111 attribute values, corresponding to the 111 word positions
• value of the first attribute is the word “Given” the value of the second attribute is “a” etc.
• Note: Long text documents require larger number of attributes that short documents• Not a problem
51
Machine Learning, Chapter 6 CSE 574, Spring 2003
Document Classification Task
• We are given a set of training documents that has been classified by a friend• 700 classified as dislike• 300 classified as like
• Use these to classify new documents
52
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Classification of Text
)|()(maxarg111
1},{∏=∈
=i
jijdislikelikev
NB vaPvPvj
)|""()|""()(maxarg 21},{
jjjdislikelikev
vaaPvgivenaPvPj
===∈
)|""( 111 jvproblemaP =K
• Naïve Bayes Classification vNB is the classification that maximizes the probability of observing the words that were actually found in the document
53
Machine Learning, Chapter 6 CSE 574, Spring 2003
Text Classification: independence assumption
• Independence assumption states that the word probabilities for one text position are independent of the words that occur in other positions, given the document classification vj
• assumption is incorrect• eg, probability of observing “learning” is higher if preceding
word is “machine”
• without making the assumption will involve prohibitive number of probability terms
• Naïve Bayes learner known to perform well in text classification problems
54
Machine Learning, Chapter 6 CSE 574, Spring 2003
Estimating Probability Terms
• To calculate vNB we need• prior probability terms P(vj)
• easy• P(like)=.3 and P(dislike)=.7
• conditional probability terms P(ai=wk|vj)• wk is the kth word in the English vocabulary, eg
P(a1=given|dislike)• difficult• need to compute one probability term for each combination
of text position (111), English word(50,000), and target value(2), implies 107 probabilities
55
Machine Learning, Chapter 6 CSE 574, Spring 2003
Reducing Number of Probability Terms
• Assume positional independence• Probability of encountering a specific word wk is
independent of the specific word position being encountered (a23 versus a95)
• This amounts to assuming that attributes are independent and identically distributed
• P(ai=wk|vj)=P(am=wk|vj)for all i,j,k,m
56
Machine Learning, Chapter 6 CSE 574, Spring 2003
Reducing Number of Probability Terms
• Replace entire set of probabilities P(a1=wk|vj), P(a2=wk|vj),… by the single position independent probability P(wk|vj)
• use P(wk|vj) regardless of word position• Now require only 2 x 50,000 = 105 terms• When training data is limited
• increases number of samples available to estimate required probabilities
• Increases reliability of estimates
57
Machine Learning, Chapter 6 CSE 574, Spring 2003
Text Classification: Estimating Probability Terms
• Using the m-estimate with uniform priors and the mequal to the size of the word vocabulary, the estimate for will be
• where n is the total no of word positions in all training examples whose target value is vj
• nk is number of times word wk is found among these n word positions
)|( jk vwP
||1
Vocabularynnk
++
58
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Algorithm for Learning and Classifying Text
Learn_Naive_Bayes_Text (Examples, V)
Examples is a set of text documents along with their target values. V is the set of all possible target values. This function learns the probability terms P(wk|vj), describing the probability that a randomly drawn word from a document in class vj will be the English word wk. It also learns the class prior probabilities P(vj).
•
59
Machine Learning, Chapter 6 CSE 574, Spring 2003
Learn_Naive_Bayes_Text (Examples, V)
Collect all words, punctuation, and other tokens that occur in Examples
• Vocabulary ← the set of all distinct words and other tokens occurring in any text document from Examples
•
60
Machine Learning, Chapter 6 CSE 574, Spring 2003
Learn_Naive_Bayes_Text (Examples, V)
Calculate the required P(vj) and P(wk|vj) probability terms
• For each target value vj in V, do• docsj ← the subset of documents from Examples for which the target value is vj
• Textj ← a single document created by concatenating all members of docsj
• n ← total number of distinct word positions in Textj
• for each word wk in Vocabulary• nk ← number of time word wk occurs in Textj• then
||||
)(Examples
docsvP j
j ←
||1)|(
VocabularynnvwP k
jk ++
←
61
Machine Learning, Chapter 6 CSE 574, Spring 2003
Classify_Naïve_Bayes_Text (Doc)
Return the estimated value for the document Doc. ai denotes the word found in the ith position within Doc.
• positions ← all words in Doc that contain tokens found in Vocabulary
• Return vNB where
)|()(maxarg jipositionsi
jVv
NB vavPvj
∏∈∈
=
Table 6.2
62
Machine Learning, Chapter 6 CSE 574, Spring 2003
Experimental Results with Naïve Bayes for Classifying Text
• Problem of classifying news articles• 20 electronic newsgroups (usenet) considered
• comp.graphics, alt.atheism, etc
• Target classification• Name of newsgroup in which article appeared• Task is one of Newsgroup Posting Service that learns to
assign documents to appropriate newsgroup
63
Machine Learning, Chapter 6 CSE 574, Spring 2003
Electronic Newsgroups considered in Text Classification Experiment
Table 6.3
64
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Text Classification Experimental Results
• Data Set• 1,000 articles collected from each newsgroup, forming data
set of 20,000 documents• Naïve Bayes was applied using
• 2/3 of these 20,000 documents as training examples • performance measured over remaining 1/3
65
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Text Classification Experimental Results
• Vocabulary• 100 most frequent words were removed
• including the and of• any word occurring fewer than 3 times was removed• resulting Vocabulary consisted of 38,500 words
66
Machine Learning, Chapter 6 CSE 574, Spring 2003
Naïve Bayes Text Classification Experimental Results
• Accuracy achieved by the program was 89%• Random guessing would yield 5% accuracy• Another variant of Naïve Bayes
• NewsWeeder system• Training: user rates some news articles as interesting• Based on user profile Newsweeder then suggests
subsequent articles of interest to user• NewsWeeder suggests top 10% of its automatically rated
articles each day• Result: 59% of articles presented were interesting as
opposed to 16% in overall pool• This is the PRECISION of system
top related