cs340: machine learning naive bayes classifiers kevin murphy · apply bayes rule to compute the...

CS340: Machine Learning

Naive Bayes classifiers

Kevin Murphy

Classifiers

• A classifier is a function f that maps input feature vectorsx ∈ X to output class labels y ∈ {1, . . . , C}

• X is the feature space eg X = IRp or X = {0, 1}p (can mixdiscrete and continuous features)

•We assume the class labels are unordered (categorical) andmutually exclusive. (If we allow an input to belong to multipleclasses, this is called a multi-label problem.)

• Goal: to learn f from a labeled training set of N input-outputpairs, (xi, yi), i = 1 : N .

•We will focus our attention on probabilistic classifiers, i.e., methodsthat return p(y|x).

• Alternative is to learn a discriminant function f (x) = y(x) topredict the most probable label.

Generative vs discriminative classifiers

•Discriminative: directly learn the function that computes the classposterior p(y|x). It discriminates between different classes given

the input.

• Generative: learn the class-conditional density p(x|y) foreach value of y, and learn the class priors p(y); then one canapply Bayes rule to compute the posterior

p(y|x) =p(x, y)

p(x|y)p(y)

where p(x) =∑C

y′=1 p(y′|x).

Generative classifiers

We usually use a plug-in approximation for simplicity

p(y = c|x,D) ≈ p(y = c|x, θ, π) =p(x|y = c, θc)p(y = c|π)

c′ p(x|y = c′, θc′)p(y = c′|π)

where D is the training data, π are the parameters of the class priorp(y) and θ are the paramters of the class-conditional densities p(x|y).

Class prior

• Class priorp(y = c|π) = πc

•MLE

∑Ni=1 I(yi = c)

Nwhere Nc is the number of training examples that have class label c.

• Posterior mean (using Dirichlet prior)

πc =Nc + 1

Gaussian class-conditional densities

Suppose x ∈ IR2 representing the height and weight of adult Westerners(in inches and pounds respectively), and y ∈ {1, 2} represents male orfemale. A natural choice for the class-conditional density is a two-dimensional Gaussian

p(x|y = c, θc) = N (x|µc, Σc)

where the mean and covariance matrix depend on the class c

55 60 65 70 75 8080

height

dotted red circle = female, solid blue cross =male

Naive Bayes assumption

Assume features are conditionally independent given class.

p(x|y = c, θc) =

p(xd|y = c, θc) =

N (xd|µcd, σcd)

This is equivalent to assuming that Σc is diagonal.

55 60 65 70 750

0.4height male

55 60 65 70 750

0.4height female

100 150 200 2500

0.4weight male

100 150 200 2500

0.4weight female

Training

55 60 65 70 750

0.4height male

55 60 65 70 750

0.4height female

100 150 200 2500

0.4weight male

100 150 200 2500

0.4weight female

h w y67 125 m79 210 m71 150 m68 140 f67 142 f60 120 f

h wm µ = 71.66, σ = 3.13 µ = 175.62, σ = 32.40f µ = 65.07, σ = 3.19 µ = 129.69, σ = 18.67

Testing

p(y = m|x) =p(x|y = m)p(y = m)

p(x|y = m)p(y = m) + p(x|y = f )p(y = f )

Let us assume p(y = m) = p(y = f ) = 0.5

p(y = m|x) =p(x|y = m)

p(x|y = m) + p(x|y = f )

=p(xh|y = m)p(xw|y = m)

p(xh|y = m)p(xw|y = m) + p(xh|y = f )p(xw|y = f )

=N (xh; µmh, σmh) ×N (xh; µmw, σmw)

“ + N (xh; µfh, σfh) ×N (xh; µfw, σfw)

Testing

55 60 65 70 750

0.4height male

55 60 65 70 750

0.4height female

100 150 200 2500

0.4weight male

100 150 200 2500

0.4weight female

h wm µ = 71.66, σ = 3.13 µ = 175.62, σ = 32.40f µ = 65.07, σ = 3.19 µ = 129.69, σ = 18.67

h w p(y = m|x)72 18060 10068 155

Binary data

• Suppose we want to classify email into spam vs non-spam.

• A simple way to represent a text document (such as email) is as abag of words.

• Let xd = 1 if word d occurs in the document and xd = 0 otherwise.

0 200 400 600

training set

Parameter estimation

Class-conditional denstity becomes a product of Bernoullis

p(~x|Y = c, θ) =

θxdcd

(1 − θcd)1−xd

θcd =Ncd

Posterior mean (with Beta(1,1) prior)

θcd =Ncd + 1

Nc + 2

Fitted class conditional densities p(x = 1|y = c)

0 100 200 300 400 500 600 7000

1word freq for class 1

0 100 200 300 400 500 600 7000

1word freq for class 2

Numerical issues

When computing

P (Y = c|~x) =P (~x|Y = c)P (Y = c)

∑Cc′=1 P (~x|Y = c′)P (Y = c′)

you will oftne encounter numerical underflow since p(~x, y = c) isvery small.Take logs

bcdef= log[P (~x|Y = c)P (Y = c)]

log P (Y = c|~x) = bc − log

c′=1

ebc′

but ebc will underflow!

Log-sum-exp trick

log(e−120 + e−121) = log(

e−120(e0 + e−1))

= log(e0 + e−1) − 120

In general

log∑

ebc = log

ebc)e−BeB

ebc−B)eB

log(∑

ebc−B)

where B = maxc bc.In matlab, use logsumexp.m.

Softmax function

p(y = c|~x, θ, π) =p(x|y = c)p(y = c)

c′ p(x|y = c′)p(y = c′)

=exp[log p(x|y = c) + log p(y = c)]

c′ exp[log p(x|y = c′) + log p(y = c′)]

=exp [log πc +

d xd log θcd]∑

c′ exp [log πc′ +∑

d xd log θc′d]

Now define vectors

~x = [1, x1, . . . , x1p]

βc = [log πc, log θc1, . . . , log θcp]

p(y = c|~x, β) =exp[βT

c ~x]∑

c′ exp[βTc′~x]

Logistic function

If y is binary

p(y = 1|x) =eβT

eβT1 x + eβT

1 + e(β2−β1)Tx

1 + ewTx

= σ(wTx)

where we have defined w = β2 − β1 and σ(·) is the logistic or sig-

moid function

σ(u) =1

1 + e−u

cs340: machine learning naive bayes classifiers kevin murphy · apply bayes rule to compute the...

Documents

naive bayes presentation

naive bayes 1, naive bayes and text classication i

bayesian machine learning - naive bayes

text classiﬁcation with naive bayes

naive bayes lecture

classification with naive bayes

spam filtering with naive bayes – which naive bayes? ·...

naive bayes and...

naive bayes -...

naive bayes document classification

naive bayes and sentiment classification

spam filtering with naive bayes â€“ which naive bayes?

13 text classification and naive bayes

classification: naïve bayes - university of...

naive bayes document classification - university of...

bayes theorem and naive bayesclassifier

naive bayes - brown university

robust naive bayes

naive bayes text classification · naive bayes text...

l11 naive bayes - virginia tech