naïve bayes - hltrivgogate/ml/2015s/lectures/nb-lec5.pdf · bayes • naïve bayes classifier...

31
Naïve Bayes Vibhav Gogate The University of Texas at Dallas

Upload: phungkhanh

Post on 06-Feb-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Naïve Bayes

Vibhav Gogate

The University of Texas at Dallas

Page 2: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Supervised Learning of Classifiers Find f

• Given: Training set {(xi, yi) | i = 1 … n}

• Find: A good approximation to f : X Y

Examples: what are X and Y ?

• Spam Detection

– Map email to {Spam,Ham}

• Digit recognition

– Map pixels to {0,1,2,3,4,5,6,7,8,9}

• Stock Prediction

– Map new, historic prices, etc. to (the real numbers) Â

Classification

Page 3: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

3

Bayesian Categorization/Classification

• Let the set of categories be {c1, c2,…cn}

• Let E be description of an instance.

• Determine category of E by determining for each ci

• P(E) can be ignored (normalization constant)

• Select the class with the max. probability.

)(

)|()()|(

EP

cEPcPEcP ii

i

)|()(~)|( iii cEPcPEcP

Page 4: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Text classification

• Classify e-mails

– Y = {Spam,NotSpam}

• Classify news articles

– Y = {what is the topic of the article?}

• Classify webpages

– Y = {Student, professor, project, …}

• What to use for features, X?

Page 5: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Features X are word sequence in document Xi for ith word in article

Page 6: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Features for Text Classification

• X is sequence of words in document

• X (and hence P(X|Y)) is huge!!! – Article at least 1000 words, X={X1,…,X1000}

– Xi represents ith word in document, i.e., the domain of Xi is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.

• 10,0001000 = 104000

• Atoms in Universe: 1080

– We may have a problem…

Page 7: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Bag of Words Model Typical additional assumption –

– Position in document doesn’t matter:

• P(Xi=xi|Y=y) = P(Xk=xi|Y=y) • (all positions have the same distribution)

– Ignore the order of words

– Sounds really silly, but often works very well!

– From now on:

• Xi = Boolean: “wordi is in document”

• X = X1 … Xn

Page 8: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Bag of Words Approach

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

Zaire 0

Page 9: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

9

Bayesian Categorization

• Need to know: – Priors: P(yi)

– Conditionals: P(X | yi)

• P(yi) are easily estimated from data. – If ni of the examples in D are in yi, then P(yi) = ni / |D|

• Conditionals: – X = X1 … Xn

– Estimate P(X1 … Xn | yi)

• Too many possible instances to estimate! – (exponential in n)

– Even with bag of words assumption!

P(y1 | X) ~ P(yi)P(X|yi)

Page 10: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Need to Simplify Somehow

• Too many probabilities

– P(x1 x2 x3 | yi)

• Can we assume some are the same?

– P(x1 x2|yi )=P(x1|yi) P(x2|yi)

10

P(x1 x2 x3 | spam) P(x1 x2 x3 | spam) P(x1 x2 x3 | spam) …. P( x1 x2 x3 | spam)

Page 11: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Conditional Independence

• X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z

• e.g.,

• Equivalent to:

Page 12: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Naïve Bayes

• Naïve Bayes assumption: – Features are independent given class:

– More generally:

• How many parameters now? – Suppose X is composed of n binary features

Page 13: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

The Naïve Bayes Classifier • Given:

– Prior P(Y)

– n conditionally independent features X given the class Y

– For each Xi, we have likelihood P(Xi|Y)

• Decision rule:

Y

X1 Xn X2

Page 14: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

MLE for the parameters of NB

• Given dataset, count occurrences for all pairs

– Count(Xi=xi,Y=y)----- How many pairs?

• MLE for discrete NB, simply:

– Prior:

– Likelihood:

Page 15: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

NAÏVE BAYES CALCULATIONS

Page 16: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Subtleties of NB Classifier: #1 Violating the NB Assumption

• Usually, features are not conditionally independent:

• The naïve Bayes assumption is often violated, yet it performs surprisingly well in many cases.

• Plausible reason: Only need the probability of the correct class to be the largest! – Example: two-way classification; just need to figure out the

correct side of 0.5 and not the actual probability (0.51 is the same as 0.99).

Page 17: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Subtleties of NB Classifier: #2 Insufficient Training Data

• What if you never see a training instance where X1=a and Y=b – You never saw, Y=Spam, X1={Enlargement}

– P(X1=Enlargement|Y=Spam)=0

• Thus no matter what values X2, X3,….,Xn take: – P(X1=Enlargement, X2=a2,…,Xn=an|Y=Spam)=0

– Why?

Page 18: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

For Binary Features: We already know the answer!

• MAP: use most likely parameter

• Beta prior equivalent to extra observations for each feature

• As N → 1, prior is “forgotten” • But, for small sample size, prior is important!

Page 19: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

That’s Great for Binomial

• Works for Spam / Ham

• What about multiple classes

– Eg, given a wikipedia page, predicting type

19

Page 20: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Multinomials: Laplace Smoothing

• Laplace’s estimate: – Pretend you saw every outcome k

extra times

– What’s Laplace with k = 0?

– k is the strength of the prior

– Can derive this as a MAP estimate for multinomial with Dirichlet priors

• Laplace for conditionals: – Smooth each condition

independently:

H H T

Page 21: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Probabilities: Important Detail!

Any more potential problems here?

• P(spam | X1 … Xn) = P(spam | Xi)

i

We are multiplying lots of small numbers Danger of underflow!

0.557 = 7 E -18

Solution? Use logs and add!

p1 * p2 = e log(p1)+log(p2)

Always keep in log form

Page 22: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

NB for Text Classification: Learning • Learning phase: P(Ym) and P(Xi|Ym)

Page 23: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

NB for Text Classification: Classification

• Given a new document having length “L”

Page 24: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Example: (Borrowed from Dan Jurafsky)

Page 25: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Bayesian Learning

What if Features are Continuous?

P(Y | X) P(X | Y) P(Y)

Prior

Data Likelihood

Posterior

Eg., Character Recognition:

Xi is ith pixel

Page 26: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Bayesian Learning

What if Features are Continuous?

Eg., Character Recognition:

Xi is ith pixel

P(Xi =x| Y=yk) = N(ik, ik)

P(Y | X) P(X | Y) P(Y)

N(ik, ik) =

Page 27: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Gaussian Naïve Bayes

P(Xi =x| Y=yk) = N(ik, ik)

P(Y | X) P(X | Y) P(Y)

N(ik, ik) =

Sometimes Assume Variance – is independent of Y (i.e., i),

– or independent of Xi (i.e., k)

– or both (i.e., )

Page 28: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Maximum Likelihood Estimates: • Mean:

• Variance:

Learning Gaussian Parameters

Page 29: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Maximum Likelihood Estimates: • Mean:

• Variance:

Learning Gaussian Parameters

jth training example

(x)1 if x true,

else 0

Page 30: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

Maximum Likelihood Estimates: • Mean:

• Variance:

Learning Gaussian Parameters

Page 31: Naïve Bayes - HLTRIvgogate/ml/2015s/lectures/NB-lec5.pdf · Bayes • Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian

What you need to know about Naïve Bayes

• Naïve Bayes classifier – What’s the assumption

– Why we use it

– How do we learn it

– Why is Bayesian estimation important

• Text classification – Bag of words model

• Gaussian NB – Features are still conditionally independent

– Each feature has a Gaussian distribution given class