mle, map, and naive bayestom/10601_sp08/slides/recitation-mle-nb.pdfmap is the foundation for naive...

17
MLE, MAP, AND NAIVE BAYES 10-601 RECITATION MARY MCGLOHON

Upload: others

Post on 25-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

MLE, MAP, AND NAIVE BAYES10-601 RECITATION

MARY MCGLOHON

Page 2: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

MLE

The usual representation we come across is a probability density function:

But what if we know that , but we don’t know ?

We can set up a likelihood equation: , and find the value of that maximizes it.

P (X = x|θ)

x1, x2, ..., xn ∼ N(µ, σ2)µ

µ

P (x|µ, σ)

Page 3: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

MLE OF MU

Since x’s are independent and from the same distribution,

Taking the log likelihood (we get to do this since log is monotonic) and removing some constants:

p(x|µ, σ2) =

n∏

i=1

p(xi|µ, σ2)

L(x) =n∏

i=1

p(xi|µ, σ2) =1√

2πσ2

n∏

i=1

exp((xi − µ)2

2σ2)

log(L(x)) = l(x) ∝n∑

i=1

−(xi − µ)

-

2

Page 4: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

CALCULUS!

We can take the derivative of this value and set it equal to zero, to maximize.

dl(x)

dx=

d

dx(−

n∑

i=1

x2

i − xiµ + µ2) = −

n∑

i=1

xi − µ

n∑

i=1

xi − µ = 0 → nµ =

n∑

i=1

xi → µ =

∑n

i=1xi

n

Page 5: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

ABOUT THE MLE

We can’t always do this analytically, but there are all sorts of tricks. (See homework).

This is the traditional statistical approach to finding parameters.

(The Unbiased M.L.E.)

Page 6: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

ABOUT THE MLE

We can’t always do this analytically, but there are all sorts of tricks. (See homework).

This is the traditional statistical approach to finding parameters.

(The Unbiased M.L.E.)

Page 7: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

MAP

What if you have some ideas about your parameter?

In the Bayesian school of thought (or “cult”, depending on who you ask)...

We can use Bayes’ Rule:

P (θ|x) =P (x|θ)P (θ)

P (x)=

P (x|θ)P (θ)∑

ΘP (θ,x)

=P (x|θ)P (θ)

∑Θ

P (x|θ)P (θ)

Page 8: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

MAP

This is just maximizing the numerator, since the denominator is a normalizing constant.

This assumes a prior distribution, .

Old-school statisticians hate that. But ifyou get a good estimate of the prior, you’ll probably be ok.

(This is why we don’t have old-school statisticians writing spam filters.)

argmaxθP (θ|x) = argmaxθ

P (x|θ)P (θ)∑

ΘP (x|θ)P (θ)

(Emcee M.C.)P (θ)

Page 9: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

MAP

This is just maximizing the numerator, since the denominator is a normalizing constant.

This assumes a prior distribution, .

Old-school statisticians hate that. But ifyou get a good estimate of the prior, you’ll probably be ok.

(This is why we don’t have old-school statisticians writing spam filters.)

argmaxθP (θ|x) = argmaxθ

P (x|θ)P (θ)∑

ΘP (x|θ)P (θ)

(Emcee M.C.)P (θ)

Page 10: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

WHAT WE CAN DO NOW

MAP is the foundation for Naive Bayes classifiers.

Here, we’re assuming our data are drawn from two “classes”. We have a bunch of data where we know the class, and want to be able to predict P(class|data-point).

So, we use empirical probabilities

In NB, we also make the assumption that the features are conditionally independent.

prediction = argmaxCP (C = c|X = x) ∝ argmaxC P̂ (X = x|C = c)P̂ (C = c)

Page 11: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

SPAM FILTERING

Suppose we wanted to build a spam filter. To use the “bag of words” approach, assuming that n words in an email are conditionally independent, we’d get:

Whichever one’s bigger wins!

P (spam|w) ∝n∏

i=1

P (wi|spam)P (spam)

P (¬spam|w) ∝n∏

i=1

P (wi|¬spam)P (¬spam)

^ ^

^ ^

Page 12: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

THE IMPORTANCE OF THE SAMPLE

What happens if you train on a set of data that’s mostly spam, and test on a set that’s mostly good emails?

Also, just choosing one test set “wastes data”.

What can we do?

P (spam|w) ∝n∏

i=1

P (wi|spam)P (spam)^ ^

Page 13: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

CROSS-VALIDATION

Cross-validation involves training several times.

LOOCV (Leave-one-out cross validation):

For each data point, train classifier on -- that is, all data points besides , and classify .

Error is the average accuracy.

What’s wrong with this?

x−i

xi xi

Page 14: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

K-FOLDS CV

A cheaper way of doing cross-validation is to divide (“fold”) the dataset into k pieces.

For each piece i,

Train on all data not in set i, classify set i.

Report mean error.

This is a “happy medium” between straight-up training/testing and LOOCV.

Page 15: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

LESS-NAIVE BAYES

How would we modify NB to use two dependent attributes?

Hint: -- you’re estimating the joint conditional distribution of the attributes.

P (xi|c) = P (xi,1, xi,2, ..., xi,A|c)

Page 16: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

CONTINUOUSLY NAIVE BAYES

How could we modify NB for continuous attributes?

For instance, classify whether you like basketball given your age (real), height (real), whether you like football (binary), and type of shoes you wear (categorical).

Page 17: MLE, MAP, AND NAIVE BAYEStom/10601_sp08/slides/recitation-mle-nb.pdfMAP is the foundation for Naive Bayes classifiers. Here, we’re assuming our data are drawn from two “classes”

TOPOLOGY OF NAIVE BAYES

What’s the decision surface for Naive Bayes?

Hint: P (c|word) = P (word|c)I(word) + P (¬word|c)I(¬word)