bayesian learning

Bayesian Learning

€

Bayes rule :

P(X |Y) =P(Y | X)P(X )

P(Y )

Bayesianism vs. Frequentism

• Classical probability: Frequentists– Probability of a particular event is defined relative to its frequency

in a sample space of events.

– E.g., probability of “the coin will come up heads on the next trial” is defined relative to the frequency of heads in a sample space of coin tosses.

• Bayesian probability:– Combine measure of “prior” belief you have in a proposition with

your subsequent observations of events.

• Example: Bayesian can assign probability to statement “There was life on Mars a billion years ago” but frequentist cannot.

Bayesian Learning

– Prior probability of h: • P(h): Probability that hypothesis h is true given our prior knowledge

• If no prior knowledge, all h H are equally probable

– Posterior probability of h:• P(h|D): Probability that hypothesis h is true, given the data D.

– Likelihood of D:• P(D|h): Probability that we will see data D, given hypothesis h is true.

Question is: given the data D and our prior beliefs, what is the probability that h is the correct hypothesis, i.e., P(h | D)?

Bayes rule says: )()()|()|(

DPhPhDPDhP

Bayesian Learning

The Monty Hall Problem

You are a contestant on a game show. There are 3 doors, A, B, and C. There is a new car behind one of them and goats behind the other two.

Monty Hall, the host, asks you to pick a door, any door. You pick door A.

Monty tells you he will open a door , different from A, that has a goat behind it. He opens door B: behind it there is a goat.

Monty now gives you a choice: Stick with your original choice A or switch to C. Should you switch?

http://math.ucsd.edu/~crypto/Monty/monty.html



Bayesian probability formulation

Hypothesis space H: h1 = Car is behind door Ah2 = Car is behind door B h3 = Car is behind door C

Data D = Monty opened B

What is P(h1 | D)? What is P(h2 | D)? What is P(h3 | D)?

A Medical Example

T. takes a test for leukemia. The test has two outcomes: positive and negative. It is known that if the patient has leukemia, the test is positive 98% of the time. If the patient does not have leukemia, the test is positive 3% of the time. It is also known that 0.008 of the population has leukemia.

T.’s test is positive.

Which is more likely: T. has leukemia or T. does not have leukemia?

• Hypothesis space: h1 = T. has leukemiah2 = T. does not have leukemia

• Prior: 0.008 of the population has leukemia. Thus P(h1) = 0.008P(h2) = 0.992

• Likelihood:P(+ | h1) = 0.98, P(− | h1) = 0.02P(+ | h2) = 0.03, P(− | h2) = 0.97

• Posterior knowledge: Blood test is + for this patient.

• Maximum A Posteriori (MAP) hypothesis:

€

hMAP = argmaxh

P(h | D) = argmaxh

P(D | h)P(h)P(D)

€

hMAP =h∈H

argmaxP(D | h)P(h)

P(D)

=h∈H

argmax P(D | h)P(h)

because P(D) is a constant independent of h.

Note: If every hH is equally probable, then

)|(argmaxMAP hDPhHh

This is called the “maximum likelihood hypothesis”.

In-class exercise

Suppose you receive an e-mail message with the subject “Hi”. You have been keeping statistics on your e-mail, and have found that while only 10% of the total e-mail messages you receive are spam, 50% of the spam messages have the subject “Hi” and 2% of the non-spam messages have the subject “Hi”. What is the probability that the message is spam?

P(h|D) = P(D|h)P(h) / P(D)

€

P(spam | Subject =" Hi") =P(Subject ="Hi" | spam)P(spam)

P(Subject ="Hi")

=(.5)(.1)

(.5)(.1) + (.02)(.9)= .735

Independence and Conditional Independence

• Two random variables, X and Y, are independent if

• Two random variables, X and Y, are independent given C if

• Examples?

)()(),( YPXPYXP

)|()|()|,( CYPCXPCYXP

Naive Bayes Classifier

Let x = <a1, a2, ..., an>

Let classes = {c1, c2, ..., cv}

We want to find MAP class, cMAP :

€

cMAP = c j ∈classesargmax P(c j | a1,a2,...,an )

By Bayes Theorem:

P(ci) can be estimated from the training data. How?

However, can’t use training data to estimate P(a1, a2, ..., an | cj). Why not?

€

cMAP = c j ∈classesargmax

P(a1,a2,...,an | c j )P(c j )P(a1,a2,...,an )

=c j ∈classesargmax P(a1,a2,...,an | c j )P(c j )

Naive Bayes classifier:

Assume

I.e., assume all the attributes are independent of one another. (Is this a good assumption?)

Given this assumption, here’s how to classify an instance x = <a1, a2, ...,an>:

We can estimate the values of these various probabilities over the training set.

€

P(a1,a2,...,an | c j ) = P(a1 | c j )P(a2 | c j )L P(an | c j )

€

cNB (x) =c j ∈classesargmax P(c j ) P(ai

i∏ | c j )

Example

• Use training data from decision tree example to classify the new instance:

<Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong>

€

cNB (x) =c j ∈classesargmax P(c j ) P(ai

i∏ | c j )

Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

In-class example

Training set: x1 x2 x3 class0 1 0 + 1 0 1 +0 0 1 −1 1 0 −1 0 0 −

What class would be assigned by a NB classifier to1 1 1 ?

Estimating probabilities

• Recap: In previous example, we had a training set and a new example,

<Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong>

• We asked: What classification is given by a naive Bayes classifier?

• Let n(c) be the number of training instances with class c, and n(ai,c) be the number of training instances with attribute value ai and class c. Then

€

P(ai | c) =n(ai,c)

n(c)

• Problem with this method: If nc is very small, gives a poor estimate.

• E.g., P(Outlook = Overcast | no) = 0.

• Now suppose we want to classify a new instance: <Outlook=overcast, Temperature=cool, Humidity=high,

Wind=strong>. Then:

This incorrectly gives us zero probability due to small sample. €

P(no) P(aii

∏ | no) = 0

One solution: Laplace smoothing (also called “add-one” smoothing)

For each class cj and attribute a with value ai, add one “virtual” instance.

That is, recalculate:

where k is the number of possible values of attribute a.

€

P(ai | c j ) ≈n(ai,c j ) +1n(c j ) + k


D1 Sunny Hot HighWeak NoD2 Sunny Hot HighStrong NoD3 Overcast Hot HighWeak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot NormalWeak YesD14 Rain Mild HighStrong No− Sunny − −

− Yes− Sunny − −

− No− Overcast − −

− Yes− Overcast − −

− No− Rain − −

− Yes− Rain − −

− No


D1 Sunny Hot HighWeak NoD2 Sunny Hot HighStrong NoD3 Overcast Hot HighWeak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot NormalWeak YesD14 Rain Mild HighStrong No− − − High

− Yes− − − High

− No− − − Normal

− Yes− − − Normal

− No

€

P(Humidity = High |Yes) = 3/9 →4 /11 P(Humidity = High | No) = 4 /5 →5 /7P(Humidity = Normal |Yes) = 6 /9 →7 /11 P(Humidity = Normal | No) =1/5 →2 /7

In-class exerciseTraining set:

x1 x2 x3 class0 1 0 + 1 0 1 +1 1 1 −1 1 0 −1 0 0 −

Calculate smoothed probabilities for attribute x1:

P(x1=0 | +)P(x1=0 | −)P(x1=1 | +)P(x1=1 | −)

Naive Bayes on continuous-valued attributes

• How to deal with continuous-valued attributes?

Two possible solutions: – Discretize

– Assume particular probability distribution of classes over values (estimate parameters from training data)

Simplest discretization method

For each attribute a , create k equal-size bins in interval from amin to amax.

Humidity: 25, 38, 50, 80, 93, 98, 98, 99



Questions: What should k be? What if some bins have very few instances?

Problem with balance between discretization bias and variance.

The more bins, the lower the bias, but the higher the variance, due to small sample size.

Alternative simple (but effective) discretization method

(Yang & Webb, 2001)

Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin.

Don’t need add-one smoothing of probabilities

This gives good balance between discretization bias and variance.

nn


(Yang & Webb, 2001)




nn

Humidity: 25, 38, 50, 80, 93, 98, 98,, 99


(Yang & Webb, 2001)




nn

Humidity: 25, 38, 50, 80, 93, 98, 98,, 99

€

8 ≈ 3 bins, 3 items per bin

Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifer(P. Domingos and M. Pazzani)

Naive Bayes classifier is called “naive” because it assumes attributes are independent of one another.

• This paper asks: why does the naive (“simple”) Bayes classifier, SBC, do so well in domains with clearly dependent attributes?

Experiments

• Compare five classification methods on 30 data sets from the UCI ML database.

SBC = Simple Bayesian Classifier

Default = “Choose class with most representatives in data”

C4.5 = Quinlan’s decision tree induction system

PEBLS = An instance-based learning system

CN2 = A rule-induction system

• For SBC, numeric values were discretized into ten equal-length intervals.

Number of domains in which SBC was more accurate versus less accurate than corresponding classifier

Same as line 1, but significant at 95% confidence

Average rank over all domains (1 is best in each domain)

Measuring Attribute Dependence

They used a simple, pairwise mutual information measure:

For attributes Am and An dependence is defined as

where AmAn is a “derived attribute”, whose values consist

of the possible combinations of values of Am and An

Note: If Am and An are independent, then D(Am, An | C) = 0.

)|()|()|()|,( CAAHCAHCAHCAAD nmnmnm

H(Aj | C) = “conditional entropy”

)(log)()( 2 kjikji k

ii vACPvACPCP

Results:

(1) SBC is more successful than more complexmethods, even when there is substantial dependence among attributes.

(2) No correlation between degreeof attribute dependence and SBC’s rank.

But why????

An Example

• Let C = {+, −}, and attributes = {A, B, C}.

• Let P(+) = P(−) = 1/2.

• Suppose A and C are completely independent, and A and B are completely dependent (i.e., A=B).

• Optimal classification procedure:

€

cMAP =c j ∈{+,−}

argmax P(A,B,C | c j )P(c j )

= argmaxc j ∈{+,−}

P(A | c j )P(C | c j )

• This leads to the following conditions:If P(A|+) P(C|+) > P(A | −) P(C| −) then class = + else class = −

• In the paper, the authors use Bayes Theorem to rewrite these conditions, and plot the “decision boundaries” for the optimal classifier and for the SBC.

p = P(+ | A)

q = P(+ | C)

Optimal

SBC+

-

Even though A and B are completely dependent, and the SBCassumes they are completely independent, the SBC gives the optimalclassification in a very large part of the problem space! But why?

• Explanation:

Suppose C = {+,-} are the possible classes. Let E be a new example with attributes <a1, a2, ..., an>.

What the naive Bayes classifier does is calculates two probabilities,

and returns the class that has the maximum probability given E.

)|()(~)|(

)|()(~)|(

ii

ii

aPPEP

aPPEP

• The probability calculations are correct only if the independence assumption is correct.

• However, the classification is correct in all cases in which the relative ranking of the two probabilities, as calculated by the SBC, is correct!

• The latter covers a lot more cases than the former.

• Thus, the SBC is effective in many cases in which the independence assumption does not hold.

bayesian learning

Documents

hypothesis h

leukemia prior

door cdata d

likelihood of d

door ah2

door b h3

prior knowledge

correct hypothesis