bayesian classification ng

7/29/2019 Bayesian Classification NG

1/35


2/35

Bayesian Classification

What are Bayesian Classifiers? Statistical Classifiers Predict class membership probabilities Based on Bayes Theorem Nave Bayesian Classifier Computationally Simple

Comparable performance with DT and NNclassifiers


3/35

Bayesian Classification

Probabilistic learning: Calculate explicitprobabilities for hypothesis, among the most

practical approaches to certain types oflearning problems

Incremental: Each training example canincrementally increase/decrease the

probability that a hypothesis is correct. Priorknowledge can be combined with observeddata.


4/35

Bayes Theorem

Let X be a data sample whose class labelis unknown

Let H be some hypothesis that X belongs

to a class C For classification determine P(H/X) P(H/X) is the probability that H holds given

the observed data sample X P(H/X) is posterior probability


5/35

Bayes Theorem

Example: Sample space: All Fruits

X is round and red

H= hypothesis that X is an Apple

P(H/X) is our confidence that X is an applegiven that X is round and red

P(H) is Prior Probability of H, ie, theprobability that any given data sample is an

apple regardless of how it looks P(H/X) is based on more information Note that P(H) is independent of X


6/35

Bayes Theorem

Example: Sample space: All Fruits P(X/H) ? It is the probability that X is round and red

given that we know that it is true that X is anapple Here P(X) is prior probability =

P(data sample from our set of fruits is red and

round)


7/35

Estimating Probabilities

P(X), P(H), and P(X/H) may be estimatedfrom given data

Bayes Theorem

Use of Bayes Theorem in Nave BayesianClassifier!!

)()()|()|(

XP

HPHXPXHP =


8/35

Nave Bayesian Classification

Also called Simple BC

Why Nave/Simple??

Class Conditional IndependenceEffect of an attribute values on a given class is

independent of the values of other attributes

This assumption simplifies computations


9/35


Steps Involved

1. Each data sample is of the type

X=(xi) i =1(1)n, where xi is the values of X for

attribute Ai

2. Suppose there are m classes Ci,i=1(1)m.

X Ci iff

P(Ci|X) > P(Cj|X) for 1 j m, j i

i.e BC assigns X to class Ci having highest


10/35


The class for whichP(Ci|X) is maximized is called the maximumposterior hypothesis.

From Bayes Theorem

3. P(X) is constant. Only need be maximized.

If class prior probabilities not known, then assume allclasses to be equally likely

Otherwise maximize

P(Ci) = Si/S

Problem: computing P(X|Ci) is unfeasible!

(find out how you would find it and why it is infeasible)

)()()|()|(

XP

CPCXPXCP

iii =

)()|( ii CPCXP

)()|( ii CPCXP


12/35


EXAMPLEAge Income Student Credit_rating Class:Buys_comp

40 LOW Y FAIR Y

>40 LOW Y EXCELLENT N

31..40 LOW Y EXCELLENT Y


13/35


EXAMPLE

X= (


14/35


EXAMPLE

P(age


15/35


EXAMPLE

P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044

P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019

P(X | buys_comp=Y)P(buys_comp=Y) = 0.044*0.643=0.028

P(X | buys_comp=N)P(buys_comp=N) = 0.019*0.357=0.007

CONCLUSION: X buys computer


16/35

Nave Bayes Classifier:

Issues Probability values ZERO!

Recall what you observed in WEKA!

If Ak is continuous valued!

Recall what you observed in WEKA!

If there are no tuples in the training set corresponding

to students for the class buys-comp=NO

P(student = Y|buys_comp=N)=0

Implications?

Solution?


17/35


Issues Laplacian Correction or Laplace Estimator

Philosophy we assume that the training data set is so large

that adding one to each count that we need would only make

a negligible difference in the estimated prob. value.

Example: D (1000)

Class: buys_comp=Y

income=low zero tuples

income=medium 990 tuples

income=high 10 tuples

Without Laplacian Correction the probs. are 0, 0.990, and0.010

With Laplacian correction: 1/1003 = 0.001,

991/1003=0.988, and 11/1003=0.011 respectively.


18/35


Issues

Continuous variable: need to do more

work than categorical attributes!

It is typically assumed to have a

Guassian distribution with a mean and a std. dev. .

Do it yourself! And cross check with

WEKA!


19/35

Nave Bayes (Summary)

Robust to isolated noise points

Handle missing values by ignoring theinstance during probability estimate

calculations

Robust to irrelevant attributes

Independence assumption may not holdfor some attributesUse other techniques such as Bayesian

Belief Networks (BBN)


20/35

Probability CalculationsAge Income Student Credit_rating Class:Buys_comp

40 LOW Y GOOD Y

>40 LOW Y EXCELLENT N

31..40 LOW Y EXCELLENT Y


21/35

Bayesian Belief Networks

Nave BC assumes Class Conditional Independence This assumption simplifies computations

When this assumption holds true, Nave BC is most accuratecompared to all other classifiers

In real problems, dependencies do exist between variables 2 methods to overcome this limitation of NBC

Bayesian networks, that combine Bayesian reasoning

with causal relationships between attributes

Decision trees, that reason on one attribute at the time,

considering most important attributes first


22/35

Conditional Independence

Let X, Y, & Z denote three set of randomvariables. The variables in X are said tobe conditionally independent of Y, givenZ if

P(X|Y,Z) = P(X|Z)Rel. bet. a persons arm length and

his/her reading skills!!One might observe that people with

longer arms tend to have higher levels ofreading skills

How do you explain this rel.?


23/35


Can be explained through a confoundingfactor, AGE

A young child tends to have short armsand lacks the reading skills of an adult

If the age of a person is fixed, then theobserved rel. between arm length andreading skills disappears

We can this conclude that arm length

and reading skills are conditionallyindependent when the age variable isfixed

P(reading skills| long arms,age) = P(reading skills|age)


25/35


Belief Networks

Bayesian Networks

Probabilistic Networks


26/35


Conditional Independence (CI) assumption madeby NBC may be too rigid

Specially for classification problems in which

attributes are somewhat correlated We need a more flexible approach for modelingthe class conditional probabilities

P(X|Ci)= P(x1, x2, x3,,xn|C)

instead of requiring that all the attributes be CIgiven the class, BBN allows us to specify which pair

of attributes are CI


27/35


Belief Networks has 2 components

Directed Acyclic Graph (DAG)

Conditional Probability Table (CPT)


28/35


A node in BBN is CI of its non-descendants,if its parents are known


29/35


FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

(~FH, ~S)


LC

~LC

(FH, S) (FH, ~S)(~FH, S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

The conditional probability table

for the variable LungCancer


30/35


31/35


Lung Cancer is CI of Emphysema, given its parents, FH& Smoker

BBN has a Conditional Probability Table (CPT) for eachvariable in the DAG

CPT for a variable Y specifies the conditionaldistribution P(Y|parents(Y))

P(LC=Y|FH=Y,S=Y) = 0.8

P(LC=N|FH=N,S=N) = 0.9

LC

~LC

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

CPT for LungCancer

(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)


32/35


Let X=(x1, x2,,xn) be a tuple described by variablesor attributes Y1, Y2, ,Yn respectively

Each variable is CI of its nondescendants given itsparents

Allows he DAG to provide a complete representationof the existing Joint Probability Distribution by:

P(x1, x2, x3,,xn)=P(xi|Parents(Yi))

where P(x1, x2, x3,,xn) is the prob. of a particularcombination of values of X, and the values for P(xi|Parents(Yi)) correspond to the entries in CPT for Yi


33/35


A node within the network can selected as an outputnode, representing a class label attribute

More than one output node

Rather than returning a single class label, theclassification process can return a probability distributionthat gives the probability of each class

Training BBN!!


34/35

Training BBN

Number of scenarios possible

Network topology may be given in advance orinferred from data

Variables may be observable or hidden (mising orincomplete data) in all or some of the training tuples

Many algos for learning the network topology fromthe training data given observable attibutes

If network topology is known and the variablesobservable, training is straightforward (just computeCPT entries)


35/35

Training BBNs

Topology given, but some variables are hiddenGradient Descent (self study)

Falls under the class of algos calledAdaptive ProbabilisticNetworks

BBNs are computationally expensive

BBNs provide explicit representation of Causal structure

Domain experts can provide prior knowledge to the trainingprocess in the form of topology and/or in conditionalprobability values

This leads to significant improvement in the learning

process

bayesian classification ng

Documents