a survey on using bayes reasoning in data mining

A survey on using Bayes reasoning in Data Mining

Directed by : Dr Rahgozar

Mostafa Haghir Chehreghani

Outline

Bayes Theorem MAP, ML hypothesis Minimum description length principle Bayes optimal classifier Naïve Bayes learner summery

Two Roles for Bayesian Methods

Provides practical learning algorithms Naïve Bayes learning Bayesian belief network learning Combine prior knowledge with observed data Requires prior probabilities

Provides useful conceptual framework Provides gold standard for evaluating other

learning algorithms Additional insight into Ockham’s razor

Bayes Theorem

P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D P(D|h) = probability of D given h

)(

)()|()|(

DP

hPhDPDhP

Choosing hypotheses

Generally want the most probable hypothesis given the training dataMaximum a posteriori hypothesis hMAP

If assume P(hi)=P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis

)()|(maxarg

)(

)()|(maxarg

)|(maxarg

hPhDP

DP

hPhDP

DhPh

Hh

Hh

HhMAP

)|(maxarg iHh

ML hDPhi

Why Likelihood Function are Great

MLEs achieve the Cramer-Rao lower bound The CRLB of variance is the inverse of the

derivative of the derivative of the log of the likelihood function. Any estimator β must have a variance greater than or equal to the CRLB

The Neyman-Pearson lemma a likelihood ratio test will have the minimum

possible Type II error of any test with the α that we selected.

Learning a Real Valued Function

Consider any real-valued target function f

Training examples <xi,di>, where di is noisy training value di=f(xi)+ei

ei is random variable (noise) drawn independently for each xi according to some Gaussian distribution with µ=0

Then the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors

m

iii

HhML xhdh

1

2))((minarg

Learning a Real Valued Function

Maximize natural log of this instead…

m

i

xhd

Hh

m

ii

Hh

HhMAP

ii

e

hdp

hDph

1

)(

2

1

2

1

2

2

1maxarg

)|(maxarg

)|(maxarg

m

iii

Hh

m

iii

Hh

m

i

ii

Hh

m

i

ii

HhML

xhd

xhd

xhd

xhdh

1

2

1

2

1

2

1

2

2

)(minarg

)(maxarg

)(

2

1maxarg

)(

2

1

2

1lnmaxarg

Learning to Predict Probabilities

Consider predicting survival probability from patient data Training examples <xi,di>, where di is 1 or 0

Want to train neural network to output a probability given xi (not a 0 or 1)

In this case can show

Weight update rule for a sigmoid unit:

where

m

iiiii

HhML xhdxhdh

1

))(1ln()1()(lnmaxarg

jkjkjk www

m

iijkiijk xxhdw

1

))((

Minimum Description Length Principle

Ockham’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes

where Lc(x) is the description length of x under encoding C

Example: H=decision trees D=training data labels Lc1(h) is # bits to describe tree h

Lc2(x) is # bits to describe D given h

Hence hMDL trades off tree size for training errors

)|()(minarg21

hDLhLh CCHh

ML

Minimum Description Length Principle

Interesting fact from information theory: The optimal (shortest expected coding length)

code for an event with probability p is –log2p bits So interpret (1):

–log2P(h) is length of h under optimal code –log2P(D|h) is length of D given h under optimal

code

(1) )(log)|(logminarg

)(log)|(logmaxarg

)()|(maxarg

22

22

hPhDP

hPhDP

hPhDPh

Hh

Hh

HhMAP

So far we’ve sought the most probable hypothesis given the data D (ie., hMAP)

Given new instance x what is its most probable classification? hMAP is not the most probable classification!

Consider three possible hypotheses:

Given a new instance x

What’s the most probable classification of x?

3.)|( ,3.)|( ,4.)|( 321 DhPDhPDhP

)( ,)( ,)( 321 xhxhxh

Most Probable Classification of New Instances

Example:

therefore

and

Bayes Optimal Classifier

Hhiij

Vvi

j

DhPhvP )|()|(maxarg

0)|( ,1)|( ,3.)|(

0)|( ,1)|( ,3.)|(

1)|( ,0)|( ,4.)|(

333

222

111

hPhPDhP

hPhPDhP

hPhPDhP

6.)|()|(

4.)|()|(

Hhii

Hhii

i

i

DhPhP

DhPhP

Hhiij

Vvi

j

DhPhvP )|()|(maxarg

Gibbs Classifier

Bayes optimal classier provides best result, but can be expensive if many hypotheses

Gibbs algorithm:1. Choose one hypothesis at random, according to

P(h|D)

2. Use this to classify new instance

Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then:

][2][ alBayesOptimGibbs errorEerrorE

Naive Bayes Classifier

Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods

When to use Moderate or large training set available Attributes that describe instances are

conditionally independent given classification Successful applications:

Diagnosis Classifying text documents

Naive Bayes Classifier

Assume target function f:X→V, where each instance x described by attributes <a1,a2,…,an>.

Most probable Value of f(x) is:

Naïve Bayes assumption

Which gives

)()|,...,,(maxarg

),...,,(

)()|,...,,(maxarg

),...,,|(maxarg

21

21

21

21

jjnVv

n

jjn

VvMAP

njVv

MAP

vPvaaaP

aaaP

vPvaaaPv

aaavPv

j

j

j

i

jijn vaPvaaaP )|()|,...,,( 21

i

jijVv

NB vaPvPvj

)|()(maxarg

Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is if

more compactly we write

Example: Thunder is conditionally independent of

Rain, given Lightning

)|(),|(),,( kikjikji zZxXPzZyYxXPzyx

)|(),|( ZXPZYXP

Inference in Bayesian Networks

How can one infer the (probabilities of) values of one or more network variables given observed values of others?

Bayes net contains all information needed for this inference

If only one variable with unknown value, easy to infer it In general case, problem is NP hard

In practice, can succeed in many cases Exact inference methods work well for some network

structures Monte Carlo methods simulate the network randomly

to calculate approximate solutions

Learning Bayes Nets

Suppose structure known, variables partially observable

e.g. observe ForestFire, Storm, BusTourGroup,

Thunder, but not Lightning, Campfire,… Similar to training neural network with hidden

units In fact, can learn network conditional

probability tables using gradient ascent! Converge to network h that (locally) maximizes

P(D|h)

More on Learning Bayes Nets

EM algorithm can also be used. Repeatedly:1. Calculate probabilities of unobserved variables,

assuming h

2. Calculate new wijk to maximize E[lnP(D|h)] where D now includes both observed and (calculated probabilities of) unobserved variables

When structure unknown… Algorithms use greedy search to add/substract

edges and nodes Active research topic

Summary

Combine prior knowledge with observed data Impact of prior knowledge (when correct!) is to

lower the sample complexity Active research area

Extend from boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional

systems More effective inference methods …

Reference: Buntine

W. L, (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159-225.

R. Agrawal, J. Gehrke, D. Gunopolos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD Conference, 1998.

LEE, C-Y. and ANTONSSON, E.K. 2000. Dynamic partitional clustering using evolution strategies. In Proceedings of the 3rd Asia-Pacific Conference on Simulated Evolution and Learning, Nagoya, Japan.

PELLEG, D. and MOORE, A. 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In Proceedings 17th ICML, Stanford University.

Reference: Cédric Archambeau, John A. Lee, Michel Verleysen. On

Convergence Problems of the EM Algorithm for Finite Gaussian Mixtures. ESANN'2003 proceedings – European Symposium on Artificial Neural Networks Bruges (Belgium), 23-25 April 2003, d-side publi., ISBN 2-930307-03-X, pp. 99-106

P. Langley and S. Sage. Induction of Selective Bayesian Classifiers. Proc. 10th Conf. on Artificial Intelligence, 1994

J. Bilmes: A Gentle Tutorial of the EM Algorithm and its Application to Parameter stimation for Gaussian Mixture and Hidden Markov Models. Technical Report of the International Computer Science Institute, Berkeley, CA (1998).

Adwait Ratnaparkhi. 1998. Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, the University of Pennsylvania.

[17] Thorsten Joachims. 1999. Transductive inference for text classification using support vector machines. In Proc. 16th International Conf. on Machine Learning, pages 200–209. Morgan Kaufmann, San Francisco, CA.

Any Question ?

a survey on using bayes reasoning in data mining

Documents