a survey on using bayes reasoning in data mining
DESCRIPTION
A survey on using Bayes reasoning in Data Mining. Directed by : Dr Rahgozar Mostafa Haghir Chehreghani. Outline. Bayes Theorem MAP, ML hypothesis Minimum description length principle Bayes optimal classifier Naïve Bayes learner summery. Two Roles for Bayesian Methods. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/1.jpg)
A survey on using Bayes reasoning in Data Mining
Directed by : Dr Rahgozar
Mostafa Haghir Chehreghani
![Page 2: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/2.jpg)
Outline
Bayes Theorem MAP, ML hypothesis Minimum description length principle Bayes optimal classifier Naïve Bayes learner summery
![Page 3: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/3.jpg)
Two Roles for Bayesian Methods
Provides practical learning algorithms Naïve Bayes learning Bayesian belief network learning Combine prior knowledge with observed data Requires prior probabilities
Provides useful conceptual framework Provides gold standard for evaluating other
learning algorithms Additional insight into Ockham’s razor
![Page 4: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/4.jpg)
Bayes Theorem
P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D P(D|h) = probability of D given h
)(
)()|()|(
DP
hPhDPDhP
![Page 5: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/5.jpg)
Choosing hypotheses
Generally want the most probable hypothesis given the training dataMaximum a posteriori hypothesis hMAP
If assume P(hi)=P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis
)()|(maxarg
)(
)()|(maxarg
)|(maxarg
hPhDP
DP
hPhDP
DhPh
Hh
Hh
HhMAP
)|(maxarg iHh
ML hDPhi
![Page 6: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/6.jpg)
Why Likelihood Function are Great
MLEs achieve the Cramer-Rao lower bound The CRLB of variance is the inverse of the
derivative of the derivative of the log of the likelihood function. Any estimator β must have a variance greater than or equal to the CRLB
The Neyman-Pearson lemma a likelihood ratio test will have the minimum
possible Type II error of any test with the α that we selected.
![Page 7: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/7.jpg)
Learning a Real Valued Function
Consider any real-valued target function f
Training examples <xi,di>, where di is noisy training value di=f(xi)+ei
ei is random variable (noise) drawn independently for each xi according to some Gaussian distribution with µ=0
Then the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors
m
iii
HhML xhdh
1
2))((minarg
![Page 8: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/8.jpg)
Learning a Real Valued Function
Maximize natural log of this instead…
m
i
xhd
Hh
m
ii
Hh
HhMAP
ii
e
hdp
hDph
1
)(
2
1
2
1
2
2
1maxarg
)|(maxarg
)|(maxarg
m
iii
Hh
m
iii
Hh
m
i
ii
Hh
m
i
ii
HhML
xhd
xhd
xhd
xhdh
1
2
1
2
1
2
1
2
2
)(minarg
)(maxarg
)(
2
1maxarg
)(
2
1
2
1lnmaxarg
![Page 9: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/9.jpg)
Learning to Predict Probabilities
Consider predicting survival probability from patient data Training examples <xi,di>, where di is 1 or 0
Want to train neural network to output a probability given xi (not a 0 or 1)
In this case can show
Weight update rule for a sigmoid unit:
where
m
iiiii
HhML xhdxhdh
1
))(1ln()1()(lnmaxarg
jkjkjk www
m
iijkiijk xxhdw
1
))((
![Page 10: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/10.jpg)
Minimum Description Length Principle
Ockham’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes
where Lc(x) is the description length of x under encoding C
Example: H=decision trees D=training data labels Lc1(h) is # bits to describe tree h
Lc2(x) is # bits to describe D given h
Hence hMDL trades off tree size for training errors
)|()(minarg21
hDLhLh CCHh
ML
![Page 11: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/11.jpg)
Minimum Description Length Principle
Interesting fact from information theory: The optimal (shortest expected coding length)
code for an event with probability p is –log2p bits So interpret (1):
–log2P(h) is length of h under optimal code –log2P(D|h) is length of D given h under optimal
code
(1) )(log)|(logminarg
)(log)|(logmaxarg
)()|(maxarg
22
22
hPhDP
hPhDP
hPhDPh
Hh
Hh
HhMAP
![Page 12: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/12.jpg)
So far we’ve sought the most probable hypothesis given the data D (ie., hMAP)
Given new instance x what is its most probable classification? hMAP is not the most probable classification!
Consider three possible hypotheses:
Given a new instance x
What’s the most probable classification of x?
3.)|( ,3.)|( ,4.)|( 321 DhPDhPDhP
)( ,)( ,)( 321 xhxhxh
Most Probable Classification of New Instances
![Page 13: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/13.jpg)
Example:
therefore
and
Bayes Optimal Classifier
Hhiij
Vvi
j
DhPhvP )|()|(maxarg
0)|( ,1)|( ,3.)|(
0)|( ,1)|( ,3.)|(
1)|( ,0)|( ,4.)|(
333
222
111
hPhPDhP
hPhPDhP
hPhPDhP
6.)|()|(
4.)|()|(
Hhii
Hhii
i
i
DhPhP
DhPhP
Hhiij
Vvi
j
DhPhvP )|()|(maxarg
![Page 14: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/14.jpg)
Gibbs Classifier
Bayes optimal classier provides best result, but can be expensive if many hypotheses
Gibbs algorithm:1. Choose one hypothesis at random, according to
P(h|D)
2. Use this to classify new instance
Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then:
][2][ alBayesOptimGibbs errorEerrorE
![Page 15: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/15.jpg)
Naive Bayes Classifier
Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods
When to use Moderate or large training set available Attributes that describe instances are
conditionally independent given classification Successful applications:
Diagnosis Classifying text documents
![Page 16: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/16.jpg)
Naive Bayes Classifier
Assume target function f:X→V, where each instance x described by attributes <a1,a2,…,an>.
Most probable Value of f(x) is:
Naïve Bayes assumption
Which gives
)()|,...,,(maxarg
),...,,(
)()|,...,,(maxarg
),...,,|(maxarg
21
21
21
21
jjnVv
n
jjn
VvMAP
njVv
MAP
vPvaaaP
aaaP
vPvaaaPv
aaavPv
j
j
j
i
jijn vaPvaaaP )|()|,...,,( 21
i
jijVv
NB vaPvPvj
)|()(maxarg
![Page 17: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/17.jpg)
Conditional Independence
Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is if
more compactly we write
Example: Thunder is conditionally independent of
Rain, given Lightning
)|(),|(),,( kikjikji zZxXPzZyYxXPzyx
)|(),|( ZXPZYXP
![Page 18: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/18.jpg)
Inference in Bayesian Networks
How can one infer the (probabilities of) values of one or more network variables given observed values of others?
Bayes net contains all information needed for this inference
If only one variable with unknown value, easy to infer it In general case, problem is NP hard
In practice, can succeed in many cases Exact inference methods work well for some network
structures Monte Carlo methods simulate the network randomly
to calculate approximate solutions
![Page 19: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/19.jpg)
Learning Bayes Nets
Suppose structure known, variables partially observable
e.g. observe ForestFire, Storm, BusTourGroup,
Thunder, but not Lightning, Campfire,… Similar to training neural network with hidden
units In fact, can learn network conditional
probability tables using gradient ascent! Converge to network h that (locally) maximizes
P(D|h)
![Page 20: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/20.jpg)
More on Learning Bayes Nets
EM algorithm can also be used. Repeatedly:1. Calculate probabilities of unobserved variables,
assuming h
2. Calculate new wijk to maximize E[lnP(D|h)] where D now includes both observed and (calculated probabilities of) unobserved variables
When structure unknown… Algorithms use greedy search to add/substract
edges and nodes Active research topic
![Page 21: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/21.jpg)
Summary
Combine prior knowledge with observed data Impact of prior knowledge (when correct!) is to
lower the sample complexity Active research area
Extend from boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional
systems More effective inference methods …
![Page 22: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/22.jpg)
Reference: Buntine
W. L, (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159-225.
R. Agrawal, J. Gehrke, D. Gunopolos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD Conference, 1998.
LEE, C-Y. and ANTONSSON, E.K. 2000. Dynamic partitional clustering using evolution strategies. In Proceedings of the 3rd Asia-Pacific Conference on Simulated Evolution and Learning, Nagoya, Japan.
PELLEG, D. and MOORE, A. 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In Proceedings 17th ICML, Stanford University.
![Page 23: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/23.jpg)
Reference: Cédric Archambeau, John A. Lee, Michel Verleysen. On
Convergence Problems of the EM Algorithm for Finite Gaussian Mixtures. ESANN'2003 proceedings – European Symposium on Artificial Neural Networks Bruges (Belgium), 23-25 April 2003, d-side publi., ISBN 2-930307-03-X, pp. 99-106
P. Langley and S. Sage. Induction of Selective Bayesian Classifiers. Proc. 10th Conf. on Artificial Intelligence, 1994
J. Bilmes: A Gentle Tutorial of the EM Algorithm and its Application to Parameter stimation for Gaussian Mixture and Hidden Markov Models. Technical Report of the International Computer Science Institute, Berkeley, CA (1998).
Adwait Ratnaparkhi. 1998. Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, the University of Pennsylvania.
[17] Thorsten Joachims. 1999. Transductive inference for text classification using support vector machines. In Proc. 16th International Conf. on Machine Learning, pages 200–209. Morgan Kaufmann, San Francisco, CA.
![Page 24: A survey on using Bayes reasoning in Data Mining](https://reader036.vdocuments.us/reader036/viewer/2022062500/56814ecd550346895dbc6b7a/html5/thumbnails/24.jpg)
Any Question ?