bayesian statistics and belief networks
DESCRIPTION
Bayesian Statistics and Belief Networks. Overview. Book: Ch 13,14 Refresher on Probability Bayesian classifiers Belief Networks / Bayesian Networks. Why Should We Care?. Theoretical framework for machine learning, classification, knowledge representation, analysis - PowerPoint PPT PresentationTRANSCRIPT
Bayesian Statistics and Belief Networks
Overview
• Book: Ch 13,14• Refresher on Probability• Bayesian classifiers• Belief Networks / Bayesian Networks
Why Should We Care?
• Theoretical framework for machine learning, classification, knowledge representation, analysis
• Bayesian methods are capable of handling noisy, incomplete data sets
• Bayesian methods are commonly in use today
Bayesian Approach To Probability and Statistics
• Classical Probability : Physical property of the world (e.g., 50% flip of a fair coin). True probability.
• Bayesian Probability : A person’s degree of belief in event X. Personal probability.
• Unlike classical probability, Bayesian probabilities benefit from but do not require repeated trials - only focus on next event; e.g. probability Seawolves win next game?
Uncertainty
Methods for Handling Uncertainty
Probability
Making Decisions Under Uncertainty
Probability Basics
Random Variables
Prior Probability
Conditional Probability
Inference by Enumeration
Inference by Enumeration
Bayes RuleProduct Rule:
P A B P A B P B
P A B P B A P A
|
|
Equating Sides: P B A
P A B P BP A
|( | ) ( )
( )
P Class evidenceP evidence Class P Class
P evidence|
( | ) ( )( )
i.e.
All classification methods can be seen as estimates of Bayes’ Rule, with different techniques to estimate P(evidence|Class).
Inference by Enumeration
Simple Bayes Rule ExampleProbability your computer has a virus, V, = 1/1000.If virused, the probability of a crash that day, C, = 4/5.Probability your computer crashes in one day, C, = 1/10.
P(C|V)=0.8P(V)=1/1000P(C)=1/10
P V CP C V P V
P C( | )
( | ) ( )( )
( . )( . )( . )
. 08 0 001
010 008
Even though a crash is a strong indicator of a virus, we expect only8/1000 crashes to be caused by viruses. Why not compute P(V|C) from direct evidence? Causal vs. Diagnostic knowledge; (consider if P(C) suddenly drops).
Bayesian Classifiers P Class evidence
P evidence Class P ClassP evidence
|( | ) ( )
( )
If we’re selecting the single most likely class, we onlyneed to find the class that maximizes P(e|Class)P(Class).
Hard part is estimating P(e|Class).Evidence e typically consists of a set of observations:
E e e en( , ,..., )1 2
Usual simplifying assumption is conditional independence:
P e C P e Cii
n
( | ) ( | )
1
P C e
P C P e C
P e
ii
n
( | )( ) ( | )
( )
1
Bayesian Classifier ExampleProbability C=Virus C=Bad DiskP(C) 0.4 0.6P(crashes|C) 0.1 0.2P(diskfull|C) 0.6 0.1
Given a case where the disk is full and computer crashes,the classifier chooses Virus as most likely since(0.4)(0.1)(0.6) > (0.6)(0.2)(0.1).
Beyond Conditional Independence
• Include second-order dependencies; i.e. pairwise combination of variables via joint probabilities:
Linear Classifier: C1
C2
P e c P e c P e c2 1 11( | ) ( | )[ ( | )] Correction factor - Difficult to compute -
n2
joint probabilities to consider
Belief Networks
• DAG that represents the dependencies between variables and specifies the joint probability distribution
• Random variables make up the nodes• Directed links represent causal direct influences• Each node has a conditional probability table
quantifying the effects from the parents• No directed cycles
Burglary Alarm ExampleBurglary Earthquake
Alarm
John Calls Mary Calls
P(B)0.001
P(E)0.002
B E P(A)T T 0.95T F 0.94F T 0.29F F 0.001
A P(J)T 0.90F 0.05
A P(M)T 0.70F 0.01
Sample Bayesian Network
Using The Belief NetworkBurglary Earthquake
Alarm
John Calls Mary Calls
P(B)0.001
P(E)0.002
B E P(A)T T 0.95T F 0.94F T 0.29F F 0.001
A P(J)T 0.90F 0.05
A P(M)T 0.70F 0.01
P x x x P x Parents Xn i ii
n
( , ,... ) ( | ( ))1 21
Probability of alarm, no burglary or earthquake, both John and Mary call:
P J A P M A P A B E P B P E( | ) ( | ) ( | ) ( ) ( ) ( . )( . )( . )( . )( . ) .0 9 0 7 0 001 0 999 0 998 0 00062
Belief Computations• Two types; both are NP-Hard• Belief Revision
– Model explanatory/diagnostic tasks– Given evidence, what is the most likely hypothesis to explain
the evidence?– Also called abductive reasoning
• Belief Updating– Queries– Given evidence, what is the probability of some other random
variable occurring?
Belief Revision• Given some evidence variables, find the state of all other variables
that maximize the probability.• E.g.: We know John Calls, but not Mary. What is the most likely
state? Only consider assignments where J=T and M=F, and maximize. Best:
049.0)99.0)(05.0)(999.0)(998.0)(999.0()|()|()|()()(
AMPAJPEBAPEPBP
Belief Updating
• Causal Inferences
• Diagnostic Inferences
• Intercausal Inferences
• Mixed Inferences
Q E
Q
E
E EQ
E Q
Causal InferencesInference from cause to effect.
E.g. Given a burglary, what is P(J|B)?
85.0)05.0)(06.0()9.0)(94.0()|()05.0)(()9.0)(()|(
94.0)|()95.0)(002.0(1)94.0)(998.0(1)|(
)95.0)(()()94.0)(()()|(?)|(
BJPAPAPBJP
BAPBAP
EPBPEPBPBAPBJP
P(M|B)=0.67 via similar calculations
Burglary Earthquake
Alarm
John Calls Mary Calls
P(B)0.001
P(E)0.002
B E P(A)T T 0.95T F 0.94F T 0.29F F 0.001
A P(J)T 0.90F 0.05
A P(M)T 0.70F 0.01
Diagnostic InferencesFrom effect to cause. E.g. Given that John calls, what is the P(burglary)?
)()()|()|(
JPBPBJPJBP
002517.0)()001.0)(999.0)(998.0()94.0)(998.0)(001.0(
)29.0)(002.0)(999.0()95.0)(002.0)(001.0()()001.0)(()()94.0)(()(
)29.0)(()()95.0)(()()(
AP
APEPBPEPBP
EPBPEPBPAP
What is P(J)? Need P(A) first:
052.0)()05.0)(9975.0()9.0)(002517.0()(
)05.0)(()9.0)(()(
JPJP
APAPJP 016.0)052.0(
)001.0)(85.0()|( JBP
Many false positives.
Intercausal InferencesExplaining Away Inferences.
Given an alarm, P(B|A)=0.37. But if we add the evidence that
earthquake is true, then P(B|A^E)=0.003.
Even though B and E are independent, the presence of
one may make the other more/less likely.
Mixed Inferences
Simultaneous intercausal and diagnostic inference.
E.g., if John calls and Earthquake is false:
017.0)^|(03.0)^|(
EJBPEJAP
Computing these values exactly is somewhat complicated.
Exact Computation - Polytree Algorithm
• Judea Pearl, 1982• Only works on singly-connected networks - at
most one undirected path between any two nodes. • Backward-chaining Message-passing algorithm for
computing posterior probabilities for query node X– Compute causal support for X, evidence variables
“above” X– Compute evidential support for X, evidence variables
“below” X
Polytree ComputationU(1
) U(m)
X
Z(1,j) Z(n,j)
Y(1)
Y(n)
...
...
xE
xE
zj jyzijijjiiy
i yix
u ixuiix
xx
iEzPzXyPyEPXEP
EUPuXPEXP
XEPEXPEXP
)|(),|()|()|(
)|()|()|(
)|()|()|(
\
\
Algorithm recursive, message
passing chain
Other Query Methods• Exact Algorithms
– Clustering• Cluster nodes to make single cluster, message-pass along that cluster
– Symbolic Probabilistic Inference• Uses d-separation to find expressions to combine
• Approximate Algorithms– Select sampling distribution, conduct trials sampling from root
to evidence nodes, accumulating weight for each node. Still tractable for dense networks.
• Forward Simulation• Stochastic Simulation
Summary• Bayesian methods provide sound theory and
framework for implementation of classifiers• Bayesian networks a natural way to represent
conditional independence information. Qualitative info in links, quantitative in tables.
• NP-complete or NP-hard to compute exact values; typical to make simplifying assumptions or approximate methods.
• Many Bayesian tools and systems exist
References• Russel, S. and Norvig, P. (1995). Artificial Intelligence,
A Modern Approach. Prentice Hall.• Weiss, S. and Kulikowski, C. (1991). Computer Systems
That Learn. Morgan Kaufman.• Heckerman, D. (1996). A Tutorial on Learning with
Bayesian Networks. Microsoft Technical Report MSR-TR-95-06.
• Internet Resources on Bayesian Networks and Machine Learning: http://www.cs.orst.edu/~wangxi/resource.html