bayesian network. introduction independence assumptions seems to be necessary for probabilistic...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Bayesian Network
Introduction
Independence assumptions Seems to be necessary for probabilistic inference
to be practical.Naïve Bayes Method Makes independence assumptions that are often
not true Also called Idiot Bayes Method for this reason.
Bayesian Network Explicitly models the independence relationships
in the data. Use these independence relationships to make
probabilistic inferences. Also known as: Belief Net, Bayes Net, Causal Net,
…
Why Bayesian Networks?Intuitive language
Can utilize causal knowledge in constructing models
Domain experts comfortable building a network
General purpose “inference” algorithms P(Bad Battery | Has Gas, Won’t Start)
Exact: Modular specification leads to large computational efficiencies
Gas
Start
Battery
Random Variables
A random variable is a set of exhaustive and mutually exclusive possibilities.Example:
throwing a die: small {1,2} medium: {3,4} large: {5,6}
Medical data patient’s age blood pressure
Variable vs. Eventa variable taking a value = an event.
Independence of Variables
Instantiation of a variable is an event.A set of variables are independent iff all possible instantiations of the variables are independent.Example:
X: patient blood pressure {high, medium, low}Y: patient sneezes {yes, no}P(X=high, Y=yes) = P(X=high) x P(Y=yes)P(X=high, Y=no) = P(X=high) x P(Y=no)... ...P(X=low, Y=yes) = P(X=low) x P(Y=yes)P(X=low, Y=no) = P(X=low) x P(Y=no)
Conditional independence between a set of variables holds iff the conditional independence between all possible instantiations of the variables holds.
Bayesian Networks: Definition
Bayesian networks are directed acyclic graphs (DAGs).
Nodes in Bayesian networks represent random variables, which is normally assumed to take on discrete values.
The links of the network represent direct probabilistic influence.
The structure of the network represents the probabilistic dependence/independence relationships between the random variables represented by the nodes.
Example
family-out (fo)
light-on (lo) dog-out (do)
bowel-problem (bp)
hear-bark (hb)
Bayesian Network: Probabilities
The nodes and links are quantified with probability distributions.
The root nodes (those with no ancestors) are assigned prior probability distributions.
The other nodes are assigned with the conditional probability distribution of the node given its parents.
Example
family-out (fo)
light-on (lo) dog-out (do)
bowel-problem (bp)
hear-bark (hb)
P(fo)=.15 P(bp)=.01
P(lo | fo)=.6 P(lo | fo)=.05
P(do| fo, bp)=.99 P(do| fo, bp)=.90 P(do| fo, bp)=.97 P(do| fo, bp)=.3
P(hb | do)=.7 P(hb | do)=.01Conditional Probability Tables (CPTs)
Noisy-OR-Gate
Exception Independence
Noisy-OR-Gate—the exceptions to the causations are independent.
P(E|C1, C2)=1-(1-P(E|C1))(1-P(E|C2))
C1 C2
E
Inhibitor1 Inhibitor2
Inference in Bayesian Networks
Given a Bayesian network and its CPTs, we can compute probabilities of the following form:
P(H | E1, E2,... ... , En)
where H, E1, E2,... ... , En are assignments to nodes (random variables) in the network.
Example:
The probability of family-out given lights out and hearing bark:
P( fo | lo, hb).
Example: Car Diagnosis
MammoNet
Applications of Bayesian NetworksDoctor's
NoteStudent
Ill
Frequent Absence
Student Disinterested
Computer Crash
Paper Late
Dog Ate It
“Dog Ate It”
Power Failure
Clocks Slow
Lives in Dorm
Hungry Dog
“Computer Crash”
Make up Excuse
Application of Bayesian Network:Diagnosis
ARCO1: Forecasting Oil Prices
ARCO1: Forecasting Oil Prices
Semantics of Belief Networks
Two ways understanding the semantics of belief network Representation of the joint probability
distribution Encoding of a collection of conditional
independence statements
Terminology
Descendent
Ancestor
Parent
Non-descendent
X
Y1 Y2
Non-descendent
Three Types of Connections
a
b
ab
ca
c
b
c
Linear
Converging Diverging
family-out
dog-out
bowel-problemfamily-out
dog-out
hear-bark
family-out
dog-outlight-on
Connection Pattern and IndependenceLinear connection: The two end variables are
usually dependent on each other. The middle variable renders them independent.
Converging connection: The two end variables are usually independent on each other. The middle variable renders them dependent.
Divergent connection: The two end variables are usually dependent on each other. The middle variable renders them independent.
family-out
dog-out
bowel-problemfamily-out
dog-out
hear-bark
family-out
dog-outlight-on
D-Separation
A variable a is d-separated from b by a set of variables E if there does not exist a d-connecting path between a and b such that None of its linear or diverging nodes is in E
For each of the converging nodes, either it or one of its descendents is in E.
Intuition: The influence between a and b must propagate
through a d-connecting path
If a and b are d-separated by E, then they are conditionally independent of each other given E:
P(a, b | E) = P(a | E) x P(b | E)
Example Independence Relationships
family-out (fo)
light-on (lo) dog-out (do)
bowel-problem (bp)
hear-bark (hb)
Chain RuleA joint probability distribution can be
expressed as a product of conditional probabilities:
P(X1, X2, ..., Xn)
=P(X1) x P(X2, X1, ..., Xn | X1)
=P(X1) x P(X2|X1) x P(X3, X4, ..., Xn| X1, X2)
=P(X1) x P(X2|X1) x P(X3|X1,X2) x P(X4, ..., Xn|X1,X2,X3)... ...
= P(X1) x P(X2|X1) x P(X3|X1, X2)x ...x P(Xn| X1, ...Xn-1)This has nothing to do with any independence assumption!
Compute the Joint Probability
Given a Bayesian network, let X1, X2, ..., Xn be an ordering of the nodes such that only the nodes that are indexed lower than i may have directed path to Xi.
Since the parents of Xi, d-separates Xi and all the other nodes that are indexed lower than i,
P(Xi| X1, ...Xi-1)=P(Xi| parents(Xi))
This probability is available in the Bayesian network.
Therefore, P(X1, X2, ..., Xn) can be computed from the probabilities available in Beyesian network.
family-out (fo)
light-on (lo) dog-out (do)
bowel-problem (bp)
hear-bark (hb)
P(fo)=.15 P(bp)=.01
P(lo | fo)=.6 P(lo | fo)=.05
P(do| fo, bp)=.99 P(do| fo, bp)=.90 P(do| fo, bp)=.97 P(do| fo, bp)=.3
P(hb | do)=.7 P(hb | do)=.01
P(fo, ¬lo, do, hb, ¬bp) = ?
What can Bayesian Networks Compute?The inputs to a Bayesian Network
evaluation algorithm is a set of evidences: e.g.,
E = { hear-bark=true, lights-on=true }
The outputs of Bayesian Network evaluation algorithm are
P(Xi=v | E)
where Xi is an variable in the network.
For example: P(family-out=true| E) is the probability of family being out given hearing dog's bark and seeing the lights on.
Computation in Bayesian Networks
Computation in Bayesian networks is NP-hard. All algorithms for computing the probabilities are exponential to the size of the network.
There are two ways around the complexity barrier: Algorithms for special subclass of networks,
e.g., singly connected networks.
Approximate algorithms.
The computation for singly connected graph is linear to the size of the network.
Bayesian Network Inference
Inference: calculating P(X |Y ) for some variables or sets of variables X and Y.Inference in Bayesian networks is #P-hard!
Reduces to
How many satisfying assignments?
I1 I2 I3 I4 I5
O
Inputs: prior probabilities of .5
P(O) must be(#sat. assign.)*(.5^#inputs)
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Bayesian Network Inference
But…inference is still tractable in some cases.Let’s look a special class of networks: trees / forests in which each node has at most one parent.
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Decomposing the probabilities
Suppose we want P(Xi | E ) where E is some set of evidence variables.Let’s split E into two parts:
Ei- is the part consisting of assignments to variables in
the subtree rooted at Xi
Ei+ is the rest of it
Xi
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Decomposing the probabilities
)(λ)(απ
)|(
)|()|(
)|(
)|(),|(
),|()|(
ii
ii
ii
ii
iii
iiii
XX
EEP
EXPXEP
EEP
EXPEXEP
EEXPEXP
Xi
Where:• is a constant independent of Xi
• (Xi) = P(Xi |Ei+)
• (Xi) = P(Ei-| Xi)
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Using the decomposition for inference
We can use this decomposition to do inference as follows. First, compute (Xi) = P(Ei
-| Xi) for all Xi recursively, using the leaves of the tree as the base case.If Xi is a leaf: If Xi is in E : (Xi) = 0 if Xi matches E, 1
otherwise If Xi is not in E : Ei
- is the null set, so P(Ei
-| Xi) = 1 (constant)
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Quick aside: “Virtual evidence”
For theoretical simplicity, but without loss of generality, let’s assume that all variables in E (the evidence set) are leaves in the tree.Why can we do this WLOG:
Xi
Xi
Xi’Observe Xi
Equivalent to
Observe Xi’
Where P(Xi’| Xi) =1 if Xi’=Xi, 0 otherwise
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Calculating (Xi) for non-leaves
Suppose Xi has one child, Xj
Then:
Xi
Xj
j
j
j
j
Xjij
Xjiij
Xjiiij
Xijiiii
XXXP
XEPXXP
XXEPXXP
XXEPXEPX
)(λ)|(
)|()|(
),|()|(
)|,()|()(λ
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Calculating (Xi) for non-leaves
Now, suppose Xi has a set of children, C.
Since Xi d-separates each of its subtrees, the contribution of each subtree to (Xi) is independent:
CX Xjij
CXijiii
j j
j
XXXP
XXEPX
)λ()|(
)(λ)|()(λ
where j(Xi) is the contribution to P(Ei-| Xi) of the
part of the evidence lying in the subtree rooted at one of Xi’s children Xj.
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
We are now -happySo now we have a way to recursively compute all the (Xi)’s, starting from the root and using the leaves as the base case.If we want, we can think of each node in the network as an autonomous processor that passes a little “ message” to its parent.
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
The other half of the problem
Remember, P(Xi|E) = (Xi)(Xi). Now that we have all the (Xi)’s, what about the (Xi)’s?(Xi) = P(Xi |Ei
+).What about the root of the tree, Xr? In that case, Er
+ is the null set, so (Xr) = P(Xr). No sweat. Since we also know (Xr), we can compute the final P(Xr).So for an arbitrary Xi with parent Xp, let’s inductively assume we know (Xp) and/or P(Xp|E). How do we get (Xi)?
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Computing (Xi)
Xp
Xi
Where i(Xp) is defined as
p
p
p
p
p
Xpipi
X pi
ppi
Xippi
Xipipi
Xipiiii
XXXP
X
EXPXXP
EXPXXP
EXPEXXP
EXXPEXPX
)(π)|(
)(λ
)|()|(
)|()|(
)|(),|(
)|,()|()(π
)(λ
)|(
pi
p
X
EXP
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
We’re done. Yay!
Thus we can compute all the (Xi)’s, and, in turn, all the P(Xi|E)’s.Can think of nodes as autonomous processors passing and messages to their neighbors
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Conjunctive queries
What if we want, e.g., P(A, B | C) instead of just marginal distributions P(A | C) and P(B | C)?Just use chain rule: P(A, B | C) = P(A | C) P(B | A, C) Each of the latter probabilities
can be computed using the technique just discussed.
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Polytrees
Technique can be generalized to polytrees: undirected versions of the graphs are still trees, but nodes can have more than one parent
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Dealing with cyclesCan deal with undirected cycles in graph by clustering variables together
Conditioning
A
B C
D
A
D
BC
Set to 0 Set to 1
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt
Join trees
Arbitrary Bayesian network can be transformed via some evil graph-theoretic magic into a join tree in which a similar method can be employed.
A
B
E D
F
C
G
ABC
BCD BCD
DF
In the worst case the join tree nodes must take on exponentiallymany combinations of values, but often works well in practice
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt