machine learning, chapter 6 cse 574, spring 2004srihari/cse574/chapbl/chapbl.part4.pdf · machine...
TRANSCRIPT
1
Machine Learning, Chapter 6 CSE 574, Spring 2004
Bayesian Belief Networks
• Naïve Bayes assumes that values of its attributes a1,..,an are conditionally independent given the target value v
• Independence assumption is overly restrictive
•Bayesian Belief Networks• provide an intermediate approach that is less constraining • more tractable than avoiding conditional independence
altogether
2
Machine Learning, Chapter 6 CSE 574, Spring 2004
Statistically dependent and independent variables
3
Machine Learning, Chapter 6 CSE 574, Spring 2004
Bayesian Belief Network
•Describes probability distribution governing a set of variables • by specifying a set of conditional independence assumptions • along with a set of conditional probabilities
•Allows conditional independence assumptions to apply to subsets of the variables•Less constraining than the global assumption of conditional independence made by the Naïve Bayes classifier
4
Machine Learning, Chapter 6 CSE 574, Spring 2004
Probability Distribution over a set of variables
•Random variables
•Each variable can take on the set of possible values
•Joint space of variables is cross-product
•Each item in joint space corresponds to one of the possible values of
•Bayesian Belief Network specifies the joint probability distribution
nYY ,..,1
iY )( iYV
)(..)()( 21 nYVYVYV ××
nYY ,..,1
5
Machine Learning, Chapter 6 CSE 574, Spring 2004
Conditional Independence
Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for Z,that is
which can be written more compactly as
)|()|(),,( kikikji zZxXPzYxXPzyx =====∀
)|(),|( ZXPZYXP =
6
Machine Learning, Chapter 6 CSE 574, Spring 2004
Conditional Independence, continued
• Naïve Bayes • Assumes instance attribute is conditionally
independent of instance attribute given target value V
• Allows Naïve Bayes to calculate
)|(),|()|,( 22121 VAPVAAPVAAP =
)|()|( 21 VAPVAP=
2A1A2A
7
Machine Learning, Chapter 6 CSE 574, Spring 2004
Bayesian Belief Network Example
• Boolean Variables (present or absent)• Storm• BusTourGroup• Lightning• Campfire• Thunder• ForestFire
• Specify conditional probabilities between terms
8
Machine Learning, Chapter 6 CSE 574, Spring 2004
Bayesian Belief Network
Y1 Y2 ConditionalProbabilityTable forvariableY4
Y3
Y5 Y6
9
Machine Learning, Chapter 6 CSE 574, Spring 2004
Probabilities stored in Bayesian Network
• Parents(Yi) denotes set of immediate predecessors of Yi in the network
• are the values stored in the conditional probability table associated with node Yi
))Y(Parents|y(P ii
10
Machine Learning, Chapter 6 CSE 574, Spring 2004
Representation
• Bayesian network represents joint probability distribution
• Joint probability for any desired assignment of valuesto the tuple of variables
is computed by
))(|(),..,(1
1 i
n
iin YParentsyPyyP ∏
=
=
>< nyy ,..,1 >< nYY ,..,1
11
Machine Learning, Chapter 6 CSE 574, Spring 2004
Example for Joint Probability Calculation
12
Machine Learning, Chapter 6 CSE 574, Spring 2004
Joint Probability Calculation
• P(a3,b1,x2,c3,d2)• = P(a3) P(b1) P(x2/a3,b1) P(c3/x2) P(d2/x2)
=0.25 x 0.6 x 0.4 x 0.5 x 0.4=0.012
13
Machine Learning, Chapter 6 CSE 574, Spring 2004
Representation
• Bayesian network allows causal knowledge to be represented• Lightning causes Thunder• Once we know value of Storm and BusTourGroup no
information about Campfire provided by Lightning and Thunder
14
Machine Learning, Chapter 6 CSE 574, Spring 2004
Causal relationships
• State of automobile• Temperature of engine• Pressure of brake fluid• Pressure of air in tires• Voltages in the wires
• Oil pressure and air pressure are not causally related• Engine temperature and oil temperature are
15
Machine Learning, Chapter 6 CSE 574, Spring 2004
Representation• Arcs represent assertion that variable is conditionally
independent of its non-descendants given its parents• Thunder is conditionally independent of other variables given
the value of Lightning• Pr (T,L / F) • = Pr(T / L, F)Pr(L / F) by chain rule• = Pr(T/ L)Pr(L / F) by network
16
Machine Learning, Chapter 6 CSE 574, Spring 2004
Inference Tasks1. Infer the value of some target variable given the observed values
of other variables• Pr(F/S,B,L,C,T)• =Pr(F,T,B/S,L,C)/Pr(T,B/S,L,C)• =Pr(F/S,L,C)Pr(T,B/S,L,C)/Pr(T,B/S,L,C)
2. Infer the probability distribution of the target variable3. Infer subset of variables when some other variables are known
17
Machine Learning, Chapter 6 CSE 574, Spring 2004
Inference Efficiency• Exact Inference for arbitrary networks is NP-hard• Approximate inference sacrifices precision to gain
efficiency• Monte Carlo method for randomly sampling unobserved
variables• Approximate methods can also be NP-hard but useful in
many cases
18
Machine Learning, Chapter 6 CSE 574, Spring 2004
Learning Bayesian Networks
• Learning Bayesian Networks from training data• Several different settings for this problem1 Network structure
• given in advance or• inferred from training data
2 Network variables• are observable in training data or• might be unobservable
19
Machine Learning, Chapter 6 CSE 574, Spring 2004
Learning Bayesian Networks
Network structure given in advance1. Variables are fully observable in training data• Estimate conditional probability table entries as for a NB
2. Only some variable values are observable• More difficult• Similar to learning weights for hidden units in ANNs
• ANNs have input and output values specified • Hidden unit values are learnt from training examples
20
Machine Learning, Chapter 6 CSE 574, Spring 2004
Gradient Ascent for learning probabilities
• Learns entries in the conditional probability tables• Gradient Ascent searches through the space of all
possible entries for the conditional probability tables• Objective function maximized during ascent is the
probability P(D/h) of the observed training data given the hypothesis h
• Corresponds to searching for the maximum likelihood hypothesis
21
Machine Learning, Chapter 6 CSE 574, Spring 2004
Gradient Ascent Training of Bayesian Networks
• Network structure given but only some variables are observable
• Learns entries in conditional probability tables• Searches through a space of hypotheses that
corresponds to the set of all possible entries for the conditional probability tables
• Objective function is maximized: P(D|h) probability of observed data given the hypothesis h
22
Machine Learning, Chapter 6 CSE 574, Spring 2004
Gradient Ascent Training of Bayesian Networks
• Maximize P(D|h) with respect to the parameters that define the conditional probability tables of the Bayesian network
• wijk is a single entry in one of the tables in the known structure of the Bayesian network
• i = index of variable• j = index of value of variable• k = index of parent
23
Machine Learning, Chapter 6 CSE 574, Spring 2004
Bayesian Network Gradient Ascent Notation
Y1 Y2 ConditionalProbabilityTable forvariableY4
Y3
wijk is the probability that Yi will take on value yij given that its immediate parents Ui take on the values given by uik
If wijk is top right entry of table, Yi is the variable Campfire,Ui are the parents-tuple <Storm,BusTourGroup>, yij=True, and uik=<False, False>
24
Machine Learning, Chapter 6 CSE 574, Spring 2004
Gradient Ascent Training of Bayesian Networks
• Maximize P(D|h) by following the gradient of ln P(D|h)
• wijk is a single entry in one of the tables in the Bayesian network
• It can be shown that that each of these derivatives can be calculated as
∑∈
===
∂∂
Dd ijk
ikiiji
ijk w)d|uU,yY(P
w)h|D(Pln
25
Machine Learning, Chapter 6 CSE 574, Spring 2004
Gradient Ascent calculation example
Y1 Y2 ConditionalProbabilityTable forvariableY4
Y3
To calculate derivative of ln p(D|h) with respect to upper-rightmost entry in table
we have to calculate P(Campfire=True,Storm=False,BusTourGroup=False|d) for each training example d in D
26
Machine Learning, Chapter 6 CSE 574, Spring 2004
Gradient Ascent Training
• When variables are unobservable for training example d, • the required probability can be calculated from the observed
variables in d using standard Bayesian network inference
• Required calculations are easily performed during most Bayesian network inference• learning can be performed at little additional cost whenever
the Bayesian network is used for inference and new evidence is subsequently obtained
27
Machine Learning, Chapter 6 CSE 574, Spring 2004
Derivation of Key Equation for Gradient Ascent Training
∑∈
===
∂∂
Dd ijk
ikiiji
ijk w)d|uU,yY(P
w)h|D(Pln
)d(Plnww
)D(Pln
Ddh
ijkijk
h ∏∈∂
∂=
∂∂P(D/h) is denoted
as Ph(D)
∑∈ ∂∂
=Dd ijk
h
wdP )(ln
∑∈ ∂
∂=
Dd ijk
h
h wdP
dP)(
)(1
28
Machine Learning, Chapter 6 CSE 574, Spring 2004
Derivation of Key Equation for Gradient Ascent Training of Bayesian Networks, continued
),(),|()(
1)(ln,
kijihkijikj
hDd ijkhijk
h uyPuydPwdPw
DP′′′′
′′∈∑∑ ∂
∂=
∂∂
)()|(),|()(
1,
kihkijihkijikj
hDd ijkh
uPuyPuydPwdP ′′′′′
′′∈∑∑ ∂
∂=
29
Machine Learning, Chapter 6 CSE 574, Spring 2004
Derivation of Key Equation for Gradient Ascent Training of Bayesian Networks, continued
)()|(),|()(
1)(lnikhikijhikijh
Dd ijkhijk
h uPuyPuydPwdPw
DP ∑∈ ∂
∂=
∂∂
)(),|()(
1ikhijkikijh
Dd ijkh
uPwuydPwdP∑
∈ ∂∂
=
)(),|()(
1ikhikijh
Dd h
uPuydPdP∑
∈
=
30
Machine Learning, Chapter 6 CSE 574, Spring 2004
Derivation of Key Equation for Gradient Ascent Training of Bayesian Networks, continued
ApplyingBayesTheorem ),(
)()()|,()(
1)(ln
ikijh
ikhhikijh
Dd hijk
h
uyPuPdPduyP
dPwDP ∑
∈
=∂
∂
),()()|,(
)(1
ikijh
ikhikijh
Dd h uyPuPduyP
dP∑∈
=
)|()|,(
)(1
ikijh
ikijh
Dd h uyPduyP
dP∑∈
=
∑∈
=Dd ijk
ikijh
wduyP )|,(
31
Machine Learning, Chapter 6 CSE 574, Spring 2004
Gradient Ascent Training Procedure
∑∈
+←Dd ijk
ikijhijkijk w
duyPww
)|,(η
• Two-step Procedure1. Update each wijk using training data D
where η is the learning rate
2. Renormalize weights wijk to assure
10 ≤≤ ijkw1=Σ ijkjw
32
Machine Learning, Chapter 6 CSE 574, Spring 2004
Properties of Algorithm for Gradient Ascent Training of Bayesian Networks
• Converges to locally maximum likelihood hypothesis • for the conditional probabilities in the Bayesian network
• Only finds a local optimum solution
• Alternative to gradient ascent • When not all variables are observable• EM algorithm - also finds local maximum likelihood solution
33
Machine Learning, Chapter 6 CSE 574, Spring 2004
Learning the Structure of Bayesian Networks
• Network structure is not known in advance• Bayesian scoring metric for choosing among alternative
networks
• Heuristic search algorithm K2 for learning network structure when data is fully observable
– Greedy search that trades network complexity for accuracy over training data
• Given 3000 training examples from a manually constructed Bayesian network containing 37 nodes and 46 arcs,
• K2 was able to reconstruct network almost exactly