machine learning, chapter 6 cse 574, spring 2004srihari/cse574/chapbl/chapbl.part4.pdf · machine...

1

Machine Learning, Chapter 6 CSE 574, Spring 2004

Bayesian Belief Networks

• Naïve Bayes assumes that values of its attributes a1,..,an are conditionally independent given the target value v

• Independence assumption is overly restrictive

•Bayesian Belief Networks• provide an intermediate approach that is less constraining • more tractable than avoiding conditional independence

altogether

2


Statistically dependent and independent variables

3


Bayesian Belief Network

•Describes probability distribution governing a set of variables • by specifying a set of conditional independence assumptions • along with a set of conditional probabilities

•Allows conditional independence assumptions to apply to subsets of the variables•Less constraining than the global assumption of conditional independence made by the Naïve Bayes classifier

4


Probability Distribution over a set of variables

•Random variables

•Each variable can take on the set of possible values

•Joint space of variables is cross-product

•Each item in joint space corresponds to one of the possible values of

•Bayesian Belief Network specifies the joint probability distribution

nYY ,..,1

iY )( iYV

)(..)()( 21 nYVYVYV ××

nYY ,..,1

5


Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for Z,that is

which can be written more compactly as

)|()|(),,( kikikji zZxXPzYxXPzyx =====∀

)|(),|( ZXPZYXP =

6


Conditional Independence, continued

• Naïve Bayes • Assumes instance attribute is conditionally

independent of instance attribute given target value V

• Allows Naïve Bayes to calculate

)|(),|()|,( 22121 VAPVAAPVAAP =

)|()|( 21 VAPVAP=

2A1A2A

7


Bayesian Belief Network Example

• Boolean Variables (present or absent)• Storm• BusTourGroup• Lightning• Campfire• Thunder• ForestFire

• Specify conditional probabilities between terms

8


Bayesian Belief Network

Y1 Y2 ConditionalProbabilityTable forvariableY4

Y3

Y5 Y6

9


Probabilities stored in Bayesian Network

• Parents(Yi) denotes set of immediate predecessors of Yi in the network

• are the values stored in the conditional probability table associated with node Yi

))Y(Parents|y(P ii

10


Representation

• Bayesian network represents joint probability distribution

• Joint probability for any desired assignment of valuesto the tuple of variables

is computed by

))(|(),..,(1

1 i

n

iin YParentsyPyyP ∏

=

=

>< nyy ,..,1 >< nYY ,..,1

11


Example for Joint Probability Calculation

12


Joint Probability Calculation

• P(a3,b1,x2,c3,d2)• = P(a3) P(b1) P(x2/a3,b1) P(c3/x2) P(d2/x2)

=0.25 x 0.6 x 0.4 x 0.5 x 0.4=0.012

13


Representation

• Bayesian network allows causal knowledge to be represented• Lightning causes Thunder• Once we know value of Storm and BusTourGroup no

information about Campfire provided by Lightning and Thunder

14


Causal relationships

• State of automobile• Temperature of engine• Pressure of brake fluid• Pressure of air in tires• Voltages in the wires

• Oil pressure and air pressure are not causally related• Engine temperature and oil temperature are

15


Representation• Arcs represent assertion that variable is conditionally

independent of its non-descendants given its parents• Thunder is conditionally independent of other variables given

the value of Lightning• Pr (T,L / F) • = Pr(T / L, F)Pr(L / F) by chain rule• = Pr(T/ L)Pr(L / F) by network

16


Inference Tasks1. Infer the value of some target variable given the observed values

of other variables• Pr(F/S,B,L,C,T)• =Pr(F,T,B/S,L,C)/Pr(T,B/S,L,C)• =Pr(F/S,L,C)Pr(T,B/S,L,C)/Pr(T,B/S,L,C)

2. Infer the probability distribution of the target variable3. Infer subset of variables when some other variables are known

17


Inference Efficiency• Exact Inference for arbitrary networks is NP-hard• Approximate inference sacrifices precision to gain

efficiency• Monte Carlo method for randomly sampling unobserved

variables• Approximate methods can also be NP-hard but useful in

many cases

18


Learning Bayesian Networks

• Learning Bayesian Networks from training data• Several different settings for this problem1 Network structure

• given in advance or• inferred from training data

2 Network variables• are observable in training data or• might be unobservable

19


Learning Bayesian Networks

Network structure given in advance1. Variables are fully observable in training data• Estimate conditional probability table entries as for a NB

2. Only some variable values are observable• More difficult• Similar to learning weights for hidden units in ANNs

• ANNs have input and output values specified • Hidden unit values are learnt from training examples

20


Gradient Ascent for learning probabilities

• Learns entries in the conditional probability tables• Gradient Ascent searches through the space of all

possible entries for the conditional probability tables• Objective function maximized during ascent is the

probability P(D/h) of the observed training data given the hypothesis h

• Corresponds to searching for the maximum likelihood hypothesis

21


Gradient Ascent Training of Bayesian Networks

• Network structure given but only some variables are observable

• Learns entries in conditional probability tables• Searches through a space of hypotheses that

corresponds to the set of all possible entries for the conditional probability tables

• Objective function is maximized: P(D|h) probability of observed data given the hypothesis h

22



• Maximize P(D|h) with respect to the parameters that define the conditional probability tables of the Bayesian network

• wijk is a single entry in one of the tables in the known structure of the Bayesian network

• i = index of variable• j = index of value of variable• k = index of parent

23


Bayesian Network Gradient Ascent Notation


Y3

wijk is the probability that Yi will take on value yij given that its immediate parents Ui take on the values given by uik

If wijk is top right entry of table, Yi is the variable Campfire,Ui are the parents-tuple <Storm,BusTourGroup>, yij=True, and uik=<False, False>

24



• Maximize P(D|h) by following the gradient of ln P(D|h)

• wijk is a single entry in one of the tables in the Bayesian network

• It can be shown that that each of these derivatives can be calculated as

∑∈

===

∂∂

Dd ijk

ikiiji

ijk w)d|uU,yY(P

w)h|D(Pln

25


Gradient Ascent calculation example


Y3

To calculate derivative of ln p(D|h) with respect to upper-rightmost entry in table

we have to calculate P(Campfire=True,Storm=False,BusTourGroup=False|d) for each training example d in D

26


Gradient Ascent Training

• When variables are unobservable for training example d, • the required probability can be calculated from the observed

variables in d using standard Bayesian network inference

• Required calculations are easily performed during most Bayesian network inference• learning can be performed at little additional cost whenever

the Bayesian network is used for inference and new evidence is subsequently obtained

27


Derivation of Key Equation for Gradient Ascent Training

∑∈

===

∂∂

Dd ijk

ikiiji

ijk w)d|uU,yY(P

w)h|D(Pln

)d(Plnww

)D(Pln

Ddh

ijkijk

h ∏∈∂

∂=

∂∂P(D/h) is denoted

as Ph(D)

∑∈ ∂∂

=Dd ijk

h

wdP )(ln

∑∈ ∂

∂=

Dd ijk

h

h wdP

dP)(

)(1

28


Derivation of Key Equation for Gradient Ascent Training of Bayesian Networks, continued

),(),|()(

1)(ln,

kijihkijikj

hDd ijkhijk

h uyPuydPwdPw

DP′′′′

′′∈∑∑ ∂

∂=

∂∂

)()|(),|()(

1,

kihkijihkijikj

hDd ijkh

uPuyPuydPwdP ′′′′′

′′∈∑∑ ∂

∂=

29



)()|(),|()(

1)(lnikhikijhikijh

Dd ijkhijk

h uPuyPuydPwdPw

DP ∑∈ ∂

∂=

∂∂

)(),|()(

1ikhijkikijh

Dd ijkh

uPwuydPwdP∑

∈ ∂∂

=

)(),|()(

1ikhikijh

Dd h

uPuydPdP∑

∈

=

30



ApplyingBayesTheorem ),(

)()()|,()(

1)(ln

ikijh

ikhhikijh

Dd hijk

h

uyPuPdPduyP

dPwDP ∑

∈

=∂

∂

),()()|,(

)(1

ikijh

ikhikijh

Dd h uyPuPduyP

dP∑∈

=

)|()|,(

)(1

ikijh

ikijh

Dd h uyPduyP

dP∑∈

=

∑∈

=Dd ijk

ikijh

wduyP )|,(

31


Gradient Ascent Training Procedure

∑∈

+←Dd ijk

ikijhijkijk w

duyPww

)|,(η

• Two-step Procedure1. Update each wijk using training data D

where η is the learning rate

2. Renormalize weights wijk to assure

10 ≤≤ ijkw1=Σ ijkjw

32


Properties of Algorithm for Gradient Ascent Training of Bayesian Networks

• Converges to locally maximum likelihood hypothesis • for the conditional probabilities in the Bayesian network

• Only finds a local optimum solution

• Alternative to gradient ascent • When not all variables are observable• EM algorithm - also finds local maximum likelihood solution

33


Learning the Structure of Bayesian Networks

• Network structure is not known in advance• Bayesian scoring metric for choosing among alternative

networks

• Heuristic search algorithm K2 for learning network structure when data is fully observable

– Greedy search that trades network complexity for accuracy over training data

• Given 3000 training examples from a manually constructed Bayesian network containing 37 nodes and 46 arcs,

• K2 was able to reconstruct network almost exactly

machine learning, chapter 6 cse 574, spring 2004srihari/cse574/chapbl/chapbl.part4.pdf · machine...

Documents